AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3

I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:

The contents of requirements.txt looked like this:

We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:

And when trying to import from pyspark, we saw this:

After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve:

7 thoughts on “AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3”

  1. Thanks for this post! I am having a similar problem where I need a version of numpy that keeps getting overwritten on EMR. I tried this solution and it did not work for me. I added an echo statement to print out the $NODEPROVISIONSTATE in each loop iteration, and it was always PENDING. I wonder if their is a chicken-and-egg situation here where NODEPROVISIONSTATE stays pending until the bootstrap action completes, which never completes because the NODEPROVISIONSTATE never becomes SUCCESSFUL.

    1. Hi, that’s interesting, for sure it would not execute the PIP installs unless $NODEPROVISIONSTATE becomes SUCCESSFUL. I wonder if something else is causing the cluster to stay pending. Are you running the shell script above as the single bootstrap call? This solution worked for us, are you using the latest EMR release? Interested to hear if you solve it…..

  2. Can you please suggest what emr version are you using. We have spent days on fixing this with no luck

Leave a Reply

Your email address will not be published. Required fields are marked *