AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3

I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:

The contents of requirements.txt looked like this:

We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:

And when trying to import from pyspark, we saw this:

After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve: