I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:
1 2 3 |
[09:43:14] jason@jralph-mbp14:~ $ cat bootstrap.sh aws s3 cp s3://bucket1-us-east-1/EMR/requirements.txt . sudo python3 -m pip install -r requirements.txt |
The contents of requirements.txt looked like this:
1 2 3 4 5 6 7 8 |
[09:43:14] jason@jralph-mbp14:~ $ cat requirements.txt boto3 botocore awscli requests scikit-learn numpy pandas |
We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:
1 2 3 4 5 6 7 8 9 10 11 12 |
Traceback (most recent call last): File "analysis.py", line 6, in <module> import pandas as pd File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module> from pandas.compat import ( File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module> from pandas.compat.numpy import ( File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 27, in <module> f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n" ImportError: this version of pandas is incompatible with numpy < 1.17.3 your numpy version is 1.16.5. Please upgrade numpy to >= 1.17.3 to use this pandas version |
And when trying to import from pyspark, we saw this:
1 2 3 4 |
Traceback (most recent call last): File "analysis.py", line 6, in <module> import pandas as pd ModuleNotFoundError: No module named 'pandas' |
After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
#!/bin/bash set -x cat > /var/tmp/fix-bootstap.sh <<'EOF' #!/bin/bash set -x while true; do NODEPROVISIONSTATE=`sed -n '/localInstance [{]/,/[}]/{ /nodeProvisionCheckinRecord [{]/,/[}]/ { /status: / { p } /[}]/a } /[}]/a }' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'` if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then echo "Running my post provision bootstrap" # Enter your code here sudo python3 -m pip install --upgrade pip sudo python3 -m pip install boto3 sudo python3 -m pip install botocore sudo python3 -m pip install sklearn sudo python3 -m pip install requests sudo python3 -m pip install numpy sudo python3 -m pip install pandas echo '-------BOOTSTRAP COMPLETE-------' exit else echo "Sleeping Till Node is Provisioned" sleep 10 fi done EOF chmod +x /var/tmp/fix-bootstap.sh nohup /var/tmp/fix-bootstap.sh 2>&1 & |
Thanks for this post! I am having a similar problem where I need a version of numpy that keeps getting overwritten on EMR. I tried this solution and it did not work for me. I added an echo statement to print out the $NODEPROVISIONSTATE in each loop iteration, and it was always PENDING. I wonder if their is a chicken-and-egg situation here where NODEPROVISIONSTATE stays pending until the bootstrap action completes, which never completes because the NODEPROVISIONSTATE never becomes SUCCESSFUL.
Hi, that’s interesting, for sure it would not execute the PIP installs unless $NODEPROVISIONSTATE becomes SUCCESSFUL. I wonder if something else is causing the cluster to stay pending. Are you running the shell script above as the single bootstrap call? This solution worked for us, are you using the latest EMR release? Interested to hear if you solve it…..
Can you please suggest what emr version are you using. We have spent days on fixing this with no luck
Hello, we have this working with Release label:emr-6.6.0.
it works like a charm,
thank you!
You are welcome!!