I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:
1 2 3 |
[09:43:14] jason@jralph-mbp14:~ $ cat bootstrap.sh aws s3 cp s3://bucket1-us-east-1/EMR/requirements.txt . sudo python3 -m pip install -r requirements.txt |
The contents of requirements.txt looked like this:
1 2 3 4 5 6 7 8 |
[09:43:14] jason@jralph-mbp14:~ $ cat requirements.txt boto3 botocore awscli requests scikit-learn numpy pandas |
We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:
1 2 3 4 5 6 7 8 9 10 11 12 |
Traceback (most recent call last): File "analysis.py", line 6, in <module> import pandas as pd File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module> from pandas.compat import ( File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module> from pandas.compat.numpy import ( File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 27, in <module> f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n" ImportError: this version of pandas is incompatible with numpy < 1.17.3 your numpy version is 1.16.5. Please upgrade numpy to >= 1.17.3 to use this pandas version |
And when trying to import from pyspark, we saw this:
1 2 3 4 |
Traceback (most recent call last): File "analysis.py", line 6, in <module> import pandas as pd ModuleNotFoundError: No module named 'pandas' |
After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
#!/bin/bash set -x cat > /var/tmp/fix-bootstap.sh <<'EOF' #!/bin/bash set -x while true; do NODEPROVISIONSTATE=`sed -n '/localInstance [{]/,/[}]/{ /nodeProvisionCheckinRecord [{]/,/[}]/ { /status: / { p } /[}]/a } /[}]/a }' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'` if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then echo "Running my post provision bootstrap" # Enter your code here sudo python3 -m pip install --upgrade pip sudo python3 -m pip install boto3 sudo python3 -m pip install botocore sudo python3 -m pip install sklearn sudo python3 -m pip install requests sudo python3 -m pip install numpy sudo python3 -m pip install pandas echo '-------BOOTSTRAP COMPLETE-------' exit else echo "Sleeping Till Node is Provisioned" sleep 10 fi done EOF chmod +x /var/tmp/fix-bootstap.sh nohup /var/tmp/fix-bootstap.sh 2>&1 & |