I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud. We are using AWS EMR and have AWS Managed Airflow as the executor (DAG). We came across an odd situation with a pyspark application. When using Airflow with a SparkSubmitHook, the job would bootstrap looking just fine according to the run logs, however it would fail with No module named 'requests'
when the application tried to import it. This was very odd since we have this application running from spark-submit just fine when calling it from the master node command line.
I decided to investigate the differences, our bootstrap script for installing python modules via pip which we call from the EMR API RunJobFlow call looks like this:
1 2 3 4 5 6 7 |
#!/bin/bash pip_bin=pip3 ${pip_bin} install --user -U pip ${pip_bin} install --user boto3 ${pip_bin} install --user boto ${pip_bin} install --user requests ${pip_bin} install --user psycopg2-binary |
This is very basic, all it does is upgrade PIP and run PIP install to install each of the modules. When checking the bootstrap log I can see that PIP upgrades and goes out to the repo and installs the packages just fine. So why were we getting the No module named 'requests'
error when executing through airflow. After a ton of googling and research I have found the issue and applied a solution that worked. Turns out airflow will run as the root user when bootstrapping, so if you notice we use the --user
argument in pip. This will instruct the packages to be installed in the calling users home directory, the kicker is the code is run by the hadoop user on the EMR cluster nodes after executing from airflow. So turns out, the hadoop user is unable to access the requests module since root installed it with --user
. I changed the bootstrap script to the following and it all started working, by removing --user
and prefixing with sudo, the packages now get installed in a globally available area for all users. I am sure there are better ways to do this, I am still learning and researching, but if you run into this, the change below with get you out of the woods.
1 2 3 4 5 6 |
#!/bin/bash sudo python3 -m pip install \ boto3 \ boto \ requests \ psycopg2-binary |
After some further research, and testing we decided to utilize a requirements.txt file to be called by the bootstrap shell script in the RunJobFlow call, first create a requirements.txt file, I like to hardcode the versions so nothing changes unexpectedly as you bootstrap a new cluster and it reaches out to PyPy to get the packages.
https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html
Add your desired packages and version numbers to a file called requirements.txt like below:
1 2 3 4 |
boto3==1.17.54 boto==2.49.0 requests==2.18.4 psycopg2-binary==2.8.6 |
Then you will need to copy this file into a bucket you have access to:
1 |
aws s3 cp requirements.txt s3://YOUR_S3_BUCKET_NAME/requirements.txt |
Then create a shell script that has the following, call it bootstrap.sh:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#!/bin/bash set -x echo '-----------RUNNING BOOTSTRAP------------------------' echo '-----------COPYING REQUIREMENTS FILE LOCALLY--------' aws s3 cp s3://YOUR_S3_BUCKET_NAME/requirements.txt . echo '-----------INSTALLING REQUIREMENTS------------------' sudo python3 -m pip install -r requirements.txt echo '-----------DONE BOOTSTRAP---------------------------' |
Copy that shell script to your bucket:
1 |
aws s3 cp bootstrap.sh s3://YOUR_S3_BUCKET_NAME/bootstrap.sh |
And execute it via the bootstrap actions in the RunJobFlow EMR API call:
1 2 3 4 5 6 7 8 |
"BootstrapActions": [ { "Name": "string", "ScriptBootstrapAction": { "Path": "s3://YOUR_S3_BUCKET_NAME/bootstrap.sh" } } ], |
As you can see the shell script will be executed which will copy the requirements.txt file locally and then run pip -r against it which will install all the packages. If you want to see the log on a running cluster, you can ssh to the master node and view the logs here to see the bootstrapping take place:
1 |
/emr/instance-controller/log/bootstrap-actions |
You should see the stdout log as so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
-----------RUNNING BOOTSTRAP------------------ -----------COPYING REQUIREMENTS FILE LOCALLY-------- Completed 67 Bytes/67 Bytes (629 Bytes/s) with 1 file(s) remaining download: s3://YOUR_S3_BUCKET_NAME/requirements.txt to ./requirements.txt -----------INSTALLING REQUIREMENTS------------------ Collecting boto==2.48.0 Downloading boto-2.48.0-py2.py3-none-any.whl (1.4 MB) Collecting boto3==1.6.15 Downloading boto3-1.6.15-py2.py3-none-any.whl (128 kB) Collecting requests==2.18.4 Downloading requests-2.18.4-py2.py3-none-any.whl (88 kB) Collecting psycopg2-binary==2.8.6 Downloading psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB) Collecting botocore<1.10.0,>=1.9.15 Downloading botocore-1.9.23-py2.py3-none-any.whl (4.1 MB) Collecting s3transfer<0.2.0,>=0.1.10 Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59 kB) Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.10.0) Collecting urllib3<1.23,>=1.21.1 Downloading urllib3-1.22-py2.py3-none-any.whl (132 kB) Collecting certifi>=2017.4.17 Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB) Collecting idna<2.7,>=2.5 Downloading idna-2.6-py2.py3-none-any.whl (56 kB) Collecting chardet<3.1.0,>=3.0.2 Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB) Requirement already satisfied: docutils>=0.10 in /usr/lib/python3.7/site-packages (from botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.14) Collecting python-dateutil<2.7.0,>=2.1 Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194 kB) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.7.0,>=2.1->botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (1.13.0) Installing collected packages: boto, python-dateutil, botocore, s3transfer, boto3, urllib3, certifi, idna, chardet, requests, psycopg2-binary Attempting uninstall: boto Found existing installation: boto 2.49.0 Uninstalling boto-2.49.0: Successfully uninstalled boto-2.49.0 Successfully installed boto-2.48.0 boto3-1.6.15 botocore-1.9.23 certifi-2021.10.8 chardet-3.0.4 idna-2.6 psycopg2-binary-2.8.6 python-dateutil-2.6.1 requests-2.18.4 s3transfer-0.1.13 urllib3-1.22 -----------DONE BOOTSTRAP--------------------- |
Hope this helps.