{"id":925,"date":"2021-11-02T21:26:54","date_gmt":"2021-11-03T01:26:54","guid":{"rendered":"https:\/\/jasonralph.org\/?p=925"},"modified":"2021-11-09T09:38:32","modified_gmt":"2021-11-09T14:38:32","slug":"aws-apache-managed-airflow-emr-modulenotfounderror-no-module-named-requests-bootstrap","status":"publish","type":"post","link":"https:\/\/jasonralph.org\/?p=925","title":{"rendered":"AWS Apache Managed Airflow EMR ModuleNotFoundError: No module named &#8216;requests&#8217; Bootstrap"},"content":{"rendered":"<p>I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud.  We are using AWS EMR and have AWS Managed Airflow as the executor (DAG).  We came across an odd situation with a pyspark application.  When using Airflow with a SparkSubmitHook, the job would bootstrap looking just fine according to the run logs, however it would fail with <code>No module named 'requests' <\/code> when the application tried to import it.  This was very odd since we have this application running from spark-submit just fine when calling it from the master node command line.  <\/p>\n<p>I decided to investigate the differences, our bootstrap script for installing python modules via pip which we call from the EMR API RunJobFlow call looks like this:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n#!\/bin\/bash\r\npip_bin=pip3\r\n${pip_bin} install --user -U pip\r\n${pip_bin} install --user boto3\r\n${pip_bin} install --user boto\r\n${pip_bin} install --user requests\r\n${pip_bin} install --user psycopg2-binary\r\n<\/pre>\n<p>This is very basic, all it does is upgrade PIP and run PIP install to install each of the modules.  When checking the bootstrap log I can see that PIP upgrades and goes out to the repo and installs the packages just fine.  So why were we getting the <code>No module named 'requests' <\/code> error when executing through airflow.  After a ton of googling and research I have found the issue and applied a solution that worked.  Turns out airflow will run as the root user when bootstrapping, so if you notice we use the <code>--user<\/code> argument in pip.  This will instruct the packages to be installed in the calling users home directory, the kicker is the code is run by the hadoop user on the EMR cluster nodes after executing from airflow. So turns out, the hadoop user is unable to access the requests module since root installed it with <code>--user<\/code>.  I changed the bootstrap script to the following and it all started working, by removing <code>--user<\/code> and prefixing with sudo, the packages now get installed in a globally available area for all users.  I am sure there are better ways to do this, I am still learning and researching, but if you run into this, the change below with get you out of the woods. <\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n#!\/bin\/bash\r\nsudo python3 -m pip install \\\r\n                        boto3 \\\r\n\t                    boto \\\r\n\t\t                requests \\\r\n                        psycopg2-binary\r\n                            \r\n<\/pre>\n<p>After some further research, and testing we decided to utilize a requirements.txt file to be called by the bootstrap shell script in the RunJobFlow call, first create a requirements.txt file, I like to hardcode the versions so nothing changes unexpectedly as you bootstrap a new cluster and it reaches out to PyPy to get the packages.  <\/p>\n<p><a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/APIReference\/API_RunJobFlow.html\">https:\/\/docs.aws.amazon.com\/emr\/latest\/APIReference\/API_RunJobFlow.html<\/a><\/p>\n<p>Add your desired packages and version numbers to a file called requirements.txt like below:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\nboto3==1.17.54\r\nboto==2.49.0\r\nrequests==2.18.4\r\npsycopg2-binary==2.8.6\r\n<\/pre>\n<p>Then you will need to copy this file into a bucket you have access to:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\naws s3 cp requirements.txt s3:\/\/YOUR_S3_BUCKET_NAME\/requirements.txt\r\n<\/pre>\n<p>Then create a shell script that has the following, call it bootstrap.sh:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n#!\/bin\/bash\r\n\r\nset -x \r\n\r\necho '-----------RUNNING BOOTSTRAP------------------------'\r\n\r\necho '-----------COPYING REQUIREMENTS FILE LOCALLY--------'\r\n\r\naws s3 cp s3:\/\/YOUR_S3_BUCKET_NAME\/requirements.txt .\r\n\r\necho '-----------INSTALLING REQUIREMENTS------------------'\r\n\r\nsudo python3 -m pip install -r requirements.txt\r\n\r\necho '-----------DONE BOOTSTRAP---------------------------'\r\n<\/pre>\n<p>Copy that shell script to your bucket:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\naws s3 cp bootstrap.sh s3:\/\/YOUR_S3_BUCKET_NAME\/bootstrap.sh\r\n<\/pre>\n<p>And execute it via the bootstrap actions in the RunJobFlow EMR API call:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n\"BootstrapActions\": [\r\n    {\r\n      \"Name\": \"string\",\r\n      \"ScriptBootstrapAction\": {\r\n        \"Path\": \"s3:\/\/YOUR_S3_BUCKET_NAME\/bootstrap.sh\"\r\n      }\r\n    }\r\n  ],\r\n<\/pre>\n<p>As you can see the shell script will be executed which will copy the requirements.txt file locally and then run pip -r against it which will install all the packages.  If you want to see the log on a running cluster, you can ssh to the master node and view the logs here to see the bootstrapping take place:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n\/emr\/instance-controller\/log\/bootstrap-actions\r\n<\/pre>\n<p>You should see the stdout log as so:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n-----------RUNNING BOOTSTRAP------------------\r\n-----------COPYING REQUIREMENTS FILE LOCALLY--------\r\nCompleted 67 Bytes\/67 Bytes (629 Bytes\/s) with 1 file(s) remaining\r\ndownload: s3:\/\/YOUR_S3_BUCKET_NAME\/requirements.txt to .\/requirements.txt\r\n-----------INSTALLING REQUIREMENTS------------------\r\nCollecting boto==2.48.0\r\n  Downloading boto-2.48.0-py2.py3-none-any.whl (1.4 MB)\r\nCollecting boto3==1.6.15\r\n  Downloading boto3-1.6.15-py2.py3-none-any.whl (128 kB)\r\nCollecting requests==2.18.4\r\n  Downloading requests-2.18.4-py2.py3-none-any.whl (88 kB)\r\nCollecting psycopg2-binary==2.8.6\r\n  Downloading psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)\r\nCollecting botocore<1.10.0,>=1.9.15\r\n  Downloading botocore-1.9.23-py2.py3-none-any.whl (4.1 MB)\r\nCollecting s3transfer<0.2.0,>=0.1.10\r\n  Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59 kB)\r\nRequirement already satisfied: jmespath<1.0.0,>=0.7.1 in \/usr\/local\/lib\/python3.7\/site-packages (from boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.10.0)\r\nCollecting urllib3<1.23,>=1.21.1\r\n  Downloading urllib3-1.22-py2.py3-none-any.whl (132 kB)\r\nCollecting certifi>=2017.4.17\r\n  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)\r\nCollecting idna<2.7,>=2.5\r\n  Downloading idna-2.6-py2.py3-none-any.whl (56 kB)\r\nCollecting chardet<3.1.0,>=3.0.2\r\n  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)\r\nRequirement already satisfied: docutils>=0.10 in \/usr\/lib\/python3.7\/site-packages (from botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.14)\r\nCollecting python-dateutil<2.7.0,>=2.1\r\n  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194 kB)\r\nRequirement already satisfied: six>=1.5 in \/usr\/local\/lib\/python3.7\/site-packages (from python-dateutil<2.7.0,>=2.1->botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (1.13.0)\r\nInstalling collected packages: boto, python-dateutil, botocore, s3transfer, boto3, urllib3, certifi, idna, chardet, requests, psycopg2-binary\r\n  Attempting uninstall: boto\r\n    Found existing installation: boto 2.49.0\r\n    Uninstalling boto-2.49.0:\r\n      Successfully uninstalled boto-2.49.0\r\nSuccessfully installed boto-2.48.0 boto3-1.6.15 botocore-1.9.23 certifi-2021.10.8 chardet-3.0.4 idna-2.6 psycopg2-binary-2.8.6 python-dateutil-2.6.1 requests-2.18.4 s3transfer-0.1.13 urllib3-1.22\r\n-----------DONE BOOTSTRAP---------------------\r\n<\/pre>\n<p>Hope this helps. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud. We are using AWS EMR and have AWS Managed Airflow as the executor (DAG). We came across an odd situation with a pyspark application. When using Airflow with a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[103,62,102,101,109,107,108,106,104,105],"class_list":["post-925","post","type-post","status-publish","format-standard","hentry","category-general-code","tag-airflow","tag-aws","tag-bootstrap","tag-emr","tag-found","tag-module","tag-not","tag-pip","tag-pyspark","tag-requests"],"_links":{"self":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/925","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=925"}],"version-history":[{"count":22,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/925\/revisions"}],"predecessor-version":[{"id":937,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/925\/revisions\/937"}],"wp:attachment":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=925"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=925"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=925"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}