HTTPSConnectionPool(host=’files.pythonhosted.org’, port=443): Read timed out

June 22, 2023June 22, 2023 adminLeave a comment

I recently had an issue where one of our EMR clusters failed to bootstrap the python modules via PIP. I checked the logs and saw that we ran into the following error:

HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out

1	HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out

I wanted to have PIP not die if it timed out, I also wanted it to retry on failure. By adding the following to my bootstrap.sh I was able to have the PIP socket timeout at a longer interval, also bump up the retries to 10. I have not seen the issue since I applied the new settings.

sudo python3 -m pip --timeout 100 --retries 10 install --upgrade pip
sudo python3 -m pip --timeout 100 --retries 10 install

1 2	sudo python3 -m pip --timeout 100 --retries 10 install --upgrade pip sudo python3 -m pip --timeout 100 --retries 10 install

From the PIP help page:

  --retries <retries>         Maximum number of retries each connection should attempt (default 5 times).
  --timeout <sec>             Set the socket timeout (default 15 seconds).

1 2	--retries <retries> Maximum number of retries each connection should attempt (default 5 times). --timeout <sec> Set the socket timeout (default 15 seconds).

AttributeError: module ‘cryptography.utils’ has no attribute ‘register_interface’

September 7, 2022September 8, 2022 admin6 Comments

I just recently came across an issue when we were bootstrapping one of our EMR clusters, looks like when trying to import pgpy we failed with the following traceback:

Traceback (most recent call last):
  File "/mnt/var/lib/hadoop/steps/s-B2LJDDVVD5Y1/./aws_s3_decrypt.py", line 18, in <module>
    import pgpy
  File "/usr/local/lib/python3.7/site-packages/pgpy/__init__.py", line 4, in <module>
    from .pgp import PGPKey
  File "/usr/local/lib/python3.7/site-packages/pgpy/pgp.py", line 27, in <module>
    from .constants import CompressionAlgorithm
  File "/usr/local/lib/python3.7/site-packages/pgpy/constants.py", line 23, in <module>
    from ._curves import BrainpoolP256R1, BrainpoolP384R1, BrainpoolP512R1, X25519, Ed25519
  File "/usr/local/lib/python3.7/site-packages/pgpy/_curves.py", line 37, in <module>
    @utils.register_interface(ec.EllipticCurve)
AttributeError: module 'cryptography.utils' has no attribute 'register_interface'
Command exiting with ret '1'

Traceback (most recent call last):

File "/mnt/var/lib/hadoop/steps/s-B2LJDDVVD5Y1/./aws_s3_decrypt.py", line 18, in <module>

import pgpy

File "/usr/local/lib/python3.7/site-packages/pgpy/__init__.py", line 4, in <module>

from .pgp import PGPKey

File "/usr/local/lib/python3.7/site-packages/pgpy/pgp.py", line 27, in <module>

from .constants import CompressionAlgorithm

File "/usr/local/lib/python3.7/site-packages/pgpy/constants.py", line 23, in <module>

from ._curves import BrainpoolP256R1, BrainpoolP384R1, BrainpoolP512R1, X25519, Ed25519

File "/usr/local/lib/python3.7/site-packages/pgpy/_curves.py", line 37, in <module>

@utils.register_interface(ec.EllipticCurve)

AttributeError: module 'cryptography.utils' has no attribute 'register_interface'

Command exiting with ret '1'

Apparently the cryptography team released a new version on September 7th 2022 that broke the pgpy library.
https://pypi.org/project/cryptography/38.0.1/

We needed to downgrade our version to get things working again. I figured I would post this to see if others run into this, according to the pgpy github page, they are working on a fix.

https://github.com/SecurityInnovation/PGPy/issues/402

Here is how I solved it in the meantime, I needed to downgrade the cryptography library.

sudo python3 -m pip install PGPy
sudo python3 -m pip uninstall -y cryptography
sudo python3 -m pip install cryptography==37.0.4

sudo python3 -m pip install PGPy

sudo python3 -m pip uninstall -y cryptography

sudo python3 -m pip install cryptography==37.0.4

AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3

May 10, 2022August 5, 2022 admin7 Comments

I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:

[09:43:14] jason@jralph-mbp14:~ $ cat bootstrap.sh
aws s3 cp s3://bucket1-us-east-1/EMR/requirements.txt .
sudo python3 -m pip install -r requirements.txt

[09:43:14] jason@jralph-mbp14:~ $ cat bootstrap.sh

aws s3 cp s3://bucket1-us-east-1/EMR/requirements.txt .

sudo python3 -m pip install -r requirements.txt

The contents of requirements.txt looked like this:

[09:43:14] jason@jralph-mbp14:~ $ cat requirements.txt
boto3
botocore
awscli
requests
scikit-learn
numpy
pandas

[09:43:14] jason@jralph-mbp14:~ $ cat requirements.txt

boto3

botocore

awscli

requests

scikit-learn

numpy

pandas

We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:

Traceback (most recent call last):
  File "analysis.py", line 6, in <module>
    import pandas as pd
  File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module>
    from pandas.compat.numpy import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 27, in <module>
    f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n"
ImportError: this version of pandas is incompatible with numpy < 1.17.3
your numpy version is 1.16.5.
Please upgrade numpy to >= 1.17.3 to use this pandas version

Traceback (most recent call last):

File "analysis.py", line 6, in <module>

import pandas as pd

File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module>

from pandas.compat import (

File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module>

from pandas.compat.numpy import (

File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 27, in <module>

f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n"

ImportError: this version of pandas is incompatible with numpy < 1.17.3

your numpy version is 1.16.5.

Please upgrade numpy to >= 1.17.3 to use this pandas version

And when trying to import from pyspark, we saw this:

Traceback (most recent call last):
  File "analysis.py", line 6, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Traceback (most recent call last):

File "analysis.py", line 6, in <module>

import pandas as pd

ModuleNotFoundError: No module named 'pandas'

After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve:

#!/bin/bash
set -x

cat > /var/tmp/fix-bootstap.sh <<'EOF'
#!/bin/bash
set -x

while true; do
    NODEPROVISIONSTATE=`sed -n '/localInstance [{]/,/[}]/{
    /nodeProvisionCheckinRecord [{]/,/[}]/ {
    /status: / { p }
    /[}]/a
    }
    /[}]/a
    }' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`

    if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then
        echo "Running my post provision bootstrap"
        # Enter your code here
        sudo python3 -m pip install --upgrade pip
        sudo python3 -m pip install boto3
        sudo python3 -m pip install botocore
        sudo python3 -m pip install sklearn
        sudo python3 -m pip install requests
        sudo python3 -m pip install numpy
        sudo python3 -m pip install pandas
        echo '-------BOOTSTRAP COMPLETE-------' 

        exit
    else
        echo "Sleeping Till Node is Provisioned"
        sleep 10
    fi
done

EOF

chmod +x /var/tmp/fix-bootstap.sh
nohup /var/tmp/fix-bootstap.sh  2>&1 &

#!/bin/bash

set -x

cat > /var/tmp/fix-bootstap.sh <<'EOF'

#!/bin/bash

set -x

while true; do

NODEPROVISIONSTATE=`sed -n '/localInstance [{]/,/[}]/{

/nodeProvisionCheckinRecord [{]/,/[}]/ {

/status: / { p }

/[}]/a

}

/[}]/a

}' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`

if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then

echo "Running my post provision bootstrap"

# Enter your code here

sudo python3 -m pip install --upgrade pip

sudo python3 -m pip install boto3

sudo python3 -m pip install botocore

sudo python3 -m pip install sklearn

sudo python3 -m pip install requests

sudo python3 -m pip install numpy

sudo python3 -m pip install pandas

echo '-------BOOTSTRAP COMPLETE-------'

exit

else

echo "Sleeping Till Node is Provisioned"

sleep 10

done

EOF

chmod +x /var/tmp/fix-bootstap.sh

nohup /var/tmp/fix-bootstap.sh 2>&1 &

AWS Apache Managed Airflow EMR ModuleNotFoundError: No module named ‘requests’ Bootstrap

November 2, 2021November 9, 2021 adminLeave a comment

I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud. We are using AWS EMR and have AWS Managed Airflow as the executor (DAG). We came across an odd situation with a pyspark application. When using Airflow with a SparkSubmitHook, the job would bootstrap looking just fine according to the run logs, however it would fail with No module named 'requests' when the application tried to import it. This was very odd since we have this application running from spark-submit just fine when calling it from the master node command line.

I decided to investigate the differences, our bootstrap script for installing python modules via pip which we call from the EMR API RunJobFlow call looks like this:

#!/bin/bash
pip_bin=pip3
${pip_bin} install --user -U pip
${pip_bin} install --user boto3
${pip_bin} install --user boto
${pip_bin} install --user requests
${pip_bin} install --user psycopg2-binary

#!/bin/bash

pip_bin=pip3

${pip_bin} install --user -U pip

${pip_bin} install --user boto3

${pip_bin} install --user boto

${pip_bin} install --user requests

${pip_bin} install --user psycopg2-binary

This is very basic, all it does is upgrade PIP and run PIP install to install each of the modules. When checking the bootstrap log I can see that PIP upgrades and goes out to the repo and installs the packages just fine. So why were we getting the No module named 'requests' error when executing through airflow. After a ton of googling and research I have found the issue and applied a solution that worked. Turns out airflow will run as the root user when bootstrapping, so if you notice we use the --user argument in pip. This will instruct the packages to be installed in the calling users home directory, the kicker is the code is run by the hadoop user on the EMR cluster nodes after executing from airflow. So turns out, the hadoop user is unable to access the requests module since root installed it with --user. I changed the bootstrap script to the following and it all started working, by removing --user and prefixing with sudo, the packages now get installed in a globally available area for all users. I am sure there are better ways to do this, I am still learning and researching, but if you run into this, the change below with get you out of the woods.

#!/bin/bash
sudo python3 -m pip install \
                        boto3 \
	                    boto \
		                requests \
                        psycopg2-binary

#!/bin/bash

sudo python3 -m pip install \

boto3 \

boto \

requests \

psycopg2-binary

After some further research, and testing we decided to utilize a requirements.txt file to be called by the bootstrap shell script in the RunJobFlow call, first create a requirements.txt file, I like to hardcode the versions so nothing changes unexpectedly as you bootstrap a new cluster and it reaches out to PyPy to get the packages.

https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html

Add your desired packages and version numbers to a file called requirements.txt like below:

boto3==1.17.54
boto==2.49.0
requests==2.18.4
psycopg2-binary==2.8.6

boto3==1.17.54

boto==2.49.0

requests==2.18.4

psycopg2-binary==2.8.6

Then you will need to copy this file into a bucket you have access to:

aws s3 cp requirements.txt s3://YOUR_S3_BUCKET_NAME/requirements.txt

1	aws s3 cp requirements.txt s3://YOUR_S3_BUCKET_NAME/requirements.txt

Then create a shell script that has the following, call it bootstrap.sh:

#!/bin/bash

set -x 

echo '-----------RUNNING BOOTSTRAP------------------------'

echo '-----------COPYING REQUIREMENTS FILE LOCALLY--------'

aws s3 cp s3://YOUR_S3_BUCKET_NAME/requirements.txt .

echo '-----------INSTALLING REQUIREMENTS------------------'

sudo python3 -m pip install -r requirements.txt

echo '-----------DONE BOOTSTRAP---------------------------'

#!/bin/bash

set -x

echo '-----------RUNNING BOOTSTRAP------------------------'

echo '-----------COPYING REQUIREMENTS FILE LOCALLY--------'

aws s3 cp s3://YOUR_S3_BUCKET_NAME/requirements.txt .

echo '-----------INSTALLING REQUIREMENTS------------------'

sudo python3 -m pip install -r requirements.txt

echo '-----------DONE BOOTSTRAP---------------------------'

Copy that shell script to your bucket:

aws s3 cp bootstrap.sh s3://YOUR_S3_BUCKET_NAME/bootstrap.sh

1	aws s3 cp bootstrap.sh s3://YOUR_S3_BUCKET_NAME/bootstrap.sh

And execute it via the bootstrap actions in the RunJobFlow EMR API call:

"BootstrapActions": [
    {
      "Name": "string",
      "ScriptBootstrapAction": {
        "Path": "s3://YOUR_S3_BUCKET_NAME/bootstrap.sh"
      }
    }
  ],

"BootstrapActions": [

{

"Name": "string",

"ScriptBootstrapAction": {

"Path": "s3://YOUR_S3_BUCKET_NAME/bootstrap.sh"

}

As you can see the shell script will be executed which will copy the requirements.txt file locally and then run pip -r against it which will install all the packages. If you want to see the log on a running cluster, you can ssh to the master node and view the logs here to see the bootstrapping take place:

/emr/instance-controller/log/bootstrap-actions

1	/emr/instance-controller/log/bootstrap-actions

You should see the stdout log as so:

-----------RUNNING BOOTSTRAP------------------
-----------COPYING REQUIREMENTS FILE LOCALLY--------
Completed 67 Bytes/67 Bytes (629 Bytes/s) with 1 file(s) remaining
download: s3://YOUR_S3_BUCKET_NAME/requirements.txt to ./requirements.txt
-----------INSTALLING REQUIREMENTS------------------
Collecting boto==2.48.0
  Downloading boto-2.48.0-py2.py3-none-any.whl (1.4 MB)
Collecting boto3==1.6.15
  Downloading boto3-1.6.15-py2.py3-none-any.whl (128 kB)
Collecting requests==2.18.4
  Downloading requests-2.18.4-py2.py3-none-any.whl (88 kB)
Collecting psycopg2-binary==2.8.6
  Downloading psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Collecting botocore<1.10.0,>=1.9.15
  Downloading botocore-1.9.23-py2.py3-none-any.whl (4.1 MB)
Collecting s3transfer<0.2.0,>=0.1.10
  Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59 kB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.10.0)
Collecting urllib3<1.23,>=1.21.1
  Downloading urllib3-1.22-py2.py3-none-any.whl (132 kB)
Collecting certifi>=2017.4.17
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting idna<2.7,>=2.5
  Downloading idna-2.6-py2.py3-none-any.whl (56 kB)
Collecting chardet<3.1.0,>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Requirement already satisfied: docutils>=0.10 in /usr/lib/python3.7/site-packages (from botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.14)
Collecting python-dateutil<2.7.0,>=2.1
  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.7.0,>=2.1->botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (1.13.0)
Installing collected packages: boto, python-dateutil, botocore, s3transfer, boto3, urllib3, certifi, idna, chardet, requests, psycopg2-binary
  Attempting uninstall: boto
    Found existing installation: boto 2.49.0
    Uninstalling boto-2.49.0:
      Successfully uninstalled boto-2.49.0
Successfully installed boto-2.48.0 boto3-1.6.15 botocore-1.9.23 certifi-2021.10.8 chardet-3.0.4 idna-2.6 psycopg2-binary-2.8.6 python-dateutil-2.6.1 requests-2.18.4 s3transfer-0.1.13 urllib3-1.22
-----------DONE BOOTSTRAP---------------------

-----------RUNNING BOOTSTRAP------------------

-----------COPYING REQUIREMENTS FILE LOCALLY--------

Completed 67 Bytes/67 Bytes (629 Bytes/s) with 1 file(s) remaining

download: s3://YOUR_S3_BUCKET_NAME/requirements.txt to ./requirements.txt

-----------INSTALLING REQUIREMENTS------------------

Collecting boto==2.48.0

Downloading boto-2.48.0-py2.py3-none-any.whl (1.4 MB)

Collecting boto3==1.6.15

Downloading boto3-1.6.15-py2.py3-none-any.whl (128 kB)

Collecting requests==2.18.4

Downloading requests-2.18.4-py2.py3-none-any.whl (88 kB)

Collecting psycopg2-binary==2.8.6

Downloading psycopg2_binary-2.8.6-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)

Collecting botocore<1.10.0,>=1.9.15

Downloading botocore-1.9.23-py2.py3-none-any.whl (4.1 MB)

Collecting s3transfer<0.2.0,>=0.1.10

Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59 kB)

Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.10.0)

Collecting urllib3<1.23,>=1.21.1

Downloading urllib3-1.22-py2.py3-none-any.whl (132 kB)

Collecting certifi>=2017.4.17

Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)

Collecting idna<2.7,>=2.5

Downloading idna-2.6-py2.py3-none-any.whl (56 kB)

Collecting chardet<3.1.0,>=3.0.2

Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)

Requirement already satisfied: docutils>=0.10 in /usr/lib/python3.7/site-packages (from botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (0.14)

Collecting python-dateutil<2.7.0,>=2.1

Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194 kB)

Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.7.0,>=2.1->botocore<1.10.0,>=1.9.15->boto3==1.6.15->-r jason_requirements.txt (line 2)) (1.13.0)

Installing collected packages: boto, python-dateutil, botocore, s3transfer, boto3, urllib3, certifi, idna, chardet, requests, psycopg2-binary

Attempting uninstall: boto

Found existing installation: boto 2.49.0

Uninstalling boto-2.49.0:

Successfully uninstalled boto-2.49.0

Successfully installed boto-2.48.0 boto3-1.6.15 botocore-1.9.23 certifi-2021.10.8 chardet-3.0.4 idna-2.6 psycopg2-binary-2.8.6 python-dateutil-2.6.1 requests-2.18.4 s3transfer-0.1.13 urllib3-1.22

-----------DONE BOOTSTRAP---------------------

Hope this helps.

Jason R. Ralph

Linux All Day Everyday

Tag: emr

HTTPSConnectionPool(host=’files.pythonhosted.org’, port=443): Read timed out

AttributeError: module ‘cryptography.utils’ has no attribute ‘register_interface’

AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3

AWS Apache Managed Airflow EMR ModuleNotFoundError: No module named ‘requests’ Bootstrap