AWS Apache Managed Airflow EMR ModuleNotFoundError: No module named ‘requests’ Bootstrap

I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud. We are using AWS EMR and have AWS Managed Airflow as the executor (DAG). We came across an odd situation with a pyspark application. When using Airflow with a SparkSubmitHook, the job would bootstrap looking just fine according to the run logs, however it would fail with No module named 'requests' when the application tried to import it. This was very odd since we have this application running from spark-submit just fine when calling it from the master node command line.

I decided to investigate the differences, our bootstrap script for installing python modules via pip which we call from the EMR API RunJobFlow call looks like this:

This is very basic, all it does is upgrade PIP and run PIP install to install each of the modules. When checking the bootstrap log I can see that PIP upgrades and goes out to the repo and installs the packages just fine. So why were we getting the No module named 'requests' error when executing through airflow. After a ton of googling and research I have found the issue and applied a solution that worked. Turns out airflow will run as the root user when bootstrapping, so if you notice we use the --user argument in pip. This will instruct the packages to be installed in the calling users home directory, the kicker is the code is run by the hadoop user on the EMR cluster nodes after executing from airflow. So turns out, the hadoop user is unable to access the requests module since root installed it with --user. I changed the bootstrap script to the following and it all started working, by removing --user and prefixing with sudo, the packages now get installed in a globally available area for all users. I am sure there are better ways to do this, I am still learning and researching, but if you run into this, the change below with get you out of the woods.

After some further research, and testing we decided to utilize a requirements.txt file to be called by the bootstrap shell script in the RunJobFlow call, first create a requirements.txt file, I like to hardcode the versions so nothing changes unexpectedly as you bootstrap a new cluster and it reaches out to PyPy to get the packages.

https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html

Add your desired packages and version numbers to a file called requirements.txt like below:

Then you will need to copy this file into a bucket you have access to:

Then create a shell script that has the following, call it bootstrap.sh:

Copy that shell script to your bucket:

And execute it via the bootstrap actions in the RunJobFlow EMR API call:

As you can see the shell script will be executed which will copy the requirements.txt file locally and then run pip -r against it which will install all the packages. If you want to see the log on a running cluster, you can ssh to the master node and view the logs here to see the bootstrapping take place:

You should see the stdout log as so:

Hope this helps.

Capture AWS CLI Output With Timestamps On Each Line Of Output

I needed a way to get output from aws cli captured into a log file with timestamps, out of the box the aws cli output has no timestamps in the output. If you execute a aws s3 cp command, something like this:

You will see output like so:

As you can see this does not show a timestamp in each event of output from the aws cli. So I scoured the internet and found out some interesting things. Turns out that aws cli out of the box outputs with carriage returns instead of newlines. So trying standard awk piping methods was not working. Also aws cli has the ability to change the output, so I needed to add a cli parameter to set output to text. Next we needed to use TR to substitute the carriage returns with newlines, finally we can pipe to awk and print a timestamp on each output event from the aws cli. The final command and output looks like this:

Produces the following in the log file which is my desired result:

I hope this helps someone else as it was a bear to solve for me.

AWS CLI Max Concurrent Requests Tuning

In this post I would like to go over how I tuned a test server for copying / syncing files from the local filesystem to S3 over the internet. If you ever had the task of doing this, you will notice that as the file count grows, so does the time it takes to upload the files to S3. After some web searching I found out that AWS allows you to tune the config to allow more concurrency than default.
AWS CLI S3 Config

The parameter that we will be playing with is max_concurrent_requests
This has a default value of 10, which allows only 10 requests to the AWS API for S3. Lets see if we can make some changes to that value and get some performance gains. My test setup is as follows:

I have 56 102MB files in the test directory:

For the first test I am going to run aws s3 sync with no changes, so out of the box it should have 10 max_concurrent_requests. Lets use the Linux time command to gather the time result to copy all 56 files to S3. I will delete the folder on S3 with each iteration to keep the test the same. You can also view the 443 requests via netstat and count them as well to show whats going on. In all the tests my best result was 250. So as you can see you will need to play with the settings to get the best result, these settings will change along with the server specs.

1. 1m25.919s with the default configuration:

2. Now lets set the max conqurent requests to 20 and try again, you can do this with the command below, after running we can see a little gain.

3. Bumped up to 50 shows a bit more gain:

4. Bumped up to 100, I start to notice that we lost some speed:

5. Bumped up to 250 we see the best result so far:

6. Bumped up to 500, we lose performance, most likely due to the machine resources.

So to wrap up, you can tune the amount of concurrent requests allowed from the aws cli to s3, you will need to play with this setting to get the best results for your machine.

PSQL Connect To AWS Redshift From Windows 10 PowerShell

Coming from a completely Linux background, I was tasked with connecting to a aws redshift cluster or a postgres cluster via Windows powershell and PSQL. I knew it was possible and searching the internet came up with CMD prompt solutions, when I attempted via powershell, I was faced with the following error below, you will need to install postgres on windows10 to get access to the psql binary, you can get it here:
https://www.postgresql.org/download/windows/

Turns out a colleague of mine and I figured out you will need to set the variable PGCLIENTENCODING via the powershell command line. This was expected but we could not nail down the syntax, we found it.

Once this is set, you can connect to PG as normal.