HTTPSConnectionPool(host=’files.pythonhosted.org’, port=443): Read timed out

I recently had an issue where one of our EMR clusters failed to bootstrap the python modules via PIP. I checked the logs and saw that we ran into the following error:

I wanted to have PIP not die if it timed out, I also wanted it to retry on failure. By adding the following to my bootstrap.sh I was able to have the PIP socket timeout at a longer interval, also bump up the retries to 10. I have not seen the issue since I applied the new settings.

From the PIP help page:

Capture AWS CLI Output With Timestamps On Each Line Of Output

I needed a way to get output from aws cli captured into a log file with timestamps, out of the box the aws cli output has no timestamps in the output. If you execute a aws s3 cp command, something like this:

You will see output like so:

As you can see this does not show a timestamp in each event of output from the aws cli. So I scoured the internet and found out some interesting things. Turns out that aws cli out of the box outputs with carriage returns instead of newlines. So trying standard awk piping methods was not working. Also aws cli has the ability to change the output, so I needed to add a cli parameter to set output to text. Next we needed to use TR to substitute the carriage returns with newlines, finally we can pipe to awk and print a timestamp on each output event from the aws cli. The final command and output looks like this:

Produces the following in the log file which is my desired result:

I hope this helps someone else as it was a bear to solve for me.

AWS CLI Max Concurrent Requests Tuning

In this post I would like to go over how I tuned a test server for copying / syncing files from the local filesystem to S3 over the internet. If you ever had the task of doing this, you will notice that as the file count grows, so does the time it takes to upload the files to S3. After some web searching I found out that AWS allows you to tune the config to allow more concurrency than default.
AWS CLI S3 Config

The parameter that we will be playing with is max_concurrent_requests
This has a default value of 10, which allows only 10 requests to the AWS API for S3. Lets see if we can make some changes to that value and get some performance gains. My test setup is as follows:

I have 56 102MB files in the test directory:

For the first test I am going to run aws s3 sync with no changes, so out of the box it should have 10 max_concurrent_requests. Lets use the Linux time command to gather the time result to copy all 56 files to S3. I will delete the folder on S3 with each iteration to keep the test the same. You can also view the 443 requests via netstat and count them as well to show whats going on. In all the tests my best result was 250. So as you can see you will need to play with the settings to get the best result, these settings will change along with the server specs.

1. 1m25.919s with the default configuration:

2. Now lets set the max conqurent requests to 20 and try again, you can do this with the command below, after running we can see a little gain.

3. Bumped up to 50 shows a bit more gain:

4. Bumped up to 100, I start to notice that we lost some speed:

5. Bumped up to 250 we see the best result so far:

6. Bumped up to 500, we lose performance, most likely due to the machine resources.

So to wrap up, you can tune the amount of concurrent requests allowed from the aws cli to s3, you will need to play with this setting to get the best results for your machine.

Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil

I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop.

This will rename the files stripping out what I wanted, files go from:

work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz

I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file. This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box.

gsutil source code:

This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well.

So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks. I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on.

You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks.

CENTOS6 Postgres pg_upgrade 9 to 11 – In Place – Link – No Copy – Limited Disk Space

I wanted to share my experience with upgrading postgres database server from major version 9 to 11. I am showing the steps that I took to get many servers in dev and production upgraded with limited disk space(not enough space to copy). I am hoping this will help with the problems I faced when testing this procedure. Using the –link parameter has drawbacks as noted in the documentation, however we perform full VM backups of each server so we can always restore from backup if the upgrade fails and we will not need to start the pg9.3 database again.

https://www.postgresql.org/docs/11/pgupgrade.html

-k
--link

use hard links instead of copying files to the new cluster
If you ran pg_upgrade with --link, the data files are shared between the old and new cluster. If you started the new cluster, the new server has written to those shared files and it is unsafe to use the old cluster.

Before we get started make a backup of the files pg_hba.conf and postgresql.conf for later use, you will need to use them later to reconstruct the pg11 configs.

Use WGET to grab the RPMS from https://yum.postgresql.org

Install the RPMS for postgres11 that we just downloaded

We will create the data location for postgres11 where the files will be hardlinked and not copied. You can see the tablespace disk locations and the index locations from the pg9.3 install. Its important to create the new pg11 data directory on the same filesystem since we will be using the –link parameter and it uses hardlinks which will not traverse filesystems.

We will need to init a postgres database in our new location on disk data11.

Now we are ready to stop pg9.3 and check pg_upgrade compatibility. pg_upgrade ships with a –check argument that will check the compatibility of the clusters and be sure the upgrade will work before changing any files. Lets stop pg9.3 and run the pg_upgrade with the –check parameter.


Ok checks have passed and the cluster versions are ready for upgrade, lets run this without the –check parameter and upgrade postgres.

OK the pg_upgrade code completed successfully and has generated 2 scripts. One to analyze the new pg11 cluster to get stats for the query planner and vacuum. The other to cleanup and remove the old pg9.3 locations on disk. Let’s start pg11, we will need to create an override file to tell pg11 where the data11 data lives, then we should be able to start postgres and check some things and verify our upgrade.


OK we can see we have pg11 running and we can run the generated scripts to cleanup, but lets take a look at the data and index directories to see what the upgrade produced.

We can view the shell scripts that pg_upgrade produced and cleanup the old pg9.3 references and run the analyze vacuums.


This looks good, lets execute them and cleanup any pg9.3 references as well as remove the pg9.3 rpms.

Remove the pg9.3 rpms and references, set the new data location in the .pgsql_profile.

You can now view the pg_hba.conf and postgresql.conf you saved in /root and add whats needed to the new pg11 configs.

That’s it!!

SINOPIA NPM allow connections to GITHUB as well as the NPM registry

SINOPIA LINK HERE
We use SINOPIA as a proxy on our internal network behind the firewall to allow users to install NODE packages without an internet connection. We basically run sinopia on a machine that has access to the internet and the clients point to the server to install packages that are not locally available. We have been running into issues where installs that needed access to github would fail with something like this:

As you can see, we are getting choked at:

To get around this we need to change the config.yml on the server to allow proxies to github, here is the final configuration. Hope this helps other users as we had a fun time trying to figure it out. Pay attention to the uplinks section and the proxy requests where github is defined.

Python Generator Find Files With Wildcard

This is a neat way to generate file names in a directory that match a specific pattern, I use this to generate a list of files exported out of hive to load into S3.

POSTGRES – Top 100 Tables In Tablespace

I had a situation where I needed to find the top 100 largest tables in a certain tablespace on a postgres 9 database, in my case we archive tables into an archive1 tablespace. This query will find all the largest relations in the archive1 tablespace. Its important to swap out ‘archive1’ with whatever tablespace you are trying to list.

Hope this helps you out, took some time to get it to work.

Python Backup WORDPRESS Site / DATABASE and HTML

I have this blog hosted on a LINODE dedicated LINUX server. It’s about 10 dollars a month for a 1 core system with about 250GB of disk space and 1GB of RAM, this server runs the common LAMP stack, I needed a quick and dirty script to backup MYSQL database and the PHP code contained in the /var/www/html folder. I wanted the script to compress the contents of both and move them into a directory with the correct date. See the comments below outlining the code and the action of running it.

So you can see we generated 2 files in a dated directory, I chose to use both zip and gunzip for compression algoritims. To view the contents you can run the normal linux commands to extract the files.

So there you have it, I can tar up the entire dated directory for easy offsite backup now of my entire site jasonralph.org. Hope this helps someone, feel free to copy the source code and change at will.

Best,
Jason

Analytic’s – With Google

Well I thought I would write up a quick post to demonstrate that even the slower kids like myself can achieve web visitors hitting their site if they put some effort behind it. I started this blog back in late 2012 and I only posted a couple of code snippets here and there. Then my coding skills developed a bit and I kept the domain alive so it started to get a bit more traffic. Anyway this really is not much traffic, but it’s neat to say that I have had my blog up and recording visits for some time now.

Google Analytics JR.org

So here it is, if you want to advertise hit me up at [email protected] 🙂