Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil

I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop.

This will rename the files stripping out what I wanted, files go from:

work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz

I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file. This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box.

gsutil source code:

This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well.

So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks. I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on.

You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks.

Python Function Execute Subprocess With Timeout

I have a project that rsync’s data from an RPM repository for a local version of this repo. The issue I was faced with was the remote mirror would sometimes stop the rsync due to overloaded network or other unforeseen issues. I wanted to use rsyncs hashing algorithm to have it start right where it left off so I wrote a function to do this. If 900 seconds was hit it usually meant there was an issue with the transfer. I also want to state here that I observed the rsync stop serving issue on many mirrors so it was not just an issue with the TCP network. I use this in production and it logs each iteration or restart. The function below will also kill the current rsync so multiple copies are not running at the same time. I also only wanted to perform 5 iterations of rsync so I use a while loop here.

Here are the individual rsync commands in the INI configuration.

Here is how I call the execute_jobs_timeout() function:

The function:

Log Snippet showing each command executing:

Python Generator Find Files With Wildcard

This is a neat way to generate file names in a directory that match a specific pattern, I use this to generate a list of files exported out of hive to load into S3.

Python3 Subprocess and Rsync Deadlock Strace Timeout

I recently came across a tough to debug issue where I was calling a shell script from python using the subprocess module, this shell script called rsync, no matter what I would always run into a timeout situation. I fired up strace and noticed that the process was in a timeout state.

select(4, NULL, [3], [3], {60, 0}) = 0 (Timeout)

I looked at the subprocess documentation and apparently using pipes will fill the system pipe buffer.

Warning

This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

I was baffled, I finally took the approach to eliminate stderr and stdout and just check the return status of the command using run(). Here is what I finally came up with, and all was well.

Hope you find this and it helps you.

Amazon Redshift Long Running Query Alert to Slack

This python code when called with a user that can query the STV_RECENTS table will check the duration on a current running query against the threshold set by the cli arguments and send an alert to slack if it exceeds 30 minutes. I have it cronned up and running every 30 minutes.

CLI example:

You will need slackclient:
https://pypi.python.org/pypi/slackclient
You will need psycopg2:
https://pypi.python.org/pypi/psycopg2

INI file:

Slack message example:

Nagios Check Postgres Table Date Column Against now()

I had a situation where a daily sync of a table from one database to another was failing. This table was updated daily so the query should return something like this when it was synced correctly:

I use Nagios very heavily and I setup a custom plugin to check the query’s date against today’s date, this should warn or critical based on user supplied arguments. Here is what a failure looks like when running from the nagios servers command line. This worked well at alerting me when the sync failed, this was integrated into the nagios subsystem and emails and slack alerts are generated as expected.

NAGIOS AWS REDSHIFT TABLE COUNT PLUGIN PYTHON

I needed a quick plugin to warn me if one of our AWS REDSHIFT instances had a table count above 6000 and alert critical if above 7000. I decided to write a python plugin for nagios to do the chore. You can see the source code and the example of executing it below on the nagios host.

Python Backup WORDPRESS Site / DATABASE and HTML

I have this blog hosted on a LINODE dedicated LINUX server. It’s about 10 dollars a month for a 1 core system with about 250GB of disk space and 1GB of RAM, this server runs the common LAMP stack, I needed a quick and dirty script to backup MYSQL database and the PHP code contained in the /var/www/html folder. I wanted the script to compress the contents of both and move them into a directory with the correct date. See the comments below outlining the code and the action of running it.

So you can see we generated 2 files in a dated directory, I chose to use both zip and gunzip for compression algoritims. To view the contents you can run the normal linux commands to extract the files.

So there you have it, I can tar up the entire dated directory for easy offsite backup now of my entire site jasonralph.org. Hope this helps someone, feel free to copy the source code and change at will.

Best,
Jason

Python Mysql Connector

Thought I would try my hand at some SQL programming with Python, I was stuck using a Windows machine(BLAH) I wanted to setup a MySQL database and 1 table for testing. I am on a Windows machine so I installed WAMP which is a Windows Apache Mysql Php server. You can get the installer for Windows here:
http://www.wampserver.com/en/

You can get the python library / module here:

Mysql Python
Using PhpMyAdmin I setup a database and added the table.

mysql

As you can see I created a function that added data into my new database. I pass it 4 parameters, id-name-dept-salary, when you execute this script the database gets populated with the correct data. Pretty cool!!

Here is how you can see the help and call the command properly:
help command

Here is what you can see when performing a select * from the table.
database command

Here is the awesome IDE pyCharm and the results from running the code.
pycharm
Here is an example of how to query that database from the table.

Here is the output of the fetch command using ID as a parameter:
database command

Took me a while to figure out the syntax, so hopefully this helps someone.
Jason

PYTHON – Script to download youtube videos for offline viewing

I was interested in viewing this video of a news conference (USENIX 2016) on my trip home on Metro North Train, NYC => CT. The trip is about an hour an 10 minutes from Manhattan’s Grand Central Terminal to Milford CT, express train that is. My concern was that I would have choppy internet service on the way since I recently updated my laptop and the built in Verizon Mobile card was not activated yet. I would need to use my ATT iPhone as a hotspot, which proved to be very shakey at times. A colleague of mine recommended a website for making youtube videos available for offline viewing. The name of this site was:

http://www.keepvid.com

Right off the rip I was concerned that this site was infested with malware and any other bullshit associated with a free video ripping service. I used the site and was able to create a download of the video I was interested in, however who knows how sick my Windows based machine just got. I could of contracted anything from this site.

I thought about this and said, there has to be a better way, or a python lib for this, and low and behold a search came up with PYTUBE:
https://github.com/nficano/pytube

This library had some interesting features and literally blew away the keepvid site in regards to flexibility. Here is some explaining of what this library can do. Please have a look at the examples below, I will do my best to narrate them.

Here I use PIP to install the PYTUBE lib, you can ignore the DEPRECATION: warning for my outdated python that blares at you for being such an idiot.

Next up you can see that I am setting a variable yt(this is the video you want to download). Using python’s Pretty Print Lib you can run the pprint(yt.get_videos() method to see what formats are available for download.

Please have a look at the comments in the code for a bit more details in regards to what is going on, in this example I am using the filename Pulp_Fiction.mp4 for my filename I want to be when downloaded.

Ok so here is what it looks like when you execute the program:

As you can see we have a new filename with the video we asked for to watch without a streaming internet connection, here is a ls to show:

As always, I am sure there are better ways to do this and I am sure there is cleaner code. Most of this code was taken right from the authors site who is a badass, here is his link:

https://github.com/nficano/pytube

Hope you liked,
J$0N