Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil

I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop.

This will rename the files stripping out what I wanted, files go from:

work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz

I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file. This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box.

gsutil source code:

This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well.

So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks. I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on.

You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks.

2 thoughts on “Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil”

    1. I wrote the multiprocessing python code after we already purged the data, so I did not have the benchmarks to show. I did save a subset for testing.
      So I performed a similar test on 56 100MB files. The original data set was 50000 100MB files.
      Here are the results.
      So from ~2 minutes to ~40 seconds.

      Rename Benchmark 56 100MB files in GCLOUD:

      bash loop:
      [jasonr@jr-sandbox test]$ time bash
      real 1m54.362s
      user 1m1.181s
      sys 0m10.148s

      python 25 processes multiprocessing:
      [jasonr@jr-sandbox test]$ time python3.5
      real 0m41.156s
      user 1m9.334s
      sys 0m9.636s

Leave a Reply

Your email address will not be published. Required fields are marked *