Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil

I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop.

This will rename the files stripping out what I wanted, files go from:

work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz

for files in $(cat listing2.txt) ; do  
    echo "Renaming: $files --> ${files#work-}"
    gsutil mv gs://gs-bucket/jason_testing/$files gs://gs-bucket/jason_testing/${files#work-}
done

for files in $(cat listing2.txt) ; do

echo "Renaming: $files --> ${files#work-}"

gsutil mv gs://gs-bucket/jason_testing/$files gs://gs-bucket/jason_testing/${files#work-}

done

I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file. This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box.

gsutil source code:

should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()
if should_prohibit_multiprocessing:
  DEFAULT_PARALLEL_PROCESS_COUNT = 1
  DEFAULT_PARALLEL_THREAD_COUNT = 24
else:
  DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)
  DEFAULT_PARALLEL_THREAD_COUNT = 5

should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()

if should_prohibit_multiprocessing:

DEFAULT_PARALLEL_PROCESS_COUNT = 1

DEFAULT_PARALLEL_THREAD_COUNT = 24

else:

DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)

DEFAULT_PARALLEL_THREAD_COUNT = 5

This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well.

So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks. I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on.

#!/usr/bin/env python3.5
import subprocess
import multiprocessing
import datetime
import shlex


class GsRenamer:
    def __init__(self):
        self.gs_cmd = '/home/jasonr/google-cloud-sdk/bin/gsutil'
        self.file_list = []
        self.final_rename_list = []
        self.now = datetime.datetime.now()

    # method to call subprocess on each command I feed it.
    def execute_jobs(self, cmd):
        try:
            print('{0} INFO: Running rename command: [{1}]'.format(self.now,
                                                                    cmd))
            subprocess.run(shlex.split(cmd), check=True)
        except subprocess.CalledProcessError as e:
            print('[{0}] FATAL: Command failed with error [{1}]').format(cmd,
                                                                         e)

    # method to get all the filenames from the bucket in gcloud and split it to
    # get the filenames. Load the results into a list. filter the list for any
    # blank lines. 
    def get_filenames_from_gs(self):
        cmd = [self.gs_cmd, 'ls',
               'gs://gs-bucket/jason_testing']
        output = subprocess.Popen(cmd, stdout=subprocess.PIPE,
        universal_newlines=True).communicate()[0].splitlines()
        for full_file_path in output:
            file_name  = full_file_path.split('/')[-1]
            self.file_list.append(file_name)
            self.file_list = list(filter(None, self.file_list))


    # method to iterate over the list and create the commands,
    # also leverage pythons string replace to rename the source
    # and target file. build the commands and load into a list.
    # Also use pythons multiprocessing module to spawn 25 processes
    # that hit the api in chunks cutting down the time to rename
    # all the files in place.
    def rename_files(self, string_original, string_replace):
        for files in self.file_list:
            renamed_files = files.replace(string_original,
                                          string_replace)
            rename_command = "{0} mv gs://gs-bucket_testing/{1} " \
                             "gs://gs-bucket/jason_testing/{2}" \
                             .format(self.gs_cmd, files, renamed_files)
            self.final_rename_list.append(rename_command)
        self.final_rename_list.sort()
        multiprocessing.pool = multiprocessing.Pool(
            processes=25)
        multiprocessing.pool.map(self.execute_jobs, self.final_rename_list)


def main():
    gsr = GsRenamer()
    gsr.get_filenames_from_gs()
    gsr.rename_files('work-', '')


if __name__ == "__main__":
    main()

#!/usr/bin/env python3.5

import subprocess

import multiprocessing

import datetime

import shlex

class GsRenamer:

def __init__(self):

self.gs_cmd = '/home/jasonr/google-cloud-sdk/bin/gsutil'

self.file_list = []

self.final_rename_list = []

self.now = datetime.datetime.now()

# method to call subprocess on each command I feed it.

def execute_jobs(self, cmd):

try:

print('{0} INFO: Running rename command: [{1}]'.format(self.now,

cmd))

subprocess.run(shlex.split(cmd), check=True)

except subprocess.CalledProcessError as e:

print('[{0}] FATAL: Command failed with error [{1}]').format(cmd,

# method to get all the filenames from the bucket in gcloud and split it to

# get the filenames. Load the results into a list. filter the list for any

# blank lines.

def get_filenames_from_gs(self):

cmd = [self.gs_cmd, 'ls',

'gs://gs-bucket/jason_testing']

output = subprocess.Popen(cmd, stdout=subprocess.PIPE,

universal_newlines=True).communicate()[0].splitlines()

for full_file_path in output:

file_name = full_file_path.split('/')[-1]

self.file_list.append(file_name)

self.file_list = list(filter(None, self.file_list))

# method to iterate over the list and create the commands,

# also leverage pythons string replace to rename the source

# and target file. build the commands and load into a list.

# Also use pythons multiprocessing module to spawn 25 processes

# that hit the api in chunks cutting down the time to rename

# all the files in place.

def rename_files(self, string_original, string_replace):

for files in self.file_list:

renamed_files = files.replace(string_original,

string_replace)

rename_command = "{0} mv gs://gs-bucket_testing/{1} " \

"gs://gs-bucket/jason_testing/{2}" \

.format(self.gs_cmd, files, renamed_files)

self.final_rename_list.append(rename_command)

self.final_rename_list.sort()

multiprocessing.pool = multiprocessing.Pool(

processes=25)

multiprocessing.pool.map(self.execute_jobs, self.final_rename_list)

def main():

gsr = GsRenamer()

gsr.get_filenames_from_gs()

gsr.rename_files('work-', '')

if __name__ == "__main__":

main()

You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks.

jasonr    4612  0.3  0.1 403772 10968 pts/0    Sl+  13:34   0:00  |           \_ python3.5 renamer.py
jasonr    4758  0.0  0.0 176432  8080 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4759  0.0  0.0 176432  8080 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4760  0.0  0.0 176432  8080 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4761  0.0  0.0 176432  8084 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4762  0.0  0.0 176432  8088 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4763  0.0  0.0 176432  8092 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4764  0.0  0.0 176432  8092 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4765  0.0  0.0 176432  8092 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4766  0.0  0.0 176432  8100 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
jasonr    4767  0.0  0.0 176432  8100 pts/0    S+   13:34   0:00  |               \_ python3.5 renamer.py
--SNIP--

jasonr 4612 0.3 0.1 403772 10968 pts/0 Sl+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4758 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4759 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4760 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4761 0.0 0.0 176432 8084 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4762 0.0 0.0 176432 8088 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4763 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4764 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4765 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4766 0.0 0.0 176432 8100 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

jasonr 4767 0.0 0.0 176432 8100 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py

--SNIP--

2 thoughts on “Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil”

Atul Chadha says:

October 10, 2019 at 12:42 pm

Great! So, how long did it finally take after your magic (25 worker processes)?

1. admin says:
  
  October 13, 2019 at 12:09 am
  
  I wrote the multiprocessing python code after we already purged the data, so I did not have the benchmarks to show. I did save a subset for testing.
  So I performed a similar test on 56 100MB files. The original data set was 50000 100MB files.
  Here are the results.
  So from ~2 minutes to ~40 seconds.
  
  Rename Benchmark 56 100MB files in GCLOUD:
  
  bash loop: [jasonr@jr-sandbox test]$ time bash test_loop.sh time: real 1m54.362s user 1m1.181s sys 0m10.148s
  python 25 processes multiprocessing: [jasonr@jr-sandbox test]$ time python3.5 renamer.py real 0m41.156s user 1m9.334s sys 0m9.636s

Jason R. Ralph

Linux All Day Everyday

Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil

2 thoughts on “Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil”

Leave a Reply Cancel reply