I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop.
This will rename the files stripping out what I wanted, files go from:
work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz
1 2 3 4 |
for files in $(cat listing2.txt) ; do echo "Renaming: $files --> ${files#work-}" gsutil mv gs://gs-bucket/jason_testing/$files gs://gs-bucket/jason_testing/${files#work-} done |
I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file. This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box.
gsutil source code:
1 2 3 4 5 6 7 |
should_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing() if should_prohibit_multiprocessing: DEFAULT_PARALLEL_PROCESS_COUNT = 1 DEFAULT_PARALLEL_THREAD_COUNT = 24 else: DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32) DEFAULT_PARALLEL_THREAD_COUNT = 5 |
This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well.
So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks. I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
#!/usr/bin/env python3.5 import subprocess import multiprocessing import datetime import shlex class GsRenamer: def __init__(self): self.gs_cmd = '/home/jasonr/google-cloud-sdk/bin/gsutil' self.file_list = [] self.final_rename_list = [] self.now = datetime.datetime.now() # method to call subprocess on each command I feed it. def execute_jobs(self, cmd): try: print('{0} INFO: Running rename command: [{1}]'.format(self.now, cmd)) subprocess.run(shlex.split(cmd), check=True) except subprocess.CalledProcessError as e: print('[{0}] FATAL: Command failed with error [{1}]').format(cmd, e) # method to get all the filenames from the bucket in gcloud and split it to # get the filenames. Load the results into a list. filter the list for any # blank lines. def get_filenames_from_gs(self): cmd = [self.gs_cmd, 'ls', 'gs://gs-bucket/jason_testing'] output = subprocess.Popen(cmd, stdout=subprocess.PIPE, universal_newlines=True).communicate()[0].splitlines() for full_file_path in output: file_name = full_file_path.split('/')[-1] self.file_list.append(file_name) self.file_list = list(filter(None, self.file_list)) # method to iterate over the list and create the commands, # also leverage pythons string replace to rename the source # and target file. build the commands and load into a list. # Also use pythons multiprocessing module to spawn 25 processes # that hit the api in chunks cutting down the time to rename # all the files in place. def rename_files(self, string_original, string_replace): for files in self.file_list: renamed_files = files.replace(string_original, string_replace) rename_command = "{0} mv gs://gs-bucket_testing/{1} " \ "gs://gs-bucket/jason_testing/{2}" \ .format(self.gs_cmd, files, renamed_files) self.final_rename_list.append(rename_command) self.final_rename_list.sort() multiprocessing.pool = multiprocessing.Pool( processes=25) multiprocessing.pool.map(self.execute_jobs, self.final_rename_list) def main(): gsr = GsRenamer() gsr.get_filenames_from_gs() gsr.rename_files('work-', '') if __name__ == "__main__": main() |
You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks.
1 2 3 4 5 6 7 8 9 10 11 12 |
jasonr 4612 0.3 0.1 403772 10968 pts/0 Sl+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4758 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4759 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4760 0.0 0.0 176432 8080 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4761 0.0 0.0 176432 8084 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4762 0.0 0.0 176432 8088 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4763 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4764 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4765 0.0 0.0 176432 8092 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4766 0.0 0.0 176432 8100 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py jasonr 4767 0.0 0.0 176432 8100 pts/0 S+ 13:34 0:00 | \_ python3.5 renamer.py --SNIP-- |
Great! So, how long did it finally take after your magic (25 worker processes)?
I wrote the multiprocessing python code after we already purged the data, so I did not have the benchmarks to show. I did save a subset for testing.
So I performed a similar test on 56 100MB files. The original data set was 50000 100MB files.
Here are the results.
So from ~2 minutes to ~40 seconds.
Rename Benchmark 56 100MB files in GCLOUD:
bash loop:
[jasonr@jr-sandbox test]$ time bash test_loop.sh
time:
real 1m54.362s
user 1m1.181s
sys 0m10.148s
python 25 processes multiprocessing:
[jasonr@jr-sandbox test]$ time python3.5 renamer.py
real 0m41.156s
user 1m9.334s
sys 0m9.636s