{"id":752,"date":"2019-09-24T18:16:33","date_gmt":"2019-09-24T22:16:33","guid":{"rendered":"http:\/\/jasonralph.org\/?p=752"},"modified":"2020-01-04T15:14:25","modified_gmt":"2020-01-04T20:14:25","slug":"mass-rename-files-in-gcloud-with-gsutil-multiprocessing-parallel","status":"publish","type":"post","link":"https:\/\/jasonralph.org\/?p=752","title":{"rendered":"Mass Rename Files In Gcloud With Python Multiprocessing Parallel Gsutil"},"content":{"rendered":"<p>I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created a listing of files with gsutil and did some awk magic to get just the filenames into listing2.txt. I wrote the following loop. <\/p>\n<p>This will rename the files stripping out what I wanted, files go from:<\/p>\n<p><code>work-data-sample__0_0_1.csv.gz to data-sample__0_0_1.csv.gz<\/code><\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\nfor files in $(cat listing2.txt) ; do  \r\n    echo \"Renaming: $files --> ${files#work-}\"\r\n    gsutil mv gs:\/\/gs-bucket\/jason_testing\/$files gs:\/\/gs-bucket\/jason_testing\/${files#work-}\r\ndone\r\n<\/pre>\n<p>I launched it and noticed something odd, it was only iterating over the list and making one call to the gcloud api to rename the file.  This was going to take forever, it actually took 24 hours. I did some reading of the docs and saw that gsutil has a -m option for multiprocessing, I also checked the source code and it looks like gsutil is multiprocess out of the box. <\/p>\n<p>gsutil source code:<\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\nshould_prohibit_multiprocessing, unused_os = ShouldProhibitMultiprocessing()\r\nif should_prohibit_multiprocessing:\r\n  DEFAULT_PARALLEL_PROCESS_COUNT = 1\r\n  DEFAULT_PARALLEL_THREAD_COUNT = 24\r\nelse:\r\n  DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)\r\n  DEFAULT_PARALLEL_THREAD_COUNT = 5\r\n<\/pre>\n<p>This is basically saying if the OS can handle multiprocessing, lets spawn the same amount of processes that the system has cpus, and then set the thread count to 5. So my for loop in bash would of taken forever with -m option as well. <\/p>\n<p>So I created some python code that would solve this issue, it would perform all the steps in one, list the files and substring out the filename, and use pythons multiprocessing to spawn 25 workers to do the api calls in chunks.  I learned a lot from this and I hope it helps others, I will add comments in the code to show whats going on. <\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\n#!\/usr\/bin\/env python3.5\r\nimport subprocess\r\nimport multiprocessing\r\nimport datetime\r\nimport shlex\r\n\r\n\r\nclass GsRenamer:\r\n    def __init__(self):\r\n        self.gs_cmd = '\/home\/jasonr\/google-cloud-sdk\/bin\/gsutil'\r\n        self.file_list = []\r\n        self.final_rename_list = []\r\n        self.now = datetime.datetime.now()\r\n\r\n    # method to call subprocess on each command I feed it.\r\n    def execute_jobs(self, cmd):\r\n        try:\r\n            print('{0} INFO: Running rename command: [{1}]'.format(self.now,\r\n                                                                    cmd))\r\n            subprocess.run(shlex.split(cmd), check=True)\r\n        except subprocess.CalledProcessError as e:\r\n            print('[{0}] FATAL: Command failed with error [{1}]').format(cmd,\r\n                                                                         e)\r\n\r\n    # method to get all the filenames from the bucket in gcloud and split it to\r\n    # get the filenames. Load the results into a list. filter the list for any\r\n    # blank lines. \r\n    def get_filenames_from_gs(self):\r\n        cmd = [self.gs_cmd, 'ls',\r\n               'gs:\/\/gs-bucket\/jason_testing']\r\n        output = subprocess.Popen(cmd, stdout=subprocess.PIPE,\r\n        universal_newlines=True).communicate()[0].splitlines()\r\n        for full_file_path in output:\r\n            file_name  = full_file_path.split('\/')[-1]\r\n            self.file_list.append(file_name)\r\n            self.file_list = list(filter(None, self.file_list))\r\n\r\n\r\n    # method to iterate over the list and create the commands,\r\n    # also leverage pythons string replace to rename the source\r\n    # and target file. build the commands and load into a list.\r\n    # Also use pythons multiprocessing module to spawn 25 processes\r\n    # that hit the api in chunks cutting down the time to rename\r\n    # all the files in place.\r\n    def rename_files(self, string_original, string_replace):\r\n        for files in self.file_list:\r\n            renamed_files = files.replace(string_original,\r\n                                          string_replace)\r\n            rename_command = \"{0} mv gs:\/\/gs-bucket_testing\/{1} \" \\\r\n                             \"gs:\/\/gs-bucket\/jason_testing\/{2}\" \\\r\n                             .format(self.gs_cmd, files, renamed_files)\r\n            self.final_rename_list.append(rename_command)\r\n        self.final_rename_list.sort()\r\n        multiprocessing.pool = multiprocessing.Pool(\r\n            processes=25)\r\n        multiprocessing.pool.map(self.execute_jobs, self.final_rename_list)\r\n\r\n\r\ndef main():\r\n    gsr = GsRenamer()\r\n    gsr.get_filenames_from_gs()\r\n    gsr.rename_files('work-', '')\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n\r\n<\/pre>\n<p>You can see the process spawns 25 worker processes that will iterate over the list and perform the move in chunks. <\/p>\n<pre class=\"theme:solarized-dark lang:default decode:true \" >\r\njasonr    4612  0.3  0.1 403772 10968 pts\/0    Sl+  13:34   0:00  |           \\_ python3.5 renamer.py\r\njasonr    4758  0.0  0.0 176432  8080 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4759  0.0  0.0 176432  8080 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4760  0.0  0.0 176432  8080 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4761  0.0  0.0 176432  8084 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4762  0.0  0.0 176432  8088 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4763  0.0  0.0 176432  8092 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4764  0.0  0.0 176432  8092 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4765  0.0  0.0 176432  8092 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4766  0.0  0.0 176432  8100 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\njasonr    4767  0.0  0.0 176432  8100 pts\/0    S+   13:34   0:00  |               \\_ python3.5 renamer.py\r\n--SNIP--\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I had been tasked with renaming in place, up in the cloud, not bringing the files down locally, 50000 files. I looked at using wildcards with gsutil however I was not able to remove what I wanted from the file, so I set out on creating a shell script to perform the task, I created [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[38,4],"tags":[75,77,74,76,24,73],"class_list":["post-752","post","type-post","status-publish","format-standard","hentry","category-coding-thoughts","category-python","tag-gcloud","tag-glcoud","tag-gsutil","tag-multiprocess","tag-python-2","tag-rename"],"_links":{"self":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/752","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=752"}],"version-history":[{"count":25,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/752\/revisions"}],"predecessor-version":[{"id":801,"href":"https:\/\/jasonralph.org\/index.php?rest_route=\/wp\/v2\/posts\/752\/revisions\/801"}],"wp:attachment":[{"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=752"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=752"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jasonralph.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=752"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}