cli – Jason R. Ralph

I needed a way to get output from aws cli captured into a log file with timestamps, out of the box the aws cli output has no timestamps in the output. If you execute a aws s3 cp command, something like this:

aws s3 cp s3://jason-test-bucket-1/test_part_00 s3://jason-test-bucket-2/jason_test/

1	aws s3 cp s3://jason-test-bucket-1/test_part_00 s3://jason-test-bucket-2/jason_test/

You will see output like so:

copy: s3://jason-test-bucket-1/test_part_00 to s3://jason-test-bucket-2/jason_test/test_part_00

1	copy: s3://jason-test-bucket-1/test_part_00 to s3://jason-test-bucket-2/jason_test/test_part_00

As you can see this does not show a timestamp in each event of output from the aws cli. So I scoured the internet and found out some interesting things. Turns out that aws cli out of the box outputs with carriage returns instead of newlines. So trying standard awk piping methods was not working. Also aws cli has the ability to change the output, so I needed to add a cli parameter to set output to text. Next we needed to use TR to substitute the carriage returns with newlines, finally we can pipe to awk and print a timestamp on each output event from the aws cli. The final command and output looks like this:

#!/bin/bash
log='test.log'
aws s3 --output text cp s3://jason-test-bucket-1/test_part_00 s3://jason-test-bucket-2/jason_test/ | tr "\r" "\n" > >(awk '{print strftime("%Y-%m-%d:%H:%M:%S ") $0}') | tee >> $log 2>&1

#!/bin/bash

log='test.log'

aws s3 --output text cp s3://jason-test-bucket-1/test_part_00 s3://jason-test-bucket-2/jason_test/ | tr "\r" "\n" > >(awk '{print strftime("%Y-%m-%d:%H:%M:%S ") $0}') | tee >> $log 2>&1

Produces the following in the log file which is my desired result:

2020-12-31:13:32:13 Completed 726.3 KiB/726.3 KiB (3.8 MiB/s) with 1 file(s) remaining
2020-12-31:13:32:13 copy: s3://jason-test-bucket-1/test_part_00 to s3://jason-test-bucket-2/jason_test/test_part_00

1 2	2020-12-31:13:32:13 Completed 726.3 KiB/726.3 KiB (3.8 MiB/s) with 1 file(s) remaining 2020-12-31:13:32:13 copy: s3://jason-test-bucket-1/test_part_00 to s3://jason-test-bucket-2/jason_test/test_part_00

I hope this helps someone else as it was a bear to solve for me.

In this post I would like to go over how I tuned a test server for copying / syncing files from the local filesystem to S3 over the internet. If you ever had the task of doing this, you will notice that as the file count grows, so does the time it takes to upload the files to S3. After some web searching I found out that AWS allows you to tune the config to allow more concurrency than default.
AWS CLI S3 Config

The parameter that we will be playing with is max_concurrent_requests
This has a default value of 10, which allows only 10 requests to the AWS API for S3. Lets see if we can make some changes to that value and get some performance gains. My test setup is as follows:

2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
8GB RAM
CentOS release 6.10 (Final)

2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

8GB RAM

CentOS release 6.10 (Final)

I have 56 102MB files in the test directory:

-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_7.csv.gz
-rw-r--r-- 1 jasonr domain^users 102M Sep 24 11:44 sample__0_0_53.csv.gz
-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_6.csv.gz
-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_8.csv.gz
-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_55.csv.gz
--snip--
[jasonr@jr-sandbox jason_test]$ ls| wc -l
56

-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_7.csv.gz

-rw-r--r-- 1 jasonr domain^users 102M Sep 24 11:44 sample__0_0_53.csv.gz

-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_6.csv.gz

-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_8.csv.gz

-rw-r--r-- 1 jasonr domain^users 101M Sep 24 11:44 sample__0_0_55.csv.gz

--snip--

[jasonr@jr-sandbox jason_test]$ ls| wc -l

For the first test I am going to run aws s3 sync with no changes, so out of the box it should have 10 max_concurrent_requests. Lets use the Linux time command to gather the time result to copy all 56 files to S3. I will delete the folder on S3 with each iteration to keep the test the same. You can also view the 443 requests via netstat and count them as well to show whats going on. In all the tests my best result was 250. So as you can see you will need to play with the settings to get the best result, these settings will change along with the server specs.

1. 1m25.919s with the default configuration:

[jasonr@jr-sandbox jason_test]$ time aws s3 sync . s3://dev-redshift/jason_sync_test/
upload: ./sample__0_0_0.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_0.csv.gz
upload: ./sample__0_0_10.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_10.csv.gz
upload: ./sample__0_0_11.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_11.csv.gz
upload: ./sample__0_0_12.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_12.csv.gz
upload: ./sample__0_0_13.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_13.csv.gz
--snip--

real	1m25.919s
user	0m35.153s
sys	0m15.879s

[jasonr@jr-sandbox jason_test]$ time aws s3 sync . s3://dev-redshift/jason_sync_test/

upload: ./sample__0_0_0.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_0.csv.gz

upload: ./sample__0_0_10.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_10.csv.gz

upload: ./sample__0_0_11.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_11.csv.gz

upload: ./sample__0_0_12.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_12.csv.gz

upload: ./sample__0_0_13.csv.gz to s3://dev-redshift/jason_sync_test/sample__0_0_13.csv.gz

--snip--

real 1m25.919s

user 0m35.153s

sys 0m15.879s

2. Now lets set the max conqurent requests to 20 and try again, you can do this with the command below, after running we can see a little gain.

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 20
[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config 
[default]
s3 =
    max_concurrent_requests = 20
[root@jr-sandbox ~]# netstat -an| grep 443| wc -l
20

real	1m13.277s
user	0m36.186s
sys	0m16.462s

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 20

[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config

[default]

s3 =

max_concurrent_requests = 20

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l

real 1m13.277s

user 0m36.186s

sys 0m16.462s

3. Bumped up to 50 shows a bit more gain:

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 50
[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config 
[default]
s3 =
    max_concurrent_requests = 50

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l
49
real	1m0.720s
user	0m37.669s
sys	0m19.344s

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 50

[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config

[default]

s3 =

max_concurrent_requests = 50

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l

real 1m0.720s

user 0m37.669s

sys 0m19.344s

4. Bumped up to 100, I start to notice that we lost some speed:

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 100
[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config 
[default]
s3 =
    max_concurrent_requests = 100
[root@jr-sandbox ~]# netstat -an| grep 443| wc -l
95
real	1m4.212s
user	0m39.737s
sys	0m21.847s

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 100

[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config

[default]

s3 =

max_concurrent_requests = 100

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l

real 1m4.212s

user 0m39.737s

sys 0m21.847s

5. Bumped up to 250 we see the best result so far:

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 250
[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config 
[default]
s3 =
    max_concurrent_requests = 250
[root@jr-sandbox ~]# netstat -an| grep 443| wc -l
234
real	0m55.036s
user	0m42.841s
sys	0m21.409s

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 250

[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config

[default]

s3 =

max_concurrent_requests = 250

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l

234

real 0m55.036s

user 0m42.841s

sys 0m21.409s

6. Bumped up to 500, we lose performance, most likely due to the machine resources.

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 500
[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config 
[default]
s3 =
    max_concurrent_requests = 500
[root@jr-sandbox ~]# netstat -an| grep 443| wc -l
465
real	1m16.593s
user	0m50.336s
sys	0m25.806s

[jasonr@jr-sandbox jason_test]$ aws configure set default.s3.max_concurrent_requests 500

[jasonr@jr-sandbox jason_test]$ cat ~/.aws/config

[default]

s3 =

max_concurrent_requests = 500

[root@jr-sandbox ~]# netstat -an| grep 443| wc -l

465

real 1m16.593s

user 0m50.336s

sys 0m25.806s

So to wrap up, you can tune the amount of concurrent requests allowed from the aws cli to s3, you will need to play with this setting to get the best results for your machine.

Jason R. Ralph

Linux All Day Everyday

Tag: cli

Capture AWS CLI Output With Timestamps On Each Line Of Output

AWS CLI Max Concurrent Requests Tuning