Redshift Serverless Find Largest Tables

You can use the below SQL on redshift serverless to find the top largest tables. You can return the results in 1MB data blocks or convert to TB. You can obviously change the limit N to whatever you want to change the number of results returned.

1MB data blocks:

Size In TB:

Specific Fields:

botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: https://lambda.us-east-1.amazonaws.com

Recently while working on one of our EMR projects that uses lambdas and airflow, I ran into the following timeout issue:

We have a lambda that was invoked from boto3 in a Airflow step that would update dynamo db with values needed for our pipeline. This function worked in previous tests with no issues. We did add to the lambda function which was causing it to take longer than normal. When we tested the lambda from the console, the function worked fine, albeit it took a bit longer than the previous version. When calling from Airflow we would continually run into the timeout issue, causing the function to be executed multiple times during retries.

I thought to test this function from the awscli and it revealed the issue, the default boto3 timeout is 60 seconds, this was longer than our lambda was taking. So even though we set the lambda timeout to 4 minutes, boto was timing out at 1 minute, never getting the response back from lambda. The way we fixed this was to have boto3 setup a lambda_config that had a longer timeout.

Upgrade Rocky Linux 8 to 9 CLI

I thought I would share my version of how I updated the server that runs this blog from Rocky 8 to Rocky 9 without a clean install. I want to mention this is a do at your own risk post, this is not officially supported.

!!!Do not attempt this if you do not have backups and a way to fully recover your system.!!!

The first step I took was go to the rocky download site and make sure I grabbed the latest GPG, RELEASE and REPOS:

https://download.rockylinux.org/pub/rocky/9/BaseOS/x86_64/os/Packages/r/

You will need to modify the below command to match the version you find in the above site, once that is complete you can run it.

One road block was dnf did not like that I had remi and epel release 8, so I removed them and it went fine.

Find the epel and remi release rpms:

Remove them:

Upgrade your system to 9 from 8:

I ignored this error, it seems like its just a GPG error:

Verify:

Rebuild the RPM database to now use SQLITE:

Thats it, reboot:

I did have some issues with dnf where I needed to reset some modules.

I needed to reset the modules one by one, there may be more on your system:

That seemed to fix it, good luck.

AttributeError: module ‘cryptography.utils’ has no attribute ‘register_interface’

I just recently came across an issue when we were bootstrapping one of our EMR clusters, looks like when trying to import pgpy we failed with the following traceback:

Apparently the cryptography team released a new version on September 7th 2022 that broke the pgpy library.
https://pypi.org/project/cryptography/38.0.1/

We needed to downgrade our version to get things working again. I figured I would post this to see if others run into this, according to the pgpy github page, they are working on a fix.

https://github.com/SecurityInnovation/PGPy/issues/402

Here is how I solved it in the meantime, I needed to downgrade the cryptography library.

Python Linux Find Files With Pattern Accessed Older Than N Days And Remove

This is a neat utility that you can use to keep in your sysadmin bag of tricks, it walks the directory you define recursively and grabs all the file access times and stores them into a list, it then compares them against a command line parameter for days ago. If its older than N days it will remove the file. What’s really nice about this utility is it has a debug mode, this way you can see what will be deleted before you remove debug and execute it.

AWS EMR ImportError: this version of pandas is incompatible with numpy < 1.17.3

I found another one that I thought was worth writing a quick blog post about. We use AWS Elastic Map Reduce with transient clusters, so in order to get the python libraries installed, we need to use the bootstrap feature. We ran into many issues trying the standard bootstrap script which looked something like this:

The contents of requirements.txt looked like this:

We would get all the nodes in the cluster to bootstrap properly however the logs showed the following:

And when trying to import from pyspark, we saw this:

After speaking with AWS support, it turns out this was a known issue. When a cluster is launched, EMR first provisions the EC2 instances, after that it runs the bootstrap actions. Thus, when the bootstrap action runs, it installs the desired version. However, since the applications are installed after the bootstrap action, these applications override the custom installation for the Python packages. In order to get around the issue of the version being overridden, the workaround is to make use of a Bootstrap Action that delays the installation of the packages until the nodes are fully up and running. This will resolve the conflict that we have been seeing with pandas and numpy. Here is what our final working bootstrap.sh looks like, hope this helps, it was a tough one to solve:

10 Year Anniversary: www.jasonralph.org

I had not posted too much lately, lots of stuff going on with my work and personal life, my wife and I moved into a new house in 2022, and for work we have been grinding on a large migration. I looked at my blog this morning and noticed that I have had this spare time project running for 10 years.

So for 10 years I have had jasonralph.org up and continuously available, with analytics, it started in my apartment on an old IBM stand alone server, it now runs on a single Rocky Linux 8 VM from linode for 10 dollars a month. I hope to have some new content soon, but for now, I am happy for the 10 year anniversary.

AWS Apache Managed Airflow EMR ModuleNotFoundError: No module named ‘requests’ Bootstrap

I came across another fun one the other day, we are in the process of migrating our on premise elastic map reduce system into the cloud. We are using AWS EMR and have AWS Managed Airflow as the executor (DAG). We came across an odd situation with a pyspark application. When using Airflow with a SparkSubmitHook, the job would bootstrap looking just fine according to the run logs, however it would fail with No module named 'requests' when the application tried to import it. This was very odd since we have this application running from spark-submit just fine when calling it from the master node command line.

I decided to investigate the differences, our bootstrap script for installing python modules via pip which we call from the EMR API RunJobFlow call looks like this:

This is very basic, all it does is upgrade PIP and run PIP install to install each of the modules. When checking the bootstrap log I can see that PIP upgrades and goes out to the repo and installs the packages just fine. So why were we getting the No module named 'requests' error when executing through airflow. After a ton of googling and research I have found the issue and applied a solution that worked. Turns out airflow will run as the root user when bootstrapping, so if you notice we use the --user argument in pip. This will instruct the packages to be installed in the calling users home directory, the kicker is the code is run by the hadoop user on the EMR cluster nodes after executing from airflow. So turns out, the hadoop user is unable to access the requests module since root installed it with --user. I changed the bootstrap script to the following and it all started working, by removing --user and prefixing with sudo, the packages now get installed in a globally available area for all users. I am sure there are better ways to do this, I am still learning and researching, but if you run into this, the change below with get you out of the woods.

After some further research, and testing we decided to utilize a requirements.txt file to be called by the bootstrap shell script in the RunJobFlow call, first create a requirements.txt file, I like to hardcode the versions so nothing changes unexpectedly as you bootstrap a new cluster and it reaches out to PyPy to get the packages.

https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html

Add your desired packages and version numbers to a file called requirements.txt like below:

Then you will need to copy this file into a bucket you have access to:

Then create a shell script that has the following, call it bootstrap.sh:

Copy that shell script to your bucket:

And execute it via the bootstrap actions in the RunJobFlow EMR API call:

As you can see the shell script will be executed which will copy the requirements.txt file locally and then run pip -r against it which will install all the packages. If you want to see the log on a running cluster, you can ssh to the master node and view the logs here to see the bootstrapping take place:

You should see the stdout log as so:

Hope this helps.

Node Application Stopped Sending Updates To Slack – can’t identify protocol

I wanted to share my experience with a node application that I support. This particular application is an API, it happens to log each and every request it receives to a internal slack channel. Our team uses this channel for many things, to verify when the API is in maintenance, to check that requests are processing, to see status on the overall health of the API etc..

Once in a while out of nowhere we would stop receiving these updates to slack. I set out to troubleshoot why this may be happening, at first we thought that we were hitting the slack rate limits, which is clearly defined here:

https://api.slack.com/docs/rate-limits

However after reading the linked doc, I was skeptical. The API does serve a lot of requests, but not enough to hit their limit. We have 2 servers that send slack messages and process the API requests and when they stopped sending it would be both servers, not just one. Also we have run into this before and restarting the service fixed the issue, so I was sure we did not hit the rate limit. Also trying to send a manual slack update using curl would not work! I knew this had to be something with the linux OS itself, and not the Slack service.

I tried to use netstat to see if we were hitting some type of OS limit, and all looked well. Next I tried one of my favorite tools, LSOF, at first I grepped for deleted to see if something was being held and not released. I did not see anything that stood out, next I grepped for node and low and behold I saw this:

My eyes went right to the “can’t identify protocol”, I opened up a browser and started to research, first hit when searching “can’t identify protocol” was a stack overflow article with the solution.

https://stackoverflow.com/questions/7911840/seeing-too-many-lsof-cant-identify-protocol

When lsof prints “Can’t identify protocol”, this usually relates to sockets (it should also say ‘sock’ in the relevant output lines).

So, somewhere in your code you are probably connecting sockets and not closing them properly (perhaps you need a finally block).

I suggest you step through your code with a debugger (easiest to use your IDE, potentially with a remote debugger, if necesssary), while running lsof side-by-side. You should eventually be able to see which thread / line of code is creating these File Descriptors.

Turns out that the node application was opening file descriptors / sockets and not closing them properly, this caused the system to hit the hard limit on open files / file descriptors. You can view the hard and soft limit like so, switch to the user that application is running as and run:

So you can see that the nodeuser has a hard limit of 4096 open files, which due to the application not properly closing them, we hit the ceiling. This explains why restarting the server or the process fixed it. It would release the open file descriptors and the system was able to open sockets again. I spoke with the developer and we researched, looks like one of the modules we were using was the cause of the issue, perhaps we were using it wrong? I found this out from this article:
https://stackoverflow.com/questions/24922745/node-js-winston-how-to-safely-drain-a-logger

Question:

I have experimented with instantiating and closing winston loggers as (half) described on https://github.com/flatiron/winston#instantiating-your-own-logger, to no avail. I run into trouble closing file transports of Winston’s – walking through it’s source code, I found that the proper way to close off a logger would seem to be the close method. I expected this to take care of closing the transport file used by the logger – however that turned out to be not so.

Varying in frequency according to node.js server load, winston would still hold on to many transport files, infinitely long after the close method had been called for them, indefinitely long after no new writes were being initiated to them. I observed that through the node.js process file descriptors table (lsof -p). Even though close has been called for a Winston logger, it would indefinitely keep the file descriptor of the log file “in use”, i.e. the log file never gets really closed. Thus leaking file descriptors and eventually making the node.js process bump into the ulimit (-n) limit after my application has been up for long.

Should there be a specific programming pattern for draining a Winston logger such that it can be eventually closed?

Answer:

Create only one logger instance and then derive children from it. In this case, winston will hold only one open file handler. Might also be better for performance.

So that was it, the developers agreed and set out to create a patch, problem solved.

centos8 postgresql-11-check-db-dir[]: is missing or empty

We have been rolling out CENTOS8 in our lower environments for testing, we use a dedicated vmware virtual server with centos8 minimal install, we only apply hardening techniques to these systems other than the main application, which is pg11 here. These systems use a LVM mounted ext4 filesystem for the data directory.

Recently on 3 of the new PG VMS after reboot we noticed that PG did not start, this also seemed intermittent, even though we have enabled the systemd service to start on reboots. So I checked the pg startup log and did not find too much about the issue. So I checked /var/log/messages and found the issue.

I checked the systemd service file and saw that out of the box postgres had the following:

After=Syslog.target This is a special target unit in systemd and is the standardized name to pull in a syslog implementation.

After=network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined.

WantedBy=multi-user.target normally defines a system state where all network services are started up and the system will accept logins, but a local GUI is not started. This is the typical default system state for server systems, which might be rack-mounted headless systems in a remote server room.

Those options above will not ensure that all filesystems in fstab are mounted before postgres starts. So what we were seeing was a classic race condition where postgres started before the data directory was mounted. As I previously mentioned we use a custom PGDATA location. So after some research I found my option that fixed this. You will need to edit the pg11 service and add the following, then reload systemd and reboot and all should work. You can find your LVM mount by running the following:

You can see my u02-data1.mount in the output, so edit and add the override file with the following, if you have multiple mounts, you can add them as well.
Edit with: systemctl edit postgresql-11

Reload the daemon with: systemctl daemon-reload

After=local-fs.target systemd-fstab-generator(3) automatically adds dependencies of type Before= to all mount units that refer to local mount points for this target unit. In addition, it adds dependencies of type Wants= to this target unit for those mounts listed in /etc/fstab that have the auto mount option set.