I wanted to share my experience with a node application that I support. This particular application is an API, it happens to log each and every request it receives to a internal slack channel. Our team uses this channel for many things, to verify when the API is in maintenance, to check that requests are processing, to see status on the overall health of the API etc..
Once in a while out of nowhere we would stop receiving these updates to slack. I set out to troubleshoot why this may be happening, at first we thought that we were hitting the slack rate limits, which is clearly defined here:
https://api.slack.com/docs/rate-limits
However after reading the linked doc, I was skeptical. The API does serve a lot of requests, but not enough to hit their limit. We have 2 servers that send slack messages and process the API requests and when they stopped sending it would be both servers, not just one. Also we have run into this before and restarting the service fixed the issue, so I was sure we did not hit the rate limit. Also trying to send a manual slack update using curl would not work! I knew this had to be something with the linux OS itself, and not the Slack service.
I tried to use netstat to see if we were hitting some type of OS limit, and all looked well. Next I tried one of my favorite tools, LSOF, at first I grepped for deleted to see if something was being held and not released. I did not see anything that stood out, next I grepped for node and low and behold I saw this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
[root@ip-172-x-x-x ~]# lsof | grep node --SNIP-- node 1794 nodeuser 19u sock 0,6 0t0 651101 can't identify protocol node 1794 nodeuser 20w REG 202,1 209793922 294970 /opt/afs/mc_api_logs/debug.log node 1794 nodeuser 21w REG 202,1 2409554 274199 /opt/afs/mc_api_logs/exceptions.log node 1794 nodeuser 22w REG 202,1 572278 294971 /opt/afs/mc_api_logs/error.log node 1794 nodeuser 23w REG 202,1 2409554 274199 /opt/afs/mc_api_logs/exceptions.log node 1794 nodeuser 24w REG 202,1 2258649 294980 /opt/afs/mc_api_logs/warn.log node 1794 nodeuser 25w REG 202,1 2409554 274199 /opt/afs/mc_api_logs/exceptions.log node 1794 nodeuser 26w REG 202,1 0 294989 /opt/afs/mc_api_logs/info.log node 1794 nodeuser 27w REG 202,1 2409554 274199 /opt/afs/mc_api_logs/exceptions.log node 1794 nodeuser 28u IPv4 13731 0t0 TCP *:pcsync-https (LISTEN) node 1794 nodeuser 29u sock 0,6 0t0 512828 can't identify protocol node 1794 nodeuser 30u sock 0,6 0t0 14507 can't identify protocol node 1794 nodeuser 31u sock 0,6 0t0 14028 can't identify protocol node 1794 nodeuser 32u sock 0,6 0t0 15183 can't identify protocol node 1794 nodeuser 33u sock 0,6 0t0 15628 can't identify protocol node 1794 nodeuser 34u sock 0,6 0t0 16346 can't identify protocol node 1794 nodeuser 35u sock 0,6 0t0 15778 can't identify protocol node 1794 nodeuser 36u sock 0,6 0t0 16847 can't identify protocol node 1794 nodeuser 37u sock 0,6 0t0 17512 can't identify protocol node 1794 nodeuser 38u sock 0,6 0t0 25572 can't identify protocol node 1794 nodeuser 39u sock 0,6 0t0 18437 can't identify protocol --SNIP-- |
My eyes went right to the “can’t identify protocol”, I opened up a browser and started to research, first hit when searching “can’t identify protocol” was a stack overflow article with the solution.
https://stackoverflow.com/questions/7911840/seeing-too-many-lsof-cant-identify-protocol
When lsof prints “Can’t identify protocol”, this usually relates to sockets (it should also say ‘sock’ in the relevant output lines).
So, somewhere in your code you are probably connecting sockets and not closing them properly (perhaps you need a finally block).
I suggest you step through your code with a debugger (easiest to use your IDE, potentially with a remote debugger, if necesssary), while running lsof side-by-side. You should eventually be able to see which thread / line of code is creating these File Descriptors.
Turns out that the node application was opening file descriptors / sockets and not closing them properly, this caused the system to hit the hard limit on open files / file descriptors. You can view the hard and soft limit like so, switch to the user that application is running as and run:
1 2 3 4 |
[nodeuser@ip-172-x-x-x ~]$ ulimit -Hn 4096 [nodeuser@ip-172-x-x-x ~]$ ulimit -Sn 1024 |
So you can see that the nodeuser has a hard limit of 4096 open files, which due to the application not properly closing them, we hit the ceiling. This explains why restarting the server or the process fixed it. It would release the open file descriptors and the system was able to open sockets again. I spoke with the developer and we researched, looks like one of the modules we were using was the cause of the issue, perhaps we were using it wrong? I found this out from this article:
https://stackoverflow.com/questions/24922745/node-js-winston-how-to-safely-drain-a-logger
Question:
I have experimented with instantiating and closing winston loggers as (half) described on https://github.com/flatiron/winston#instantiating-your-own-logger, to no avail. I run into trouble closing file transports of Winston’s – walking through it’s source code, I found that the proper way to close off a logger would seem to be the close method. I expected this to take care of closing the transport file used by the logger – however that turned out to be not so.
Varying in frequency according to node.js server load, winston would still hold on to many transport files, infinitely long after the close method had been called for them, indefinitely long after no new writes were being initiated to them. I observed that through the node.js process file descriptors table (lsof -p). Even though close has been called for a Winston logger, it would indefinitely keep the file descriptor of the log file “in use”, i.e. the log file never gets really closed. Thus leaking file descriptors and eventually making the node.js process bump into the ulimit (-n) limit after my application has been up for long.
Should there be a specific programming pattern for draining a Winston logger such that it can be eventually closed?
Answer:
Create only one logger instance and then derive children from it. In this case, winston will hold only one open file handler. Might also be better for performance.
So that was it, the developers agreed and set out to create a patch, problem solved.