Sidekiq Memory Leak

While deploying an app to two brand new Ubuntu 14.04LTS servers, a nasty memory leak was found.

The offending processes were our sidekiq workers, and they were hitting a linux kernel OOM death within 24 hours. Oddly, these processes were running just fine on an older version of Ubuntu (11.10). Unfortunately, an email to Mike Perham, the main guy behind sidekiq netted me with a response I’d given so many times before: “did you try turning it off and on again”

Of course, restarting the process certainly fixed the memory leak, but watching a process’s memory usage and restarting it every day did not seem practical.

Ruby Upgrade

So we devised a plan to upgrade Ruby from 2.1.1 to 2.2.0. One of our expert developers lead the upgrade of ruby on the hosts that house our sidekiq processes. Until a new version of sidekiq is relased that addresses this issue, a Ruby upgrade seemed to be our only option within userspace. After the Ruby upgrade, testing was performed on our main app as per usual, and then I got my hands on it to test the memory leak issue.

At first it seemed that our app was performing normally, and the process memory use levelled off. However, once a production load was put onto the new code, the memory leak reared its ugly head once more.

Kernel up/downgrade

The kernel on the old boxes where sidekiq was working was 3.0.0-26-virtual, on Ubuntu 11.10.

On the newer boxes that were exhibiting the memory leak, 3.13.0-29-generic was the kernel in use on Ubuntu 14.10LTS. The potential for a bug to exist between Ruby and the Linux Kernel’s malloc() call is high enough to test another kernel version.

Unfortunately, upgrading a kernel on an Amazon EC2 instance is far from a simple process, and after breaking several boxes, and wasting several days in research and testing, I gave up this path. If you are running sidekiq in a non-EC2 envirnment, I implore you to try a down/upgrade of the kernel to see if you can exhibit/prohibit this memory leak behaviour. If the next Ruby or sidekiq upgrades do not fix the issue, I will be building a third production box within my Vsphere environment where it’s a little bit easier to modify the host kernel.

What did work

Call it powdering a corpse, or call it duct tape; I ended up using monit to monitor the process with the memory leak, and to restart said process if its memory consumption went above a threshold. To do this, I created a file called /etc/monit/conf.d/api_code.conf and wrote the following poetry inside

1
2
3
4
5
check process api_code with pidfile /app_path/pids/api_code.pid
    start program = "/usr/sbin/service api_code start "
    stop program = "/usr/sbin/service api_code stop "
    if totalmem > 1024 MB then restart
    if 5 restarts within 5 cycles then timeout

I’ve also triggered an alert with m/monit so that I can track every time the process is restarted. While this solution works, and keeps our production API afloat, it does not for one happy admin make. While this particular piece of code is relatively harmless when restarting, other pieces of our code could be damaging if interrupted. It’s not hard to imagine a process hitting that memory threshold in the middle of doing something important.

Let’s hope that “turning it off and on again” isn’t a long-term solution.