SNMP Monitoring With Datadog

Having the ability to quickly pull statistics on application and server performance metrics, while easily alerting and reporting to various integrations has made Datadog a valuable tool for our dev team to track application issues, and our ops team to track server and database issues.

While browsing the various Datadog integrations, I found an addon for SNMP. While I had planned on using cacti to graph internal equipment, having a single looking glass was ideal.

So I made some pretty graphs.

CPU Usage

VLAN traffic usage.

Datadog charges per host, per month. Luckily, for SNMP, we can monitor an entire fleet of SNMP devices from a single host.

To start, I created a linux host on my internal network, and configured the Datadog agent with the defaults.

The device I chose for this example is a Cisco ASA, pulling basic operation stats (CPU, RAM), and per-VLAN traffic monitoring. The hunt for the OIDs was grueling. I ended up finding a list of OIDs somewhere in the mammoth Cisco knowledge base that had OIDs relevant to my device. Using snmpwalk, I was able to verify an OIDs existence, or poll values to ensure it was the correct OID.

This commaned will output the current value for our OUTSIDE VLAN, which is our WAN edge interface. Your OID may vary, depending on the device, interface speed, number of interfaces, etc.

snmpwalk -c password -v 2c ""

After I’d finished compiling a list of OIDs to be used, I began adding them to /etc/dd-agent/conf.d/snmp.yaml as listed below. You can add multiple devices by adding a new section starting with - ip_address:

Most of these OIDs should work on most ASAs, but you will want to adust the interface names to reflect your own internal VLAN assignments.

# SNMP v1-v2 configuration
   - ip_address:
     port: 161
     community_string: password
     snmp_version: 2 # Only required for snmp v1, will default to 2
     timeout: 1 # second, by default
     retries: 5

       - OID:
         name: OUTSIDE_Bytes
       - OID:
         name: COMPUTE_Bytes
       - OID:
         name: GUEST_Bytes
       - OID:
         name: INFERNO_Bytes
       - OID:
       - OID:
         name: ActiveConnections
       - OID:
         name: IPSecTunnels
       - OID:
         name: OUTSIDE_Errors
       - OID:
         name: ifInOctets
       - OID:
         name: ifInDiscards
       - OID:
         name: ifInErrors
       - OID:
         name: ifOutOctets
       - OID:
         name: ifOutDiscards
       - OID:
         name: ifOutErrors
       - OID:
         name: ciscoMemoryPoolName
       - OID:
         name: ciscoMemoryPoolAlternate
       - OID:
         name: ciscoMemoryPoolValid
       - OID:
         name: ciscoMemoryPoolUsed
       - OID:
         name: ciscoMemoryPoolFree
       - OID:
         name: ciscoMemoryPoolLargestFree
       - OID:
         name: cpmCPUTotal1minRev
       - OID:
         name: cpmCPUTotal5minRev
       - OID:
         name: cipSecGlobalOutOctets
       - OID:
         name: crasIPSecNumSessions

Keep in mind that you can poll any SNMP capable device; Access Points, switches, servers, telephones, etc.

Sidekiq Memory Leak

While deploying an app to two brand new Ubuntu 14.04LTS servers, a nasty memory leak was found.

The offending processes were our sidekiq workers, and they were hitting a linux kernel OOM death within 24 hours. Oddly, these processes were running just fine on an older version of Ubuntu (11.10). Unfortunately, an email to Mike Perham, the main guy behind sidekiq netted me with a response I’d given so many times before: “did you try turning it off and on again”

Of course, restarting the process certainly fixed the memory leak, but watching a process’s memory usage and restarting it every day did not seem practical.

Ruby Upgrade

So we devised a plan to upgrade Ruby from 2.1.1 to 2.2.0. One of our expert developers lead the upgrade of ruby on the hosts that house our sidekiq processes. Until a new version of sidekiq is relased that addresses this issue, a Ruby upgrade seemed to be our only option within userspace. After the Ruby upgrade, testing was performed on our main app as per usual, and then I got my hands on it to test the memory leak issue.

At first it seemed that our app was performing normally, and the process memory use levelled off. However, once a production load was put onto the new code, the memory leak reared its ugly head once more.

Kernel up/downgrade

The kernel on the old boxes where sidekiq was working was 3.0.0-26-virtual, on Ubuntu 11.10.

On the newer boxes that were exhibiting the memory leak, 3.13.0-29-generic was the kernel in use on Ubuntu 14.10LTS. The potential for a bug to exist between Ruby and the Linux Kernel’s malloc() call is high enough to test another kernel version.

Unfortunately, upgrading a kernel on an Amazon EC2 instance is far from a simple process, and after breaking several boxes, and wasting several days in research and testing, I gave up this path. If you are running sidekiq in a non-EC2 envirnment, I implore you to try a down/upgrade of the kernel to see if you can exhibit/prohibit this memory leak behaviour. If the next Ruby or sidekiq upgrades do not fix the issue, I will be building a third production box within my Vsphere environment where it’s a little bit easier to modify the host kernel.

What did work

Call it powdering a corpse, or call it duct tape; I ended up using monit to monitor the process with the memory leak, and to restart said process if its memory consumption went above a threshold. To do this, I created a file called /etc/monit/conf.d/api_code.conf and wrote the following poetry inside

check process api_code with pidfile /app_path/pids/
    start program = "/usr/sbin/service api_code start "
    stop program = "/usr/sbin/service api_code stop "
    if totalmem > 1024 MB then restart
    if 5 restarts within 5 cycles then timeout

I’ve also triggered an alert with m/monit so that I can track every time the process is restarted. While this solution works, and keeps our production API afloat, it does not for one happy admin make. While this particular piece of code is relatively harmless when restarting, other pieces of our code could be damaging if interrupted. It’s not hard to imagine a process hitting that memory threshold in the middle of doing something important.

Let’s hope that “turning it off and on again” isn’t a long-term solution.


Out with the Old

Growing tired of having to stay on my toes to update, secure, and monitor my Wordpress installations, I figured it was time to move to a different CMS. After several false starts with some rather obnoxious blogging tools, I settled onto octopress.

The big selling feature came in terms of security, in that octopress locally generates static content, which you can then deploy as needed to web servers by several means (scp, rsync, github pages, etc).

Other features that I found to be worthy of mention:

  • markdown language is simple, and simple is good
  • very small file footprint, and no moving parts
  • zero dependancies on web server (other than the web server)
  • using exitwp to migrate from WP to octopress

Necessary Privacy

15 years ago, I held the position that governments and technology providers were not only able to, but were actively monitoring data communications through several dragnet operations. Dismissed as a conspiracy theorist left me with very few people to share gpg keys for email encryption; not that this stopped me from storing my information encrypted, hiding my online tracks, and other actions that furthered the perception that I was a bit paranoid. A typical conversation on this topic usually led to the question “what do you have to hide”? I’ve never had anything incriminating to hide, and it was never about that.

It was about control over my own personal information, where it ended up, who had access, who was rewriting it, and who was selling it. It was also about being labeled something I wasn’t, just because of an acquaintance or affiliation.

Around the time that I started really taking privacy seriously, I pulled out my soapbox and started preaching to anyone within earshot usually to their chagrin. I observed that most people considered their privacy to be a right, and placed trust in the authorities to regulate and manage any of the information collected about them.

Fast forward to today. A simple civilian-accessible search on someone’s name can get you emails, phone numbers, addresses, current and past employers, images, purchases on eBay. Using this information can further reveal more and more information until an entire profile is built on this person.

Or you can just add them Facebook.

So what is the problem with all of this information being collected and processed?

  • This data is persistent; it will not be deleted

  • The data is only good if it’s properly related to other data from the same person; processes are in place to ensure the validity, although there is no guarantee that your data isn’t linked erroneously

  • There are no regulations or laws saying that all of your data must be accessible to you (even through FOI/FIPPA)

  • There are no mechanisms in place to opt out of this collection

But what happens when the analytics engine fails? When we’re talking about big data, we’re talking about a staggering amount of information. No single person or group could manually sift through this data and draw correlations. A program or process handles that work load. History has taught us that programs and processes have bugs. Suddenly, you may find yourself on a list among sex offenders, political dissidents, or worse. Just because you had a common friend with a wanted criminal, and your grandmother’s computer had been infected with a botnet that was being used to distribute anti-government propaganda. There is no one you can call to get this association removed.

So what can you do to prevent this in the first place? You can become translucent, obfuscate, and protect. Starting with your web browser and email, encrypt everything that you can.

  • HTTPS for web connections, and GnuPG for email

  • Don’t bother trying to encrypt to keep the NSA out, just try and keep data mining apps and scripts from hauling off your personal data

  • Try not to leave tracks by requesting that sites do not track you, turn off scripts with NoScript, disable incoming ads, and don’t accept or store third party cookies

  • Shred your bills and other hard copies with personal information

  • Don’t give your postal code or personal details to merchants

You cannot escape the data dragnet entirely, without going completely off the grid. But you can minimize the chances of a disastrous theft of identity by corporate interests, governmental bodies, or other types of nefarious criminal organizations.