SNMP polling interval granularity

I recently had a need to increase the granularity of SNMP monitoring on some critical network interfaces, and I thought I’d share what I’ve learned so far.

The reason behind the change was due to an issue where we briefly maxed out the bandwidth on of our WAN circuits. For most companies this would be no big deal, especially since the burst in traffic was less than a minute in duration. However, because our customers are very sensitive to latency, this caused some noticeable delays.

The problem with interval monitoring is that you just don’t know what happens between the polls. The whole idea of interval monitoring is to get an approximation of what’s happening, so by design you’re going to miss some of the details. But how do you know what you miss?

Take a look at this picture to get an idea of what I’m talking about:

SNMP Granularity Example

As you can see, the longer the time between polls, the more detail is missed. This might not be a problem for you — I would say that the goal of most monitoring solutions is to look for traffic *trends*, and not necessarily sub minute bursts. But if you need the extra detail, then it becomes quite a challenge, and there are limited solutions depending on the size of your budget.

If money is no object, then you can buy one of the fancy network traffic monitors by Corvil or Netscout. These are awesome boxes that record every drop of data you feed through them, so that you can replay the exact issue over and over. You pay handsomely for that functionality.

cPacket networks and Riverbed also have some cool products, but price is definitely a factor as well.

The only way to get this kind of granularity on the cheap is to leverage SNMP and pump up the granularity.

IOS SNMP Configuration

Disclaimer: Do not apply these commands to your production environment until you have tested and fully understand the potential impact

By my rough estimation and testing, most Cisco devices (IOS, IOS XE, and NX-OS), by default, will update their internal SNMP statistics every 10 seconds +/-. But there’s hidden command in IOS that will let you adjust the interval at which IOS updates the internal statistics database.

snmp-server hc poll <interval>

The interval for this command is in hundredths of seconds. So to update at 1 second intervals, you need:

snmp-server hc poll 100

I have used this command successfully on Cisco ISR G2 routers, 7200 routers, and 4900M switches.

This does not work on IOS XE or NX-OS devices, including ASR1000’s and Nexus 7K or 3K’s.

Network Management Systems

You might recall that I’m a big fan of Solarwinds for network monitoring. Unfortunately, the lowest interval you can configure for statistics collection in Solarwinds NPM is 1 minute. So I had to look elsewhere for a product that could poll more frequently.

Enter SevOne. Their platform allows you to configure ‘high-speed pollers’ which can be tuned down to 1 second intervals. The folks at SevOne are great to work with, and I received a lot of good guidance as I was configuring my instance.

This isn’t meant to be a review or comparison of the two products — SevOne solved a very specific problem for me and at a price point that was 1/10th the cost of the next closest solution, so props to them.

The other possible solutions are Cacti or other RRDtool based statistics/graphing systems. Since these are so flexible, I would expect that you could grab stats down to one second with these as well, although I haven’t verified this.

Request for Comments

There’s some discussion out there about the problems you can run into if you poll too frequently, one of which is polling more frequently than the agent updates. I definitely observed this problem as I was working on this, and saw the wild numbers you can get. If you have any thoughts on why polling more frequently would be bad, or might lead you to the wrong conclusions, please feel free to comment.

IT Crisis Management: Responding to and dealing with network outages

I’ve worked in the financial industry for almost a decade now, and in that time I’ve seen my fair share of network/system outages. When I started, I heard plenty of stories about “RGE’s” or “Resume Generating Events” as they were called around the office, and was jokingly advised to always keep my resume up to date. That’s not bad advice by itself, but I wanted to share my opinion on dealing with outages and the aftermath.

When I started, I lived in fear of what would happen if something went down during my watch. My heart skipped a beat each time an alert popped up on our monitoring system. And the first time an outage occurred, I was sure it was going to cost me my job.

Thankfully, that didn’t happen.

Over the years, not only have I grown less fearful of outages, I’ve learned that there are two phases to an outage: The outage itself (including the resolution), and the post-mortem.

The Outage

Obviously, if something is going wrong, the primary objective should be restoring service as soon as possible. Key to a rapid resolution is having clear channels of communication with other teams and knowing the most important people to bring into the loop during an issue. This is called being transparent about issues. Don’t wait to open channels of communication to partner teams or managers. In the financial industry information is key. Clients might be unhappy that you’re having problems, but they will be downright livid if you had an issue and you didn’t warn them. This can also have liability issues attached, so be sure to think out these things with your management.

Also vitally important is a strong understanding of the network, how things connect to each other, and having a good monitoring system in place. It almost goes without saying, but if you don’t know how anything relates to each other, and you have no way to monitor your systems, you’ve got some other issues to deal with.

The Post Mortem

You’ve solved the issue and things are working smoothly again — disaster mitigated! What happens after the outage, is in my experience, equally as important as resolving the issue itself. I believe this is where you can really shine and set yourself apart.

A mentor of mine once told me that what people really want after an outage is the answers to three questions:

  1. What happened?
  2. What did you do to fix it?
  3. When did you know about it?

And he followed up by saying that out of the three questions, the last was really the most important. His opinion, and one I share now, is that accidents happen, equipment breaks, circuits get the backhoe treatment, etc. and you generally can’t totally avoid outages. What you can do (besides practicing good operational principles) is respond promptly to your alerting system or reports from users, and show initiative by investigating on the first call and not on the 10th. Act quickly and openly.

Depending on the nature of your team, it’s likely that only a few people will understand the true technical cause of an issue. As you communicate up the chain, the gruesome details will be lost. Its like the telephone game:

Engineer to manager: "We started taking excessive CRC errors on an interface without the line actually going down."

Manager to CTO: "We had a line problem."

CTO to CEO: "It broke."

Ok, so maybe that’s a little too simplified, but you get the idea. You don’t speak the same language as the upper echelon of management, and conversely, they don’t speak your low-level technical language. All they care about is when did we know about it, and how soon did we fix it.

Everything Else

This is not meant to be an exhaustive description of everything that should happen before/during/after an outage. I also don’t want to get into the details of what constitutes good operational principles, but I’m thinking of things like exercising caution when implementing changes, not being cavalier about production environments, and only making changes after following proper change management procedures , and then only during approved change windows.

I’m also aware that there are different cultures around the human side of outages. Some companies or managers will roast you and leave you for dead no matter how small the misstep, while others will have a more forgiving approach. That’s another topic as well.

Just remember that no matter how deep your bureaucracy, or how carefully you plan, there will always be unexpected issues. Just be sure that you do everything you can to be responsive and transparent about the issues you face.

Have thoughts on this subject? I’d love to hear your comments!