How are the switches and routers doing in your network? Are the NAS appliances still up? What's the speed of a particular interface? Is the temperature too high in your server room? With SNMP (Simple Network Monitoring Protocol) you can find out all kinds of information about devices in your network.
Since a lot of hardware manufacturers support this protocol, you can collect data from network switches, routers, UPS and power supply devices, NAS appliances, printers, etc. and use them in your monitoring software. Monitoring that kind of data, you can make decisions based on facts, instead of guessing. This way you can even prevent failures before they occur.
In this blog post we'll give an introduction on SNMP and its history, we'll explain how OIDs (Object Identifiers) and MIBs (Management Information Bases) work and why you need them. We're also going to talk about the pros and cons and share some insights from our developers who write SNMP checks on a daily basis. Of course, we're going to show you as well how to monitor your devices via SNMP in Checkmk.
Some Facts about SNMP
Let's start with a brief overview over the SNMP protocol and its changes over the past decades. There are three major versions. SNMPv1 is the oldest one, and it dates back to the late 1980s. It was developed as a standard for managing network devices via IP (Internet Protocol). More than 30 years later, it is still in use, although it is insecure and inefficient. Version 1 only requires a plaintext community (a kind of password): If it matches the community stored in the agent, it grants access. SNMPv1 only supports 32-bit counters which isn't good enough for today's gigabit networks and causes problems because the counters wrap too fast, i.e. more than once between samples.
When does that matter? Let’s do some maths: The highest number a 32-bit counter can store is 2^32 = 4,294,967,296 (roughly 4 billion). If you want to monitor the number of octets received on a fully occupied 1 Gbit interface, that’s 1 billion (a thousand million) bits per second. An octet is 8 bits. So, every second you receive around 125 million octets. After 34 seconds, you have received more than 4 billion octets and the counter then wraps and starts at 0 again. Yay!
SNMPv2c adds support for 64-bit counters for high capacity interfaces. The successor to v1 still sends plain text data over the network, but it introduces a new and optimized way to send and receive management data in bulk transfers. SNMPv2c and SNMPv1 are not compatible.
SNMPv3 adds security to v2c and supports authentication (MD5 or SHA-1) as well as encryption (DES or AES 128, sometimes AES 256) – and, of course, a combination of both. You can decide between three different levels of security: NoAuthNoPriv (no authentication, no privacy), AuthNoPriv (authentication, no privacy), and AuthPriv (authentication and privacy). Since MD5 and SHA-1 are considered insecure these days, please make sure to define two different passwords when configuring the device's passwords for authentication and encryption.
NOTE: Encryption of SNMPv3 requires a lot more processing power, so it slow things down significantly. That's why version 3 can be a frustrating bottleneck, e.g. on huge modular switches (see section "Is SNMP great, or does it suck?"). Many devices still support version 1 in addition to version 2 – but please try to avoid SNMPv1 because of the wrapping problem.
OID and MIB – What the Hack?
So, we've mentioned that SNMP collects data and transfers it from the managed devices to your monitoring software. Depending on the device it could be information about a printer's toner or the number of printed pages, the bandwidth of a network interface, a server's fan status, etc. Every piece of information is an object and it has its own unique identifier, i.e. an individual address. This address is called OID (Object Identifier).
An OID is a long sequence of numbers, separated by dots, e.g.:
It's a hierarchical order, following a tree structure. Reading it from left to right, it starts with the root (1 = ISO), followed by the first child note (3 = identified organization) and so on. The picture below shows a couple of relevant paths in this tree.
Most parts of the tree are standardized. For example, you should always find information related to network interfaces in the OIDs below 126.96.36.199.2.1.2. However, every vendor can have their own OIDs in a subtree. The blue area of the picture shows a part of a Cisco OID tree (below 188.8.131.52.4.1.9). There are generic OIDs (left tree) as well as specific products' OIDs, e.g. a Cisco UCS (right tree).
To make an OID readable you need a translator. The MIB (Management Information Base) provides names, definitions, and descriptions for the objects – so, converting an OID into something human readable with a MIB is a bit like decoding IP addresses with the help of DNS. Since a lot of hardware vendors use their own OID numbering scheme, their own MIB is necessary to understand and translate the numbers.
NOTE: Customers often ask for MIB checks, but keep in mind that the MIB is just a translation – in other words, it's not the solution, it's your way to a solution.
Is SNMP great or does it suck?
SNMP still is the de facto standard when it comes to network monitoring. A lot of devices support that protocol (and sometimes nothing else!), so you can monitor almost everything – from traffic/bandwidth and CPU loads to the fan status of a RAID system up to air conditioners and door alarms. Because SNMP collects data in a standardized way, you can use it in many different monitoring solutions.
But, let's face it, there are some serious disadvantages. Querying a device using SNMP is slow and quite inefficient. Compared to SNMPv1 version 2 has improved things a bit, but SNMPv3 makes things worse again with the added security (looking at the MD5/SHA-1 algorithm, I would like to put quotation marks around the word "security").
TIP: If encryption slows things down in your environment, then consider going back to v2c and move the management traffic to a separate VLAN for security.
While many devices support SNMP, it doesn’t mean the vendors did a great job implementing it. Poor performance is a pain for admins monitoring a device, where SNMP requests are running into timeouts. But a poor SNMP implementation is even more annoying for anyone who develops checks for a monitoring solution. Our developers have seen all kinds of misbehaving SNMP stacks. Just think about different date formats: Some countries use YYYY/MM/DD, others DD/MM/YYYY or DD/MM/YY, and even MM/DD/YY. A couple of weeks ago we wrote a check for a SNMP device, where the vendor thought, that using 3 digits for the days in the date format is a great idea – what on earth?! We could probably write an entire book “SNMP Stories from Hell”.
Of course, the MIB can help, but one thing is for sure: The MIB is not the holy grail, and a lot of the times the information stored there might not be correct. Where you might expect Bytes, you might actually find Bits and so on. By the way, this is also the reason why we're no big fans of check generators that you feed with an MIB, and out comes a check. It might work, but most likely you will spend some more time trying to fix faulty checks in the end.
And here is some more advice from our consultants: Avoid cross-monitoring with SNMP. Never try to access your switches, routers, etc. with more than one SNMP tool. Even if the device has a powerful enough CPU, it's probably not a good idea to configure two monitoring servers or any other tool to monitor the same device simultaneously.
To sum it all up: Only use SNMP if there is no other way. SNMP puts too much strain on the target system and the monitoring software if not implemented properly (which happens often enough). If you can use a good light-weight agent for monitoring, do it. If you can’t install an agent, but there is a good API available, use that (e.g. we recommend using the NetApp API for monitoring as it is much better implemented than their SNMP agent). If you can't install an agent on a device and there is no good API, then use SNMP.
How to monitor via SNMP in Checkmk
Checkmk supports monitoring via SNMP for a lot of different devices, i.e. network switches, routers, UPS and power supply devices, NAS appliances, printers, etc. We don't just offer generic checks for those devices. Over the past years we've written many specific checks with the help of our customers. They've given us some valuable input and advice to make sure that we monitor the right things and set some default thresholds that make sense.
Assuming your DNS is working correctly, just enter the hostname of a new device and the community (v1, v2c) or the device's passwords for authentication and encryption (v3). Checkmk will then determine the correct checks for that device and automatically detect the services – no need to configure anything else.
To find out if Checkmk supports a device, have a look at the Catalog of Check Plug-ins. Click on a list entry for detailed information about the supported agent (Checkmk agent, vendor API, SNMP). Checkmk can also handle SNMP traps using the Event Console. For more information about that, have a look at our user manual.
So, now you know about the different protocol versions, about OIDs and MIBs. We've also explained how to use Checkmk for monitoring all sorts of devices with SNMP. While it felt good to rant a bit about SNMP or rather some vendor's implementation of this protocol, one thing is for sure: SNMP plays an important role in monitoring physical devices and will probably continue to do so for a while longer. Things would be much worse without SNMP – it's just a necessary evil we must live with.
This is the first article in a series of blogposts about SNMP Network Monitoring. Stay tuned for more #monitoringlove.