Splunk is more than just a highly efficient search engine. The cross-platform solution captures, indexes, and correlates real-time data from different sources, stores it in a searchable repository, and creates visualizations. Like any other service in your environment, Splunk can and should be monitored – ideally with an external tool and on a different machine.
After a brief introduction to Splunk, this article describes Splunk components that can be observed. In the last section we'll show how easy it is to integrate Splunk monitoring in Checkmk – that way you don't have to access Splunk's GUI anymore, but have everything in one place.
What is Splunk?
In a nutshell: Splunk makes machine data readable and offers access to all kinds of data which is usually in an unstructured format and quite difficult to understand. Splunk identifies patterns, provides metrics, and generates graphs, reports, alerts, dashboards, and other visualizations. Splunk can index almost any kind of data, like streams, machine and historical data, for example log files, network feeds, etc.
Basically, you point the software at the data source of your choice which then becomes a data input. To get remote data into Splunk, it can read network feeds or receive data from so-called forwarders which are installed on the different hosts where the data originates. On top of that, there are apps and add-ons with pre-configured inputs for specific data sources, for example from Linux or Windows hosts, Cisco Security or Symantec Blue Coat data, and so on.
Since Splunk processes and extracts the relevant data, it helps admins to identify and locate problems in their IT infrastructure. But what about Splunk itself? If you're using Splunk in your organization, it most likely plays a key role and therefore it's important to keep an eye on it. Let's have a a look at the basics first.
Splunk Monitoring: the Basics
A good start is looking at log files. Splunk generates quite a few of these protocols, for example internal logs (user activities, indexed volume in bytes, periodic snapshots of Splunk performance and system data, logs for the Splunk server
splunkd and more), introspection logs (data about the impact of the Splunk software on the host system), and search logs (data about a search, including run time and other performance metrics).
On top of that, it's important to check on Splunk's overall state, i.e. its topology and and performance. That includes search and indexing performance, resource and license usage, etc. Here is an overview of things you might want to check:
Splunk License Usage
- Splunk System Messages
Now, if you're already familiar with all these aspects, you can jump straight to the section 'How to monitor Splunk with Checkmk'. If not, read the next section for more detailed information about Splunk's components.
Alerts, Health, Jobs, and Licenses
As you probably know, it's possible to configure alerts in Splunk, so it can respond to certain events and trigger an action when a search result meets a specific condition. An alert action can be an email notification, a webhook action to display messages in a chat room, etc. Once you've configured alerts in Splunk, it's a good idea keep track of the number of alarms triggered – too many or too little alerts can be an indication for a misconfigured Splunk service.
The Splunk Health Report offers detailed information about the
splunkd service. It will tell you more about the overall state as well as show indicators for particular features, like the File Monitor Input, Index Processor, Search Scheduler, etc. The health report uses different colors: green means the feature is functioning properly, yellow indicates that a feature experiences problems, and red means there is a severe problem. One way to access the health report is via the Splunk web interface, but you can also check on the features via Splunk's REST API which is what the Checkmk special agent does.
In principle, every search query in Splunk is a job, and if there are too many or too many unfinished/failed jobs, the system can become unstable. So, keep an eye on the job count, failed jobs, and zombie jobs (which never reached the status “terminated”).
If you exceed the maximum daily indexing volume allowed for your Splunk license, a warning occurs, and exceeding the license warning limit causes a license violation. During such a violation period Splunk will continue to index the data, but the search (not for the internal index), scheduled reports, and alerts are blocked. This is why you should monitor the licenses itself and also the license usage.
So, what's the difference? When monitoring the license, you will receive information about its status (valid/expired), the expiration time, the maximum violations (until the search is blocked), and the allowed quota (data volume that can be indexed per day). To really check on how close you are to reaching the quota limit, you need another check: the license usage. It will tell you how much data has been indexed (and that includes the data volume on the slaves).
Last, but not least, you should check on Splunk's system messages. Normally, you can access the messages via the Splunk web interface, but it's possible to access these notifications via the REST API as well, so why not send them to the external monitoring solution?
How to monitor Splunk with Checkmk
As I mentioned above, Checkmk has a special agent for Splunk which uses the Splunk REST API to access information about Splunk alerts, jobs, health, and licenses. Setting up the monitoring in Checkmk is simple: Create a new host (WATO ➳ Hosts, Create new host), tick the checkbox IP Address Family and choose No IP from the drop-down menu. Open the DATA SOURCES section, activate the check box Check_MK Agent and select No Checkmk agent, all configured special agents from the drop-down menu or Normal Checkmk agent, all configured special agents if the Checkmk agent is installed.
Next, create a new rule. In our example, you need the rule Check state of Splunk which you can find via the search box in WATO ➳ Host & Service Parameters. It should be listed under Datasource Programs. In the configuration dialog enter the name of the Splunk instance (FQDN or IP address) you want to observe, your Splunk username and password. If you're using Splunk ≥8.0, then make sure to choose HTTPS in the Protocol drop-down menu – even if the Splunk GUI is still accessible via HTTP. You can tick various check boxes in order to decide what to monitor: License state, License usage, System messages, Jobs, Health, and Alerts.
Watch the Watcher
Failing IT systems can cost your organization more than money – they can damage your reputation, cost you customers or even your job. Monitoring and measuring performance of servers and services in your infrastructure is therefore vital. It's easy to keep an eye on Splunk and its components, since the software offers a REST API which allows easy access to essential information. If you're already using Checkmk to monitor your infrastructure and services, then you're only a few clicks away from including Splunk.
If you have any questions or would like to share your tips and tricks, please leave a comment or discuss this topic in our Checkmk Forum!