Linux is more sophisticated than you think
My experience with monitoring memory under Linux goes back a long way. It all started in 1992, when Linux was still installed from floppy disks, couldn't handle TCP/IP and the version number started with 0.99. In fact, memory was the reason why I became aware of Linux. Because when I was programming my first major project in C++ (a game that was played by physical mail!), I had reached the limits of RAM in MS-DOS, and Linux featured a modern memory architecture that offered a linear address space and could be extended virtually indefinitely by swap. Simply programming without thinking about memory. Ingenious!
The first tool for monitoring memory was the command free, which almost every Linuxer knows. It shows the current "consumption" of RAM and Swap:
total used free shared buff/cache available
Mem: 16331712 2397504 9976772 222864 3957436 13369088
Swap: 32097148 553420 32097148
Of course the numbers weren't that big back then (we read kB here, not bytes), and my computer didn't have 16 GB, just 4 MB! But even then the principle was the same. And already at that time the fear of intensive swap consumption and the associated loss of performance was great. For this reason every decent monitoring system has a check for the RAM consumption as well as a check for the swap.
At least that's what I thought in the beginning. Until I took a closer look at it during Checkmk's development almost 20 years later! Of course it was pretty clear that buffers and caches can be considered as free memory. And that's what (almost) all monitoring systems do, so to speak. But I wanted to know more about it and thoroughly researched the meaning of all information in
/proc/meminfo. Because in this file the Linux kernel gives exact information about all current parameters of the memory management. And there is a lot of information here - much more than
free shows. In some cases I had to venture into the source code of Linux in order to understand the connections exactly.
I came to surprising results, which - well, you could probably say - shook the foundations of my view on the world:
- The memory management of Linux is much more ingenious and sophisticated than I thought. The words "free" and "occupied" don't do justice for what actually happens
- Looking at swap and RAM separately makes no sense at all.
- And even the obvious idea of considering buffers/caches as free is not necessarily correct!
- Many important parameters are not shown at all by
free, but they can be absolutely critical.
- Checkmk's Linux memory check needs to be completely reworked.
After a few days of work we were done: Checkmk got the (in my opinion) best, most accurate and above all most technically "correct" Linux memory check imaginable. But this lead to a new, much bigger problem: explaining it! The check plugin now worked properly to such an extent that many users were surprised and suspicious of the results. But anyway, how does it make sense if a RAM threshold is above 100%?
What's more valuable? Processes or Cache?
Let's take a closer look at this. Let's say our server has 64 GB RAM and just as much swap. That makes a total of 128 GB of maximum memory. And let's just forget the fact that the kernel itself needs some memory.
And now lets assume we have a bunch of application processes that just happen to need 64 GB of RAM. That sounds wonderful, because no swap space should be used, right? Here is the first surprise! Because Linux is cheeky and outsources parts of the processes to the swap area. Why? The kernel would like some memory for caches. Because this is not only a nice second use for otherwise empty memory (as I used to think), but absolutely critical for a high-performance system. If really all files would have to fetched from disk over and over again, the overall effect would be much worse than if a few unimportant parts of processes end up in the swap area.
The following graphic shows the development of different caches of a server over a week. Of course, most of the space is used for file contents. But also caches for the file system structure (directories, file names etc.) take up an immense amount of space once a day (up to 17.78 GB). I didn't investigate this further, but it could be that a backup is always running at that time.
Linux therefore swaps out parts of processes much earlier and not only when memory becomes scarce. And the extent of that depends on external influences. If a large number of different files have been read - e.g. during a data backup - the cache is bloated and processes are increasingly moved to Swap. Once the backup is over, the processes remain in Swap for some time - even if there is space in RAM again. Because why waste valuable disk IO bandwidth for data you might never need?
What does this mean for a meaningful monitoring? If you look at RAM and Swap separately, you will notice that after the data backup you have more RAM free than before - and that more Swap is occupied. But actually nothing has really changed. So if you have two separate checks, both show an inverted bend in the curve, which leads to wrong conclusions and above all to wrong alarms.
It is better to consider the sum of occupied RAM and Swap. What does this sum mean? It is nothing more than the current total memory consumption of all processes - regardless of the type of memory in which the data is currently stored. This sum and nothing else is relevant for the performance of the system.
A threshold above 100%?
Now, of course, there is the question of a sensible threshold for an alarm. Absolute thresholds such as 64 GB are of course very impractical if you monitor a large number of different servers. But to what should a relative value in percent then refer? From my point of view, it makes most sense to relate this value exclusively to the RAM. A threshold of 150% then suddenly makes sense! Because this means that the processes may consume up to 50% more memory than there is real RAM. This ensures that the majority of the processes can still be stored in RAM - even if the caches are still taken into account.
Further interesting memory values
If you have ever looked at
/proc/meminfo, you were surely surprised how many values there are and which information can be found beyond RAM and Swap. A few of them are quite relevant. I would like to mention two of them, both of which have created trouble for me:
Dirty (Filesystem Writeback)
The 'dirty' value includes blocks of files that have been modified by processes but not yet written to disk. Linux usually waits up to 30 seconds to write such blocks, hoping to be able to efficiently merge further changes into the same block. In a healthy, albeit heavily stressed system, this is what it looks like:
The individual peaks show situations where a large number of new files were created or changed in one sweep. This is no cause for concern, because the data was written to disk very quickly. However, if you have a situation where there is a permanent jam, it usually means a bottleneck in the disk IO. The system is not able to write the data back to disk in time.
There is a good test for this: Go to the command line and type the command
sync. This immediately aborts the 30 seconds wait and writes all outstanding data to disk. This should only take a few seconds. If the command lasts longer, there is cause for concern. Important modified data is only in the RAM and does not make it to the disk. It could also indicate a hardware defect in the disk subsystem. If
sync lasts even several hours, then you need to ring the alarm bells and urgent action is required.
Monitoring the dirty disk blocks can detect such situations and alert you in time.
Every process under Linux has a table that maps virtual memory addresses to physical memory addresses. The table also shows, for example, where something has been stored in the Swap space. It goes without saying that these pagetables must be stored in RAM and cannot be swapped. The following graph shows a server where everything is absolutely within limits:
Of course a maximum of 240 MB is a lot of memory only for these tables, considering that my first computer had only 4 MB at all. However, with 64 GB RAM in total this doesn't hurt.
In my time as a Linux consultant I had a customer who was running many Oracle databases on a large server. The architecture of Oracle allows many processes to handle requests to the database in parallel and communicate with it via Shared Memory. The customer had many of such processes active. Now the RAM for Shared Memory is really only needed once, but each process needs a Pagetable anyway, and the Shared Memory is reflected in each table again. These tables added up to more than 50% of the RAM on the system. The confusing thing was that this memory was not visible in the processes in
top or similar tools. Somehow the memory was gone, and nobody knew where it was going!
The solution was then very simple: You can activate so-called Huge Pages with Linux. Then a Page no longer manages 512 bytes, but e.g. 2 MB, and the tables become much smaller. But you have to find the root cause first. Therefore also checking Pagetables is a crucial for a good monitoring.
Memory monitoring is more than just a simple threshold for "used" RAM. Since the memory management is very different for each operating system, a good monitoring has to know and take into account its peculiarities.