Linux RSS discrepancy

TL;DR Before kernel 6.2 you can get up to 64*4096 bytes per thread difference between real memory usafe and value from ps, top others utilities.

This story is written based on information gathered during the investigation of Java fragmentation. I hope I will finish it soon and will write an article about it.

Short history

During the investigation of the memory usage for Java application, our first evidence was the difference between memory usage values from pmap -X PID and ps -q PID -o rss.

After this, I found some posts related to my issue. Here is the most interesting one: https://tbrindus.ca/sometimes-the-kernel-lies-about-process-memory-usage/ A lot of details are unveiled about what is going on there after reading this article.

Linux kernel cache RSS value per thread

Basically, it means that if you are reading Resident Set Size(RSS) value from ps, top, or any other utilities, they are going for memory information to /proc/PID/stat. If you are interested in details of the file format, you can read about it from these links:

https://man7.org/linux/man-pages/man5/proc.5.html search by the string :/proc/pid/stat
https://elixir.bootlin.com/linux/v6.1.63/source/fs/proc/array.c#L570

Doc for the rss field:

(24) rss %ld Resident Set Size: number of pages the process has in real memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out. This value is inaccurate; see /proc/pid/statm below.

From the /proc/pid/statm documentation, we can read this:

Some of these values are inaccurate because of a kernel-internal scalability optimization. If accurate values are required, use /proc/pid/smaps or /proc/pid/smaps_rollup instead, which are much slower but provide accurate, detailed information.

 ~> cat /proc/(pid)/smaps_rollup
00010000-7ffd5c1ec000 ---p 00000000 00:00 0                              [rollup]
Rss:             4084624 kB
Pss:             4067545 kB
Shared_Clean:      28912 kB
Shared_Dirty:          0 kB
Private_Clean:      3320 kB
Private_Dirty:   4052392 kB
Referenced:      4084624 kB
Anonymous:       4052380 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB

Going deeper

This implementation is done for kernels up to 6.2. We can get the difference between real RSS and "performance optimised RSS" values up to 63 * 4096 = 258048 or 252KiB per thread with PAGE_SIZE = 4096

TASK_RSS_EVENTS_THRESH can be updated only at compile time, so if you want to have a different value, you must recompile the kernel.

SPLIT_RSS_COUNTING used because of USE_SPLIT_PTE_PTLOCKS

#if defined(SPLIT_RSS_COUNTING)
...
/* sync counter once per 64 page faults */
#define TASK_RSS_EVENTS_THRESH  (64)
...
if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
    sync_mm_rss(task->mm);
}
#else /* SPLIT_RSS_COUNTING */

You can have a look at the real code from the above example here: https://elixir.bootlin.com/linux/v6.1.46/source/mm/memory.c#L203

Kernel 6.2

Starting from kernel version 6.2(released 19 February 2023) there are no counters per thread. Now they update real performance optimised RSS every (nr_cpus ^ 2) page faults. As you can see, this variable does not depend on the number of threads, so it should be more consistent and probably should not be bigger despite having more threads. Link to the patch discussion: https://lkml.kernel.org/r/[email protected]

Somebody found some performance regressions with new changes, so after some time, this piece of memory management code could be even more different.

What to do?

To get an absolutely correct memory usage value, you should access /proc/PID/smaps_rollup or just /proc/PID/smaps and summarise all RSS values. Short documentation about that file: https://www.kernel.org/doc/Documentation/ABI/testing/procfs-smaps_rollup

Some places noted that reading smaps could be much slower than statm, so let's measure the difference:

~> # this is for fish shell
~> for i in maps stat statm status smaps smaps_rollup; echo $i; bash -c "time cat /proc/$ttt/$i > /dev/null" ; end
maps

real    0m0.001s
user    0m0.001s
sys     0m0.001s
stat

real    0m0.004s
user    0m0.000s
sys     0m0.001s
statm

real    0m0.001s
user    0m0.000s
sys     0m0.001s
status

real    0m0.001s
user    0m0.001s
sys     0m0.000s
smaps

real    0m0.115s
user    0m0.000s
sys     0m0.077s
smaps_rollup

real    0m0.127s
user    0m0.000s
sys     0m0.088s

Indeed, reading smaps could have an impact on the performance. 100ms is a pretty high latency number. Here is the source code for the function that generates output for smaps_rollup output: https://elixir.bootlin.com/linux/v6.1.46/source/fs/proc/task_mmu.c#L877. If you want more details about why it could be that slow.

The worst case?

I read this discussion about the topic: https://lore.kernel.org/linux-man/[email protected]/T/#m777c32932711d629353b3bb000695f8f6325fdc2. And I thought, how about simulating an edge case when we can get massive difference of the RSS just for the sake of example?

After a couple of hours with rust, I got this small piece of a hacky code: https://gist.github.com/libbkmz/bf23d3f5a48290e4d3dc7ac8dd072dd2

If you run it on a kernel older than 6.2 with these parameters: rust-script src/main.rs -t 512 -i 63 you could get similar numbers to mine:

/proc/self/stat RSS: 20_123_648
/proc/self/smaps_rollup RSS: 154_882_048
diff: 134_758_400

The optimised counter shows that the process has 20MB, but in reality, it consumed more than 150MB... 134MB difference for 512 threads means 134768400/512/4096 = 64.26 pages.

You can play around with different parameters and check what is possible.

Created: 25-08-2023

Last updated: 30-11-2023

Category: misc

Tags: linux kernel memory rust