This document describes the Linux memory management "Unevictable LRU" infrastructure and the use of this infrastructure to manage several types of "unevictable" pages. The document attempts to provide the overall rationale behind this mechanism and the rationale for some of the design decisions that drove the implementation. The latter design rationale is discussed in the context of an implementation description. Admittedly, one can obtain the implementation details--the "what does it do?"--by reading the code. One hopes that the descriptions below add value by provide the answer to "why does it do that?". Unevictable LRU Infrastructure: The Unevictable LRU adds an additional LRU list to track unevictable pages and to hide these pages from vmscan. This mechanism is based on a patch by Larry Woodman of Red Hat to address several scalability problems with page reclaim in Linux. The problems have been observed at customer sites on large memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB of main memory will have over 32 million 4k pages in a single zone. When a large fraction of these pages are not evictable for any reason [see below], vmscan will spend a lot of time scanning the LRU lists looking for the small fraction of pages that are evictable. This can result in a situation where all cpus are spending 100% of their time in vmscan for hours or days on end, with the system completely unresponsive. The Unevictable LRU infrastructure addresses the following classes of unevictable pages: + page owned by ramfs + page mapped into SHM_LOCKed shared memory regions + page mapped into VM_LOCKED [mlock()ed] vmas The infrastructure might be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. The Unevictable LRU List The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list called the "unevictable" list and an associated page flag, PG_unevictable, to indicate that the page is being managed on the unevictable list. The PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active flag in that it indicates on which LRU list a page resides when PG_lru is set. The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU Kconfig option. The Unevictable LRU infrastructure maintains unevictable pages on an additional LRU list for a few reasons: 1) We get to "treat unevictable pages just like we treat other pages in the system, which means we get to use the same code to manipulate them, the same code to isolate them (for migrate, etc.), the same code to keep track of the statistics, etc..." [Rik van Riel] 2) We want to be able to migrate unevictable pages between nodes--for memory defragmentation, workload management and memory hotplug. The linux kernel can only migrate pages that it can successfully isolate from the lru lists. If we were to maintain pages elsewise than on an lru-like list, where they can be found by isolate_lru_page(), we would prevent their migration, unless we reworked migration code to find the unevictable pages. The unevictable LRU list does not differentiate between file backed and swap backed [anon] pages. This differentiation is only important while the pages are, in fact, evictable. The unevictable LRU list benefits from the "arrayification" of the per-zone LRU lists and statistics originally proposed and posted by Christoph Lameter. The unevictable list does not use the lru pagevec mechanism. Rather, unevictable pages are placed directly on the page's zone's unevictable list under the zone lru_lock. The reason for this is to prevent stranding of pages on the unevictable list when one task has the page isolated from the lru and other tasks are changing the "evictability" state of the page. Unevictable LRU and Memory Controller Interaction The memory controller data structure automatically gets a per zone unevictable lru list as a result of the "arrayification" of the per-zone LRU lists. The memory controller tracks the movement of pages to and from the unevictable list. When a memory control group comes under memory pressure, the controller will not attempt to reclaim pages on the unevictable list. This has a couple of effects. Because the pages are "hidden" from reclaim on the unevictable list, the reclaim process can be more efficient, dealing only with pages that have a chance of being reclaimed. On the other hand, if too many of the pages charged to the control group are unevictable, the evictable portion of the working set of the tasks in the control group may not fit into the available memory. This can cause the control group to thrash or to oom-kill tasks. Unevictable LRU: Detecting Unevictable Pages The function page_evictable(page, vma) in vmscan.c determines whether a page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's address space using a wrapper function. Wrapper functions are used to set, clear and test the flag to reduce the requirement for #ifdef's throughout the source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. This flag remains for the life of the inode. For shared memory regions, AS_UNEVICTABLE is set when an application successfully SHM_LOCKs the region and is removed when the region is SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page tables for the region as does, for example, mlock(). So, we make no special effort to push any pages in the SHM_LOCKed region to the unevictable list. Vmscan will do this when/if it encounters the pages during reclaim. On SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed region is destroyed, the pages are also "rescued" from the unevictable list in the process of freeing them. page_evictable() detects mlock()ed pages by testing an additional page flag, PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a non-NULL vma is supplied, page_evictable() will check whether the vma is VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and update the appropriate statistics if the vma is VM_LOCKED. This method allows efficient "culling" of pages in the fault path that are being faulted in to VM_LOCKED vmas. Unevictable Pages and Vmscan [shrink_*_list()] If unevictable pages are culled in the fault path, or moved to the unevictable list at mlock() or mmap() time, vmscan will never encounter the pages until they have become evictable again, for example, via munlock() and have been "rescued" from the unevictable list. However, there may be situations where we decide,# Monitor the system for dropped packets and proudce a report of drop locations and counts import os import sys sys.path.append(os.environ['PERF_EXEC_PATH'] + \ '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') from perf_trace_context import * from Core import * from Util import * drop_log = {} kallsyms = [] def get_kallsyms_table(): global kallsyms try: f = open("/proc/kallsyms", "r") except: return for line in f: loc = int(line.split()[0], 16) name = line.split()[2] kallsyms.append((loc, name)) kallsyms.sort() def get_sym(sloc): loc = int(sloc) # Invariant: kallsyms[i][0] <= loc for all 0 <= i <= start # kallsyms[i][0] > loc for all end <= i < len(kallsyms) start, end = -1, len(kallsyms) while end != start + 1: pivot = (start + end) // 2 if loc < kallsyms[pivot][0]: end = pivot else: start = pivot # Now (start == -1 or kallsyms[start][0] <= loc) # and (start == len(kallsyms) - 1 or loc < kallsyms[start + 1][0]) if start >= 0: symloc, name = kallsyms[start] return (name, loc - symloc) else: return (None, 0) def print_drop_table(): print "%25s %25s %25s" % ("LOCATION", "OFFSET", "COUNT") for i in drop_log.keys(): (sym, off) = get_sym(i) if sym == None: sym = i print "%25s %25s %25s" % (sym, off, drop_log[i]) def trace_begin(): print "Starting trace (Ctrl-C to dump results)" def trace_end(): print "Gathering kallsyms data" get_kallsyms_table() print_drop_table() # called from perf, when it finds a correspoinding event def skb__kfree_skb(name, context, cpu, sec, nsec, pid, comm, callchain, skbaddr, location, protocol): slocation = str(location) try: drop_log[slocation] = drop_log[slocation] + 1 except: drop_log[slocation] = 1