aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/vm
diff options
context:
space:
mode:
authorGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
committerGlenn Elliott <gelliott@cs.unc.edu>2012-03-04 19:47:13 -0500
commitc71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch)
treeecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/vm
parentea53c912f8a86a8567697115b6a0d8152beee5c8 (diff)
parent6a00f206debf8a5c8899055726ad127dbeeed098 (diff)
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts: litmus/sched_cedf.c
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/Makefile2
-rw-r--r--Documentation/vm/active_mm.txt2
-rw-r--r--Documentation/vm/cleancache.txt278
-rw-r--r--Documentation/vm/highmem.txt162
-rw-r--r--Documentation/vm/hugetlbpage.txt2
-rw-r--r--Documentation/vm/hwpoison.txt6
-rw-r--r--Documentation/vm/locking2
-rw-r--r--Documentation/vm/numa_memory_policy.txt2
-rw-r--r--Documentation/vm/overcommit-accounting2
-rw-r--r--Documentation/vm/page-types.c105
-rw-r--r--Documentation/vm/slabinfo.c1364
-rw-r--r--Documentation/vm/transhuge.txt298
-rw-r--r--Documentation/vm/unevictable-lru.txt3
13 files changed, 849 insertions, 1379 deletions
diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile
index 9dcff328b964..3fa4d0668864 100644
--- a/Documentation/vm/Makefile
+++ b/Documentation/vm/Makefile
@@ -2,7 +2,7 @@
2obj- := dummy.o 2obj- := dummy.o
3 3
4# List of programs to build 4# List of programs to build
5hostprogs-y := slabinfo page-types hugepage-mmap hugepage-shm map_hugetlb 5hostprogs-y := page-types hugepage-mmap hugepage-shm map_hugetlb
6 6
7# Tell kbuild to always build the programs 7# Tell kbuild to always build the programs
8always := $(hostprogs-y) 8always := $(hostprogs-y)
diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt
index 4ee1f643d897..dbf45817405f 100644
--- a/Documentation/vm/active_mm.txt
+++ b/Documentation/vm/active_mm.txt
@@ -74,7 +74,7 @@ we have a user context", and is generally done by the page fault handler
74and things like that). 74and things like that).
75 75
76Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, 76Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
77because it slightly changes the interfaces to accomodate the alpha (who 77because it slightly changes the interfaces to accommodate the alpha (who
78would have thought it, but the alpha actually ends up having one of the 78would have thought it, but the alpha actually ends up having one of the
79ugliest context switch codes - unlike the other architectures where the MM 79ugliest context switch codes - unlike the other architectures where the MM
80and register state is separate, the alpha PALcode joins the two, and you 80and register state is separate, the alpha PALcode joins the two, and you
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt
new file mode 100644
index 000000000000..36c367c73084
--- /dev/null
+++ b/Documentation/vm/cleancache.txt
@@ -0,0 +1,278 @@
1MOTIVATION
2
3Cleancache is a new optional feature provided by the VFS layer that
4potentially dramatically increases page cache effectiveness for
5many workloads in many environments at a negligible cost.
6
7Cleancache can be thought of as a page-granularity victim cache for clean
8pages that the kernel's pageframe replacement algorithm (PFRA) would like
9to keep around, but can't since there isn't enough memory. So when the
10PFRA "evicts" a page, it first attempts to use cleancache code to
11put the data contained in that page into "transcendent memory", memory
12that is not directly accessible or addressable by the kernel and is
13of unknown and possibly time-varying size.
14
15Later, when a cleancache-enabled filesystem wishes to access a page
16in a file on disk, it first checks cleancache to see if it already
17contains it; if it does, the page of data is copied into the kernel
18and a disk access is avoided.
19
20Transcendent memory "drivers" for cleancache are currently implemented
21in Xen (using hypervisor memory) and zcache (using in-kernel compressed
22memory) and other implementations are in development.
23
24FAQs are included below.
25
26IMPLEMENTATION OVERVIEW
27
28A cleancache "backend" that provides transcendent memory registers itself
29to the kernel's cleancache "frontend" by calling cleancache_register_ops,
30passing a pointer to a cleancache_ops structure with funcs set appropriately.
31Note that cleancache_register_ops returns the previous settings so that
32chaining can be performed if desired. The functions provided must conform to
33certain semantics as follows:
34
35Most important, cleancache is "ephemeral". Pages which are copied into
36cleancache have an indefinite lifetime which is completely unknowable
37by the kernel and so may or may not still be in cleancache at any later time.
38Thus, as its name implies, cleancache is not suitable for dirty pages.
39Cleancache has complete discretion over what pages to preserve and what
40pages to discard and when.
41
42Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
43pool id which, if positive, must be saved in the filesystem's superblock;
44a negative return value indicates failure. A "put_page" will copy a
45(presumably about-to-be-evicted) page into cleancache and associate it with
46the pool id, a file key, and a page index into the file. (The combination
47of a pool id, a file key, and an index is sometimes called a "handle".)
48A "get_page" will copy the page, if found, from cleancache into kernel memory.
49A "flush_page" will ensure the page no longer is present in cleancache;
50a "flush_inode" will flush all pages associated with the specified file;
51and, when a filesystem is unmounted, a "flush_fs" will flush all pages in
52all files specified by the given pool id and also surrender the pool id.
53
54An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
55to treat the pool as shared using a 128-bit UUID as a key. On systems
56that may run multiple kernels (such as hard partitioned or virtualized
57systems) that may share a clustered filesystem, and where cleancache
58may be shared among those kernels, calls to init_shared_fs that specify the
59same UUID will receive the same pool id, thus allowing the pages to
60be shared. Note that any security requirements must be imposed outside
61of the kernel (e.g. by "tools" that control cleancache). Or a
62cleancache implementation can simply disable shared_init by always
63returning a negative value.
64
65If a get_page is successful on a non-shared pool, the page is flushed (thus
66making cleancache an "exclusive" cache). On a shared pool, the page
67is NOT flushed on a successful get_page so that it remains accessible to
68other sharers. The kernel is responsible for ensuring coherency between
69cleancache (shared or not), the page cache, and the filesystem, using
70cleancache flush operations as required.
71
72Note that cleancache must enforce put-put-get coherency and get-get
73coherency. For the former, if two puts are made to the same handle but
74with different data, say AAA by the first put and BBB by the second, a
75subsequent get can never return the stale data (AAA). For get-get coherency,
76if a get for a given handle fails, subsequent gets for that handle will
77never succeed unless preceded by a successful put with that handle.
78
79Last, cleancache provides no SMP serialization guarantees; if two
80different Linux threads are simultaneously putting and flushing a page
81with the same handle, the results are indeterminate. Callers must
82lock the page to ensure serial behavior.
83
84CLEANCACHE PERFORMANCE METRICS
85
86Cleancache monitoring is done by sysfs files in the
87/sys/kernel/mm/cleancache directory. The effectiveness of cleancache
88can be measured (across all filesystems) with:
89
90succ_gets - number of gets that were successful
91failed_gets - number of gets that failed
92puts - number of puts attempted (all "succeed")
93flushes - number of flushes attempted
94
95A backend implementatation may provide additional metrics.
96
97FAQ
98
991) Where's the value? (Andrew Morton)
100
101Cleancache provides a significant performance benefit to many workloads
102in many environments with negligible overhead by improving the
103effectiveness of the pagecache. Clean pagecache pages are
104saved in transcendent memory (RAM that is otherwise not directly
105addressable to the kernel); fetching those pages later avoids "refaults"
106and thus disk reads.
107
108Cleancache (and its sister code "frontswap") provide interfaces for
109this transcendent memory (aka "tmem"), which conceptually lies between
110fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
111Disallowing direct kernel or userland reads/writes to tmem
112is ideal when data is transformed to a different form and size (such
113as with compression) or secretly moved (as might be useful for write-
114balancing for some RAM-like devices). Evicted page-cache pages (and
115swap pages) are a great use for this kind of slower-than-RAM-but-much-
116faster-than-disk transcendent memory, and the cleancache (and frontswap)
117"page-object-oriented" specification provides a nice way to read and
118write -- and indirectly "name" -- the pages.
119
120In the virtual case, the whole point of virtualization is to statistically
121multiplex physical resources across the varying demands of multiple
122virtual machines. This is really hard to do with RAM and efforts to
123do it well with no kernel change have essentially failed (except in some
124well-publicized special-case workloads). Cleancache -- and frontswap --
125with a fairly small impact on the kernel, provide a huge amount
126of flexibility for more dynamic, flexible RAM multiplexing.
127Specifically, the Xen Transcendent Memory backend allows otherwise
128"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
129virtual machines, but the pages can be compressed and deduplicated to
130optimize RAM utilization. And when guest OS's are induced to surrender
131underutilized RAM (e.g. with "self-ballooning"), page cache pages
132are the first to go, and cleancache allows those pages to be
133saved and reclaimed if overall host system memory conditions allow.
134
135And the identical interface used for cleancache can be used in
136physical systems as well. The zcache driver acts as a memory-hungry
137device that stores pages of data in a compressed state. And
138the proposed "RAMster" driver shares RAM across multiple physical
139systems.
140
1412) Why does cleancache have its sticky fingers so deep inside the
142 filesystems and VFS? (Andrew Morton and Christoph Hellwig)
143
144The core hooks for cleancache in VFS are in most cases a single line
145and the minimum set are placed precisely where needed to maintain
146coherency (via cleancache_flush operations) between cleancache,
147the page cache, and disk. All hooks compile into nothingness if
148cleancache is config'ed off and turn into a function-pointer-
149compare-to-NULL if config'ed on but no backend claims the ops
150functions, or to a compare-struct-element-to-negative if a
151backend claims the ops functions but a filesystem doesn't enable
152cleancache.
153
154Some filesystems are built entirely on top of VFS and the hooks
155in VFS are sufficient, so don't require an "init_fs" hook; the
156initial implementation of cleancache didn't provide this hook.
157But for some filesystems (such as btrfs), the VFS hooks are
158incomplete and one or more hooks in fs-specific code are required.
159And for some other filesystems, such as tmpfs, cleancache may
160be counterproductive. So it seemed prudent to require a filesystem
161to "opt in" to use cleancache, which requires adding a hook in
162each filesystem. Not all filesystems are supported by cleancache
163only because they haven't been tested. The existing set should
164be sufficient to validate the concept, the opt-in approach means
165that untested filesystems are not affected, and the hooks in the
166existing filesystems should make it very easy to add more
167filesystems in the future.
168
169The total impact of the hooks to existing fs and mm files is only
170about 40 lines added (not counting comments and blank lines).
171
1723) Why not make cleancache asynchronous and batched so it can
173 more easily interface with real devices with DMA instead
174 of copying each individual page? (Minchan Kim)
175
176The one-page-at-a-time copy semantics simplifies the implementation
177on both the frontend and backend and also allows the backend to
178do fancy things on-the-fly like page compression and
179page deduplication. And since the data is "gone" (copied into/out
180of the pageframe) before the cleancache get/put call returns,
181a great deal of race conditions and potential coherency issues
182are avoided. While the interface seems odd for a "real device"
183or for real kernel-addressable RAM, it makes perfect sense for
184transcendent memory.
185
1864) Why is non-shared cleancache "exclusive"? And where is the
187 page "flushed" after a "get"? (Minchan Kim)
188
189The main reason is to free up space in transcendent memory and
190to avoid unnecessary cleancache_flush calls. If you want inclusive,
191the page can be "put" immediately following the "get". If
192put-after-get for inclusive becomes common, the interface could
193be easily extended to add a "get_no_flush" call.
194
195The flush is done by the cleancache backend implementation.
196
1975) What's the performance impact?
198
199Performance analysis has been presented at OLS'09 and LCA'10.
200Briefly, performance gains can be significant on most workloads,
201especially when memory pressure is high (e.g. when RAM is
202overcommitted in a virtual workload); and because the hooks are
203invoked primarily in place of or in addition to a disk read/write,
204overhead is negligible even in worst case workloads. Basically
205cleancache replaces I/O with memory-copy-CPU-overhead; on older
206single-core systems with slow memory-copy speeds, cleancache
207has little value, but in newer multicore machines, especially
208consolidated/virtualized machines, it has great value.
209
2106) How do I add cleancache support for filesystem X? (Boaz Harrash)
211
212Filesystems that are well-behaved and conform to certain
213restrictions can utilize cleancache simply by making a call to
214cleancache_init_fs at mount time. Unusual, misbehaving, or
215poorly layered filesystems must either add additional hooks
216and/or undergo extensive additional testing... or should just
217not enable the optional cleancache.
218
219Some points for a filesystem to consider:
220
221- The FS should be block-device-based (e.g. a ram-based FS such
222 as tmpfs should not enable cleancache)
223- To ensure coherency/correctness, the FS must ensure that all
224 file removal or truncation operations either go through VFS or
225 add hooks to do the equivalent cleancache "flush" operations
226- To ensure coherency/correctness, either inode numbers must
227 be unique across the lifetime of the on-disk file OR the
228 FS must provide an "encode_fh" function.
229- The FS must call the VFS superblock alloc and deactivate routines
230 or add hooks to do the equivalent cleancache calls done there.
231- To maximize performance, all pages fetched from the FS should
232 go through the do_mpag_readpage routine or the FS should add
233 hooks to do the equivalent (cf. btrfs)
234- Currently, the FS blocksize must be the same as PAGESIZE. This
235 is not an architectural restriction, but no backends currently
236 support anything different.
237- A clustered FS should invoke the "shared_init_fs" cleancache
238 hook to get best performance for some backends.
239
2407) Why not use the KVA of the inode as the key? (Christoph Hellwig)
241
242If cleancache would use the inode virtual address instead of
243inode/filehandle, the pool id could be eliminated. But, this
244won't work because cleancache retains pagecache data pages
245persistently even when the inode has been pruned from the
246inode unused list, and only flushes the data page if the file
247gets removed/truncated. So if cleancache used the inode kva,
248there would be potential coherency issues if/when the inode
249kva is reused for a different file. Alternately, if cleancache
250flushed the pages when the inode kva was freed, much of the value
251of cleancache would be lost because the cache of pages in cleanache
252is potentially much larger than the kernel pagecache and is most
253useful if the pages survive inode cache removal.
254
2558) Why is a global variable required?
256
257The cleancache_enabled flag is checked in all of the frequently-used
258cleancache hooks. The alternative is a function call to check a static
259variable. Since cleancache is enabled dynamically at runtime, systems
260that don't enable cleancache would suffer thousands (possibly
261tens-of-thousands) of unnecessary function calls per second. So the
262global variable allows cleancache to be enabled by default at compile
263time, but have insignificant performance impact when cleancache remains
264disabled at runtime.
265
2669) Does cleanache work with KVM?
267
268The memory model of KVM is sufficiently different that a cleancache
269backend may have less value for KVM. This remains to be tested,
270especially in an overcommitted system.
271
27210) Does cleancache work in userspace? It sounds useful for
273 memory hungry caches like web browsers. (Jamie Lokier)
274
275No plans yet, though we agree it sounds useful, at least for
276apps that bypass the page cache (e.g. O_DIRECT).
277
278Last updated: Dan Magenheimer, April 13 2011
diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt
new file mode 100644
index 000000000000..4324d24ffacd
--- /dev/null
+++ b/Documentation/vm/highmem.txt
@@ -0,0 +1,162 @@
1
2 ====================
3 HIGH MEMORY HANDLING
4 ====================
5
6By: Peter Zijlstra <a.p.zijlstra@chello.nl>
7
8Contents:
9
10 (*) What is high memory?
11
12 (*) Temporary virtual mappings.
13
14 (*) Using kmap_atomic.
15
16 (*) Cost of temporary mappings.
17
18 (*) i386 PAE.
19
20
21====================
22WHAT IS HIGH MEMORY?
23====================
24
25High memory (highmem) is used when the size of physical memory approaches or
26exceeds the maximum size of virtual memory. At that point it becomes
27impossible for the kernel to keep all of the available physical memory mapped
28at all times. This means the kernel needs to start using temporary mappings of
29the pieces of physical memory that it wants to access.
30
31The part of (physical) memory not covered by a permanent mapping is what we
32refer to as 'highmem'. There are various architecture dependent constraints on
33where exactly that border lies.
34
35In the i386 arch, for example, we choose to map the kernel into every process's
36VM space so that we don't have to pay the full TLB invalidation costs for
37kernel entry/exit. This means the available virtual memory space (4GiB on
38i386) has to be divided between user and kernel space.
39
40The traditional split for architectures using this approach is 3:1, 3GiB for
41userspace and the top 1GiB for kernel space:
42
43 +--------+ 0xffffffff
44 | Kernel |
45 +--------+ 0xc0000000
46 | |
47 | User |
48 | |
49 +--------+ 0x00000000
50
51This means that the kernel can at most map 1GiB of physical memory at any one
52time, but because we need virtual address space for other things - including
53temporary maps to access the rest of the physical memory - the actual direct
54map will typically be less (usually around ~896MiB).
55
56Other architectures that have mm context tagged TLBs can have separate kernel
57and user maps. Some hardware (like some ARMs), however, have limited virtual
58space when they use mm context tags.
59
60
61==========================
62TEMPORARY VIRTUAL MAPPINGS
63==========================
64
65The kernel contains several ways of creating temporary mappings:
66
67 (*) vmap(). This can be used to make a long duration mapping of multiple
68 physical pages into a contiguous virtual space. It needs global
69 synchronization to unmap.
70
71 (*) kmap(). This permits a short duration mapping of a single page. It needs
72 global synchronization, but is amortized somewhat. It is also prone to
73 deadlocks when using in a nested fashion, and so it is not recommended for
74 new code.
75
76 (*) kmap_atomic(). This permits a very short duration mapping of a single
77 page. Since the mapping is restricted to the CPU that issued it, it
78 performs well, but the issuing task is therefore required to stay on that
79 CPU until it has finished, lest some other task displace its mappings.
80
81 kmap_atomic() may also be used by interrupt contexts, since it is does not
82 sleep and the caller may not sleep until after kunmap_atomic() is called.
83
84 It may be assumed that k[un]map_atomic() won't fail.
85
86
87=================
88USING KMAP_ATOMIC
89=================
90
91When and where to use kmap_atomic() is straightforward. It is used when code
92wants to access the contents of a page that might be allocated from high memory
93(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
94functions, and they can be used in a manner similar to the following:
95
96 /* Find the page of interest. */
97 struct page *page = find_get_page(mapping, offset);
98
99 /* Gain access to the contents of that page. */
100 void *vaddr = kmap_atomic(page);
101
102 /* Do something to the contents of that page. */
103 memset(vaddr, 0, PAGE_SIZE);
104
105 /* Unmap that page. */
106 kunmap_atomic(vaddr);
107
108Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
109not the argument.
110
111If you need to map two pages because you want to copy from one page to
112another you need to keep the kmap_atomic calls strictly nested, like:
113
114 vaddr1 = kmap_atomic(page1);
115 vaddr2 = kmap_atomic(page2);
116
117 memcpy(vaddr1, vaddr2, PAGE_SIZE);
118
119 kunmap_atomic(vaddr2);
120 kunmap_atomic(vaddr1);
121
122
123==========================
124COST OF TEMPORARY MAPPINGS
125==========================
126
127The cost of creating temporary mappings can be quite high. The arch has to
128manipulate the kernel's page tables, the data TLB and/or the MMU's registers.
129
130If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping
131simply with a bit of arithmetic that will convert the page struct address into
132a pointer to the page contents rather than juggling mappings about. In such a
133case, the unmap operation may be a null operation.
134
135If CONFIG_MMU is not set, then there can be no temporary mappings and no
136highmem. In such a case, the arithmetic approach will also be used.
137
138
139========
140i386 PAE
141========
142
143The i386 arch, under some circumstances, will permit you to stick up to 64GiB
144of RAM into your 32-bit machine. This has a number of consequences:
145
146 (*) Linux needs a page-frame structure for each page in the system and the
147 pageframes need to live in the permanent mapping, which means:
148
149 (*) you can have 896M/sizeof(struct page) page-frames at most; with struct
150 page being 32-bytes that would end up being something in the order of 112G
151 worth of pages; the kernel, however, needs to store more than just
152 page-frames in that memory...
153
154 (*) PAE makes your page tables larger - which slows the system down as more
155 data has to be accessed to traverse in TLB fills and the like. One
156 advantage is that PAE has more PTE bits and can provide advanced features
157 like NX and PAT.
158
159The general recommendation is that you don't use more than 8GiB on a 32-bit
160machine - although more might work for you and your workload, you're pretty
161much on your own - don't expect kernel developers to really care much if things
162come apart.
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index 457634c1e03e..f8551b3879f8 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -72,7 +72,7 @@ number of huge pages requested. This is the most reliable method of
72allocating huge pages as memory has not yet become fragmented. 72allocating huge pages as memory has not yet become fragmented.
73 73
74Some platforms support multiple huge page sizes. To allocate huge pages 74Some platforms support multiple huge page sizes. To allocate huge pages
75of a specific size, one must preceed the huge pages boot command parameters 75of a specific size, one must precede the huge pages boot command parameters
76with a huge page size selection parameter "hugepagesz=<size>". <size> must 76with a huge page size selection parameter "hugepagesz=<size>". <size> must
77be specified in bytes with optional scale suffix [kKmMgG]. The default huge 77be specified in bytes with optional scale suffix [kKmMgG]. The default huge
78page size may be selected with the "default_hugepagesz=<size>" boot parameter. 78page size may be selected with the "default_hugepagesz=<size>" boot parameter.
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
index 12f9ba20ccb7..550068466605 100644
--- a/Documentation/vm/hwpoison.txt
+++ b/Documentation/vm/hwpoison.txt
@@ -129,12 +129,12 @@ Limit injection to pages owned by memgroup. Specified by inode number
129of the memcg. 129of the memcg.
130 130
131Example: 131Example:
132 mkdir /cgroup/hwpoison 132 mkdir /sys/fs/cgroup/mem/hwpoison
133 133
134 usemem -m 100 -s 1000 & 134 usemem -m 100 -s 1000 &
135 echo `jobs -p` > /cgroup/hwpoison/tasks 135 echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
136 136
137 memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') 137 memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
138 echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg 138 echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
139 139
140 page-types -p `pidof init` --hwpoison # shall do nothing 140 page-types -p `pidof init` --hwpoison # shall do nothing
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index 25fadb448760..f61228bd6395 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by
66expand_stack(), it is hard to come up with a destructive scenario without 66expand_stack(), it is hard to come up with a destructive scenario without
67having the vmlist protection in this case. 67having the vmlist protection in this case.
68 68
69The page_table_lock nests with the inode i_mmap_lock and the kmem cache 69The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
70c_spinlock spinlocks. This is okay, since the kmem code asks for pages after 70c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
71dropping c_spinlock. The page_table_lock also nests with pagecache_lock and 71dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
72pagemap_lru_lock spinlocks, and no code asks for memory with these locks 72pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 6690fc34ef6d..4e7da6543424 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -424,7 +424,7 @@ a command line tool, numactl(8), exists that allows one to:
424 424
425+ set the shared policy for a shared memory segment via mbind(2) 425+ set the shared policy for a shared memory segment via mbind(2)
426 426
427The numactl(8) tool is packages with the run-time version of the library 427The numactl(8) tool is packaged with the run-time version of the library
428containing the memory policy system call wrappers. Some distributions 428containing the memory policy system call wrappers. Some distributions
429package the headers and compile-time libraries in a separate development 429package the headers and compile-time libraries in a separate development
430package. 430package.
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting
index 21c7b1f8f32b..706d7ed9d8d2 100644
--- a/Documentation/vm/overcommit-accounting
+++ b/Documentation/vm/overcommit-accounting
@@ -4,7 +4,7 @@ The Linux kernel supports the following overcommit handling modes
4 address space are refused. Used for a typical system. It 4 address space are refused. Used for a typical system. It
5 ensures a seriously wild allocation fails while allowing 5 ensures a seriously wild allocation fails while allowing
6 overcommit to reduce swap usage. root is allowed to 6 overcommit to reduce swap usage. root is allowed to
7 allocate slighly more memory in this mode. This is the 7 allocate slightly more memory in this mode. This is the
8 default. 8 default.
9 9
101 - Always overcommit. Appropriate for some scientific 101 - Always overcommit. Appropriate for some scientific
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c
index cc96ee2666f2..7445caa26d05 100644
--- a/Documentation/vm/page-types.c
+++ b/Documentation/vm/page-types.c
@@ -32,8 +32,20 @@
32#include <sys/types.h> 32#include <sys/types.h>
33#include <sys/errno.h> 33#include <sys/errno.h>
34#include <sys/fcntl.h> 34#include <sys/fcntl.h>
35#include <sys/mount.h>
36#include <sys/statfs.h>
37#include "../../include/linux/magic.h"
35 38
36 39
40#ifndef MAX_PATH
41# define MAX_PATH 256
42#endif
43
44#ifndef STR
45# define _STR(x) #x
46# define STR(x) _STR(x)
47#endif
48
37/* 49/*
38 * pagemap kernel ABI bits 50 * pagemap kernel ABI bits
39 */ 51 */
@@ -152,6 +164,12 @@ static const char *page_flag_names[] = {
152}; 164};
153 165
154 166
167static const char *debugfs_known_mountpoints[] = {
168 "/sys/kernel/debug",
169 "/debug",
170 0,
171};
172
155/* 173/*
156 * data structures 174 * data structures
157 */ 175 */
@@ -184,7 +202,7 @@ static int kpageflags_fd;
184static int opt_hwpoison; 202static int opt_hwpoison;
185static int opt_unpoison; 203static int opt_unpoison;
186 204
187static const char hwpoison_debug_fs[] = "/debug/hwpoison"; 205static char hwpoison_debug_fs[MAX_PATH+1];
188static int hwpoison_inject_fd; 206static int hwpoison_inject_fd;
189static int hwpoison_forget_fd; 207static int hwpoison_forget_fd;
190 208
@@ -464,21 +482,100 @@ static uint64_t kpageflags_flags(uint64_t flags)
464 return flags; 482 return flags;
465} 483}
466 484
485/* verify that a mountpoint is actually a debugfs instance */
486static int debugfs_valid_mountpoint(const char *debugfs)
487{
488 struct statfs st_fs;
489
490 if (statfs(debugfs, &st_fs) < 0)
491 return -ENOENT;
492 else if (st_fs.f_type != (long) DEBUGFS_MAGIC)
493 return -ENOENT;
494
495 return 0;
496}
497
498/* find the path to the mounted debugfs */
499static const char *debugfs_find_mountpoint(void)
500{
501 const char **ptr;
502 char type[100];
503 FILE *fp;
504
505 ptr = debugfs_known_mountpoints;
506 while (*ptr) {
507 if (debugfs_valid_mountpoint(*ptr) == 0) {
508 strcpy(hwpoison_debug_fs, *ptr);
509 return hwpoison_debug_fs;
510 }
511 ptr++;
512 }
513
514 /* give up and parse /proc/mounts */
515 fp = fopen("/proc/mounts", "r");
516 if (fp == NULL)
517 perror("Can't open /proc/mounts for read");
518
519 while (fscanf(fp, "%*s %"
520 STR(MAX_PATH)
521 "s %99s %*s %*d %*d\n",
522 hwpoison_debug_fs, type) == 2) {
523 if (strcmp(type, "debugfs") == 0)
524 break;
525 }
526 fclose(fp);
527
528 if (strcmp(type, "debugfs") != 0)
529 return NULL;
530
531 return hwpoison_debug_fs;
532}
533
534/* mount the debugfs somewhere if it's not mounted */
535
536static void debugfs_mount(void)
537{
538 const char **ptr;
539
540 /* see if it's already mounted */
541 if (debugfs_find_mountpoint())
542 return;
543
544 ptr = debugfs_known_mountpoints;
545 while (*ptr) {
546 if (mount(NULL, *ptr, "debugfs", 0, NULL) == 0) {
547 /* save the mountpoint */
548 strcpy(hwpoison_debug_fs, *ptr);
549 break;
550 }
551 ptr++;
552 }
553
554 if (*ptr == NULL) {
555 perror("mount debugfs");
556 exit(EXIT_FAILURE);
557 }
558}
559
467/* 560/*
468 * page actions 561 * page actions
469 */ 562 */
470 563
471static void prepare_hwpoison_fd(void) 564static void prepare_hwpoison_fd(void)
472{ 565{
473 char buf[100]; 566 char buf[MAX_PATH + 1];
567
568 debugfs_mount();
474 569
475 if (opt_hwpoison && !hwpoison_inject_fd) { 570 if (opt_hwpoison && !hwpoison_inject_fd) {
476 sprintf(buf, "%s/corrupt-pfn", hwpoison_debug_fs); 571 snprintf(buf, MAX_PATH, "%s/hwpoison/corrupt-pfn",
572 hwpoison_debug_fs);
477 hwpoison_inject_fd = checked_open(buf, O_WRONLY); 573 hwpoison_inject_fd = checked_open(buf, O_WRONLY);
478 } 574 }
479 575
480 if (opt_unpoison && !hwpoison_forget_fd) { 576 if (opt_unpoison && !hwpoison_forget_fd) {
481 sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs); 577 snprintf(buf, MAX_PATH, "%s/hwpoison/unpoison-pfn",
578 hwpoison_debug_fs);
482 hwpoison_forget_fd = checked_open(buf, O_WRONLY); 579 hwpoison_forget_fd = checked_open(buf, O_WRONLY);
483 } 580 }
484} 581}
diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c
deleted file mode 100644
index 92e729f4b676..000000000000
--- a/Documentation/vm/slabinfo.c
+++ /dev/null
@@ -1,1364 +0,0 @@
1/*
2 * Slabinfo: Tool to get reports about slabs
3 *
4 * (C) 2007 sgi, Christoph Lameter
5 *
6 * Compile by:
7 *
8 * gcc -o slabinfo slabinfo.c
9 */
10#include <stdio.h>
11#include <stdlib.h>
12#include <sys/types.h>
13#include <dirent.h>
14#include <strings.h>
15#include <string.h>
16#include <unistd.h>
17#include <stdarg.h>
18#include <getopt.h>
19#include <regex.h>
20#include <errno.h>
21
22#define MAX_SLABS 500
23#define MAX_ALIASES 500
24#define MAX_NODES 1024
25
26struct slabinfo {
27 char *name;
28 int alias;
29 int refs;
30 int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
31 int hwcache_align, object_size, objs_per_slab;
32 int sanity_checks, slab_size, store_user, trace;
33 int order, poison, reclaim_account, red_zone;
34 unsigned long partial, objects, slabs, objects_partial, objects_total;
35 unsigned long alloc_fastpath, alloc_slowpath;
36 unsigned long free_fastpath, free_slowpath;
37 unsigned long free_frozen, free_add_partial, free_remove_partial;
38 unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
39 unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
40 unsigned long deactivate_to_head, deactivate_to_tail;
41 unsigned long deactivate_remote_frees, order_fallback;
42 int numa[MAX_NODES];
43 int numa_partial[MAX_NODES];
44} slabinfo[MAX_SLABS];
45
46struct aliasinfo {
47 char *name;
48 char *ref;
49 struct slabinfo *slab;
50} aliasinfo[MAX_ALIASES];
51
52int slabs = 0;
53int actual_slabs = 0;
54int aliases = 0;
55int alias_targets = 0;
56int highest_node = 0;
57
58char buffer[4096];
59
60int show_empty = 0;
61int show_report = 0;
62int show_alias = 0;
63int show_slab = 0;
64int skip_zero = 1;
65int show_numa = 0;
66int show_track = 0;
67int show_first_alias = 0;
68int validate = 0;
69int shrink = 0;
70int show_inverted = 0;
71int show_single_ref = 0;
72int show_totals = 0;
73int sort_size = 0;
74int sort_active = 0;
75int set_debug = 0;
76int show_ops = 0;
77int show_activity = 0;
78
79/* Debug options */
80int sanity = 0;
81int redzone = 0;
82int poison = 0;
83int tracking = 0;
84int tracing = 0;
85
86int page_size;
87
88regex_t pattern;
89
90static void fatal(const char *x, ...)
91{
92 va_list ap;
93
94 va_start(ap, x);
95 vfprintf(stderr, x, ap);
96 va_end(ap);
97 exit(EXIT_FAILURE);
98}
99
100static void usage(void)
101{
102 printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
103 "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
104 "-a|--aliases Show aliases\n"
105 "-A|--activity Most active slabs first\n"
106 "-d<options>|--debug=<options> Set/Clear Debug options\n"
107 "-D|--display-active Switch line format to activity\n"
108 "-e|--empty Show empty slabs\n"
109 "-f|--first-alias Show first alias\n"
110 "-h|--help Show usage information\n"
111 "-i|--inverted Inverted list\n"
112 "-l|--slabs Show slabs\n"
113 "-n|--numa Show NUMA information\n"
114 "-o|--ops Show kmem_cache_ops\n"
115 "-s|--shrink Shrink slabs\n"
116 "-r|--report Detailed report on single slabs\n"
117 "-S|--Size Sort by size\n"
118 "-t|--tracking Show alloc/free information\n"
119 "-T|--Totals Show summary information\n"
120 "-v|--validate Validate slabs\n"
121 "-z|--zero Include empty slabs\n"
122 "-1|--1ref Single reference\n"
123 "\nValid debug options (FZPUT may be combined)\n"
124 "a / A Switch on all debug options (=FZUP)\n"
125 "- Switch off all debug options\n"
126 "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
127 "z / Z Redzoning\n"
128 "p / P Poisoning\n"
129 "u / U Tracking\n"
130 "t / T Tracing\n"
131 );
132}
133
134static unsigned long read_obj(const char *name)
135{
136 FILE *f = fopen(name, "r");
137
138 if (!f)
139 buffer[0] = 0;
140 else {
141 if (!fgets(buffer, sizeof(buffer), f))
142 buffer[0] = 0;
143 fclose(f);
144 if (buffer[strlen(buffer)] == '\n')
145 buffer[strlen(buffer)] = 0;
146 }
147 return strlen(buffer);
148}
149
150
151/*
152 * Get the contents of an attribute
153 */
154static unsigned long get_obj(const char *name)
155{
156 if (!read_obj(name))
157 return 0;
158
159 return atol(buffer);
160}
161
162static unsigned long get_obj_and_str(const char *name, char **x)
163{
164 unsigned long result = 0;
165 char *p;
166
167 *x = NULL;
168
169 if (!read_obj(name)) {
170 x = NULL;
171 return 0;
172 }
173 result = strtoul(buffer, &p, 10);
174 while (*p == ' ')
175 p++;
176 if (*p)
177 *x = strdup(p);
178 return result;
179}
180
181static void set_obj(struct slabinfo *s, const char *name, int n)
182{
183 char x[100];
184 FILE *f;
185
186 snprintf(x, 100, "%s/%s", s->name, name);
187 f = fopen(x, "w");
188 if (!f)
189 fatal("Cannot write to %s\n", x);
190
191 fprintf(f, "%d\n", n);
192 fclose(f);
193}
194
195static unsigned long read_slab_obj(struct slabinfo *s, const char *name)
196{
197 char x[100];
198 FILE *f;
199 size_t l;
200
201 snprintf(x, 100, "%s/%s", s->name, name);
202 f = fopen(x, "r");
203 if (!f) {
204 buffer[0] = 0;
205 l = 0;
206 } else {
207 l = fread(buffer, 1, sizeof(buffer), f);
208 buffer[l] = 0;
209 fclose(f);
210 }
211 return l;
212}
213
214
215/*
216 * Put a size string together
217 */
218static int store_size(char *buffer, unsigned long value)
219{
220 unsigned long divisor = 1;
221 char trailer = 0;
222 int n;
223
224 if (value > 1000000000UL) {
225 divisor = 100000000UL;
226 trailer = 'G';
227 } else if (value > 1000000UL) {
228 divisor = 100000UL;
229 trailer = 'M';
230 } else if (value > 1000UL) {
231 divisor = 100;
232 trailer = 'K';
233 }
234
235 value /= divisor;
236 n = sprintf(buffer, "%ld",value);
237 if (trailer) {
238 buffer[n] = trailer;
239 n++;
240 buffer[n] = 0;
241 }
242 if (divisor != 1) {
243 memmove(buffer + n - 2, buffer + n - 3, 4);
244 buffer[n-2] = '.';
245 n++;
246 }
247 return n;
248}
249
250static void decode_numa_list(int *numa, char *t)
251{
252 int node;
253 int nr;
254
255 memset(numa, 0, MAX_NODES * sizeof(int));
256
257 if (!t)
258 return;
259
260 while (*t == 'N') {
261 t++;
262 node = strtoul(t, &t, 10);
263 if (*t == '=') {
264 t++;
265 nr = strtoul(t, &t, 10);
266 numa[node] = nr;
267 if (node > highest_node)
268 highest_node = node;
269 }
270 while (*t == ' ')
271 t++;
272 }
273}
274
275static void slab_validate(struct slabinfo *s)
276{
277 if (strcmp(s->name, "*") == 0)
278 return;
279
280 set_obj(s, "validate", 1);
281}
282
283static void slab_shrink(struct slabinfo *s)
284{
285 if (strcmp(s->name, "*") == 0)
286 return;
287
288 set_obj(s, "shrink", 1);
289}
290
291int line = 0;
292
293static void first_line(void)
294{
295 if (show_activity)
296 printf("Name Objects Alloc Free %%Fast Fallb O\n");
297 else
298 printf("Name Objects Objsize Space "
299 "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
300}
301
302/*
303 * Find the shortest alias of a slab
304 */
305static struct aliasinfo *find_one_alias(struct slabinfo *find)
306{
307 struct aliasinfo *a;
308 struct aliasinfo *best = NULL;
309
310 for(a = aliasinfo;a < aliasinfo + aliases; a++) {
311 if (a->slab == find &&
312 (!best || strlen(best->name) < strlen(a->name))) {
313 best = a;
314 if (strncmp(a->name,"kmall", 5) == 0)
315 return best;
316 }
317 }
318 return best;
319}
320
321static unsigned long slab_size(struct slabinfo *s)
322{
323 return s->slabs * (page_size << s->order);
324}
325
326static unsigned long slab_activity(struct slabinfo *s)
327{
328 return s->alloc_fastpath + s->free_fastpath +
329 s->alloc_slowpath + s->free_slowpath;
330}
331
332static void slab_numa(struct slabinfo *s, int mode)
333{
334 int node;
335
336 if (strcmp(s->name, "*") == 0)
337 return;
338
339 if (!highest_node) {
340 printf("\n%s: No NUMA information available.\n", s->name);
341 return;
342 }
343
344 if (skip_zero && !s->slabs)
345 return;
346
347 if (!line) {
348 printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
349 for(node = 0; node <= highest_node; node++)
350 printf(" %4d", node);
351 printf("\n----------------------");
352 for(node = 0; node <= highest_node; node++)
353 printf("-----");
354 printf("\n");
355 }
356 printf("%-21s ", mode ? "All slabs" : s->name);
357 for(node = 0; node <= highest_node; node++) {
358 char b[20];
359
360 store_size(b, s->numa[node]);
361 printf(" %4s", b);
362 }
363 printf("\n");
364 if (mode) {
365 printf("%-21s ", "Partial slabs");
366 for(node = 0; node <= highest_node; node++) {
367 char b[20];
368
369 store_size(b, s->numa_partial[node]);
370 printf(" %4s", b);
371 }
372 printf("\n");
373 }
374 line++;
375}
376
377static void show_tracking(struct slabinfo *s)
378{
379 printf("\n%s: Kernel object allocation\n", s->name);
380 printf("-----------------------------------------------------------------------\n");
381 if (read_slab_obj(s, "alloc_calls"))
382 printf(buffer);
383 else
384 printf("No Data\n");
385
386 printf("\n%s: Kernel object freeing\n", s->name);
387 printf("------------------------------------------------------------------------\n");
388 if (read_slab_obj(s, "free_calls"))
389 printf(buffer);
390 else
391 printf("No Data\n");
392
393}
394
395static void ops(struct slabinfo *s)
396{
397 if (strcmp(s->name, "*") == 0)
398 return;
399
400 if (read_slab_obj(s, "ops")) {
401 printf("\n%s: kmem_cache operations\n", s->name);
402 printf("--------------------------------------------\n");
403 printf(buffer);
404 } else
405 printf("\n%s has no kmem_cache operations\n", s->name);
406}
407
408static const char *onoff(int x)
409{
410 if (x)
411 return "On ";
412 return "Off";
413}
414
415static void slab_stats(struct slabinfo *s)
416{
417 unsigned long total_alloc;
418 unsigned long total_free;
419 unsigned long total;
420
421 if (!s->alloc_slab)
422 return;
423
424 total_alloc = s->alloc_fastpath + s->alloc_slowpath;
425 total_free = s->free_fastpath + s->free_slowpath;
426
427 if (!total_alloc)
428 return;
429
430 printf("\n");
431 printf("Slab Perf Counter Alloc Free %%Al %%Fr\n");
432 printf("--------------------------------------------------\n");
433 printf("Fastpath %8lu %8lu %3lu %3lu\n",
434 s->alloc_fastpath, s->free_fastpath,
435 s->alloc_fastpath * 100 / total_alloc,
436 s->free_fastpath * 100 / total_free);
437 printf("Slowpath %8lu %8lu %3lu %3lu\n",
438 total_alloc - s->alloc_fastpath, s->free_slowpath,
439 (total_alloc - s->alloc_fastpath) * 100 / total_alloc,
440 s->free_slowpath * 100 / total_free);
441 printf("Page Alloc %8lu %8lu %3lu %3lu\n",
442 s->alloc_slab, s->free_slab,
443 s->alloc_slab * 100 / total_alloc,
444 s->free_slab * 100 / total_free);
445 printf("Add partial %8lu %8lu %3lu %3lu\n",
446 s->deactivate_to_head + s->deactivate_to_tail,
447 s->free_add_partial,
448 (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
449 s->free_add_partial * 100 / total_free);
450 printf("Remove partial %8lu %8lu %3lu %3lu\n",
451 s->alloc_from_partial, s->free_remove_partial,
452 s->alloc_from_partial * 100 / total_alloc,
453 s->free_remove_partial * 100 / total_free);
454
455 printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
456 s->deactivate_remote_frees, s->free_frozen,
457 s->deactivate_remote_frees * 100 / total_alloc,
458 s->free_frozen * 100 / total_free);
459
460 printf("Total %8lu %8lu\n\n", total_alloc, total_free);
461
462 if (s->cpuslab_flush)
463 printf("Flushes %8lu\n", s->cpuslab_flush);
464
465 if (s->alloc_refill)
466 printf("Refill %8lu\n", s->alloc_refill);
467
468 total = s->deactivate_full + s->deactivate_empty +
469 s->deactivate_to_head + s->deactivate_to_tail;
470
471 if (total)
472 printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
473 "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
474 s->deactivate_full, (s->deactivate_full * 100) / total,
475 s->deactivate_empty, (s->deactivate_empty * 100) / total,
476 s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
477 s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
478}
479
480static void report(struct slabinfo *s)
481{
482 if (strcmp(s->name, "*") == 0)
483 return;
484
485 printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n",
486 s->name, s->aliases, s->order, s->objects);
487 if (s->hwcache_align)
488 printf("** Hardware cacheline aligned\n");
489 if (s->cache_dma)
490 printf("** Memory is allocated in a special DMA zone\n");
491 if (s->destroy_by_rcu)
492 printf("** Slabs are destroyed via RCU\n");
493 if (s->reclaim_account)
494 printf("** Reclaim accounting active\n");
495
496 printf("\nSizes (bytes) Slabs Debug Memory\n");
497 printf("------------------------------------------------------------------------\n");
498 printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
499 s->object_size, s->slabs, onoff(s->sanity_checks),
500 s->slabs * (page_size << s->order));
501 printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
502 s->slab_size, s->slabs - s->partial - s->cpu_slabs,
503 onoff(s->red_zone), s->objects * s->object_size);
504 printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
505 page_size << s->order, s->partial, onoff(s->poison),
506 s->slabs * (page_size << s->order) - s->objects * s->object_size);
507 printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
508 s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
509 (s->slab_size - s->object_size) * s->objects);
510 printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
511 s->align, s->objs_per_slab, onoff(s->trace),
512 ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
513 s->slabs);
514
515 ops(s);
516 show_tracking(s);
517 slab_numa(s, 1);
518 slab_stats(s);
519}
520
521static void slabcache(struct slabinfo *s)
522{
523 char size_str[20];
524 char dist_str[40];
525 char flags[20];
526 char *p = flags;
527
528 if (strcmp(s->name, "*") == 0)
529 return;
530
531 if (actual_slabs == 1) {
532 report(s);
533 return;
534 }
535
536 if (skip_zero && !show_empty && !s->slabs)
537 return;
538
539 if (show_empty && s->slabs)
540 return;
541
542 store_size(size_str, slab_size(s));
543 snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
544 s->partial, s->cpu_slabs);
545
546 if (!line++)
547 first_line();
548
549 if (s->aliases)
550 *p++ = '*';
551 if (s->cache_dma)
552 *p++ = 'd';
553 if (s->hwcache_align)
554 *p++ = 'A';
555 if (s->poison)
556 *p++ = 'P';
557 if (s->reclaim_account)
558 *p++ = 'a';
559 if (s->red_zone)
560 *p++ = 'Z';
561 if (s->sanity_checks)
562 *p++ = 'F';
563 if (s->store_user)
564 *p++ = 'U';
565 if (s->trace)
566 *p++ = 'T';
567
568 *p = 0;
569 if (show_activity) {
570 unsigned long total_alloc;
571 unsigned long total_free;
572
573 total_alloc = s->alloc_fastpath + s->alloc_slowpath;
574 total_free = s->free_fastpath + s->free_slowpath;
575
576 printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n",
577 s->name, s->objects,
578 total_alloc, total_free,
579 total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
580 total_free ? (s->free_fastpath * 100 / total_free) : 0,
581 s->order_fallback, s->order);
582 }
583 else
584 printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
585 s->name, s->objects, s->object_size, size_str, dist_str,
586 s->objs_per_slab, s->order,
587 s->slabs ? (s->partial * 100) / s->slabs : 100,
588 s->slabs ? (s->objects * s->object_size * 100) /
589 (s->slabs * (page_size << s->order)) : 100,
590 flags);
591}
592
593/*
594 * Analyze debug options. Return false if something is amiss.
595 */
596static int debug_opt_scan(char *opt)
597{
598 if (!opt || !opt[0] || strcmp(opt, "-") == 0)
599 return 1;
600
601 if (strcasecmp(opt, "a") == 0) {
602 sanity = 1;
603 poison = 1;
604 redzone = 1;
605 tracking = 1;
606 return 1;
607 }
608
609 for ( ; *opt; opt++)
610 switch (*opt) {
611 case 'F' : case 'f':
612 if (sanity)
613 return 0;
614 sanity = 1;
615 break;
616 case 'P' : case 'p':
617 if (poison)
618 return 0;
619 poison = 1;
620 break;
621
622 case 'Z' : case 'z':
623 if (redzone)
624 return 0;
625 redzone = 1;
626 break;
627
628 case 'U' : case 'u':
629 if (tracking)
630 return 0;
631 tracking = 1;
632 break;
633
634 case 'T' : case 't':
635 if (tracing)
636 return 0;
637 tracing = 1;
638 break;
639 default:
640 return 0;
641 }
642 return 1;
643}
644
645static int slab_empty(struct slabinfo *s)
646{
647 if (s->objects > 0)
648 return 0;
649
650 /*
651 * We may still have slabs even if there are no objects. Shrinking will
652 * remove them.
653 */
654 if (s->slabs != 0)
655 set_obj(s, "shrink", 1);
656
657 return 1;
658}
659
660static void slab_debug(struct slabinfo *s)
661{
662 if (strcmp(s->name, "*") == 0)
663 return;
664
665 if (sanity && !s->sanity_checks) {
666 set_obj(s, "sanity", 1);
667 }
668 if (!sanity && s->sanity_checks) {
669 if (slab_empty(s))
670 set_obj(s, "sanity", 0);
671 else
672 fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
673 }
674 if (redzone && !s->red_zone) {
675 if (slab_empty(s))
676 set_obj(s, "red_zone", 1);
677 else
678 fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
679 }
680 if (!redzone && s->red_zone) {
681 if (slab_empty(s))
682 set_obj(s, "red_zone", 0);
683 else
684 fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
685 }
686 if (poison && !s->poison) {
687 if (slab_empty(s))
688 set_obj(s, "poison", 1);
689 else
690 fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
691 }
692 if (!poison && s->poison) {
693 if (slab_empty(s))
694 set_obj(s, "poison", 0);
695 else
696 fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
697 }
698 if (tracking && !s->store_user) {
699 if (slab_empty(s))
700 set_obj(s, "store_user", 1);
701 else
702 fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
703 }
704 if (!tracking && s->store_user) {
705 if (slab_empty(s))
706 set_obj(s, "store_user", 0);
707 else
708 fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
709 }
710 if (tracing && !s->trace) {
711 if (slabs == 1)
712 set_obj(s, "trace", 1);
713 else
714 fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name);
715 }
716 if (!tracing && s->trace)
717 set_obj(s, "trace", 1);
718}
719
720static void totals(void)
721{
722 struct slabinfo *s;
723
724 int used_slabs = 0;
725 char b1[20], b2[20], b3[20], b4[20];
726 unsigned long long max = 1ULL << 63;
727
728 /* Object size */
729 unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
730
731 /* Number of partial slabs in a slabcache */
732 unsigned long long min_partial = max, max_partial = 0,
733 avg_partial, total_partial = 0;
734
735 /* Number of slabs in a slab cache */
736 unsigned long long min_slabs = max, max_slabs = 0,
737 avg_slabs, total_slabs = 0;
738
739 /* Size of the whole slab */
740 unsigned long long min_size = max, max_size = 0,
741 avg_size, total_size = 0;
742
743 /* Bytes used for object storage in a slab */
744 unsigned long long min_used = max, max_used = 0,
745 avg_used, total_used = 0;
746
747 /* Waste: Bytes used for alignment and padding */
748 unsigned long long min_waste = max, max_waste = 0,
749 avg_waste, total_waste = 0;
750 /* Number of objects in a slab */
751 unsigned long long min_objects = max, max_objects = 0,
752 avg_objects, total_objects = 0;
753 /* Waste per object */
754 unsigned long long min_objwaste = max,
755 max_objwaste = 0, avg_objwaste,
756 total_objwaste = 0;
757
758 /* Memory per object */
759 unsigned long long min_memobj = max,
760 max_memobj = 0, avg_memobj,
761 total_objsize = 0;
762
763 /* Percentage of partial slabs per slab */
764 unsigned long min_ppart = 100, max_ppart = 0,
765 avg_ppart, total_ppart = 0;
766
767 /* Number of objects in partial slabs */
768 unsigned long min_partobj = max, max_partobj = 0,
769 avg_partobj, total_partobj = 0;
770
771 /* Percentage of partial objects of all objects in a slab */
772 unsigned long min_ppartobj = 100, max_ppartobj = 0,
773 avg_ppartobj, total_ppartobj = 0;
774
775
776 for (s = slabinfo; s < slabinfo + slabs; s++) {
777 unsigned long long size;
778 unsigned long used;
779 unsigned long long wasted;
780 unsigned long long objwaste;
781 unsigned long percentage_partial_slabs;
782 unsigned long percentage_partial_objs;
783
784 if (!s->slabs || !s->objects)
785 continue;
786
787 used_slabs++;
788
789 size = slab_size(s);
790 used = s->objects * s->object_size;
791 wasted = size - used;
792 objwaste = s->slab_size - s->object_size;
793
794 percentage_partial_slabs = s->partial * 100 / s->slabs;
795 if (percentage_partial_slabs > 100)
796 percentage_partial_slabs = 100;
797
798 percentage_partial_objs = s->objects_partial * 100
799 / s->objects;
800
801 if (percentage_partial_objs > 100)
802 percentage_partial_objs = 100;
803
804 if (s->object_size < min_objsize)
805 min_objsize = s->object_size;
806 if (s->partial < min_partial)
807 min_partial = s->partial;
808 if (s->slabs < min_slabs)
809 min_slabs = s->slabs;
810 if (size < min_size)
811 min_size = size;
812 if (wasted < min_waste)
813 min_waste = wasted;
814 if (objwaste < min_objwaste)
815 min_objwaste = objwaste;
816 if (s->objects < min_objects)
817 min_objects = s->objects;
818 if (used < min_used)
819 min_used = used;
820 if (s->objects_partial < min_partobj)
821 min_partobj = s->objects_partial;
822 if (percentage_partial_slabs < min_ppart)
823 min_ppart = percentage_partial_slabs;
824 if (percentage_partial_objs < min_ppartobj)
825 min_ppartobj = percentage_partial_objs;
826 if (s->slab_size < min_memobj)
827 min_memobj = s->slab_size;
828
829 if (s->object_size > max_objsize)
830 max_objsize = s->object_size;
831 if (s->partial > max_partial)
832 max_partial = s->partial;
833 if (s->slabs > max_slabs)
834 max_slabs = s->slabs;
835 if (size > max_size)
836 max_size = size;
837 if (wasted > max_waste)
838 max_waste = wasted;
839 if (objwaste > max_objwaste)
840 max_objwaste = objwaste;
841 if (s->objects > max_objects)
842 max_objects = s->objects;
843 if (used > max_used)
844 max_used = used;
845 if (s->objects_partial > max_partobj)
846 max_partobj = s->objects_partial;
847 if (percentage_partial_slabs > max_ppart)
848 max_ppart = percentage_partial_slabs;
849 if (percentage_partial_objs > max_ppartobj)
850 max_ppartobj = percentage_partial_objs;
851 if (s->slab_size > max_memobj)
852 max_memobj = s->slab_size;
853
854 total_partial += s->partial;
855 total_slabs += s->slabs;
856 total_size += size;
857 total_waste += wasted;
858
859 total_objects += s->objects;
860 total_used += used;
861 total_partobj += s->objects_partial;
862 total_ppart += percentage_partial_slabs;
863 total_ppartobj += percentage_partial_objs;
864
865 total_objwaste += s->objects * objwaste;
866 total_objsize += s->objects * s->slab_size;
867 }
868
869 if (!total_objects) {
870 printf("No objects\n");
871 return;
872 }
873 if (!used_slabs) {
874 printf("No slabs\n");
875 return;
876 }
877
878 /* Per slab averages */
879 avg_partial = total_partial / used_slabs;
880 avg_slabs = total_slabs / used_slabs;
881 avg_size = total_size / used_slabs;
882 avg_waste = total_waste / used_slabs;
883
884 avg_objects = total_objects / used_slabs;
885 avg_used = total_used / used_slabs;
886 avg_partobj = total_partobj / used_slabs;
887 avg_ppart = total_ppart / used_slabs;
888 avg_ppartobj = total_ppartobj / used_slabs;
889
890 /* Per object object sizes */
891 avg_objsize = total_used / total_objects;
892 avg_objwaste = total_objwaste / total_objects;
893 avg_partobj = total_partobj * 100 / total_objects;
894 avg_memobj = total_objsize / total_objects;
895
896 printf("Slabcache Totals\n");
897 printf("----------------\n");
898 printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n",
899 slabs, aliases, alias_targets, used_slabs);
900
901 store_size(b1, total_size);store_size(b2, total_waste);
902 store_size(b3, total_waste * 100 / total_used);
903 printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
904
905 store_size(b1, total_objects);store_size(b2, total_partobj);
906 store_size(b3, total_partobj * 100 / total_objects);
907 printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3);
908
909 printf("\n");
910 printf("Per Cache Average Min Max Total\n");
911 printf("---------------------------------------------------------\n");
912
913 store_size(b1, avg_objects);store_size(b2, min_objects);
914 store_size(b3, max_objects);store_size(b4, total_objects);
915 printf("#Objects %10s %10s %10s %10s\n",
916 b1, b2, b3, b4);
917
918 store_size(b1, avg_slabs);store_size(b2, min_slabs);
919 store_size(b3, max_slabs);store_size(b4, total_slabs);
920 printf("#Slabs %10s %10s %10s %10s\n",
921 b1, b2, b3, b4);
922
923 store_size(b1, avg_partial);store_size(b2, min_partial);
924 store_size(b3, max_partial);store_size(b4, total_partial);
925 printf("#PartSlab %10s %10s %10s %10s\n",
926 b1, b2, b3, b4);
927 store_size(b1, avg_ppart);store_size(b2, min_ppart);
928 store_size(b3, max_ppart);
929 store_size(b4, total_partial * 100 / total_slabs);
930 printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n",
931 b1, b2, b3, b4);
932
933 store_size(b1, avg_partobj);store_size(b2, min_partobj);
934 store_size(b3, max_partobj);
935 store_size(b4, total_partobj);
936 printf("PartObjs %10s %10s %10s %10s\n",
937 b1, b2, b3, b4);
938
939 store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj);
940 store_size(b3, max_ppartobj);
941 store_size(b4, total_partobj * 100 / total_objects);
942 printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n",
943 b1, b2, b3, b4);
944
945 store_size(b1, avg_size);store_size(b2, min_size);
946 store_size(b3, max_size);store_size(b4, total_size);
947 printf("Memory %10s %10s %10s %10s\n",
948 b1, b2, b3, b4);
949
950 store_size(b1, avg_used);store_size(b2, min_used);
951 store_size(b3, max_used);store_size(b4, total_used);
952 printf("Used %10s %10s %10s %10s\n",
953 b1, b2, b3, b4);
954
955 store_size(b1, avg_waste);store_size(b2, min_waste);
956 store_size(b3, max_waste);store_size(b4, total_waste);
957 printf("Loss %10s %10s %10s %10s\n",
958 b1, b2, b3, b4);
959
960 printf("\n");
961 printf("Per Object Average Min Max\n");
962 printf("---------------------------------------------\n");
963
964 store_size(b1, avg_memobj);store_size(b2, min_memobj);
965 store_size(b3, max_memobj);
966 printf("Memory %10s %10s %10s\n",
967 b1, b2, b3);
968 store_size(b1, avg_objsize);store_size(b2, min_objsize);
969 store_size(b3, max_objsize);
970 printf("User %10s %10s %10s\n",
971 b1, b2, b3);
972
973 store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
974 store_size(b3, max_objwaste);
975 printf("Loss %10s %10s %10s\n",
976 b1, b2, b3);
977}
978
979static void sort_slabs(void)
980{
981 struct slabinfo *s1,*s2;
982
983 for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
984 for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
985 int result;
986
987 if (sort_size)
988 result = slab_size(s1) < slab_size(s2);
989 else if (sort_active)
990 result = slab_activity(s1) < slab_activity(s2);
991 else
992 result = strcasecmp(s1->name, s2->name);
993
994 if (show_inverted)
995 result = -result;
996
997 if (result > 0) {
998 struct slabinfo t;
999
1000 memcpy(&t, s1, sizeof(struct slabinfo));
1001 memcpy(s1, s2, sizeof(struct slabinfo));
1002 memcpy(s2, &t, sizeof(struct slabinfo));
1003 }
1004 }
1005 }
1006}
1007
1008static void sort_aliases(void)
1009{
1010 struct aliasinfo *a1,*a2;
1011
1012 for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) {
1013 for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) {
1014 char *n1, *n2;
1015
1016 n1 = a1->name;
1017 n2 = a2->name;
1018 if (show_alias && !show_inverted) {
1019 n1 = a1->ref;
1020 n2 = a2->ref;
1021 }
1022 if (strcasecmp(n1, n2) > 0) {
1023 struct aliasinfo t;
1024
1025 memcpy(&t, a1, sizeof(struct aliasinfo));
1026 memcpy(a1, a2, sizeof(struct aliasinfo));
1027 memcpy(a2, &t, sizeof(struct aliasinfo));
1028 }
1029 }
1030 }
1031}
1032
1033static void link_slabs(void)
1034{
1035 struct aliasinfo *a;
1036 struct slabinfo *s;
1037
1038 for (a = aliasinfo; a < aliasinfo + aliases; a++) {
1039
1040 for (s = slabinfo; s < slabinfo + slabs; s++)
1041 if (strcmp(a->ref, s->name) == 0) {
1042 a->slab = s;
1043 s->refs++;
1044 break;
1045 }
1046 if (s == slabinfo + slabs)
1047 fatal("Unresolved alias %s\n", a->ref);
1048 }
1049}
1050
1051static void alias(void)
1052{
1053 struct aliasinfo *a;
1054 char *active = NULL;
1055
1056 sort_aliases();
1057 link_slabs();
1058
1059 for(a = aliasinfo; a < aliasinfo + aliases; a++) {
1060
1061 if (!show_single_ref && a->slab->refs == 1)
1062 continue;
1063
1064 if (!show_inverted) {
1065 if (active) {
1066 if (strcmp(a->slab->name, active) == 0) {
1067 printf(" %s", a->name);
1068 continue;
1069 }
1070 }
1071 printf("\n%-12s <- %s", a->slab->name, a->name);
1072 active = a->slab->name;
1073 }
1074 else
1075 printf("%-20s -> %s\n", a->name, a->slab->name);
1076 }
1077 if (active)
1078 printf("\n");
1079}
1080
1081
1082static void rename_slabs(void)
1083{
1084 struct slabinfo *s;
1085 struct aliasinfo *a;
1086
1087 for (s = slabinfo; s < slabinfo + slabs; s++) {
1088 if (*s->name != ':')
1089 continue;
1090
1091 if (s->refs > 1 && !show_first_alias)
1092 continue;
1093
1094 a = find_one_alias(s);
1095
1096 if (a)
1097 s->name = a->name;
1098 else {
1099 s->name = "*";
1100 actual_slabs--;
1101 }
1102 }
1103}
1104
1105static int slab_mismatch(char *slab)
1106{
1107 return regexec(&pattern, slab, 0, NULL, 0);
1108}
1109
1110static void read_slab_dir(void)
1111{
1112 DIR *dir;
1113 struct dirent *de;
1114 struct slabinfo *slab = slabinfo;
1115 struct aliasinfo *alias = aliasinfo;
1116 char *p;
1117 char *t;
1118 int count;
1119
1120 if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
1121 fatal("SYSFS support for SLUB not active\n");
1122
1123 dir = opendir(".");
1124 while ((de = readdir(dir))) {
1125 if (de->d_name[0] == '.' ||
1126 (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
1127 continue;
1128 switch (de->d_type) {
1129 case DT_LNK:
1130 alias->name = strdup(de->d_name);
1131 count = readlink(de->d_name, buffer, sizeof(buffer));
1132
1133 if (count < 0)
1134 fatal("Cannot read symlink %s\n", de->d_name);
1135
1136 buffer[count] = 0;
1137 p = buffer + count;
1138 while (p > buffer && p[-1] != '/')
1139 p--;
1140 alias->ref = strdup(p);
1141 alias++;
1142 break;
1143 case DT_DIR:
1144 if (chdir(de->d_name))
1145 fatal("Unable to access slab %s\n", slab->name);
1146 slab->name = strdup(de->d_name);
1147 slab->alias = 0;
1148 slab->refs = 0;
1149 slab->aliases = get_obj("aliases");
1150 slab->align = get_obj("align");
1151 slab->cache_dma = get_obj("cache_dma");
1152 slab->cpu_slabs = get_obj("cpu_slabs");
1153 slab->destroy_by_rcu = get_obj("destroy_by_rcu");
1154 slab->hwcache_align = get_obj("hwcache_align");
1155 slab->object_size = get_obj("object_size");
1156 slab->objects = get_obj("objects");
1157 slab->objects_partial = get_obj("objects_partial");
1158 slab->objects_total = get_obj("objects_total");
1159 slab->objs_per_slab = get_obj("objs_per_slab");
1160 slab->order = get_obj("order");
1161 slab->partial = get_obj("partial");
1162 slab->partial = get_obj_and_str("partial", &t);
1163 decode_numa_list(slab->numa_partial, t);
1164 free(t);
1165 slab->poison = get_obj("poison");
1166 slab->reclaim_account = get_obj("reclaim_account");
1167 slab->red_zone = get_obj("red_zone");
1168 slab->sanity_checks = get_obj("sanity_checks");
1169 slab->slab_size = get_obj("slab_size");
1170 slab->slabs = get_obj_and_str("slabs", &t);
1171 decode_numa_list(slab->numa, t);
1172 free(t);
1173 slab->store_user = get_obj("store_user");
1174 slab->trace = get_obj("trace");
1175 slab->alloc_fastpath = get_obj("alloc_fastpath");
1176 slab->alloc_slowpath = get_obj("alloc_slowpath");
1177 slab->free_fastpath = get_obj("free_fastpath");
1178 slab->free_slowpath = get_obj("free_slowpath");
1179 slab->free_frozen= get_obj("free_frozen");
1180 slab->free_add_partial = get_obj("free_add_partial");
1181 slab->free_remove_partial = get_obj("free_remove_partial");
1182 slab->alloc_from_partial = get_obj("alloc_from_partial");
1183 slab->alloc_slab = get_obj("alloc_slab");
1184 slab->alloc_refill = get_obj("alloc_refill");
1185 slab->free_slab = get_obj("free_slab");
1186 slab->cpuslab_flush = get_obj("cpuslab_flush");
1187 slab->deactivate_full = get_obj("deactivate_full");
1188 slab->deactivate_empty = get_obj("deactivate_empty");
1189 slab->deactivate_to_head = get_obj("deactivate_to_head");
1190 slab->deactivate_to_tail = get_obj("deactivate_to_tail");
1191 slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
1192 slab->order_fallback = get_obj("order_fallback");
1193 chdir("..");
1194 if (slab->name[0] == ':')
1195 alias_targets++;
1196 slab++;
1197 break;
1198 default :
1199 fatal("Unknown file type %lx\n", de->d_type);
1200 }
1201 }
1202 closedir(dir);
1203 slabs = slab - slabinfo;
1204 actual_slabs = slabs;
1205 aliases = alias - aliasinfo;
1206 if (slabs > MAX_SLABS)
1207 fatal("Too many slabs\n");
1208 if (aliases > MAX_ALIASES)
1209 fatal("Too many aliases\n");
1210}
1211
1212static void output_slabs(void)
1213{
1214 struct slabinfo *slab;
1215
1216 for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
1217
1218 if (slab->alias)
1219 continue;
1220
1221
1222 if (show_numa)
1223 slab_numa(slab, 0);
1224 else if (show_track)
1225 show_tracking(slab);
1226 else if (validate)
1227 slab_validate(slab);
1228 else if (shrink)
1229 slab_shrink(slab);
1230 else if (set_debug)
1231 slab_debug(slab);
1232 else if (show_ops)
1233 ops(slab);
1234 else if (show_slab)
1235 slabcache(slab);
1236 else if (show_report)
1237 report(slab);
1238 }
1239}
1240
1241struct option opts[] = {
1242 { "aliases", 0, NULL, 'a' },
1243 { "activity", 0, NULL, 'A' },
1244 { "debug", 2, NULL, 'd' },
1245 { "display-activity", 0, NULL, 'D' },
1246 { "empty", 0, NULL, 'e' },
1247 { "first-alias", 0, NULL, 'f' },
1248 { "help", 0, NULL, 'h' },
1249 { "inverted", 0, NULL, 'i'},
1250 { "numa", 0, NULL, 'n' },
1251 { "ops", 0, NULL, 'o' },
1252 { "report", 0, NULL, 'r' },
1253 { "shrink", 0, NULL, 's' },
1254 { "slabs", 0, NULL, 'l' },
1255 { "track", 0, NULL, 't'},
1256 { "validate", 0, NULL, 'v' },
1257 { "zero", 0, NULL, 'z' },
1258 { "1ref", 0, NULL, '1'},
1259 { NULL, 0, NULL, 0 }
1260};
1261
1262int main(int argc, char *argv[])
1263{
1264 int c;
1265 int err;
1266 char *pattern_source;
1267
1268 page_size = getpagesize();
1269
1270 while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
1271 opts, NULL)) != -1)
1272 switch (c) {
1273 case '1':
1274 show_single_ref = 1;
1275 break;
1276 case 'a':
1277 show_alias = 1;
1278 break;
1279 case 'A':
1280 sort_active = 1;
1281 break;
1282 case 'd':
1283 set_debug = 1;
1284 if (!debug_opt_scan(optarg))
1285 fatal("Invalid debug option '%s'\n", optarg);
1286 break;
1287 case 'D':
1288 show_activity = 1;
1289 break;
1290 case 'e':
1291 show_empty = 1;
1292 break;
1293 case 'f':
1294 show_first_alias = 1;
1295 break;
1296 case 'h':
1297 usage();
1298 return 0;
1299 case 'i':
1300 show_inverted = 1;
1301 break;
1302 case 'n':
1303 show_numa = 1;
1304 break;
1305 case 'o':
1306 show_ops = 1;
1307 break;
1308 case 'r':
1309 show_report = 1;
1310 break;
1311 case 's':
1312 shrink = 1;
1313 break;
1314 case 'l':
1315 show_slab = 1;
1316 break;
1317 case 't':
1318 show_track = 1;
1319 break;
1320 case 'v':
1321 validate = 1;
1322 break;
1323 case 'z':
1324 skip_zero = 0;
1325 break;
1326 case 'T':
1327 show_totals = 1;
1328 break;
1329 case 'S':
1330 sort_size = 1;
1331 break;
1332
1333 default:
1334 fatal("%s: Invalid option '%c'\n", argv[0], optopt);
1335
1336 }
1337
1338 if (!show_slab && !show_alias && !show_track && !show_report
1339 && !validate && !shrink && !set_debug && !show_ops)
1340 show_slab = 1;
1341
1342 if (argc > optind)
1343 pattern_source = argv[optind];
1344 else
1345 pattern_source = ".*";
1346
1347 err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
1348 if (err)
1349 fatal("%s: Invalid pattern '%s' code %d\n",
1350 argv[0], pattern_source, err);
1351 read_slab_dir();
1352 if (show_alias)
1353 alias();
1354 else
1355 if (show_totals)
1356 totals();
1357 else {
1358 link_slabs();
1359 rename_slabs();
1360 sort_slabs();
1361 output_slabs();
1362 }
1363 return 0;
1364}
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
index 000000000000..0924aaca3302
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,298 @@
1= Transparent Hugepage Support =
2
3== Objective ==
4
5Performance critical computing applications dealing with large memory
6working sets are already running on top of libhugetlbfs and in turn
7hugetlbfs. Transparent Hugepage Support is an alternative means of
8using huge pages for the backing of virtual memory with huge pages
9that supports the automatic promotion and demotion of page sizes and
10without the shortcomings of hugetlbfs.
11
12Currently it only works for anonymous memory mappings but in the
13future it can expand over the pagecache layer starting with tmpfs.
14
15The reason applications are running faster is because of two
16factors. The first factor is almost completely irrelevant and it's not
17of significant interest because it'll also have the downside of
18requiring larger clear-page copy-page in page faults which is a
19potentially negative effect. The first factor consists in taking a
20single page fault for each 2M virtual region touched by userland (so
21reducing the enter/exit kernel frequency by a 512 times factor). This
22only matters the first time the memory is accessed for the lifetime of
23a memory mapping. The second long lasting and much more important
24factor will affect all subsequent accesses to the memory for the whole
25runtime of the application. The second factor consist of two
26components: 1) the TLB miss will run faster (especially with
27virtualization using nested pagetables but almost always also on bare
28metal without virtualization) and 2) a single TLB entry will be
29mapping a much larger amount of virtual memory in turn reducing the
30number of TLB misses. With virtualization and nested pagetables the
31TLB can be mapped of larger size only if both KVM and the Linux guest
32are using hugepages but a significant speedup already happens if only
33one of the two is using hugepages just because of the fact the TLB
34miss is going to run faster.
35
36== Design ==
37
38- "graceful fallback": mm components which don't have transparent
39 hugepage knowledge fall back to breaking a transparent hugepage and
40 working on the regular pages and their respective regular pmd/pte
41 mappings
42
43- if a hugepage allocation fails because of memory fragmentation,
44 regular pages should be gracefully allocated instead and mixed in
45 the same vma without any failure or significant delay and without
46 userland noticing
47
48- if some task quits and more hugepages become available (either
49 immediately in the buddy or through the VM), guest physical memory
50 backed by regular pages should be relocated on hugepages
51 automatically (with khugepaged)
52
53- it doesn't require memory reservation and in turn it uses hugepages
54 whenever possible (the only possible reservation here is kernelcore=
55 to avoid unmovable pages to fragment all the memory but such a tweak
56 is not specific to transparent hugepage support and it's a generic
57 feature that applies to all dynamic high order allocations in the
58 kernel)
59
60- this initial support only offers the feature in the anonymous memory
61 regions but it'd be ideal to move it to tmpfs and the pagecache
62 later
63
64Transparent Hugepage Support maximizes the usefulness of free memory
65if compared to the reservation approach of hugetlbfs by allowing all
66unused memory to be used as cache or other movable (or even unmovable
67entities). It doesn't require reservation to prevent hugepage
68allocation failures to be noticeable from userland. It allows paging
69and all other advanced VM features to be available on the
70hugepages. It requires no modifications for applications to take
71advantage of it.
72
73Applications however can be further optimized to take advantage of
74this feature, like for example they've been optimized before to avoid
75a flood of mmap system calls for every malloc(4k). Optimizing userland
76is by far not mandatory and khugepaged already can take care of long
77lived page allocations even for hugepage unaware applications that
78deals with large amounts of memory.
79
80In certain cases when hugepages are enabled system wide, application
81may end up allocating more memory resources. An application may mmap a
82large region but only touch 1 byte of it, in that case a 2M page might
83be allocated instead of a 4k page for no good. This is why it's
84possible to disable hugepages system-wide and to only have them inside
85MADV_HUGEPAGE madvise regions.
86
87Embedded systems should enable hugepages only inside madvise regions
88to eliminate any risk of wasting any precious byte of memory and to
89only run faster.
90
91Applications that gets a lot of benefit from hugepages and that don't
92risk to lose memory by using hugepages, should use
93madvise(MADV_HUGEPAGE) on their critical mmapped regions.
94
95== sysfs ==
96
97Transparent Hugepage Support can be entirely disabled (mostly for
98debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
99avoid the risk of consuming more memory resources) or enabled system
100wide. This can be achieved with one of:
101
102echo always >/sys/kernel/mm/transparent_hugepage/enabled
103echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
104echo never >/sys/kernel/mm/transparent_hugepage/enabled
105
106It's also possible to limit defrag efforts in the VM to generate
107hugepages in case they're not immediately free to madvise regions or
108to never try to defrag memory and simply fallback to regular pages
109unless hugepages are immediately available. Clearly if we spend CPU
110time to defrag memory, we would expect to gain even more by the fact
111we use hugepages later instead of regular pages. This isn't always
112guaranteed, but it may be more likely in case the allocation is for a
113MADV_HUGEPAGE region.
114
115echo always >/sys/kernel/mm/transparent_hugepage/defrag
116echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
117echo never >/sys/kernel/mm/transparent_hugepage/defrag
118
119khugepaged will be automatically started when
120transparent_hugepage/enabled is set to "always" or "madvise, and it'll
121be automatically shutdown if it's set to "never".
122
123khugepaged runs usually at low frequency so while one may not want to
124invoke defrag algorithms synchronously during the page faults, it
125should be worth invoking defrag at least in khugepaged. However it's
126also possible to disable defrag in khugepaged:
127
128echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
129echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
130
131You can also control how many pages khugepaged should scan at each
132pass:
133
134/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
135
136and how many milliseconds to wait in khugepaged between each pass (you
137can set this to 0 to run khugepaged at 100% utilization of one core):
138
139/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
140
141and how many milliseconds to wait in khugepaged if there's an hugepage
142allocation failure to throttle the next allocation attempt.
143
144/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
145
146The khugepaged progress can be seen in the number of pages collapsed:
147
148/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
149
150for each pass:
151
152/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
153
154== Boot parameter ==
155
156You can change the sysfs boot time defaults of Transparent Hugepage
157Support by passing the parameter "transparent_hugepage=always" or
158"transparent_hugepage=madvise" or "transparent_hugepage=never"
159(without "") to the kernel command line.
160
161== Need of application restart ==
162
163The transparent_hugepage/enabled values only affect future
164behavior. So to make them effective you need to restart any
165application that could have been using hugepages. This also applies to
166the regions registered in khugepaged.
167
168== get_user_pages and follow_page ==
169
170get_user_pages and follow_page if run on a hugepage, will return the
171head or tail pages as usual (exactly as they would do on
172hugetlbfs). Most gup users will only care about the actual physical
173address of the page and its temporary pinning to release after the I/O
174is complete, so they won't ever notice the fact the page is huge. But
175if any driver is going to mangle over the page structure of the tail
176page (like for checking page->mapping or other bits that are relevant
177for the head page and not the tail page), it should be updated to jump
178to check head page instead (while serializing properly against
179split_huge_page() to avoid the head and tail pages to disappear from
180under it, see the futex code to see an example of that, hugetlbfs also
181needed special handling in futex code for similar reasons).
182
183NOTE: these aren't new constraints to the GUP API, and they match the
184same constrains that applies to hugetlbfs too, so any driver capable
185of handling GUP on hugetlbfs will also work fine on transparent
186hugepage backed mappings.
187
188In case you can't handle compound pages if they're returned by
189follow_page, the FOLL_SPLIT bit can be specified as parameter to
190follow_page, so that it will split the hugepages before returning
191them. Migration for example passes FOLL_SPLIT as parameter to
192follow_page because it's not hugepage aware and in fact it can't work
193at all on hugetlbfs (but it instead works fine on transparent
194hugepages thanks to FOLL_SPLIT). migration simply can't deal with
195hugepages being returned (as it's not only checking the pfn of the
196page and pinning it during the copy but it pretends to migrate the
197memory in regular page sizes and with regular pte/pmd mappings).
198
199== Optimizing the applications ==
200
201To be guaranteed that the kernel will map a 2M page immediately in any
202memory region, the mmap region has to be hugepage naturally
203aligned. posix_memalign() can provide that guarantee.
204
205== Hugetlbfs ==
206
207You can use hugetlbfs on a kernel that has transparent hugepage
208support enabled just fine as always. No difference can be noted in
209hugetlbfs other than there will be less overall fragmentation. All
210usual features belonging to hugetlbfs are preserved and
211unaffected. libhugetlbfs will also work fine as usual.
212
213== Graceful fallback ==
214
215Code walking pagetables but unware about huge pmds can simply call
216split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
217pmd_offset. It's trivial to make the code transparent hugepage aware
218by just grepping for "pmd_offset" and adding split_huge_page_pmd where
219missing after pmd_offset returns the pmd. Thanks to the graceful
220fallback design, with a one liner change, you can avoid to write
221hundred if not thousand of lines of complex code to make your code
222hugepage aware.
223
224If you're not walking pagetables but you run into a physical hugepage
225but you can't handle it natively in your code, you can split it by
226calling split_huge_page(page). This is what the Linux VM does before
227it tries to swapout the hugepage for example.
228
229Example to make mremap.c transparent hugepage aware with a one liner
230change:
231
232diff --git a/mm/mremap.c b/mm/mremap.c
233--- a/mm/mremap.c
234+++ b/mm/mremap.c
235@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
236 return NULL;
237
238 pmd = pmd_offset(pud, addr);
239+ split_huge_page_pmd(mm, pmd);
240 if (pmd_none_or_clear_bad(pmd))
241 return NULL;
242
243== Locking in hugepage aware code ==
244
245We want as much code as possible hugepage aware, as calling
246split_huge_page() or split_huge_page_pmd() has a cost.
247
248To make pagetable walks huge pmd aware, all you need to do is to call
249pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
250mmap_sem in read (or write) mode to be sure an huge pmd cannot be
251created from under you by khugepaged (khugepaged collapse_huge_page
252takes the mmap_sem in write mode in addition to the anon_vma lock). If
253pmd_trans_huge returns false, you just fallback in the old code
254paths. If instead pmd_trans_huge returns true, you have to take the
255mm->page_table_lock and re-run pmd_trans_huge. Taking the
256page_table_lock will prevent the huge pmd to be converted into a
257regular pmd from under you (split_huge_page can run in parallel to the
258pagetable walk). If the second pmd_trans_huge returns false, you
259should just drop the page_table_lock and fallback to the old code as
260before. Otherwise you should run pmd_trans_splitting on the pmd. In
261case pmd_trans_splitting returns true, it means split_huge_page is
262already in the middle of splitting the page. So if pmd_trans_splitting
263returns true it's enough to drop the page_table_lock and call
264wait_split_huge_page and then fallback the old code paths. You are
265guaranteed by the time wait_split_huge_page returns, the pmd isn't
266huge anymore. If pmd_trans_splitting returns false, you can proceed to
267process the huge pmd and the hugepage natively. Once finished you can
268drop the page_table_lock.
269
270== compound_lock, get_user_pages and put_page ==
271
272split_huge_page internally has to distribute the refcounts in the head
273page to the tail pages before clearing all PG_head/tail bits from the
274page structures. It can do that easily for refcounts taken by huge pmd
275mappings. But the GUI API as created by hugetlbfs (that returns head
276and tail pages if running get_user_pages on an address backed by any
277hugepage), requires the refcount to be accounted on the tail pages and
278not only in the head pages, if we want to be able to run
279split_huge_page while there are gup pins established on any tail
280page. Failure to be able to run split_huge_page if there's any gup pin
281on any tail page, would mean having to split all hugepages upfront in
282get_user_pages which is unacceptable as too many gup users are
283performance critical and they must work natively on hugepages like
284they work natively on hugetlbfs already (hugetlbfs is simpler because
285hugetlbfs pages cannot be splitted so there wouldn't be requirement of
286accounting the pins on the tail pages for hugetlbfs). If we wouldn't
287account the gup refcounts on the tail pages during gup, we won't know
288anymore which tail page is pinned by gup and which is not while we run
289split_huge_page. But we still have to add the gup pin to the head page
290too, to know when we can free the compound page in case it's never
291splitted during its lifetime. That requires changing not just
292get_page, but put_page as well so that when put_page runs on a tail
293page (and only on a tail page) it will find its respective head page,
294and then it will decrease the head page refcount in addition to the
295tail page refcount. To obtain a head page reliably and to decrease its
296refcount without race conditions, put_page has to serialize against
297__split_huge_page_refcount using a special per-page lock called
298compound_lock.
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 2d70d0d95108..97bae3c576c2 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -84,8 +84,7 @@ indicate that the page is being managed on the unevictable list.
84 84
85The PG_unevictable flag is analogous to, and mutually exclusive with, the 85The PG_unevictable flag is analogous to, and mutually exclusive with, the
86PG_active flag in that it indicates on which LRU list a page resides when 86PG_active flag in that it indicates on which LRU list a page resides when
87PG_lru is set. The unevictable list is compile-time configurable based on the 87PG_lru is set.
88UNEVICTABLE_LRU Kconfig option.
89 88
90The Unevictable LRU infrastructure maintains unevictable pages on an additional 89The Unevictable LRU infrastructure maintains unevictable pages on an additional
91LRU list for a few reasons: 90LRU list for a few reasons: