diff options
author | Glenn Elliott <gelliott@cs.unc.edu> | 2012-03-04 19:47:13 -0500 |
---|---|---|
committer | Glenn Elliott <gelliott@cs.unc.edu> | 2012-03-04 19:47:13 -0500 |
commit | c71c03bda1e86c9d5198c5d83f712e695c4f2a1e (patch) | |
tree | ecb166cb3e2b7e2adb3b5e292245fefd23381ac8 /Documentation/vm | |
parent | ea53c912f8a86a8567697115b6a0d8152beee5c8 (diff) | |
parent | 6a00f206debf8a5c8899055726ad127dbeeed098 (diff) |
Merge branch 'mpi-master' into wip-k-fmlpwip-k-fmlp
Conflicts:
litmus/sched_cedf.c
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/Makefile | 2 | ||||
-rw-r--r-- | Documentation/vm/active_mm.txt | 2 | ||||
-rw-r--r-- | Documentation/vm/cleancache.txt | 278 | ||||
-rw-r--r-- | Documentation/vm/highmem.txt | 162 | ||||
-rw-r--r-- | Documentation/vm/hugetlbpage.txt | 2 | ||||
-rw-r--r-- | Documentation/vm/hwpoison.txt | 6 | ||||
-rw-r--r-- | Documentation/vm/locking | 2 | ||||
-rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 2 | ||||
-rw-r--r-- | Documentation/vm/overcommit-accounting | 2 | ||||
-rw-r--r-- | Documentation/vm/page-types.c | 105 | ||||
-rw-r--r-- | Documentation/vm/slabinfo.c | 1364 | ||||
-rw-r--r-- | Documentation/vm/transhuge.txt | 298 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 3 |
13 files changed, 849 insertions, 1379 deletions
diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile index 9dcff328b964..3fa4d0668864 100644 --- a/Documentation/vm/Makefile +++ b/Documentation/vm/Makefile | |||
@@ -2,7 +2,7 @@ | |||
2 | obj- := dummy.o | 2 | obj- := dummy.o |
3 | 3 | ||
4 | # List of programs to build | 4 | # List of programs to build |
5 | hostprogs-y := slabinfo page-types hugepage-mmap hugepage-shm map_hugetlb | 5 | hostprogs-y := page-types hugepage-mmap hugepage-shm map_hugetlb |
6 | 6 | ||
7 | # Tell kbuild to always build the programs | 7 | # Tell kbuild to always build the programs |
8 | always := $(hostprogs-y) | 8 | always := $(hostprogs-y) |
diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt index 4ee1f643d897..dbf45817405f 100644 --- a/Documentation/vm/active_mm.txt +++ b/Documentation/vm/active_mm.txt | |||
@@ -74,7 +74,7 @@ we have a user context", and is generally done by the page fault handler | |||
74 | and things like that). | 74 | and things like that). |
75 | 75 | ||
76 | Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, | 76 | Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, |
77 | because it slightly changes the interfaces to accomodate the alpha (who | 77 | because it slightly changes the interfaces to accommodate the alpha (who |
78 | would have thought it, but the alpha actually ends up having one of the | 78 | would have thought it, but the alpha actually ends up having one of the |
79 | ugliest context switch codes - unlike the other architectures where the MM | 79 | ugliest context switch codes - unlike the other architectures where the MM |
80 | and register state is separate, the alpha PALcode joins the two, and you | 80 | and register state is separate, the alpha PALcode joins the two, and you |
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt new file mode 100644 index 000000000000..36c367c73084 --- /dev/null +++ b/Documentation/vm/cleancache.txt | |||
@@ -0,0 +1,278 @@ | |||
1 | MOTIVATION | ||
2 | |||
3 | Cleancache is a new optional feature provided by the VFS layer that | ||
4 | potentially dramatically increases page cache effectiveness for | ||
5 | many workloads in many environments at a negligible cost. | ||
6 | |||
7 | Cleancache can be thought of as a page-granularity victim cache for clean | ||
8 | pages that the kernel's pageframe replacement algorithm (PFRA) would like | ||
9 | to keep around, but can't since there isn't enough memory. So when the | ||
10 | PFRA "evicts" a page, it first attempts to use cleancache code to | ||
11 | put the data contained in that page into "transcendent memory", memory | ||
12 | that is not directly accessible or addressable by the kernel and is | ||
13 | of unknown and possibly time-varying size. | ||
14 | |||
15 | Later, when a cleancache-enabled filesystem wishes to access a page | ||
16 | in a file on disk, it first checks cleancache to see if it already | ||
17 | contains it; if it does, the page of data is copied into the kernel | ||
18 | and a disk access is avoided. | ||
19 | |||
20 | Transcendent memory "drivers" for cleancache are currently implemented | ||
21 | in Xen (using hypervisor memory) and zcache (using in-kernel compressed | ||
22 | memory) and other implementations are in development. | ||
23 | |||
24 | FAQs are included below. | ||
25 | |||
26 | IMPLEMENTATION OVERVIEW | ||
27 | |||
28 | A cleancache "backend" that provides transcendent memory registers itself | ||
29 | to the kernel's cleancache "frontend" by calling cleancache_register_ops, | ||
30 | passing a pointer to a cleancache_ops structure with funcs set appropriately. | ||
31 | Note that cleancache_register_ops returns the previous settings so that | ||
32 | chaining can be performed if desired. The functions provided must conform to | ||
33 | certain semantics as follows: | ||
34 | |||
35 | Most important, cleancache is "ephemeral". Pages which are copied into | ||
36 | cleancache have an indefinite lifetime which is completely unknowable | ||
37 | by the kernel and so may or may not still be in cleancache at any later time. | ||
38 | Thus, as its name implies, cleancache is not suitable for dirty pages. | ||
39 | Cleancache has complete discretion over what pages to preserve and what | ||
40 | pages to discard and when. | ||
41 | |||
42 | Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a | ||
43 | pool id which, if positive, must be saved in the filesystem's superblock; | ||
44 | a negative return value indicates failure. A "put_page" will copy a | ||
45 | (presumably about-to-be-evicted) page into cleancache and associate it with | ||
46 | the pool id, a file key, and a page index into the file. (The combination | ||
47 | of a pool id, a file key, and an index is sometimes called a "handle".) | ||
48 | A "get_page" will copy the page, if found, from cleancache into kernel memory. | ||
49 | A "flush_page" will ensure the page no longer is present in cleancache; | ||
50 | a "flush_inode" will flush all pages associated with the specified file; | ||
51 | and, when a filesystem is unmounted, a "flush_fs" will flush all pages in | ||
52 | all files specified by the given pool id and also surrender the pool id. | ||
53 | |||
54 | An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache | ||
55 | to treat the pool as shared using a 128-bit UUID as a key. On systems | ||
56 | that may run multiple kernels (such as hard partitioned or virtualized | ||
57 | systems) that may share a clustered filesystem, and where cleancache | ||
58 | may be shared among those kernels, calls to init_shared_fs that specify the | ||
59 | same UUID will receive the same pool id, thus allowing the pages to | ||
60 | be shared. Note that any security requirements must be imposed outside | ||
61 | of the kernel (e.g. by "tools" that control cleancache). Or a | ||
62 | cleancache implementation can simply disable shared_init by always | ||
63 | returning a negative value. | ||
64 | |||
65 | If a get_page is successful on a non-shared pool, the page is flushed (thus | ||
66 | making cleancache an "exclusive" cache). On a shared pool, the page | ||
67 | is NOT flushed on a successful get_page so that it remains accessible to | ||
68 | other sharers. The kernel is responsible for ensuring coherency between | ||
69 | cleancache (shared or not), the page cache, and the filesystem, using | ||
70 | cleancache flush operations as required. | ||
71 | |||
72 | Note that cleancache must enforce put-put-get coherency and get-get | ||
73 | coherency. For the former, if two puts are made to the same handle but | ||
74 | with different data, say AAA by the first put and BBB by the second, a | ||
75 | subsequent get can never return the stale data (AAA). For get-get coherency, | ||
76 | if a get for a given handle fails, subsequent gets for that handle will | ||
77 | never succeed unless preceded by a successful put with that handle. | ||
78 | |||
79 | Last, cleancache provides no SMP serialization guarantees; if two | ||
80 | different Linux threads are simultaneously putting and flushing a page | ||
81 | with the same handle, the results are indeterminate. Callers must | ||
82 | lock the page to ensure serial behavior. | ||
83 | |||
84 | CLEANCACHE PERFORMANCE METRICS | ||
85 | |||
86 | Cleancache monitoring is done by sysfs files in the | ||
87 | /sys/kernel/mm/cleancache directory. The effectiveness of cleancache | ||
88 | can be measured (across all filesystems) with: | ||
89 | |||
90 | succ_gets - number of gets that were successful | ||
91 | failed_gets - number of gets that failed | ||
92 | puts - number of puts attempted (all "succeed") | ||
93 | flushes - number of flushes attempted | ||
94 | |||
95 | A backend implementatation may provide additional metrics. | ||
96 | |||
97 | FAQ | ||
98 | |||
99 | 1) Where's the value? (Andrew Morton) | ||
100 | |||
101 | Cleancache provides a significant performance benefit to many workloads | ||
102 | in many environments with negligible overhead by improving the | ||
103 | effectiveness of the pagecache. Clean pagecache pages are | ||
104 | saved in transcendent memory (RAM that is otherwise not directly | ||
105 | addressable to the kernel); fetching those pages later avoids "refaults" | ||
106 | and thus disk reads. | ||
107 | |||
108 | Cleancache (and its sister code "frontswap") provide interfaces for | ||
109 | this transcendent memory (aka "tmem"), which conceptually lies between | ||
110 | fast kernel-directly-addressable RAM and slower DMA/asynchronous devices. | ||
111 | Disallowing direct kernel or userland reads/writes to tmem | ||
112 | is ideal when data is transformed to a different form and size (such | ||
113 | as with compression) or secretly moved (as might be useful for write- | ||
114 | balancing for some RAM-like devices). Evicted page-cache pages (and | ||
115 | swap pages) are a great use for this kind of slower-than-RAM-but-much- | ||
116 | faster-than-disk transcendent memory, and the cleancache (and frontswap) | ||
117 | "page-object-oriented" specification provides a nice way to read and | ||
118 | write -- and indirectly "name" -- the pages. | ||
119 | |||
120 | In the virtual case, the whole point of virtualization is to statistically | ||
121 | multiplex physical resources across the varying demands of multiple | ||
122 | virtual machines. This is really hard to do with RAM and efforts to | ||
123 | do it well with no kernel change have essentially failed (except in some | ||
124 | well-publicized special-case workloads). Cleancache -- and frontswap -- | ||
125 | with a fairly small impact on the kernel, provide a huge amount | ||
126 | of flexibility for more dynamic, flexible RAM multiplexing. | ||
127 | Specifically, the Xen Transcendent Memory backend allows otherwise | ||
128 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | ||
129 | virtual machines, but the pages can be compressed and deduplicated to | ||
130 | optimize RAM utilization. And when guest OS's are induced to surrender | ||
131 | underutilized RAM (e.g. with "self-ballooning"), page cache pages | ||
132 | are the first to go, and cleancache allows those pages to be | ||
133 | saved and reclaimed if overall host system memory conditions allow. | ||
134 | |||
135 | And the identical interface used for cleancache can be used in | ||
136 | physical systems as well. The zcache driver acts as a memory-hungry | ||
137 | device that stores pages of data in a compressed state. And | ||
138 | the proposed "RAMster" driver shares RAM across multiple physical | ||
139 | systems. | ||
140 | |||
141 | 2) Why does cleancache have its sticky fingers so deep inside the | ||
142 | filesystems and VFS? (Andrew Morton and Christoph Hellwig) | ||
143 | |||
144 | The core hooks for cleancache in VFS are in most cases a single line | ||
145 | and the minimum set are placed precisely where needed to maintain | ||
146 | coherency (via cleancache_flush operations) between cleancache, | ||
147 | the page cache, and disk. All hooks compile into nothingness if | ||
148 | cleancache is config'ed off and turn into a function-pointer- | ||
149 | compare-to-NULL if config'ed on but no backend claims the ops | ||
150 | functions, or to a compare-struct-element-to-negative if a | ||
151 | backend claims the ops functions but a filesystem doesn't enable | ||
152 | cleancache. | ||
153 | |||
154 | Some filesystems are built entirely on top of VFS and the hooks | ||
155 | in VFS are sufficient, so don't require an "init_fs" hook; the | ||
156 | initial implementation of cleancache didn't provide this hook. | ||
157 | But for some filesystems (such as btrfs), the VFS hooks are | ||
158 | incomplete and one or more hooks in fs-specific code are required. | ||
159 | And for some other filesystems, such as tmpfs, cleancache may | ||
160 | be counterproductive. So it seemed prudent to require a filesystem | ||
161 | to "opt in" to use cleancache, which requires adding a hook in | ||
162 | each filesystem. Not all filesystems are supported by cleancache | ||
163 | only because they haven't been tested. The existing set should | ||
164 | be sufficient to validate the concept, the opt-in approach means | ||
165 | that untested filesystems are not affected, and the hooks in the | ||
166 | existing filesystems should make it very easy to add more | ||
167 | filesystems in the future. | ||
168 | |||
169 | The total impact of the hooks to existing fs and mm files is only | ||
170 | about 40 lines added (not counting comments and blank lines). | ||
171 | |||
172 | 3) Why not make cleancache asynchronous and batched so it can | ||
173 | more easily interface with real devices with DMA instead | ||
174 | of copying each individual page? (Minchan Kim) | ||
175 | |||
176 | The one-page-at-a-time copy semantics simplifies the implementation | ||
177 | on both the frontend and backend and also allows the backend to | ||
178 | do fancy things on-the-fly like page compression and | ||
179 | page deduplication. And since the data is "gone" (copied into/out | ||
180 | of the pageframe) before the cleancache get/put call returns, | ||
181 | a great deal of race conditions and potential coherency issues | ||
182 | are avoided. While the interface seems odd for a "real device" | ||
183 | or for real kernel-addressable RAM, it makes perfect sense for | ||
184 | transcendent memory. | ||
185 | |||
186 | 4) Why is non-shared cleancache "exclusive"? And where is the | ||
187 | page "flushed" after a "get"? (Minchan Kim) | ||
188 | |||
189 | The main reason is to free up space in transcendent memory and | ||
190 | to avoid unnecessary cleancache_flush calls. If you want inclusive, | ||
191 | the page can be "put" immediately following the "get". If | ||
192 | put-after-get for inclusive becomes common, the interface could | ||
193 | be easily extended to add a "get_no_flush" call. | ||
194 | |||
195 | The flush is done by the cleancache backend implementation. | ||
196 | |||
197 | 5) What's the performance impact? | ||
198 | |||
199 | Performance analysis has been presented at OLS'09 and LCA'10. | ||
200 | Briefly, performance gains can be significant on most workloads, | ||
201 | especially when memory pressure is high (e.g. when RAM is | ||
202 | overcommitted in a virtual workload); and because the hooks are | ||
203 | invoked primarily in place of or in addition to a disk read/write, | ||
204 | overhead is negligible even in worst case workloads. Basically | ||
205 | cleancache replaces I/O with memory-copy-CPU-overhead; on older | ||
206 | single-core systems with slow memory-copy speeds, cleancache | ||
207 | has little value, but in newer multicore machines, especially | ||
208 | consolidated/virtualized machines, it has great value. | ||
209 | |||
210 | 6) How do I add cleancache support for filesystem X? (Boaz Harrash) | ||
211 | |||
212 | Filesystems that are well-behaved and conform to certain | ||
213 | restrictions can utilize cleancache simply by making a call to | ||
214 | cleancache_init_fs at mount time. Unusual, misbehaving, or | ||
215 | poorly layered filesystems must either add additional hooks | ||
216 | and/or undergo extensive additional testing... or should just | ||
217 | not enable the optional cleancache. | ||
218 | |||
219 | Some points for a filesystem to consider: | ||
220 | |||
221 | - The FS should be block-device-based (e.g. a ram-based FS such | ||
222 | as tmpfs should not enable cleancache) | ||
223 | - To ensure coherency/correctness, the FS must ensure that all | ||
224 | file removal or truncation operations either go through VFS or | ||
225 | add hooks to do the equivalent cleancache "flush" operations | ||
226 | - To ensure coherency/correctness, either inode numbers must | ||
227 | be unique across the lifetime of the on-disk file OR the | ||
228 | FS must provide an "encode_fh" function. | ||
229 | - The FS must call the VFS superblock alloc and deactivate routines | ||
230 | or add hooks to do the equivalent cleancache calls done there. | ||
231 | - To maximize performance, all pages fetched from the FS should | ||
232 | go through the do_mpag_readpage routine or the FS should add | ||
233 | hooks to do the equivalent (cf. btrfs) | ||
234 | - Currently, the FS blocksize must be the same as PAGESIZE. This | ||
235 | is not an architectural restriction, but no backends currently | ||
236 | support anything different. | ||
237 | - A clustered FS should invoke the "shared_init_fs" cleancache | ||
238 | hook to get best performance for some backends. | ||
239 | |||
240 | 7) Why not use the KVA of the inode as the key? (Christoph Hellwig) | ||
241 | |||
242 | If cleancache would use the inode virtual address instead of | ||
243 | inode/filehandle, the pool id could be eliminated. But, this | ||
244 | won't work because cleancache retains pagecache data pages | ||
245 | persistently even when the inode has been pruned from the | ||
246 | inode unused list, and only flushes the data page if the file | ||
247 | gets removed/truncated. So if cleancache used the inode kva, | ||
248 | there would be potential coherency issues if/when the inode | ||
249 | kva is reused for a different file. Alternately, if cleancache | ||
250 | flushed the pages when the inode kva was freed, much of the value | ||
251 | of cleancache would be lost because the cache of pages in cleanache | ||
252 | is potentially much larger than the kernel pagecache and is most | ||
253 | useful if the pages survive inode cache removal. | ||
254 | |||
255 | 8) Why is a global variable required? | ||
256 | |||
257 | The cleancache_enabled flag is checked in all of the frequently-used | ||
258 | cleancache hooks. The alternative is a function call to check a static | ||
259 | variable. Since cleancache is enabled dynamically at runtime, systems | ||
260 | that don't enable cleancache would suffer thousands (possibly | ||
261 | tens-of-thousands) of unnecessary function calls per second. So the | ||
262 | global variable allows cleancache to be enabled by default at compile | ||
263 | time, but have insignificant performance impact when cleancache remains | ||
264 | disabled at runtime. | ||
265 | |||
266 | 9) Does cleanache work with KVM? | ||
267 | |||
268 | The memory model of KVM is sufficiently different that a cleancache | ||
269 | backend may have less value for KVM. This remains to be tested, | ||
270 | especially in an overcommitted system. | ||
271 | |||
272 | 10) Does cleancache work in userspace? It sounds useful for | ||
273 | memory hungry caches like web browsers. (Jamie Lokier) | ||
274 | |||
275 | No plans yet, though we agree it sounds useful, at least for | ||
276 | apps that bypass the page cache (e.g. O_DIRECT). | ||
277 | |||
278 | Last updated: Dan Magenheimer, April 13 2011 | ||
diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt new file mode 100644 index 000000000000..4324d24ffacd --- /dev/null +++ b/Documentation/vm/highmem.txt | |||
@@ -0,0 +1,162 @@ | |||
1 | |||
2 | ==================== | ||
3 | HIGH MEMORY HANDLING | ||
4 | ==================== | ||
5 | |||
6 | By: Peter Zijlstra <a.p.zijlstra@chello.nl> | ||
7 | |||
8 | Contents: | ||
9 | |||
10 | (*) What is high memory? | ||
11 | |||
12 | (*) Temporary virtual mappings. | ||
13 | |||
14 | (*) Using kmap_atomic. | ||
15 | |||
16 | (*) Cost of temporary mappings. | ||
17 | |||
18 | (*) i386 PAE. | ||
19 | |||
20 | |||
21 | ==================== | ||
22 | WHAT IS HIGH MEMORY? | ||
23 | ==================== | ||
24 | |||
25 | High memory (highmem) is used when the size of physical memory approaches or | ||
26 | exceeds the maximum size of virtual memory. At that point it becomes | ||
27 | impossible for the kernel to keep all of the available physical memory mapped | ||
28 | at all times. This means the kernel needs to start using temporary mappings of | ||
29 | the pieces of physical memory that it wants to access. | ||
30 | |||
31 | The part of (physical) memory not covered by a permanent mapping is what we | ||
32 | refer to as 'highmem'. There are various architecture dependent constraints on | ||
33 | where exactly that border lies. | ||
34 | |||
35 | In the i386 arch, for example, we choose to map the kernel into every process's | ||
36 | VM space so that we don't have to pay the full TLB invalidation costs for | ||
37 | kernel entry/exit. This means the available virtual memory space (4GiB on | ||
38 | i386) has to be divided between user and kernel space. | ||
39 | |||
40 | The traditional split for architectures using this approach is 3:1, 3GiB for | ||
41 | userspace and the top 1GiB for kernel space: | ||
42 | |||
43 | +--------+ 0xffffffff | ||
44 | | Kernel | | ||
45 | +--------+ 0xc0000000 | ||
46 | | | | ||
47 | | User | | ||
48 | | | | ||
49 | +--------+ 0x00000000 | ||
50 | |||
51 | This means that the kernel can at most map 1GiB of physical memory at any one | ||
52 | time, but because we need virtual address space for other things - including | ||
53 | temporary maps to access the rest of the physical memory - the actual direct | ||
54 | map will typically be less (usually around ~896MiB). | ||
55 | |||
56 | Other architectures that have mm context tagged TLBs can have separate kernel | ||
57 | and user maps. Some hardware (like some ARMs), however, have limited virtual | ||
58 | space when they use mm context tags. | ||
59 | |||
60 | |||
61 | ========================== | ||
62 | TEMPORARY VIRTUAL MAPPINGS | ||
63 | ========================== | ||
64 | |||
65 | The kernel contains several ways of creating temporary mappings: | ||
66 | |||
67 | (*) vmap(). This can be used to make a long duration mapping of multiple | ||
68 | physical pages into a contiguous virtual space. It needs global | ||
69 | synchronization to unmap. | ||
70 | |||
71 | (*) kmap(). This permits a short duration mapping of a single page. It needs | ||
72 | global synchronization, but is amortized somewhat. It is also prone to | ||
73 | deadlocks when using in a nested fashion, and so it is not recommended for | ||
74 | new code. | ||
75 | |||
76 | (*) kmap_atomic(). This permits a very short duration mapping of a single | ||
77 | page. Since the mapping is restricted to the CPU that issued it, it | ||
78 | performs well, but the issuing task is therefore required to stay on that | ||
79 | CPU until it has finished, lest some other task displace its mappings. | ||
80 | |||
81 | kmap_atomic() may also be used by interrupt contexts, since it is does not | ||
82 | sleep and the caller may not sleep until after kunmap_atomic() is called. | ||
83 | |||
84 | It may be assumed that k[un]map_atomic() won't fail. | ||
85 | |||
86 | |||
87 | ================= | ||
88 | USING KMAP_ATOMIC | ||
89 | ================= | ||
90 | |||
91 | When and where to use kmap_atomic() is straightforward. It is used when code | ||
92 | wants to access the contents of a page that might be allocated from high memory | ||
93 | (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two | ||
94 | functions, and they can be used in a manner similar to the following: | ||
95 | |||
96 | /* Find the page of interest. */ | ||
97 | struct page *page = find_get_page(mapping, offset); | ||
98 | |||
99 | /* Gain access to the contents of that page. */ | ||
100 | void *vaddr = kmap_atomic(page); | ||
101 | |||
102 | /* Do something to the contents of that page. */ | ||
103 | memset(vaddr, 0, PAGE_SIZE); | ||
104 | |||
105 | /* Unmap that page. */ | ||
106 | kunmap_atomic(vaddr); | ||
107 | |||
108 | Note that the kunmap_atomic() call takes the result of the kmap_atomic() call | ||
109 | not the argument. | ||
110 | |||
111 | If you need to map two pages because you want to copy from one page to | ||
112 | another you need to keep the kmap_atomic calls strictly nested, like: | ||
113 | |||
114 | vaddr1 = kmap_atomic(page1); | ||
115 | vaddr2 = kmap_atomic(page2); | ||
116 | |||
117 | memcpy(vaddr1, vaddr2, PAGE_SIZE); | ||
118 | |||
119 | kunmap_atomic(vaddr2); | ||
120 | kunmap_atomic(vaddr1); | ||
121 | |||
122 | |||
123 | ========================== | ||
124 | COST OF TEMPORARY MAPPINGS | ||
125 | ========================== | ||
126 | |||
127 | The cost of creating temporary mappings can be quite high. The arch has to | ||
128 | manipulate the kernel's page tables, the data TLB and/or the MMU's registers. | ||
129 | |||
130 | If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping | ||
131 | simply with a bit of arithmetic that will convert the page struct address into | ||
132 | a pointer to the page contents rather than juggling mappings about. In such a | ||
133 | case, the unmap operation may be a null operation. | ||
134 | |||
135 | If CONFIG_MMU is not set, then there can be no temporary mappings and no | ||
136 | highmem. In such a case, the arithmetic approach will also be used. | ||
137 | |||
138 | |||
139 | ======== | ||
140 | i386 PAE | ||
141 | ======== | ||
142 | |||
143 | The i386 arch, under some circumstances, will permit you to stick up to 64GiB | ||
144 | of RAM into your 32-bit machine. This has a number of consequences: | ||
145 | |||
146 | (*) Linux needs a page-frame structure for each page in the system and the | ||
147 | pageframes need to live in the permanent mapping, which means: | ||
148 | |||
149 | (*) you can have 896M/sizeof(struct page) page-frames at most; with struct | ||
150 | page being 32-bytes that would end up being something in the order of 112G | ||
151 | worth of pages; the kernel, however, needs to store more than just | ||
152 | page-frames in that memory... | ||
153 | |||
154 | (*) PAE makes your page tables larger - which slows the system down as more | ||
155 | data has to be accessed to traverse in TLB fills and the like. One | ||
156 | advantage is that PAE has more PTE bits and can provide advanced features | ||
157 | like NX and PAT. | ||
158 | |||
159 | The general recommendation is that you don't use more than 8GiB on a 32-bit | ||
160 | machine - although more might work for you and your workload, you're pretty | ||
161 | much on your own - don't expect kernel developers to really care much if things | ||
162 | come apart. | ||
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 457634c1e03e..f8551b3879f8 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt | |||
@@ -72,7 +72,7 @@ number of huge pages requested. This is the most reliable method of | |||
72 | allocating huge pages as memory has not yet become fragmented. | 72 | allocating huge pages as memory has not yet become fragmented. |
73 | 73 | ||
74 | Some platforms support multiple huge page sizes. To allocate huge pages | 74 | Some platforms support multiple huge page sizes. To allocate huge pages |
75 | of a specific size, one must preceed the huge pages boot command parameters | 75 | of a specific size, one must precede the huge pages boot command parameters |
76 | with a huge page size selection parameter "hugepagesz=<size>". <size> must | 76 | with a huge page size selection parameter "hugepagesz=<size>". <size> must |
77 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge | 77 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge |
78 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. | 78 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. |
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 12f9ba20ccb7..550068466605 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt | |||
@@ -129,12 +129,12 @@ Limit injection to pages owned by memgroup. Specified by inode number | |||
129 | of the memcg. | 129 | of the memcg. |
130 | 130 | ||
131 | Example: | 131 | Example: |
132 | mkdir /cgroup/hwpoison | 132 | mkdir /sys/fs/cgroup/mem/hwpoison |
133 | 133 | ||
134 | usemem -m 100 -s 1000 & | 134 | usemem -m 100 -s 1000 & |
135 | echo `jobs -p` > /cgroup/hwpoison/tasks | 135 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks |
136 | 136 | ||
137 | memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') | 137 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
138 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | 138 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg |
139 | 139 | ||
140 | page-types -p `pidof init` --hwpoison # shall do nothing | 140 | page-types -p `pidof init` --hwpoison # shall do nothing |
diff --git a/Documentation/vm/locking b/Documentation/vm/locking index 25fadb448760..f61228bd6395 100644 --- a/Documentation/vm/locking +++ b/Documentation/vm/locking | |||
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by | |||
66 | expand_stack(), it is hard to come up with a destructive scenario without | 66 | expand_stack(), it is hard to come up with a destructive scenario without |
67 | having the vmlist protection in this case. | 67 | having the vmlist protection in this case. |
68 | 68 | ||
69 | The page_table_lock nests with the inode i_mmap_lock and the kmem cache | 69 | The page_table_lock nests with the inode i_mmap_mutex and the kmem cache |
70 | c_spinlock spinlocks. This is okay, since the kmem code asks for pages after | 70 | c_spinlock spinlocks. This is okay, since the kmem code asks for pages after |
71 | dropping c_spinlock. The page_table_lock also nests with pagecache_lock and | 71 | dropping c_spinlock. The page_table_lock also nests with pagecache_lock and |
72 | pagemap_lru_lock spinlocks, and no code asks for memory with these locks | 72 | pagemap_lru_lock spinlocks, and no code asks for memory with these locks |
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 6690fc34ef6d..4e7da6543424 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -424,7 +424,7 @@ a command line tool, numactl(8), exists that allows one to: | |||
424 | 424 | ||
425 | + set the shared policy for a shared memory segment via mbind(2) | 425 | + set the shared policy for a shared memory segment via mbind(2) |
426 | 426 | ||
427 | The numactl(8) tool is packages with the run-time version of the library | 427 | The numactl(8) tool is packaged with the run-time version of the library |
428 | containing the memory policy system call wrappers. Some distributions | 428 | containing the memory policy system call wrappers. Some distributions |
429 | package the headers and compile-time libraries in a separate development | 429 | package the headers and compile-time libraries in a separate development |
430 | package. | 430 | package. |
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index 21c7b1f8f32b..706d7ed9d8d2 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting | |||
@@ -4,7 +4,7 @@ The Linux kernel supports the following overcommit handling modes | |||
4 | address space are refused. Used for a typical system. It | 4 | address space are refused. Used for a typical system. It |
5 | ensures a seriously wild allocation fails while allowing | 5 | ensures a seriously wild allocation fails while allowing |
6 | overcommit to reduce swap usage. root is allowed to | 6 | overcommit to reduce swap usage. root is allowed to |
7 | allocate slighly more memory in this mode. This is the | 7 | allocate slightly more memory in this mode. This is the |
8 | default. | 8 | default. |
9 | 9 | ||
10 | 1 - Always overcommit. Appropriate for some scientific | 10 | 1 - Always overcommit. Appropriate for some scientific |
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index cc96ee2666f2..7445caa26d05 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c | |||
@@ -32,8 +32,20 @@ | |||
32 | #include <sys/types.h> | 32 | #include <sys/types.h> |
33 | #include <sys/errno.h> | 33 | #include <sys/errno.h> |
34 | #include <sys/fcntl.h> | 34 | #include <sys/fcntl.h> |
35 | #include <sys/mount.h> | ||
36 | #include <sys/statfs.h> | ||
37 | #include "../../include/linux/magic.h" | ||
35 | 38 | ||
36 | 39 | ||
40 | #ifndef MAX_PATH | ||
41 | # define MAX_PATH 256 | ||
42 | #endif | ||
43 | |||
44 | #ifndef STR | ||
45 | # define _STR(x) #x | ||
46 | # define STR(x) _STR(x) | ||
47 | #endif | ||
48 | |||
37 | /* | 49 | /* |
38 | * pagemap kernel ABI bits | 50 | * pagemap kernel ABI bits |
39 | */ | 51 | */ |
@@ -152,6 +164,12 @@ static const char *page_flag_names[] = { | |||
152 | }; | 164 | }; |
153 | 165 | ||
154 | 166 | ||
167 | static const char *debugfs_known_mountpoints[] = { | ||
168 | "/sys/kernel/debug", | ||
169 | "/debug", | ||
170 | 0, | ||
171 | }; | ||
172 | |||
155 | /* | 173 | /* |
156 | * data structures | 174 | * data structures |
157 | */ | 175 | */ |
@@ -184,7 +202,7 @@ static int kpageflags_fd; | |||
184 | static int opt_hwpoison; | 202 | static int opt_hwpoison; |
185 | static int opt_unpoison; | 203 | static int opt_unpoison; |
186 | 204 | ||
187 | static const char hwpoison_debug_fs[] = "/debug/hwpoison"; | 205 | static char hwpoison_debug_fs[MAX_PATH+1]; |
188 | static int hwpoison_inject_fd; | 206 | static int hwpoison_inject_fd; |
189 | static int hwpoison_forget_fd; | 207 | static int hwpoison_forget_fd; |
190 | 208 | ||
@@ -464,21 +482,100 @@ static uint64_t kpageflags_flags(uint64_t flags) | |||
464 | return flags; | 482 | return flags; |
465 | } | 483 | } |
466 | 484 | ||
485 | /* verify that a mountpoint is actually a debugfs instance */ | ||
486 | static int debugfs_valid_mountpoint(const char *debugfs) | ||
487 | { | ||
488 | struct statfs st_fs; | ||
489 | |||
490 | if (statfs(debugfs, &st_fs) < 0) | ||
491 | return -ENOENT; | ||
492 | else if (st_fs.f_type != (long) DEBUGFS_MAGIC) | ||
493 | return -ENOENT; | ||
494 | |||
495 | return 0; | ||
496 | } | ||
497 | |||
498 | /* find the path to the mounted debugfs */ | ||
499 | static const char *debugfs_find_mountpoint(void) | ||
500 | { | ||
501 | const char **ptr; | ||
502 | char type[100]; | ||
503 | FILE *fp; | ||
504 | |||
505 | ptr = debugfs_known_mountpoints; | ||
506 | while (*ptr) { | ||
507 | if (debugfs_valid_mountpoint(*ptr) == 0) { | ||
508 | strcpy(hwpoison_debug_fs, *ptr); | ||
509 | return hwpoison_debug_fs; | ||
510 | } | ||
511 | ptr++; | ||
512 | } | ||
513 | |||
514 | /* give up and parse /proc/mounts */ | ||
515 | fp = fopen("/proc/mounts", "r"); | ||
516 | if (fp == NULL) | ||
517 | perror("Can't open /proc/mounts for read"); | ||
518 | |||
519 | while (fscanf(fp, "%*s %" | ||
520 | STR(MAX_PATH) | ||
521 | "s %99s %*s %*d %*d\n", | ||
522 | hwpoison_debug_fs, type) == 2) { | ||
523 | if (strcmp(type, "debugfs") == 0) | ||
524 | break; | ||
525 | } | ||
526 | fclose(fp); | ||
527 | |||
528 | if (strcmp(type, "debugfs") != 0) | ||
529 | return NULL; | ||
530 | |||
531 | return hwpoison_debug_fs; | ||
532 | } | ||
533 | |||
534 | /* mount the debugfs somewhere if it's not mounted */ | ||
535 | |||
536 | static void debugfs_mount(void) | ||
537 | { | ||
538 | const char **ptr; | ||
539 | |||
540 | /* see if it's already mounted */ | ||
541 | if (debugfs_find_mountpoint()) | ||
542 | return; | ||
543 | |||
544 | ptr = debugfs_known_mountpoints; | ||
545 | while (*ptr) { | ||
546 | if (mount(NULL, *ptr, "debugfs", 0, NULL) == 0) { | ||
547 | /* save the mountpoint */ | ||
548 | strcpy(hwpoison_debug_fs, *ptr); | ||
549 | break; | ||
550 | } | ||
551 | ptr++; | ||
552 | } | ||
553 | |||
554 | if (*ptr == NULL) { | ||
555 | perror("mount debugfs"); | ||
556 | exit(EXIT_FAILURE); | ||
557 | } | ||
558 | } | ||
559 | |||
467 | /* | 560 | /* |
468 | * page actions | 561 | * page actions |
469 | */ | 562 | */ |
470 | 563 | ||
471 | static void prepare_hwpoison_fd(void) | 564 | static void prepare_hwpoison_fd(void) |
472 | { | 565 | { |
473 | char buf[100]; | 566 | char buf[MAX_PATH + 1]; |
567 | |||
568 | debugfs_mount(); | ||
474 | 569 | ||
475 | if (opt_hwpoison && !hwpoison_inject_fd) { | 570 | if (opt_hwpoison && !hwpoison_inject_fd) { |
476 | sprintf(buf, "%s/corrupt-pfn", hwpoison_debug_fs); | 571 | snprintf(buf, MAX_PATH, "%s/hwpoison/corrupt-pfn", |
572 | hwpoison_debug_fs); | ||
477 | hwpoison_inject_fd = checked_open(buf, O_WRONLY); | 573 | hwpoison_inject_fd = checked_open(buf, O_WRONLY); |
478 | } | 574 | } |
479 | 575 | ||
480 | if (opt_unpoison && !hwpoison_forget_fd) { | 576 | if (opt_unpoison && !hwpoison_forget_fd) { |
481 | sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs); | 577 | snprintf(buf, MAX_PATH, "%s/hwpoison/unpoison-pfn", |
578 | hwpoison_debug_fs); | ||
482 | hwpoison_forget_fd = checked_open(buf, O_WRONLY); | 579 | hwpoison_forget_fd = checked_open(buf, O_WRONLY); |
483 | } | 580 | } |
484 | } | 581 | } |
diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c deleted file mode 100644 index 92e729f4b676..000000000000 --- a/Documentation/vm/slabinfo.c +++ /dev/null | |||
@@ -1,1364 +0,0 @@ | |||
1 | /* | ||
2 | * Slabinfo: Tool to get reports about slabs | ||
3 | * | ||
4 | * (C) 2007 sgi, Christoph Lameter | ||
5 | * | ||
6 | * Compile by: | ||
7 | * | ||
8 | * gcc -o slabinfo slabinfo.c | ||
9 | */ | ||
10 | #include <stdio.h> | ||
11 | #include <stdlib.h> | ||
12 | #include <sys/types.h> | ||
13 | #include <dirent.h> | ||
14 | #include <strings.h> | ||
15 | #include <string.h> | ||
16 | #include <unistd.h> | ||
17 | #include <stdarg.h> | ||
18 | #include <getopt.h> | ||
19 | #include <regex.h> | ||
20 | #include <errno.h> | ||
21 | |||
22 | #define MAX_SLABS 500 | ||
23 | #define MAX_ALIASES 500 | ||
24 | #define MAX_NODES 1024 | ||
25 | |||
26 | struct slabinfo { | ||
27 | char *name; | ||
28 | int alias; | ||
29 | int refs; | ||
30 | int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu; | ||
31 | int hwcache_align, object_size, objs_per_slab; | ||
32 | int sanity_checks, slab_size, store_user, trace; | ||
33 | int order, poison, reclaim_account, red_zone; | ||
34 | unsigned long partial, objects, slabs, objects_partial, objects_total; | ||
35 | unsigned long alloc_fastpath, alloc_slowpath; | ||
36 | unsigned long free_fastpath, free_slowpath; | ||
37 | unsigned long free_frozen, free_add_partial, free_remove_partial; | ||
38 | unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill; | ||
39 | unsigned long cpuslab_flush, deactivate_full, deactivate_empty; | ||
40 | unsigned long deactivate_to_head, deactivate_to_tail; | ||
41 | unsigned long deactivate_remote_frees, order_fallback; | ||
42 | int numa[MAX_NODES]; | ||
43 | int numa_partial[MAX_NODES]; | ||
44 | } slabinfo[MAX_SLABS]; | ||
45 | |||
46 | struct aliasinfo { | ||
47 | char *name; | ||
48 | char *ref; | ||
49 | struct slabinfo *slab; | ||
50 | } aliasinfo[MAX_ALIASES]; | ||
51 | |||
52 | int slabs = 0; | ||
53 | int actual_slabs = 0; | ||
54 | int aliases = 0; | ||
55 | int alias_targets = 0; | ||
56 | int highest_node = 0; | ||
57 | |||
58 | char buffer[4096]; | ||
59 | |||
60 | int show_empty = 0; | ||
61 | int show_report = 0; | ||
62 | int show_alias = 0; | ||
63 | int show_slab = 0; | ||
64 | int skip_zero = 1; | ||
65 | int show_numa = 0; | ||
66 | int show_track = 0; | ||
67 | int show_first_alias = 0; | ||
68 | int validate = 0; | ||
69 | int shrink = 0; | ||
70 | int show_inverted = 0; | ||
71 | int show_single_ref = 0; | ||
72 | int show_totals = 0; | ||
73 | int sort_size = 0; | ||
74 | int sort_active = 0; | ||
75 | int set_debug = 0; | ||
76 | int show_ops = 0; | ||
77 | int show_activity = 0; | ||
78 | |||
79 | /* Debug options */ | ||
80 | int sanity = 0; | ||
81 | int redzone = 0; | ||
82 | int poison = 0; | ||
83 | int tracking = 0; | ||
84 | int tracing = 0; | ||
85 | |||
86 | int page_size; | ||
87 | |||
88 | regex_t pattern; | ||
89 | |||
90 | static void fatal(const char *x, ...) | ||
91 | { | ||
92 | va_list ap; | ||
93 | |||
94 | va_start(ap, x); | ||
95 | vfprintf(stderr, x, ap); | ||
96 | va_end(ap); | ||
97 | exit(EXIT_FAILURE); | ||
98 | } | ||
99 | |||
100 | static void usage(void) | ||
101 | { | ||
102 | printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n" | ||
103 | "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n" | ||
104 | "-a|--aliases Show aliases\n" | ||
105 | "-A|--activity Most active slabs first\n" | ||
106 | "-d<options>|--debug=<options> Set/Clear Debug options\n" | ||
107 | "-D|--display-active Switch line format to activity\n" | ||
108 | "-e|--empty Show empty slabs\n" | ||
109 | "-f|--first-alias Show first alias\n" | ||
110 | "-h|--help Show usage information\n" | ||
111 | "-i|--inverted Inverted list\n" | ||
112 | "-l|--slabs Show slabs\n" | ||
113 | "-n|--numa Show NUMA information\n" | ||
114 | "-o|--ops Show kmem_cache_ops\n" | ||
115 | "-s|--shrink Shrink slabs\n" | ||
116 | "-r|--report Detailed report on single slabs\n" | ||
117 | "-S|--Size Sort by size\n" | ||
118 | "-t|--tracking Show alloc/free information\n" | ||
119 | "-T|--Totals Show summary information\n" | ||
120 | "-v|--validate Validate slabs\n" | ||
121 | "-z|--zero Include empty slabs\n" | ||
122 | "-1|--1ref Single reference\n" | ||
123 | "\nValid debug options (FZPUT may be combined)\n" | ||
124 | "a / A Switch on all debug options (=FZUP)\n" | ||
125 | "- Switch off all debug options\n" | ||
126 | "f / F Sanity Checks (SLAB_DEBUG_FREE)\n" | ||
127 | "z / Z Redzoning\n" | ||
128 | "p / P Poisoning\n" | ||
129 | "u / U Tracking\n" | ||
130 | "t / T Tracing\n" | ||
131 | ); | ||
132 | } | ||
133 | |||
134 | static unsigned long read_obj(const char *name) | ||
135 | { | ||
136 | FILE *f = fopen(name, "r"); | ||
137 | |||
138 | if (!f) | ||
139 | buffer[0] = 0; | ||
140 | else { | ||
141 | if (!fgets(buffer, sizeof(buffer), f)) | ||
142 | buffer[0] = 0; | ||
143 | fclose(f); | ||
144 | if (buffer[strlen(buffer)] == '\n') | ||
145 | buffer[strlen(buffer)] = 0; | ||
146 | } | ||
147 | return strlen(buffer); | ||
148 | } | ||
149 | |||
150 | |||
151 | /* | ||
152 | * Get the contents of an attribute | ||
153 | */ | ||
154 | static unsigned long get_obj(const char *name) | ||
155 | { | ||
156 | if (!read_obj(name)) | ||
157 | return 0; | ||
158 | |||
159 | return atol(buffer); | ||
160 | } | ||
161 | |||
162 | static unsigned long get_obj_and_str(const char *name, char **x) | ||
163 | { | ||
164 | unsigned long result = 0; | ||
165 | char *p; | ||
166 | |||
167 | *x = NULL; | ||
168 | |||
169 | if (!read_obj(name)) { | ||
170 | x = NULL; | ||
171 | return 0; | ||
172 | } | ||
173 | result = strtoul(buffer, &p, 10); | ||
174 | while (*p == ' ') | ||
175 | p++; | ||
176 | if (*p) | ||
177 | *x = strdup(p); | ||
178 | return result; | ||
179 | } | ||
180 | |||
181 | static void set_obj(struct slabinfo *s, const char *name, int n) | ||
182 | { | ||
183 | char x[100]; | ||
184 | FILE *f; | ||
185 | |||
186 | snprintf(x, 100, "%s/%s", s->name, name); | ||
187 | f = fopen(x, "w"); | ||
188 | if (!f) | ||
189 | fatal("Cannot write to %s\n", x); | ||
190 | |||
191 | fprintf(f, "%d\n", n); | ||
192 | fclose(f); | ||
193 | } | ||
194 | |||
195 | static unsigned long read_slab_obj(struct slabinfo *s, const char *name) | ||
196 | { | ||
197 | char x[100]; | ||
198 | FILE *f; | ||
199 | size_t l; | ||
200 | |||
201 | snprintf(x, 100, "%s/%s", s->name, name); | ||
202 | f = fopen(x, "r"); | ||
203 | if (!f) { | ||
204 | buffer[0] = 0; | ||
205 | l = 0; | ||
206 | } else { | ||
207 | l = fread(buffer, 1, sizeof(buffer), f); | ||
208 | buffer[l] = 0; | ||
209 | fclose(f); | ||
210 | } | ||
211 | return l; | ||
212 | } | ||
213 | |||
214 | |||
215 | /* | ||
216 | * Put a size string together | ||
217 | */ | ||
218 | static int store_size(char *buffer, unsigned long value) | ||
219 | { | ||
220 | unsigned long divisor = 1; | ||
221 | char trailer = 0; | ||
222 | int n; | ||
223 | |||
224 | if (value > 1000000000UL) { | ||
225 | divisor = 100000000UL; | ||
226 | trailer = 'G'; | ||
227 | } else if (value > 1000000UL) { | ||
228 | divisor = 100000UL; | ||
229 | trailer = 'M'; | ||
230 | } else if (value > 1000UL) { | ||
231 | divisor = 100; | ||
232 | trailer = 'K'; | ||
233 | } | ||
234 | |||
235 | value /= divisor; | ||
236 | n = sprintf(buffer, "%ld",value); | ||
237 | if (trailer) { | ||
238 | buffer[n] = trailer; | ||
239 | n++; | ||
240 | buffer[n] = 0; | ||
241 | } | ||
242 | if (divisor != 1) { | ||
243 | memmove(buffer + n - 2, buffer + n - 3, 4); | ||
244 | buffer[n-2] = '.'; | ||
245 | n++; | ||
246 | } | ||
247 | return n; | ||
248 | } | ||
249 | |||
250 | static void decode_numa_list(int *numa, char *t) | ||
251 | { | ||
252 | int node; | ||
253 | int nr; | ||
254 | |||
255 | memset(numa, 0, MAX_NODES * sizeof(int)); | ||
256 | |||
257 | if (!t) | ||
258 | return; | ||
259 | |||
260 | while (*t == 'N') { | ||
261 | t++; | ||
262 | node = strtoul(t, &t, 10); | ||
263 | if (*t == '=') { | ||
264 | t++; | ||
265 | nr = strtoul(t, &t, 10); | ||
266 | numa[node] = nr; | ||
267 | if (node > highest_node) | ||
268 | highest_node = node; | ||
269 | } | ||
270 | while (*t == ' ') | ||
271 | t++; | ||
272 | } | ||
273 | } | ||
274 | |||
275 | static void slab_validate(struct slabinfo *s) | ||
276 | { | ||
277 | if (strcmp(s->name, "*") == 0) | ||
278 | return; | ||
279 | |||
280 | set_obj(s, "validate", 1); | ||
281 | } | ||
282 | |||
283 | static void slab_shrink(struct slabinfo *s) | ||
284 | { | ||
285 | if (strcmp(s->name, "*") == 0) | ||
286 | return; | ||
287 | |||
288 | set_obj(s, "shrink", 1); | ||
289 | } | ||
290 | |||
291 | int line = 0; | ||
292 | |||
293 | static void first_line(void) | ||
294 | { | ||
295 | if (show_activity) | ||
296 | printf("Name Objects Alloc Free %%Fast Fallb O\n"); | ||
297 | else | ||
298 | printf("Name Objects Objsize Space " | ||
299 | "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n"); | ||
300 | } | ||
301 | |||
302 | /* | ||
303 | * Find the shortest alias of a slab | ||
304 | */ | ||
305 | static struct aliasinfo *find_one_alias(struct slabinfo *find) | ||
306 | { | ||
307 | struct aliasinfo *a; | ||
308 | struct aliasinfo *best = NULL; | ||
309 | |||
310 | for(a = aliasinfo;a < aliasinfo + aliases; a++) { | ||
311 | if (a->slab == find && | ||
312 | (!best || strlen(best->name) < strlen(a->name))) { | ||
313 | best = a; | ||
314 | if (strncmp(a->name,"kmall", 5) == 0) | ||
315 | return best; | ||
316 | } | ||
317 | } | ||
318 | return best; | ||
319 | } | ||
320 | |||
321 | static unsigned long slab_size(struct slabinfo *s) | ||
322 | { | ||
323 | return s->slabs * (page_size << s->order); | ||
324 | } | ||
325 | |||
326 | static unsigned long slab_activity(struct slabinfo *s) | ||
327 | { | ||
328 | return s->alloc_fastpath + s->free_fastpath + | ||
329 | s->alloc_slowpath + s->free_slowpath; | ||
330 | } | ||
331 | |||
332 | static void slab_numa(struct slabinfo *s, int mode) | ||
333 | { | ||
334 | int node; | ||
335 | |||
336 | if (strcmp(s->name, "*") == 0) | ||
337 | return; | ||
338 | |||
339 | if (!highest_node) { | ||
340 | printf("\n%s: No NUMA information available.\n", s->name); | ||
341 | return; | ||
342 | } | ||
343 | |||
344 | if (skip_zero && !s->slabs) | ||
345 | return; | ||
346 | |||
347 | if (!line) { | ||
348 | printf("\n%-21s:", mode ? "NUMA nodes" : "Slab"); | ||
349 | for(node = 0; node <= highest_node; node++) | ||
350 | printf(" %4d", node); | ||
351 | printf("\n----------------------"); | ||
352 | for(node = 0; node <= highest_node; node++) | ||
353 | printf("-----"); | ||
354 | printf("\n"); | ||
355 | } | ||
356 | printf("%-21s ", mode ? "All slabs" : s->name); | ||
357 | for(node = 0; node <= highest_node; node++) { | ||
358 | char b[20]; | ||
359 | |||
360 | store_size(b, s->numa[node]); | ||
361 | printf(" %4s", b); | ||
362 | } | ||
363 | printf("\n"); | ||
364 | if (mode) { | ||
365 | printf("%-21s ", "Partial slabs"); | ||
366 | for(node = 0; node <= highest_node; node++) { | ||
367 | char b[20]; | ||
368 | |||
369 | store_size(b, s->numa_partial[node]); | ||
370 | printf(" %4s", b); | ||
371 | } | ||
372 | printf("\n"); | ||
373 | } | ||
374 | line++; | ||
375 | } | ||
376 | |||
377 | static void show_tracking(struct slabinfo *s) | ||
378 | { | ||
379 | printf("\n%s: Kernel object allocation\n", s->name); | ||
380 | printf("-----------------------------------------------------------------------\n"); | ||
381 | if (read_slab_obj(s, "alloc_calls")) | ||
382 | printf(buffer); | ||
383 | else | ||
384 | printf("No Data\n"); | ||
385 | |||
386 | printf("\n%s: Kernel object freeing\n", s->name); | ||
387 | printf("------------------------------------------------------------------------\n"); | ||
388 | if (read_slab_obj(s, "free_calls")) | ||
389 | printf(buffer); | ||
390 | else | ||
391 | printf("No Data\n"); | ||
392 | |||
393 | } | ||
394 | |||
395 | static void ops(struct slabinfo *s) | ||
396 | { | ||
397 | if (strcmp(s->name, "*") == 0) | ||
398 | return; | ||
399 | |||
400 | if (read_slab_obj(s, "ops")) { | ||
401 | printf("\n%s: kmem_cache operations\n", s->name); | ||
402 | printf("--------------------------------------------\n"); | ||
403 | printf(buffer); | ||
404 | } else | ||
405 | printf("\n%s has no kmem_cache operations\n", s->name); | ||
406 | } | ||
407 | |||
408 | static const char *onoff(int x) | ||
409 | { | ||
410 | if (x) | ||
411 | return "On "; | ||
412 | return "Off"; | ||
413 | } | ||
414 | |||
415 | static void slab_stats(struct slabinfo *s) | ||
416 | { | ||
417 | unsigned long total_alloc; | ||
418 | unsigned long total_free; | ||
419 | unsigned long total; | ||
420 | |||
421 | if (!s->alloc_slab) | ||
422 | return; | ||
423 | |||
424 | total_alloc = s->alloc_fastpath + s->alloc_slowpath; | ||
425 | total_free = s->free_fastpath + s->free_slowpath; | ||
426 | |||
427 | if (!total_alloc) | ||
428 | return; | ||
429 | |||
430 | printf("\n"); | ||
431 | printf("Slab Perf Counter Alloc Free %%Al %%Fr\n"); | ||
432 | printf("--------------------------------------------------\n"); | ||
433 | printf("Fastpath %8lu %8lu %3lu %3lu\n", | ||
434 | s->alloc_fastpath, s->free_fastpath, | ||
435 | s->alloc_fastpath * 100 / total_alloc, | ||
436 | s->free_fastpath * 100 / total_free); | ||
437 | printf("Slowpath %8lu %8lu %3lu %3lu\n", | ||
438 | total_alloc - s->alloc_fastpath, s->free_slowpath, | ||
439 | (total_alloc - s->alloc_fastpath) * 100 / total_alloc, | ||
440 | s->free_slowpath * 100 / total_free); | ||
441 | printf("Page Alloc %8lu %8lu %3lu %3lu\n", | ||
442 | s->alloc_slab, s->free_slab, | ||
443 | s->alloc_slab * 100 / total_alloc, | ||
444 | s->free_slab * 100 / total_free); | ||
445 | printf("Add partial %8lu %8lu %3lu %3lu\n", | ||
446 | s->deactivate_to_head + s->deactivate_to_tail, | ||
447 | s->free_add_partial, | ||
448 | (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc, | ||
449 | s->free_add_partial * 100 / total_free); | ||
450 | printf("Remove partial %8lu %8lu %3lu %3lu\n", | ||
451 | s->alloc_from_partial, s->free_remove_partial, | ||
452 | s->alloc_from_partial * 100 / total_alloc, | ||
453 | s->free_remove_partial * 100 / total_free); | ||
454 | |||
455 | printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n", | ||
456 | s->deactivate_remote_frees, s->free_frozen, | ||
457 | s->deactivate_remote_frees * 100 / total_alloc, | ||
458 | s->free_frozen * 100 / total_free); | ||
459 | |||
460 | printf("Total %8lu %8lu\n\n", total_alloc, total_free); | ||
461 | |||
462 | if (s->cpuslab_flush) | ||
463 | printf("Flushes %8lu\n", s->cpuslab_flush); | ||
464 | |||
465 | if (s->alloc_refill) | ||
466 | printf("Refill %8lu\n", s->alloc_refill); | ||
467 | |||
468 | total = s->deactivate_full + s->deactivate_empty + | ||
469 | s->deactivate_to_head + s->deactivate_to_tail; | ||
470 | |||
471 | if (total) | ||
472 | printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) " | ||
473 | "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n", | ||
474 | s->deactivate_full, (s->deactivate_full * 100) / total, | ||
475 | s->deactivate_empty, (s->deactivate_empty * 100) / total, | ||
476 | s->deactivate_to_head, (s->deactivate_to_head * 100) / total, | ||
477 | s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total); | ||
478 | } | ||
479 | |||
480 | static void report(struct slabinfo *s) | ||
481 | { | ||
482 | if (strcmp(s->name, "*") == 0) | ||
483 | return; | ||
484 | |||
485 | printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n", | ||
486 | s->name, s->aliases, s->order, s->objects); | ||
487 | if (s->hwcache_align) | ||
488 | printf("** Hardware cacheline aligned\n"); | ||
489 | if (s->cache_dma) | ||
490 | printf("** Memory is allocated in a special DMA zone\n"); | ||
491 | if (s->destroy_by_rcu) | ||
492 | printf("** Slabs are destroyed via RCU\n"); | ||
493 | if (s->reclaim_account) | ||
494 | printf("** Reclaim accounting active\n"); | ||
495 | |||
496 | printf("\nSizes (bytes) Slabs Debug Memory\n"); | ||
497 | printf("------------------------------------------------------------------------\n"); | ||
498 | printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n", | ||
499 | s->object_size, s->slabs, onoff(s->sanity_checks), | ||
500 | s->slabs * (page_size << s->order)); | ||
501 | printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n", | ||
502 | s->slab_size, s->slabs - s->partial - s->cpu_slabs, | ||
503 | onoff(s->red_zone), s->objects * s->object_size); | ||
504 | printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n", | ||
505 | page_size << s->order, s->partial, onoff(s->poison), | ||
506 | s->slabs * (page_size << s->order) - s->objects * s->object_size); | ||
507 | printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n", | ||
508 | s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user), | ||
509 | (s->slab_size - s->object_size) * s->objects); | ||
510 | printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n", | ||
511 | s->align, s->objs_per_slab, onoff(s->trace), | ||
512 | ((page_size << s->order) - s->objs_per_slab * s->slab_size) * | ||
513 | s->slabs); | ||
514 | |||
515 | ops(s); | ||
516 | show_tracking(s); | ||
517 | slab_numa(s, 1); | ||
518 | slab_stats(s); | ||
519 | } | ||
520 | |||
521 | static void slabcache(struct slabinfo *s) | ||
522 | { | ||
523 | char size_str[20]; | ||
524 | char dist_str[40]; | ||
525 | char flags[20]; | ||
526 | char *p = flags; | ||
527 | |||
528 | if (strcmp(s->name, "*") == 0) | ||
529 | return; | ||
530 | |||
531 | if (actual_slabs == 1) { | ||
532 | report(s); | ||
533 | return; | ||
534 | } | ||
535 | |||
536 | if (skip_zero && !show_empty && !s->slabs) | ||
537 | return; | ||
538 | |||
539 | if (show_empty && s->slabs) | ||
540 | return; | ||
541 | |||
542 | store_size(size_str, slab_size(s)); | ||
543 | snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs, | ||
544 | s->partial, s->cpu_slabs); | ||
545 | |||
546 | if (!line++) | ||
547 | first_line(); | ||
548 | |||
549 | if (s->aliases) | ||
550 | *p++ = '*'; | ||
551 | if (s->cache_dma) | ||
552 | *p++ = 'd'; | ||
553 | if (s->hwcache_align) | ||
554 | *p++ = 'A'; | ||
555 | if (s->poison) | ||
556 | *p++ = 'P'; | ||
557 | if (s->reclaim_account) | ||
558 | *p++ = 'a'; | ||
559 | if (s->red_zone) | ||
560 | *p++ = 'Z'; | ||
561 | if (s->sanity_checks) | ||
562 | *p++ = 'F'; | ||
563 | if (s->store_user) | ||
564 | *p++ = 'U'; | ||
565 | if (s->trace) | ||
566 | *p++ = 'T'; | ||
567 | |||
568 | *p = 0; | ||
569 | if (show_activity) { | ||
570 | unsigned long total_alloc; | ||
571 | unsigned long total_free; | ||
572 | |||
573 | total_alloc = s->alloc_fastpath + s->alloc_slowpath; | ||
574 | total_free = s->free_fastpath + s->free_slowpath; | ||
575 | |||
576 | printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n", | ||
577 | s->name, s->objects, | ||
578 | total_alloc, total_free, | ||
579 | total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0, | ||
580 | total_free ? (s->free_fastpath * 100 / total_free) : 0, | ||
581 | s->order_fallback, s->order); | ||
582 | } | ||
583 | else | ||
584 | printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n", | ||
585 | s->name, s->objects, s->object_size, size_str, dist_str, | ||
586 | s->objs_per_slab, s->order, | ||
587 | s->slabs ? (s->partial * 100) / s->slabs : 100, | ||
588 | s->slabs ? (s->objects * s->object_size * 100) / | ||
589 | (s->slabs * (page_size << s->order)) : 100, | ||
590 | flags); | ||
591 | } | ||
592 | |||
593 | /* | ||
594 | * Analyze debug options. Return false if something is amiss. | ||
595 | */ | ||
596 | static int debug_opt_scan(char *opt) | ||
597 | { | ||
598 | if (!opt || !opt[0] || strcmp(opt, "-") == 0) | ||
599 | return 1; | ||
600 | |||
601 | if (strcasecmp(opt, "a") == 0) { | ||
602 | sanity = 1; | ||
603 | poison = 1; | ||
604 | redzone = 1; | ||
605 | tracking = 1; | ||
606 | return 1; | ||
607 | } | ||
608 | |||
609 | for ( ; *opt; opt++) | ||
610 | switch (*opt) { | ||
611 | case 'F' : case 'f': | ||
612 | if (sanity) | ||
613 | return 0; | ||
614 | sanity = 1; | ||
615 | break; | ||
616 | case 'P' : case 'p': | ||
617 | if (poison) | ||
618 | return 0; | ||
619 | poison = 1; | ||
620 | break; | ||
621 | |||
622 | case 'Z' : case 'z': | ||
623 | if (redzone) | ||
624 | return 0; | ||
625 | redzone = 1; | ||
626 | break; | ||
627 | |||
628 | case 'U' : case 'u': | ||
629 | if (tracking) | ||
630 | return 0; | ||
631 | tracking = 1; | ||
632 | break; | ||
633 | |||
634 | case 'T' : case 't': | ||
635 | if (tracing) | ||
636 | return 0; | ||
637 | tracing = 1; | ||
638 | break; | ||
639 | default: | ||
640 | return 0; | ||
641 | } | ||
642 | return 1; | ||
643 | } | ||
644 | |||
645 | static int slab_empty(struct slabinfo *s) | ||
646 | { | ||
647 | if (s->objects > 0) | ||
648 | return 0; | ||
649 | |||
650 | /* | ||
651 | * We may still have slabs even if there are no objects. Shrinking will | ||
652 | * remove them. | ||
653 | */ | ||
654 | if (s->slabs != 0) | ||
655 | set_obj(s, "shrink", 1); | ||
656 | |||
657 | return 1; | ||
658 | } | ||
659 | |||
660 | static void slab_debug(struct slabinfo *s) | ||
661 | { | ||
662 | if (strcmp(s->name, "*") == 0) | ||
663 | return; | ||
664 | |||
665 | if (sanity && !s->sanity_checks) { | ||
666 | set_obj(s, "sanity", 1); | ||
667 | } | ||
668 | if (!sanity && s->sanity_checks) { | ||
669 | if (slab_empty(s)) | ||
670 | set_obj(s, "sanity", 0); | ||
671 | else | ||
672 | fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name); | ||
673 | } | ||
674 | if (redzone && !s->red_zone) { | ||
675 | if (slab_empty(s)) | ||
676 | set_obj(s, "red_zone", 1); | ||
677 | else | ||
678 | fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name); | ||
679 | } | ||
680 | if (!redzone && s->red_zone) { | ||
681 | if (slab_empty(s)) | ||
682 | set_obj(s, "red_zone", 0); | ||
683 | else | ||
684 | fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name); | ||
685 | } | ||
686 | if (poison && !s->poison) { | ||
687 | if (slab_empty(s)) | ||
688 | set_obj(s, "poison", 1); | ||
689 | else | ||
690 | fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name); | ||
691 | } | ||
692 | if (!poison && s->poison) { | ||
693 | if (slab_empty(s)) | ||
694 | set_obj(s, "poison", 0); | ||
695 | else | ||
696 | fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name); | ||
697 | } | ||
698 | if (tracking && !s->store_user) { | ||
699 | if (slab_empty(s)) | ||
700 | set_obj(s, "store_user", 1); | ||
701 | else | ||
702 | fprintf(stderr, "%s not empty cannot enable tracking\n", s->name); | ||
703 | } | ||
704 | if (!tracking && s->store_user) { | ||
705 | if (slab_empty(s)) | ||
706 | set_obj(s, "store_user", 0); | ||
707 | else | ||
708 | fprintf(stderr, "%s not empty cannot disable tracking\n", s->name); | ||
709 | } | ||
710 | if (tracing && !s->trace) { | ||
711 | if (slabs == 1) | ||
712 | set_obj(s, "trace", 1); | ||
713 | else | ||
714 | fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name); | ||
715 | } | ||
716 | if (!tracing && s->trace) | ||
717 | set_obj(s, "trace", 1); | ||
718 | } | ||
719 | |||
720 | static void totals(void) | ||
721 | { | ||
722 | struct slabinfo *s; | ||
723 | |||
724 | int used_slabs = 0; | ||
725 | char b1[20], b2[20], b3[20], b4[20]; | ||
726 | unsigned long long max = 1ULL << 63; | ||
727 | |||
728 | /* Object size */ | ||
729 | unsigned long long min_objsize = max, max_objsize = 0, avg_objsize; | ||
730 | |||
731 | /* Number of partial slabs in a slabcache */ | ||
732 | unsigned long long min_partial = max, max_partial = 0, | ||
733 | avg_partial, total_partial = 0; | ||
734 | |||
735 | /* Number of slabs in a slab cache */ | ||
736 | unsigned long long min_slabs = max, max_slabs = 0, | ||
737 | avg_slabs, total_slabs = 0; | ||
738 | |||
739 | /* Size of the whole slab */ | ||
740 | unsigned long long min_size = max, max_size = 0, | ||
741 | avg_size, total_size = 0; | ||
742 | |||
743 | /* Bytes used for object storage in a slab */ | ||
744 | unsigned long long min_used = max, max_used = 0, | ||
745 | avg_used, total_used = 0; | ||
746 | |||
747 | /* Waste: Bytes used for alignment and padding */ | ||
748 | unsigned long long min_waste = max, max_waste = 0, | ||
749 | avg_waste, total_waste = 0; | ||
750 | /* Number of objects in a slab */ | ||
751 | unsigned long long min_objects = max, max_objects = 0, | ||
752 | avg_objects, total_objects = 0; | ||
753 | /* Waste per object */ | ||
754 | unsigned long long min_objwaste = max, | ||
755 | max_objwaste = 0, avg_objwaste, | ||
756 | total_objwaste = 0; | ||
757 | |||
758 | /* Memory per object */ | ||
759 | unsigned long long min_memobj = max, | ||
760 | max_memobj = 0, avg_memobj, | ||
761 | total_objsize = 0; | ||
762 | |||
763 | /* Percentage of partial slabs per slab */ | ||
764 | unsigned long min_ppart = 100, max_ppart = 0, | ||
765 | avg_ppart, total_ppart = 0; | ||
766 | |||
767 | /* Number of objects in partial slabs */ | ||
768 | unsigned long min_partobj = max, max_partobj = 0, | ||
769 | avg_partobj, total_partobj = 0; | ||
770 | |||
771 | /* Percentage of partial objects of all objects in a slab */ | ||
772 | unsigned long min_ppartobj = 100, max_ppartobj = 0, | ||
773 | avg_ppartobj, total_ppartobj = 0; | ||
774 | |||
775 | |||
776 | for (s = slabinfo; s < slabinfo + slabs; s++) { | ||
777 | unsigned long long size; | ||
778 | unsigned long used; | ||
779 | unsigned long long wasted; | ||
780 | unsigned long long objwaste; | ||
781 | unsigned long percentage_partial_slabs; | ||
782 | unsigned long percentage_partial_objs; | ||
783 | |||
784 | if (!s->slabs || !s->objects) | ||
785 | continue; | ||
786 | |||
787 | used_slabs++; | ||
788 | |||
789 | size = slab_size(s); | ||
790 | used = s->objects * s->object_size; | ||
791 | wasted = size - used; | ||
792 | objwaste = s->slab_size - s->object_size; | ||
793 | |||
794 | percentage_partial_slabs = s->partial * 100 / s->slabs; | ||
795 | if (percentage_partial_slabs > 100) | ||
796 | percentage_partial_slabs = 100; | ||
797 | |||
798 | percentage_partial_objs = s->objects_partial * 100 | ||
799 | / s->objects; | ||
800 | |||
801 | if (percentage_partial_objs > 100) | ||
802 | percentage_partial_objs = 100; | ||
803 | |||
804 | if (s->object_size < min_objsize) | ||
805 | min_objsize = s->object_size; | ||
806 | if (s->partial < min_partial) | ||
807 | min_partial = s->partial; | ||
808 | if (s->slabs < min_slabs) | ||
809 | min_slabs = s->slabs; | ||
810 | if (size < min_size) | ||
811 | min_size = size; | ||
812 | if (wasted < min_waste) | ||
813 | min_waste = wasted; | ||
814 | if (objwaste < min_objwaste) | ||
815 | min_objwaste = objwaste; | ||
816 | if (s->objects < min_objects) | ||
817 | min_objects = s->objects; | ||
818 | if (used < min_used) | ||
819 | min_used = used; | ||
820 | if (s->objects_partial < min_partobj) | ||
821 | min_partobj = s->objects_partial; | ||
822 | if (percentage_partial_slabs < min_ppart) | ||
823 | min_ppart = percentage_partial_slabs; | ||
824 | if (percentage_partial_objs < min_ppartobj) | ||
825 | min_ppartobj = percentage_partial_objs; | ||
826 | if (s->slab_size < min_memobj) | ||
827 | min_memobj = s->slab_size; | ||
828 | |||
829 | if (s->object_size > max_objsize) | ||
830 | max_objsize = s->object_size; | ||
831 | if (s->partial > max_partial) | ||
832 | max_partial = s->partial; | ||
833 | if (s->slabs > max_slabs) | ||
834 | max_slabs = s->slabs; | ||
835 | if (size > max_size) | ||
836 | max_size = size; | ||
837 | if (wasted > max_waste) | ||
838 | max_waste = wasted; | ||
839 | if (objwaste > max_objwaste) | ||
840 | max_objwaste = objwaste; | ||
841 | if (s->objects > max_objects) | ||
842 | max_objects = s->objects; | ||
843 | if (used > max_used) | ||
844 | max_used = used; | ||
845 | if (s->objects_partial > max_partobj) | ||
846 | max_partobj = s->objects_partial; | ||
847 | if (percentage_partial_slabs > max_ppart) | ||
848 | max_ppart = percentage_partial_slabs; | ||
849 | if (percentage_partial_objs > max_ppartobj) | ||
850 | max_ppartobj = percentage_partial_objs; | ||
851 | if (s->slab_size > max_memobj) | ||
852 | max_memobj = s->slab_size; | ||
853 | |||
854 | total_partial += s->partial; | ||
855 | total_slabs += s->slabs; | ||
856 | total_size += size; | ||
857 | total_waste += wasted; | ||
858 | |||
859 | total_objects += s->objects; | ||
860 | total_used += used; | ||
861 | total_partobj += s->objects_partial; | ||
862 | total_ppart += percentage_partial_slabs; | ||
863 | total_ppartobj += percentage_partial_objs; | ||
864 | |||
865 | total_objwaste += s->objects * objwaste; | ||
866 | total_objsize += s->objects * s->slab_size; | ||
867 | } | ||
868 | |||
869 | if (!total_objects) { | ||
870 | printf("No objects\n"); | ||
871 | return; | ||
872 | } | ||
873 | if (!used_slabs) { | ||
874 | printf("No slabs\n"); | ||
875 | return; | ||
876 | } | ||
877 | |||
878 | /* Per slab averages */ | ||
879 | avg_partial = total_partial / used_slabs; | ||
880 | avg_slabs = total_slabs / used_slabs; | ||
881 | avg_size = total_size / used_slabs; | ||
882 | avg_waste = total_waste / used_slabs; | ||
883 | |||
884 | avg_objects = total_objects / used_slabs; | ||
885 | avg_used = total_used / used_slabs; | ||
886 | avg_partobj = total_partobj / used_slabs; | ||
887 | avg_ppart = total_ppart / used_slabs; | ||
888 | avg_ppartobj = total_ppartobj / used_slabs; | ||
889 | |||
890 | /* Per object object sizes */ | ||
891 | avg_objsize = total_used / total_objects; | ||
892 | avg_objwaste = total_objwaste / total_objects; | ||
893 | avg_partobj = total_partobj * 100 / total_objects; | ||
894 | avg_memobj = total_objsize / total_objects; | ||
895 | |||
896 | printf("Slabcache Totals\n"); | ||
897 | printf("----------------\n"); | ||
898 | printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n", | ||
899 | slabs, aliases, alias_targets, used_slabs); | ||
900 | |||
901 | store_size(b1, total_size);store_size(b2, total_waste); | ||
902 | store_size(b3, total_waste * 100 / total_used); | ||
903 | printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3); | ||
904 | |||
905 | store_size(b1, total_objects);store_size(b2, total_partobj); | ||
906 | store_size(b3, total_partobj * 100 / total_objects); | ||
907 | printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3); | ||
908 | |||
909 | printf("\n"); | ||
910 | printf("Per Cache Average Min Max Total\n"); | ||
911 | printf("---------------------------------------------------------\n"); | ||
912 | |||
913 | store_size(b1, avg_objects);store_size(b2, min_objects); | ||
914 | store_size(b3, max_objects);store_size(b4, total_objects); | ||
915 | printf("#Objects %10s %10s %10s %10s\n", | ||
916 | b1, b2, b3, b4); | ||
917 | |||
918 | store_size(b1, avg_slabs);store_size(b2, min_slabs); | ||
919 | store_size(b3, max_slabs);store_size(b4, total_slabs); | ||
920 | printf("#Slabs %10s %10s %10s %10s\n", | ||
921 | b1, b2, b3, b4); | ||
922 | |||
923 | store_size(b1, avg_partial);store_size(b2, min_partial); | ||
924 | store_size(b3, max_partial);store_size(b4, total_partial); | ||
925 | printf("#PartSlab %10s %10s %10s %10s\n", | ||
926 | b1, b2, b3, b4); | ||
927 | store_size(b1, avg_ppart);store_size(b2, min_ppart); | ||
928 | store_size(b3, max_ppart); | ||
929 | store_size(b4, total_partial * 100 / total_slabs); | ||
930 | printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n", | ||
931 | b1, b2, b3, b4); | ||
932 | |||
933 | store_size(b1, avg_partobj);store_size(b2, min_partobj); | ||
934 | store_size(b3, max_partobj); | ||
935 | store_size(b4, total_partobj); | ||
936 | printf("PartObjs %10s %10s %10s %10s\n", | ||
937 | b1, b2, b3, b4); | ||
938 | |||
939 | store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj); | ||
940 | store_size(b3, max_ppartobj); | ||
941 | store_size(b4, total_partobj * 100 / total_objects); | ||
942 | printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n", | ||
943 | b1, b2, b3, b4); | ||
944 | |||
945 | store_size(b1, avg_size);store_size(b2, min_size); | ||
946 | store_size(b3, max_size);store_size(b4, total_size); | ||
947 | printf("Memory %10s %10s %10s %10s\n", | ||
948 | b1, b2, b3, b4); | ||
949 | |||
950 | store_size(b1, avg_used);store_size(b2, min_used); | ||
951 | store_size(b3, max_used);store_size(b4, total_used); | ||
952 | printf("Used %10s %10s %10s %10s\n", | ||
953 | b1, b2, b3, b4); | ||
954 | |||
955 | store_size(b1, avg_waste);store_size(b2, min_waste); | ||
956 | store_size(b3, max_waste);store_size(b4, total_waste); | ||
957 | printf("Loss %10s %10s %10s %10s\n", | ||
958 | b1, b2, b3, b4); | ||
959 | |||
960 | printf("\n"); | ||
961 | printf("Per Object Average Min Max\n"); | ||
962 | printf("---------------------------------------------\n"); | ||
963 | |||
964 | store_size(b1, avg_memobj);store_size(b2, min_memobj); | ||
965 | store_size(b3, max_memobj); | ||
966 | printf("Memory %10s %10s %10s\n", | ||
967 | b1, b2, b3); | ||
968 | store_size(b1, avg_objsize);store_size(b2, min_objsize); | ||
969 | store_size(b3, max_objsize); | ||
970 | printf("User %10s %10s %10s\n", | ||
971 | b1, b2, b3); | ||
972 | |||
973 | store_size(b1, avg_objwaste);store_size(b2, min_objwaste); | ||
974 | store_size(b3, max_objwaste); | ||
975 | printf("Loss %10s %10s %10s\n", | ||
976 | b1, b2, b3); | ||
977 | } | ||
978 | |||
979 | static void sort_slabs(void) | ||
980 | { | ||
981 | struct slabinfo *s1,*s2; | ||
982 | |||
983 | for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) { | ||
984 | for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) { | ||
985 | int result; | ||
986 | |||
987 | if (sort_size) | ||
988 | result = slab_size(s1) < slab_size(s2); | ||
989 | else if (sort_active) | ||
990 | result = slab_activity(s1) < slab_activity(s2); | ||
991 | else | ||
992 | result = strcasecmp(s1->name, s2->name); | ||
993 | |||
994 | if (show_inverted) | ||
995 | result = -result; | ||
996 | |||
997 | if (result > 0) { | ||
998 | struct slabinfo t; | ||
999 | |||
1000 | memcpy(&t, s1, sizeof(struct slabinfo)); | ||
1001 | memcpy(s1, s2, sizeof(struct slabinfo)); | ||
1002 | memcpy(s2, &t, sizeof(struct slabinfo)); | ||
1003 | } | ||
1004 | } | ||
1005 | } | ||
1006 | } | ||
1007 | |||
1008 | static void sort_aliases(void) | ||
1009 | { | ||
1010 | struct aliasinfo *a1,*a2; | ||
1011 | |||
1012 | for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) { | ||
1013 | for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) { | ||
1014 | char *n1, *n2; | ||
1015 | |||
1016 | n1 = a1->name; | ||
1017 | n2 = a2->name; | ||
1018 | if (show_alias && !show_inverted) { | ||
1019 | n1 = a1->ref; | ||
1020 | n2 = a2->ref; | ||
1021 | } | ||
1022 | if (strcasecmp(n1, n2) > 0) { | ||
1023 | struct aliasinfo t; | ||
1024 | |||
1025 | memcpy(&t, a1, sizeof(struct aliasinfo)); | ||
1026 | memcpy(a1, a2, sizeof(struct aliasinfo)); | ||
1027 | memcpy(a2, &t, sizeof(struct aliasinfo)); | ||
1028 | } | ||
1029 | } | ||
1030 | } | ||
1031 | } | ||
1032 | |||
1033 | static void link_slabs(void) | ||
1034 | { | ||
1035 | struct aliasinfo *a; | ||
1036 | struct slabinfo *s; | ||
1037 | |||
1038 | for (a = aliasinfo; a < aliasinfo + aliases; a++) { | ||
1039 | |||
1040 | for (s = slabinfo; s < slabinfo + slabs; s++) | ||
1041 | if (strcmp(a->ref, s->name) == 0) { | ||
1042 | a->slab = s; | ||
1043 | s->refs++; | ||
1044 | break; | ||
1045 | } | ||
1046 | if (s == slabinfo + slabs) | ||
1047 | fatal("Unresolved alias %s\n", a->ref); | ||
1048 | } | ||
1049 | } | ||
1050 | |||
1051 | static void alias(void) | ||
1052 | { | ||
1053 | struct aliasinfo *a; | ||
1054 | char *active = NULL; | ||
1055 | |||
1056 | sort_aliases(); | ||
1057 | link_slabs(); | ||
1058 | |||
1059 | for(a = aliasinfo; a < aliasinfo + aliases; a++) { | ||
1060 | |||
1061 | if (!show_single_ref && a->slab->refs == 1) | ||
1062 | continue; | ||
1063 | |||
1064 | if (!show_inverted) { | ||
1065 | if (active) { | ||
1066 | if (strcmp(a->slab->name, active) == 0) { | ||
1067 | printf(" %s", a->name); | ||
1068 | continue; | ||
1069 | } | ||
1070 | } | ||
1071 | printf("\n%-12s <- %s", a->slab->name, a->name); | ||
1072 | active = a->slab->name; | ||
1073 | } | ||
1074 | else | ||
1075 | printf("%-20s -> %s\n", a->name, a->slab->name); | ||
1076 | } | ||
1077 | if (active) | ||
1078 | printf("\n"); | ||
1079 | } | ||
1080 | |||
1081 | |||
1082 | static void rename_slabs(void) | ||
1083 | { | ||
1084 | struct slabinfo *s; | ||
1085 | struct aliasinfo *a; | ||
1086 | |||
1087 | for (s = slabinfo; s < slabinfo + slabs; s++) { | ||
1088 | if (*s->name != ':') | ||
1089 | continue; | ||
1090 | |||
1091 | if (s->refs > 1 && !show_first_alias) | ||
1092 | continue; | ||
1093 | |||
1094 | a = find_one_alias(s); | ||
1095 | |||
1096 | if (a) | ||
1097 | s->name = a->name; | ||
1098 | else { | ||
1099 | s->name = "*"; | ||
1100 | actual_slabs--; | ||
1101 | } | ||
1102 | } | ||
1103 | } | ||
1104 | |||
1105 | static int slab_mismatch(char *slab) | ||
1106 | { | ||
1107 | return regexec(&pattern, slab, 0, NULL, 0); | ||
1108 | } | ||
1109 | |||
1110 | static void read_slab_dir(void) | ||
1111 | { | ||
1112 | DIR *dir; | ||
1113 | struct dirent *de; | ||
1114 | struct slabinfo *slab = slabinfo; | ||
1115 | struct aliasinfo *alias = aliasinfo; | ||
1116 | char *p; | ||
1117 | char *t; | ||
1118 | int count; | ||
1119 | |||
1120 | if (chdir("/sys/kernel/slab") && chdir("/sys/slab")) | ||
1121 | fatal("SYSFS support for SLUB not active\n"); | ||
1122 | |||
1123 | dir = opendir("."); | ||
1124 | while ((de = readdir(dir))) { | ||
1125 | if (de->d_name[0] == '.' || | ||
1126 | (de->d_name[0] != ':' && slab_mismatch(de->d_name))) | ||
1127 | continue; | ||
1128 | switch (de->d_type) { | ||
1129 | case DT_LNK: | ||
1130 | alias->name = strdup(de->d_name); | ||
1131 | count = readlink(de->d_name, buffer, sizeof(buffer)); | ||
1132 | |||
1133 | if (count < 0) | ||
1134 | fatal("Cannot read symlink %s\n", de->d_name); | ||
1135 | |||
1136 | buffer[count] = 0; | ||
1137 | p = buffer + count; | ||
1138 | while (p > buffer && p[-1] != '/') | ||
1139 | p--; | ||
1140 | alias->ref = strdup(p); | ||
1141 | alias++; | ||
1142 | break; | ||
1143 | case DT_DIR: | ||
1144 | if (chdir(de->d_name)) | ||
1145 | fatal("Unable to access slab %s\n", slab->name); | ||
1146 | slab->name = strdup(de->d_name); | ||
1147 | slab->alias = 0; | ||
1148 | slab->refs = 0; | ||
1149 | slab->aliases = get_obj("aliases"); | ||
1150 | slab->align = get_obj("align"); | ||
1151 | slab->cache_dma = get_obj("cache_dma"); | ||
1152 | slab->cpu_slabs = get_obj("cpu_slabs"); | ||
1153 | slab->destroy_by_rcu = get_obj("destroy_by_rcu"); | ||
1154 | slab->hwcache_align = get_obj("hwcache_align"); | ||
1155 | slab->object_size = get_obj("object_size"); | ||
1156 | slab->objects = get_obj("objects"); | ||
1157 | slab->objects_partial = get_obj("objects_partial"); | ||
1158 | slab->objects_total = get_obj("objects_total"); | ||
1159 | slab->objs_per_slab = get_obj("objs_per_slab"); | ||
1160 | slab->order = get_obj("order"); | ||
1161 | slab->partial = get_obj("partial"); | ||
1162 | slab->partial = get_obj_and_str("partial", &t); | ||
1163 | decode_numa_list(slab->numa_partial, t); | ||
1164 | free(t); | ||
1165 | slab->poison = get_obj("poison"); | ||
1166 | slab->reclaim_account = get_obj("reclaim_account"); | ||
1167 | slab->red_zone = get_obj("red_zone"); | ||
1168 | slab->sanity_checks = get_obj("sanity_checks"); | ||
1169 | slab->slab_size = get_obj("slab_size"); | ||
1170 | slab->slabs = get_obj_and_str("slabs", &t); | ||
1171 | decode_numa_list(slab->numa, t); | ||
1172 | free(t); | ||
1173 | slab->store_user = get_obj("store_user"); | ||
1174 | slab->trace = get_obj("trace"); | ||
1175 | slab->alloc_fastpath = get_obj("alloc_fastpath"); | ||
1176 | slab->alloc_slowpath = get_obj("alloc_slowpath"); | ||
1177 | slab->free_fastpath = get_obj("free_fastpath"); | ||
1178 | slab->free_slowpath = get_obj("free_slowpath"); | ||
1179 | slab->free_frozen= get_obj("free_frozen"); | ||
1180 | slab->free_add_partial = get_obj("free_add_partial"); | ||
1181 | slab->free_remove_partial = get_obj("free_remove_partial"); | ||
1182 | slab->alloc_from_partial = get_obj("alloc_from_partial"); | ||
1183 | slab->alloc_slab = get_obj("alloc_slab"); | ||
1184 | slab->alloc_refill = get_obj("alloc_refill"); | ||
1185 | slab->free_slab = get_obj("free_slab"); | ||
1186 | slab->cpuslab_flush = get_obj("cpuslab_flush"); | ||
1187 | slab->deactivate_full = get_obj("deactivate_full"); | ||
1188 | slab->deactivate_empty = get_obj("deactivate_empty"); | ||
1189 | slab->deactivate_to_head = get_obj("deactivate_to_head"); | ||
1190 | slab->deactivate_to_tail = get_obj("deactivate_to_tail"); | ||
1191 | slab->deactivate_remote_frees = get_obj("deactivate_remote_frees"); | ||
1192 | slab->order_fallback = get_obj("order_fallback"); | ||
1193 | chdir(".."); | ||
1194 | if (slab->name[0] == ':') | ||
1195 | alias_targets++; | ||
1196 | slab++; | ||
1197 | break; | ||
1198 | default : | ||
1199 | fatal("Unknown file type %lx\n", de->d_type); | ||
1200 | } | ||
1201 | } | ||
1202 | closedir(dir); | ||
1203 | slabs = slab - slabinfo; | ||
1204 | actual_slabs = slabs; | ||
1205 | aliases = alias - aliasinfo; | ||
1206 | if (slabs > MAX_SLABS) | ||
1207 | fatal("Too many slabs\n"); | ||
1208 | if (aliases > MAX_ALIASES) | ||
1209 | fatal("Too many aliases\n"); | ||
1210 | } | ||
1211 | |||
1212 | static void output_slabs(void) | ||
1213 | { | ||
1214 | struct slabinfo *slab; | ||
1215 | |||
1216 | for (slab = slabinfo; slab < slabinfo + slabs; slab++) { | ||
1217 | |||
1218 | if (slab->alias) | ||
1219 | continue; | ||
1220 | |||
1221 | |||
1222 | if (show_numa) | ||
1223 | slab_numa(slab, 0); | ||
1224 | else if (show_track) | ||
1225 | show_tracking(slab); | ||
1226 | else if (validate) | ||
1227 | slab_validate(slab); | ||
1228 | else if (shrink) | ||
1229 | slab_shrink(slab); | ||
1230 | else if (set_debug) | ||
1231 | slab_debug(slab); | ||
1232 | else if (show_ops) | ||
1233 | ops(slab); | ||
1234 | else if (show_slab) | ||
1235 | slabcache(slab); | ||
1236 | else if (show_report) | ||
1237 | report(slab); | ||
1238 | } | ||
1239 | } | ||
1240 | |||
1241 | struct option opts[] = { | ||
1242 | { "aliases", 0, NULL, 'a' }, | ||
1243 | { "activity", 0, NULL, 'A' }, | ||
1244 | { "debug", 2, NULL, 'd' }, | ||
1245 | { "display-activity", 0, NULL, 'D' }, | ||
1246 | { "empty", 0, NULL, 'e' }, | ||
1247 | { "first-alias", 0, NULL, 'f' }, | ||
1248 | { "help", 0, NULL, 'h' }, | ||
1249 | { "inverted", 0, NULL, 'i'}, | ||
1250 | { "numa", 0, NULL, 'n' }, | ||
1251 | { "ops", 0, NULL, 'o' }, | ||
1252 | { "report", 0, NULL, 'r' }, | ||
1253 | { "shrink", 0, NULL, 's' }, | ||
1254 | { "slabs", 0, NULL, 'l' }, | ||
1255 | { "track", 0, NULL, 't'}, | ||
1256 | { "validate", 0, NULL, 'v' }, | ||
1257 | { "zero", 0, NULL, 'z' }, | ||
1258 | { "1ref", 0, NULL, '1'}, | ||
1259 | { NULL, 0, NULL, 0 } | ||
1260 | }; | ||
1261 | |||
1262 | int main(int argc, char *argv[]) | ||
1263 | { | ||
1264 | int c; | ||
1265 | int err; | ||
1266 | char *pattern_source; | ||
1267 | |||
1268 | page_size = getpagesize(); | ||
1269 | |||
1270 | while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS", | ||
1271 | opts, NULL)) != -1) | ||
1272 | switch (c) { | ||
1273 | case '1': | ||
1274 | show_single_ref = 1; | ||
1275 | break; | ||
1276 | case 'a': | ||
1277 | show_alias = 1; | ||
1278 | break; | ||
1279 | case 'A': | ||
1280 | sort_active = 1; | ||
1281 | break; | ||
1282 | case 'd': | ||
1283 | set_debug = 1; | ||
1284 | if (!debug_opt_scan(optarg)) | ||
1285 | fatal("Invalid debug option '%s'\n", optarg); | ||
1286 | break; | ||
1287 | case 'D': | ||
1288 | show_activity = 1; | ||
1289 | break; | ||
1290 | case 'e': | ||
1291 | show_empty = 1; | ||
1292 | break; | ||
1293 | case 'f': | ||
1294 | show_first_alias = 1; | ||
1295 | break; | ||
1296 | case 'h': | ||
1297 | usage(); | ||
1298 | return 0; | ||
1299 | case 'i': | ||
1300 | show_inverted = 1; | ||
1301 | break; | ||
1302 | case 'n': | ||
1303 | show_numa = 1; | ||
1304 | break; | ||
1305 | case 'o': | ||
1306 | show_ops = 1; | ||
1307 | break; | ||
1308 | case 'r': | ||
1309 | show_report = 1; | ||
1310 | break; | ||
1311 | case 's': | ||
1312 | shrink = 1; | ||
1313 | break; | ||
1314 | case 'l': | ||
1315 | show_slab = 1; | ||
1316 | break; | ||
1317 | case 't': | ||
1318 | show_track = 1; | ||
1319 | break; | ||
1320 | case 'v': | ||
1321 | validate = 1; | ||
1322 | break; | ||
1323 | case 'z': | ||
1324 | skip_zero = 0; | ||
1325 | break; | ||
1326 | case 'T': | ||
1327 | show_totals = 1; | ||
1328 | break; | ||
1329 | case 'S': | ||
1330 | sort_size = 1; | ||
1331 | break; | ||
1332 | |||
1333 | default: | ||
1334 | fatal("%s: Invalid option '%c'\n", argv[0], optopt); | ||
1335 | |||
1336 | } | ||
1337 | |||
1338 | if (!show_slab && !show_alias && !show_track && !show_report | ||
1339 | && !validate && !shrink && !set_debug && !show_ops) | ||
1340 | show_slab = 1; | ||
1341 | |||
1342 | if (argc > optind) | ||
1343 | pattern_source = argv[optind]; | ||
1344 | else | ||
1345 | pattern_source = ".*"; | ||
1346 | |||
1347 | err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB); | ||
1348 | if (err) | ||
1349 | fatal("%s: Invalid pattern '%s' code %d\n", | ||
1350 | argv[0], pattern_source, err); | ||
1351 | read_slab_dir(); | ||
1352 | if (show_alias) | ||
1353 | alias(); | ||
1354 | else | ||
1355 | if (show_totals) | ||
1356 | totals(); | ||
1357 | else { | ||
1358 | link_slabs(); | ||
1359 | rename_slabs(); | ||
1360 | sort_slabs(); | ||
1361 | output_slabs(); | ||
1362 | } | ||
1363 | return 0; | ||
1364 | } | ||
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt new file mode 100644 index 000000000000..0924aaca3302 --- /dev/null +++ b/Documentation/vm/transhuge.txt | |||
@@ -0,0 +1,298 @@ | |||
1 | = Transparent Hugepage Support = | ||
2 | |||
3 | == Objective == | ||
4 | |||
5 | Performance critical computing applications dealing with large memory | ||
6 | working sets are already running on top of libhugetlbfs and in turn | ||
7 | hugetlbfs. Transparent Hugepage Support is an alternative means of | ||
8 | using huge pages for the backing of virtual memory with huge pages | ||
9 | that supports the automatic promotion and demotion of page sizes and | ||
10 | without the shortcomings of hugetlbfs. | ||
11 | |||
12 | Currently it only works for anonymous memory mappings but in the | ||
13 | future it can expand over the pagecache layer starting with tmpfs. | ||
14 | |||
15 | The reason applications are running faster is because of two | ||
16 | factors. The first factor is almost completely irrelevant and it's not | ||
17 | of significant interest because it'll also have the downside of | ||
18 | requiring larger clear-page copy-page in page faults which is a | ||
19 | potentially negative effect. The first factor consists in taking a | ||
20 | single page fault for each 2M virtual region touched by userland (so | ||
21 | reducing the enter/exit kernel frequency by a 512 times factor). This | ||
22 | only matters the first time the memory is accessed for the lifetime of | ||
23 | a memory mapping. The second long lasting and much more important | ||
24 | factor will affect all subsequent accesses to the memory for the whole | ||
25 | runtime of the application. The second factor consist of two | ||
26 | components: 1) the TLB miss will run faster (especially with | ||
27 | virtualization using nested pagetables but almost always also on bare | ||
28 | metal without virtualization) and 2) a single TLB entry will be | ||
29 | mapping a much larger amount of virtual memory in turn reducing the | ||
30 | number of TLB misses. With virtualization and nested pagetables the | ||
31 | TLB can be mapped of larger size only if both KVM and the Linux guest | ||
32 | are using hugepages but a significant speedup already happens if only | ||
33 | one of the two is using hugepages just because of the fact the TLB | ||
34 | miss is going to run faster. | ||
35 | |||
36 | == Design == | ||
37 | |||
38 | - "graceful fallback": mm components which don't have transparent | ||
39 | hugepage knowledge fall back to breaking a transparent hugepage and | ||
40 | working on the regular pages and their respective regular pmd/pte | ||
41 | mappings | ||
42 | |||
43 | - if a hugepage allocation fails because of memory fragmentation, | ||
44 | regular pages should be gracefully allocated instead and mixed in | ||
45 | the same vma without any failure or significant delay and without | ||
46 | userland noticing | ||
47 | |||
48 | - if some task quits and more hugepages become available (either | ||
49 | immediately in the buddy or through the VM), guest physical memory | ||
50 | backed by regular pages should be relocated on hugepages | ||
51 | automatically (with khugepaged) | ||
52 | |||
53 | - it doesn't require memory reservation and in turn it uses hugepages | ||
54 | whenever possible (the only possible reservation here is kernelcore= | ||
55 | to avoid unmovable pages to fragment all the memory but such a tweak | ||
56 | is not specific to transparent hugepage support and it's a generic | ||
57 | feature that applies to all dynamic high order allocations in the | ||
58 | kernel) | ||
59 | |||
60 | - this initial support only offers the feature in the anonymous memory | ||
61 | regions but it'd be ideal to move it to tmpfs and the pagecache | ||
62 | later | ||
63 | |||
64 | Transparent Hugepage Support maximizes the usefulness of free memory | ||
65 | if compared to the reservation approach of hugetlbfs by allowing all | ||
66 | unused memory to be used as cache or other movable (or even unmovable | ||
67 | entities). It doesn't require reservation to prevent hugepage | ||
68 | allocation failures to be noticeable from userland. It allows paging | ||
69 | and all other advanced VM features to be available on the | ||
70 | hugepages. It requires no modifications for applications to take | ||
71 | advantage of it. | ||
72 | |||
73 | Applications however can be further optimized to take advantage of | ||
74 | this feature, like for example they've been optimized before to avoid | ||
75 | a flood of mmap system calls for every malloc(4k). Optimizing userland | ||
76 | is by far not mandatory and khugepaged already can take care of long | ||
77 | lived page allocations even for hugepage unaware applications that | ||
78 | deals with large amounts of memory. | ||
79 | |||
80 | In certain cases when hugepages are enabled system wide, application | ||
81 | may end up allocating more memory resources. An application may mmap a | ||
82 | large region but only touch 1 byte of it, in that case a 2M page might | ||
83 | be allocated instead of a 4k page for no good. This is why it's | ||
84 | possible to disable hugepages system-wide and to only have them inside | ||
85 | MADV_HUGEPAGE madvise regions. | ||
86 | |||
87 | Embedded systems should enable hugepages only inside madvise regions | ||
88 | to eliminate any risk of wasting any precious byte of memory and to | ||
89 | only run faster. | ||
90 | |||
91 | Applications that gets a lot of benefit from hugepages and that don't | ||
92 | risk to lose memory by using hugepages, should use | ||
93 | madvise(MADV_HUGEPAGE) on their critical mmapped regions. | ||
94 | |||
95 | == sysfs == | ||
96 | |||
97 | Transparent Hugepage Support can be entirely disabled (mostly for | ||
98 | debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to | ||
99 | avoid the risk of consuming more memory resources) or enabled system | ||
100 | wide. This can be achieved with one of: | ||
101 | |||
102 | echo always >/sys/kernel/mm/transparent_hugepage/enabled | ||
103 | echo madvise >/sys/kernel/mm/transparent_hugepage/enabled | ||
104 | echo never >/sys/kernel/mm/transparent_hugepage/enabled | ||
105 | |||
106 | It's also possible to limit defrag efforts in the VM to generate | ||
107 | hugepages in case they're not immediately free to madvise regions or | ||
108 | to never try to defrag memory and simply fallback to regular pages | ||
109 | unless hugepages are immediately available. Clearly if we spend CPU | ||
110 | time to defrag memory, we would expect to gain even more by the fact | ||
111 | we use hugepages later instead of regular pages. This isn't always | ||
112 | guaranteed, but it may be more likely in case the allocation is for a | ||
113 | MADV_HUGEPAGE region. | ||
114 | |||
115 | echo always >/sys/kernel/mm/transparent_hugepage/defrag | ||
116 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag | ||
117 | echo never >/sys/kernel/mm/transparent_hugepage/defrag | ||
118 | |||
119 | khugepaged will be automatically started when | ||
120 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll | ||
121 | be automatically shutdown if it's set to "never". | ||
122 | |||
123 | khugepaged runs usually at low frequency so while one may not want to | ||
124 | invoke defrag algorithms synchronously during the page faults, it | ||
125 | should be worth invoking defrag at least in khugepaged. However it's | ||
126 | also possible to disable defrag in khugepaged: | ||
127 | |||
128 | echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
129 | echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | ||
130 | |||
131 | You can also control how many pages khugepaged should scan at each | ||
132 | pass: | ||
133 | |||
134 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan | ||
135 | |||
136 | and how many milliseconds to wait in khugepaged between each pass (you | ||
137 | can set this to 0 to run khugepaged at 100% utilization of one core): | ||
138 | |||
139 | /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs | ||
140 | |||
141 | and how many milliseconds to wait in khugepaged if there's an hugepage | ||
142 | allocation failure to throttle the next allocation attempt. | ||
143 | |||
144 | /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs | ||
145 | |||
146 | The khugepaged progress can be seen in the number of pages collapsed: | ||
147 | |||
148 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed | ||
149 | |||
150 | for each pass: | ||
151 | |||
152 | /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans | ||
153 | |||
154 | == Boot parameter == | ||
155 | |||
156 | You can change the sysfs boot time defaults of Transparent Hugepage | ||
157 | Support by passing the parameter "transparent_hugepage=always" or | ||
158 | "transparent_hugepage=madvise" or "transparent_hugepage=never" | ||
159 | (without "") to the kernel command line. | ||
160 | |||
161 | == Need of application restart == | ||
162 | |||
163 | The transparent_hugepage/enabled values only affect future | ||
164 | behavior. So to make them effective you need to restart any | ||
165 | application that could have been using hugepages. This also applies to | ||
166 | the regions registered in khugepaged. | ||
167 | |||
168 | == get_user_pages and follow_page == | ||
169 | |||
170 | get_user_pages and follow_page if run on a hugepage, will return the | ||
171 | head or tail pages as usual (exactly as they would do on | ||
172 | hugetlbfs). Most gup users will only care about the actual physical | ||
173 | address of the page and its temporary pinning to release after the I/O | ||
174 | is complete, so they won't ever notice the fact the page is huge. But | ||
175 | if any driver is going to mangle over the page structure of the tail | ||
176 | page (like for checking page->mapping or other bits that are relevant | ||
177 | for the head page and not the tail page), it should be updated to jump | ||
178 | to check head page instead (while serializing properly against | ||
179 | split_huge_page() to avoid the head and tail pages to disappear from | ||
180 | under it, see the futex code to see an example of that, hugetlbfs also | ||
181 | needed special handling in futex code for similar reasons). | ||
182 | |||
183 | NOTE: these aren't new constraints to the GUP API, and they match the | ||
184 | same constrains that applies to hugetlbfs too, so any driver capable | ||
185 | of handling GUP on hugetlbfs will also work fine on transparent | ||
186 | hugepage backed mappings. | ||
187 | |||
188 | In case you can't handle compound pages if they're returned by | ||
189 | follow_page, the FOLL_SPLIT bit can be specified as parameter to | ||
190 | follow_page, so that it will split the hugepages before returning | ||
191 | them. Migration for example passes FOLL_SPLIT as parameter to | ||
192 | follow_page because it's not hugepage aware and in fact it can't work | ||
193 | at all on hugetlbfs (but it instead works fine on transparent | ||
194 | hugepages thanks to FOLL_SPLIT). migration simply can't deal with | ||
195 | hugepages being returned (as it's not only checking the pfn of the | ||
196 | page and pinning it during the copy but it pretends to migrate the | ||
197 | memory in regular page sizes and with regular pte/pmd mappings). | ||
198 | |||
199 | == Optimizing the applications == | ||
200 | |||
201 | To be guaranteed that the kernel will map a 2M page immediately in any | ||
202 | memory region, the mmap region has to be hugepage naturally | ||
203 | aligned. posix_memalign() can provide that guarantee. | ||
204 | |||
205 | == Hugetlbfs == | ||
206 | |||
207 | You can use hugetlbfs on a kernel that has transparent hugepage | ||
208 | support enabled just fine as always. No difference can be noted in | ||
209 | hugetlbfs other than there will be less overall fragmentation. All | ||
210 | usual features belonging to hugetlbfs are preserved and | ||
211 | unaffected. libhugetlbfs will also work fine as usual. | ||
212 | |||
213 | == Graceful fallback == | ||
214 | |||
215 | Code walking pagetables but unware about huge pmds can simply call | ||
216 | split_huge_page_pmd(mm, pmd) where the pmd is the one returned by | ||
217 | pmd_offset. It's trivial to make the code transparent hugepage aware | ||
218 | by just grepping for "pmd_offset" and adding split_huge_page_pmd where | ||
219 | missing after pmd_offset returns the pmd. Thanks to the graceful | ||
220 | fallback design, with a one liner change, you can avoid to write | ||
221 | hundred if not thousand of lines of complex code to make your code | ||
222 | hugepage aware. | ||
223 | |||
224 | If you're not walking pagetables but you run into a physical hugepage | ||
225 | but you can't handle it natively in your code, you can split it by | ||
226 | calling split_huge_page(page). This is what the Linux VM does before | ||
227 | it tries to swapout the hugepage for example. | ||
228 | |||
229 | Example to make mremap.c transparent hugepage aware with a one liner | ||
230 | change: | ||
231 | |||
232 | diff --git a/mm/mremap.c b/mm/mremap.c | ||
233 | --- a/mm/mremap.c | ||
234 | +++ b/mm/mremap.c | ||
235 | @@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru | ||
236 | return NULL; | ||
237 | |||
238 | pmd = pmd_offset(pud, addr); | ||
239 | + split_huge_page_pmd(mm, pmd); | ||
240 | if (pmd_none_or_clear_bad(pmd)) | ||
241 | return NULL; | ||
242 | |||
243 | == Locking in hugepage aware code == | ||
244 | |||
245 | We want as much code as possible hugepage aware, as calling | ||
246 | split_huge_page() or split_huge_page_pmd() has a cost. | ||
247 | |||
248 | To make pagetable walks huge pmd aware, all you need to do is to call | ||
249 | pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the | ||
250 | mmap_sem in read (or write) mode to be sure an huge pmd cannot be | ||
251 | created from under you by khugepaged (khugepaged collapse_huge_page | ||
252 | takes the mmap_sem in write mode in addition to the anon_vma lock). If | ||
253 | pmd_trans_huge returns false, you just fallback in the old code | ||
254 | paths. If instead pmd_trans_huge returns true, you have to take the | ||
255 | mm->page_table_lock and re-run pmd_trans_huge. Taking the | ||
256 | page_table_lock will prevent the huge pmd to be converted into a | ||
257 | regular pmd from under you (split_huge_page can run in parallel to the | ||
258 | pagetable walk). If the second pmd_trans_huge returns false, you | ||
259 | should just drop the page_table_lock and fallback to the old code as | ||
260 | before. Otherwise you should run pmd_trans_splitting on the pmd. In | ||
261 | case pmd_trans_splitting returns true, it means split_huge_page is | ||
262 | already in the middle of splitting the page. So if pmd_trans_splitting | ||
263 | returns true it's enough to drop the page_table_lock and call | ||
264 | wait_split_huge_page and then fallback the old code paths. You are | ||
265 | guaranteed by the time wait_split_huge_page returns, the pmd isn't | ||
266 | huge anymore. If pmd_trans_splitting returns false, you can proceed to | ||
267 | process the huge pmd and the hugepage natively. Once finished you can | ||
268 | drop the page_table_lock. | ||
269 | |||
270 | == compound_lock, get_user_pages and put_page == | ||
271 | |||
272 | split_huge_page internally has to distribute the refcounts in the head | ||
273 | page to the tail pages before clearing all PG_head/tail bits from the | ||
274 | page structures. It can do that easily for refcounts taken by huge pmd | ||
275 | mappings. But the GUI API as created by hugetlbfs (that returns head | ||
276 | and tail pages if running get_user_pages on an address backed by any | ||
277 | hugepage), requires the refcount to be accounted on the tail pages and | ||
278 | not only in the head pages, if we want to be able to run | ||
279 | split_huge_page while there are gup pins established on any tail | ||
280 | page. Failure to be able to run split_huge_page if there's any gup pin | ||
281 | on any tail page, would mean having to split all hugepages upfront in | ||
282 | get_user_pages which is unacceptable as too many gup users are | ||
283 | performance critical and they must work natively on hugepages like | ||
284 | they work natively on hugetlbfs already (hugetlbfs is simpler because | ||
285 | hugetlbfs pages cannot be splitted so there wouldn't be requirement of | ||
286 | accounting the pins on the tail pages for hugetlbfs). If we wouldn't | ||
287 | account the gup refcounts on the tail pages during gup, we won't know | ||
288 | anymore which tail page is pinned by gup and which is not while we run | ||
289 | split_huge_page. But we still have to add the gup pin to the head page | ||
290 | too, to know when we can free the compound page in case it's never | ||
291 | splitted during its lifetime. That requires changing not just | ||
292 | get_page, but put_page as well so that when put_page runs on a tail | ||
293 | page (and only on a tail page) it will find its respective head page, | ||
294 | and then it will decrease the head page refcount in addition to the | ||
295 | tail page refcount. To obtain a head page reliably and to decrease its | ||
296 | refcount without race conditions, put_page has to serialize against | ||
297 | __split_huge_page_refcount using a special per-page lock called | ||
298 | compound_lock. | ||
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 2d70d0d95108..97bae3c576c2 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt | |||
@@ -84,8 +84,7 @@ indicate that the page is being managed on the unevictable list. | |||
84 | 84 | ||
85 | The PG_unevictable flag is analogous to, and mutually exclusive with, the | 85 | The PG_unevictable flag is analogous to, and mutually exclusive with, the |
86 | PG_active flag in that it indicates on which LRU list a page resides when | 86 | PG_active flag in that it indicates on which LRU list a page resides when |
87 | PG_lru is set. The unevictable list is compile-time configurable based on the | 87 | PG_lru is set. |
88 | UNEVICTABLE_LRU Kconfig option. | ||
89 | 88 | ||
90 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | 89 | The Unevictable LRU infrastructure maintains unevictable pages on an additional |
91 | LRU list for a few reasons: | 90 | LRU list for a few reasons: |