aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorChristoph Lameter <clameter@sgi.com>2007-05-06 17:49:36 -0400
committerLinus Torvalds <torvalds@woody.linux-foundation.org>2007-05-07 15:12:53 -0400
commit81819f0fc8285a2a5a921c019e3e3d7b6169d225 (patch)
tree47e3da44d3ef6c74ceae6c3771b191b46467bb48 /mm
parent543691a6cd70b606dd9bed5e77b120c5d9c5c506 (diff)
SLUB core
This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on boot up and merges them into the corresponding general caches. This leads to more effective memory use. About 50% of all caches can be eliminated through slab merging. This will also decrease slab fragmentation because partial allocated slabs can be filled up again. Slab merging can be switched off by specifying slub_nomerge on boot up. Note that merging can expose heretofore unknown bugs in the kernel because corrupted objects may now be placed differently and corrupt differing neighboring objects. Enable sanity checks to find those. H. Diagnostics The current slab diagnostics are difficult to use and require a recompilation of the kernel. SLUB contains debugging code that is always available (but is kept out of the hot code paths). SLUB diagnostics can be enabled via the "slab_debug" option. Parameters can be specified to select a single or a group of slab caches for diagnostics. This means that the system is running with the usual performance and it is much more likely that race conditions can be reproduced. I. Resiliency If basic sanity checks are on then SLUB is capable of detecting common error conditions and recover as best as possible to allow the system to continue. J. Tracing Tracing can be enabled via the slab_debug=T,<slabcache> option during boot. SLUB will then protocol all actions on that slabcache and dump the object contents on free. K. On demand DMA cache creation. Generally DMA caches are not needed. If a kmalloc is used with __GFP_DMA then just create this single slabcache that is needed. For systems that have no ZONE_DMA requirement the support is completely eliminated. L. Performance increase Some benchmarks have shown speed improvements on kernbench in the range of 5-10%. The locking overhead of slub is based on the underlying base allocation size. If we can reliably allocate larger order pages then it is possible to increase slub performance much further. The anti-fragmentation patches may enable further performance increases. Tested on: i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator SLUB Boot options slub_nomerge Disable merging of slabs slub_min_order=x Require a minimum order for slab caches. This increases the managed chunk size and therefore reduces meta data and locking overhead. slub_min_objects=x Mininum objects per slab. Default is 8. slub_max_order=x Avoid generating slabs larger than order specified. slub_debug Enable all diagnostics for all caches slub_debug=<options> Enable selective options for all caches slub_debug=<o>,<cache> Enable selective options for a certain set of caches Available Debug options F Double Free checking, sanity and resiliency R Red zoning P Object / padding poisoning U Track last free / alloc T Trace all allocs / frees (only use for individual slabs). To use SLUB: Apply this patch and then select SLUB as the default slab allocator. [hugh@veritas.com: fix an oops-causing locking error] [akpm@linux-foundation.org: various stupid cleanups and small fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/Makefile1
-rw-r--r--mm/slub.c3144
2 files changed, 3145 insertions, 0 deletions
diff --git a/mm/Makefile b/mm/Makefile
index f3c077eb0b8e..1887148e44e7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
25obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o 25obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
26obj-$(CONFIG_SLOB) += slob.o 26obj-$(CONFIG_SLOB) += slob.o
27obj-$(CONFIG_SLAB) += slab.o 27obj-$(CONFIG_SLAB) += slab.o
28obj-$(CONFIG_SLUB) += slub.o
28obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o 29obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
29obj-$(CONFIG_FS_XIP) += filemap_xip.o 30obj-$(CONFIG_FS_XIP) += filemap_xip.o
30obj-$(CONFIG_MIGRATION) += migrate.o 31obj-$(CONFIG_MIGRATION) += migrate.o
diff --git a/mm/slub.c b/mm/slub.c
new file mode 100644
index 000000000000..0cd56bd74b64
--- /dev/null
+++ b/mm/slub.c
@@ -0,0 +1,3144 @@
1/*
2 * SLUB: A slab allocator that limits cache line use instead of queuing
3 * objects in per cpu and per node lists.
4 *
5 * The allocator synchronizes using per slab locks and only
6 * uses a centralized lock to manage a pool of partial slabs.
7 *
8 * (C) 2007 SGI, Christoph Lameter <clameter@sgi.com>
9 */
10
11#include <linux/mm.h>
12#include <linux/module.h>
13#include <linux/bit_spinlock.h>
14#include <linux/interrupt.h>
15#include <linux/bitops.h>
16#include <linux/slab.h>
17#include <linux/seq_file.h>
18#include <linux/cpu.h>
19#include <linux/cpuset.h>
20#include <linux/mempolicy.h>
21#include <linux/ctype.h>
22#include <linux/kallsyms.h>
23
24/*
25 * Lock order:
26 * 1. slab_lock(page)
27 * 2. slab->list_lock
28 *
29 * The slab_lock protects operations on the object of a particular
30 * slab and its metadata in the page struct. If the slab lock
31 * has been taken then no allocations nor frees can be performed
32 * on the objects in the slab nor can the slab be added or removed
33 * from the partial or full lists since this would mean modifying
34 * the page_struct of the slab.
35 *
36 * The list_lock protects the partial and full list on each node and
37 * the partial slab counter. If taken then no new slabs may be added or
38 * removed from the lists nor make the number of partial slabs be modified.
39 * (Note that the total number of slabs is an atomic value that may be
40 * modified without taking the list lock).
41 *
42 * The list_lock is a centralized lock and thus we avoid taking it as
43 * much as possible. As long as SLUB does not have to handle partial
44 * slabs, operations can continue without any centralized lock. F.e.
45 * allocating a long series of objects that fill up slabs does not require
46 * the list lock.
47 *
48 * The lock order is sometimes inverted when we are trying to get a slab
49 * off a list. We take the list_lock and then look for a page on the list
50 * to use. While we do that objects in the slabs may be freed. We can
51 * only operate on the slab if we have also taken the slab_lock. So we use
52 * a slab_trylock() on the slab. If trylock was successful then no frees
53 * can occur anymore and we can use the slab for allocations etc. If the
54 * slab_trylock() does not succeed then frees are in progress in the slab and
55 * we must stay away from it for a while since we may cause a bouncing
56 * cacheline if we try to acquire the lock. So go onto the next slab.
57 * If all pages are busy then we may allocate a new slab instead of reusing
58 * a partial slab. A new slab has noone operating on it and thus there is
59 * no danger of cacheline contention.
60 *
61 * Interrupts are disabled during allocation and deallocation in order to
62 * make the slab allocator safe to use in the context of an irq. In addition
63 * interrupts are disabled to ensure that the processor does not change
64 * while handling per_cpu slabs, due to kernel preemption.
65 *
66 * SLUB assigns one slab for allocation to each processor.
67 * Allocations only occur from these slabs called cpu slabs.
68 *
69 * Slabs with free elements are kept on a partial list.
70 * There is no list for full slabs. If an object in a full slab is
71 * freed then the slab will show up again on the partial lists.
72 * Otherwise there is no need to track full slabs unless we have to
73 * track full slabs for debugging purposes.
74 *
75 * Slabs are freed when they become empty. Teardown and setup is
76 * minimal so we rely on the page allocators per cpu caches for
77 * fast frees and allocs.
78 *
79 * Overloading of page flags that are otherwise used for LRU management.
80 *
81 * PageActive The slab is used as a cpu cache. Allocations
82 * may be performed from the slab. The slab is not
83 * on any slab list and cannot be moved onto one.
84 *
85 * PageError Slab requires special handling due to debug
86 * options set. This moves slab handling out of
87 * the fast path.
88 */
89
90/*
91 * Issues still to be resolved:
92 *
93 * - The per cpu array is updated for each new slab and and is a remote
94 * cacheline for most nodes. This could become a bouncing cacheline given
95 * enough frequent updates. There are 16 pointers in a cacheline.so at
96 * max 16 cpus could compete. Likely okay.
97 *
98 * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
99 *
100 * - Support DEBUG_SLAB_LEAK. Trouble is we do not know where the full
101 * slabs are in SLUB.
102 *
103 * - SLAB_DEBUG_INITIAL is not supported but I have never seen a use of
104 * it.
105 *
106 * - Variable sizing of the per node arrays
107 */
108
109/* Enable to test recovery from slab corruption on boot */
110#undef SLUB_RESILIENCY_TEST
111
112#if PAGE_SHIFT <= 12
113
114/*
115 * Small page size. Make sure that we do not fragment memory
116 */
117#define DEFAULT_MAX_ORDER 1
118#define DEFAULT_MIN_OBJECTS 4
119
120#else
121
122/*
123 * Large page machines are customarily able to handle larger
124 * page orders.
125 */
126#define DEFAULT_MAX_ORDER 2
127#define DEFAULT_MIN_OBJECTS 8
128
129#endif
130
131/*
132 * Flags from the regular SLAB that SLUB does not support:
133 */
134#define SLUB_UNIMPLEMENTED (SLAB_DEBUG_INITIAL)
135
136#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
137 SLAB_POISON | SLAB_STORE_USER)
138/*
139 * Set of flags that will prevent slab merging
140 */
141#define SLUB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
142 SLAB_TRACE | SLAB_DESTROY_BY_RCU)
143
144#define SLUB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \
145 SLAB_CACHE_DMA)
146
147#ifndef ARCH_KMALLOC_MINALIGN
148#define ARCH_KMALLOC_MINALIGN sizeof(void *)
149#endif
150
151#ifndef ARCH_SLAB_MINALIGN
152#define ARCH_SLAB_MINALIGN sizeof(void *)
153#endif
154
155/* Internal SLUB flags */
156#define __OBJECT_POISON 0x80000000 /* Poison object */
157
158static int kmem_size = sizeof(struct kmem_cache);
159
160#ifdef CONFIG_SMP
161static struct notifier_block slab_notifier;
162#endif
163
164static enum {
165 DOWN, /* No slab functionality available */
166 PARTIAL, /* kmem_cache_open() works but kmalloc does not */
167 UP, /* Everything works */
168 SYSFS /* Sysfs up */
169} slab_state = DOWN;
170
171/* A list of all slab caches on the system */
172static DECLARE_RWSEM(slub_lock);
173LIST_HEAD(slab_caches);
174
175#ifdef CONFIG_SYSFS
176static int sysfs_slab_add(struct kmem_cache *);
177static int sysfs_slab_alias(struct kmem_cache *, const char *);
178static void sysfs_slab_remove(struct kmem_cache *);
179#else
180static int sysfs_slab_add(struct kmem_cache *s) { return 0; }
181static int sysfs_slab_alias(struct kmem_cache *s, const char *p) { return 0; }
182static void sysfs_slab_remove(struct kmem_cache *s) {}
183#endif
184
185/********************************************************************
186 * Core slab cache functions
187 *******************************************************************/
188
189int slab_is_available(void)
190{
191 return slab_state >= UP;
192}
193
194static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
195{
196#ifdef CONFIG_NUMA
197 return s->node[node];
198#else
199 return &s->local_node;
200#endif
201}
202
203/*
204 * Object debugging
205 */
206static void print_section(char *text, u8 *addr, unsigned int length)
207{
208 int i, offset;
209 int newline = 1;
210 char ascii[17];
211
212 ascii[16] = 0;
213
214 for (i = 0; i < length; i++) {
215 if (newline) {
216 printk(KERN_ERR "%10s 0x%p: ", text, addr + i);
217 newline = 0;
218 }
219 printk(" %02x", addr[i]);
220 offset = i % 16;
221 ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
222 if (offset == 15) {
223 printk(" %s\n",ascii);
224 newline = 1;
225 }
226 }
227 if (!newline) {
228 i %= 16;
229 while (i < 16) {
230 printk(" ");
231 ascii[i] = ' ';
232 i++;
233 }
234 printk(" %s\n", ascii);
235 }
236}
237
238/*
239 * Slow version of get and set free pointer.
240 *
241 * This requires touching the cache lines of kmem_cache.
242 * The offset can also be obtained from the page. In that
243 * case it is in the cacheline that we already need to touch.
244 */
245static void *get_freepointer(struct kmem_cache *s, void *object)
246{
247 return *(void **)(object + s->offset);
248}
249
250static void set_freepointer(struct kmem_cache *s, void *object, void *fp)
251{
252 *(void **)(object + s->offset) = fp;
253}
254
255/*
256 * Tracking user of a slab.
257 */
258struct track {
259 void *addr; /* Called from address */
260 int cpu; /* Was running on cpu */
261 int pid; /* Pid context */
262 unsigned long when; /* When did the operation occur */
263};
264
265enum track_item { TRACK_ALLOC, TRACK_FREE };
266
267static struct track *get_track(struct kmem_cache *s, void *object,
268 enum track_item alloc)
269{
270 struct track *p;
271
272 if (s->offset)
273 p = object + s->offset + sizeof(void *);
274 else
275 p = object + s->inuse;
276
277 return p + alloc;
278}
279
280static void set_track(struct kmem_cache *s, void *object,
281 enum track_item alloc, void *addr)
282{
283 struct track *p;
284
285 if (s->offset)
286 p = object + s->offset + sizeof(void *);
287 else
288 p = object + s->inuse;
289
290 p += alloc;
291 if (addr) {
292 p->addr = addr;
293 p->cpu = smp_processor_id();
294 p->pid = current ? current->pid : -1;
295 p->when = jiffies;
296 } else
297 memset(p, 0, sizeof(struct track));
298}
299
300#define set_tracking(__s, __o, __a) set_track(__s, __o, __a, \
301 __builtin_return_address(0))
302
303static void init_tracking(struct kmem_cache *s, void *object)
304{
305 if (s->flags & SLAB_STORE_USER) {
306 set_track(s, object, TRACK_FREE, NULL);
307 set_track(s, object, TRACK_ALLOC, NULL);
308 }
309}
310
311static void print_track(const char *s, struct track *t)
312{
313 if (!t->addr)
314 return;
315
316 printk(KERN_ERR "%s: ", s);
317 __print_symbol("%s", (unsigned long)t->addr);
318 printk(" jiffies_ago=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
319}
320
321static void print_trailer(struct kmem_cache *s, u8 *p)
322{
323 unsigned int off; /* Offset of last byte */
324
325 if (s->flags & SLAB_RED_ZONE)
326 print_section("Redzone", p + s->objsize,
327 s->inuse - s->objsize);
328
329 printk(KERN_ERR "FreePointer 0x%p -> 0x%p\n",
330 p + s->offset,
331 get_freepointer(s, p));
332
333 if (s->offset)
334 off = s->offset + sizeof(void *);
335 else
336 off = s->inuse;
337
338 if (s->flags & SLAB_STORE_USER) {
339 print_track("Last alloc", get_track(s, p, TRACK_ALLOC));
340 print_track("Last free ", get_track(s, p, TRACK_FREE));
341 off += 2 * sizeof(struct track);
342 }
343
344 if (off != s->size)
345 /* Beginning of the filler is the free pointer */
346 print_section("Filler", p + off, s->size - off);
347}
348
349static void object_err(struct kmem_cache *s, struct page *page,
350 u8 *object, char *reason)
351{
352 u8 *addr = page_address(page);
353
354 printk(KERN_ERR "*** SLUB %s: %s@0x%p slab 0x%p\n",
355 s->name, reason, object, page);
356 printk(KERN_ERR " offset=%tu flags=0x%04lx inuse=%u freelist=0x%p\n",
357 object - addr, page->flags, page->inuse, page->freelist);
358 if (object > addr + 16)
359 print_section("Bytes b4", object - 16, 16);
360 print_section("Object", object, min(s->objsize, 128));
361 print_trailer(s, object);
362 dump_stack();
363}
364
365static void slab_err(struct kmem_cache *s, struct page *page, char *reason, ...)
366{
367 va_list args;
368 char buf[100];
369
370 va_start(args, reason);
371 vsnprintf(buf, sizeof(buf), reason, args);
372 va_end(args);
373 printk(KERN_ERR "*** SLUB %s: %s in slab @0x%p\n", s->name, buf,
374 page);
375 dump_stack();
376}
377
378static void init_object(struct kmem_cache *s, void *object, int active)
379{
380 u8 *p = object;
381
382 if (s->flags & __OBJECT_POISON) {
383 memset(p, POISON_FREE, s->objsize - 1);
384 p[s->objsize -1] = POISON_END;
385 }
386
387 if (s->flags & SLAB_RED_ZONE)
388 memset(p + s->objsize,
389 active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
390 s->inuse - s->objsize);
391}
392
393static int check_bytes(u8 *start, unsigned int value, unsigned int bytes)
394{
395 while (bytes) {
396 if (*start != (u8)value)
397 return 0;
398 start++;
399 bytes--;
400 }
401 return 1;
402}
403
404
405static int check_valid_pointer(struct kmem_cache *s, struct page *page,
406 void *object)
407{
408 void *base;
409
410 if (!object)
411 return 1;
412
413 base = page_address(page);
414 if (object < base || object >= base + s->objects * s->size ||
415 (object - base) % s->size) {
416 return 0;
417 }
418
419 return 1;
420}
421
422/*
423 * Object layout:
424 *
425 * object address
426 * Bytes of the object to be managed.
427 * If the freepointer may overlay the object then the free
428 * pointer is the first word of the object.
429 * Poisoning uses 0x6b (POISON_FREE) and the last byte is
430 * 0xa5 (POISON_END)
431 *
432 * object + s->objsize
433 * Padding to reach word boundary. This is also used for Redzoning.
434 * Padding is extended to word size if Redzoning is enabled
435 * and objsize == inuse.
436 * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
437 * 0xcc (RED_ACTIVE) for objects in use.
438 *
439 * object + s->inuse
440 * A. Free pointer (if we cannot overwrite object on free)
441 * B. Tracking data for SLAB_STORE_USER
442 * C. Padding to reach required alignment boundary
443 * Padding is done using 0x5a (POISON_INUSE)
444 *
445 * object + s->size
446 *
447 * If slabcaches are merged then the objsize and inuse boundaries are to
448 * be ignored. And therefore no slab options that rely on these boundaries
449 * may be used with merged slabcaches.
450 */
451
452static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
453 void *from, void *to)
454{
455 printk(KERN_ERR "@@@ SLUB: %s Restoring %s (0x%x) from 0x%p-0x%p\n",
456 s->name, message, data, from, to - 1);
457 memset(from, data, to - from);
458}
459
460static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)
461{
462 unsigned long off = s->inuse; /* The end of info */
463
464 if (s->offset)
465 /* Freepointer is placed after the object. */
466 off += sizeof(void *);
467
468 if (s->flags & SLAB_STORE_USER)
469 /* We also have user information there */
470 off += 2 * sizeof(struct track);
471
472 if (s->size == off)
473 return 1;
474
475 if (check_bytes(p + off, POISON_INUSE, s->size - off))
476 return 1;
477
478 object_err(s, page, p, "Object padding check fails");
479
480 /*
481 * Restore padding
482 */
483 restore_bytes(s, "object padding", POISON_INUSE, p + off, p + s->size);
484 return 0;
485}
486
487static int slab_pad_check(struct kmem_cache *s, struct page *page)
488{
489 u8 *p;
490 int length, remainder;
491
492 if (!(s->flags & SLAB_POISON))
493 return 1;
494
495 p = page_address(page);
496 length = s->objects * s->size;
497 remainder = (PAGE_SIZE << s->order) - length;
498 if (!remainder)
499 return 1;
500
501 if (!check_bytes(p + length, POISON_INUSE, remainder)) {
502 printk(KERN_ERR "SLUB: %s slab 0x%p: Padding fails check\n",
503 s->name, p);
504 dump_stack();
505 restore_bytes(s, "slab padding", POISON_INUSE, p + length,
506 p + length + remainder);
507 return 0;
508 }
509 return 1;
510}
511
512static int check_object(struct kmem_cache *s, struct page *page,
513 void *object, int active)
514{
515 u8 *p = object;
516 u8 *endobject = object + s->objsize;
517
518 if (s->flags & SLAB_RED_ZONE) {
519 unsigned int red =
520 active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
521
522 if (!check_bytes(endobject, red, s->inuse - s->objsize)) {
523 object_err(s, page, object,
524 active ? "Redzone Active" : "Redzone Inactive");
525 restore_bytes(s, "redzone", red,
526 endobject, object + s->inuse);
527 return 0;
528 }
529 } else {
530 if ((s->flags & SLAB_POISON) && s->objsize < s->inuse &&
531 !check_bytes(endobject, POISON_INUSE,
532 s->inuse - s->objsize)) {
533 object_err(s, page, p, "Alignment padding check fails");
534 /*
535 * Fix it so that there will not be another report.
536 *
537 * Hmmm... We may be corrupting an object that now expects
538 * to be longer than allowed.
539 */
540 restore_bytes(s, "alignment padding", POISON_INUSE,
541 endobject, object + s->inuse);
542 }
543 }
544
545 if (s->flags & SLAB_POISON) {
546 if (!active && (s->flags & __OBJECT_POISON) &&
547 (!check_bytes(p, POISON_FREE, s->objsize - 1) ||
548 p[s->objsize - 1] != POISON_END)) {
549
550 object_err(s, page, p, "Poison check failed");
551 restore_bytes(s, "Poison", POISON_FREE,
552 p, p + s->objsize -1);
553 restore_bytes(s, "Poison", POISON_END,
554 p + s->objsize - 1, p + s->objsize);
555 return 0;
556 }
557 /*
558 * check_pad_bytes cleans up on its own.
559 */
560 check_pad_bytes(s, page, p);
561 }
562
563 if (!s->offset && active)
564 /*
565 * Object and freepointer overlap. Cannot check
566 * freepointer while object is allocated.
567 */
568 return 1;
569
570 /* Check free pointer validity */
571 if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
572 object_err(s, page, p, "Freepointer corrupt");
573 /*
574 * No choice but to zap it and thus loose the remainder
575 * of the free objects in this slab. May cause
576 * another error because the object count maybe
577 * wrong now.
578 */
579 set_freepointer(s, p, NULL);
580 return 0;
581 }
582 return 1;
583}
584
585static int check_slab(struct kmem_cache *s, struct page *page)
586{
587 VM_BUG_ON(!irqs_disabled());
588
589 if (!PageSlab(page)) {
590 printk(KERN_ERR "SLUB: %s Not a valid slab page @0x%p "
591 "flags=%lx mapping=0x%p count=%d \n",
592 s->name, page, page->flags, page->mapping,
593 page_count(page));
594 return 0;
595 }
596 if (page->offset * sizeof(void *) != s->offset) {
597 printk(KERN_ERR "SLUB: %s Corrupted offset %lu in slab @0x%p"
598 " flags=0x%lx mapping=0x%p count=%d\n",
599 s->name,
600 (unsigned long)(page->offset * sizeof(void *)),
601 page,
602 page->flags,
603 page->mapping,
604 page_count(page));
605 dump_stack();
606 return 0;
607 }
608 if (page->inuse > s->objects) {
609 printk(KERN_ERR "SLUB: %s Inuse %u > max %u in slab "
610 "page @0x%p flags=%lx mapping=0x%p count=%d\n",
611 s->name, page->inuse, s->objects, page, page->flags,
612 page->mapping, page_count(page));
613 dump_stack();
614 return 0;
615 }
616 /* Slab_pad_check fixes things up after itself */
617 slab_pad_check(s, page);
618 return 1;
619}
620
621/*
622 * Determine if a certain object on a page is on the freelist and
623 * therefore free. Must hold the slab lock for cpu slabs to
624 * guarantee that the chains are consistent.
625 */
626static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
627{
628 int nr = 0;
629 void *fp = page->freelist;
630 void *object = NULL;
631
632 while (fp && nr <= s->objects) {
633 if (fp == search)
634 return 1;
635 if (!check_valid_pointer(s, page, fp)) {
636 if (object) {
637 object_err(s, page, object,
638 "Freechain corrupt");
639 set_freepointer(s, object, NULL);
640 break;
641 } else {
642 printk(KERN_ERR "SLUB: %s slab 0x%p "
643 "freepointer 0x%p corrupted.\n",
644 s->name, page, fp);
645 dump_stack();
646 page->freelist = NULL;
647 page->inuse = s->objects;
648 return 0;
649 }
650 break;
651 }
652 object = fp;
653 fp = get_freepointer(s, object);
654 nr++;
655 }
656
657 if (page->inuse != s->objects - nr) {
658 printk(KERN_ERR "slab %s: page 0x%p wrong object count."
659 " counter is %d but counted were %d\n",
660 s->name, page, page->inuse,
661 s->objects - nr);
662 page->inuse = s->objects - nr;
663 }
664 return search == NULL;
665}
666
667static int alloc_object_checks(struct kmem_cache *s, struct page *page,
668 void *object)
669{
670 if (!check_slab(s, page))
671 goto bad;
672
673 if (object && !on_freelist(s, page, object)) {
674 printk(KERN_ERR "SLUB: %s Object 0x%p@0x%p "
675 "already allocated.\n",
676 s->name, object, page);
677 goto dump;
678 }
679
680 if (!check_valid_pointer(s, page, object)) {
681 object_err(s, page, object, "Freelist Pointer check fails");
682 goto dump;
683 }
684
685 if (!object)
686 return 1;
687
688 if (!check_object(s, page, object, 0))
689 goto bad;
690 init_object(s, object, 1);
691
692 if (s->flags & SLAB_TRACE) {
693 printk(KERN_INFO "TRACE %s alloc 0x%p inuse=%d fp=0x%p\n",
694 s->name, object, page->inuse,
695 page->freelist);
696 dump_stack();
697 }
698 return 1;
699dump:
700 dump_stack();
701bad:
702 if (PageSlab(page)) {
703 /*
704 * If this is a slab page then lets do the best we can
705 * to avoid issues in the future. Marking all objects
706 * as used avoids touching the remainder.
707 */
708 printk(KERN_ERR "@@@ SLUB: %s slab 0x%p. Marking all objects used.\n",
709 s->name, page);
710 page->inuse = s->objects;
711 page->freelist = NULL;
712 /* Fix up fields that may be corrupted */
713 page->offset = s->offset / sizeof(void *);
714 }
715 return 0;
716}
717
718static int free_object_checks(struct kmem_cache *s, struct page *page,
719 void *object)
720{
721 if (!check_slab(s, page))
722 goto fail;
723
724 if (!check_valid_pointer(s, page, object)) {
725 printk(KERN_ERR "SLUB: %s slab 0x%p invalid "
726 "object pointer 0x%p\n",
727 s->name, page, object);
728 goto fail;
729 }
730
731 if (on_freelist(s, page, object)) {
732 printk(KERN_ERR "SLUB: %s slab 0x%p object "
733 "0x%p already free.\n", s->name, page, object);
734 goto fail;
735 }
736
737 if (!check_object(s, page, object, 1))
738 return 0;
739
740 if (unlikely(s != page->slab)) {
741 if (!PageSlab(page))
742 printk(KERN_ERR "slab_free %s size %d: attempt to"
743 "free object(0x%p) outside of slab.\n",
744 s->name, s->size, object);
745 else
746 if (!page->slab)
747 printk(KERN_ERR
748 "slab_free : no slab(NULL) for object 0x%p.\n",
749 object);
750 else
751 printk(KERN_ERR "slab_free %s(%d): object at 0x%p"
752 " belongs to slab %s(%d)\n",
753 s->name, s->size, object,
754 page->slab->name, page->slab->size);
755 goto fail;
756 }
757 if (s->flags & SLAB_TRACE) {
758 printk(KERN_INFO "TRACE %s free 0x%p inuse=%d fp=0x%p\n",
759 s->name, object, page->inuse,
760 page->freelist);
761 print_section("Object", object, s->objsize);
762 dump_stack();
763 }
764 init_object(s, object, 0);
765 return 1;
766fail:
767 dump_stack();
768 printk(KERN_ERR "@@@ SLUB: %s slab 0x%p object at 0x%p not freed.\n",
769 s->name, page, object);
770 return 0;
771}
772
773/*
774 * Slab allocation and freeing
775 */
776static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
777{
778 struct page * page;
779 int pages = 1 << s->order;
780
781 if (s->order)
782 flags |= __GFP_COMP;
783
784 if (s->flags & SLAB_CACHE_DMA)
785 flags |= SLUB_DMA;
786
787 if (node == -1)
788 page = alloc_pages(flags, s->order);
789 else
790 page = alloc_pages_node(node, flags, s->order);
791
792 if (!page)
793 return NULL;
794
795 mod_zone_page_state(page_zone(page),
796 (s->flags & SLAB_RECLAIM_ACCOUNT) ?
797 NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
798 pages);
799
800 return page;
801}
802
803static void setup_object(struct kmem_cache *s, struct page *page,
804 void *object)
805{
806 if (PageError(page)) {
807 init_object(s, object, 0);
808 init_tracking(s, object);
809 }
810
811 if (unlikely(s->ctor)) {
812 int mode = SLAB_CTOR_CONSTRUCTOR;
813
814 if (!(s->flags & __GFP_WAIT))
815 mode |= SLAB_CTOR_ATOMIC;
816
817 s->ctor(object, s, mode);
818 }
819}
820
821static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
822{
823 struct page *page;
824 struct kmem_cache_node *n;
825 void *start;
826 void *end;
827 void *last;
828 void *p;
829
830 if (flags & __GFP_NO_GROW)
831 return NULL;
832
833 BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
834
835 if (flags & __GFP_WAIT)
836 local_irq_enable();
837
838 page = allocate_slab(s, flags & GFP_LEVEL_MASK, node);
839 if (!page)
840 goto out;
841
842 n = get_node(s, page_to_nid(page));
843 if (n)
844 atomic_long_inc(&n->nr_slabs);
845 page->offset = s->offset / sizeof(void *);
846 page->slab = s;
847 page->flags |= 1 << PG_slab;
848 if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
849 SLAB_STORE_USER | SLAB_TRACE))
850 page->flags |= 1 << PG_error;
851
852 start = page_address(page);
853 end = start + s->objects * s->size;
854
855 if (unlikely(s->flags & SLAB_POISON))
856 memset(start, POISON_INUSE, PAGE_SIZE << s->order);
857
858 last = start;
859 for (p = start + s->size; p < end; p += s->size) {
860 setup_object(s, page, last);
861 set_freepointer(s, last, p);
862 last = p;
863 }
864 setup_object(s, page, last);
865 set_freepointer(s, last, NULL);
866
867 page->freelist = start;
868 page->inuse = 0;
869out:
870 if (flags & __GFP_WAIT)
871 local_irq_disable();
872 return page;
873}
874
875static void __free_slab(struct kmem_cache *s, struct page *page)
876{
877 int pages = 1 << s->order;
878
879 if (unlikely(PageError(page) || s->dtor)) {
880 void *start = page_address(page);
881 void *end = start + (pages << PAGE_SHIFT);
882 void *p;
883
884 slab_pad_check(s, page);
885 for (p = start; p <= end - s->size; p += s->size) {
886 if (s->dtor)
887 s->dtor(p, s, 0);
888 check_object(s, page, p, 0);
889 }
890 }
891
892 mod_zone_page_state(page_zone(page),
893 (s->flags & SLAB_RECLAIM_ACCOUNT) ?
894 NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
895 - pages);
896
897 page->mapping = NULL;
898 __free_pages(page, s->order);
899}
900
901static void rcu_free_slab(struct rcu_head *h)
902{
903 struct page *page;
904
905 page = container_of((struct list_head *)h, struct page, lru);
906 __free_slab(page->slab, page);
907}
908
909static void free_slab(struct kmem_cache *s, struct page *page)
910{
911 if (unlikely(s->flags & SLAB_DESTROY_BY_RCU)) {
912 /*
913 * RCU free overloads the RCU head over the LRU
914 */
915 struct rcu_head *head = (void *)&page->lru;
916
917 call_rcu(head, rcu_free_slab);
918 } else
919 __free_slab(s, page);
920}
921
922static void discard_slab(struct kmem_cache *s, struct page *page)
923{
924 struct kmem_cache_node *n = get_node(s, page_to_nid(page));
925
926 atomic_long_dec(&n->nr_slabs);
927 reset_page_mapcount(page);
928 page->flags &= ~(1 << PG_slab | 1 << PG_error);
929 free_slab(s, page);
930}
931
932/*
933 * Per slab locking using the pagelock
934 */
935static __always_inline void slab_lock(struct page *page)
936{
937 bit_spin_lock(PG_locked, &page->flags);
938}
939
940static __always_inline void slab_unlock(struct page *page)
941{
942 bit_spin_unlock(PG_locked, &page->flags);
943}
944
945static __always_inline int slab_trylock(struct page *page)
946{
947 int rc = 1;
948
949 rc = bit_spin_trylock(PG_locked, &page->flags);
950 return rc;
951}
952
953/*
954 * Management of partially allocated slabs
955 */
956static void add_partial(struct kmem_cache *s, struct page *page)
957{
958 struct kmem_cache_node *n = get_node(s, page_to_nid(page));
959
960 spin_lock(&n->list_lock);
961 n->nr_partial++;
962 list_add(&page->lru, &n->partial);
963 spin_unlock(&n->list_lock);
964}
965
966static void remove_partial(struct kmem_cache *s,
967 struct page *page)
968{
969 struct kmem_cache_node *n = get_node(s, page_to_nid(page));
970
971 spin_lock(&n->list_lock);
972 list_del(&page->lru);
973 n->nr_partial--;
974 spin_unlock(&n->list_lock);
975}
976
977/*
978 * Lock page and remove it from the partial list
979 *
980 * Must hold list_lock
981 */
982static int lock_and_del_slab(struct kmem_cache_node *n, struct page *page)
983{
984 if (slab_trylock(page)) {
985 list_del(&page->lru);
986 n->nr_partial--;
987 return 1;
988 }
989 return 0;
990}
991
992/*
993 * Try to get a partial slab from a specific node
994 */
995static struct page *get_partial_node(struct kmem_cache_node *n)
996{
997 struct page *page;
998
999 /*
1000 * Racy check. If we mistakenly see no partial slabs then we
1001 * just allocate an empty slab. If we mistakenly try to get a
1002 * partial slab then get_partials() will return NULL.
1003 */
1004 if (!n || !n->nr_partial)
1005 return NULL;
1006
1007 spin_lock(&n->list_lock);
1008 list_for_each_entry(page, &n->partial, lru)
1009 if (lock_and_del_slab(n, page))
1010 goto out;
1011 page = NULL;
1012out:
1013 spin_unlock(&n->list_lock);
1014 return page;
1015}
1016
1017/*
1018 * Get a page from somewhere. Search in increasing NUMA
1019 * distances.
1020 */
1021static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
1022{
1023#ifdef CONFIG_NUMA
1024 struct zonelist *zonelist;
1025 struct zone **z;
1026 struct page *page;
1027
1028 /*
1029 * The defrag ratio allows to configure the tradeoffs between
1030 * inter node defragmentation and node local allocations.
1031 * A lower defrag_ratio increases the tendency to do local
1032 * allocations instead of scanning throught the partial
1033 * lists on other nodes.
1034 *
1035 * If defrag_ratio is set to 0 then kmalloc() always
1036 * returns node local objects. If its higher then kmalloc()
1037 * may return off node objects in order to avoid fragmentation.
1038 *
1039 * A higher ratio means slabs may be taken from other nodes
1040 * thus reducing the number of partial slabs on those nodes.
1041 *
1042 * If /sys/slab/xx/defrag_ratio is set to 100 (which makes
1043 * defrag_ratio = 1000) then every (well almost) allocation
1044 * will first attempt to defrag slab caches on other nodes. This
1045 * means scanning over all nodes to look for partial slabs which
1046 * may be a bit expensive to do on every slab allocation.
1047 */
1048 if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
1049 return NULL;
1050
1051 zonelist = &NODE_DATA(slab_node(current->mempolicy))
1052 ->node_zonelists[gfp_zone(flags)];
1053 for (z = zonelist->zones; *z; z++) {
1054 struct kmem_cache_node *n;
1055
1056 n = get_node(s, zone_to_nid(*z));
1057
1058 if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
1059 n->nr_partial > 2) {
1060 page = get_partial_node(n);
1061 if (page)
1062 return page;
1063 }
1064 }
1065#endif
1066 return NULL;
1067}
1068
1069/*
1070 * Get a partial page, lock it and return it.
1071 */
1072static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
1073{
1074 struct page *page;
1075 int searchnode = (node == -1) ? numa_node_id() : node;
1076
1077 page = get_partial_node(get_node(s, searchnode));
1078 if (page || (flags & __GFP_THISNODE))
1079 return page;
1080
1081 return get_any_partial(s, flags);
1082}
1083
1084/*
1085 * Move a page back to the lists.
1086 *
1087 * Must be called with the slab lock held.
1088 *
1089 * On exit the slab lock will have been dropped.
1090 */
1091static void putback_slab(struct kmem_cache *s, struct page *page)
1092{
1093 if (page->inuse) {
1094 if (page->freelist)
1095 add_partial(s, page);
1096 slab_unlock(page);
1097 } else {
1098 slab_unlock(page);
1099 discard_slab(s, page);
1100 }
1101}
1102
1103/*
1104 * Remove the cpu slab
1105 */
1106static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
1107{
1108 s->cpu_slab[cpu] = NULL;
1109 ClearPageActive(page);
1110
1111 putback_slab(s, page);
1112}
1113
1114static void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
1115{
1116 slab_lock(page);
1117 deactivate_slab(s, page, cpu);
1118}
1119
1120/*
1121 * Flush cpu slab.
1122 * Called from IPI handler with interrupts disabled.
1123 */
1124static void __flush_cpu_slab(struct kmem_cache *s, int cpu)
1125{
1126 struct page *page = s->cpu_slab[cpu];
1127
1128 if (likely(page))
1129 flush_slab(s, page, cpu);
1130}
1131
1132static void flush_cpu_slab(void *d)
1133{
1134 struct kmem_cache *s = d;
1135 int cpu = smp_processor_id();
1136
1137 __flush_cpu_slab(s, cpu);
1138}
1139
1140static void flush_all(struct kmem_cache *s)
1141{
1142#ifdef CONFIG_SMP
1143 on_each_cpu(flush_cpu_slab, s, 1, 1);
1144#else
1145 unsigned long flags;
1146
1147 local_irq_save(flags);
1148 flush_cpu_slab(s);
1149 local_irq_restore(flags);
1150#endif
1151}
1152
1153/*
1154 * slab_alloc is optimized to only modify two cachelines on the fast path
1155 * (aside from the stack):
1156 *
1157 * 1. The page struct
1158 * 2. The first cacheline of the object to be allocated.
1159 *
1160 * The only cache lines that are read (apart from code) is the
1161 * per cpu array in the kmem_cache struct.
1162 *
1163 * Fastpath is not possible if we need to get a new slab or have
1164 * debugging enabled (which means all slabs are marked with PageError)
1165 */
1166static __always_inline void *slab_alloc(struct kmem_cache *s,
1167 gfp_t gfpflags, int node)
1168{
1169 struct page *page;
1170 void **object;
1171 unsigned long flags;
1172 int cpu;
1173
1174 local_irq_save(flags);
1175 cpu = smp_processor_id();
1176 page = s->cpu_slab[cpu];
1177 if (!page)
1178 goto new_slab;
1179
1180 slab_lock(page);
1181 if (unlikely(node != -1 && page_to_nid(page) != node))
1182 goto another_slab;
1183redo:
1184 object = page->freelist;
1185 if (unlikely(!object))
1186 goto another_slab;
1187 if (unlikely(PageError(page)))
1188 goto debug;
1189
1190have_object:
1191 page->inuse++;
1192 page->freelist = object[page->offset];
1193 slab_unlock(page);
1194 local_irq_restore(flags);
1195 return object;
1196
1197another_slab:
1198 deactivate_slab(s, page, cpu);
1199
1200new_slab:
1201 page = get_partial(s, gfpflags, node);
1202 if (likely(page)) {
1203have_slab:
1204 s->cpu_slab[cpu] = page;
1205 SetPageActive(page);
1206 goto redo;
1207 }
1208
1209 page = new_slab(s, gfpflags, node);
1210 if (page) {
1211 cpu = smp_processor_id();
1212 if (s->cpu_slab[cpu]) {
1213 /*
1214 * Someone else populated the cpu_slab while we enabled
1215 * interrupts, or we have got scheduled on another cpu.
1216 * The page may not be on the requested node.
1217 */
1218 if (node == -1 ||
1219 page_to_nid(s->cpu_slab[cpu]) == node) {
1220 /*
1221 * Current cpuslab is acceptable and we
1222 * want the current one since its cache hot
1223 */
1224 discard_slab(s, page);
1225 page = s->cpu_slab[cpu];
1226 slab_lock(page);
1227 goto redo;
1228 }
1229 /* Dump the current slab */
1230 flush_slab(s, s->cpu_slab[cpu], cpu);
1231 }
1232 slab_lock(page);
1233 goto have_slab;
1234 }
1235 local_irq_restore(flags);
1236 return NULL;
1237debug:
1238 if (!alloc_object_checks(s, page, object))
1239 goto another_slab;
1240 if (s->flags & SLAB_STORE_USER)
1241 set_tracking(s, object, TRACK_ALLOC);
1242 goto have_object;
1243}
1244
1245void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
1246{
1247 return slab_alloc(s, gfpflags, -1);
1248}
1249EXPORT_SYMBOL(kmem_cache_alloc);
1250
1251#ifdef CONFIG_NUMA
1252void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
1253{
1254 return slab_alloc(s, gfpflags, node);
1255}
1256EXPORT_SYMBOL(kmem_cache_alloc_node);
1257#endif
1258
1259/*
1260 * The fastpath only writes the cacheline of the page struct and the first
1261 * cacheline of the object.
1262 *
1263 * No special cachelines need to be read
1264 */
1265static void slab_free(struct kmem_cache *s, struct page *page, void *x)
1266{
1267 void *prior;
1268 void **object = (void *)x;
1269 unsigned long flags;
1270
1271 local_irq_save(flags);
1272 slab_lock(page);
1273
1274 if (unlikely(PageError(page)))
1275 goto debug;
1276checks_ok:
1277 prior = object[page->offset] = page->freelist;
1278 page->freelist = object;
1279 page->inuse--;
1280
1281 if (unlikely(PageActive(page)))
1282 /*
1283 * Cpu slabs are never on partial lists and are
1284 * never freed.
1285 */
1286 goto out_unlock;
1287
1288 if (unlikely(!page->inuse))
1289 goto slab_empty;
1290
1291 /*
1292 * Objects left in the slab. If it
1293 * was not on the partial list before
1294 * then add it.
1295 */
1296 if (unlikely(!prior))
1297 add_partial(s, page);
1298
1299out_unlock:
1300 slab_unlock(page);
1301 local_irq_restore(flags);
1302 return;
1303
1304slab_empty:
1305 if (prior)
1306 /*
1307 * Partially used slab that is on the partial list.
1308 */
1309 remove_partial(s, page);
1310
1311 slab_unlock(page);
1312 discard_slab(s, page);
1313 local_irq_restore(flags);
1314 return;
1315
1316debug:
1317 if (free_object_checks(s, page, x))
1318 goto checks_ok;
1319 goto out_unlock;
1320}
1321
1322void kmem_cache_free(struct kmem_cache *s, void *x)
1323{
1324 struct page * page;
1325
1326 page = virt_to_page(x);
1327
1328 if (unlikely(PageCompound(page)))
1329 page = page->first_page;
1330
1331
1332 if (unlikely(PageError(page) && (s->flags & SLAB_STORE_USER)))
1333 set_tracking(s, x, TRACK_FREE);
1334 slab_free(s, page, x);
1335}
1336EXPORT_SYMBOL(kmem_cache_free);
1337
1338/* Figure out on which slab object the object resides */
1339static struct page *get_object_page(const void *x)
1340{
1341 struct page *page = virt_to_page(x);
1342
1343 if (unlikely(PageCompound(page)))
1344 page = page->first_page;
1345
1346 if (!PageSlab(page))
1347 return NULL;
1348
1349 return page;
1350}
1351
1352/*
1353 * kmem_cache_open produces objects aligned at "size" and the first object
1354 * is placed at offset 0 in the slab (We have no metainformation on the
1355 * slab, all slabs are in essence "off slab").
1356 *
1357 * In order to get the desired alignment one just needs to align the
1358 * size.
1359 *
1360 * Notice that the allocation order determines the sizes of the per cpu
1361 * caches. Each processor has always one slab available for allocations.
1362 * Increasing the allocation order reduces the number of times that slabs
1363 * must be moved on and off the partial lists and therefore may influence
1364 * locking overhead.
1365 *
1366 * The offset is used to relocate the free list link in each object. It is
1367 * therefore possible to move the free list link behind the object. This
1368 * is necessary for RCU to work properly and also useful for debugging.
1369 */
1370
1371/*
1372 * Mininum / Maximum order of slab pages. This influences locking overhead
1373 * and slab fragmentation. A higher order reduces the number of partial slabs
1374 * and increases the number of allocations possible without having to
1375 * take the list_lock.
1376 */
1377static int slub_min_order;
1378static int slub_max_order = DEFAULT_MAX_ORDER;
1379
1380/*
1381 * Minimum number of objects per slab. This is necessary in order to
1382 * reduce locking overhead. Similar to the queue size in SLAB.
1383 */
1384static int slub_min_objects = DEFAULT_MIN_OBJECTS;
1385
1386/*
1387 * Merge control. If this is set then no merging of slab caches will occur.
1388 */
1389static int slub_nomerge;
1390
1391/*
1392 * Debug settings:
1393 */
1394static int slub_debug;
1395
1396static char *slub_debug_slabs;
1397
1398/*
1399 * Calculate the order of allocation given an slab object size.
1400 *
1401 * The order of allocation has significant impact on other elements
1402 * of the system. Generally order 0 allocations should be preferred
1403 * since they do not cause fragmentation in the page allocator. Larger
1404 * objects may have problems with order 0 because there may be too much
1405 * space left unused in a slab. We go to a higher order if more than 1/8th
1406 * of the slab would be wasted.
1407 *
1408 * In order to reach satisfactory performance we must ensure that
1409 * a minimum number of objects is in one slab. Otherwise we may
1410 * generate too much activity on the partial lists. This is less a
1411 * concern for large slabs though. slub_max_order specifies the order
1412 * where we begin to stop considering the number of objects in a slab.
1413 *
1414 * Higher order allocations also allow the placement of more objects
1415 * in a slab and thereby reduce object handling overhead. If the user
1416 * has requested a higher mininum order then we start with that one
1417 * instead of zero.
1418 */
1419static int calculate_order(int size)
1420{
1421 int order;
1422 int rem;
1423
1424 for (order = max(slub_min_order, fls(size - 1) - PAGE_SHIFT);
1425 order < MAX_ORDER; order++) {
1426 unsigned long slab_size = PAGE_SIZE << order;
1427
1428 if (slub_max_order > order &&
1429 slab_size < slub_min_objects * size)
1430 continue;
1431
1432 if (slab_size < size)
1433 continue;
1434
1435 rem = slab_size % size;
1436
1437 if (rem <= (PAGE_SIZE << order) / 8)
1438 break;
1439
1440 }
1441 if (order >= MAX_ORDER)
1442 return -E2BIG;
1443 return order;
1444}
1445
1446/*
1447 * Function to figure out which alignment to use from the
1448 * various ways of specifying it.
1449 */
1450static unsigned long calculate_alignment(unsigned long flags,
1451 unsigned long align, unsigned long size)
1452{
1453 /*
1454 * If the user wants hardware cache aligned objects then
1455 * follow that suggestion if the object is sufficiently
1456 * large.
1457 *
1458 * The hardware cache alignment cannot override the
1459 * specified alignment though. If that is greater
1460 * then use it.
1461 */
1462 if ((flags & (SLAB_MUST_HWCACHE_ALIGN | SLAB_HWCACHE_ALIGN)) &&
1463 size > L1_CACHE_BYTES / 2)
1464 return max_t(unsigned long, align, L1_CACHE_BYTES);
1465
1466 if (align < ARCH_SLAB_MINALIGN)
1467 return ARCH_SLAB_MINALIGN;
1468
1469 return ALIGN(align, sizeof(void *));
1470}
1471
1472static void init_kmem_cache_node(struct kmem_cache_node *n)
1473{
1474 n->nr_partial = 0;
1475 atomic_long_set(&n->nr_slabs, 0);
1476 spin_lock_init(&n->list_lock);
1477 INIT_LIST_HEAD(&n->partial);
1478}
1479
1480#ifdef CONFIG_NUMA
1481/*
1482 * No kmalloc_node yet so do it by hand. We know that this is the first
1483 * slab on the node for this slabcache. There are no concurrent accesses
1484 * possible.
1485 *
1486 * Note that this function only works on the kmalloc_node_cache
1487 * when allocating for the kmalloc_node_cache.
1488 */
1489static struct kmem_cache_node * __init early_kmem_cache_node_alloc(gfp_t gfpflags,
1490 int node)
1491{
1492 struct page *page;
1493 struct kmem_cache_node *n;
1494
1495 BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
1496
1497 page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
1498 /* new_slab() disables interupts */
1499 local_irq_enable();
1500
1501 BUG_ON(!page);
1502 n = page->freelist;
1503 BUG_ON(!n);
1504 page->freelist = get_freepointer(kmalloc_caches, n);
1505 page->inuse++;
1506 kmalloc_caches->node[node] = n;
1507 init_object(kmalloc_caches, n, 1);
1508 init_kmem_cache_node(n);
1509 atomic_long_inc(&n->nr_slabs);
1510 add_partial(kmalloc_caches, page);
1511 return n;
1512}
1513
1514static void free_kmem_cache_nodes(struct kmem_cache *s)
1515{
1516 int node;
1517
1518 for_each_online_node(node) {
1519 struct kmem_cache_node *n = s->node[node];
1520 if (n && n != &s->local_node)
1521 kmem_cache_free(kmalloc_caches, n);
1522 s->node[node] = NULL;
1523 }
1524}
1525
1526static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
1527{
1528 int node;
1529 int local_node;
1530
1531 if (slab_state >= UP)
1532 local_node = page_to_nid(virt_to_page(s));
1533 else
1534 local_node = 0;
1535
1536 for_each_online_node(node) {
1537 struct kmem_cache_node *n;
1538
1539 if (local_node == node)
1540 n = &s->local_node;
1541 else {
1542 if (slab_state == DOWN) {
1543 n = early_kmem_cache_node_alloc(gfpflags,
1544 node);
1545 continue;
1546 }
1547 n = kmem_cache_alloc_node(kmalloc_caches,
1548 gfpflags, node);
1549
1550 if (!n) {
1551 free_kmem_cache_nodes(s);
1552 return 0;
1553 }
1554
1555 }
1556 s->node[node] = n;
1557 init_kmem_cache_node(n);
1558 }
1559 return 1;
1560}
1561#else
1562static void free_kmem_cache_nodes(struct kmem_cache *s)
1563{
1564}
1565
1566static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
1567{
1568 init_kmem_cache_node(&s->local_node);
1569 return 1;
1570}
1571#endif
1572
1573/*
1574 * calculate_sizes() determines the order and the distribution of data within
1575 * a slab object.
1576 */
1577static int calculate_sizes(struct kmem_cache *s)
1578{
1579 unsigned long flags = s->flags;
1580 unsigned long size = s->objsize;
1581 unsigned long align = s->align;
1582
1583 /*
1584 * Determine if we can poison the object itself. If the user of
1585 * the slab may touch the object after free or before allocation
1586 * then we should never poison the object itself.
1587 */
1588 if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
1589 !s->ctor && !s->dtor)
1590 s->flags |= __OBJECT_POISON;
1591 else
1592 s->flags &= ~__OBJECT_POISON;
1593
1594 /*
1595 * Round up object size to the next word boundary. We can only
1596 * place the free pointer at word boundaries and this determines
1597 * the possible location of the free pointer.
1598 */
1599 size = ALIGN(size, sizeof(void *));
1600
1601 /*
1602 * If we are redzoning then check if there is some space between the
1603 * end of the object and the free pointer. If not then add an
1604 * additional word, so that we can establish a redzone between
1605 * the object and the freepointer to be able to check for overwrites.
1606 */
1607 if ((flags & SLAB_RED_ZONE) && size == s->objsize)
1608 size += sizeof(void *);
1609
1610 /*
1611 * With that we have determined how much of the slab is in actual
1612 * use by the object. This is the potential offset to the free
1613 * pointer.
1614 */
1615 s->inuse = size;
1616
1617 if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
1618 s->ctor || s->dtor)) {
1619 /*
1620 * Relocate free pointer after the object if it is not
1621 * permitted to overwrite the first word of the object on
1622 * kmem_cache_free.
1623 *
1624 * This is the case if we do RCU, have a constructor or
1625 * destructor or are poisoning the objects.
1626 */
1627 s->offset = size;
1628 size += sizeof(void *);
1629 }
1630
1631 if (flags & SLAB_STORE_USER)
1632 /*
1633 * Need to store information about allocs and frees after
1634 * the object.
1635 */
1636 size += 2 * sizeof(struct track);
1637
1638 if (flags & DEBUG_DEFAULT_FLAGS)
1639 /*
1640 * Add some empty padding so that we can catch
1641 * overwrites from earlier objects rather than let
1642 * tracking information or the free pointer be
1643 * corrupted if an user writes before the start
1644 * of the object.
1645 */
1646 size += sizeof(void *);
1647 /*
1648 * Determine the alignment based on various parameters that the
1649 * user specified (this is unecessarily complex due to the attempt
1650 * to be compatible with SLAB. Should be cleaned up some day).
1651 */
1652 align = calculate_alignment(flags, align, s->objsize);
1653
1654 /*
1655 * SLUB stores one object immediately after another beginning from
1656 * offset 0. In order to align the objects we have to simply size
1657 * each object to conform to the alignment.
1658 */
1659 size = ALIGN(size, align);
1660 s->size = size;
1661
1662 s->order = calculate_order(size);
1663 if (s->order < 0)
1664 return 0;
1665
1666 /*
1667 * Determine the number of objects per slab
1668 */
1669 s->objects = (PAGE_SIZE << s->order) / size;
1670
1671 /*
1672 * Verify that the number of objects is within permitted limits.
1673 * The page->inuse field is only 16 bit wide! So we cannot have
1674 * more than 64k objects per slab.
1675 */
1676 if (!s->objects || s->objects > 65535)
1677 return 0;
1678 return 1;
1679
1680}
1681
1682static int __init finish_bootstrap(void)
1683{
1684 struct list_head *h;
1685 int err;
1686
1687 slab_state = SYSFS;
1688
1689 list_for_each(h, &slab_caches) {
1690 struct kmem_cache *s =
1691 container_of(h, struct kmem_cache, list);
1692
1693 err = sysfs_slab_add(s);
1694 BUG_ON(err);
1695 }
1696 return 0;
1697}
1698
1699static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
1700 const char *name, size_t size,
1701 size_t align, unsigned long flags,
1702 void (*ctor)(void *, struct kmem_cache *, unsigned long),
1703 void (*dtor)(void *, struct kmem_cache *, unsigned long))
1704{
1705 memset(s, 0, kmem_size);
1706 s->name = name;
1707 s->ctor = ctor;
1708 s->dtor = dtor;
1709 s->objsize = size;
1710 s->flags = flags;
1711 s->align = align;
1712
1713 BUG_ON(flags & SLUB_UNIMPLEMENTED);
1714
1715 /*
1716 * The page->offset field is only 16 bit wide. This is an offset
1717 * in units of words from the beginning of an object. If the slab
1718 * size is bigger then we cannot move the free pointer behind the
1719 * object anymore.
1720 *
1721 * On 32 bit platforms the limit is 256k. On 64bit platforms
1722 * the limit is 512k.
1723 *
1724 * Debugging or ctor/dtors may create a need to move the free
1725 * pointer. Fail if this happens.
1726 */
1727 if (s->size >= 65535 * sizeof(void *)) {
1728 BUG_ON(flags & (SLAB_RED_ZONE | SLAB_POISON |
1729 SLAB_STORE_USER | SLAB_DESTROY_BY_RCU));
1730 BUG_ON(ctor || dtor);
1731 }
1732 else
1733 /*
1734 * Enable debugging if selected on the kernel commandline.
1735 */
1736 if (slub_debug && (!slub_debug_slabs ||
1737 strncmp(slub_debug_slabs, name,
1738 strlen(slub_debug_slabs)) == 0))
1739 s->flags |= slub_debug;
1740
1741 if (!calculate_sizes(s))
1742 goto error;
1743
1744 s->refcount = 1;
1745#ifdef CONFIG_NUMA
1746 s->defrag_ratio = 100;
1747#endif
1748
1749 if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
1750 return 1;
1751error:
1752 if (flags & SLAB_PANIC)
1753 panic("Cannot create slab %s size=%lu realsize=%u "
1754 "order=%u offset=%u flags=%lx\n",
1755 s->name, (unsigned long)size, s->size, s->order,
1756 s->offset, flags);
1757 return 0;
1758}
1759EXPORT_SYMBOL(kmem_cache_open);
1760
1761/*
1762 * Check if a given pointer is valid
1763 */
1764int kmem_ptr_validate(struct kmem_cache *s, const void *object)
1765{
1766 struct page * page;
1767 void *addr;
1768
1769 page = get_object_page(object);
1770
1771 if (!page || s != page->slab)
1772 /* No slab or wrong slab */
1773 return 0;
1774
1775 addr = page_address(page);
1776 if (object < addr || object >= addr + s->objects * s->size)
1777 /* Out of bounds */
1778 return 0;
1779
1780 if ((object - addr) % s->size)
1781 /* Improperly aligned */
1782 return 0;
1783
1784 /*
1785 * We could also check if the object is on the slabs freelist.
1786 * But this would be too expensive and it seems that the main
1787 * purpose of kmem_ptr_valid is to check if the object belongs
1788 * to a certain slab.
1789 */
1790 return 1;
1791}
1792EXPORT_SYMBOL(kmem_ptr_validate);
1793
1794/*
1795 * Determine the size of a slab object
1796 */
1797unsigned int kmem_cache_size(struct kmem_cache *s)
1798{
1799 return s->objsize;
1800}
1801EXPORT_SYMBOL(kmem_cache_size);
1802
1803const char *kmem_cache_name(struct kmem_cache *s)
1804{
1805 return s->name;
1806}
1807EXPORT_SYMBOL(kmem_cache_name);
1808
1809/*
1810 * Attempt to free all slabs on a node
1811 */
1812static int free_list(struct kmem_cache *s, struct kmem_cache_node *n,
1813 struct list_head *list)
1814{
1815 int slabs_inuse = 0;
1816 unsigned long flags;
1817 struct page *page, *h;
1818
1819 spin_lock_irqsave(&n->list_lock, flags);
1820 list_for_each_entry_safe(page, h, list, lru)
1821 if (!page->inuse) {
1822 list_del(&page->lru);
1823 discard_slab(s, page);
1824 } else
1825 slabs_inuse++;
1826 spin_unlock_irqrestore(&n->list_lock, flags);
1827 return slabs_inuse;
1828}
1829
1830/*
1831 * Release all resources used by slab cache
1832 */
1833static int kmem_cache_close(struct kmem_cache *s)
1834{
1835 int node;
1836
1837 flush_all(s);
1838
1839 /* Attempt to free all objects */
1840 for_each_online_node(node) {
1841 struct kmem_cache_node *n = get_node(s, node);
1842
1843 free_list(s, n, &n->partial);
1844 if (atomic_long_read(&n->nr_slabs))
1845 return 1;
1846 }
1847 free_kmem_cache_nodes(s);
1848 return 0;
1849}
1850
1851/*
1852 * Close a cache and release the kmem_cache structure
1853 * (must be used for caches created using kmem_cache_create)
1854 */
1855void kmem_cache_destroy(struct kmem_cache *s)
1856{
1857 down_write(&slub_lock);
1858 s->refcount--;
1859 if (!s->refcount) {
1860 list_del(&s->list);
1861 if (kmem_cache_close(s))
1862 WARN_ON(1);
1863 sysfs_slab_remove(s);
1864 kfree(s);
1865 }
1866 up_write(&slub_lock);
1867}
1868EXPORT_SYMBOL(kmem_cache_destroy);
1869
1870/********************************************************************
1871 * Kmalloc subsystem
1872 *******************************************************************/
1873
1874struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
1875EXPORT_SYMBOL(kmalloc_caches);
1876
1877#ifdef CONFIG_ZONE_DMA
1878static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
1879#endif
1880
1881static int __init setup_slub_min_order(char *str)
1882{
1883 get_option (&str, &slub_min_order);
1884
1885 return 1;
1886}
1887
1888__setup("slub_min_order=", setup_slub_min_order);
1889
1890static int __init setup_slub_max_order(char *str)
1891{
1892 get_option (&str, &slub_max_order);
1893
1894 return 1;
1895}
1896
1897__setup("slub_max_order=", setup_slub_max_order);
1898
1899static int __init setup_slub_min_objects(char *str)
1900{
1901 get_option (&str, &slub_min_objects);
1902
1903 return 1;
1904}
1905
1906__setup("slub_min_objects=", setup_slub_min_objects);
1907
1908static int __init setup_slub_nomerge(char *str)
1909{
1910 slub_nomerge = 1;
1911 return 1;
1912}
1913
1914__setup("slub_nomerge", setup_slub_nomerge);
1915
1916static int __init setup_slub_debug(char *str)
1917{
1918 if (!str || *str != '=')
1919 slub_debug = DEBUG_DEFAULT_FLAGS;
1920 else {
1921 str++;
1922 if (*str == 0 || *str == ',')
1923 slub_debug = DEBUG_DEFAULT_FLAGS;
1924 else
1925 for( ;*str && *str != ','; str++)
1926 switch (*str) {
1927 case 'f' : case 'F' :
1928 slub_debug |= SLAB_DEBUG_FREE;
1929 break;
1930 case 'z' : case 'Z' :
1931 slub_debug |= SLAB_RED_ZONE;
1932 break;
1933 case 'p' : case 'P' :
1934 slub_debug |= SLAB_POISON;
1935 break;
1936 case 'u' : case 'U' :
1937 slub_debug |= SLAB_STORE_USER;
1938 break;
1939 case 't' : case 'T' :
1940 slub_debug |= SLAB_TRACE;
1941 break;
1942 default:
1943 printk(KERN_ERR "slub_debug option '%c' "
1944 "unknown. skipped\n",*str);
1945 }
1946 }
1947
1948 if (*str == ',')
1949 slub_debug_slabs = str + 1;
1950 return 1;
1951}
1952
1953__setup("slub_debug", setup_slub_debug);
1954
1955static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
1956 const char *name, int size, gfp_t gfp_flags)
1957{
1958 unsigned int flags = 0;
1959
1960 if (gfp_flags & SLUB_DMA)
1961 flags = SLAB_CACHE_DMA;
1962
1963 down_write(&slub_lock);
1964 if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
1965 flags, NULL, NULL))
1966 goto panic;
1967
1968 list_add(&s->list, &slab_caches);
1969 up_write(&slub_lock);
1970 if (sysfs_slab_add(s))
1971 goto panic;
1972 return s;
1973
1974panic:
1975 panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
1976}
1977
1978static struct kmem_cache *get_slab(size_t size, gfp_t flags)
1979{
1980 int index = kmalloc_index(size);
1981
1982 if (!size)
1983 return NULL;
1984
1985 /* Allocation too large? */
1986 BUG_ON(index < 0);
1987
1988#ifdef CONFIG_ZONE_DMA
1989 if ((flags & SLUB_DMA)) {
1990 struct kmem_cache *s;
1991 struct kmem_cache *x;
1992 char *text;
1993 size_t realsize;
1994
1995 s = kmalloc_caches_dma[index];
1996 if (s)
1997 return s;
1998
1999 /* Dynamically create dma cache */
2000 x = kmalloc(kmem_size, flags & ~SLUB_DMA);
2001 if (!x)
2002 panic("Unable to allocate memory for dma cache\n");
2003
2004 if (index <= KMALLOC_SHIFT_HIGH)
2005 realsize = 1 << index;
2006 else {
2007 if (index == 1)
2008 realsize = 96;
2009 else
2010 realsize = 192;
2011 }
2012
2013 text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
2014 (unsigned int)realsize);
2015 s = create_kmalloc_cache(x, text, realsize, flags);
2016 kmalloc_caches_dma[index] = s;
2017 return s;
2018 }
2019#endif
2020 return &kmalloc_caches[index];
2021}
2022
2023void *__kmalloc(size_t size, gfp_t flags)
2024{
2025 struct kmem_cache *s = get_slab(size, flags);
2026
2027 if (s)
2028 return kmem_cache_alloc(s, flags);
2029 return NULL;
2030}
2031EXPORT_SYMBOL(__kmalloc);
2032
2033#ifdef CONFIG_NUMA
2034void *__kmalloc_node(size_t size, gfp_t flags, int node)
2035{
2036 struct kmem_cache *s = get_slab(size, flags);
2037
2038 if (s)
2039 return kmem_cache_alloc_node(s, flags, node);
2040 return NULL;
2041}
2042EXPORT_SYMBOL(__kmalloc_node);
2043#endif
2044
2045size_t ksize(const void *object)
2046{
2047 struct page *page = get_object_page(object);
2048 struct kmem_cache *s;
2049
2050 BUG_ON(!page);
2051 s = page->slab;
2052 BUG_ON(!s);
2053
2054 /*
2055 * Debugging requires use of the padding between object
2056 * and whatever may come after it.
2057 */
2058 if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
2059 return s->objsize;
2060
2061 /*
2062 * If we have the need to store the freelist pointer
2063 * back there or track user information then we can
2064 * only use the space before that information.
2065 */
2066 if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
2067 return s->inuse;
2068
2069 /*
2070 * Else we can use all the padding etc for the allocation
2071 */
2072 return s->size;
2073}
2074EXPORT_SYMBOL(ksize);
2075
2076void kfree(const void *x)
2077{
2078 struct kmem_cache *s;
2079 struct page *page;
2080
2081 if (!x)
2082 return;
2083
2084 page = virt_to_page(x);
2085
2086 if (unlikely(PageCompound(page)))
2087 page = page->first_page;
2088
2089 s = page->slab;
2090
2091 if (unlikely(PageError(page) && (s->flags & SLAB_STORE_USER)))
2092 set_tracking(s, (void *)x, TRACK_FREE);
2093 slab_free(s, page, (void *)x);
2094}
2095EXPORT_SYMBOL(kfree);
2096
2097/**
2098 * krealloc - reallocate memory. The contents will remain unchanged.
2099 *
2100 * @p: object to reallocate memory for.
2101 * @new_size: how many bytes of memory are required.
2102 * @flags: the type of memory to allocate.
2103 *
2104 * The contents of the object pointed to are preserved up to the
2105 * lesser of the new and old sizes. If @p is %NULL, krealloc()
2106 * behaves exactly like kmalloc(). If @size is 0 and @p is not a
2107 * %NULL pointer, the object pointed to is freed.
2108 */
2109void *krealloc(const void *p, size_t new_size, gfp_t flags)
2110{
2111 struct kmem_cache *new_cache;
2112 void *ret;
2113 struct page *page;
2114
2115 if (unlikely(!p))
2116 return kmalloc(new_size, flags);
2117
2118 if (unlikely(!new_size)) {
2119 kfree(p);
2120 return NULL;
2121 }
2122
2123 page = virt_to_page(p);
2124
2125 if (unlikely(PageCompound(page)))
2126 page = page->first_page;
2127
2128 new_cache = get_slab(new_size, flags);
2129
2130 /*
2131 * If new size fits in the current cache, bail out.
2132 */
2133 if (likely(page->slab == new_cache))
2134 return (void *)p;
2135
2136 ret = kmalloc(new_size, flags);
2137 if (ret) {
2138 memcpy(ret, p, min(new_size, ksize(p)));
2139 kfree(p);
2140 }
2141 return ret;
2142}
2143EXPORT_SYMBOL(krealloc);
2144
2145/********************************************************************
2146 * Basic setup of slabs
2147 *******************************************************************/
2148
2149void __init kmem_cache_init(void)
2150{
2151 int i;
2152
2153#ifdef CONFIG_NUMA
2154 /*
2155 * Must first have the slab cache available for the allocations of the
2156 * struct kmalloc_cache_node's. There is special bootstrap code in
2157 * kmem_cache_open for slab_state == DOWN.
2158 */
2159 create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
2160 sizeof(struct kmem_cache_node), GFP_KERNEL);
2161#endif
2162
2163 /* Able to allocate the per node structures */
2164 slab_state = PARTIAL;
2165
2166 /* Caches that are not of the two-to-the-power-of size */
2167 create_kmalloc_cache(&kmalloc_caches[1],
2168 "kmalloc-96", 96, GFP_KERNEL);
2169 create_kmalloc_cache(&kmalloc_caches[2],
2170 "kmalloc-192", 192, GFP_KERNEL);
2171
2172 for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
2173 create_kmalloc_cache(&kmalloc_caches[i],
2174 "kmalloc", 1 << i, GFP_KERNEL);
2175
2176 slab_state = UP;
2177
2178 /* Provide the correct kmalloc names now that the caches are up */
2179 for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
2180 kmalloc_caches[i]. name =
2181 kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
2182
2183#ifdef CONFIG_SMP
2184 register_cpu_notifier(&slab_notifier);
2185#endif
2186
2187 if (nr_cpu_ids) /* Remove when nr_cpu_ids is fixed upstream ! */
2188 kmem_size = offsetof(struct kmem_cache, cpu_slab)
2189 + nr_cpu_ids * sizeof(struct page *);
2190
2191 printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
2192 " Processors=%d, Nodes=%d\n",
2193 KMALLOC_SHIFT_HIGH, L1_CACHE_BYTES,
2194 slub_min_order, slub_max_order, slub_min_objects,
2195 nr_cpu_ids, nr_node_ids);
2196}
2197
2198/*
2199 * Find a mergeable slab cache
2200 */
2201static int slab_unmergeable(struct kmem_cache *s)
2202{
2203 if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
2204 return 1;
2205
2206 if (s->ctor || s->dtor)
2207 return 1;
2208
2209 return 0;
2210}
2211
2212static struct kmem_cache *find_mergeable(size_t size,
2213 size_t align, unsigned long flags,
2214 void (*ctor)(void *, struct kmem_cache *, unsigned long),
2215 void (*dtor)(void *, struct kmem_cache *, unsigned long))
2216{
2217 struct list_head *h;
2218
2219 if (slub_nomerge || (flags & SLUB_NEVER_MERGE))
2220 return NULL;
2221
2222 if (ctor || dtor)
2223 return NULL;
2224
2225 size = ALIGN(size, sizeof(void *));
2226 align = calculate_alignment(flags, align, size);
2227 size = ALIGN(size, align);
2228
2229 list_for_each(h, &slab_caches) {
2230 struct kmem_cache *s =
2231 container_of(h, struct kmem_cache, list);
2232
2233 if (slab_unmergeable(s))
2234 continue;
2235
2236 if (size > s->size)
2237 continue;
2238
2239 if (((flags | slub_debug) & SLUB_MERGE_SAME) !=
2240 (s->flags & SLUB_MERGE_SAME))
2241 continue;
2242 /*
2243 * Check if alignment is compatible.
2244 * Courtesy of Adrian Drzewiecki
2245 */
2246 if ((s->size & ~(align -1)) != s->size)
2247 continue;
2248
2249 if (s->size - size >= sizeof(void *))
2250 continue;
2251
2252 return s;
2253 }
2254 return NULL;
2255}
2256
2257struct kmem_cache *kmem_cache_create(const char *name, size_t size,
2258 size_t align, unsigned long flags,
2259 void (*ctor)(void *, struct kmem_cache *, unsigned long),
2260 void (*dtor)(void *, struct kmem_cache *, unsigned long))
2261{
2262 struct kmem_cache *s;
2263
2264 down_write(&slub_lock);
2265 s = find_mergeable(size, align, flags, dtor, ctor);
2266 if (s) {
2267 s->refcount++;
2268 /*
2269 * Adjust the object sizes so that we clear
2270 * the complete object on kzalloc.
2271 */
2272 s->objsize = max(s->objsize, (int)size);
2273 s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
2274 if (sysfs_slab_alias(s, name))
2275 goto err;
2276 } else {
2277 s = kmalloc(kmem_size, GFP_KERNEL);
2278 if (s && kmem_cache_open(s, GFP_KERNEL, name,
2279 size, align, flags, ctor, dtor)) {
2280 if (sysfs_slab_add(s)) {
2281 kfree(s);
2282 goto err;
2283 }
2284 list_add(&s->list, &slab_caches);
2285 } else
2286 kfree(s);
2287 }
2288 up_write(&slub_lock);
2289 return s;
2290
2291err:
2292 up_write(&slub_lock);
2293 if (flags & SLAB_PANIC)
2294 panic("Cannot create slabcache %s\n", name);
2295 else
2296 s = NULL;
2297 return s;
2298}
2299EXPORT_SYMBOL(kmem_cache_create);
2300
2301void *kmem_cache_zalloc(struct kmem_cache *s, gfp_t flags)
2302{
2303 void *x;
2304
2305 x = kmem_cache_alloc(s, flags);
2306 if (x)
2307 memset(x, 0, s->objsize);
2308 return x;
2309}
2310EXPORT_SYMBOL(kmem_cache_zalloc);
2311
2312#ifdef CONFIG_SMP
2313static void for_all_slabs(void (*func)(struct kmem_cache *, int), int cpu)
2314{
2315 struct list_head *h;
2316
2317 down_read(&slub_lock);
2318 list_for_each(h, &slab_caches) {
2319 struct kmem_cache *s =
2320 container_of(h, struct kmem_cache, list);
2321
2322 func(s, cpu);
2323 }
2324 up_read(&slub_lock);
2325}
2326
2327/*
2328 * Use the cpu notifier to insure that the slab are flushed
2329 * when necessary.
2330 */
2331static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
2332 unsigned long action, void *hcpu)
2333{
2334 long cpu = (long)hcpu;
2335
2336 switch (action) {
2337 case CPU_UP_CANCELED:
2338 case CPU_DEAD:
2339 for_all_slabs(__flush_cpu_slab, cpu);
2340 break;
2341 default:
2342 break;
2343 }
2344 return NOTIFY_OK;
2345}
2346
2347static struct notifier_block __cpuinitdata slab_notifier =
2348 { &slab_cpuup_callback, NULL, 0 };
2349
2350#endif
2351
2352/***************************************************************
2353 * Compatiblility definitions
2354 **************************************************************/
2355
2356int kmem_cache_shrink(struct kmem_cache *s)
2357{
2358 flush_all(s);
2359 return 0;
2360}
2361EXPORT_SYMBOL(kmem_cache_shrink);
2362
2363#ifdef CONFIG_NUMA
2364
2365/*****************************************************************
2366 * Generic reaper used to support the page allocator
2367 * (the cpu slabs are reaped by a per slab workqueue).
2368 *
2369 * Maybe move this to the page allocator?
2370 ****************************************************************/
2371
2372static DEFINE_PER_CPU(unsigned long, reap_node);
2373
2374static void init_reap_node(int cpu)
2375{
2376 int node;
2377
2378 node = next_node(cpu_to_node(cpu), node_online_map);
2379 if (node == MAX_NUMNODES)
2380 node = first_node(node_online_map);
2381
2382 __get_cpu_var(reap_node) = node;
2383}
2384
2385static void next_reap_node(void)
2386{
2387 int node = __get_cpu_var(reap_node);
2388
2389 /*
2390 * Also drain per cpu pages on remote zones
2391 */
2392 if (node != numa_node_id())
2393 drain_node_pages(node);
2394
2395 node = next_node(node, node_online_map);
2396 if (unlikely(node >= MAX_NUMNODES))
2397 node = first_node(node_online_map);
2398 __get_cpu_var(reap_node) = node;
2399}
2400#else
2401#define init_reap_node(cpu) do { } while (0)
2402#define next_reap_node(void) do { } while (0)
2403#endif
2404
2405#define REAPTIMEOUT_CPUC (2*HZ)
2406
2407#ifdef CONFIG_SMP
2408static DEFINE_PER_CPU(struct delayed_work, reap_work);
2409
2410static void cache_reap(struct work_struct *unused)
2411{
2412 next_reap_node();
2413 refresh_cpu_vm_stats(smp_processor_id());
2414 schedule_delayed_work(&__get_cpu_var(reap_work),
2415 REAPTIMEOUT_CPUC);
2416}
2417
2418static void __devinit start_cpu_timer(int cpu)
2419{
2420 struct delayed_work *reap_work = &per_cpu(reap_work, cpu);
2421
2422 /*
2423 * When this gets called from do_initcalls via cpucache_init(),
2424 * init_workqueues() has already run, so keventd will be setup
2425 * at that time.
2426 */
2427 if (keventd_up() && reap_work->work.func == NULL) {
2428 init_reap_node(cpu);
2429 INIT_DELAYED_WORK(reap_work, cache_reap);
2430 schedule_delayed_work_on(cpu, reap_work, HZ + 3 * cpu);
2431 }
2432}
2433
2434static int __init cpucache_init(void)
2435{
2436 int cpu;
2437
2438 /*
2439 * Register the timers that drain pcp pages and update vm statistics
2440 */
2441 for_each_online_cpu(cpu)
2442 start_cpu_timer(cpu);
2443 return 0;
2444}
2445__initcall(cpucache_init);
2446#endif
2447
2448#ifdef SLUB_RESILIENCY_TEST
2449static unsigned long validate_slab_cache(struct kmem_cache *s);
2450
2451static void resiliency_test(void)
2452{
2453 u8 *p;
2454
2455 printk(KERN_ERR "SLUB resiliency testing\n");
2456 printk(KERN_ERR "-----------------------\n");
2457 printk(KERN_ERR "A. Corruption after allocation\n");
2458
2459 p = kzalloc(16, GFP_KERNEL);
2460 p[16] = 0x12;
2461 printk(KERN_ERR "\n1. kmalloc-16: Clobber Redzone/next pointer"
2462 " 0x12->0x%p\n\n", p + 16);
2463
2464 validate_slab_cache(kmalloc_caches + 4);
2465
2466 /* Hmmm... The next two are dangerous */
2467 p = kzalloc(32, GFP_KERNEL);
2468 p[32 + sizeof(void *)] = 0x34;
2469 printk(KERN_ERR "\n2. kmalloc-32: Clobber next pointer/next slab"
2470 " 0x34 -> -0x%p\n", p);
2471 printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n");
2472
2473 validate_slab_cache(kmalloc_caches + 5);
2474 p = kzalloc(64, GFP_KERNEL);
2475 p += 64 + (get_cycles() & 0xff) * sizeof(void *);
2476 *p = 0x56;
2477 printk(KERN_ERR "\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n",
2478 p);
2479 printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n");
2480 validate_slab_cache(kmalloc_caches + 6);
2481
2482 printk(KERN_ERR "\nB. Corruption after free\n");
2483 p = kzalloc(128, GFP_KERNEL);
2484 kfree(p);
2485 *p = 0x78;
2486 printk(KERN_ERR "1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p);
2487 validate_slab_cache(kmalloc_caches + 7);
2488
2489 p = kzalloc(256, GFP_KERNEL);
2490 kfree(p);
2491 p[50] = 0x9a;
2492 printk(KERN_ERR "\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p);
2493 validate_slab_cache(kmalloc_caches + 8);
2494
2495 p = kzalloc(512, GFP_KERNEL);
2496 kfree(p);
2497 p[512] = 0xab;
2498 printk(KERN_ERR "\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p);
2499 validate_slab_cache(kmalloc_caches + 9);
2500}
2501#else
2502static void resiliency_test(void) {};
2503#endif
2504
2505/*
2506 * These are not as efficient as kmalloc for the non debug case.
2507 * We do not have the page struct available so we have to touch one
2508 * cacheline in struct kmem_cache to check slab flags.
2509 */
2510void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, void *caller)
2511{
2512 struct kmem_cache *s = get_slab(size, gfpflags);
2513 void *object;
2514
2515 if (!s)
2516 return NULL;
2517
2518 object = kmem_cache_alloc(s, gfpflags);
2519
2520 if (object && (s->flags & SLAB_STORE_USER))
2521 set_track(s, object, TRACK_ALLOC, caller);
2522
2523 return object;
2524}
2525
2526void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
2527 int node, void *caller)
2528{
2529 struct kmem_cache *s = get_slab(size, gfpflags);
2530 void *object;
2531
2532 if (!s)
2533 return NULL;
2534
2535 object = kmem_cache_alloc_node(s, gfpflags, node);
2536
2537 if (object && (s->flags & SLAB_STORE_USER))
2538 set_track(s, object, TRACK_ALLOC, caller);
2539
2540 return object;
2541}
2542
2543#ifdef CONFIG_SYSFS
2544
2545static unsigned long count_partial(struct kmem_cache_node *n)
2546{
2547 unsigned long flags;
2548 unsigned long x = 0;
2549 struct page *page;
2550
2551 spin_lock_irqsave(&n->list_lock, flags);
2552 list_for_each_entry(page, &n->partial, lru)
2553 x += page->inuse;
2554 spin_unlock_irqrestore(&n->list_lock, flags);
2555 return x;
2556}
2557
2558enum slab_stat_type {
2559 SL_FULL,
2560 SL_PARTIAL,
2561 SL_CPU,
2562 SL_OBJECTS
2563};
2564
2565#define SO_FULL (1 << SL_FULL)
2566#define SO_PARTIAL (1 << SL_PARTIAL)
2567#define SO_CPU (1 << SL_CPU)
2568#define SO_OBJECTS (1 << SL_OBJECTS)
2569
2570static unsigned long slab_objects(struct kmem_cache *s,
2571 char *buf, unsigned long flags)
2572{
2573 unsigned long total = 0;
2574 int cpu;
2575 int node;
2576 int x;
2577 unsigned long *nodes;
2578 unsigned long *per_cpu;
2579
2580 nodes = kzalloc(2 * sizeof(unsigned long) * nr_node_ids, GFP_KERNEL);
2581 per_cpu = nodes + nr_node_ids;
2582
2583 for_each_possible_cpu(cpu) {
2584 struct page *page = s->cpu_slab[cpu];
2585 int node;
2586
2587 if (page) {
2588 node = page_to_nid(page);
2589 if (flags & SO_CPU) {
2590 int x = 0;
2591
2592 if (flags & SO_OBJECTS)
2593 x = page->inuse;
2594 else
2595 x = 1;
2596 total += x;
2597 nodes[node] += x;
2598 }
2599 per_cpu[node]++;
2600 }
2601 }
2602
2603 for_each_online_node(node) {
2604 struct kmem_cache_node *n = get_node(s, node);
2605
2606 if (flags & SO_PARTIAL) {
2607 if (flags & SO_OBJECTS)
2608 x = count_partial(n);
2609 else
2610 x = n->nr_partial;
2611 total += x;
2612 nodes[node] += x;
2613 }
2614
2615 if (flags & SO_FULL) {
2616 int full_slabs = atomic_read(&n->nr_slabs)
2617 - per_cpu[node]
2618 - n->nr_partial;
2619
2620 if (flags & SO_OBJECTS)
2621 x = full_slabs * s->objects;
2622 else
2623 x = full_slabs;
2624 total += x;
2625 nodes[node] += x;
2626 }
2627 }
2628
2629 x = sprintf(buf, "%lu", total);
2630#ifdef CONFIG_NUMA
2631 for_each_online_node(node)
2632 if (nodes[node])
2633 x += sprintf(buf + x, " N%d=%lu",
2634 node, nodes[node]);
2635#endif
2636 kfree(nodes);
2637 return x + sprintf(buf + x, "\n");
2638}
2639
2640static int any_slab_objects(struct kmem_cache *s)
2641{
2642 int node;
2643 int cpu;
2644
2645 for_each_possible_cpu(cpu)
2646 if (s->cpu_slab[cpu])
2647 return 1;
2648
2649 for_each_node(node) {
2650 struct kmem_cache_node *n = get_node(s, node);
2651
2652 if (n->nr_partial || atomic_read(&n->nr_slabs))
2653 return 1;
2654 }
2655 return 0;
2656}
2657
2658#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
2659#define to_slab(n) container_of(n, struct kmem_cache, kobj);
2660
2661struct slab_attribute {
2662 struct attribute attr;
2663 ssize_t (*show)(struct kmem_cache *s, char *buf);
2664 ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
2665};
2666
2667#define SLAB_ATTR_RO(_name) \
2668 static struct slab_attribute _name##_attr = __ATTR_RO(_name)
2669
2670#define SLAB_ATTR(_name) \
2671 static struct slab_attribute _name##_attr = \
2672 __ATTR(_name, 0644, _name##_show, _name##_store)
2673
2674
2675static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
2676{
2677 return sprintf(buf, "%d\n", s->size);
2678}
2679SLAB_ATTR_RO(slab_size);
2680
2681static ssize_t align_show(struct kmem_cache *s, char *buf)
2682{
2683 return sprintf(buf, "%d\n", s->align);
2684}
2685SLAB_ATTR_RO(align);
2686
2687static ssize_t object_size_show(struct kmem_cache *s, char *buf)
2688{
2689 return sprintf(buf, "%d\n", s->objsize);
2690}
2691SLAB_ATTR_RO(object_size);
2692
2693static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
2694{
2695 return sprintf(buf, "%d\n", s->objects);
2696}
2697SLAB_ATTR_RO(objs_per_slab);
2698
2699static ssize_t order_show(struct kmem_cache *s, char *buf)
2700{
2701 return sprintf(buf, "%d\n", s->order);
2702}
2703SLAB_ATTR_RO(order);
2704
2705static ssize_t ctor_show(struct kmem_cache *s, char *buf)
2706{
2707 if (s->ctor) {
2708 int n = sprint_symbol(buf, (unsigned long)s->ctor);
2709
2710 return n + sprintf(buf + n, "\n");
2711 }
2712 return 0;
2713}
2714SLAB_ATTR_RO(ctor);
2715
2716static ssize_t dtor_show(struct kmem_cache *s, char *buf)
2717{
2718 if (s->dtor) {
2719 int n = sprint_symbol(buf, (unsigned long)s->dtor);
2720
2721 return n + sprintf(buf + n, "\n");
2722 }
2723 return 0;
2724}
2725SLAB_ATTR_RO(dtor);
2726
2727static ssize_t aliases_show(struct kmem_cache *s, char *buf)
2728{
2729 return sprintf(buf, "%d\n", s->refcount - 1);
2730}
2731SLAB_ATTR_RO(aliases);
2732
2733static ssize_t slabs_show(struct kmem_cache *s, char *buf)
2734{
2735 return slab_objects(s, buf, SO_FULL|SO_PARTIAL|SO_CPU);
2736}
2737SLAB_ATTR_RO(slabs);
2738
2739static ssize_t partial_show(struct kmem_cache *s, char *buf)
2740{
2741 return slab_objects(s, buf, SO_PARTIAL);
2742}
2743SLAB_ATTR_RO(partial);
2744
2745static ssize_t cpu_slabs_show(struct kmem_cache *s, char *buf)
2746{
2747 return slab_objects(s, buf, SO_CPU);
2748}
2749SLAB_ATTR_RO(cpu_slabs);
2750
2751static ssize_t objects_show(struct kmem_cache *s, char *buf)
2752{
2753 return slab_objects(s, buf, SO_FULL|SO_PARTIAL|SO_CPU|SO_OBJECTS);
2754}
2755SLAB_ATTR_RO(objects);
2756
2757static ssize_t sanity_checks_show(struct kmem_cache *s, char *buf)
2758{
2759 return sprintf(buf, "%d\n", !!(s->flags & SLAB_DEBUG_FREE));
2760}
2761
2762static ssize_t sanity_checks_store(struct kmem_cache *s,
2763 const char *buf, size_t length)
2764{
2765 s->flags &= ~SLAB_DEBUG_FREE;
2766 if (buf[0] == '1')
2767 s->flags |= SLAB_DEBUG_FREE;
2768 return length;
2769}
2770SLAB_ATTR(sanity_checks);
2771
2772static ssize_t trace_show(struct kmem_cache *s, char *buf)
2773{
2774 return sprintf(buf, "%d\n", !!(s->flags & SLAB_TRACE));
2775}
2776
2777static ssize_t trace_store(struct kmem_cache *s, const char *buf,
2778 size_t length)
2779{
2780 s->flags &= ~SLAB_TRACE;
2781 if (buf[0] == '1')
2782 s->flags |= SLAB_TRACE;
2783 return length;
2784}
2785SLAB_ATTR(trace);
2786
2787static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
2788{
2789 return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
2790}
2791
2792static ssize_t reclaim_account_store(struct kmem_cache *s,
2793 const char *buf, size_t length)
2794{
2795 s->flags &= ~SLAB_RECLAIM_ACCOUNT;
2796 if (buf[0] == '1')
2797 s->flags |= SLAB_RECLAIM_ACCOUNT;
2798 return length;
2799}
2800SLAB_ATTR(reclaim_account);
2801
2802static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
2803{
2804 return sprintf(buf, "%d\n", !!(s->flags &
2805 (SLAB_HWCACHE_ALIGN|SLAB_MUST_HWCACHE_ALIGN)));
2806}
2807SLAB_ATTR_RO(hwcache_align);
2808
2809#ifdef CONFIG_ZONE_DMA
2810static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
2811{
2812 return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
2813}
2814SLAB_ATTR_RO(cache_dma);
2815#endif
2816
2817static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
2818{
2819 return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
2820}
2821SLAB_ATTR_RO(destroy_by_rcu);
2822
2823static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
2824{
2825 return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
2826}
2827
2828static ssize_t red_zone_store(struct kmem_cache *s,
2829 const char *buf, size_t length)
2830{
2831 if (any_slab_objects(s))
2832 return -EBUSY;
2833
2834 s->flags &= ~SLAB_RED_ZONE;
2835 if (buf[0] == '1')
2836 s->flags |= SLAB_RED_ZONE;
2837 calculate_sizes(s);
2838 return length;
2839}
2840SLAB_ATTR(red_zone);
2841
2842static ssize_t poison_show(struct kmem_cache *s, char *buf)
2843{
2844 return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
2845}
2846
2847static ssize_t poison_store(struct kmem_cache *s,
2848 const char *buf, size_t length)
2849{
2850 if (any_slab_objects(s))
2851 return -EBUSY;
2852
2853 s->flags &= ~SLAB_POISON;
2854 if (buf[0] == '1')
2855 s->flags |= SLAB_POISON;
2856 calculate_sizes(s);
2857 return length;
2858}
2859SLAB_ATTR(poison);
2860
2861static ssize_t store_user_show(struct kmem_cache *s, char *buf)
2862{
2863 return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
2864}
2865
2866static ssize_t store_user_store(struct kmem_cache *s,
2867 const char *buf, size_t length)
2868{
2869 if (any_slab_objects(s))
2870 return -EBUSY;
2871
2872 s->flags &= ~SLAB_STORE_USER;
2873 if (buf[0] == '1')
2874 s->flags |= SLAB_STORE_USER;
2875 calculate_sizes(s);
2876 return length;
2877}
2878SLAB_ATTR(store_user);
2879
2880#ifdef CONFIG_NUMA
2881static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
2882{
2883 return sprintf(buf, "%d\n", s->defrag_ratio / 10);
2884}
2885
2886static ssize_t defrag_ratio_store(struct kmem_cache *s,
2887 const char *buf, size_t length)
2888{
2889 int n = simple_strtoul(buf, NULL, 10);
2890
2891 if (n < 100)
2892 s->defrag_ratio = n * 10;
2893 return length;
2894}
2895SLAB_ATTR(defrag_ratio);
2896#endif
2897
2898static struct attribute * slab_attrs[] = {
2899 &slab_size_attr.attr,
2900 &object_size_attr.attr,
2901 &objs_per_slab_attr.attr,
2902 &order_attr.attr,
2903 &objects_attr.attr,
2904 &slabs_attr.attr,
2905 &partial_attr.attr,
2906 &cpu_slabs_attr.attr,
2907 &ctor_attr.attr,
2908 &dtor_attr.attr,
2909 &aliases_attr.attr,
2910 &align_attr.attr,
2911 &sanity_checks_attr.attr,
2912 &trace_attr.attr,
2913 &hwcache_align_attr.attr,
2914 &reclaim_account_attr.attr,
2915 &destroy_by_rcu_attr.attr,
2916 &red_zone_attr.attr,
2917 &poison_attr.attr,
2918 &store_user_attr.attr,
2919#ifdef CONFIG_ZONE_DMA
2920 &cache_dma_attr.attr,
2921#endif
2922#ifdef CONFIG_NUMA
2923 &defrag_ratio_attr.attr,
2924#endif
2925 NULL
2926};
2927
2928static struct attribute_group slab_attr_group = {
2929 .attrs = slab_attrs,
2930};
2931
2932static ssize_t slab_attr_show(struct kobject *kobj,
2933 struct attribute *attr,
2934 char *buf)
2935{
2936 struct slab_attribute *attribute;
2937 struct kmem_cache *s;
2938 int err;
2939
2940 attribute = to_slab_attr(attr);
2941 s = to_slab(kobj);
2942
2943 if (!attribute->show)
2944 return -EIO;
2945
2946 err = attribute->show(s, buf);
2947
2948 return err;
2949}
2950
2951static ssize_t slab_attr_store(struct kobject *kobj,
2952 struct attribute *attr,
2953 const char *buf, size_t len)
2954{
2955 struct slab_attribute *attribute;
2956 struct kmem_cache *s;
2957 int err;
2958
2959 attribute = to_slab_attr(attr);
2960 s = to_slab(kobj);
2961
2962 if (!attribute->store)
2963 return -EIO;
2964
2965 err = attribute->store(s, buf, len);
2966
2967 return err;
2968}
2969
2970static struct sysfs_ops slab_sysfs_ops = {
2971 .show = slab_attr_show,
2972 .store = slab_attr_store,
2973};
2974
2975static struct kobj_type slab_ktype = {
2976 .sysfs_ops = &slab_sysfs_ops,
2977};
2978
2979static int uevent_filter(struct kset *kset, struct kobject *kobj)
2980{
2981 struct kobj_type *ktype = get_ktype(kobj);
2982
2983 if (ktype == &slab_ktype)
2984 return 1;
2985 return 0;
2986}
2987
2988static struct kset_uevent_ops slab_uevent_ops = {
2989 .filter = uevent_filter,
2990};
2991
2992decl_subsys(slab, &slab_ktype, &slab_uevent_ops);
2993
2994#define ID_STR_LENGTH 64
2995
2996/* Create a unique string id for a slab cache:
2997 * format
2998 * :[flags-]size:[memory address of kmemcache]
2999 */
3000static char *create_unique_id(struct kmem_cache *s)
3001{
3002 char *name = kmalloc(ID_STR_LENGTH, GFP_KERNEL);
3003 char *p = name;
3004
3005 BUG_ON(!name);
3006
3007 *p++ = ':';
3008 /*
3009 * First flags affecting slabcache operations. We will only
3010 * get here for aliasable slabs so we do not need to support
3011 * too many flags. The flags here must cover all flags that
3012 * are matched during merging to guarantee that the id is
3013 * unique.
3014 */
3015 if (s->flags & SLAB_CACHE_DMA)
3016 *p++ = 'd';
3017 if (s->flags & SLAB_RECLAIM_ACCOUNT)
3018 *p++ = 'a';
3019 if (s->flags & SLAB_DEBUG_FREE)
3020 *p++ = 'F';
3021 if (p != name + 1)
3022 *p++ = '-';
3023 p += sprintf(p, "%07d", s->size);
3024 BUG_ON(p > name + ID_STR_LENGTH - 1);
3025 return name;
3026}
3027
3028static int sysfs_slab_add(struct kmem_cache *s)
3029{
3030 int err;
3031 const char *name;
3032 int unmergeable;
3033
3034 if (slab_state < SYSFS)
3035 /* Defer until later */
3036 return 0;
3037
3038 unmergeable = slab_unmergeable(s);
3039 if (unmergeable) {
3040 /*
3041 * Slabcache can never be merged so we can use the name proper.
3042 * This is typically the case for debug situations. In that
3043 * case we can catch duplicate names easily.
3044 */
3045 sysfs_remove_link(&slab_subsys.kset.kobj, s->name);
3046 name = s->name;
3047 } else {
3048 /*
3049 * Create a unique name for the slab as a target
3050 * for the symlinks.
3051 */
3052 name = create_unique_id(s);
3053 }
3054
3055 kobj_set_kset_s(s, slab_subsys);
3056 kobject_set_name(&s->kobj, name);
3057 kobject_init(&s->kobj);
3058 err = kobject_add(&s->kobj);
3059 if (err)
3060 return err;
3061
3062 err = sysfs_create_group(&s->kobj, &slab_attr_group);
3063 if (err)
3064 return err;
3065 kobject_uevent(&s->kobj, KOBJ_ADD);
3066 if (!unmergeable) {
3067 /* Setup first alias */
3068 sysfs_slab_alias(s, s->name);
3069 kfree(name);
3070 }
3071 return 0;
3072}
3073
3074static void sysfs_slab_remove(struct kmem_cache *s)
3075{
3076 kobject_uevent(&s->kobj, KOBJ_REMOVE);
3077 kobject_del(&s->kobj);
3078}
3079
3080/*
3081 * Need to buffer aliases during bootup until sysfs becomes
3082 * available lest we loose that information.
3083 */
3084struct saved_alias {
3085 struct kmem_cache *s;
3086 const char *name;
3087 struct saved_alias *next;
3088};
3089
3090struct saved_alias *alias_list;
3091
3092static int sysfs_slab_alias(struct kmem_cache *s, const char *name)
3093{
3094 struct saved_alias *al;
3095
3096 if (slab_state == SYSFS) {
3097 /*
3098 * If we have a leftover link then remove it.
3099 */
3100 sysfs_remove_link(&slab_subsys.kset.kobj, name);
3101 return sysfs_create_link(&slab_subsys.kset.kobj,
3102 &s->kobj, name);
3103 }
3104
3105 al = kmalloc(sizeof(struct saved_alias), GFP_KERNEL);
3106 if (!al)
3107 return -ENOMEM;
3108
3109 al->s = s;
3110 al->name = name;
3111 al->next = alias_list;
3112 alias_list = al;
3113 return 0;
3114}
3115
3116static int __init slab_sysfs_init(void)
3117{
3118 int err;
3119
3120 err = subsystem_register(&slab_subsys);
3121 if (err) {
3122 printk(KERN_ERR "Cannot register slab subsystem.\n");
3123 return -ENOSYS;
3124 }
3125
3126 finish_bootstrap();
3127
3128 while (alias_list) {
3129 struct saved_alias *al = alias_list;
3130
3131 alias_list = alias_list->next;
3132 err = sysfs_slab_alias(al->s, al->name);
3133 BUG_ON(err);
3134 kfree(al);
3135 }
3136
3137 resiliency_test();
3138 return 0;
3139}
3140
3141__initcall(slab_sysfs_init);
3142#else
3143__initcall(finish_bootstrap);
3144#endif