aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/stable/sysfs-devices-node7
-rw-r--r--Documentation/fault-injection/provoke-crashes.txt38
-rw-r--r--Documentation/feature-removal-schedule.txt32
-rw-r--r--Documentation/filesystems/Locking18
-rw-r--r--Documentation/filesystems/nfs/nfs41-server.txt5
-rw-r--r--Documentation/filesystems/proc.txt53
-rw-r--r--Documentation/gpio.txt64
-rw-r--r--Documentation/init.txt49
-rw-r--r--Documentation/kprobes.txt207
-rw-r--r--Documentation/kvm/api.txt12
-rw-r--r--Documentation/vm/slub.txt1
11 files changed, 446 insertions, 40 deletions
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
new file mode 100644
index 000000000000..49b82cad7003
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -0,0 +1,7 @@
1What: /sys/devices/system/node/nodeX
2Date: October 2002
3Contact: Linux Memory Management list <linux-mm@kvack.org>
4Description:
5 When CONFIG_NUMA is enabled, this is a directory containing
6 information on node X such as what CPUs are local to the
7 node.
diff --git a/Documentation/fault-injection/provoke-crashes.txt b/Documentation/fault-injection/provoke-crashes.txt
new file mode 100644
index 000000000000..7a9d3d81525b
--- /dev/null
+++ b/Documentation/fault-injection/provoke-crashes.txt
@@ -0,0 +1,38 @@
1The lkdtm module provides an interface to crash or injure the kernel at
2predefined crashpoints to evaluate the reliability of crash dumps obtained
3using different dumping solutions. The module uses KPROBEs to instrument
4crashing points, but can also crash the kernel directly without KRPOBE
5support.
6
7
8You can provide the way either through module arguments when inserting
9the module, or through a debugfs interface.
10
11Usage: insmod lkdtm.ko [recur_count={>0}] cpoint_name=<> cpoint_type=<>
12 [cpoint_count={>0}]
13
14 recur_count : Recursion level for the stack overflow test. Default is 10.
15
16 cpoint_name : Crash point where the kernel is to be crashed. It can be
17 one of INT_HARDWARE_ENTRY, INT_HW_IRQ_EN, INT_TASKLET_ENTRY,
18 FS_DEVRW, MEM_SWAPOUT, TIMERADD, SCSI_DISPATCH_CMD,
19 IDE_CORE_CP, DIRECT
20
21 cpoint_type : Indicates the action to be taken on hitting the crash point.
22 It can be one of PANIC, BUG, EXCEPTION, LOOP, OVERFLOW,
23 CORRUPT_STACK, UNALIGNED_LOAD_STORE_WRITE, OVERWRITE_ALLOCATION,
24 WRITE_AFTER_FREE,
25
26 cpoint_count : Indicates the number of times the crash point is to be hit
27 to trigger an action. The default is 10.
28
29You can also induce failures by mounting debugfs and writing the type to
30<mountpoint>/provoke-crash/<crashpoint>. E.g.,
31
32 mount -t debugfs debugfs /mnt
33 echo EXCEPTION > /mnt/provoke-crash/INT_HARDWARE_ENTRY
34
35
36A special file is `DIRECT' which will induce the crash directly without
37KPROBE instrumentation. This mode is the only one available when the module
38is built on a kernel without KPROBEs support.
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 8debdd625e1a..a5cc0db63d7a 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -550,3 +550,35 @@ Why: udev fully replaces this special file system that only contains CAPI
550 NCCI TTY device nodes. User space (pppdcapiplugin) works without 550 NCCI TTY device nodes. User space (pppdcapiplugin) works without
551 noticing the difference. 551 noticing the difference.
552Who: Jan Kiszka <jan.kiszka@web.de> 552Who: Jan Kiszka <jan.kiszka@web.de>
553
554----------------------------
555
556What: KVM memory aliases support
557When: July 2010
558Why: Memory aliasing support is used for speeding up guest vga access
559 through the vga windows.
560
561 Modern userspace no longer uses this feature, so it's just bitrotted
562 code and can be removed with no impact.
563Who: Avi Kivity <avi@redhat.com>
564
565----------------------------
566
567What: KVM kernel-allocated memory slots
568When: July 2010
569Why: Since 2.6.25, kvm supports user-allocated memory slots, which are
570 much more flexible than kernel-allocated slots. All current userspace
571 supports the newer interface and this code can be removed with no
572 impact.
573Who: Avi Kivity <avi@redhat.com>
574
575----------------------------
576
577What: KVM paravirt mmu host support
578When: January 2011
579Why: The paravirt mmu host support is slower than non-paravirt mmu, both
580 on newer and older hardware. It is already not exposed to the guest,
581 and kept only for live migration purposes.
582Who: Avi Kivity <avi@redhat.com>
583
584----------------------------
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 18b9d0ca0630..06bbbed71206 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -460,13 +460,6 @@ in sys_read() and friends.
460 460
461--------------------------- dquot_operations ------------------------------- 461--------------------------- dquot_operations -------------------------------
462prototypes: 462prototypes:
463 int (*initialize) (struct inode *, int);
464 int (*drop) (struct inode *);
465 int (*alloc_space) (struct inode *, qsize_t, int);
466 int (*alloc_inode) (const struct inode *, unsigned long);
467 int (*free_space) (struct inode *, qsize_t);
468 int (*free_inode) (const struct inode *, unsigned long);
469 int (*transfer) (struct inode *, struct iattr *);
470 int (*write_dquot) (struct dquot *); 463 int (*write_dquot) (struct dquot *);
471 int (*acquire_dquot) (struct dquot *); 464 int (*acquire_dquot) (struct dquot *);
472 int (*release_dquot) (struct dquot *); 465 int (*release_dquot) (struct dquot *);
@@ -479,13 +472,6 @@ a proper locking wrt the filesystem and call the generic quota operations.
479What filesystem should expect from the generic quota functions: 472What filesystem should expect from the generic quota functions:
480 473
481 FS recursion Held locks when called 474 FS recursion Held locks when called
482initialize: yes maybe dqonoff_sem
483drop: yes -
484alloc_space: ->mark_dirty() -
485alloc_inode: ->mark_dirty() -
486free_space: ->mark_dirty() -
487free_inode: ->mark_dirty() -
488transfer: yes -
489write_dquot: yes dqonoff_sem or dqptr_sem 475write_dquot: yes dqonoff_sem or dqptr_sem
490acquire_dquot: yes dqonoff_sem or dqptr_sem 476acquire_dquot: yes dqonoff_sem or dqptr_sem
491release_dquot: yes dqonoff_sem or dqptr_sem 477release_dquot: yes dqonoff_sem or dqptr_sem
@@ -495,10 +481,6 @@ write_info: yes dqonoff_sem
495FS recursion means calling ->quota_read() and ->quota_write() from superblock 481FS recursion means calling ->quota_read() and ->quota_write() from superblock
496operations. 482operations.
497 483
498->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
499only directly by the filesystem and do not call any fs functions only
500the ->mark_dirty() operation.
501
502More details about quota locking can be found in fs/dquot.c. 484More details about quota locking can be found in fs/dquot.c.
503 485
504--------------------------- vm_operations_struct ----------------------------- 486--------------------------- vm_operations_struct -----------------------------
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
index 1bd0d0c05171..6a53a84afc72 100644
--- a/Documentation/filesystems/nfs/nfs41-server.txt
+++ b/Documentation/filesystems/nfs/nfs41-server.txt
@@ -17,8 +17,7 @@ kernels must turn 4.1 on or off *before* turning support for version 4
17on or off; rpc.nfsd does this correctly.) 17on or off; rpc.nfsd does this correctly.)
18 18
19The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based 19The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
20on the latest NFSv4.1 Internet Draft: 20on RFC 5661.
21http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29
22 21
23From the many new features in NFSv4.1 the current implementation 22From the many new features in NFSv4.1 the current implementation
24focuses on the mandatory-to-implement NFSv4.1 Sessions, providing 23focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
@@ -44,7 +43,7 @@ interoperability problems with future clients. Known issues:
44 trunking, but this is a mandatory feature, and its use is 43 trunking, but this is a mandatory feature, and its use is
45 recommended to clients in a number of places. (E.g. to ensure 44 recommended to clients in a number of places. (E.g. to ensure
46 timely renewal in case an existing connection's retry timeouts 45 timely renewal in case an existing connection's retry timeouts
47 have gotten too long; see section 8.3 of the draft.) 46 have gotten too long; see section 8.3 of the RFC.)
48 Therefore, lack of this feature may cause future clients to 47 Therefore, lack of this feature may cause future clients to
49 fail. 48 fail.
50 - Incomplete backchannel support: incomplete backchannel gss 49 - Incomplete backchannel support: incomplete backchannel gss
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d07513a67a6..96a44dd95e03 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -164,6 +164,7 @@ read the file /proc/PID/status:
164 VmExe: 68 kB 164 VmExe: 68 kB
165 VmLib: 1412 kB 165 VmLib: 1412 kB
166 VmPTE: 20 kb 166 VmPTE: 20 kb
167 VmSwap: 0 kB
167 Threads: 1 168 Threads: 1
168 SigQ: 0/28578 169 SigQ: 0/28578
169 SigPnd: 0000000000000000 170 SigPnd: 0000000000000000
@@ -188,6 +189,12 @@ memory usage. Its seven fields are explained in Table 1-3. The stat file
188contains details information about the process itself. Its fields are 189contains details information about the process itself. Its fields are
189explained in Table 1-4. 190explained in Table 1-4.
190 191
192(for SMP CONFIG users)
193For making accounting scalable, RSS related information are handled in
194asynchronous manner and the vaule may not be very precise. To see a precise
195snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
196It's slow but very precise.
197
191Table 1-2: Contents of the statm files (as of 2.6.30-rc7) 198Table 1-2: Contents of the statm files (as of 2.6.30-rc7)
192.............................................................................. 199..............................................................................
193 Field Content 200 Field Content
@@ -213,6 +220,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7)
213 VmExe size of text segment 220 VmExe size of text segment
214 VmLib size of shared library code 221 VmLib size of shared library code
215 VmPTE size of page table entries 222 VmPTE size of page table entries
223 VmSwap size of swap usage (the number of referred swapents)
216 Threads number of threads 224 Threads number of threads
217 SigQ number of signals queued/max. number for queue 225 SigQ number of signals queued/max. number for queue
218 SigPnd bitmap of pending signals for the thread 226 SigPnd bitmap of pending signals for the thread
@@ -430,6 +438,7 @@ Table 1-5: Kernel info in /proc
430 modules List of loaded modules 438 modules List of loaded modules
431 mounts Mounted filesystems 439 mounts Mounted filesystems
432 net Networking info (see text) 440 net Networking info (see text)
441 pagetypeinfo Additional page allocator information (see text) (2.5)
433 partitions Table of partitions known to the system 442 partitions Table of partitions known to the system
434 pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, 443 pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
435 decoupled by lspci (2.4) 444 decoupled by lspci (2.4)
@@ -584,7 +593,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ...
584Node 0, zone Normal 1 0 0 1 101 8 ... 593Node 0, zone Normal 1 0 0 1 101 8 ...
585Node 0, zone HighMem 2 0 0 1 1 0 ... 594Node 0, zone HighMem 2 0 0 1 1 0 ...
586 595
587Memory fragmentation is a problem under some workloads, and buddyinfo is a 596External fragmentation is a problem under some workloads, and buddyinfo is a
588useful tool for helping diagnose these problems. Buddyinfo will give you a 597useful tool for helping diagnose these problems. Buddyinfo will give you a
589clue as to how big an area you can safely allocate, or why a previous 598clue as to how big an area you can safely allocate, or why a previous
590allocation failed. 599allocation failed.
@@ -594,6 +603,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
594ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE 603ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
595available in ZONE_NORMAL, etc... 604available in ZONE_NORMAL, etc...
596 605
606More information relevant to external fragmentation can be found in
607pagetypeinfo.
608
609> cat /proc/pagetypeinfo
610Page block order: 9
611Pages per block: 512
612
613Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
614Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
615Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
616Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
617Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
618Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
619Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
620Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
621Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
622Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
623Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
624
625Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
626Node 0, zone DMA 2 0 5 1 0
627Node 0, zone DMA32 41 6 967 2 0
628
629Fragmentation avoidance in the kernel works by grouping pages of different
630migrate types into the same contiguous regions of memory called page blocks.
631A page block is typically the size of the default hugepage size e.g. 2MB on
632X86-64. By keeping pages grouped based on their ability to move, the kernel
633can reclaim pages within a page block to satisfy a high-order allocation.
634
635The pagetypinfo begins with information on the size of a page block. It
636then gives the same type of information as buddyinfo except broken down
637by migrate-type and finishes with details on how many page blocks of each
638type exist.
639
640If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
641from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
642make an estimate of the likely number of huge pages that can be allocated
643at a given point in time. All the "Movable" blocks should be allocatable
644unless memory has been mlock()'d. Some of the Reclaimable blocks should
645also be allocatable although a lot of filesystem metadata may have to be
646reclaimed to achieve this.
647
597.............................................................................. 648..............................................................................
598 649
599meminfo: 650meminfo:
diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt
index 1866c27eec69..c2c6e9b39bbe 100644
--- a/Documentation/gpio.txt
+++ b/Documentation/gpio.txt
@@ -253,6 +253,70 @@ pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown).
253Also note that it's your responsibility to have stopped using a GPIO 253Also note that it's your responsibility to have stopped using a GPIO
254before you free it. 254before you free it.
255 255
256Considering in most cases GPIOs are actually configured right after they
257are claimed, three additional calls are defined:
258
259 /* request a single GPIO, with initial configuration specified by
260 * 'flags', identical to gpio_request() wrt other arguments and
261 * return value
262 */
263 int gpio_request_one(unsigned gpio, unsigned long flags, const char *label);
264
265 /* request multiple GPIOs in a single call
266 */
267 int gpio_request_array(struct gpio *array, size_t num);
268
269 /* release multiple GPIOs in a single call
270 */
271 void gpio_free_array(struct gpio *array, size_t num);
272
273where 'flags' is currently defined to specify the following properties:
274
275 * GPIOF_DIR_IN - to configure direction as input
276 * GPIOF_DIR_OUT - to configure direction as output
277
278 * GPIOF_INIT_LOW - as output, set initial level to LOW
279 * GPIOF_INIT_HIGH - as output, set initial level to HIGH
280
281since GPIOF_INIT_* are only valid when configured as output, so group valid
282combinations as:
283
284 * GPIOF_IN - configure as input
285 * GPIOF_OUT_INIT_LOW - configured as output, initial level LOW
286 * GPIOF_OUT_INIT_HIGH - configured as output, initial level HIGH
287
288In the future, these flags can be extended to support more properties such
289as open-drain status.
290
291Further more, to ease the claim/release of multiple GPIOs, 'struct gpio' is
292introduced to encapsulate all three fields as:
293
294 struct gpio {
295 unsigned gpio;
296 unsigned long flags;
297 const char *label;
298 };
299
300A typical example of usage:
301
302 static struct gpio leds_gpios[] = {
303 { 32, GPIOF_OUT_INIT_HIGH, "Power LED" }, /* default to ON */
304 { 33, GPIOF_OUT_INIT_LOW, "Green LED" }, /* default to OFF */
305 { 34, GPIOF_OUT_INIT_LOW, "Red LED" }, /* default to OFF */
306 { 35, GPIOF_OUT_INIT_LOW, "Blue LED" }, /* default to OFF */
307 { ... },
308 };
309
310 err = gpio_request_one(31, GPIOF_IN, "Reset Button");
311 if (err)
312 ...
313
314 err = gpio_request_array(leds_gpios, ARRAY_SIZE(leds_gpios));
315 if (err)
316 ...
317
318 gpio_free_array(leds_gpios, ARRAY_SIZE(leds_gpios));
319
256 320
257GPIOs mapped to IRQs 321GPIOs mapped to IRQs
258-------------------- 322--------------------
diff --git a/Documentation/init.txt b/Documentation/init.txt
new file mode 100644
index 000000000000..535ad5e82b98
--- /dev/null
+++ b/Documentation/init.txt
@@ -0,0 +1,49 @@
1Explaining the dreaded "No init found." boot hang message
2=========================================================
3
4OK, so you've got this pretty unintuitive message (currently located
5in init/main.c) and are wondering what the H*** went wrong.
6Some high-level reasons for failure (listed roughly in order of execution)
7to load the init binary are:
8A) Unable to mount root FS
9B) init binary doesn't exist on rootfs
10C) broken console device
11D) binary exists but dependencies not available
12E) binary cannot be loaded
13
14Detailed explanations:
150) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE)
16 to get more detailed kernel messages.
17A) make sure you have the correct root FS type
18 (and root= kernel parameter points to the correct partition),
19 required drivers such as storage hardware (such as SCSI or USB!)
20 and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules,
21 to be pre-loaded by an initrd)
22C) Possibly a conflict in console= setup --> initial console unavailable.
23 E.g. some serial consoles are unreliable due to serial IRQ issues (e.g.
24 missing interrupt-based configuration).
25 Try using a different console= device or e.g. netconsole= .
26D) e.g. required library dependencies of the init binary such as
27 /lib/ld-linux.so.2 missing or broken. Use readelf -d <INIT>|grep NEEDED
28 to find out which libraries are required.
29E) make sure the binary's architecture matches your hardware.
30 E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware.
31 In case you tried loading a non-binary file here (shell script?),
32 you should make sure that the script specifies an interpreter in its shebang
33 header line (#!/...) that is fully working (including its library
34 dependencies). And before tackling scripts, better first test a simple
35 non-script binary such as /bin/sh and confirm its successful execution.
36 To find out more, add code to init/main.c to display kernel_execve()s
37 return values.
38
39Please extend this explanation whenever you find new failure causes
40(after all loading the init binary is a CRITICAL and hard transition step
41which needs to be made as painless as possible), then submit patch to LKML.
42Further TODOs:
43- Implement the various run_init_process() invocations via a struct array
44 which can then store the kernel_execve() result value and on failure
45 log it all by iterating over _all_ results (very important usability fix).
46- try to make the implementation itself more helpful in general,
47 e.g. by providing additional error messages at affected places.
48
49Andreas Mohr <andi at lisas period de>
diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 053037a1fe6d..2f9115c0ae62 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -1,6 +1,7 @@
1Title : Kernel Probes (Kprobes) 1Title : Kernel Probes (Kprobes)
2Authors : Jim Keniston <jkenisto@us.ibm.com> 2Authors : Jim Keniston <jkenisto@us.ibm.com>
3 : Prasanna S Panchamukhi <prasanna@in.ibm.com> 3 : Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
4 : Masami Hiramatsu <mhiramat@redhat.com>
4 5
5CONTENTS 6CONTENTS
6 7
@@ -15,6 +16,7 @@ CONTENTS
159. Jprobes Example 169. Jprobes Example
1610. Kretprobes Example 1710. Kretprobes Example
17Appendix A: The kprobes debugfs interface 18Appendix A: The kprobes debugfs interface
19Appendix B: The kprobes sysctl interface
18 20
191. Concepts: Kprobes, Jprobes, Return Probes 211. Concepts: Kprobes, Jprobes, Return Probes
20 22
@@ -42,13 +44,13 @@ registration/unregistration of a group of *probes. These functions
42can speed up unregistration process when you have to unregister 44can speed up unregistration process when you have to unregister
43a lot of probes at once. 45a lot of probes at once.
44 46
45The next three subsections explain how the different types of 47The next four subsections explain how the different types of
46probes work. They explain certain things that you'll need to 48probes work and how jump optimization works. They explain certain
47know in order to make the best use of Kprobes -- e.g., the 49things that you'll need to know in order to make the best use of
48difference between a pre_handler and a post_handler, and how 50Kprobes -- e.g., the difference between a pre_handler and
49to use the maxactive and nmissed fields of a kretprobe. But 51a post_handler, and how to use the maxactive and nmissed fields of
50if you're in a hurry to start using Kprobes, you can skip ahead 52a kretprobe. But if you're in a hurry to start using Kprobes, you
51to section 2. 53can skip ahead to section 2.
52 54
531.1 How Does a Kprobe Work? 551.1 How Does a Kprobe Work?
54 56
@@ -161,13 +163,125 @@ In case probed function is entered but there is no kretprobe_instance
161object available, then in addition to incrementing the nmissed count, 163object available, then in addition to incrementing the nmissed count,
162the user entry_handler invocation is also skipped. 164the user entry_handler invocation is also skipped.
163 165
1661.4 How Does Jump Optimization Work?
167
168If you configured your kernel with CONFIG_OPTPROBES=y (currently
169this option is supported on x86/x86-64, non-preemptive kernel) and
170the "debug.kprobes_optimization" kernel parameter is set to 1 (see
171sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
172instruction instead of a breakpoint instruction at each probepoint.
173
1741.4.1 Init a Kprobe
175
176When a probe is registered, before attempting this optimization,
177Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
178address. So, even if it's not possible to optimize this particular
179probepoint, there'll be a probe there.
180
1811.4.2 Safety Check
182
183Before optimizing a probe, Kprobes performs the following safety checks:
184
185- Kprobes verifies that the region that will be replaced by the jump
186instruction (the "optimized region") lies entirely within one function.
187(A jump instruction is multiple bytes, and so may overlay multiple
188instructions.)
189
190- Kprobes analyzes the entire function and verifies that there is no
191jump into the optimized region. Specifically:
192 - the function contains no indirect jump;
193 - the function contains no instruction that causes an exception (since
194 the fixup code triggered by the exception could jump back into the
195 optimized region -- Kprobes checks the exception tables to verify this);
196 and
197 - there is no near jump to the optimized region (other than to the first
198 byte).
199
200- For each instruction in the optimized region, Kprobes verifies that
201the instruction can be executed out of line.
202
2031.4.3 Preparing Detour Buffer
204
205Next, Kprobes prepares a "detour" buffer, which contains the following
206instruction sequence:
207- code to push the CPU's registers (emulating a breakpoint trap)
208- a call to the trampoline code which calls user's probe handlers.
209- code to restore registers
210- the instructions from the optimized region
211- a jump back to the original execution path.
212
2131.4.4 Pre-optimization
214
215After preparing the detour buffer, Kprobes verifies that none of the
216following situations exist:
217- The probe has either a break_handler (i.e., it's a jprobe) or a
218post_handler.
219- Other instructions in the optimized region are probed.
220- The probe is disabled.
221In any of the above cases, Kprobes won't start optimizing the probe.
222Since these are temporary situations, Kprobes tries to start
223optimizing it again if the situation is changed.
224
225If the kprobe can be optimized, Kprobes enqueues the kprobe to an
226optimizing list, and kicks the kprobe-optimizer workqueue to optimize
227it. If the to-be-optimized probepoint is hit before being optimized,
228Kprobes returns control to the original instruction path by setting
229the CPU's instruction pointer to the copied code in the detour buffer
230-- thus at least avoiding the single-step.
231
2321.4.5 Optimization
233
234The Kprobe-optimizer doesn't insert the jump instruction immediately;
235rather, it calls synchronize_sched() for safety first, because it's
236possible for a CPU to be interrupted in the middle of executing the
237optimized region(*). As you know, synchronize_sched() can ensure
238that all interruptions that were active when synchronize_sched()
239was called are done, but only if CONFIG_PREEMPT=n. So, this version
240of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**)
241
242After that, the Kprobe-optimizer calls stop_machine() to replace
243the optimized region with a jump instruction to the detour buffer,
244using text_poke_smp().
245
2461.4.6 Unoptimization
247
248When an optimized kprobe is unregistered, disabled, or blocked by
249another kprobe, it will be unoptimized. If this happens before
250the optimization is complete, the kprobe is just dequeued from the
251optimized list. If the optimization has been done, the jump is
252replaced with the original code (except for an int3 breakpoint in
253the first byte) by using text_poke_smp().
254
255(*)Please imagine that the 2nd instruction is interrupted and then
256the optimizer replaces the 2nd instruction with the jump *address*
257while the interrupt handler is running. When the interrupt
258returns to original address, there is no valid instruction,
259and it causes an unexpected result.
260
261(**)This optimization-safety checking may be replaced with the
262stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
263kernel.
264
265NOTE for geeks:
266The jump optimization changes the kprobe's pre_handler behavior.
267Without optimization, the pre_handler can change the kernel's execution
268path by changing regs->ip and returning 1. However, when the probe
269is optimized, that modification is ignored. Thus, if you want to
270tweak the kernel's execution path, you need to suppress optimization,
271using one of the following techniques:
272- Specify an empty function for the kprobe's post_handler or break_handler.
273 or
274- Config CONFIG_OPTPROBES=n.
275 or
276- Execute 'sysctl -w debug.kprobes_optimization=n'
277
1642. Architectures Supported 2782. Architectures Supported
165 279
166Kprobes, jprobes, and return probes are implemented on the following 280Kprobes, jprobes, and return probes are implemented on the following
167architectures: 281architectures:
168 282
169- i386 283- i386 (Supports jump optimization)
170- x86_64 (AMD-64, EM64T) 284- x86_64 (AMD-64, EM64T) (Supports jump optimization)
171- ppc64 285- ppc64
172- ia64 (Does not support probes on instruction slot1.) 286- ia64 (Does not support probes on instruction slot1.)
173- sparc64 (Return probes not yet implemented.) 287- sparc64 (Return probes not yet implemented.)
@@ -193,6 +307,10 @@ it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
193so you can use "objdump -d -l vmlinux" to see the source-to-object 307so you can use "objdump -d -l vmlinux" to see the source-to-object
194code mapping. 308code mapping.
195 309
310If you want to reduce probing overhead, set "Kprobes jump optimization
311support" (CONFIG_OPTPROBES) to "y". You can find this option under the
312"Kprobes" line.
313
1964. API Reference 3144. API Reference
197 315
198The Kprobes API includes a "register" function and an "unregister" 316The Kprobes API includes a "register" function and an "unregister"
@@ -389,7 +507,10 @@ the probe which has been registered.
389 507
390Kprobes allows multiple probes at the same address. Currently, 508Kprobes allows multiple probes at the same address. Currently,
391however, there cannot be multiple jprobes on the same function at 509however, there cannot be multiple jprobes on the same function at
392the same time. 510the same time. Also, a probepoint for which there is a jprobe or
511a post_handler cannot be optimized. So if you install a jprobe,
512or a kprobe with a post_handler, at an optimized probepoint, the
513probepoint will be unoptimized automatically.
393 514
394In general, you can install a probe anywhere in the kernel. 515In general, you can install a probe anywhere in the kernel.
395In particular, you can probe interrupt handlers. Known exceptions 516In particular, you can probe interrupt handlers. Known exceptions
@@ -453,6 +574,38 @@ reason, Kprobes doesn't support return probes (or kprobes or jprobes)
453on the x86_64 version of __switch_to(); the registration functions 574on the x86_64 version of __switch_to(); the registration functions
454return -EINVAL. 575return -EINVAL.
455 576
577On x86/x86-64, since the Jump Optimization of Kprobes modifies
578instructions widely, there are some limitations to optimization. To
579explain it, we introduce some terminology. Imagine a 3-instruction
580sequence consisting of a two 2-byte instructions and one 3-byte
581instruction.
582
583 IA
584 |
585[-2][-1][0][1][2][3][4][5][6][7]
586 [ins1][ins2][ ins3 ]
587 [<- DCR ->]
588 [<- JTPR ->]
589
590ins1: 1st Instruction
591ins2: 2nd Instruction
592ins3: 3rd Instruction
593IA: Insertion Address
594JTPR: Jump Target Prohibition Region
595DCR: Detoured Code Region
596
597The instructions in DCR are copied to the out-of-line buffer
598of the kprobe, because the bytes in DCR are replaced by
599a 5-byte jump instruction. So there are several limitations.
600
601a) The instructions in DCR must be relocatable.
602b) The instructions in DCR must not include a call instruction.
603c) JTPR must not be targeted by any jump or call instruction.
604d) DCR must not straddle the border betweeen functions.
605
606Anyway, these limitations are checked by the in-kernel instruction
607decoder, so you don't need to worry about that.
608
4566. Probe Overhead 6096. Probe Overhead
457 610
458On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 611On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
@@ -476,6 +629,19 @@ k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07
476ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) 629ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
477k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 630k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99
478 631
6326.1 Optimized Probe Overhead
633
634Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
635process. Here are sample overhead figures (in usec) for x86 architectures.
636k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
637r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
638
639i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
640k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
641
642x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
643k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
644
4797. TODO 6457. TODO
480 646
481a. SystemTap (http://sourceware.org/systemtap): Provides a simplified 647a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
@@ -523,7 +689,8 @@ is also specified. Following columns show probe status. If the probe is on
523a virtual address that is no longer valid (module init sections, module 689a virtual address that is no longer valid (module init sections, module
524virtual addresses that correspond to modules that've been unloaded), 690virtual addresses that correspond to modules that've been unloaded),
525such probes are marked with [GONE]. If the probe is temporarily disabled, 691such probes are marked with [GONE]. If the probe is temporarily disabled,
526such probes are marked with [DISABLED]. 692such probes are marked with [DISABLED]. If the probe is optimized, it is
693marked with [OPTIMIZED].
527 694
528/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. 695/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
529 696
@@ -533,3 +700,19 @@ registered probes will be disarmed, till such time a "1" is echoed to this
533file. Note that this knob just disarms and arms all kprobes and doesn't 700file. Note that this knob just disarms and arms all kprobes and doesn't
534change each probe's disabling state. This means that disabled kprobes (marked 701change each probe's disabling state. This means that disabled kprobes (marked
535[DISABLED]) will be not enabled if you turn ON all kprobes by this knob. 702[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
703
704
705Appendix B: The kprobes sysctl interface
706
707/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
708
709When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
710a knob to globally and forcibly turn jump optimization (see section
7111.4) ON or OFF. By default, jump optimization is allowed (ON).
712If you echo "0" to this file or set "debug.kprobes_optimization" to
7130 via sysctl, all optimized probes will be unoptimized, and any new
714probes registered after that will not be optimized. Note that this
715knob *changes* the optimized state. This means that optimized probes
716(marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
717removed). If the knob is turned on, they will be optimized again.
718
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 2811e452f756..c6416a398163 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -23,12 +23,12 @@ of a virtual machine. The ioctls belong to three classes
23 Only run vcpu ioctls from the same thread that was used to create the 23 Only run vcpu ioctls from the same thread that was used to create the
24 vcpu. 24 vcpu.
25 25
262. File descritpors 262. File descriptors
27 27
28The kvm API is centered around file descriptors. An initial 28The kvm API is centered around file descriptors. An initial
29open("/dev/kvm") obtains a handle to the kvm subsystem; this handle 29open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
30can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this 30can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
31handle will create a VM file descripror which can be used to issue VM 31handle will create a VM file descriptor which can be used to issue VM
32ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu 32ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
33and return a file descriptor pointing to it. Finally, ioctls on a vcpu 33and return a file descriptor pointing to it. Finally, ioctls on a vcpu
34fd can be used to control the vcpu, including the important task of 34fd can be used to control the vcpu, including the important task of
@@ -643,7 +643,7 @@ Type: vm ioctl
643Parameters: struct kvm_clock_data (in) 643Parameters: struct kvm_clock_data (in)
644Returns: 0 on success, -1 on error 644Returns: 0 on success, -1 on error
645 645
646Sets the current timestamp of kvmclock to the valued specific in its parameter. 646Sets the current timestamp of kvmclock to the value specified in its parameter.
647In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios 647In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
648such as migration. 648such as migration.
649 649
@@ -795,11 +795,11 @@ Unused.
795 __u64 data_offset; /* relative to kvm_run start */ 795 __u64 data_offset; /* relative to kvm_run start */
796 } io; 796 } io;
797 797
798If exit_reason is KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT, then the vcpu has 798If exit_reason is KVM_EXIT_IO, then the vcpu has
799executed a port I/O instruction which could not be satisfied by kvm. 799executed a port I/O instruction which could not be satisfied by kvm.
800data_offset describes where the data is located (KVM_EXIT_IO_OUT) or 800data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
801where kvm expects application code to place the data for the next 801where kvm expects application code to place the data for the next
802KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a patcked array. 802KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array.
803 803
804 struct { 804 struct {
805 struct kvm_debug_exit_arch arch; 805 struct kvm_debug_exit_arch arch;
@@ -815,7 +815,7 @@ Unused.
815 __u8 is_write; 815 __u8 is_write;
816 } mmio; 816 } mmio;
817 817
818If exit_reason is KVM_EXIT_MMIO or KVM_EXIT_IO_OUT, then the vcpu has 818If exit_reason is KVM_EXIT_MMIO, then the vcpu has
819executed a memory-mapped I/O instruction which could not be satisfied 819executed a memory-mapped I/O instruction which could not be satisfied
820by kvm. The 'data' member contains the written data if 'is_write' is 820by kvm. The 'data' member contains the written data if 'is_write' is
821true, and should be filled by application code otherwise. 821true, and should be filled by application code otherwise.
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index b37300edf27c..07375e73981a 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -41,6 +41,7 @@ Possible debug options are
41 P Poisoning (object and padding) 41 P Poisoning (object and padding)
42 U User tracking (free and alloc) 42 U User tracking (free and alloc)
43 T Trace (please only use on single slabs) 43 T Trace (please only use on single slabs)
44 A Toggle failslab filter mark for the cache
44 O Switch debugging off for caches that would have 45 O Switch debugging off for caches that would have
45 caused higher minimum slab orders 46 caused higher minimum slab orders
46 - Switch all debugging off (useful if the kernel is 47 - Switch all debugging off (useful if the kernel is