diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/Changes | 31 | ||||
-rw-r--r-- | Documentation/CodingStyle | 41 | ||||
-rw-r--r-- | Documentation/RCU/rcuref.txt | 87 | ||||
-rw-r--r-- | Documentation/SubmittingDrivers | 24 | ||||
-rw-r--r-- | Documentation/SubmittingPatches | 63 | ||||
-rw-r--r-- | Documentation/applying-patches.txt | 29 | ||||
-rw-r--r-- | Documentation/block/stat.txt | 82 | ||||
-rw-r--r-- | Documentation/cpu-hotplug.txt | 357 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 161 | ||||
-rw-r--r-- | Documentation/filesystems/ext3.txt | 5 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 17 | ||||
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.txt | 72 | ||||
-rw-r--r-- | Documentation/filesystems/relayfs.txt | 126 | ||||
-rw-r--r-- | Documentation/keys-request-key.txt | 22 | ||||
-rw-r--r-- | Documentation/keys.txt | 43 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 20 |
16 files changed, 1004 insertions, 176 deletions
diff --git a/Documentation/Changes b/Documentation/Changes index 86b86399d61d..fe5ae0f55020 100644 --- a/Documentation/Changes +++ b/Documentation/Changes | |||
@@ -31,8 +31,6 @@ al espaņol de este documento en varios formatos. | |||
31 | Eine deutsche Version dieser Datei finden Sie unter | 31 | Eine deutsche Version dieser Datei finden Sie unter |
32 | <http://www.stefan-winter.de/Changes-2.4.0.txt>. | 32 | <http://www.stefan-winter.de/Changes-2.4.0.txt>. |
33 | 33 | ||
34 | Last updated: October 29th, 2002 | ||
35 | |||
36 | Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu). | 34 | Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu). |
37 | 35 | ||
38 | Current Minimal Requirements | 36 | Current Minimal Requirements |
@@ -48,7 +46,7 @@ necessary on all systems; obviously, if you don't have any ISDN | |||
48 | hardware, for example, you probably needn't concern yourself with | 46 | hardware, for example, you probably needn't concern yourself with |
49 | isdn4k-utils. | 47 | isdn4k-utils. |
50 | 48 | ||
51 | o Gnu C 2.95.3 # gcc --version | 49 | o Gnu C 3.2 # gcc --version |
52 | o Gnu make 3.79.1 # make --version | 50 | o Gnu make 3.79.1 # make --version |
53 | o binutils 2.12 # ld -v | 51 | o binutils 2.12 # ld -v |
54 | o util-linux 2.10o # fdformat --version | 52 | o util-linux 2.10o # fdformat --version |
@@ -74,26 +72,7 @@ GCC | |||
74 | --- | 72 | --- |
75 | 73 | ||
76 | The gcc version requirements may vary depending on the type of CPU in your | 74 | The gcc version requirements may vary depending on the type of CPU in your |
77 | computer. The next paragraph applies to users of x86 CPUs, but not | 75 | computer. |
78 | necessarily to users of other CPUs. Users of other CPUs should obtain | ||
79 | information about their gcc version requirements from another source. | ||
80 | |||
81 | The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it | ||
82 | should be used when you need absolute stability. You may use gcc 3.0.x | ||
83 | instead if you wish, although it may cause problems. Later versions of gcc | ||
84 | have not received much testing for Linux kernel compilation, and there are | ||
85 | almost certainly bugs (mainly, but not exclusively, in the kernel) that | ||
86 | will need to be fixed in order to use these compilers. In any case, using | ||
87 | pgcc instead of plain gcc is just asking for trouble. | ||
88 | |||
89 | The Red Hat gcc 2.96 compiler subtree can also be used to build this tree. | ||
90 | You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build | ||
91 | the kernel correctly. | ||
92 | |||
93 | In addition, please pay attention to compiler optimization. Anything | ||
94 | greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x | ||
95 | or derivatives, be sure not to use -fstrict-aliasing (which, depending on | ||
96 | your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing). | ||
97 | 76 | ||
98 | Make | 77 | Make |
99 | ---- | 78 | ---- |
@@ -322,9 +301,9 @@ Getting updated software | |||
322 | Kernel compilation | 301 | Kernel compilation |
323 | ****************** | 302 | ****************** |
324 | 303 | ||
325 | gcc 2.95.3 | 304 | gcc |
326 | ---------- | 305 | --- |
327 | o <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz> | 306 | o <ftp://ftp.gnu.org/gnu/gcc/> |
328 | 307 | ||
329 | Make | 308 | Make |
330 | ---- | 309 | ---- |
diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index eb7db3c19227..ce780ef648f1 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle | |||
@@ -344,7 +344,7 @@ Remember: if another thread can find your data structure, and you don't | |||
344 | have a reference count on it, you almost certainly have a bug. | 344 | have a reference count on it, you almost certainly have a bug. |
345 | 345 | ||
346 | 346 | ||
347 | Chapter 11: Macros, Enums, Inline functions and RTL | 347 | Chapter 11: Macros, Enums and RTL |
348 | 348 | ||
349 | Names of macros defining constants and labels in enums are capitalized. | 349 | Names of macros defining constants and labels in enums are capitalized. |
350 | 350 | ||
@@ -429,7 +429,35 @@ from void pointer to any other pointer type is guaranteed by the C programming | |||
429 | language. | 429 | language. |
430 | 430 | ||
431 | 431 | ||
432 | Chapter 14: References | 432 | Chapter 14: The inline disease |
433 | |||
434 | There appears to be a common misperception that gcc has a magic "make me | ||
435 | faster" speedup option called "inline". While the use of inlines can be | ||
436 | appropriate (for example as a means of replacing macros, see Chapter 11), it | ||
437 | very often is not. Abundant use of the inline keyword leads to a much bigger | ||
438 | kernel, which in turn slows the system as a whole down, due to a bigger | ||
439 | icache footprint for the CPU and simply because there is less memory | ||
440 | available for the pagecache. Just think about it; a pagecache miss causes a | ||
441 | disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles | ||
442 | that can go into these 5 miliseconds. | ||
443 | |||
444 | A reasonable rule of thumb is to not put inline at functions that have more | ||
445 | than 3 lines of code in them. An exception to this rule are the cases where | ||
446 | a parameter is known to be a compiletime constant, and as a result of this | ||
447 | constantness you *know* the compiler will be able to optimize most of your | ||
448 | function away at compile time. For a good example of this later case, see | ||
449 | the kmalloc() inline function. | ||
450 | |||
451 | Often people argue that adding inline to functions that are static and used | ||
452 | only once is always a win since there is no space tradeoff. While this is | ||
453 | technically correct, gcc is capable of inlining these automatically without | ||
454 | help, and the maintenance issue of removing the inline when a second user | ||
455 | appears outweighs the potential value of the hint that tells gcc to do | ||
456 | something it would have done anyway. | ||
457 | |||
458 | |||
459 | |||
460 | Chapter 15: References | ||
433 | 461 | ||
434 | The C Programming Language, Second Edition | 462 | The C Programming Language, Second Edition |
435 | by Brian W. Kernighan and Dennis M. Ritchie. | 463 | by Brian W. Kernighan and Dennis M. Ritchie. |
@@ -444,10 +472,13 @@ ISBN 0-201-61586-X. | |||
444 | URL: http://cm.bell-labs.com/cm/cs/tpop/ | 472 | URL: http://cm.bell-labs.com/cm/cs/tpop/ |
445 | 473 | ||
446 | GNU manuals - where in compliance with K&R and this text - for cpp, gcc, | 474 | GNU manuals - where in compliance with K&R and this text - for cpp, gcc, |
447 | gcc internals and indent, all available from http://www.gnu.org | 475 | gcc internals and indent, all available from http://www.gnu.org/manual/ |
448 | 476 | ||
449 | WG14 is the international standardization working group for the programming | 477 | WG14 is the international standardization working group for the programming |
450 | language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/ | 478 | language C, URL: http://www.open-std.org/JTC1/SC22/WG14/ |
479 | |||
480 | Kernel CodingStyle, by greg@kroah.com at OLS 2002: | ||
481 | http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/ | ||
451 | 482 | ||
452 | -- | 483 | -- |
453 | Last updated on 16 February 2004 by a community effort on LKML. | 484 | Last updated on 30 December 2005 by a community effort on LKML. |
diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt index a23fee66064d..3f60db41b2f0 100644 --- a/Documentation/RCU/rcuref.txt +++ b/Documentation/RCU/rcuref.txt | |||
@@ -1,74 +1,67 @@ | |||
1 | Refcounter framework for elements of lists/arrays protected by | 1 | Refcounter design for elements of lists/arrays protected by RCU. |
2 | RCU. | ||
3 | 2 | ||
4 | Refcounting on elements of lists which are protected by traditional | 3 | Refcounting on elements of lists which are protected by traditional |
5 | reader/writer spinlocks or semaphores are straight forward as in: | 4 | reader/writer spinlocks or semaphores are straight forward as in: |
6 | 5 | ||
7 | 1. 2. | 6 | 1. 2. |
8 | add() search_and_reference() | 7 | add() search_and_reference() |
9 | { { | 8 | { { |
10 | alloc_object read_lock(&list_lock); | 9 | alloc_object read_lock(&list_lock); |
11 | ... search_for_element | 10 | ... search_for_element |
12 | atomic_set(&el->rc, 1); atomic_inc(&el->rc); | 11 | atomic_set(&el->rc, 1); atomic_inc(&el->rc); |
13 | write_lock(&list_lock); ... | 12 | write_lock(&list_lock); ... |
14 | add_element read_unlock(&list_lock); | 13 | add_element read_unlock(&list_lock); |
15 | ... ... | 14 | ... ... |
16 | write_unlock(&list_lock); } | 15 | write_unlock(&list_lock); } |
17 | } | 16 | } |
18 | 17 | ||
19 | 3. 4. | 18 | 3. 4. |
20 | release_referenced() delete() | 19 | release_referenced() delete() |
21 | { { | 20 | { { |
22 | ... write_lock(&list_lock); | 21 | ... write_lock(&list_lock); |
23 | atomic_dec(&el->rc, relfunc) ... | 22 | atomic_dec(&el->rc, relfunc) ... |
24 | ... delete_element | 23 | ... delete_element |
25 | } write_unlock(&list_lock); | 24 | } write_unlock(&list_lock); |
26 | ... | 25 | ... |
27 | if (atomic_dec_and_test(&el->rc)) | 26 | if (atomic_dec_and_test(&el->rc)) |
28 | kfree(el); | 27 | kfree(el); |
29 | ... | 28 | ... |
30 | } | 29 | } |
31 | 30 | ||
32 | If this list/array is made lock free using rcu as in changing the | 31 | If this list/array is made lock free using rcu as in changing the |
33 | write_lock in add() and delete() to spin_lock and changing read_lock | 32 | write_lock in add() and delete() to spin_lock and changing read_lock |
34 | in search_and_reference to rcu_read_lock(), the rcuref_get in | 33 | in search_and_reference to rcu_read_lock(), the atomic_get in |
35 | search_and_reference could potentially hold reference to an element which | 34 | search_and_reference could potentially hold reference to an element which |
36 | has already been deleted from the list/array. rcuref_lf_get_rcu takes | 35 | has already been deleted from the list/array. atomic_inc_not_zero takes |
37 | care of this scenario. search_and_reference should look as; | 36 | care of this scenario. search_and_reference should look as; |
38 | 37 | ||
39 | 1. 2. | 38 | 1. 2. |
40 | add() search_and_reference() | 39 | add() search_and_reference() |
41 | { { | 40 | { { |
42 | alloc_object rcu_read_lock(); | 41 | alloc_object rcu_read_lock(); |
43 | ... search_for_element | 42 | ... search_for_element |
44 | atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) { | 43 | atomic_set(&el->rc, 1); if (atomic_inc_not_zero(&el->rc)) { |
45 | write_lock(&list_lock); rcu_read_unlock(); | 44 | write_lock(&list_lock); rcu_read_unlock(); |
46 | return FAIL; | 45 | return FAIL; |
47 | add_element } | 46 | add_element } |
48 | ... ... | 47 | ... ... |
49 | write_unlock(&list_lock); rcu_read_unlock(); | 48 | write_unlock(&list_lock); rcu_read_unlock(); |
50 | } } | 49 | } } |
51 | 3. 4. | 50 | 3. 4. |
52 | release_referenced() delete() | 51 | release_referenced() delete() |
53 | { { | 52 | { { |
54 | ... write_lock(&list_lock); | 53 | ... write_lock(&list_lock); |
55 | rcuref_dec(&el->rc, relfunc) ... | 54 | atomic_dec(&el->rc, relfunc) ... |
56 | ... delete_element | 55 | ... delete_element |
57 | } write_unlock(&list_lock); | 56 | } write_unlock(&list_lock); |
58 | ... | 57 | ... |
59 | if (rcuref_dec_and_test(&el->rc)) | 58 | if (atomic_dec_and_test(&el->rc)) |
60 | call_rcu(&el->head, el_free); | 59 | call_rcu(&el->head, el_free); |
61 | ... | 60 | ... |
62 | } | 61 | } |
63 | 62 | ||
64 | Sometimes, reference to the element need to be obtained in the | 63 | Sometimes, reference to the element need to be obtained in the |
65 | update (write) stream. In such cases, rcuref_inc_lf might be an overkill | 64 | update (write) stream. In such cases, atomic_inc_not_zero might be an |
66 | since the spinlock serialising list updates are held. rcuref_inc | 65 | overkill since the spinlock serialising list updates are held. atomic_inc |
67 | is to be used in such cases. | 66 | is to be used in such cases. |
68 | For arches which do not have cmpxchg rcuref_inc_lf | 67 | |
69 | api uses a hashed spinlock implementation and the same hashed spinlock | ||
70 | is acquired in all rcuref_xxx primitives to preserve atomicity. | ||
71 | Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the | ||
72 | refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api | ||
73 | might lead to races. rcuref_inc_lf() must be used in lockfree | ||
74 | RCU critical sections only. | ||
diff --git a/Documentation/SubmittingDrivers b/Documentation/SubmittingDrivers index c3cca924e94b..dd311cff1cc3 100644 --- a/Documentation/SubmittingDrivers +++ b/Documentation/SubmittingDrivers | |||
@@ -27,18 +27,17 @@ Who To Submit Drivers To | |||
27 | ------------------------ | 27 | ------------------------ |
28 | 28 | ||
29 | Linux 2.0: | 29 | Linux 2.0: |
30 | No new drivers are accepted for this kernel tree | 30 | No new drivers are accepted for this kernel tree. |
31 | 31 | ||
32 | Linux 2.2: | 32 | Linux 2.2: |
33 | No new drivers are accepted for this kernel tree. | ||
34 | |||
35 | Linux 2.4: | ||
33 | If the code area has a general maintainer then please submit it to | 36 | If the code area has a general maintainer then please submit it to |
34 | the maintainer listed in MAINTAINERS in the kernel file. If the | 37 | the maintainer listed in MAINTAINERS in the kernel file. If the |
35 | maintainer does not respond or you cannot find the appropriate | 38 | maintainer does not respond or you cannot find the appropriate |
36 | maintainer then please contact the 2.2 kernel maintainer: | 39 | maintainer then please contact Marcelo Tosatti |
37 | Marc-Christian Petersen <m.c.p@wolk-project.de>. | 40 | <marcelo.tosatti@cyclades.com>. |
38 | |||
39 | Linux 2.4: | ||
40 | The same rules apply as 2.2. The final contact point for Linux 2.4 | ||
41 | submissions is Marcelo Tosatti <marcelo.tosatti@cyclades.com>. | ||
42 | 41 | ||
43 | Linux 2.6: | 42 | Linux 2.6: |
44 | The same rules apply as 2.4 except that you should follow linux-kernel | 43 | The same rules apply as 2.4 except that you should follow linux-kernel |
@@ -53,6 +52,7 @@ Licensing: The code must be released to us under the | |||
53 | of exclusive GPL licensing, and if you wish the driver | 52 | of exclusive GPL licensing, and if you wish the driver |
54 | to be useful to other communities such as BSD you may well | 53 | to be useful to other communities such as BSD you may well |
55 | wish to release under multiple licenses. | 54 | wish to release under multiple licenses. |
55 | See accepted licenses at include/linux/module.h | ||
56 | 56 | ||
57 | Copyright: The copyright owner must agree to use of GPL. | 57 | Copyright: The copyright owner must agree to use of GPL. |
58 | It's best if the submitter and copyright owner | 58 | It's best if the submitter and copyright owner |
@@ -143,5 +143,13 @@ KernelNewbies: | |||
143 | http://kernelnewbies.org/ | 143 | http://kernelnewbies.org/ |
144 | 144 | ||
145 | Linux USB project: | 145 | Linux USB project: |
146 | http://sourceforge.net/projects/linux-usb/ | 146 | http://linux-usb.sourceforge.net/ |
147 | |||
148 | How to NOT write kernel driver by arjanv@redhat.com | ||
149 | http://people.redhat.com/arjanv/olspaper.pdf | ||
150 | |||
151 | Kernel Janitor: | ||
152 | http://janitor.kernelnewbies.org/ | ||
147 | 153 | ||
154 | -- | ||
155 | Last updated on 17 Nov 2005. | ||
diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 1d47e6c09dc6..6198e5ebcf65 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches | |||
@@ -78,7 +78,9 @@ Randy Dunlap's patch scripts: | |||
78 | http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz | 78 | http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz |
79 | 79 | ||
80 | Andrew Morton's patch scripts: | 80 | Andrew Morton's patch scripts: |
81 | http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.20 | 81 | http://www.zip.com.au/~akpm/linux/patches/ |
82 | Instead of these scripts, quilt is the recommended patch management | ||
83 | tool (see above). | ||
82 | 84 | ||
83 | 85 | ||
84 | 86 | ||
@@ -97,7 +99,7 @@ need to split up your patch. See #3, next. | |||
97 | 99 | ||
98 | 3) Separate your changes. | 100 | 3) Separate your changes. |
99 | 101 | ||
100 | Separate each logical change into its own patch. | 102 | Separate _logical changes_ into a single patch file. |
101 | 103 | ||
102 | For example, if your changes include both bug fixes and performance | 104 | For example, if your changes include both bug fixes and performance |
103 | enhancements for a single driver, separate those changes into two | 105 | enhancements for a single driver, separate those changes into two |
@@ -112,6 +114,10 @@ If one patch depends on another patch in order for a change to be | |||
112 | complete, that is OK. Simply note "this patch depends on patch X" | 114 | complete, that is OK. Simply note "this patch depends on patch X" |
113 | in your patch description. | 115 | in your patch description. |
114 | 116 | ||
117 | If you cannot condense your patch set into a smaller set of patches, | ||
118 | then only post say 15 or so at a time and wait for review and integration. | ||
119 | |||
120 | |||
115 | 121 | ||
116 | 4) Select e-mail destination. | 122 | 4) Select e-mail destination. |
117 | 123 | ||
@@ -124,6 +130,10 @@ your patch to the primary Linux kernel developer's mailing list, | |||
124 | linux-kernel@vger.kernel.org. Most kernel developers monitor this | 130 | linux-kernel@vger.kernel.org. Most kernel developers monitor this |
125 | e-mail list, and can comment on your changes. | 131 | e-mail list, and can comment on your changes. |
126 | 132 | ||
133 | |||
134 | Do not send more than 15 patches at once to the vger mailing lists!!! | ||
135 | |||
136 | |||
127 | Linus Torvalds is the final arbiter of all changes accepted into the | 137 | Linus Torvalds is the final arbiter of all changes accepted into the |
128 | Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets | 138 | Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets |
129 | a lot of e-mail, so typically you should do your best to -avoid- sending | 139 | a lot of e-mail, so typically you should do your best to -avoid- sending |
@@ -149,6 +159,9 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the | |||
149 | MAINTAINERS file for a mailing list that relates specifically to | 159 | MAINTAINERS file for a mailing list that relates specifically to |
150 | your change. | 160 | your change. |
151 | 161 | ||
162 | Majordomo lists of VGER.KERNEL.ORG at: | ||
163 | <http://vger.kernel.org/vger-lists.html> | ||
164 | |||
152 | If changes affect userland-kernel interfaces, please send | 165 | If changes affect userland-kernel interfaces, please send |
153 | the MAN-PAGES maintainer (as listed in the MAINTAINERS file) | 166 | the MAN-PAGES maintainer (as listed in the MAINTAINERS file) |
154 | a man-pages patch, or at least a notification of the change, | 167 | a man-pages patch, or at least a notification of the change, |
@@ -373,27 +386,14 @@ a diffstat, to show what files have changed, and the number of inserted | |||
373 | and deleted lines per file. A diffstat is especially useful on bigger | 386 | and deleted lines per file. A diffstat is especially useful on bigger |
374 | patches. Other comments relevant only to the moment or the maintainer, | 387 | patches. Other comments relevant only to the moment or the maintainer, |
375 | not suitable for the permanent changelog, should also go here. | 388 | not suitable for the permanent changelog, should also go here. |
389 | Use diffstat options "-p 1 -w 70" so that filenames are listed from the | ||
390 | top of the kernel source tree and don't use too much horizontal space | ||
391 | (easily fit in 80 columns, maybe with some indentation). | ||
376 | 392 | ||
377 | See more details on the proper patch format in the following | 393 | See more details on the proper patch format in the following |
378 | references. | 394 | references. |
379 | 395 | ||
380 | 396 | ||
381 | 13) More references for submitting patches | ||
382 | |||
383 | Andrew Morton, "The perfect patch" (tpp). | ||
384 | <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> | ||
385 | |||
386 | Jeff Garzik, "Linux kernel patch submission format." | ||
387 | <http://linux.yyz.us/patch-format.html> | ||
388 | |||
389 | Greg KH, "How to piss off a kernel subsystem maintainer" | ||
390 | <http://www.kroah.com/log/2005/03/31/> | ||
391 | |||
392 | Kernel Documentation/CodingStyle | ||
393 | <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> | ||
394 | |||
395 | Linus Torvald's mail on the canonical patch format: | ||
396 | <http://lkml.org/lkml/2005/4/7/183> | ||
397 | 397 | ||
398 | 398 | ||
399 | ----------------------------------- | 399 | ----------------------------------- |
@@ -466,3 +466,30 @@ and 'extern __inline__'. | |||
466 | Don't try to anticipate nebulous future cases which may or may not | 466 | Don't try to anticipate nebulous future cases which may or may not |
467 | be useful: "Make it as simple as you can, and no simpler." | 467 | be useful: "Make it as simple as you can, and no simpler." |
468 | 468 | ||
469 | |||
470 | |||
471 | ---------------------- | ||
472 | SECTION 3 - REFERENCES | ||
473 | ---------------------- | ||
474 | |||
475 | Andrew Morton, "The perfect patch" (tpp). | ||
476 | <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> | ||
477 | |||
478 | Jeff Garzik, "Linux kernel patch submission format." | ||
479 | <http://linux.yyz.us/patch-format.html> | ||
480 | |||
481 | Greg Kroah, "How to piss off a kernel subsystem maintainer". | ||
482 | <http://www.kroah.com/log/2005/03/31/> | ||
483 | <http://www.kroah.com/log/2005/07/08/> | ||
484 | <http://www.kroah.com/log/2005/10/19/> | ||
485 | |||
486 | NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!. | ||
487 | <http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2> | ||
488 | |||
489 | Kernel Documentation/CodingStyle | ||
490 | <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> | ||
491 | |||
492 | Linus Torvald's mail on the canonical patch format: | ||
493 | <http://lkml.org/lkml/2005/4/7/183> | ||
494 | -- | ||
495 | Last updated on 17 Nov 2005. | ||
diff --git a/Documentation/applying-patches.txt b/Documentation/applying-patches.txt index 681e426e2482..05a08c2c1889 100644 --- a/Documentation/applying-patches.txt +++ b/Documentation/applying-patches.txt | |||
@@ -2,7 +2,8 @@ | |||
2 | Applying Patches To The Linux Kernel | 2 | Applying Patches To The Linux Kernel |
3 | ------------------------------------ | 3 | ------------------------------------ |
4 | 4 | ||
5 | (Written by Jesper Juhl, August 2005) | 5 | Original by: Jesper Juhl, August 2005 |
6 | Last update: 2005-12-02 | ||
6 | 7 | ||
7 | 8 | ||
8 | 9 | ||
@@ -118,7 +119,7 @@ wrong. | |||
118 | 119 | ||
119 | When patch encounters a change that it can't fix up with fuzz it rejects it | 120 | When patch encounters a change that it can't fix up with fuzz it rejects it |
120 | outright and leaves a file with a .rej extension (a reject file). You can | 121 | outright and leaves a file with a .rej extension (a reject file). You can |
121 | read this file to see exactely what change couldn't be applied, so you can | 122 | read this file to see exactly what change couldn't be applied, so you can |
122 | go fix it up by hand if you wish. | 123 | go fix it up by hand if you wish. |
123 | 124 | ||
124 | If you don't have any third party patches applied to your kernel source, but | 125 | If you don't have any third party patches applied to your kernel source, but |
@@ -127,7 +128,7 @@ and have made no modifications yourself to the source files, then you should | |||
127 | never see a fuzz or reject message from patch. If you do see such messages | 128 | never see a fuzz or reject message from patch. If you do see such messages |
128 | anyway, then there's a high risk that either your local source tree or the | 129 | anyway, then there's a high risk that either your local source tree or the |
129 | patch file is corrupted in some way. In that case you should probably try | 130 | patch file is corrupted in some way. In that case you should probably try |
130 | redownloading the patch and if things are still not OK then you'd be advised | 131 | re-downloading the patch and if things are still not OK then you'd be advised |
131 | to start with a fresh tree downloaded in full from kernel.org. | 132 | to start with a fresh tree downloaded in full from kernel.org. |
132 | 133 | ||
133 | Let's look a bit more at some of the messages patch can produce. | 134 | Let's look a bit more at some of the messages patch can produce. |
@@ -180,9 +181,11 @@ wish to apply. | |||
180 | 181 | ||
181 | Are there any alternatives to `patch'? | 182 | Are there any alternatives to `patch'? |
182 | --- | 183 | --- |
183 | Yes there are alternatives. You can use the `interdiff' program | 184 | Yes there are alternatives. |
184 | (http://cyberelk.net/tim/patchutils/) to generate a patch representing the | 185 | |
185 | differences between two patches and then apply the result. | 186 | You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to |
187 | generate a patch representing the differences between two patches and then | ||
188 | apply the result. | ||
186 | This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single | 189 | This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single |
187 | step. The -z flag to interdiff will even let you feed it patches in gzip or | 190 | step. The -z flag to interdiff will even let you feed it patches in gzip or |
188 | bzip2 compressed form directly without the use of zcat or bzcat or manual | 191 | bzip2 compressed form directly without the use of zcat or bzcat or manual |
@@ -197,7 +200,7 @@ do the additional steps since interdiff can get things wrong in some cases. | |||
197 | Another alternative is `ketchup', which is a python script for automatic | 200 | Another alternative is `ketchup', which is a python script for automatic |
198 | downloading and applying of patches (http://www.selenic.com/ketchup/). | 201 | downloading and applying of patches (http://www.selenic.com/ketchup/). |
199 | 202 | ||
200 | Other nice tools are diffstat which shows a summary of changes made by a | 203 | Other nice tools are diffstat which shows a summary of changes made by a |
201 | patch, lsdiff which displays a short listing of affected files in a patch | 204 | patch, lsdiff which displays a short listing of affected files in a patch |
202 | file, along with (optionally) the line numbers of the start of each patch | 205 | file, along with (optionally) the line numbers of the start of each patch |
203 | and grepdiff which displays a list of the files modified by a patch where | 206 | and grepdiff which displays a list of the files modified by a patch where |
@@ -258,7 +261,7 @@ $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch | |||
258 | # source dir is now 2.6.11 | 261 | # source dir is now 2.6.11 |
259 | $ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch | 262 | $ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch |
260 | $ cd .. | 263 | $ cd .. |
261 | $ mv linux-2.6.11.1 inux-2.6.12 # rename source dir | 264 | $ mv linux-2.6.11.1 linux-2.6.12 # rename source dir |
262 | 265 | ||
263 | 266 | ||
264 | The 2.6.x.y kernels | 267 | The 2.6.x.y kernels |
@@ -433,7 +436,11 @@ $ cd .. | |||
433 | $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir | 436 | $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir |
434 | 437 | ||
435 | 438 | ||
436 | This concludes this list of explanations of the various kernel trees and I | 439 | This concludes this list of explanations of the various kernel trees. |
437 | hope you are now crystal clear on how to apply the various patches and help | 440 | I hope you are now clear on how to apply the various patches and help testing |
438 | testing the kernel. | 441 | the kernel. |
442 | |||
443 | Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert, | ||
444 | Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have | ||
445 | forgotten for their reviews and contributions to this document. | ||
439 | 446 | ||
diff --git a/Documentation/block/stat.txt b/Documentation/block/stat.txt new file mode 100644 index 000000000000..0dbc946de2ea --- /dev/null +++ b/Documentation/block/stat.txt | |||
@@ -0,0 +1,82 @@ | |||
1 | Block layer statistics in /sys/block/<dev>/stat | ||
2 | =============================================== | ||
3 | |||
4 | This file documents the contents of the /sys/block/<dev>/stat file. | ||
5 | |||
6 | The stat file provides several statistics about the state of block | ||
7 | device <dev>. | ||
8 | |||
9 | Q. Why are there multiple statistics in a single file? Doesn't sysfs | ||
10 | normally contain a single value per file? | ||
11 | A. By having a single file, the kernel can guarantee that the statistics | ||
12 | represent a consistent snapshot of the state of the device. If the | ||
13 | statistics were exported as multiple files containing one statistic | ||
14 | each, it would be impossible to guarantee that a set of readings | ||
15 | represent a single point in time. | ||
16 | |||
17 | The stat file consists of a single line of text containing 11 decimal | ||
18 | values separated by whitespace. The fields are summarized in the | ||
19 | following table, and described in more detail below. | ||
20 | |||
21 | Name units description | ||
22 | ---- ----- ----------- | ||
23 | read I/Os requests number of read I/Os processed | ||
24 | read merges requests number of read I/Os merged with in-queue I/O | ||
25 | read sectors sectors number of sectors read | ||
26 | read ticks milliseconds total wait time for read requests | ||
27 | write I/Os requests number of write I/Os processed | ||
28 | write merges requests number of write I/Os merged with in-queue I/O | ||
29 | write sectors sectors number of sectors written | ||
30 | write ticks milliseconds total wait time for write requests | ||
31 | in_flight requests number of I/Os currently in flight | ||
32 | io_ticks milliseconds total time this block device has been active | ||
33 | time_in_queue milliseconds total wait time for all requests | ||
34 | |||
35 | read I/Os, write I/Os | ||
36 | ===================== | ||
37 | |||
38 | These values increment when an I/O request completes. | ||
39 | |||
40 | read merges, write merges | ||
41 | ========================= | ||
42 | |||
43 | These values increment when an I/O request is merged with an | ||
44 | already-queued I/O request. | ||
45 | |||
46 | read sectors, write sectors | ||
47 | =========================== | ||
48 | |||
49 | These values count the number of sectors read from or written to this | ||
50 | block device. The "sectors" in question are the standard UNIX 512-byte | ||
51 | sectors, not any device- or filesystem-specific block size. The | ||
52 | counters are incremented when the I/O completes. | ||
53 | |||
54 | read ticks, write ticks | ||
55 | ======================= | ||
56 | |||
57 | These values count the number of milliseconds that I/O requests have | ||
58 | waited on this block device. If there are multiple I/O requests waiting, | ||
59 | these values will increase at a rate greater than 1000/second; for | ||
60 | example, if 60 read requests wait for an average of 30 ms, the read_ticks | ||
61 | field will increase by 60*30 = 1800. | ||
62 | |||
63 | in_flight | ||
64 | ========= | ||
65 | |||
66 | This value counts the number of I/O requests that have been issued to | ||
67 | the device driver but have not yet completed. It does not include I/O | ||
68 | requests that are in the queue but not yet issued to the device driver. | ||
69 | |||
70 | io_ticks | ||
71 | ======== | ||
72 | |||
73 | This value counts the number of milliseconds during which the device has | ||
74 | had I/O requests queued. | ||
75 | |||
76 | time_in_queue | ||
77 | ============= | ||
78 | |||
79 | This value counts the number of milliseconds that I/O requests have waited | ||
80 | on this block device. If there are multiple I/O requests waiting, this | ||
81 | value will increase as the product of the number of milliseconds times the | ||
82 | number of requests waiting (see "read ticks" above for an example). | ||
diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt new file mode 100644 index 000000000000..08c5d04f3086 --- /dev/null +++ b/Documentation/cpu-hotplug.txt | |||
@@ -0,0 +1,357 @@ | |||
1 | CPU hotplug Support in Linux(tm) Kernel | ||
2 | |||
3 | Maintainers: | ||
4 | CPU Hotplug Core: | ||
5 | Rusty Russell <rusty@rustycorp.com.au> | ||
6 | Srivatsa Vaddagiri <vatsa@in.ibm.com> | ||
7 | i386: | ||
8 | Zwane Mwaikambo <zwane@arm.linux.org.uk> | ||
9 | ppc64: | ||
10 | Nathan Lynch <nathanl@austin.ibm.com> | ||
11 | Joel Schopp <jschopp@austin.ibm.com> | ||
12 | ia64/x86_64: | ||
13 | Ashok Raj <ashok.raj@intel.com> | ||
14 | |||
15 | Authors: Ashok Raj <ashok.raj@intel.com> | ||
16 | Lots of feedback: Nathan Lynch <nathanl@austin.ibm.com>, | ||
17 | Joel Schopp <jschopp@austin.ibm.com> | ||
18 | |||
19 | Introduction | ||
20 | |||
21 | Modern advances in system architectures have introduced advanced error | ||
22 | reporting and correction capabilities in processors. CPU architectures permit | ||
23 | partitioning support, where compute resources of a single CPU could be made | ||
24 | available to virtual machine environments. There are couple OEMS that | ||
25 | support NUMA hardware which are hot pluggable as well, where physical | ||
26 | node insertion and removal require support for CPU hotplug. | ||
27 | |||
28 | Such advances require CPUs available to a kernel to be removed either for | ||
29 | provisioning reasons, or for RAS purposes to keep an offending CPU off | ||
30 | system execution path. Hence the need for CPU hotplug support in the | ||
31 | Linux kernel. | ||
32 | |||
33 | A more novel use of CPU-hotplug support is its use today in suspend | ||
34 | resume support for SMP. Dual-core and HT support makes even | ||
35 | a laptop run SMP kernels which didn't support these methods. SMP support | ||
36 | for suspend/resume is a work in progress. | ||
37 | |||
38 | General Stuff about CPU Hotplug | ||
39 | -------------------------------- | ||
40 | |||
41 | Command Line Switches | ||
42 | --------------------- | ||
43 | maxcpus=n Restrict boot time cpus to n. Say if you have 4 cpus, using | ||
44 | maxcpus=2 will only boot 2. You can choose to bring the | ||
45 | other cpus later online, read FAQ's for more info. | ||
46 | |||
47 | additional_cpus=n [x86_64 only] use this to limit hotpluggable cpus. | ||
48 | This option sets | ||
49 | cpu_possible_map = cpu_present_map + additional_cpus | ||
50 | |||
51 | CPU maps and such | ||
52 | ----------------- | ||
53 | [More on cpumaps and primitive to manipulate, please check | ||
54 | include/linux/cpumask.h that has more descriptive text.] | ||
55 | |||
56 | cpu_possible_map: Bitmap of possible CPUs that can ever be available in the | ||
57 | system. This is used to allocate some boot time memory for per_cpu variables | ||
58 | that aren't designed to grow/shrink as CPUs are made available or removed. | ||
59 | Once set during boot time discovery phase, the map is static, i.e no bits | ||
60 | are added or removed anytime. Trimming it accurately for your system needs | ||
61 | upfront can save some boot time memory. See below for how we use heuristics | ||
62 | in x86_64 case to keep this under check. | ||
63 | |||
64 | cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up() | ||
65 | after a cpu is available for kernel scheduling and ready to receive | ||
66 | interrupts from devices. Its cleared when a cpu is brought down using | ||
67 | __cpu_disable(), before which all OS services including interrupts are | ||
68 | migrated to another target CPU. | ||
69 | |||
70 | cpu_present_map: Bitmap of CPUs currently present in the system. Not all | ||
71 | of them may be online. When physical hotplug is processed by the relevant | ||
72 | subsystem (e.g ACPI) can change and new bit either be added or removed | ||
73 | from the map depending on the event is hot-add/hot-remove. There are currently | ||
74 | no locking rules as of now. Typical usage is to init topology during boot, | ||
75 | at which time hotplug is disabled. | ||
76 | |||
77 | You really dont need to manipulate any of the system cpu maps. They should | ||
78 | be read-only for most use. When setting up per-cpu resources almost always use | ||
79 | cpu_possible_map/for_each_cpu() to iterate. | ||
80 | |||
81 | Never use anything other than cpumask_t to represent bitmap of CPUs. | ||
82 | |||
83 | #include <linux/cpumask.h> | ||
84 | |||
85 | for_each_cpu - Iterate over cpu_possible_map | ||
86 | for_each_online_cpu - Iterate over cpu_online_map | ||
87 | for_each_present_cpu - Iterate over cpu_present_map | ||
88 | for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask. | ||
89 | |||
90 | #include <linux/cpu.h> | ||
91 | lock_cpu_hotplug() and unlock_cpu_hotplug(): | ||
92 | |||
93 | The above calls are used to inhibit cpu hotplug operations. While holding the | ||
94 | cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid | ||
95 | cpus going away, you could also use preempt_disable() and preempt_enable() | ||
96 | for those sections. Just remember the critical section cannot call any | ||
97 | function that can sleep or schedule this process away. The preempt_disable() | ||
98 | will work as long as stop_machine_run() is used to take a cpu down. | ||
99 | |||
100 | CPU Hotplug - Frequently Asked Questions. | ||
101 | |||
102 | Q: How to i enable my kernel to support CPU hotplug? | ||
103 | A: When doing make defconfig, Enable CPU hotplug support | ||
104 | |||
105 | "Processor type and Features" -> Support for Hotpluggable CPUs | ||
106 | |||
107 | Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well. | ||
108 | |||
109 | You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support | ||
110 | as well. | ||
111 | |||
112 | Q: What architectures support CPU hotplug? | ||
113 | A: As of 2.6.14, the following architectures support CPU hotplug. | ||
114 | |||
115 | i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64 | ||
116 | |||
117 | Q: How to test if hotplug is supported on the newly built kernel? | ||
118 | A: You should now notice an entry in sysfs. | ||
119 | |||
120 | Check if sysfs is mounted, using the "mount" command. You should notice | ||
121 | an entry as shown below in the output. | ||
122 | |||
123 | .... | ||
124 | none on /sys type sysfs (rw) | ||
125 | .... | ||
126 | |||
127 | if this is not mounted, do the following. | ||
128 | |||
129 | #mkdir /sysfs | ||
130 | #mount -t sysfs sys /sys | ||
131 | |||
132 | now you should see entries for all present cpu, the following is an example | ||
133 | in a 8-way system. | ||
134 | |||
135 | #pwd | ||
136 | #/sys/devices/system/cpu | ||
137 | #ls -l | ||
138 | total 0 | ||
139 | drwxr-xr-x 10 root root 0 Sep 19 07:44 . | ||
140 | drwxr-xr-x 13 root root 0 Sep 19 07:45 .. | ||
141 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu0 | ||
142 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu1 | ||
143 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu2 | ||
144 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu3 | ||
145 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu4 | ||
146 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu5 | ||
147 | drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu6 | ||
148 | drwxr-xr-x 3 root root 0 Sep 19 07:48 cpu7 | ||
149 | |||
150 | Under each directory you would find an "online" file which is the control | ||
151 | file to logically online/offline a processor. | ||
152 | |||
153 | Q: Does hot-add/hot-remove refer to physical add/remove of cpus? | ||
154 | A: The usage of hot-add/remove may not be very consistently used in the code. | ||
155 | CONFIG_CPU_HOTPLUG enables logical online/offline capability in the kernel. | ||
156 | To support physical addition/removal, one would need some BIOS hooks and | ||
157 | the platform should have something like an attention button in PCI hotplug. | ||
158 | CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs. | ||
159 | |||
160 | Q: How do i logically offline a CPU? | ||
161 | A: Do the following. | ||
162 | |||
163 | #echo 0 > /sys/devices/system/cpu/cpuX/online | ||
164 | |||
165 | once the logical offline is successful, check | ||
166 | |||
167 | #cat /proc/interrupts | ||
168 | |||
169 | you should now not see the CPU that you removed. Also online file will report | ||
170 | the state as 0 when a cpu if offline and 1 when its online. | ||
171 | |||
172 | #To display the current cpu state. | ||
173 | #cat /sys/devices/system/cpu/cpuX/online | ||
174 | |||
175 | Q: Why cant i remove CPU0 on some systems? | ||
176 | A: Some architectures may have some special dependency on a certain CPU. | ||
177 | |||
178 | For e.g in IA64 platforms we have ability to sent platform interrupts to the | ||
179 | OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI | ||
180 | specifications, we didn't have a way to change the target CPU. Hence if the | ||
181 | current ACPI version doesn't support such re-direction, we disable that CPU | ||
182 | by making it not-removable. | ||
183 | |||
184 | In such cases you will also notice that the online file is missing under cpu0. | ||
185 | |||
186 | Q: How do i find out if a particular CPU is not removable? | ||
187 | A: Depending on the implementation, some architectures may show this by the | ||
188 | absence of the "online" file. This is done if it can be determined ahead of | ||
189 | time that this CPU cannot be removed. | ||
190 | |||
191 | In some situations, this can be a run time check, i.e if you try to remove the | ||
192 | last CPU, this will not be permitted. You can find such failures by | ||
193 | investigating the return value of the "echo" command. | ||
194 | |||
195 | Q: What happens when a CPU is being logically offlined? | ||
196 | A: The following happen, listed in no particular order :-) | ||
197 | |||
198 | - A notification is sent to in-kernel registered modules by sending an event | ||
199 | CPU_DOWN_PREPARE | ||
200 | - All process is migrated away from this outgoing CPU to a new CPU | ||
201 | - All interrupts targeted to this CPU is migrated to a new CPU | ||
202 | - timers/bottom half/task lets are also migrated to a new CPU | ||
203 | - Once all services are migrated, kernel calls an arch specific routine | ||
204 | __cpu_disable() to perform arch specific cleanup. | ||
205 | - Once this is successful, an event for successful cleanup is sent by an event | ||
206 | CPU_DEAD. | ||
207 | |||
208 | "It is expected that each service cleans up when the CPU_DOWN_PREPARE | ||
209 | notifier is called, when CPU_DEAD is called its expected there is nothing | ||
210 | running on behalf of this CPU that was offlined" | ||
211 | |||
212 | Q: If i have some kernel code that needs to be aware of CPU arrival and | ||
213 | departure, how to i arrange for proper notification? | ||
214 | A: This is what you would need in your kernel code to receive notifications. | ||
215 | |||
216 | #include <linux/cpu.h> | ||
217 | static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb, | ||
218 | unsigned long action, void *hcpu) | ||
219 | { | ||
220 | unsigned int cpu = (unsigned long)hcpu; | ||
221 | |||
222 | switch (action) { | ||
223 | case CPU_ONLINE: | ||
224 | foobar_online_action(cpu); | ||
225 | break; | ||
226 | case CPU_DEAD: | ||
227 | foobar_dead_action(cpu); | ||
228 | break; | ||
229 | } | ||
230 | return NOTIFY_OK; | ||
231 | } | ||
232 | |||
233 | static struct notifier_block foobar_cpu_notifer = | ||
234 | { | ||
235 | .notifier_call = foobar_cpu_callback, | ||
236 | }; | ||
237 | |||
238 | |||
239 | In your init function, | ||
240 | |||
241 | register_cpu_notifier(&foobar_cpu_notifier); | ||
242 | |||
243 | You can fail PREPARE notifiers if something doesn't work to prepare resources. | ||
244 | This will stop the activity and send a following CANCELED event back. | ||
245 | |||
246 | CPU_DEAD should not be failed, its just a goodness indication, but bad | ||
247 | things will happen if a notifier in path sent a BAD notify code. | ||
248 | |||
249 | Q: I don't see my action being called for all CPUs already up and running? | ||
250 | A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined. | ||
251 | If you need to perform some action for each cpu already in the system, then | ||
252 | |||
253 | for_each_online_cpu(i) { | ||
254 | foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i); | ||
255 | foobar_cpu_callback(&foobar-cpu_notifier, CPU_ONLINE, i); | ||
256 | } | ||
257 | |||
258 | Q: If i would like to develop cpu hotplug support for a new architecture, | ||
259 | what do i need at a minimum? | ||
260 | A: The following are what is required for CPU hotplug infrastructure to work | ||
261 | correctly. | ||
262 | |||
263 | - Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU | ||
264 | - __cpu_up() - Arch interface to bring up a CPU | ||
265 | - __cpu_disable() - Arch interface to shutdown a CPU, no more interrupts | ||
266 | can be handled by the kernel after the routine | ||
267 | returns. Including local APIC timers etc are | ||
268 | shutdown. | ||
269 | - __cpu_die() - This actually supposed to ensure death of the CPU. | ||
270 | Actually look at some example code in other arch | ||
271 | that implement CPU hotplug. The processor is taken | ||
272 | down from the idle() loop for that specific | ||
273 | architecture. __cpu_die() typically waits for some | ||
274 | per_cpu state to be set, to ensure the processor | ||
275 | dead routine is called to be sure positively. | ||
276 | |||
277 | Q: I need to ensure that a particular cpu is not removed when there is some | ||
278 | work specific to this cpu is in progress. | ||
279 | A: First switch the current thread context to preferred cpu | ||
280 | |||
281 | int my_func_on_cpu(int cpu) | ||
282 | { | ||
283 | cpumask_t saved_mask, new_mask = CPU_MASK_NONE; | ||
284 | int curr_cpu, err = 0; | ||
285 | |||
286 | saved_mask = current->cpus_allowed; | ||
287 | cpu_set(cpu, new_mask); | ||
288 | err = set_cpus_allowed(current, new_mask); | ||
289 | |||
290 | if (err) | ||
291 | return err; | ||
292 | |||
293 | /* | ||
294 | * If we got scheduled out just after the return from | ||
295 | * set_cpus_allowed() before running the work, this ensures | ||
296 | * we stay locked. | ||
297 | */ | ||
298 | curr_cpu = get_cpu(); | ||
299 | |||
300 | if (curr_cpu != cpu) { | ||
301 | err = -EAGAIN; | ||
302 | goto ret; | ||
303 | } else { | ||
304 | /* | ||
305 | * Do work : But cant sleep, since get_cpu() disables preempt | ||
306 | */ | ||
307 | } | ||
308 | ret: | ||
309 | put_cpu(); | ||
310 | set_cpus_allowed(current, saved_mask); | ||
311 | return err; | ||
312 | } | ||
313 | |||
314 | |||
315 | Q: How do we determine how many CPUs are available for hotplug. | ||
316 | A: There is no clear spec defined way from ACPI that can give us that | ||
317 | information today. Based on some input from Natalie of Unisys, | ||
318 | that the ACPI MADT (Multiple APIC Description Tables) marks those possible | ||
319 | CPUs in a system with disabled status. | ||
320 | |||
321 | Andi implemented some simple heuristics that count the number of disabled | ||
322 | CPUs in MADT as hotpluggable CPUS. In the case there are no disabled CPUS | ||
323 | we assume 1/2 the number of CPUs currently present can be hotplugged. | ||
324 | |||
325 | Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field | ||
326 | in MADT is only 8 bits. | ||
327 | |||
328 | User Space Notification | ||
329 | |||
330 | Hotplug support for devices is common in Linux today. Its being used today to | ||
331 | support automatic configuration of network, usb and pci devices. A hotplug | ||
332 | event can be used to invoke an agent script to perform the configuration task. | ||
333 | |||
334 | You can add /etc/hotplug/cpu.agent to handle hotplug notification user space | ||
335 | scripts. | ||
336 | |||
337 | #!/bin/bash | ||
338 | # $Id: cpu.agent | ||
339 | # Kernel hotplug params include: | ||
340 | #ACTION=%s [online or offline] | ||
341 | #DEVPATH=%s | ||
342 | # | ||
343 | cd /etc/hotplug | ||
344 | . ./hotplug.functions | ||
345 | |||
346 | case $ACTION in | ||
347 | online) | ||
348 | echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt | ||
349 | ;; | ||
350 | offline) | ||
351 | echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt | ||
352 | ;; | ||
353 | *) | ||
354 | debug_mesg CPU $ACTION event not supported | ||
355 | exit 1 | ||
356 | ;; | ||
357 | esac | ||
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index a09a8eb80665..9e49b1c35729 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -14,7 +14,10 @@ CONTENTS: | |||
14 | 1.1 What are cpusets ? | 14 | 1.1 What are cpusets ? |
15 | 1.2 Why are cpusets needed ? | 15 | 1.2 Why are cpusets needed ? |
16 | 1.3 How are cpusets implemented ? | 16 | 1.3 How are cpusets implemented ? |
17 | 1.4 How do I use cpusets ? | 17 | 1.4 What are exclusive cpusets ? |
18 | 1.5 What does notify_on_release do ? | ||
19 | 1.6 What is memory_pressure ? | ||
20 | 1.7 How do I use cpusets ? | ||
18 | 2. Usage Examples and Syntax | 21 | 2. Usage Examples and Syntax |
19 | 2.1 Basic Usage | 22 | 2.1 Basic Usage |
20 | 2.2 Adding/removing cpus | 23 | 2.2 Adding/removing cpus |
@@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not | |||
49 | allocate a page on a node that is not allowed in the requesting tasks | 52 | allocate a page on a node that is not allowed in the requesting tasks |
50 | mems_allowed vector. | 53 | mems_allowed vector. |
51 | 54 | ||
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct | ||
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. | ||
54 | A cpuset that is cpu exclusive has a sched domain associated with it. | ||
55 | The sched domain consists of all cpus in the current cpuset that are not | ||
56 | part of any exclusive child cpusets. | ||
57 | This ensures that the scheduler load balacing code only balances | ||
58 | against the cpus that are in the sched domain as defined above and not | ||
59 | all of the cpus in the system. This removes any overhead due to | ||
60 | load balancing code trying to pull tasks outside of the cpu exclusive | ||
61 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
62 | |||
63 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
64 | page, buffer and other data commonly shared by the kernel across | ||
65 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
66 | allocations of memory for user space. This enables configuring a | ||
67 | system so that several independent jobs can share common kernel | ||
68 | data, such as file system pages, while isolating each jobs user | ||
69 | allocation in its own cpuset. To do this, construct a large | ||
70 | mem_exclusive cpuset to hold all the jobs, and construct child, | ||
71 | non-mem_exclusive cpusets for each individual job. Only a small | ||
72 | amount of typical kernel memory, such as requests from interrupt | ||
73 | handlers, is allowed to be taken outside even a mem_exclusive cpuset. | ||
74 | |||
75 | User level code may create and destroy cpusets by name in the cpuset | 55 | User level code may create and destroy cpusets by name in the cpuset |
76 | virtual file system, manage the attributes and permissions of these | 56 | virtual file system, manage the attributes and permissions of these |
77 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | 57 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
@@ -192,9 +172,15 @@ containing the following files describing that cpuset: | |||
192 | 172 | ||
193 | - cpus: list of CPUs in that cpuset | 173 | - cpus: list of CPUs in that cpuset |
194 | - mems: list of Memory Nodes in that cpuset | 174 | - mems: list of Memory Nodes in that cpuset |
175 | - memory_migrate flag: if set, move pages to cpusets nodes | ||
195 | - cpu_exclusive flag: is cpu placement exclusive? | 176 | - cpu_exclusive flag: is cpu placement exclusive? |
196 | - mem_exclusive flag: is memory placement exclusive? | 177 | - mem_exclusive flag: is memory placement exclusive? |
197 | - tasks: list of tasks (by pid) attached to that cpuset | 178 | - tasks: list of tasks (by pid) attached to that cpuset |
179 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? | ||
180 | - memory_pressure: measure of how much paging pressure in cpuset | ||
181 | |||
182 | In addition, the root cpuset only has the following file: | ||
183 | - memory_pressure_enabled flag: compute memory_pressure? | ||
198 | 184 | ||
199 | New cpusets are created using the mkdir system call or shell | 185 | New cpusets are created using the mkdir system call or shell |
200 | command. The properties of a cpuset, such as its flags, allowed | 186 | command. The properties of a cpuset, such as its flags, allowed |
@@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |||
228 | to represent the cpuset hierarchy provides for a familiar permission | 214 | to represent the cpuset hierarchy provides for a familiar permission |
229 | and name space for cpusets, with a minimum of additional kernel code. | 215 | and name space for cpusets, with a minimum of additional kernel code. |
230 | 216 | ||
231 | 1.4 How do I use cpusets ? | 217 | |
218 | 1.4 What are exclusive cpusets ? | ||
219 | -------------------------------- | ||
220 | |||
221 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | ||
222 | a direct ancestor or descendent, may share any of the same CPUs or | ||
223 | Memory Nodes. | ||
224 | |||
225 | A cpuset that is cpu_exclusive has a scheduler (sched) domain | ||
226 | associated with it. The sched domain consists of all CPUs in the | ||
227 | current cpuset that are not part of any exclusive child cpusets. | ||
228 | This ensures that the scheduler load balancing code only balances | ||
229 | against the CPUs that are in the sched domain as defined above and | ||
230 | not all of the CPUs in the system. This removes any overhead due to | ||
231 | load balancing code trying to pull tasks outside of the cpu_exclusive | ||
232 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
233 | |||
234 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
235 | page, buffer and other data commonly shared by the kernel across | ||
236 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
237 | allocations of memory for user space. This enables configuring a | ||
238 | system so that several independent jobs can share common kernel data, | ||
239 | such as file system pages, while isolating each jobs user allocation in | ||
240 | its own cpuset. To do this, construct a large mem_exclusive cpuset to | ||
241 | hold all the jobs, and construct child, non-mem_exclusive cpusets for | ||
242 | each individual job. Only a small amount of typical kernel memory, | ||
243 | such as requests from interrupt handlers, is allowed to be taken | ||
244 | outside even a mem_exclusive cpuset. | ||
245 | |||
246 | |||
247 | 1.5 What does notify_on_release do ? | ||
248 | ------------------------------------ | ||
249 | |||
250 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever | ||
251 | the last task in the cpuset leaves (exits or attaches to some other | ||
252 | cpuset) and the last child cpuset of that cpuset is removed, then | ||
253 | the kernel runs the command /sbin/cpuset_release_agent, supplying the | ||
254 | pathname (relative to the mount point of the cpuset file system) of the | ||
255 | abandoned cpuset. This enables automatic removal of abandoned cpusets. | ||
256 | The default value of notify_on_release in the root cpuset at system | ||
257 | boot is disabled (0). The default value of other cpusets at creation | ||
258 | is the current value of their parents notify_on_release setting. | ||
259 | |||
260 | |||
261 | 1.6 What is memory_pressure ? | ||
262 | ----------------------------- | ||
263 | The memory_pressure of a cpuset provides a simple per-cpuset metric | ||
264 | of the rate that the tasks in a cpuset are attempting to free up in | ||
265 | use memory on the nodes of the cpuset to satisfy additional memory | ||
266 | requests. | ||
267 | |||
268 | This enables batch managers monitoring jobs running in dedicated | ||
269 | cpusets to efficiently detect what level of memory pressure that job | ||
270 | is causing. | ||
271 | |||
272 | This is useful both on tightly managed systems running a wide mix of | ||
273 | submitted jobs, which may choose to terminate or re-prioritize jobs that | ||
274 | are trying to use more memory than allowed on the nodes assigned them, | ||
275 | and with tightly coupled, long running, massively parallel scientific | ||
276 | computing jobs that will dramatically fail to meet required performance | ||
277 | goals if they start to use more memory than allowed to them. | ||
278 | |||
279 | This mechanism provides a very economical way for the batch manager | ||
280 | to monitor a cpuset for signs of memory pressure. It's up to the | ||
281 | batch manager or other user code to decide what to do about it and | ||
282 | take action. | ||
283 | |||
284 | ==> Unless this feature is enabled by writing "1" to the special file | ||
285 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | ||
286 | code of __alloc_pages() for this metric reduces to simply noticing | ||
287 | that the cpuset_memory_pressure_enabled flag is zero. So only | ||
288 | systems that enable this feature will compute the metric. | ||
289 | |||
290 | Why a per-cpuset, running average: | ||
291 | |||
292 | Because this meter is per-cpuset, rather than per-task or mm, | ||
293 | the system load imposed by a batch scheduler monitoring this | ||
294 | metric is sharply reduced on large systems, because a scan of | ||
295 | the tasklist can be avoided on each set of queries. | ||
296 | |||
297 | Because this meter is a running average, instead of an accumulating | ||
298 | counter, a batch scheduler can detect memory pressure with a | ||
299 | single read, instead of having to read and accumulate results | ||
300 | for a period of time. | ||
301 | |||
302 | Because this meter is per-cpuset rather than per-task or mm, | ||
303 | the batch scheduler can obtain the key information, memory | ||
304 | pressure in a cpuset, with a single read, rather than having to | ||
305 | query and accumulate results over all the (dynamically changing) | ||
306 | set of tasks in the cpuset. | ||
307 | |||
308 | A per-cpuset simple digital filter (requires a spinlock and 3 words | ||
309 | of data per-cpuset) is kept, and updated by any task attached to that | ||
310 | cpuset, if it enters the synchronous (direct) page reclaim code. | ||
311 | |||
312 | A per-cpuset file provides an integer number representing the recent | ||
313 | (half-life of 10 seconds) rate of direct page reclaims caused by | ||
314 | the tasks in the cpuset, in units of reclaims attempted per second, | ||
315 | times 1000. | ||
316 | |||
317 | |||
318 | 1.7 How do I use cpusets ? | ||
232 | -------------------------- | 319 | -------------------------- |
233 | 320 | ||
234 | In order to minimize the impact of cpusets on critical kernel | 321 | In order to minimize the impact of cpusets on critical kernel |
@@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid | |||
277 | impacting the scheduler code in the kernel with a check for changes | 364 | impacting the scheduler code in the kernel with a check for changes |
278 | in a tasks processor placement. | 365 | in a tasks processor placement. |
279 | 366 | ||
367 | Normally, once a page is allocated (given a physical page | ||
368 | of main memory) then that page stays on whatever node it | ||
369 | was allocated, so long as it remains allocated, even if the | ||
370 | cpusets memory placement policy 'mems' subsequently changes. | ||
371 | If the cpuset flag file 'memory_migrate' is set true, then when | ||
372 | tasks are attached to that cpuset, any pages that task had | ||
373 | allocated to it on nodes in its previous cpuset are migrated | ||
374 | to the tasks new cpuset. Depending on the implementation, | ||
375 | this migration may either be done by swapping the page out, | ||
376 | so that the next time the page is referenced, it will be paged | ||
377 | into the tasks new cpuset, usually on the node where it was | ||
378 | referenced, or this migration may be done by directly copying | ||
379 | the pages from the tasks previous cpuset to the new cpuset, | ||
380 | where possible to the same node, relative to the new cpuset, | ||
381 | as the node that held the page, relative to the old cpuset. | ||
382 | Also if 'memory_migrate' is set true, then if that cpusets | ||
383 | 'mems' file is modified, pages allocated to tasks in that | ||
384 | cpuset, that were on nodes in the previous setting of 'mems', | ||
385 | will be moved to nodes in the new setting of 'mems.' Again, | ||
386 | depending on the implementation, this might be done by swapping, | ||
387 | or by direct copying. In either case, pages that were not in | ||
388 | the tasks prior cpuset, or in the cpusets prior 'mems' setting, | ||
389 | will not be moved. | ||
390 | |||
280 | There is an exception to the above. If hotplug functionality is used | 391 | There is an exception to the above. If hotplug functionality is used |
281 | to remove all the CPUs that are currently assigned to a cpuset, | 392 | to remove all the CPUs that are currently assigned to a cpuset, |
282 | then the kernel will automatically update the cpus_allowed of all | 393 | then the kernel will automatically update the cpus_allowed of all |
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9840d5b8d5b9..22e4040564d5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -22,6 +22,11 @@ journal=inum When a journal already exists, this option is | |||
22 | the inode which will represent the ext3 file | 22 | the inode which will represent the ext3 file |
23 | system's journal file. | 23 | system's journal file. |
24 | 24 | ||
25 | journal_dev=devnum When the external journal device's major/minor numbers | ||
26 | have changed, this option allows to specify the new | ||
27 | journal location. The journal device is identified | ||
28 | through its new major/minor numbers encoded in devnum. | ||
29 | |||
25 | noload Don't load the journal on mounting. | 30 | noload Don't load the journal on mounting. |
26 | 31 | ||
27 | data=journal All data are committed into the journal prior | 32 | data=journal All data are committed into the journal prior |
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d4773565ea2f..a4dcf42c2fd9 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1302,6 +1302,23 @@ VM has token based thrashing control mechanism and uses the token to prevent | |||
1302 | unnecessary page faults in thrashing situation. The unit of the value is | 1302 | unnecessary page faults in thrashing situation. The unit of the value is |
1303 | second. The value would be useful to tune thrashing behavior. | 1303 | second. The value would be useful to tune thrashing behavior. |
1304 | 1304 | ||
1305 | drop_caches | ||
1306 | ----------- | ||
1307 | |||
1308 | Writing to this will cause the kernel to drop clean caches, dentries and | ||
1309 | inodes from memory, causing that memory to become free. | ||
1310 | |||
1311 | To free pagecache: | ||
1312 | echo 1 > /proc/sys/vm/drop_caches | ||
1313 | To free dentries and inodes: | ||
1314 | echo 2 > /proc/sys/vm/drop_caches | ||
1315 | To free pagecache, dentries and inodes: | ||
1316 | echo 3 > /proc/sys/vm/drop_caches | ||
1317 | |||
1318 | As this is a non-destructive operation and dirty objects are not freeable, the | ||
1319 | user should run `sync' first. | ||
1320 | |||
1321 | |||
1305 | 2.5 /proc/sys/dev - Device specific parameters | 1322 | 2.5 /proc/sys/dev - Device specific parameters |
1306 | ---------------------------------------------- | 1323 | ---------------------------------------------- |
1307 | 1324 | ||
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index b3404a032596..60ab61e54e8a 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt | |||
@@ -143,12 +143,26 @@ as the following example: | |||
143 | dir /mnt 755 0 0 | 143 | dir /mnt 755 0 0 |
144 | file /init initramfs/init.sh 755 0 0 | 144 | file /init initramfs/init.sh 755 0 0 |
145 | 145 | ||
146 | Run "usr/gen_init_cpio" (after the kernel build) to get a usage message | ||
147 | documenting the above file format. | ||
148 | |||
146 | One advantage of the text file is that root access is not required to | 149 | One advantage of the text file is that root access is not required to |
147 | set permissions or create device nodes in the new archive. (Note that those | 150 | set permissions or create device nodes in the new archive. (Note that those |
148 | two example "file" entries expect to find files named "init.sh" and "busybox" in | 151 | two example "file" entries expect to find files named "init.sh" and "busybox" in |
149 | a directory called "initramfs", under the linux-2.6.* directory. See | 152 | a directory called "initramfs", under the linux-2.6.* directory. See |
150 | Documentation/early-userspace/README for more details.) | 153 | Documentation/early-userspace/README for more details.) |
151 | 154 | ||
155 | The kernel does not depend on external cpio tools, gen_init_cpio is created | ||
156 | from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's | ||
157 | boot-time extractor is also (obviously) self-contained. However, if you _do_ | ||
158 | happen to have cpio installed, the following command line can extract the | ||
159 | generated cpio image back into its component files: | ||
160 | |||
161 | cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames | ||
162 | |||
163 | Contents of initramfs: | ||
164 | ---------------------- | ||
165 | |||
152 | If you don't already understand what shared libraries, devices, and paths | 166 | If you don't already understand what shared libraries, devices, and paths |
153 | you need to get a minimal root filesystem up and running, here are some | 167 | you need to get a minimal root filesystem up and running, here are some |
154 | references: | 168 | references: |
@@ -161,13 +175,69 @@ designed to be a tiny C library to statically link early userspace | |||
161 | code against, along with some related utilities. It is BSD licensed. | 175 | code against, along with some related utilities. It is BSD licensed. |
162 | 176 | ||
163 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | 177 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) |
164 | myself. These are LGPL and GPL, respectively. | 178 | myself. These are LGPL and GPL, respectively. (A self-contained initramfs |
179 | package is planned for the busybox 1.2 release.) | ||
165 | 180 | ||
166 | In theory you could use glibc, but that's not well suited for small embedded | 181 | In theory you could use glibc, but that's not well suited for small embedded |
167 | uses like this. (A "hello world" program statically linked against glibc is | 182 | uses like this. (A "hello world" program statically linked against glibc is |
168 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do | 183 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do |
169 | name lookups, even when otherwise statically linked.) | 184 | name lookups, even when otherwise statically linked.) |
170 | 185 | ||
186 | Why cpio rather than tar? | ||
187 | ------------------------- | ||
188 | |||
189 | This decision was made back in December, 2001. The discussion started here: | ||
190 | |||
191 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html | ||
192 | |||
193 | And spawned a second thread (specifically on tar vs cpio), starting here: | ||
194 | |||
195 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html | ||
196 | |||
197 | The quick and dirty summary version (which is no substitute for reading | ||
198 | the above threads) is: | ||
199 | |||
200 | 1) cpio is a standard. It's decades old (from the AT&T days), and already | ||
201 | widely used on Linux (inside RPM, Red Hat's device driver disks). Here's | ||
202 | a Linux Journal article about it from 1996: | ||
203 | |||
204 | http://www.linuxjournal.com/article/1213 | ||
205 | |||
206 | It's not as popular as tar because the traditional cpio command line tools | ||
207 | require _truly_hideous_ command line arguments. But that says nothing | ||
208 | either way about the archive format, and there are alternative tools, | ||
209 | such as: | ||
210 | |||
211 | http://freshmeat.net/projects/afio/ | ||
212 | |||
213 | 2) The cpio archive format chosen by the kernel is simpler and cleaner (and | ||
214 | thus easier to create and parse) than any of the (literally dozens of) | ||
215 | various tar archive formats. The complete initramfs archive format is | ||
216 | explained in buffer-format.txt, created in usr/gen_init_cpio.c, and | ||
217 | extracted in init/initramfs.c. All three together come to less than 26k | ||
218 | total of human-readable text. | ||
219 | |||
220 | 3) The GNU project standardizing on tar is approximately as relevant as | ||
221 | Windows standardizing on zip. Linux is not part of either, and is free | ||
222 | to make its own technical decisions. | ||
223 | |||
224 | 4) Since this is a kernel internal format, it could easily have been | ||
225 | something brand new. The kernel provides its own tools to create and | ||
226 | extract this format anyway. Using an existing standard was preferable, | ||
227 | but not essential. | ||
228 | |||
229 | 5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be | ||
230 | supported on the kernel side"): | ||
231 | |||
232 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html | ||
233 | |||
234 | explained his reasoning: | ||
235 | |||
236 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html | ||
237 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html | ||
238 | |||
239 | and, most importantly, designed and implemented the initramfs code. | ||
240 | |||
171 | Future directions: | 241 | Future directions: |
172 | ------------------ | 242 | ------------------ |
173 | 243 | ||
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt index d803abed29f0..5832377b7340 100644 --- a/Documentation/filesystems/relayfs.txt +++ b/Documentation/filesystems/relayfs.txt | |||
@@ -44,30 +44,41 @@ relayfs can operate in a mode where it will overwrite data not yet | |||
44 | collected by userspace, and not wait for it to consume it. | 44 | collected by userspace, and not wait for it to consume it. |
45 | 45 | ||
46 | relayfs itself does not provide for communication of such data between | 46 | relayfs itself does not provide for communication of such data between |
47 | userspace and kernel, allowing the kernel side to remain simple and not | 47 | userspace and kernel, allowing the kernel side to remain simple and |
48 | impose a single interface on userspace. It does provide a separate | 48 | not impose a single interface on userspace. It does provide a set of |
49 | helper though, described below. | 49 | examples and a separate helper though, described below. |
50 | |||
51 | klog and relay-apps example code | ||
52 | ================================ | ||
53 | |||
54 | relayfs itself is ready to use, but to make things easier, a couple | ||
55 | simple utility functions and a set of examples are provided. | ||
56 | |||
57 | The relay-apps example tarball, available on the relayfs sourceforge | ||
58 | site, contains a set of self-contained examples, each consisting of a | ||
59 | pair of .c files containing boilerplate code for each of the user and | ||
60 | kernel sides of a relayfs application; combined these two sets of | ||
61 | boilerplate code provide glue to easily stream data to disk, without | ||
62 | having to bother with mundane housekeeping chores. | ||
63 | |||
64 | The 'klog debugging functions' patch (klog.patch in the relay-apps | ||
65 | tarball) provides a couple of high-level logging functions to the | ||
66 | kernel which allow writing formatted text or raw data to a channel, | ||
67 | regardless of whether a channel to write into exists or not, or | ||
68 | whether relayfs is compiled into the kernel or is configured as a | ||
69 | module. These functions allow you to put unconditional 'trace' | ||
70 | statements anywhere in the kernel or kernel modules; only when there | ||
71 | is a 'klog handler' registered will data actually be logged (see the | ||
72 | klog and kleak examples for details). | ||
73 | |||
74 | It is of course possible to use relayfs from scratch i.e. without | ||
75 | using any of the relay-apps example code or klog, but you'll have to | ||
76 | implement communication between userspace and kernel, allowing both to | ||
77 | convey the state of buffers (full, empty, amount of padding). | ||
78 | |||
79 | klog and the relay-apps examples can be found in the relay-apps | ||
80 | tarball on http://relayfs.sourceforge.net | ||
50 | 81 | ||
51 | klog, relay-app & librelay | ||
52 | ========================== | ||
53 | |||
54 | relayfs itself is ready to use, but to make things easier, two | ||
55 | additional systems are provided. klog is a simple wrapper to make | ||
56 | writing formatted text or raw data to a channel simpler, regardless of | ||
57 | whether a channel to write into exists or not, or whether relayfs is | ||
58 | compiled into the kernel or is configured as a module. relay-app is | ||
59 | the kernel counterpart of userspace librelay.c, combined these two | ||
60 | files provide glue to easily stream data to disk, without having to | ||
61 | bother with housekeeping. klog and relay-app can be used together, | ||
62 | with klog providing high-level logging functions to the kernel and | ||
63 | relay-app taking care of kernel-user control and disk-logging chores. | ||
64 | |||
65 | It is possible to use relayfs without relay-app & librelay, but you'll | ||
66 | have to implement communication between userspace and kernel, allowing | ||
67 | both to convey the state of buffers (full, empty, amount of padding). | ||
68 | |||
69 | klog, relay-app and librelay can be found in the relay-apps tarball on | ||
70 | http://relayfs.sourceforge.net | ||
71 | 82 | ||
72 | The relayfs user space API | 83 | The relayfs user space API |
73 | ========================== | 84 | ========================== |
@@ -125,6 +136,8 @@ Here's a summary of the API relayfs provides to in-kernel clients: | |||
125 | relay_reset(chan) | 136 | relay_reset(chan) |
126 | relayfs_create_dir(name, parent) | 137 | relayfs_create_dir(name, parent) |
127 | relayfs_remove_dir(dentry) | 138 | relayfs_remove_dir(dentry) |
139 | relayfs_create_file(name, parent, mode, fops, data) | ||
140 | relayfs_remove_file(dentry) | ||
128 | 141 | ||
129 | channel management typically called on instigation of userspace: | 142 | channel management typically called on instigation of userspace: |
130 | 143 | ||
@@ -141,6 +154,8 @@ Here's a summary of the API relayfs provides to in-kernel clients: | |||
141 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | 154 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) |
142 | buf_mapped(buf, filp) | 155 | buf_mapped(buf, filp) |
143 | buf_unmapped(buf, filp) | 156 | buf_unmapped(buf, filp) |
157 | create_buf_file(filename, parent, mode, buf, is_global) | ||
158 | remove_buf_file(dentry) | ||
144 | 159 | ||
145 | helper functions: | 160 | helper functions: |
146 | 161 | ||
@@ -320,6 +335,71 @@ forces a sub-buffer switch on all the channel buffers, and can be used | |||
320 | to finalize and process the last sub-buffers before the channel is | 335 | to finalize and process the last sub-buffers before the channel is |
321 | closed. | 336 | closed. |
322 | 337 | ||
338 | Creating non-relay files | ||
339 | ------------------------ | ||
340 | |||
341 | relay_open() automatically creates files in the relayfs filesystem to | ||
342 | represent the per-cpu kernel buffers; it's often useful for | ||
343 | applications to be able to create their own files alongside the relay | ||
344 | files in the relayfs filesystem as well e.g. 'control' files much like | ||
345 | those created in /proc or debugfs for similar purposes, used to | ||
346 | communicate control information between the kernel and user sides of a | ||
347 | relayfs application. For this purpose the relayfs_create_file() and | ||
348 | relayfs_remove_file() API functions exist. For relayfs_create_file(), | ||
349 | the caller passes in a set of user-defined file operations to be used | ||
350 | for the file and an optional void * to a user-specified data item, | ||
351 | which will be accessible via inode->u.generic_ip (see the relay-apps | ||
352 | tarball for examples). The file_operations are a required parameter | ||
353 | to relayfs_create_file() and thus the semantics of these files are | ||
354 | completely defined by the caller. | ||
355 | |||
356 | See the relay-apps tarball at http://relayfs.sourceforge.net for | ||
357 | examples of how these non-relay files are meant to be used. | ||
358 | |||
359 | Creating relay files in other filesystems | ||
360 | ----------------------------------------- | ||
361 | |||
362 | By default of course, relay_open() creates relay files in the relayfs | ||
363 | filesystem. Because relay_file_operations is exported, however, it's | ||
364 | also possible to create and use relay files in other pseudo-filesytems | ||
365 | such as debugfs. | ||
366 | |||
367 | For this purpose, two callback functions are provided, | ||
368 | create_buf_file() and remove_buf_file(). create_buf_file() is called | ||
369 | once for each per-cpu buffer from relay_open() to allow the client to | ||
370 | create a file to be used to represent the corresponding buffer; if | ||
371 | this callback is not defined, the default implementation will create | ||
372 | and return a file in the relayfs filesystem to represent the buffer. | ||
373 | The callback should return the dentry of the file created to represent | ||
374 | the relay buffer. Note that the parent directory passed to | ||
375 | relay_open() (and passed along to the callback), if specified, must | ||
376 | exist in the same filesystem the new relay file is created in. If | ||
377 | create_buf_file() is defined, remove_buf_file() must also be defined; | ||
378 | it's responsible for deleting the file(s) created in create_buf_file() | ||
379 | and is called during relay_close(). | ||
380 | |||
381 | The create_buf_file() implementation can also be defined in such a way | ||
382 | as to allow the creation of a single 'global' buffer instead of the | ||
383 | default per-cpu set. This can be useful for applications interested | ||
384 | mainly in seeing the relative ordering of system-wide events without | ||
385 | the need to bother with saving explicit timestamps for the purpose of | ||
386 | merging/sorting per-cpu files in a postprocessing step. | ||
387 | |||
388 | To have relay_open() create a global buffer, the create_buf_file() | ||
389 | implementation should set the value of the is_global outparam to a | ||
390 | non-zero value in addition to creating the file that will be used to | ||
391 | represent the single buffer. In the case of a global buffer, | ||
392 | create_buf_file() and remove_buf_file() will be called only once. The | ||
393 | normal channel-writing functions e.g. relay_write() can still be used | ||
394 | - writes from any cpu will transparently end up in the global buffer - | ||
395 | but since it is a global buffer, callers should make sure they use the | ||
396 | proper locking for such a buffer, either by wrapping writes in a | ||
397 | spinlock, or by copying a write function from relayfs_fs.h and | ||
398 | creating a local version that internally does the proper locking. | ||
399 | |||
400 | See the 'exported-relayfile' examples in the relay-apps tarball for | ||
401 | examples of creating and using relay files in debugfs. | ||
402 | |||
323 | Misc | 403 | Misc |
324 | ---- | 404 | ---- |
325 | 405 | ||
diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt index 5f2b9c5edbb5..22488d791168 100644 --- a/Documentation/keys-request-key.txt +++ b/Documentation/keys-request-key.txt | |||
@@ -56,10 +56,12 @@ A request proceeds in the following manner: | |||
56 | (4) request_key() then forks and executes /sbin/request-key with a new session | 56 | (4) request_key() then forks and executes /sbin/request-key with a new session |
57 | keyring that contains a link to auth key V. | 57 | keyring that contains a link to auth key V. |
58 | 58 | ||
59 | (5) /sbin/request-key execs an appropriate program to perform the actual | 59 | (5) /sbin/request-key assumes the authority associated with key U. |
60 | |||
61 | (6) /sbin/request-key execs an appropriate program to perform the actual | ||
60 | instantiation. | 62 | instantiation. |
61 | 63 | ||
62 | (6) The program may want to access another key from A's context (say a | 64 | (7) The program may want to access another key from A's context (say a |
63 | Kerberos TGT key). It just requests the appropriate key, and the keyring | 65 | Kerberos TGT key). It just requests the appropriate key, and the keyring |
64 | search notes that the session keyring has auth key V in its bottom level. | 66 | search notes that the session keyring has auth key V in its bottom level. |
65 | 67 | ||
@@ -67,19 +69,19 @@ A request proceeds in the following manner: | |||
67 | UID, GID, groups and security info of process A as if it was process A, | 69 | UID, GID, groups and security info of process A as if it was process A, |
68 | and come up with key W. | 70 | and come up with key W. |
69 | 71 | ||
70 | (7) The program then does what it must to get the data with which to | 72 | (8) The program then does what it must to get the data with which to |
71 | instantiate key U, using key W as a reference (perhaps it contacts a | 73 | instantiate key U, using key W as a reference (perhaps it contacts a |
72 | Kerberos server using the TGT) and then instantiates key U. | 74 | Kerberos server using the TGT) and then instantiates key U. |
73 | 75 | ||
74 | (8) Upon instantiating key U, auth key V is automatically revoked so that it | 76 | (9) Upon instantiating key U, auth key V is automatically revoked so that it |
75 | may not be used again. | 77 | may not be used again. |
76 | 78 | ||
77 | (9) The program then exits 0 and request_key() deletes key V and returns key | 79 | (10) The program then exits 0 and request_key() deletes key V and returns key |
78 | U to the caller. | 80 | U to the caller. |
79 | 81 | ||
80 | This also extends further. If key W (step 5 above) didn't exist, key W would be | 82 | This also extends further. If key W (step 7 above) didn't exist, key W would be |
81 | created uninstantiated, another auth key (X) would be created [as per step 3] | 83 | created uninstantiated, another auth key (X) would be created (as per step 3) |
82 | and another copy of /sbin/request-key spawned [as per step 4]; but the context | 84 | and another copy of /sbin/request-key spawned (as per step 4); but the context |
83 | specified by auth key X will still be process A, as it was in auth key V. | 85 | specified by auth key X will still be process A, as it was in auth key V. |
84 | 86 | ||
85 | This is because process A's keyrings can't simply be attached to | 87 | This is because process A's keyrings can't simply be attached to |
@@ -138,8 +140,8 @@ until one succeeds: | |||
138 | 140 | ||
139 | (3) The process's session keyring is searched. | 141 | (3) The process's session keyring is searched. |
140 | 142 | ||
141 | (4) If the process has a request_key() authorisation key in its session | 143 | (4) If the process has assumed the authority associated with a request_key() |
142 | keyring then: | 144 | authorisation key then: |
143 | 145 | ||
144 | (a) If extant, the calling process's thread keyring is searched. | 146 | (a) If extant, the calling process's thread keyring is searched. |
145 | 147 | ||
diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 6304db59bfe4..aaa01b0e3ee9 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt | |||
@@ -308,6 +308,8 @@ process making the call: | |||
308 | KEY_SPEC_USER_KEYRING -4 UID-specific keyring | 308 | KEY_SPEC_USER_KEYRING -4 UID-specific keyring |
309 | KEY_SPEC_USER_SESSION_KEYRING -5 UID-session keyring | 309 | KEY_SPEC_USER_SESSION_KEYRING -5 UID-session keyring |
310 | KEY_SPEC_GROUP_KEYRING -6 GID-specific keyring | 310 | KEY_SPEC_GROUP_KEYRING -6 GID-specific keyring |
311 | KEY_SPEC_REQKEY_AUTH_KEY -7 assumed request_key() | ||
312 | authorisation key | ||
311 | 313 | ||
312 | 314 | ||
313 | The main syscalls are: | 315 | The main syscalls are: |
@@ -498,7 +500,11 @@ The keyctl syscall functions are: | |||
498 | keyring is full, error ENFILE will result. | 500 | keyring is full, error ENFILE will result. |
499 | 501 | ||
500 | The link procedure checks the nesting of the keyrings, returning ELOOP if | 502 | The link procedure checks the nesting of the keyrings, returning ELOOP if |
501 | it appears to deep or EDEADLK if the link would introduce a cycle. | 503 | it appears too deep or EDEADLK if the link would introduce a cycle. |
504 | |||
505 | Any links within the keyring to keys that match the new key in terms of | ||
506 | type and description will be discarded from the keyring as the new one is | ||
507 | added. | ||
502 | 508 | ||
503 | 509 | ||
504 | (*) Unlink a key or keyring from another keyring: | 510 | (*) Unlink a key or keyring from another keyring: |
@@ -628,6 +634,41 @@ The keyctl syscall functions are: | |||
628 | there is one, otherwise the user default session keyring. | 634 | there is one, otherwise the user default session keyring. |
629 | 635 | ||
630 | 636 | ||
637 | (*) Set the timeout on a key. | ||
638 | |||
639 | long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout); | ||
640 | |||
641 | This sets or clears the timeout on a key. The timeout can be 0 to clear | ||
642 | the timeout or a number of seconds to set the expiry time that far into | ||
643 | the future. | ||
644 | |||
645 | The process must have attribute modification access on a key to set its | ||
646 | timeout. Timeouts may not be set with this function on negative, revoked | ||
647 | or expired keys. | ||
648 | |||
649 | |||
650 | (*) Assume the authority granted to instantiate a key | ||
651 | |||
652 | long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key); | ||
653 | |||
654 | This assumes or divests the authority required to instantiate the | ||
655 | specified key. Authority can only be assumed if the thread has the | ||
656 | authorisation key associated with the specified key in its keyrings | ||
657 | somewhere. | ||
658 | |||
659 | Once authority is assumed, searches for keys will also search the | ||
660 | requester's keyrings using the requester's security label, UID, GID and | ||
661 | groups. | ||
662 | |||
663 | If the requested authority is unavailable, error EPERM will be returned, | ||
664 | likewise if the authority has been revoked because the target key is | ||
665 | already instantiated. | ||
666 | |||
667 | If the specified key is 0, then any assumed authority will be divested. | ||
668 | |||
669 | The assumed authorititive key is inherited across fork and exec. | ||
670 | |||
671 | |||
631 | =============== | 672 | =============== |
632 | KERNEL SERVICES | 673 | KERNEL SERVICES |
633 | =============== | 674 | =============== |
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 2f1aae32a5d9..6910c0136f8d 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt | |||
@@ -26,12 +26,13 @@ Currently, these files are in /proc/sys/vm: | |||
26 | - min_free_kbytes | 26 | - min_free_kbytes |
27 | - laptop_mode | 27 | - laptop_mode |
28 | - block_dump | 28 | - block_dump |
29 | - drop-caches | ||
29 | 30 | ||
30 | ============================================================== | 31 | ============================================================== |
31 | 32 | ||
32 | dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, | 33 | dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, |
33 | dirty_writeback_centisecs, vfs_cache_pressure, laptop_mode, | 34 | dirty_writeback_centisecs, vfs_cache_pressure, laptop_mode, |
34 | block_dump, swap_token_timeout: | 35 | block_dump, swap_token_timeout, drop-caches: |
35 | 36 | ||
36 | See Documentation/filesystems/proc.txt | 37 | See Documentation/filesystems/proc.txt |
37 | 38 | ||
@@ -102,3 +103,20 @@ This is used to force the Linux VM to keep a minimum number | |||
102 | of kilobytes free. The VM uses this number to compute a pages_min | 103 | of kilobytes free. The VM uses this number to compute a pages_min |
103 | value for each lowmem zone in the system. Each lowmem zone gets | 104 | value for each lowmem zone in the system. Each lowmem zone gets |
104 | a number of reserved free pages based proportionally on its size. | 105 | a number of reserved free pages based proportionally on its size. |
106 | |||
107 | ============================================================== | ||
108 | |||
109 | percpu_pagelist_fraction | ||
110 | |||
111 | This is the fraction of pages at most (high mark pcp->high) in each zone that | ||
112 | are allocated for each per cpu page list. The min value for this is 8. It | ||
113 | means that we don't allow more than 1/8th of pages in each zone to be | ||
114 | allocated in any single per_cpu_pagelist. This entry only changes the value | ||
115 | of hot per cpu pagelists. User can specify a number like 100 to allocate | ||
116 | 1/100th of each zone to each per cpu page list. | ||
117 | |||
118 | The batch value of each per cpu pagelist is also updated as a result. It is | ||
119 | set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) | ||
120 | |||
121 | The initial value is zero. Kernel does not use this value at boot time to set | ||
122 | the high water marks for each per cpu page list. | ||