diff options
Diffstat (limited to 'Documentation')
57 files changed, 3580 insertions, 431 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index f28a24e0279b..433cf5e9ae04 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX | |||
@@ -46,6 +46,8 @@ SubmittingPatches | |||
46 | - procedure to get a source patch included into the kernel tree. | 46 | - procedure to get a source patch included into the kernel tree. |
47 | VGA-softcursor.txt | 47 | VGA-softcursor.txt |
48 | - how to change your VGA cursor from a blinking underscore. | 48 | - how to change your VGA cursor from a blinking underscore. |
49 | applying-patches.txt | ||
50 | - description of various trees and how to apply their patches. | ||
49 | arm/ | 51 | arm/ |
50 | - directory with info about Linux on the ARM architecture. | 52 | - directory with info about Linux on the ARM architecture. |
51 | basic_profiling.txt | 53 | basic_profiling.txt |
@@ -275,7 +277,7 @@ tty.txt | |||
275 | unicode.txt | 277 | unicode.txt |
276 | - info on the Unicode character/font mapping used in Linux. | 278 | - info on the Unicode character/font mapping used in Linux. |
277 | uml/ | 279 | uml/ |
278 | - directory with infomation about User Mode Linux. | 280 | - directory with information about User Mode Linux. |
279 | usb/ | 281 | usb/ |
280 | - directory with info regarding the Universal Serial Bus. | 282 | - directory with info regarding the Universal Serial Bus. |
281 | video4linux/ | 283 | video4linux/ |
diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index f25b3953f513..22e5f9036f3c 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle | |||
@@ -236,6 +236,9 @@ ugly), but try to avoid excess. Instead, put the comments at the head | |||
236 | of the function, telling people what it does, and possibly WHY it does | 236 | of the function, telling people what it does, and possibly WHY it does |
237 | it. | 237 | it. |
238 | 238 | ||
239 | When commenting the kernel API functions, please use the kerneldoc format. | ||
240 | See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc | ||
241 | for details. | ||
239 | 242 | ||
240 | Chapter 8: You've made a mess of it | 243 | Chapter 8: You've made a mess of it |
241 | 244 | ||
diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 6ee3cd6134df..1af0f2d50220 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt | |||
@@ -121,7 +121,7 @@ pool's device. | |||
121 | dma_addr_t addr); | 121 | dma_addr_t addr); |
122 | 122 | ||
123 | This puts memory back into the pool. The pool is what was passed to | 123 | This puts memory back into the pool. The pool is what was passed to |
124 | the the pool allocation routine; the cpu and dma addresses are what | 124 | the pool allocation routine; the cpu and dma addresses are what |
125 | were returned when that routine allocated the memory being freed. | 125 | were returned when that routine allocated the memory being freed. |
126 | 126 | ||
127 | 127 | ||
diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt new file mode 100644 index 000000000000..705f6be92bdb --- /dev/null +++ b/Documentation/DMA-ISA-LPC.txt | |||
@@ -0,0 +1,151 @@ | |||
1 | DMA with ISA and LPC devices | ||
2 | ============================ | ||
3 | |||
4 | Pierre Ossman <drzeus@drzeus.cx> | ||
5 | |||
6 | This document describes how to do DMA transfers using the old ISA DMA | ||
7 | controller. Even though ISA is more or less dead today the LPC bus | ||
8 | uses the same DMA system so it will be around for quite some time. | ||
9 | |||
10 | Part I - Headers and dependencies | ||
11 | --------------------------------- | ||
12 | |||
13 | To do ISA style DMA you need to include two headers: | ||
14 | |||
15 | #include <linux/dma-mapping.h> | ||
16 | #include <asm/dma.h> | ||
17 | |||
18 | The first is the generic DMA API used to convert virtual addresses to | ||
19 | physical addresses (see Documentation/DMA-API.txt for details). | ||
20 | |||
21 | The second contains the routines specific to ISA DMA transfers. Since | ||
22 | this is not present on all platforms make sure you construct your | ||
23 | Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries | ||
24 | to build your driver on unsupported platforms. | ||
25 | |||
26 | Part II - Buffer allocation | ||
27 | --------------------------- | ||
28 | |||
29 | The ISA DMA controller has some very strict requirements on which | ||
30 | memory it can access so extra care must be taken when allocating | ||
31 | buffers. | ||
32 | |||
33 | (You usually need a special buffer for DMA transfers instead of | ||
34 | transferring directly to and from your normal data structures.) | ||
35 | |||
36 | The DMA-able address space is the lowest 16 MB of _physical_ memory. | ||
37 | Also the transfer block may not cross page boundaries (which are 64 | ||
38 | or 128 KiB depending on which channel you use). | ||
39 | |||
40 | In order to allocate a piece of memory that satisfies all these | ||
41 | requirements you pass the flag GFP_DMA to kmalloc. | ||
42 | |||
43 | Unfortunately the memory available for ISA DMA is scarce so unless you | ||
44 | allocate the memory during boot-up it's a good idea to also pass | ||
45 | __GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder. | ||
46 | |||
47 | (This scarcity also means that you should allocate the buffer as | ||
48 | early as possible and not release it until the driver is unloaded.) | ||
49 | |||
50 | Part III - Address translation | ||
51 | ------------------------------ | ||
52 | |||
53 | To translate the virtual address to a physical use the normal DMA | ||
54 | API. Do _not_ use isa_virt_to_phys() even though it does the same | ||
55 | thing. The reason for this is that the function isa_virt_to_phys() | ||
56 | will require a Kconfig dependency to ISA, not just ISA_DMA_API which | ||
57 | is really all you need. Remember that even though the DMA controller | ||
58 | has its origins in ISA it is used elsewhere. | ||
59 | |||
60 | Note: x86_64 had a broken DMA API when it came to ISA but has since | ||
61 | been fixed. If your arch has problems then fix the DMA API instead of | ||
62 | reverting to the ISA functions. | ||
63 | |||
64 | Part IV - Channels | ||
65 | ------------------ | ||
66 | |||
67 | A normal ISA DMA controller has 8 channels. The lower four are for | ||
68 | 8-bit transfers and the upper four are for 16-bit transfers. | ||
69 | |||
70 | (Actually the DMA controller is really two separate controllers where | ||
71 | channel 4 is used to give DMA access for the second controller (0-3). | ||
72 | This means that of the four 16-bits channels only three are usable.) | ||
73 | |||
74 | You allocate these in a similar fashion as all basic resources: | ||
75 | |||
76 | extern int request_dma(unsigned int dmanr, const char * device_id); | ||
77 | extern void free_dma(unsigned int dmanr); | ||
78 | |||
79 | The ability to use 16-bit or 8-bit transfers is _not_ up to you as a | ||
80 | driver author but depends on what the hardware supports. Check your | ||
81 | specs or test different channels. | ||
82 | |||
83 | Part V - Transfer data | ||
84 | ---------------------- | ||
85 | |||
86 | Now for the good stuff, the actual DMA transfer. :) | ||
87 | |||
88 | Before you use any ISA DMA routines you need to claim the DMA lock | ||
89 | using claim_dma_lock(). The reason is that some DMA operations are | ||
90 | not atomic so only one driver may fiddle with the registers at a | ||
91 | time. | ||
92 | |||
93 | The first time you use the DMA controller you should call | ||
94 | clear_dma_ff(). This clears an internal register in the DMA | ||
95 | controller that is used for the non-atomic operations. As long as you | ||
96 | (and everyone else) uses the locking functions then you only need to | ||
97 | reset this once. | ||
98 | |||
99 | Next, you tell the controller in which direction you intend to do the | ||
100 | transfer using set_dma_mode(). Currently you have the options | ||
101 | DMA_MODE_READ and DMA_MODE_WRITE. | ||
102 | |||
103 | Set the address from where the transfer should start (this needs to | ||
104 | be 16-bit aligned for 16-bit transfers) and how many bytes to | ||
105 | transfer. Note that it's _bytes_. The DMA routines will do all the | ||
106 | required translation to values that the DMA controller understands. | ||
107 | |||
108 | The final step is enabling the DMA channel and releasing the DMA | ||
109 | lock. | ||
110 | |||
111 | Once the DMA transfer is finished (or timed out) you should disable | ||
112 | the channel again. You should also check get_dma_residue() to make | ||
113 | sure that all data has been transfered. | ||
114 | |||
115 | Example: | ||
116 | |||
117 | int flags, residue; | ||
118 | |||
119 | flags = claim_dma_lock(); | ||
120 | |||
121 | clear_dma_ff(); | ||
122 | |||
123 | set_dma_mode(channel, DMA_MODE_WRITE); | ||
124 | set_dma_addr(channel, phys_addr); | ||
125 | set_dma_count(channel, num_bytes); | ||
126 | |||
127 | dma_enable(channel); | ||
128 | |||
129 | release_dma_lock(flags); | ||
130 | |||
131 | while (!device_done()); | ||
132 | |||
133 | flags = claim_dma_lock(); | ||
134 | |||
135 | dma_disable(channel); | ||
136 | |||
137 | residue = dma_get_residue(channel); | ||
138 | if (residue != 0) | ||
139 | printk(KERN_ERR "driver: Incomplete DMA transfer!" | ||
140 | " %d bytes left!\n", residue); | ||
141 | |||
142 | release_dma_lock(flags); | ||
143 | |||
144 | Part VI - Suspend/resume | ||
145 | ------------------------ | ||
146 | |||
147 | It is the driver's responsibility to make sure that the machine isn't | ||
148 | suspended while a DMA transfer is in progress. Also, all DMA settings | ||
149 | are lost when the system suspends so if your driver relies on the DMA | ||
150 | controller being in a certain state then you have to restore these | ||
151 | registers upon resume. | ||
diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl index 1ef6f43c6d8f..341aaa4ce481 100644 --- a/Documentation/DocBook/journal-api.tmpl +++ b/Documentation/DocBook/journal-api.tmpl | |||
@@ -116,7 +116,7 @@ filesystem. Almost. | |||
116 | 116 | ||
117 | You still need to actually journal your filesystem changes, this | 117 | You still need to actually journal your filesystem changes, this |
118 | is done by wrapping them into transactions. Additionally you | 118 | is done by wrapping them into transactions. Additionally you |
119 | also need to wrap the modification of each of the the buffers | 119 | also need to wrap the modification of each of the buffers |
120 | with calls to the journal layer, so it knows what the modifications | 120 | with calls to the journal layer, so it knows what the modifications |
121 | you are actually making are. To do this use journal_start() which | 121 | you are actually making are. To do this use journal_start() which |
122 | returns a transaction handle. | 122 | returns a transaction handle. |
@@ -128,7 +128,7 @@ and its counterpart journal_stop(), which indicates the end of a transaction | |||
128 | are nestable calls, so you can reenter a transaction if necessary, | 128 | are nestable calls, so you can reenter a transaction if necessary, |
129 | but remember you must call journal_stop() the same number of times as | 129 | but remember you must call journal_stop() the same number of times as |
130 | journal_start() before the transaction is completed (or more accurately | 130 | journal_start() before the transaction is completed (or more accurately |
131 | leaves the the update phase). Ext3/VFS makes use of this feature to simplify | 131 | leaves the update phase). Ext3/VFS makes use of this feature to simplify |
132 | quota support. | 132 | quota support. |
133 | </para> | 133 | </para> |
134 | 134 | ||
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 49a9ef82d575..6367bba32d22 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl | |||
@@ -8,8 +8,7 @@ | |||
8 | 8 | ||
9 | <authorgroup> | 9 | <authorgroup> |
10 | <author> | 10 | <author> |
11 | <firstname>Paul</firstname> | 11 | <firstname>Rusty</firstname> |
12 | <othername>Rusty</othername> | ||
13 | <surname>Russell</surname> | 12 | <surname>Russell</surname> |
14 | <affiliation> | 13 | <affiliation> |
15 | <address> | 14 | <address> |
@@ -20,7 +19,7 @@ | |||
20 | </authorgroup> | 19 | </authorgroup> |
21 | 20 | ||
22 | <copyright> | 21 | <copyright> |
23 | <year>2001</year> | 22 | <year>2005</year> |
24 | <holder>Rusty Russell</holder> | 23 | <holder>Rusty Russell</holder> |
25 | </copyright> | 24 | </copyright> |
26 | 25 | ||
@@ -64,7 +63,7 @@ | |||
64 | <chapter id="introduction"> | 63 | <chapter id="introduction"> |
65 | <title>Introduction</title> | 64 | <title>Introduction</title> |
66 | <para> | 65 | <para> |
67 | Welcome, gentle reader, to Rusty's Unreliable Guide to Linux | 66 | Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux |
68 | Kernel Hacking. This document describes the common routines and | 67 | Kernel Hacking. This document describes the common routines and |
69 | general requirements for kernel code: its goal is to serve as a | 68 | general requirements for kernel code: its goal is to serve as a |
70 | primer for Linux kernel development for experienced C | 69 | primer for Linux kernel development for experienced C |
@@ -96,13 +95,13 @@ | |||
96 | 95 | ||
97 | <listitem> | 96 | <listitem> |
98 | <para> | 97 | <para> |
99 | not associated with any process, serving a softirq, tasklet or bh; | 98 | not associated with any process, serving a softirq or tasklet; |
100 | </para> | 99 | </para> |
101 | </listitem> | 100 | </listitem> |
102 | 101 | ||
103 | <listitem> | 102 | <listitem> |
104 | <para> | 103 | <para> |
105 | running in kernel space, associated with a process; | 104 | running in kernel space, associated with a process (user context); |
106 | </para> | 105 | </para> |
107 | </listitem> | 106 | </listitem> |
108 | 107 | ||
@@ -114,11 +113,12 @@ | |||
114 | </itemizedlist> | 113 | </itemizedlist> |
115 | 114 | ||
116 | <para> | 115 | <para> |
117 | There is a strict ordering between these: other than the last | 116 | There is an ordering between these. The bottom two can preempt |
118 | category (userspace) each can only be pre-empted by those above. | 117 | each other, but above that is a strict hierarchy: each can only be |
119 | For example, while a softirq is running on a CPU, no other | 118 | preempted by the ones above it. For example, while a softirq is |
120 | softirq will pre-empt it, but a hardware interrupt can. However, | 119 | running on a CPU, no other softirq will preempt it, but a hardware |
121 | any other CPUs in the system execute independently. | 120 | interrupt can. However, any other CPUs in the system execute |
121 | independently. | ||
122 | </para> | 122 | </para> |
123 | 123 | ||
124 | <para> | 124 | <para> |
@@ -130,10 +130,10 @@ | |||
130 | <title>User Context</title> | 130 | <title>User Context</title> |
131 | 131 | ||
132 | <para> | 132 | <para> |
133 | User context is when you are coming in from a system call or | 133 | User context is when you are coming in from a system call or other |
134 | other trap: you can sleep, and you own the CPU (except for | 134 | trap: like userspace, you can be preempted by more important tasks |
135 | interrupts) until you call <function>schedule()</function>. | 135 | and by interrupts. You can sleep, by calling |
136 | In other words, user context (unlike userspace) is not pre-emptable. | 136 | <function>schedule()</function>. |
137 | </para> | 137 | </para> |
138 | 138 | ||
139 | <note> | 139 | <note> |
@@ -153,7 +153,7 @@ | |||
153 | 153 | ||
154 | <caution> | 154 | <caution> |
155 | <para> | 155 | <para> |
156 | Beware that if you have interrupts or bottom halves disabled | 156 | Beware that if you have preemption or softirqs disabled |
157 | (see below), <function>in_interrupt()</function> will return a | 157 | (see below), <function>in_interrupt()</function> will return a |
158 | false positive. | 158 | false positive. |
159 | </para> | 159 | </para> |
@@ -168,10 +168,10 @@ | |||
168 | <hardware>keyboard</hardware> are examples of real | 168 | <hardware>keyboard</hardware> are examples of real |
169 | hardware which produce interrupts at any time. The kernel runs | 169 | hardware which produce interrupts at any time. The kernel runs |
170 | interrupt handlers, which services the hardware. The kernel | 170 | interrupt handlers, which services the hardware. The kernel |
171 | guarantees that this handler is never re-entered: if another | 171 | guarantees that this handler is never re-entered: if the same |
172 | interrupt arrives, it is queued (or dropped). Because it | 172 | interrupt arrives, it is queued (or dropped). Because it |
173 | disables interrupts, this handler has to be fast: frequently it | 173 | disables interrupts, this handler has to be fast: frequently it |
174 | simply acknowledges the interrupt, marks a `software interrupt' | 174 | simply acknowledges the interrupt, marks a 'software interrupt' |
175 | for execution and exits. | 175 | for execution and exits. |
176 | </para> | 176 | </para> |
177 | 177 | ||
@@ -188,60 +188,52 @@ | |||
188 | </sect1> | 188 | </sect1> |
189 | 189 | ||
190 | <sect1 id="basics-softirqs"> | 190 | <sect1 id="basics-softirqs"> |
191 | <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title> | 191 | <title>Software Interrupt Context: Softirqs and Tasklets</title> |
192 | 192 | ||
193 | <para> | 193 | <para> |
194 | Whenever a system call is about to return to userspace, or a | 194 | Whenever a system call is about to return to userspace, or a |
195 | hardware interrupt handler exits, any `software interrupts' | 195 | hardware interrupt handler exits, any 'software interrupts' |
196 | which are marked pending (usually by hardware interrupts) are | 196 | which are marked pending (usually by hardware interrupts) are |
197 | run (<filename>kernel/softirq.c</filename>). | 197 | run (<filename>kernel/softirq.c</filename>). |
198 | </para> | 198 | </para> |
199 | 199 | ||
200 | <para> | 200 | <para> |
201 | Much of the real interrupt handling work is done here. Early in | 201 | Much of the real interrupt handling work is done here. Early in |
202 | the transition to <acronym>SMP</acronym>, there were only `bottom | 202 | the transition to <acronym>SMP</acronym>, there were only 'bottom |
203 | halves' (BHs), which didn't take advantage of multiple CPUs. Shortly | 203 | halves' (BHs), which didn't take advantage of multiple CPUs. Shortly |
204 | after we switched from wind-up computers made of match-sticks and snot, | 204 | after we switched from wind-up computers made of match-sticks and snot, |
205 | we abandoned this limitation. | 205 | we abandoned this limitation and switched to 'softirqs'. |
206 | </para> | 206 | </para> |
207 | 207 | ||
208 | <para> | 208 | <para> |
209 | <filename class="headerfile">include/linux/interrupt.h</filename> lists the | 209 | <filename class="headerfile">include/linux/interrupt.h</filename> lists the |
210 | different BH's. No matter how many CPUs you have, no two BHs will run at | 210 | different softirqs. A very important softirq is the |
211 | the same time. This made the transition to SMP simpler, but sucks hard for | 211 | timer softirq (<filename |
212 | scalable performance. A very important bottom half is the timer | 212 | class="headerfile">include/linux/timer.h</filename>): you can |
213 | BH (<filename class="headerfile">include/linux/timer.h</filename>): you | 213 | register to have it call functions for you in a given length of |
214 | can register to have it call functions for you in a given length of time. | 214 | time. |
215 | </para> | 215 | </para> |
216 | 216 | ||
217 | <para> | 217 | <para> |
218 | 2.3.43 introduced softirqs, and re-implemented the (now | 218 | Softirqs are often a pain to deal with, since the same softirq |
219 | deprecated) BHs underneath them. Softirqs are fully-SMP | 219 | will run simultaneously on more than one CPU. For this reason, |
220 | versions of BHs: they can run on as many CPUs at once as | 220 | tasklets (<filename |
221 | required. This means they need to deal with any races in shared | 221 | class="headerfile">include/linux/interrupt.h</filename>) are more |
222 | data using their own locks. A bitmask is used to keep track of | 222 | often used: they are dynamically-registrable (meaning you can have |
223 | which are enabled, so the 32 available softirqs should not be | 223 | as many as you want), and they also guarantee that any tasklet |
224 | used up lightly. (<emphasis>Yes</emphasis>, people will | 224 | will only run on one CPU at any time, although different tasklets |
225 | notice). | 225 | can run simultaneously. |
226 | </para> | ||
227 | |||
228 | <para> | ||
229 | tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>) | ||
230 | are like softirqs, except they are dynamically-registrable (meaning you | ||
231 | can have as many as you want), and they also guarantee that any tasklet | ||
232 | will only run on one CPU at any time, although different tasklets can | ||
233 | run simultaneously (unlike different BHs). | ||
234 | </para> | 226 | </para> |
235 | <caution> | 227 | <caution> |
236 | <para> | 228 | <para> |
237 | The name `tasklet' is misleading: they have nothing to do with `tasks', | 229 | The name 'tasklet' is misleading: they have nothing to do with 'tasks', |
238 | and probably more to do with some bad vodka Alexey Kuznetsov had at the | 230 | and probably more to do with some bad vodka Alexey Kuznetsov had at the |
239 | time. | 231 | time. |
240 | </para> | 232 | </para> |
241 | </caution> | 233 | </caution> |
242 | 234 | ||
243 | <para> | 235 | <para> |
244 | You can tell you are in a softirq (or bottom half, or tasklet) | 236 | You can tell you are in a softirq (or tasklet) |
245 | using the <function>in_softirq()</function> macro | 237 | using the <function>in_softirq()</function> macro |
246 | (<filename class="headerfile">include/linux/interrupt.h</filename>). | 238 | (<filename class="headerfile">include/linux/interrupt.h</filename>). |
247 | </para> | 239 | </para> |
@@ -288,11 +280,10 @@ | |||
288 | <term>A rigid stack limit</term> | 280 | <term>A rigid stack limit</term> |
289 | <listitem> | 281 | <listitem> |
290 | <para> | 282 | <para> |
291 | The kernel stack is about 6K in 2.2 (for most | 283 | Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's |
292 | architectures: it's about 14K on the Alpha), and shared | 284 | about 14K on most 64-bit archs, and often shared with interrupts |
293 | with interrupts so you can't use it all. Avoid deep | 285 | so you can't use it all. Avoid deep recursion and huge local |
294 | recursion and huge local arrays on the stack (allocate | 286 | arrays on the stack (allocate them dynamically instead). |
295 | them dynamically instead). | ||
296 | </para> | 287 | </para> |
297 | </listitem> | 288 | </listitem> |
298 | </varlistentry> | 289 | </varlistentry> |
@@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg) | |||
339 | 330 | ||
340 | <para> | 331 | <para> |
341 | If all your routine does is read or write some parameter, consider | 332 | If all your routine does is read or write some parameter, consider |
342 | implementing a <function>sysctl</function> interface instead. | 333 | implementing a <function>sysfs</function> interface instead. |
343 | </para> | 334 | </para> |
344 | 335 | ||
345 | <para> | 336 | <para> |
@@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */ | |||
417 | </para> | 408 | </para> |
418 | 409 | ||
419 | <para> | 410 | <para> |
420 | You will eventually lock up your box if you break these rules. | 411 | You should always compile your kernel |
412 | <symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn | ||
413 | you if you break these rules. If you <emphasis>do</emphasis> break | ||
414 | the rules, you will eventually lock up your box. | ||
421 | </para> | 415 | </para> |
422 | 416 | ||
423 | <para> | 417 | <para> |
@@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
515 | success). | 509 | success). |
516 | </para> | 510 | </para> |
517 | </caution> | 511 | </caution> |
518 | [Yes, this moronic interface makes me cringe. Please submit a | 512 | [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.] |
519 | patch and become my hero --RR.] | ||
520 | </para> | 513 | </para> |
521 | <para> | 514 | <para> |
522 | The functions may sleep implicitly. This should never be called | 515 | The functions may sleep implicitly. This should never be called |
@@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
587 | </variablelist> | 580 | </variablelist> |
588 | 581 | ||
589 | <para> | 582 | <para> |
590 | If you see a <errorname>kmem_grow: Called nonatomically from int | 583 | If you see a <errorname>sleeping function called from invalid |
591 | </errorname> warning message you called a memory allocation function | 584 | context</errorname> warning message, then maybe you called a |
592 | from interrupt context without <constant>GFP_ATOMIC</constant>. | 585 | sleeping allocation function from interrupt context without |
593 | You should really fix that. Run, don't walk. | 586 | <constant>GFP_ATOMIC</constant>. You should really fix that. |
587 | Run, don't walk. | ||
594 | </para> | 588 | </para> |
595 | 589 | ||
596 | <para> | 590 | <para> |
@@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
639 | </sect1> | 633 | </sect1> |
640 | 634 | ||
641 | <sect1 id="routines-udelay"> | 635 | <sect1 id="routines-udelay"> |
642 | <title><function>udelay()</function>/<function>mdelay()</function> | 636 | <title><function>mdelay()</function>/<function>udelay()</function> |
643 | <filename class="headerfile">include/asm/delay.h</filename> | 637 | <filename class="headerfile">include/asm/delay.h</filename> |
644 | <filename class="headerfile">include/linux/delay.h</filename> | 638 | <filename class="headerfile">include/linux/delay.h</filename> |
645 | </title> | 639 | </title> |
646 | 640 | ||
647 | <para> | 641 | <para> |
648 | The <function>udelay()</function> function can be used for small pauses. | 642 | The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses. |
649 | Do not use large values with <function>udelay()</function> as you risk | 643 | Do not use large values with them as you risk |
650 | overflow - the helper function <function>mdelay()</function> is useful | 644 | overflow - the helper function <function>mdelay()</function> is useful |
651 | here, or even consider <function>schedule_timeout()</function>. | 645 | here, or consider <function>msleep()</function>. |
652 | </para> | 646 | </para> |
653 | </sect1> | 647 | </sect1> |
654 | 648 | ||
@@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
698 | These routines disable soft interrupts on the local CPU, and | 692 | These routines disable soft interrupts on the local CPU, and |
699 | restore them. They are reentrant; if soft interrupts were | 693 | restore them. They are reentrant; if soft interrupts were |
700 | disabled before, they will still be disabled after this pair | 694 | disabled before, they will still be disabled after this pair |
701 | of functions has been called. They prevent softirqs, tasklets | 695 | of functions has been called. They prevent softirqs and tasklets |
702 | and bottom halves from running on the current CPU. | 696 | from running on the current CPU. |
703 | </para> | 697 | </para> |
704 | </sect1> | 698 | </sect1> |
705 | 699 | ||
@@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
708 | <filename class="headerfile">include/asm/smp.h</filename></title> | 702 | <filename class="headerfile">include/asm/smp.h</filename></title> |
709 | 703 | ||
710 | <para> | 704 | <para> |
711 | <function>smp_processor_id()</function> returns the current | 705 | <function>get_cpu()</function> disables preemption (so you won't |
712 | processor number, between 0 and <symbol>NR_CPUS</symbol> (the | 706 | suddenly get moved to another CPU) and returns the current |
713 | maximum number of CPUs supported by Linux, currently 32). These | 707 | processor number, between 0 and <symbol>NR_CPUS</symbol>. Note |
714 | values are not necessarily continuous. | 708 | that the CPU numbers are not necessarily continuous. You return |
709 | it again with <function>put_cpu()</function> when you are done. | ||
710 | </para> | ||
711 | <para> | ||
712 | If you know you cannot be preempted by another task (ie. you are | ||
713 | in interrupt context, or have preemption disabled) you can use | ||
714 | smp_processor_id(). | ||
715 | </para> | 715 | </para> |
716 | </sect1> | 716 | </sect1> |
717 | 717 | ||
@@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
722 | <para> | 722 | <para> |
723 | After boot, the kernel frees up a special section; functions | 723 | After boot, the kernel frees up a special section; functions |
724 | marked with <type>__init</type> and data structures marked with | 724 | marked with <type>__init</type> and data structures marked with |
725 | <type>__initdata</type> are dropped after boot is complete (within | 725 | <type>__initdata</type> are dropped after boot is complete: similarly |
726 | modules this directive is currently ignored). <type>__exit</type> | 726 | modules discard this memory after initialization. <type>__exit</type> |
727 | is used to declare a function which is only required on exit: the | 727 | is used to declare a function which is only required on exit: the |
728 | function will be dropped if this file is not compiled as a module. | 728 | function will be dropped if this file is not compiled as a module. |
729 | See the header file for use. Note that it makes no sense for a function | 729 | See the header file for use. Note that it makes no sense for a function |
730 | marked with <type>__init</type> to be exported to modules with | 730 | marked with <type>__init</type> to be exported to modules with |
731 | <function>EXPORT_SYMBOL()</function> - this will break. | 731 | <function>EXPORT_SYMBOL()</function> - this will break. |
732 | </para> | 732 | </para> |
733 | <para> | ||
734 | Static data structures marked as <type>__initdata</type> must be initialised | ||
735 | (as opposed to ordinary static data which is zeroed BSS) and cannot be | ||
736 | <type>const</type>. | ||
737 | </para> | ||
738 | 733 | ||
739 | </sect1> | 734 | </sect1> |
740 | 735 | ||
@@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
762 | <para> | 757 | <para> |
763 | The function can return a negative error number to cause | 758 | The function can return a negative error number to cause |
764 | module loading to fail (unfortunately, this has no effect if | 759 | module loading to fail (unfortunately, this has no effect if |
765 | the module is compiled into the kernel). For modules, this is | 760 | the module is compiled into the kernel). This function is |
766 | called in user context, with interrupts enabled, and the | 761 | called in user context with interrupts enabled, so it can sleep. |
767 | kernel lock held, so it can sleep. | ||
768 | </para> | 762 | </para> |
769 | </sect1> | 763 | </sect1> |
770 | 764 | ||
@@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
779 | reached zero. This function can also sleep, but cannot fail: | 773 | reached zero. This function can also sleep, but cannot fail: |
780 | everything must be cleaned up by the time it returns. | 774 | everything must be cleaned up by the time it returns. |
781 | </para> | 775 | </para> |
776 | |||
777 | <para> | ||
778 | Note that this macro is optional: if it is not present, your | ||
779 | module will not be removable (except for 'rmmod -f'). | ||
780 | </para> | ||
781 | </sect1> | ||
782 | |||
783 | <sect1 id="routines-module-use-counters"> | ||
784 | <title> <function>try_module_get()</function>/<function>module_put()</function> | ||
785 | <filename class="headerfile">include/linux/module.h</filename></title> | ||
786 | |||
787 | <para> | ||
788 | These manipulate the module usage count, to protect against | ||
789 | removal (a module also can't be removed if another module uses one | ||
790 | of its exported symbols: see below). Before calling into module | ||
791 | code, you should call <function>try_module_get()</function> on | ||
792 | that module: if it fails, then the module is being removed and you | ||
793 | should act as if it wasn't there. Otherwise, you can safely enter | ||
794 | the module, and call <function>module_put()</function> when you're | ||
795 | finished. | ||
796 | </para> | ||
797 | |||
798 | <para> | ||
799 | Most registerable structures have an | ||
800 | <structfield>owner</structfield> field, such as in the | ||
801 | <structname>file_operations</structname> structure. Set this field | ||
802 | to the macro <symbol>THIS_MODULE</symbol>. | ||
803 | </para> | ||
782 | </sect1> | 804 | </sect1> |
783 | 805 | ||
784 | <!-- add info on new-style module refcounting here --> | 806 | <!-- add info on new-style module refcounting here --> |
@@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
821 | There is a macro to do this: | 843 | There is a macro to do this: |
822 | <function>wait_event_interruptible()</function> | 844 | <function>wait_event_interruptible()</function> |
823 | 845 | ||
824 | <filename class="headerfile">include/linux/sched.h</filename> The | 846 | <filename class="headerfile">include/linux/wait.h</filename> The |
825 | first argument is the wait queue head, and the second is an | 847 | first argument is the wait queue head, and the second is an |
826 | expression which is evaluated; the macro returns | 848 | expression which is evaluated; the macro returns |
827 | <returnvalue>0</returnvalue> when this expression is true, or | 849 | <returnvalue>0</returnvalue> when this expression is true, or |
@@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
847 | <para> | 869 | <para> |
848 | Call <function>wake_up()</function> | 870 | Call <function>wake_up()</function> |
849 | 871 | ||
850 | <filename class="headerfile">include/linux/sched.h</filename>;, | 872 | <filename class="headerfile">include/linux/wait.h</filename>;, |
851 | which will wake up every process in the queue. The exception is | 873 | which will wake up every process in the queue. The exception is |
852 | if one has <constant>TASK_EXCLUSIVE</constant> set, in which case | 874 | if one has <constant>TASK_EXCLUSIVE</constant> set, in which case |
853 | the remainder of the queue will not be woken. | 875 | the remainder of the queue will not be woken. There are other variants |
876 | of this basic function available in the same header. | ||
854 | </para> | 877 | </para> |
855 | </sect1> | 878 | </sect1> |
856 | </chapter> | 879 | </chapter> |
@@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
863 | first class of operations work on <type>atomic_t</type> | 886 | first class of operations work on <type>atomic_t</type> |
864 | 887 | ||
865 | <filename class="headerfile">include/asm/atomic.h</filename>; this | 888 | <filename class="headerfile">include/asm/atomic.h</filename>; this |
866 | contains a signed integer (at least 24 bits long), and you must use | 889 | contains a signed integer (at least 32 bits long), and you must use |
867 | these functions to manipulate or read atomic_t variables. | 890 | these functions to manipulate or read atomic_t variables. |
868 | <function>atomic_read()</function> and | 891 | <function>atomic_read()</function> and |
869 | <function>atomic_set()</function> get and set the counter, | 892 | <function>atomic_set()</function> get and set the counter, |
@@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
882 | 905 | ||
883 | <para> | 906 | <para> |
884 | Note that these functions are slower than normal arithmetic, and | 907 | Note that these functions are slower than normal arithmetic, and |
885 | so should not be used unnecessarily. On some platforms they | 908 | so should not be used unnecessarily. |
886 | are much slower, like 32-bit Sparc where they use a spinlock. | ||
887 | </para> | 909 | </para> |
888 | 910 | ||
889 | <para> | 911 | <para> |
890 | The second class of atomic operations is atomic bit operations on a | 912 | The second class of atomic operations is atomic bit operations on an |
891 | <type>long</type>, defined in | 913 | <type>unsigned long</type>, defined in |
892 | 914 | ||
893 | <filename class="headerfile">include/linux/bitops.h</filename>. These | 915 | <filename class="headerfile">include/linux/bitops.h</filename>. These |
894 | operations generally take a pointer to the bit pattern, and a bit | 916 | operations generally take a pointer to the bit pattern, and a bit |
@@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
899 | <function>test_and_clear_bit()</function> and | 921 | <function>test_and_clear_bit()</function> and |
900 | <function>test_and_change_bit()</function> do the same thing, | 922 | <function>test_and_change_bit()</function> do the same thing, |
901 | except return true if the bit was previously set; these are | 923 | except return true if the bit was previously set; these are |
902 | particularly useful for very simple locking. | 924 | particularly useful for atomically setting flags. |
903 | </para> | 925 | </para> |
904 | 926 | ||
905 | <para> | 927 | <para> |
@@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
907 | than BITS_PER_LONG. The resulting behavior is strange on big-endian | 929 | than BITS_PER_LONG. The resulting behavior is strange on big-endian |
908 | platforms though so it is a good idea not to do this. | 930 | platforms though so it is a good idea not to do this. |
909 | </para> | 931 | </para> |
910 | |||
911 | <para> | ||
912 | Note that the order of bits depends on the architecture, and in | ||
913 | particular, the bitfield passed to these operations must be at | ||
914 | least as large as a <type>long</type>. | ||
915 | </para> | ||
916 | </chapter> | 932 | </chapter> |
917 | 933 | ||
918 | <chapter id="symbols"> | 934 | <chapter id="symbols"> |
@@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
932 | <filename class="headerfile">include/linux/module.h</filename></title> | 948 | <filename class="headerfile">include/linux/module.h</filename></title> |
933 | 949 | ||
934 | <para> | 950 | <para> |
935 | This is the classic method of exporting a symbol, and it works | 951 | This is the classic method of exporting a symbol: dynamically |
936 | for both modules and non-modules. In the kernel all these | 952 | loaded modules will be able to use the symbol as normal. |
937 | declarations are often bundled into a single file to help | ||
938 | genksyms (which searches source files for these declarations). | ||
939 | See the comment on genksyms and Makefiles below. | ||
940 | </para> | 953 | </para> |
941 | </sect1> | 954 | </sect1> |
942 | 955 | ||
@@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
949 | symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can | 962 | symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can |
950 | only be seen by modules with a | 963 | only be seen by modules with a |
951 | <function>MODULE_LICENSE()</function> that specifies a GPL | 964 | <function>MODULE_LICENSE()</function> that specifies a GPL |
952 | compatible license. | 965 | compatible license. It implies that the function is considered |
966 | an internal implementation issue, and not really an interface. | ||
953 | </para> | 967 | </para> |
954 | </sect1> | 968 | </sect1> |
955 | </chapter> | 969 | </chapter> |
@@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
962 | <filename class="headerfile">include/linux/list.h</filename></title> | 976 | <filename class="headerfile">include/linux/list.h</filename></title> |
963 | 977 | ||
964 | <para> | 978 | <para> |
965 | There are three sets of linked-list routines in the kernel | 979 | There used to be three sets of linked-list routines in the kernel |
966 | headers, but this one seems to be winning out (and Linus has | 980 | headers, but this one is the winner. If you don't have some |
967 | used it). If you don't have some particular pressing need for | 981 | particular pressing need for a single list, it's a good choice. |
968 | a single list, it's a good choice. In fact, I don't care | 982 | </para> |
969 | whether it's a good choice or not, just use it so we can get | 983 | |
970 | rid of the others. | 984 | <para> |
985 | In particular, <function>list_for_each_entry</function> is useful. | ||
971 | </para> | 986 | </para> |
972 | </sect1> | 987 | </sect1> |
973 | 988 | ||
@@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); | |||
979 | convention, and return <returnvalue>0</returnvalue> for success, | 994 | convention, and return <returnvalue>0</returnvalue> for success, |
980 | and a negative error number | 995 | and a negative error number |
981 | (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be | 996 | (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be |
982 | unintuitive at first, but it's fairly widespread in the networking | 997 | unintuitive at first, but it's fairly widespread in the kernel. |
983 | code, for example. | ||
984 | </para> | 998 | </para> |
985 | 999 | ||
986 | <para> | 1000 | <para> |
987 | The filesystem code uses <function>ERR_PTR()</function> | 1001 | Using <function>ERR_PTR()</function> |
988 | 1002 | ||
989 | <filename class="headerfile">include/linux/fs.h</filename>; to | 1003 | <filename class="headerfile">include/linux/err.h</filename>; to |
990 | encode a negative error number into a pointer, and | 1004 | encode a negative error number into a pointer, and |
991 | <function>IS_ERR()</function> and <function>PTR_ERR()</function> | 1005 | <function>IS_ERR()</function> and <function>PTR_ERR()</function> |
992 | to get it back out again: avoids a separate pointer parameter for | 1006 | to get it back out again: avoids a separate pointer parameter for |
@@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = { | |||
1040 | supported, due to lack of general use, but the following are | 1054 | supported, due to lack of general use, but the following are |
1041 | considered standard (see the GCC info page section "C | 1055 | considered standard (see the GCC info page section "C |
1042 | Extensions" for more details - Yes, really the info page, the | 1056 | Extensions" for more details - Yes, really the info page, the |
1043 | man page is only a short summary of the stuff in info): | 1057 | man page is only a short summary of the stuff in info). |
1044 | </para> | 1058 | </para> |
1045 | <itemizedlist> | 1059 | <itemizedlist> |
1046 | <listitem> | 1060 | <listitem> |
@@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = { | |||
1091 | </listitem> | 1105 | </listitem> |
1092 | <listitem> | 1106 | <listitem> |
1093 | <para> | 1107 | <para> |
1094 | Function names as strings (__FUNCTION__) | 1108 | Function names as strings (__func__). |
1095 | </para> | 1109 | </para> |
1096 | </listitem> | 1110 | </listitem> |
1097 | <listitem> | 1111 | <listitem> |
@@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = { | |||
1164 | <listitem> | 1178 | <listitem> |
1165 | <para> | 1179 | <para> |
1166 | Usually you want a configuration option for your kernel hack. | 1180 | Usually you want a configuration option for your kernel hack. |
1167 | Edit <filename>Config.in</filename> in the appropriate directory | 1181 | Edit <filename>Kconfig</filename> in the appropriate directory. |
1168 | (but under <filename>arch/</filename> it's called | 1182 | The Config language is simple to use by cut and paste, and there's |
1169 | <filename>config.in</filename>). The Config Language used is not | 1183 | complete documentation in |
1170 | bash, even though it looks like bash; the safe way is to use only | 1184 | <filename>Documentation/kbuild/kconfig-language.txt</filename>. |
1171 | the constructs that you already see in | ||
1172 | <filename>Config.in</filename> files (see | ||
1173 | <filename>Documentation/kbuild/kconfig-language.txt</filename>). | ||
1174 | It's good to run "make xconfig" at least once to test (because | ||
1175 | it's the only one with a static parser). | ||
1176 | </para> | ||
1177 | |||
1178 | <para> | ||
1179 | Variables which can be Y or N use <type>bool</type> followed by a | ||
1180 | tagline and the config define name (which must start with | ||
1181 | CONFIG_). The <type>tristate</type> function is the same, but | ||
1182 | allows the answer M (which defines | ||
1183 | <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of | ||
1184 | <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol> | ||
1185 | is enabled. | ||
1186 | </para> | 1185 | </para> |
1187 | 1186 | ||
1188 | <para> | 1187 | <para> |
1189 | You may well want to make your CONFIG option only visible if | 1188 | You may well want to make your CONFIG option only visible if |
1190 | <symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a | 1189 | <symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a |
1191 | warning to users. There many other fancy things you can do: see | 1190 | warning to users. There many other fancy things you can do: see |
1192 | the various <filename>Config.in</filename> files for ideas. | 1191 | the various <filename>Kconfig</filename> files for ideas. |
1193 | </para> | 1192 | </para> |
1194 | </listitem> | ||
1195 | 1193 | ||
1196 | <listitem> | ||
1197 | <para> | 1194 | <para> |
1198 | Edit the <filename>Makefile</filename>: the CONFIG variables are | 1195 | In your description of the option, make sure you address both the |
1199 | exported here so you can conditionalize compilation with `ifeq'. | 1196 | expert user and the user who knows nothing about your feature. Mention |
1200 | If your file exports symbols then add the names to | 1197 | incompatibilities and issues here. <emphasis> Definitely |
1201 | <varname>export-objs</varname> so that genksyms will find them. | 1198 | </emphasis> end your description with <quote> if in doubt, say N |
1202 | <caution> | 1199 | </quote> (or, occasionally, `Y'); this is for people who have no |
1203 | <para> | 1200 | idea what you are talking about. |
1204 | There is a restriction on the kernel build system that objects | ||
1205 | which export symbols must have globally unique names. | ||
1206 | If your object does not have a globally unique name then the | ||
1207 | standard fix is to move the | ||
1208 | <function>EXPORT_SYMBOL()</function> statements to their own | ||
1209 | object with a unique name. | ||
1210 | This is why several systems have separate exporting objects, | ||
1211 | usually suffixed with ksyms. | ||
1212 | </para> | ||
1213 | </caution> | ||
1214 | </para> | 1201 | </para> |
1215 | </listitem> | 1202 | </listitem> |
1216 | 1203 | ||
1217 | <listitem> | 1204 | <listitem> |
1218 | <para> | 1205 | <para> |
1219 | Document your option in Documentation/Configure.help. Mention | 1206 | Edit the <filename>Makefile</filename>: the CONFIG variables are |
1220 | incompatibilities and issues here. <emphasis> Definitely | 1207 | exported here so you can usually just add a "obj-$(CONFIG_xxx) += |
1221 | </emphasis> end your description with <quote> if in doubt, say N | 1208 | xxx.o" line. The syntax is documented in |
1222 | </quote> (or, occasionally, `Y'); this is for people who have no | 1209 | <filename>Documentation/kbuild/makefiles.txt</filename>. |
1223 | idea what you are talking about. | ||
1224 | </para> | 1210 | </para> |
1225 | </listitem> | 1211 | </listitem> |
1226 | 1212 | ||
@@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = { | |||
1253 | </para> | 1239 | </para> |
1254 | 1240 | ||
1255 | <para> | 1241 | <para> |
1256 | <filename>include/linux/brlock.h:</filename> | 1242 | <filename>include/asm-i386/delay.h:</filename> |
1257 | </para> | 1243 | </para> |
1258 | <programlisting> | 1244 | <programlisting> |
1259 | extern inline void br_read_lock (enum brlock_indices idx) | 1245 | #define ndelay(n) (__builtin_constant_p(n) ? \ |
1260 | { | 1246 | ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ |
1261 | /* | 1247 | __ndelay(n)) |
1262 | * This causes a link-time bug message if an | ||
1263 | * invalid index is used: | ||
1264 | */ | ||
1265 | if (idx >= __BR_END) | ||
1266 | __br_lock_usage_bug(); | ||
1267 | |||
1268 | read_lock(&__brlock_array[smp_processor_id()][idx]); | ||
1269 | } | ||
1270 | </programlisting> | 1248 | </programlisting> |
1271 | 1249 | ||
1272 | <para> | 1250 | <para> |
diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index f3ef0bf435e9..705c442c7bf4 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl | |||
@@ -841,7 +841,7 @@ usbdev_ioctl (int fd, int ifno, unsigned request, void *param) | |||
841 | File modification time is not updated by this request. | 841 | File modification time is not updated by this request. |
842 | </para><para> | 842 | </para><para> |
843 | Those struct members are from some interface descriptor | 843 | Those struct members are from some interface descriptor |
844 | applying to the the current configuration. | 844 | applying to the current configuration. |
845 | The interface number is the bInterfaceNumber value, and | 845 | The interface number is the bInterfaceNumber value, and |
846 | the altsetting number is the bAlternateSetting value. | 846 | the altsetting number is the bAlternateSetting value. |
847 | (This resets each endpoint in the interface.) | 847 | (This resets each endpoint in the interface.) |
diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt index d5032eb480aa..63edc5f847c4 100644 --- a/Documentation/MSI-HOWTO.txt +++ b/Documentation/MSI-HOWTO.txt | |||
@@ -430,7 +430,7 @@ which may result in system hang. The software driver of specific | |||
430 | MSI-capable hardware is responsible for whether calling | 430 | MSI-capable hardware is responsible for whether calling |
431 | pci_enable_msi or not. A return of zero indicates the kernel | 431 | pci_enable_msi or not. A return of zero indicates the kernel |
432 | successfully initializes the MSI/MSI-X capability structure of the | 432 | successfully initializes the MSI/MSI-X capability structure of the |
433 | device funtion. The device function is now running on MSI/MSI-X mode. | 433 | device function. The device function is now running on MSI/MSI-X mode. |
434 | 434 | ||
435 | 5.6 How to tell whether MSI/MSI-X is enabled on device function | 435 | 5.6 How to tell whether MSI/MSI-X is enabled on device function |
436 | 436 | ||
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index 9c6d450138ea..fcbcbc35b122 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt | |||
@@ -2,7 +2,8 @@ Read the F-ing Papers! | |||
2 | 2 | ||
3 | 3 | ||
4 | This document describes RCU-related publications, and is followed by | 4 | This document describes RCU-related publications, and is followed by |
5 | the corresponding bibtex entries. | 5 | the corresponding bibtex entries. A number of the publications may |
6 | be found at http://www.rdrop.com/users/paulmck/RCU/. | ||
6 | 7 | ||
7 | The first thing resembling RCU was published in 1980, when Kung and Lehman | 8 | The first thing resembling RCU was published in 1980, when Kung and Lehman |
8 | [Kung80] recommended use of a garbage collector to defer destruction | 9 | [Kung80] recommended use of a garbage collector to defer destruction |
@@ -113,6 +114,10 @@ describing how to make RCU safe for soft-realtime applications [Sarma04c], | |||
113 | and a paper describing SELinux performance with RCU [JamesMorris04b]. | 114 | and a paper describing SELinux performance with RCU [JamesMorris04b]. |
114 | 115 | ||
115 | 116 | ||
117 | 2005 has seen further adaptation of RCU to realtime use, permitting | ||
118 | preemption of RCU realtime critical sections [PaulMcKenney05a, | ||
119 | PaulMcKenney05b]. | ||
120 | |||
116 | Bibtex Entries | 121 | Bibtex Entries |
117 | 122 | ||
118 | @article{Kung80 | 123 | @article{Kung80 |
@@ -410,3 +415,32 @@ Oregon Health and Sciences University" | |||
410 | \url{http://www.livejournal.com/users/james_morris/2153.html} | 415 | \url{http://www.livejournal.com/users/james_morris/2153.html} |
411 | [Viewed December 10, 2004]" | 416 | [Viewed December 10, 2004]" |
412 | } | 417 | } |
418 | |||
419 | @unpublished{PaulMcKenney05a | ||
420 | ,Author="Paul E. McKenney" | ||
421 | ,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress" | ||
422 | ,month="May" | ||
423 | ,year="2005" | ||
424 | ,note="Available: | ||
425 | \url{http://lkml.org/lkml/2005/5/9/185} | ||
426 | [Viewed May 13, 2005]" | ||
427 | ,annotation=" | ||
428 | First publication of working lock-based deferred free patches | ||
429 | for the CONFIG_PREEMPT_RT environment. | ||
430 | " | ||
431 | } | ||
432 | |||
433 | @conference{PaulMcKenney05b | ||
434 | ,Author="Paul E. McKenney and Dipankar Sarma" | ||
435 | ,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware" | ||
436 | ,Booktitle="linux.conf.au 2005" | ||
437 | ,month="April" | ||
438 | ,year="2005" | ||
439 | ,address="Canberra, Australia" | ||
440 | ,note="Available: | ||
441 | \url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf} | ||
442 | [Viewed May 13, 2005]" | ||
443 | ,annotation=" | ||
444 | Realtime turns into making RCU yet more realtime friendly. | ||
445 | " | ||
446 | } | ||
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt index 3bfb84b3b7db..aab4a9ec3931 100644 --- a/Documentation/RCU/UP.txt +++ b/Documentation/RCU/UP.txt | |||
@@ -8,7 +8,7 @@ is that since there is only one CPU, it should not be necessary to | |||
8 | wait for anything else to get done, since there are no other CPUs for | 8 | wait for anything else to get done, since there are no other CPUs for |
9 | anything else to be happening on. Although this approach will -sort- -of- | 9 | anything else to be happening on. Although this approach will -sort- -of- |
10 | work a surprising amount of the time, it is a very bad idea in general. | 10 | work a surprising amount of the time, it is a very bad idea in general. |
11 | This document presents two examples that demonstrate exactly how bad an | 11 | This document presents three examples that demonstrate exactly how bad an |
12 | idea this is. | 12 | idea this is. |
13 | 13 | ||
14 | 14 | ||
@@ -26,6 +26,9 @@ from softirq, the list scan would find itself referencing a newly freed | |||
26 | element B. This situation can greatly decrease the life expectancy of | 26 | element B. This situation can greatly decrease the life expectancy of |
27 | your kernel. | 27 | your kernel. |
28 | 28 | ||
29 | This same problem can occur if call_rcu() is invoked from a hardware | ||
30 | interrupt handler. | ||
31 | |||
29 | 32 | ||
30 | Example 2: Function-Call Fatality | 33 | Example 2: Function-Call Fatality |
31 | 34 | ||
@@ -44,8 +47,37 @@ its arguments would cause it to fail to make the fundamental guarantee | |||
44 | underlying RCU, namely that call_rcu() defers invoking its arguments until | 47 | underlying RCU, namely that call_rcu() defers invoking its arguments until |
45 | all RCU read-side critical sections currently executing have completed. | 48 | all RCU read-side critical sections currently executing have completed. |
46 | 49 | ||
47 | Quick Quiz: why is it -not- legal to invoke synchronize_rcu() in | 50 | Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in |
48 | this case? | 51 | this case? |
52 | |||
53 | |||
54 | Example 3: Death by Deadlock | ||
55 | |||
56 | Suppose that call_rcu() is invoked while holding a lock, and that the | ||
57 | callback function must acquire this same lock. In this case, if | ||
58 | call_rcu() were to directly invoke the callback, the result would | ||
59 | be self-deadlock. | ||
60 | |||
61 | In some cases, it would possible to restructure to code so that | ||
62 | the call_rcu() is delayed until after the lock is released. However, | ||
63 | there are cases where this can be quite ugly: | ||
64 | |||
65 | 1. If a number of items need to be passed to call_rcu() within | ||
66 | the same critical section, then the code would need to create | ||
67 | a list of them, then traverse the list once the lock was | ||
68 | released. | ||
69 | |||
70 | 2. In some cases, the lock will be held across some kernel API, | ||
71 | so that delaying the call_rcu() until the lock is released | ||
72 | requires that the data item be passed up via a common API. | ||
73 | It is far better to guarantee that callbacks are invoked | ||
74 | with no locks held than to have to modify such APIs to allow | ||
75 | arbitrary data items to be passed back up through them. | ||
76 | |||
77 | If call_rcu() directly invokes the callback, painful locking restrictions | ||
78 | or API changes would be required. | ||
79 | |||
80 | Quick Quiz #2: What locking restriction must RCU callbacks respect? | ||
49 | 81 | ||
50 | 82 | ||
51 | Summary | 83 | Summary |
@@ -53,12 +85,35 @@ Summary | |||
53 | Permitting call_rcu() to immediately invoke its arguments or permitting | 85 | Permitting call_rcu() to immediately invoke its arguments or permitting |
54 | synchronize_rcu() to immediately return breaks RCU, even on a UP system. | 86 | synchronize_rcu() to immediately return breaks RCU, even on a UP system. |
55 | So do not do it! Even on a UP system, the RCU infrastructure -must- | 87 | So do not do it! Even on a UP system, the RCU infrastructure -must- |
56 | respect grace periods. | 88 | respect grace periods, and -must- invoke callbacks from a known environment |
57 | 89 | in which no locks are held. | |
58 | 90 | ||
59 | Answer to Quick Quiz | 91 | |
60 | 92 | Answer to Quick Quiz #1: | |
61 | The calling function is scanning an RCU-protected linked list, and | 93 | Why is it -not- legal to invoke synchronize_rcu() in this case? |
62 | is therefore within an RCU read-side critical section. Therefore, | 94 | |
63 | the called function has been invoked within an RCU read-side critical | 95 | Because the calling function is scanning an RCU-protected linked |
64 | section, and is not permitted to block. | 96 | list, and is therefore within an RCU read-side critical section. |
97 | Therefore, the called function has been invoked within an RCU | ||
98 | read-side critical section, and is not permitted to block. | ||
99 | |||
100 | Answer to Quick Quiz #2: | ||
101 | What locking restriction must RCU callbacks respect? | ||
102 | |||
103 | Any lock that is acquired within an RCU callback must be | ||
104 | acquired elsewhere using an _irq variant of the spinlock | ||
105 | primitive. For example, if "mylock" is acquired by an | ||
106 | RCU callback, then a process-context acquisition of this | ||
107 | lock must use something like spin_lock_irqsave() to | ||
108 | acquire the lock. | ||
109 | |||
110 | If the process-context code were to simply use spin_lock(), | ||
111 | then, since RCU callbacks can be invoked from softirq context, | ||
112 | the callback might be called from a softirq that interrupted | ||
113 | the process-context critical section. This would result in | ||
114 | self-deadlock. | ||
115 | |||
116 | This restriction might seem gratuitous, since very few RCU | ||
117 | callbacks acquire locks directly. However, a great many RCU | ||
118 | callbacks do acquire locks -indirectly-, for example, via | ||
119 | the kfree() primitive. | ||
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 8f3fb77c9cd3..e118a7c1a092 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt | |||
@@ -43,6 +43,10 @@ over a rather long period of time, but improvements are always welcome! | |||
43 | rcu_read_lock_bh()) in the read-side critical sections, | 43 | rcu_read_lock_bh()) in the read-side critical sections, |
44 | and are also an excellent aid to readability. | 44 | and are also an excellent aid to readability. |
45 | 45 | ||
46 | As a rough rule of thumb, any dereference of an RCU-protected | ||
47 | pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() | ||
48 | or by the appropriate update-side lock. | ||
49 | |||
46 | 3. Does the update code tolerate concurrent accesses? | 50 | 3. Does the update code tolerate concurrent accesses? |
47 | 51 | ||
48 | The whole point of RCU is to permit readers to run without | 52 | The whole point of RCU is to permit readers to run without |
@@ -90,7 +94,11 @@ over a rather long period of time, but improvements are always welcome! | |||
90 | 94 | ||
91 | The rcu_dereference() primitive is used by the various | 95 | The rcu_dereference() primitive is used by the various |
92 | "_rcu()" list-traversal primitives, such as the | 96 | "_rcu()" list-traversal primitives, such as the |
93 | list_for_each_entry_rcu(). | 97 | list_for_each_entry_rcu(). Note that it is perfectly |
98 | legal (if redundant) for update-side code to use | ||
99 | rcu_dereference() and the "_rcu()" list-traversal | ||
100 | primitives. This is particularly useful in code | ||
101 | that is common to readers and updaters. | ||
94 | 102 | ||
95 | b. If the list macros are being used, the list_add_tail_rcu() | 103 | b. If the list macros are being used, the list_add_tail_rcu() |
96 | and list_add_rcu() primitives must be used in order | 104 | and list_add_rcu() primitives must be used in order |
@@ -150,16 +158,9 @@ over a rather long period of time, but improvements are always welcome! | |||
150 | 158 | ||
151 | Use of the _rcu() list-traversal primitives outside of an | 159 | Use of the _rcu() list-traversal primitives outside of an |
152 | RCU read-side critical section causes no harm other than | 160 | RCU read-side critical section causes no harm other than |
153 | a slight performance degradation on Alpha CPUs and some | 161 | a slight performance degradation on Alpha CPUs. It can |
154 | confusion on the part of people trying to read the code. | 162 | also be quite helpful in reducing code bloat when common |
155 | 163 | code is shared between readers and updaters. | |
156 | Another way of thinking of this is "If you are holding the | ||
157 | lock that prevents the data structure from changing, why do | ||
158 | you also need RCU-based protection?" That said, there may | ||
159 | well be situations where use of the _rcu() list-traversal | ||
160 | primitives while the update-side lock is held results in | ||
161 | simpler and more maintainable code. The jury is still out | ||
162 | on this question. | ||
163 | 164 | ||
164 | 10. Conversely, if you are in an RCU read-side critical section, | 165 | 10. Conversely, if you are in an RCU read-side critical section, |
165 | you -must- use the "_rcu()" variants of the list macros. | 166 | you -must- use the "_rcu()" variants of the list macros. |
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index eb444006683e..6fa092251586 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt | |||
@@ -64,6 +64,54 @@ o I hear that RCU is patented? What is with that? | |||
64 | Of these, one was allowed to lapse by the assignee, and the | 64 | Of these, one was allowed to lapse by the assignee, and the |
65 | others have been contributed to the Linux kernel under GPL. | 65 | others have been contributed to the Linux kernel under GPL. |
66 | 66 | ||
67 | o I hear that RCU needs work in order to support realtime kernels? | ||
68 | |||
69 | Yes, work in progress. | ||
70 | |||
67 | o Where can I find more information on RCU? | 71 | o Where can I find more information on RCU? |
68 | 72 | ||
69 | See the RTFP.txt file in this directory. | 73 | See the RTFP.txt file in this directory. |
74 | Or point your browser at http://www.rdrop.com/users/paulmck/RCU/. | ||
75 | |||
76 | o What are all these files in this directory? | ||
77 | |||
78 | |||
79 | NMI-RCU.txt | ||
80 | |||
81 | Describes how to use RCU to implement dynamic | ||
82 | NMI handlers, which can be revectored on the fly, | ||
83 | without rebooting. | ||
84 | |||
85 | RTFP.txt | ||
86 | |||
87 | List of RCU-related publications and web sites. | ||
88 | |||
89 | UP.txt | ||
90 | |||
91 | Discussion of RCU usage in UP kernels. | ||
92 | |||
93 | arrayRCU.txt | ||
94 | |||
95 | Describes how to use RCU to protect arrays, with | ||
96 | resizeable arrays whose elements reference other | ||
97 | data structures being of the most interest. | ||
98 | |||
99 | checklist.txt | ||
100 | |||
101 | Lists things to check for when inspecting code that | ||
102 | uses RCU. | ||
103 | |||
104 | listRCU.txt | ||
105 | |||
106 | Describes how to use RCU to protect linked lists. | ||
107 | This is the simplest and most common use of RCU | ||
108 | in the Linux kernel. | ||
109 | |||
110 | rcu.txt | ||
111 | |||
112 | You are reading it! | ||
113 | |||
114 | whatisRCU.txt | ||
115 | |||
116 | Overview of how the RCU implementation works. Along | ||
117 | the way, presents a conceptual view of RCU. | ||
diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt new file mode 100644 index 000000000000..a23fee66064d --- /dev/null +++ b/Documentation/RCU/rcuref.txt | |||
@@ -0,0 +1,74 @@ | |||
1 | Refcounter framework for elements of lists/arrays protected by | ||
2 | RCU. | ||
3 | |||
4 | Refcounting on elements of lists which are protected by traditional | ||
5 | reader/writer spinlocks or semaphores are straight forward as in: | ||
6 | |||
7 | 1. 2. | ||
8 | add() search_and_reference() | ||
9 | { { | ||
10 | alloc_object read_lock(&list_lock); | ||
11 | ... search_for_element | ||
12 | atomic_set(&el->rc, 1); atomic_inc(&el->rc); | ||
13 | write_lock(&list_lock); ... | ||
14 | add_element read_unlock(&list_lock); | ||
15 | ... ... | ||
16 | write_unlock(&list_lock); } | ||
17 | } | ||
18 | |||
19 | 3. 4. | ||
20 | release_referenced() delete() | ||
21 | { { | ||
22 | ... write_lock(&list_lock); | ||
23 | atomic_dec(&el->rc, relfunc) ... | ||
24 | ... delete_element | ||
25 | } write_unlock(&list_lock); | ||
26 | ... | ||
27 | if (atomic_dec_and_test(&el->rc)) | ||
28 | kfree(el); | ||
29 | ... | ||
30 | } | ||
31 | |||
32 | If this list/array is made lock free using rcu as in changing the | ||
33 | write_lock in add() and delete() to spin_lock and changing read_lock | ||
34 | in search_and_reference to rcu_read_lock(), the rcuref_get in | ||
35 | search_and_reference could potentially hold reference to an element which | ||
36 | has already been deleted from the list/array. rcuref_lf_get_rcu takes | ||
37 | care of this scenario. search_and_reference should look as; | ||
38 | |||
39 | 1. 2. | ||
40 | add() search_and_reference() | ||
41 | { { | ||
42 | alloc_object rcu_read_lock(); | ||
43 | ... search_for_element | ||
44 | atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) { | ||
45 | write_lock(&list_lock); rcu_read_unlock(); | ||
46 | return FAIL; | ||
47 | add_element } | ||
48 | ... ... | ||
49 | write_unlock(&list_lock); rcu_read_unlock(); | ||
50 | } } | ||
51 | 3. 4. | ||
52 | release_referenced() delete() | ||
53 | { { | ||
54 | ... write_lock(&list_lock); | ||
55 | rcuref_dec(&el->rc, relfunc) ... | ||
56 | ... delete_element | ||
57 | } write_unlock(&list_lock); | ||
58 | ... | ||
59 | if (rcuref_dec_and_test(&el->rc)) | ||
60 | call_rcu(&el->head, el_free); | ||
61 | ... | ||
62 | } | ||
63 | |||
64 | Sometimes, reference to the element need to be obtained in the | ||
65 | update (write) stream. In such cases, rcuref_inc_lf might be an overkill | ||
66 | since the spinlock serialising list updates are held. rcuref_inc | ||
67 | is to be used in such cases. | ||
68 | For arches which do not have cmpxchg rcuref_inc_lf | ||
69 | api uses a hashed spinlock implementation and the same hashed spinlock | ||
70 | is acquired in all rcuref_xxx primitives to preserve atomicity. | ||
71 | Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the | ||
72 | refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api | ||
73 | might lead to races. rcuref_inc_lf() must be used in lockfree | ||
74 | RCU critical sections only. | ||
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt new file mode 100644 index 000000000000..354d89c78377 --- /dev/null +++ b/Documentation/RCU/whatisRCU.txt | |||
@@ -0,0 +1,902 @@ | |||
1 | What is RCU? | ||
2 | |||
3 | RCU is a synchronization mechanism that was added to the Linux kernel | ||
4 | during the 2.5 development effort that is optimized for read-mostly | ||
5 | situations. Although RCU is actually quite simple once you understand it, | ||
6 | getting there can sometimes be a challenge. Part of the problem is that | ||
7 | most of the past descriptions of RCU have been written with the mistaken | ||
8 | assumption that there is "one true way" to describe RCU. Instead, | ||
9 | the experience has been that different people must take different paths | ||
10 | to arrive at an understanding of RCU. This document provides several | ||
11 | different paths, as follows: | ||
12 | |||
13 | 1. RCU OVERVIEW | ||
14 | 2. WHAT IS RCU'S CORE API? | ||
15 | 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? | ||
16 | 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? | ||
17 | 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? | ||
18 | 6. ANALOGY WITH READER-WRITER LOCKING | ||
19 | 7. FULL LIST OF RCU APIs | ||
20 | 8. ANSWERS TO QUICK QUIZZES | ||
21 | |||
22 | People who prefer starting with a conceptual overview should focus on | ||
23 | Section 1, though most readers will profit by reading this section at | ||
24 | some point. People who prefer to start with an API that they can then | ||
25 | experiment with should focus on Section 2. People who prefer to start | ||
26 | with example uses should focus on Sections 3 and 4. People who need to | ||
27 | understand the RCU implementation should focus on Section 5, then dive | ||
28 | into the kernel source code. People who reason best by analogy should | ||
29 | focus on Section 6. Section 7 serves as an index to the docbook API | ||
30 | documentation, and Section 8 is the traditional answer key. | ||
31 | |||
32 | So, start with the section that makes the most sense to you and your | ||
33 | preferred method of learning. If you need to know everything about | ||
34 | everything, feel free to read the whole thing -- but if you are really | ||
35 | that type of person, you have perused the source code and will therefore | ||
36 | never need this document anyway. ;-) | ||
37 | |||
38 | |||
39 | 1. RCU OVERVIEW | ||
40 | |||
41 | The basic idea behind RCU is to split updates into "removal" and | ||
42 | "reclamation" phases. The removal phase removes references to data items | ||
43 | within a data structure (possibly by replacing them with references to | ||
44 | new versions of these data items), and can run concurrently with readers. | ||
45 | The reason that it is safe to run the removal phase concurrently with | ||
46 | readers is the semantics of modern CPUs guarantee that readers will see | ||
47 | either the old or the new version of the data structure rather than a | ||
48 | partially updated reference. The reclamation phase does the work of reclaiming | ||
49 | (e.g., freeing) the data items removed from the data structure during the | ||
50 | removal phase. Because reclaiming data items can disrupt any readers | ||
51 | concurrently referencing those data items, the reclamation phase must | ||
52 | not start until readers no longer hold references to those data items. | ||
53 | |||
54 | Splitting the update into removal and reclamation phases permits the | ||
55 | updater to perform the removal phase immediately, and to defer the | ||
56 | reclamation phase until all readers active during the removal phase have | ||
57 | completed, either by blocking until they finish or by registering a | ||
58 | callback that is invoked after they finish. Only readers that are active | ||
59 | during the removal phase need be considered, because any reader starting | ||
60 | after the removal phase will be unable to gain a reference to the removed | ||
61 | data items, and therefore cannot be disrupted by the reclamation phase. | ||
62 | |||
63 | So the typical RCU update sequence goes something like the following: | ||
64 | |||
65 | a. Remove pointers to a data structure, so that subsequent | ||
66 | readers cannot gain a reference to it. | ||
67 | |||
68 | b. Wait for all previous readers to complete their RCU read-side | ||
69 | critical sections. | ||
70 | |||
71 | c. At this point, there cannot be any readers who hold references | ||
72 | to the data structure, so it now may safely be reclaimed | ||
73 | (e.g., kfree()d). | ||
74 | |||
75 | Step (b) above is the key idea underlying RCU's deferred destruction. | ||
76 | The ability to wait until all readers are done allows RCU readers to | ||
77 | use much lighter-weight synchronization, in some cases, absolutely no | ||
78 | synchronization at all. In contrast, in more conventional lock-based | ||
79 | schemes, readers must use heavy-weight synchronization in order to | ||
80 | prevent an updater from deleting the data structure out from under them. | ||
81 | This is because lock-based updaters typically update data items in place, | ||
82 | and must therefore exclude readers. In contrast, RCU-based updaters | ||
83 | typically take advantage of the fact that writes to single aligned | ||
84 | pointers are atomic on modern CPUs, allowing atomic insertion, removal, | ||
85 | and replacement of data items in a linked structure without disrupting | ||
86 | readers. Concurrent RCU readers can then continue accessing the old | ||
87 | versions, and can dispense with the atomic operations, memory barriers, | ||
88 | and communications cache misses that are so expensive on present-day | ||
89 | SMP computer systems, even in absence of lock contention. | ||
90 | |||
91 | In the three-step procedure shown above, the updater is performing both | ||
92 | the removal and the reclamation step, but it is often helpful for an | ||
93 | entirely different thread to do the reclamation, as is in fact the case | ||
94 | in the Linux kernel's directory-entry cache (dcache). Even if the same | ||
95 | thread performs both the update step (step (a) above) and the reclamation | ||
96 | step (step (c) above), it is often helpful to think of them separately. | ||
97 | For example, RCU readers and updaters need not communicate at all, | ||
98 | but RCU provides implicit low-overhead communication between readers | ||
99 | and reclaimers, namely, in step (b) above. | ||
100 | |||
101 | So how the heck can a reclaimer tell when a reader is done, given | ||
102 | that readers are not doing any sort of synchronization operations??? | ||
103 | Read on to learn about how RCU's API makes this easy. | ||
104 | |||
105 | |||
106 | 2. WHAT IS RCU'S CORE API? | ||
107 | |||
108 | The core RCU API is quite small: | ||
109 | |||
110 | a. rcu_read_lock() | ||
111 | b. rcu_read_unlock() | ||
112 | c. synchronize_rcu() / call_rcu() | ||
113 | d. rcu_assign_pointer() | ||
114 | e. rcu_dereference() | ||
115 | |||
116 | There are many other members of the RCU API, but the rest can be | ||
117 | expressed in terms of these five, though most implementations instead | ||
118 | express synchronize_rcu() in terms of the call_rcu() callback API. | ||
119 | |||
120 | The five core RCU APIs are described below, the other 18 will be enumerated | ||
121 | later. See the kernel docbook documentation for more info, or look directly | ||
122 | at the function header comments. | ||
123 | |||
124 | rcu_read_lock() | ||
125 | |||
126 | void rcu_read_lock(void); | ||
127 | |||
128 | Used by a reader to inform the reclaimer that the reader is | ||
129 | entering an RCU read-side critical section. It is illegal | ||
130 | to block while in an RCU read-side critical section, though | ||
131 | kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side | ||
132 | critical sections. Any RCU-protected data structure accessed | ||
133 | during an RCU read-side critical section is guaranteed to remain | ||
134 | unreclaimed for the full duration of that critical section. | ||
135 | Reference counts may be used in conjunction with RCU to maintain | ||
136 | longer-term references to data structures. | ||
137 | |||
138 | rcu_read_unlock() | ||
139 | |||
140 | void rcu_read_unlock(void); | ||
141 | |||
142 | Used by a reader to inform the reclaimer that the reader is | ||
143 | exiting an RCU read-side critical section. Note that RCU | ||
144 | read-side critical sections may be nested and/or overlapping. | ||
145 | |||
146 | synchronize_rcu() | ||
147 | |||
148 | void synchronize_rcu(void); | ||
149 | |||
150 | Marks the end of updater code and the beginning of reclaimer | ||
151 | code. It does this by blocking until all pre-existing RCU | ||
152 | read-side critical sections on all CPUs have completed. | ||
153 | Note that synchronize_rcu() will -not- necessarily wait for | ||
154 | any subsequent RCU read-side critical sections to complete. | ||
155 | For example, consider the following sequence of events: | ||
156 | |||
157 | CPU 0 CPU 1 CPU 2 | ||
158 | ----------------- ------------------------- --------------- | ||
159 | 1. rcu_read_lock() | ||
160 | 2. enters synchronize_rcu() | ||
161 | 3. rcu_read_lock() | ||
162 | 4. rcu_read_unlock() | ||
163 | 5. exits synchronize_rcu() | ||
164 | 6. rcu_read_unlock() | ||
165 | |||
166 | To reiterate, synchronize_rcu() waits only for ongoing RCU | ||
167 | read-side critical sections to complete, not necessarily for | ||
168 | any that begin after synchronize_rcu() is invoked. | ||
169 | |||
170 | Of course, synchronize_rcu() does not necessarily return | ||
171 | -immediately- after the last pre-existing RCU read-side critical | ||
172 | section completes. For one thing, there might well be scheduling | ||
173 | delays. For another thing, many RCU implementations process | ||
174 | requests in batches in order to improve efficiencies, which can | ||
175 | further delay synchronize_rcu(). | ||
176 | |||
177 | Since synchronize_rcu() is the API that must figure out when | ||
178 | readers are done, its implementation is key to RCU. For RCU | ||
179 | to be useful in all but the most read-intensive situations, | ||
180 | synchronize_rcu()'s overhead must also be quite small. | ||
181 | |||
182 | The call_rcu() API is a callback form of synchronize_rcu(), | ||
183 | and is described in more detail in a later section. Instead of | ||
184 | blocking, it registers a function and argument which are invoked | ||
185 | after all ongoing RCU read-side critical sections have completed. | ||
186 | This callback variant is particularly useful in situations where | ||
187 | it is illegal to block. | ||
188 | |||
189 | rcu_assign_pointer() | ||
190 | |||
191 | typeof(p) rcu_assign_pointer(p, typeof(p) v); | ||
192 | |||
193 | Yes, rcu_assign_pointer() -is- implemented as a macro, though it | ||
194 | would be cool to be able to declare a function in this manner. | ||
195 | (Compiler experts will no doubt disagree.) | ||
196 | |||
197 | The updater uses this function to assign a new value to an | ||
198 | RCU-protected pointer, in order to safely communicate the change | ||
199 | in value from the updater to the reader. This function returns | ||
200 | the new value, and also executes any memory-barrier instructions | ||
201 | required for a given CPU architecture. | ||
202 | |||
203 | Perhaps more important, it serves to document which pointers | ||
204 | are protected by RCU. That said, rcu_assign_pointer() is most | ||
205 | frequently used indirectly, via the _rcu list-manipulation | ||
206 | primitives such as list_add_rcu(). | ||
207 | |||
208 | rcu_dereference() | ||
209 | |||
210 | typeof(p) rcu_dereference(p); | ||
211 | |||
212 | Like rcu_assign_pointer(), rcu_dereference() must be implemented | ||
213 | as a macro. | ||
214 | |||
215 | The reader uses rcu_dereference() to fetch an RCU-protected | ||
216 | pointer, which returns a value that may then be safely | ||
217 | dereferenced. Note that rcu_deference() does not actually | ||
218 | dereference the pointer, instead, it protects the pointer for | ||
219 | later dereferencing. It also executes any needed memory-barrier | ||
220 | instructions for a given CPU architecture. Currently, only Alpha | ||
221 | needs memory barriers within rcu_dereference() -- on other CPUs, | ||
222 | it compiles to nothing, not even a compiler directive. | ||
223 | |||
224 | Common coding practice uses rcu_dereference() to copy an | ||
225 | RCU-protected pointer to a local variable, then dereferences | ||
226 | this local variable, for example as follows: | ||
227 | |||
228 | p = rcu_dereference(head.next); | ||
229 | return p->data; | ||
230 | |||
231 | However, in this case, one could just as easily combine these | ||
232 | into one statement: | ||
233 | |||
234 | return rcu_dereference(head.next)->data; | ||
235 | |||
236 | If you are going to be fetching multiple fields from the | ||
237 | RCU-protected structure, using the local variable is of | ||
238 | course preferred. Repeated rcu_dereference() calls look | ||
239 | ugly and incur unnecessary overhead on Alpha CPUs. | ||
240 | |||
241 | Note that the value returned by rcu_dereference() is valid | ||
242 | only within the enclosing RCU read-side critical section. | ||
243 | For example, the following is -not- legal: | ||
244 | |||
245 | rcu_read_lock(); | ||
246 | p = rcu_dereference(head.next); | ||
247 | rcu_read_unlock(); | ||
248 | x = p->address; | ||
249 | rcu_read_lock(); | ||
250 | y = p->data; | ||
251 | rcu_read_unlock(); | ||
252 | |||
253 | Holding a reference from one RCU read-side critical section | ||
254 | to another is just as illegal as holding a reference from | ||
255 | one lock-based critical section to another! Similarly, | ||
256 | using a reference outside of the critical section in which | ||
257 | it was acquired is just as illegal as doing so with normal | ||
258 | locking. | ||
259 | |||
260 | As with rcu_assign_pointer(), an important function of | ||
261 | rcu_dereference() is to document which pointers are protected | ||
262 | by RCU. And, again like rcu_assign_pointer(), rcu_dereference() | ||
263 | is typically used indirectly, via the _rcu list-manipulation | ||
264 | primitives, such as list_for_each_entry_rcu(). | ||
265 | |||
266 | The following diagram shows how each API communicates among the | ||
267 | reader, updater, and reclaimer. | ||
268 | |||
269 | |||
270 | rcu_assign_pointer() | ||
271 | +--------+ | ||
272 | +---------------------->| reader |---------+ | ||
273 | | +--------+ | | ||
274 | | | | | ||
275 | | | | Protect: | ||
276 | | | | rcu_read_lock() | ||
277 | | | | rcu_read_unlock() | ||
278 | | rcu_dereference() | | | ||
279 | +---------+ | | | ||
280 | | updater |<---------------------+ | | ||
281 | +---------+ V | ||
282 | | +-----------+ | ||
283 | +----------------------------------->| reclaimer | | ||
284 | +-----------+ | ||
285 | Defer: | ||
286 | synchronize_rcu() & call_rcu() | ||
287 | |||
288 | |||
289 | The RCU infrastructure observes the time sequence of rcu_read_lock(), | ||
290 | rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in | ||
291 | order to determine when (1) synchronize_rcu() invocations may return | ||
292 | to their callers and (2) call_rcu() callbacks may be invoked. Efficient | ||
293 | implementations of the RCU infrastructure make heavy use of batching in | ||
294 | order to amortize their overhead over many uses of the corresponding APIs. | ||
295 | |||
296 | There are no fewer than three RCU mechanisms in the Linux kernel; the | ||
297 | diagram above shows the first one, which is by far the most commonly used. | ||
298 | The rcu_dereference() and rcu_assign_pointer() primitives are used for | ||
299 | all three mechanisms, but different defer and protect primitives are | ||
300 | used as follows: | ||
301 | |||
302 | Defer Protect | ||
303 | |||
304 | a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock() | ||
305 | call_rcu() | ||
306 | |||
307 | b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh() | ||
308 | |||
309 | c. synchronize_sched() preempt_disable() / preempt_enable() | ||
310 | local_irq_save() / local_irq_restore() | ||
311 | hardirq enter / hardirq exit | ||
312 | NMI enter / NMI exit | ||
313 | |||
314 | These three mechanisms are used as follows: | ||
315 | |||
316 | a. RCU applied to normal data structures. | ||
317 | |||
318 | b. RCU applied to networking data structures that may be subjected | ||
319 | to remote denial-of-service attacks. | ||
320 | |||
321 | c. RCU applied to scheduler and interrupt/NMI-handler tasks. | ||
322 | |||
323 | Again, most uses will be of (a). The (b) and (c) cases are important | ||
324 | for specialized uses, but are relatively uncommon. | ||
325 | |||
326 | |||
327 | 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? | ||
328 | |||
329 | This section shows a simple use of the core RCU API to protect a | ||
330 | global pointer to a dynamically allocated structure. More typical | ||
331 | uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. | ||
332 | |||
333 | struct foo { | ||
334 | int a; | ||
335 | char b; | ||
336 | long c; | ||
337 | }; | ||
338 | DEFINE_SPINLOCK(foo_mutex); | ||
339 | |||
340 | struct foo *gbl_foo; | ||
341 | |||
342 | /* | ||
343 | * Create a new struct foo that is the same as the one currently | ||
344 | * pointed to by gbl_foo, except that field "a" is replaced | ||
345 | * with "new_a". Points gbl_foo to the new structure, and | ||
346 | * frees up the old structure after a grace period. | ||
347 | * | ||
348 | * Uses rcu_assign_pointer() to ensure that concurrent readers | ||
349 | * see the initialized version of the new structure. | ||
350 | * | ||
351 | * Uses synchronize_rcu() to ensure that any readers that might | ||
352 | * have references to the old structure complete before freeing | ||
353 | * the old structure. | ||
354 | */ | ||
355 | void foo_update_a(int new_a) | ||
356 | { | ||
357 | struct foo *new_fp; | ||
358 | struct foo *old_fp; | ||
359 | |||
360 | new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); | ||
361 | spin_lock(&foo_mutex); | ||
362 | old_fp = gbl_foo; | ||
363 | *new_fp = *old_fp; | ||
364 | new_fp->a = new_a; | ||
365 | rcu_assign_pointer(gbl_foo, new_fp); | ||
366 | spin_unlock(&foo_mutex); | ||
367 | synchronize_rcu(); | ||
368 | kfree(old_fp); | ||
369 | } | ||
370 | |||
371 | /* | ||
372 | * Return the value of field "a" of the current gbl_foo | ||
373 | * structure. Use rcu_read_lock() and rcu_read_unlock() | ||
374 | * to ensure that the structure does not get deleted out | ||
375 | * from under us, and use rcu_dereference() to ensure that | ||
376 | * we see the initialized version of the structure (important | ||
377 | * for DEC Alpha and for people reading the code). | ||
378 | */ | ||
379 | int foo_get_a(void) | ||
380 | { | ||
381 | int retval; | ||
382 | |||
383 | rcu_read_lock(); | ||
384 | retval = rcu_dereference(gbl_foo)->a; | ||
385 | rcu_read_unlock(); | ||
386 | return retval; | ||
387 | } | ||
388 | |||
389 | So, to sum up: | ||
390 | |||
391 | o Use rcu_read_lock() and rcu_read_unlock() to guard RCU | ||
392 | read-side critical sections. | ||
393 | |||
394 | o Within an RCU read-side critical section, use rcu_dereference() | ||
395 | to dereference RCU-protected pointers. | ||
396 | |||
397 | o Use some solid scheme (such as locks or semaphores) to | ||
398 | keep concurrent updates from interfering with each other. | ||
399 | |||
400 | o Use rcu_assign_pointer() to update an RCU-protected pointer. | ||
401 | This primitive protects concurrent readers from the updater, | ||
402 | -not- concurrent updates from each other! You therefore still | ||
403 | need to use locking (or something similar) to keep concurrent | ||
404 | rcu_assign_pointer() primitives from interfering with each other. | ||
405 | |||
406 | o Use synchronize_rcu() -after- removing a data element from an | ||
407 | RCU-protected data structure, but -before- reclaiming/freeing | ||
408 | the data element, in order to wait for the completion of all | ||
409 | RCU read-side critical sections that might be referencing that | ||
410 | data item. | ||
411 | |||
412 | See checklist.txt for additional rules to follow when using RCU. | ||
413 | |||
414 | |||
415 | 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? | ||
416 | |||
417 | In the example above, foo_update_a() blocks until a grace period elapses. | ||
418 | This is quite simple, but in some cases one cannot afford to wait so | ||
419 | long -- there might be other high-priority work to be done. | ||
420 | |||
421 | In such cases, one uses call_rcu() rather than synchronize_rcu(). | ||
422 | The call_rcu() API is as follows: | ||
423 | |||
424 | void call_rcu(struct rcu_head * head, | ||
425 | void (*func)(struct rcu_head *head)); | ||
426 | |||
427 | This function invokes func(head) after a grace period has elapsed. | ||
428 | This invocation might happen from either softirq or process context, | ||
429 | so the function is not permitted to block. The foo struct needs to | ||
430 | have an rcu_head structure added, perhaps as follows: | ||
431 | |||
432 | struct foo { | ||
433 | int a; | ||
434 | char b; | ||
435 | long c; | ||
436 | struct rcu_head rcu; | ||
437 | }; | ||
438 | |||
439 | The foo_update_a() function might then be written as follows: | ||
440 | |||
441 | /* | ||
442 | * Create a new struct foo that is the same as the one currently | ||
443 | * pointed to by gbl_foo, except that field "a" is replaced | ||
444 | * with "new_a". Points gbl_foo to the new structure, and | ||
445 | * frees up the old structure after a grace period. | ||
446 | * | ||
447 | * Uses rcu_assign_pointer() to ensure that concurrent readers | ||
448 | * see the initialized version of the new structure. | ||
449 | * | ||
450 | * Uses call_rcu() to ensure that any readers that might have | ||
451 | * references to the old structure complete before freeing the | ||
452 | * old structure. | ||
453 | */ | ||
454 | void foo_update_a(int new_a) | ||
455 | { | ||
456 | struct foo *new_fp; | ||
457 | struct foo *old_fp; | ||
458 | |||
459 | new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); | ||
460 | spin_lock(&foo_mutex); | ||
461 | old_fp = gbl_foo; | ||
462 | *new_fp = *old_fp; | ||
463 | new_fp->a = new_a; | ||
464 | rcu_assign_pointer(gbl_foo, new_fp); | ||
465 | spin_unlock(&foo_mutex); | ||
466 | call_rcu(&old_fp->rcu, foo_reclaim); | ||
467 | } | ||
468 | |||
469 | The foo_reclaim() function might appear as follows: | ||
470 | |||
471 | void foo_reclaim(struct rcu_head *rp) | ||
472 | { | ||
473 | struct foo *fp = container_of(rp, struct foo, rcu); | ||
474 | |||
475 | kfree(fp); | ||
476 | } | ||
477 | |||
478 | The container_of() primitive is a macro that, given a pointer into a | ||
479 | struct, the type of the struct, and the pointed-to field within the | ||
480 | struct, returns a pointer to the beginning of the struct. | ||
481 | |||
482 | The use of call_rcu() permits the caller of foo_update_a() to | ||
483 | immediately regain control, without needing to worry further about the | ||
484 | old version of the newly updated element. It also clearly shows the | ||
485 | RCU distinction between updater, namely foo_update_a(), and reclaimer, | ||
486 | namely foo_reclaim(). | ||
487 | |||
488 | The summary of advice is the same as for the previous section, except | ||
489 | that we are now using call_rcu() rather than synchronize_rcu(): | ||
490 | |||
491 | o Use call_rcu() -after- removing a data element from an | ||
492 | RCU-protected data structure in order to register a callback | ||
493 | function that will be invoked after the completion of all RCU | ||
494 | read-side critical sections that might be referencing that | ||
495 | data item. | ||
496 | |||
497 | Again, see checklist.txt for additional rules governing the use of RCU. | ||
498 | |||
499 | |||
500 | 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? | ||
501 | |||
502 | One of the nice things about RCU is that it has extremely simple "toy" | ||
503 | implementations that are a good first step towards understanding the | ||
504 | production-quality implementations in the Linux kernel. This section | ||
505 | presents two such "toy" implementations of RCU, one that is implemented | ||
506 | in terms of familiar locking primitives, and another that more closely | ||
507 | resembles "classic" RCU. Both are way too simple for real-world use, | ||
508 | lacking both functionality and performance. However, they are useful | ||
509 | in getting a feel for how RCU works. See kernel/rcupdate.c for a | ||
510 | production-quality implementation, and see: | ||
511 | |||
512 | http://www.rdrop.com/users/paulmck/RCU | ||
513 | |||
514 | for papers describing the Linux kernel RCU implementation. The OLS'01 | ||
515 | and OLS'02 papers are a good introduction, and the dissertation provides | ||
516 | more details on the current implementation. | ||
517 | |||
518 | |||
519 | 5A. "TOY" IMPLEMENTATION #1: LOCKING | ||
520 | |||
521 | This section presents a "toy" RCU implementation that is based on | ||
522 | familiar locking primitives. Its overhead makes it a non-starter for | ||
523 | real-life use, as does its lack of scalability. It is also unsuitable | ||
524 | for realtime use, since it allows scheduling latency to "bleed" from | ||
525 | one read-side critical section to another. | ||
526 | |||
527 | However, it is probably the easiest implementation to relate to, so is | ||
528 | a good starting point. | ||
529 | |||
530 | It is extremely simple: | ||
531 | |||
532 | static DEFINE_RWLOCK(rcu_gp_mutex); | ||
533 | |||
534 | void rcu_read_lock(void) | ||
535 | { | ||
536 | read_lock(&rcu_gp_mutex); | ||
537 | } | ||
538 | |||
539 | void rcu_read_unlock(void) | ||
540 | { | ||
541 | read_unlock(&rcu_gp_mutex); | ||
542 | } | ||
543 | |||
544 | void synchronize_rcu(void) | ||
545 | { | ||
546 | write_lock(&rcu_gp_mutex); | ||
547 | write_unlock(&rcu_gp_mutex); | ||
548 | } | ||
549 | |||
550 | [You can ignore rcu_assign_pointer() and rcu_dereference() without | ||
551 | missing much. But here they are anyway. And whatever you do, don't | ||
552 | forget about them when submitting patches making use of RCU!] | ||
553 | |||
554 | #define rcu_assign_pointer(p, v) ({ \ | ||
555 | smp_wmb(); \ | ||
556 | (p) = (v); \ | ||
557 | }) | ||
558 | |||
559 | #define rcu_dereference(p) ({ \ | ||
560 | typeof(p) _________p1 = p; \ | ||
561 | smp_read_barrier_depends(); \ | ||
562 | (_________p1); \ | ||
563 | }) | ||
564 | |||
565 | |||
566 | The rcu_read_lock() and rcu_read_unlock() primitive read-acquire | ||
567 | and release a global reader-writer lock. The synchronize_rcu() | ||
568 | primitive write-acquires this same lock, then immediately releases | ||
569 | it. This means that once synchronize_rcu() exits, all RCU read-side | ||
570 | critical sections that were in progress before synchonize_rcu() was | ||
571 | called are guaranteed to have completed -- there is no way that | ||
572 | synchronize_rcu() would have been able to write-acquire the lock | ||
573 | otherwise. | ||
574 | |||
575 | It is possible to nest rcu_read_lock(), since reader-writer locks may | ||
576 | be recursively acquired. Note also that rcu_read_lock() is immune | ||
577 | from deadlock (an important property of RCU). The reason for this is | ||
578 | that the only thing that can block rcu_read_lock() is a synchronize_rcu(). | ||
579 | But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, | ||
580 | so there can be no deadlock cycle. | ||
581 | |||
582 | Quick Quiz #1: Why is this argument naive? How could a deadlock | ||
583 | occur when using this algorithm in a real-world Linux | ||
584 | kernel? How could this deadlock be avoided? | ||
585 | |||
586 | |||
587 | 5B. "TOY" EXAMPLE #2: CLASSIC RCU | ||
588 | |||
589 | This section presents a "toy" RCU implementation that is based on | ||
590 | "classic RCU". It is also short on performance (but only for updates) and | ||
591 | on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT | ||
592 | kernels. The definitions of rcu_dereference() and rcu_assign_pointer() | ||
593 | are the same as those shown in the preceding section, so they are omitted. | ||
594 | |||
595 | void rcu_read_lock(void) { } | ||
596 | |||
597 | void rcu_read_unlock(void) { } | ||
598 | |||
599 | void synchronize_rcu(void) | ||
600 | { | ||
601 | int cpu; | ||
602 | |||
603 | for_each_cpu(cpu) | ||
604 | run_on(cpu); | ||
605 | } | ||
606 | |||
607 | Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. | ||
608 | This is the great strength of classic RCU in a non-preemptive kernel: | ||
609 | read-side overhead is precisely zero, at least on non-Alpha CPUs. | ||
610 | And there is absolutely no way that rcu_read_lock() can possibly | ||
611 | participate in a deadlock cycle! | ||
612 | |||
613 | The implementation of synchronize_rcu() simply schedules itself on each | ||
614 | CPU in turn. The run_on() primitive can be implemented straightforwardly | ||
615 | in terms of the sched_setaffinity() primitive. Of course, a somewhat less | ||
616 | "toy" implementation would restore the affinity upon completion rather | ||
617 | than just leaving all tasks running on the last CPU, but when I said | ||
618 | "toy", I meant -toy-! | ||
619 | |||
620 | So how the heck is this supposed to work??? | ||
621 | |||
622 | Remember that it is illegal to block while in an RCU read-side critical | ||
623 | section. Therefore, if a given CPU executes a context switch, we know | ||
624 | that it must have completed all preceding RCU read-side critical sections. | ||
625 | Once -all- CPUs have executed a context switch, then -all- preceding | ||
626 | RCU read-side critical sections will have completed. | ||
627 | |||
628 | So, suppose that we remove a data item from its structure and then invoke | ||
629 | synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed | ||
630 | that there are no RCU read-side critical sections holding a reference | ||
631 | to that data item, so we can safely reclaim it. | ||
632 | |||
633 | Quick Quiz #2: Give an example where Classic RCU's read-side | ||
634 | overhead is -negative-. | ||
635 | |||
636 | Quick Quiz #3: If it is illegal to block in an RCU read-side | ||
637 | critical section, what the heck do you do in | ||
638 | PREEMPT_RT, where normal spinlocks can block??? | ||
639 | |||
640 | |||
641 | 6. ANALOGY WITH READER-WRITER LOCKING | ||
642 | |||
643 | Although RCU can be used in many different ways, a very common use of | ||
644 | RCU is analogous to reader-writer locking. The following unified | ||
645 | diff shows how closely related RCU and reader-writer locking can be. | ||
646 | |||
647 | @@ -13,15 +14,15 @@ | ||
648 | struct list_head *lp; | ||
649 | struct el *p; | ||
650 | |||
651 | - read_lock(); | ||
652 | - list_for_each_entry(p, head, lp) { | ||
653 | + rcu_read_lock(); | ||
654 | + list_for_each_entry_rcu(p, head, lp) { | ||
655 | if (p->key == key) { | ||
656 | *result = p->data; | ||
657 | - read_unlock(); | ||
658 | + rcu_read_unlock(); | ||
659 | return 1; | ||
660 | } | ||
661 | } | ||
662 | - read_unlock(); | ||
663 | + rcu_read_unlock(); | ||
664 | return 0; | ||
665 | } | ||
666 | |||
667 | @@ -29,15 +30,16 @@ | ||
668 | { | ||
669 | struct el *p; | ||
670 | |||
671 | - write_lock(&listmutex); | ||
672 | + spin_lock(&listmutex); | ||
673 | list_for_each_entry(p, head, lp) { | ||
674 | if (p->key == key) { | ||
675 | list_del(&p->list); | ||
676 | - write_unlock(&listmutex); | ||
677 | + spin_unlock(&listmutex); | ||
678 | + synchronize_rcu(); | ||
679 | kfree(p); | ||
680 | return 1; | ||
681 | } | ||
682 | } | ||
683 | - write_unlock(&listmutex); | ||
684 | + spin_unlock(&listmutex); | ||
685 | return 0; | ||
686 | } | ||
687 | |||
688 | Or, for those who prefer a side-by-side listing: | ||
689 | |||
690 | 1 struct el { 1 struct el { | ||
691 | 2 struct list_head list; 2 struct list_head list; | ||
692 | 3 long key; 3 long key; | ||
693 | 4 spinlock_t mutex; 4 spinlock_t mutex; | ||
694 | 5 int data; 5 int data; | ||
695 | 6 /* Other data fields */ 6 /* Other data fields */ | ||
696 | 7 }; 7 }; | ||
697 | 8 spinlock_t listmutex; 8 spinlock_t listmutex; | ||
698 | 9 struct el head; 9 struct el head; | ||
699 | |||
700 | 1 int search(long key, int *result) 1 int search(long key, int *result) | ||
701 | 2 { 2 { | ||
702 | 3 struct list_head *lp; 3 struct list_head *lp; | ||
703 | 4 struct el *p; 4 struct el *p; | ||
704 | 5 5 | ||
705 | 6 read_lock(); 6 rcu_read_lock(); | ||
706 | 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { | ||
707 | 8 if (p->key == key) { 8 if (p->key == key) { | ||
708 | 9 *result = p->data; 9 *result = p->data; | ||
709 | 10 read_unlock(); 10 rcu_read_unlock(); | ||
710 | 11 return 1; 11 return 1; | ||
711 | 12 } 12 } | ||
712 | 13 } 13 } | ||
713 | 14 read_unlock(); 14 rcu_read_unlock(); | ||
714 | 15 return 0; 15 return 0; | ||
715 | 16 } 16 } | ||
716 | |||
717 | 1 int delete(long key) 1 int delete(long key) | ||
718 | 2 { 2 { | ||
719 | 3 struct el *p; 3 struct el *p; | ||
720 | 4 4 | ||
721 | 5 write_lock(&listmutex); 5 spin_lock(&listmutex); | ||
722 | 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { | ||
723 | 7 if (p->key == key) { 7 if (p->key == key) { | ||
724 | 8 list_del(&p->list); 8 list_del(&p->list); | ||
725 | 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); | ||
726 | 10 synchronize_rcu(); | ||
727 | 10 kfree(p); 11 kfree(p); | ||
728 | 11 return 1; 12 return 1; | ||
729 | 12 } 13 } | ||
730 | 13 } 14 } | ||
731 | 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); | ||
732 | 15 return 0; 16 return 0; | ||
733 | 16 } 17 } | ||
734 | |||
735 | Either way, the differences are quite small. Read-side locking moves | ||
736 | to rcu_read_lock() and rcu_read_unlock, update-side locking moves from | ||
737 | from a reader-writer lock to a simple spinlock, and a synchronize_rcu() | ||
738 | precedes the kfree(). | ||
739 | |||
740 | However, there is one potential catch: the read-side and update-side | ||
741 | critical sections can now run concurrently. In many cases, this will | ||
742 | not be a problem, but it is necessary to check carefully regardless. | ||
743 | For example, if multiple independent list updates must be seen as | ||
744 | a single atomic update, converting to RCU will require special care. | ||
745 | |||
746 | Also, the presence of synchronize_rcu() means that the RCU version of | ||
747 | delete() can now block. If this is a problem, there is a callback-based | ||
748 | mechanism that never blocks, namely call_rcu(), that can be used in | ||
749 | place of synchronize_rcu(). | ||
750 | |||
751 | |||
752 | 7. FULL LIST OF RCU APIs | ||
753 | |||
754 | The RCU APIs are documented in docbook-format header comments in the | ||
755 | Linux-kernel source code, but it helps to have a full list of the | ||
756 | APIs, since there does not appear to be a way to categorize them | ||
757 | in docbook. Here is the list, by category. | ||
758 | |||
759 | Markers for RCU read-side critical sections: | ||
760 | |||
761 | rcu_read_lock | ||
762 | rcu_read_unlock | ||
763 | rcu_read_lock_bh | ||
764 | rcu_read_unlock_bh | ||
765 | |||
766 | RCU pointer/list traversal: | ||
767 | |||
768 | rcu_dereference | ||
769 | list_for_each_rcu (to be deprecated in favor of | ||
770 | list_for_each_entry_rcu) | ||
771 | list_for_each_safe_rcu (deprecated, not used) | ||
772 | list_for_each_entry_rcu | ||
773 | list_for_each_continue_rcu (to be deprecated in favor of new | ||
774 | list_for_each_entry_continue_rcu) | ||
775 | hlist_for_each_rcu (to be deprecated in favor of | ||
776 | hlist_for_each_entry_rcu) | ||
777 | hlist_for_each_entry_rcu | ||
778 | |||
779 | RCU pointer update: | ||
780 | |||
781 | rcu_assign_pointer | ||
782 | list_add_rcu | ||
783 | list_add_tail_rcu | ||
784 | list_del_rcu | ||
785 | list_replace_rcu | ||
786 | hlist_del_rcu | ||
787 | hlist_add_head_rcu | ||
788 | |||
789 | RCU grace period: | ||
790 | |||
791 | synchronize_kernel (deprecated) | ||
792 | synchronize_net | ||
793 | synchronize_sched | ||
794 | synchronize_rcu | ||
795 | call_rcu | ||
796 | call_rcu_bh | ||
797 | |||
798 | See the comment headers in the source code (or the docbook generated | ||
799 | from them) for more information. | ||
800 | |||
801 | |||
802 | 8. ANSWERS TO QUICK QUIZZES | ||
803 | |||
804 | Quick Quiz #1: Why is this argument naive? How could a deadlock | ||
805 | occur when using this algorithm in a real-world Linux | ||
806 | kernel? [Referring to the lock-based "toy" RCU | ||
807 | algorithm.] | ||
808 | |||
809 | Answer: Consider the following sequence of events: | ||
810 | |||
811 | 1. CPU 0 acquires some unrelated lock, call it | ||
812 | "problematic_lock". | ||
813 | |||
814 | 2. CPU 1 enters synchronize_rcu(), write-acquiring | ||
815 | rcu_gp_mutex. | ||
816 | |||
817 | 3. CPU 0 enters rcu_read_lock(), but must wait | ||
818 | because CPU 1 holds rcu_gp_mutex. | ||
819 | |||
820 | 4. CPU 1 is interrupted, and the irq handler | ||
821 | attempts to acquire problematic_lock. | ||
822 | |||
823 | The system is now deadlocked. | ||
824 | |||
825 | One way to avoid this deadlock is to use an approach like | ||
826 | that of CONFIG_PREEMPT_RT, where all normal spinlocks | ||
827 | become blocking locks, and all irq handlers execute in | ||
828 | the context of special tasks. In this case, in step 4 | ||
829 | above, the irq handler would block, allowing CPU 1 to | ||
830 | release rcu_gp_mutex, avoiding the deadlock. | ||
831 | |||
832 | Even in the absence of deadlock, this RCU implementation | ||
833 | allows latency to "bleed" from readers to other | ||
834 | readers through synchronize_rcu(). To see this, | ||
835 | consider task A in an RCU read-side critical section | ||
836 | (thus read-holding rcu_gp_mutex), task B blocked | ||
837 | attempting to write-acquire rcu_gp_mutex, and | ||
838 | task C blocked in rcu_read_lock() attempting to | ||
839 | read_acquire rcu_gp_mutex. Task A's RCU read-side | ||
840 | latency is holding up task C, albeit indirectly via | ||
841 | task B. | ||
842 | |||
843 | Realtime RCU implementations therefore use a counter-based | ||
844 | approach where tasks in RCU read-side critical sections | ||
845 | cannot be blocked by tasks executing synchronize_rcu(). | ||
846 | |||
847 | Quick Quiz #2: Give an example where Classic RCU's read-side | ||
848 | overhead is -negative-. | ||
849 | |||
850 | Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT | ||
851 | kernel where a routing table is used by process-context | ||
852 | code, but can be updated by irq-context code (for example, | ||
853 | by an "ICMP REDIRECT" packet). The usual way of handling | ||
854 | this would be to have the process-context code disable | ||
855 | interrupts while searching the routing table. Use of | ||
856 | RCU allows such interrupt-disabling to be dispensed with. | ||
857 | Thus, without RCU, you pay the cost of disabling interrupts, | ||
858 | and with RCU you don't. | ||
859 | |||
860 | One can argue that the overhead of RCU in this | ||
861 | case is negative with respect to the single-CPU | ||
862 | interrupt-disabling approach. Others might argue that | ||
863 | the overhead of RCU is merely zero, and that replacing | ||
864 | the positive overhead of the interrupt-disabling scheme | ||
865 | with the zero-overhead RCU scheme does not constitute | ||
866 | negative overhead. | ||
867 | |||
868 | In real life, of course, things are more complex. But | ||
869 | even the theoretical possibility of negative overhead for | ||
870 | a synchronization primitive is a bit unexpected. ;-) | ||
871 | |||
872 | Quick Quiz #3: If it is illegal to block in an RCU read-side | ||
873 | critical section, what the heck do you do in | ||
874 | PREEMPT_RT, where normal spinlocks can block??? | ||
875 | |||
876 | Answer: Just as PREEMPT_RT permits preemption of spinlock | ||
877 | critical sections, it permits preemption of RCU | ||
878 | read-side critical sections. It also permits | ||
879 | spinlocks blocking while in RCU read-side critical | ||
880 | sections. | ||
881 | |||
882 | Why the apparent inconsistency? Because it is it | ||
883 | possible to use priority boosting to keep the RCU | ||
884 | grace periods short if need be (for example, if running | ||
885 | short of memory). In contrast, if blocking waiting | ||
886 | for (say) network reception, there is no way to know | ||
887 | what should be boosted. Especially given that the | ||
888 | process we need to boost might well be a human being | ||
889 | who just went out for a pizza or something. And although | ||
890 | a computer-operated cattle prod might arouse serious | ||
891 | interest, it might also provoke serious objections. | ||
892 | Besides, how does the computer know what pizza parlor | ||
893 | the human being went to??? | ||
894 | |||
895 | |||
896 | ACKNOWLEDGEMENTS | ||
897 | |||
898 | My thanks to the people who helped make this human-readable, including | ||
899 | Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood. | ||
900 | |||
901 | |||
902 | For more information, see http://www.rdrop.com/users/paulmck/RCU. | ||
diff --git a/Documentation/applying-patches.txt b/Documentation/applying-patches.txt new file mode 100644 index 000000000000..681e426e2482 --- /dev/null +++ b/Documentation/applying-patches.txt | |||
@@ -0,0 +1,439 @@ | |||
1 | |||
2 | Applying Patches To The Linux Kernel | ||
3 | ------------------------------------ | ||
4 | |||
5 | (Written by Jesper Juhl, August 2005) | ||
6 | |||
7 | |||
8 | |||
9 | A frequently asked question on the Linux Kernel Mailing List is how to apply | ||
10 | a patch to the kernel or, more specifically, what base kernel a patch for | ||
11 | one of the many trees/branches should be applied to. Hopefully this document | ||
12 | will explain this to you. | ||
13 | |||
14 | In addition to explaining how to apply and revert patches, a brief | ||
15 | description of the different kernel trees (and examples of how to apply | ||
16 | their specific patches) is also provided. | ||
17 | |||
18 | |||
19 | What is a patch? | ||
20 | --- | ||
21 | A patch is a small text document containing a delta of changes between two | ||
22 | different versions of a source tree. Patches are created with the `diff' | ||
23 | program. | ||
24 | To correctly apply a patch you need to know what base it was generated from | ||
25 | and what new version the patch will change the source tree into. These | ||
26 | should both be present in the patch file metadata or be possible to deduce | ||
27 | from the filename. | ||
28 | |||
29 | |||
30 | How do I apply or revert a patch? | ||
31 | --- | ||
32 | You apply a patch with the `patch' program. The patch program reads a diff | ||
33 | (or patch) file and makes the changes to the source tree described in it. | ||
34 | |||
35 | Patches for the Linux kernel are generated relative to the parent directory | ||
36 | holding the kernel source dir. | ||
37 | |||
38 | This means that paths to files inside the patch file contain the name of the | ||
39 | kernel source directories it was generated against (or some other directory | ||
40 | names like "a/" and "b/"). | ||
41 | Since this is unlikely to match the name of the kernel source dir on your | ||
42 | local machine (but is often useful info to see what version an otherwise | ||
43 | unlabeled patch was generated against) you should change into your kernel | ||
44 | source directory and then strip the first element of the path from filenames | ||
45 | in the patch file when applying it (the -p1 argument to `patch' does this). | ||
46 | |||
47 | To revert a previously applied patch, use the -R argument to patch. | ||
48 | So, if you applied a patch like this: | ||
49 | patch -p1 < ../patch-x.y.z | ||
50 | |||
51 | You can revert (undo) it like this: | ||
52 | patch -R -p1 < ../patch-x.y.z | ||
53 | |||
54 | |||
55 | How do I feed a patch/diff file to `patch'? | ||
56 | --- | ||
57 | This (as usual with Linux and other UNIX like operating systems) can be | ||
58 | done in several different ways. | ||
59 | In all the examples below I feed the file (in uncompressed form) to patch | ||
60 | via stdin using the following syntax: | ||
61 | patch -p1 < path/to/patch-x.y.z | ||
62 | |||
63 | If you just want to be able to follow the examples below and don't want to | ||
64 | know of more than one way to use patch, then you can stop reading this | ||
65 | section here. | ||
66 | |||
67 | Patch can also get the name of the file to use via the -i argument, like | ||
68 | this: | ||
69 | patch -p1 -i path/to/patch-x.y.z | ||
70 | |||
71 | If your patch file is compressed with gzip or bzip2 and you don't want to | ||
72 | uncompress it before applying it, then you can feed it to patch like this | ||
73 | instead: | ||
74 | zcat path/to/patch-x.y.z.gz | patch -p1 | ||
75 | bzcat path/to/patch-x.y.z.bz2 | patch -p1 | ||
76 | |||
77 | If you wish to uncompress the patch file by hand first before applying it | ||
78 | (what I assume you've done in the examples below), then you simply run | ||
79 | gunzip or bunzip2 on the file - like this: | ||
80 | gunzip patch-x.y.z.gz | ||
81 | bunzip2 patch-x.y.z.bz2 | ||
82 | |||
83 | Which will leave you with a plain text patch-x.y.z file that you can feed to | ||
84 | patch via stdin or the -i argument, as you prefer. | ||
85 | |||
86 | A few other nice arguments for patch are -s which causes patch to be silent | ||
87 | except for errors which is nice to prevent errors from scrolling out of the | ||
88 | screen too fast, and --dry-run which causes patch to just print a listing of | ||
89 | what would happen, but doesn't actually make any changes. Finally --verbose | ||
90 | tells patch to print more information about the work being done. | ||
91 | |||
92 | |||
93 | Common errors when patching | ||
94 | --- | ||
95 | When patch applies a patch file it attempts to verify the sanity of the | ||
96 | file in different ways. | ||
97 | Checking that the file looks like a valid patch file, checking the code | ||
98 | around the bits being modified matches the context provided in the patch are | ||
99 | just two of the basic sanity checks patch does. | ||
100 | |||
101 | If patch encounters something that doesn't look quite right it has two | ||
102 | options. It can either refuse to apply the changes and abort or it can try | ||
103 | to find a way to make the patch apply with a few minor changes. | ||
104 | |||
105 | One example of something that's not 'quite right' that patch will attempt to | ||
106 | fix up is if all the context matches, the lines being changed match, but the | ||
107 | line numbers are different. This can happen, for example, if the patch makes | ||
108 | a change in the middle of the file but for some reasons a few lines have | ||
109 | been added or removed near the beginning of the file. In that case | ||
110 | everything looks good it has just moved up or down a bit, and patch will | ||
111 | usually adjust the line numbers and apply the patch. | ||
112 | |||
113 | Whenever patch applies a patch that it had to modify a bit to make it fit | ||
114 | it'll tell you about it by saying the patch applied with 'fuzz'. | ||
115 | You should be wary of such changes since even though patch probably got it | ||
116 | right it doesn't /always/ get it right, and the result will sometimes be | ||
117 | wrong. | ||
118 | |||
119 | When patch encounters a change that it can't fix up with fuzz it rejects it | ||
120 | outright and leaves a file with a .rej extension (a reject file). You can | ||
121 | read this file to see exactely what change couldn't be applied, so you can | ||
122 | go fix it up by hand if you wish. | ||
123 | |||
124 | If you don't have any third party patches applied to your kernel source, but | ||
125 | only patches from kernel.org and you apply the patches in the correct order, | ||
126 | and have made no modifications yourself to the source files, then you should | ||
127 | never see a fuzz or reject message from patch. If you do see such messages | ||
128 | anyway, then there's a high risk that either your local source tree or the | ||
129 | patch file is corrupted in some way. In that case you should probably try | ||
130 | redownloading the patch and if things are still not OK then you'd be advised | ||
131 | to start with a fresh tree downloaded in full from kernel.org. | ||
132 | |||
133 | Let's look a bit more at some of the messages patch can produce. | ||
134 | |||
135 | If patch stops and presents a "File to patch:" prompt, then patch could not | ||
136 | find a file to be patched. Most likely you forgot to specify -p1 or you are | ||
137 | in the wrong directory. Less often, you'll find patches that need to be | ||
138 | applied with -p0 instead of -p1 (reading the patch file should reveal if | ||
139 | this is the case - if so, then this is an error by the person who created | ||
140 | the patch but is not fatal). | ||
141 | |||
142 | If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a | ||
143 | message similar to that, then it means that patch had to adjust the location | ||
144 | of the change (in this example it needed to move 7 lines from where it | ||
145 | expected to make the change to make it fit). | ||
146 | The resulting file may or may not be OK, depending on the reason the file | ||
147 | was different than expected. | ||
148 | This often happens if you try to apply a patch that was generated against a | ||
149 | different kernel version than the one you are trying to patch. | ||
150 | |||
151 | If you get a message like "Hunk #3 FAILED at 2387.", then it means that the | ||
152 | patch could not be applied correctly and the patch program was unable to | ||
153 | fuzz its way through. This will generate a .rej file with the change that | ||
154 | caused the patch to fail and also a .orig file showing you the original | ||
155 | content that couldn't be changed. | ||
156 | |||
157 | If you get "Reversed (or previously applied) patch detected! Assume -R? [n]" | ||
158 | then patch detected that the change contained in the patch seems to have | ||
159 | already been made. | ||
160 | If you actually did apply this patch previously and you just re-applied it | ||
161 | in error, then just say [n]o and abort this patch. If you applied this patch | ||
162 | previously and actually intended to revert it, but forgot to specify -R, | ||
163 | then you can say [y]es here to make patch revert it for you. | ||
164 | This can also happen if the creator of the patch reversed the source and | ||
165 | destination directories when creating the patch, and in that case reverting | ||
166 | the patch will in fact apply it. | ||
167 | |||
168 | A message similar to "patch: **** unexpected end of file in patch" or "patch | ||
169 | unexpectedly ends in middle of line" means that patch could make no sense of | ||
170 | the file you fed to it. Either your download is broken or you tried to feed | ||
171 | patch a compressed patch file without uncompressing it first. | ||
172 | |||
173 | As I already mentioned above, these errors should never happen if you apply | ||
174 | a patch from kernel.org to the correct version of an unmodified source tree. | ||
175 | So if you get these errors with kernel.org patches then you should probably | ||
176 | assume that either your patch file or your tree is broken and I'd advice you | ||
177 | to start over with a fresh download of a full kernel tree and the patch you | ||
178 | wish to apply. | ||
179 | |||
180 | |||
181 | Are there any alternatives to `patch'? | ||
182 | --- | ||
183 | Yes there are alternatives. You can use the `interdiff' program | ||
184 | (http://cyberelk.net/tim/patchutils/) to generate a patch representing the | ||
185 | differences between two patches and then apply the result. | ||
186 | This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single | ||
187 | step. The -z flag to interdiff will even let you feed it patches in gzip or | ||
188 | bzip2 compressed form directly without the use of zcat or bzcat or manual | ||
189 | decompression. | ||
190 | |||
191 | Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step: | ||
192 | interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1 | ||
193 | |||
194 | Although interdiff may save you a step or two you are generally advised to | ||
195 | do the additional steps since interdiff can get things wrong in some cases. | ||
196 | |||
197 | Another alternative is `ketchup', which is a python script for automatic | ||
198 | downloading and applying of patches (http://www.selenic.com/ketchup/). | ||
199 | |||
200 | Other nice tools are diffstat which shows a summary of changes made by a | ||
201 | patch, lsdiff which displays a short listing of affected files in a patch | ||
202 | file, along with (optionally) the line numbers of the start of each patch | ||
203 | and grepdiff which displays a list of the files modified by a patch where | ||
204 | the patch contains a given regular expression. | ||
205 | |||
206 | |||
207 | Where can I download the patches? | ||
208 | --- | ||
209 | The patches are available at http://kernel.org/ | ||
210 | Most recent patches are linked from the front page, but they also have | ||
211 | specific homes. | ||
212 | |||
213 | The 2.6.x.y (-stable) and 2.6.x patches live at | ||
214 | ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ | ||
215 | |||
216 | The -rc patches live at | ||
217 | ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/ | ||
218 | |||
219 | The -git patches live at | ||
220 | ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/ | ||
221 | |||
222 | The -mm kernels live at | ||
223 | ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/ | ||
224 | |||
225 | In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a | ||
226 | country code. This way you'll be downloading from a mirror site that's most | ||
227 | likely geographically closer to you, resulting in faster downloads for you, | ||
228 | less bandwidth used globally and less load on the main kernel.org servers - | ||
229 | these are good things, do use mirrors when possible. | ||
230 | |||
231 | |||
232 | The 2.6.x kernels | ||
233 | --- | ||
234 | These are the base stable releases released by Linus. The highest numbered | ||
235 | release is the most recent. | ||
236 | |||
237 | If regressions or other serious flaws are found then a -stable fix patch | ||
238 | will be released (see below) on top of this base. Once a new 2.6.x base | ||
239 | kernel is released, a patch is made available that is a delta between the | ||
240 | previous 2.6.x kernel and the new one. | ||
241 | |||
242 | To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note | ||
243 | that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the | ||
244 | base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to | ||
245 | first revert the 2.6.x.y patch). | ||
246 | |||
247 | Here are some examples: | ||
248 | |||
249 | # moving from 2.6.11 to 2.6.12 | ||
250 | $ cd ~/linux-2.6.11 # change to kernel source dir | ||
251 | $ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch | ||
252 | $ cd .. | ||
253 | $ mv linux-2.6.11 linux-2.6.12 # rename source dir | ||
254 | |||
255 | # moving from 2.6.11.1 to 2.6.12 | ||
256 | $ cd ~/linux-2.6.11.1 # change to kernel source dir | ||
257 | $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch | ||
258 | # source dir is now 2.6.11 | ||
259 | $ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch | ||
260 | $ cd .. | ||
261 | $ mv linux-2.6.11.1 inux-2.6.12 # rename source dir | ||
262 | |||
263 | |||
264 | The 2.6.x.y kernels | ||
265 | --- | ||
266 | Kernels with 4 digit versions are -stable kernels. They contain small(ish) | ||
267 | critical fixes for security problems or significant regressions discovered | ||
268 | in a given 2.6.x kernel. | ||
269 | |||
270 | This is the recommended branch for users who want the most recent stable | ||
271 | kernel and are not interested in helping test development/experimental | ||
272 | versions. | ||
273 | |||
274 | If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is | ||
275 | the current stable kernel. | ||
276 | |||
277 | These patches are not incremental, meaning that for example the 2.6.12.3 | ||
278 | patch does not apply on top of the 2.6.12.2 kernel source, but rather on top | ||
279 | of the base 2.6.12 kernel source. | ||
280 | So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel | ||
281 | source you have to first back out the 2.6.12.2 patch (so you are left with a | ||
282 | base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch. | ||
283 | |||
284 | Here's a small example: | ||
285 | |||
286 | $ cd ~/linux-2.6.12.2 # change into the kernel source dir | ||
287 | $ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch | ||
288 | $ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch | ||
289 | $ cd .. | ||
290 | $ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir | ||
291 | |||
292 | |||
293 | The -rc kernels | ||
294 | --- | ||
295 | These are release-candidate kernels. These are development kernels released | ||
296 | by Linus whenever he deems the current git (the kernel's source management | ||
297 | tool) tree to be in a reasonably sane state adequate for testing. | ||
298 | |||
299 | These kernels are not stable and you should expect occasional breakage if | ||
300 | you intend to run them. This is however the most stable of the main | ||
301 | development branches and is also what will eventually turn into the next | ||
302 | stable kernel, so it is important that it be tested by as many people as | ||
303 | possible. | ||
304 | |||
305 | This is a good branch to run for people who want to help out testing | ||
306 | development kernels but do not want to run some of the really experimental | ||
307 | stuff (such people should see the sections about -git and -mm kernels below). | ||
308 | |||
309 | The -rc patches are not incremental, they apply to a base 2.6.x kernel, just | ||
310 | like the 2.6.x.y patches described above. The kernel version before the -rcN | ||
311 | suffix denotes the version of the kernel that this -rc kernel will eventually | ||
312 | turn into. | ||
313 | So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13 | ||
314 | kernel and the patch should be applied on top of the 2.6.12 kernel source. | ||
315 | |||
316 | Here are 3 examples of how to apply these patches: | ||
317 | |||
318 | # first an example of moving from 2.6.12 to 2.6.13-rc3 | ||
319 | $ cd ~/linux-2.6.12 # change into the 2.6.12 source dir | ||
320 | $ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch | ||
321 | $ cd .. | ||
322 | $ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir | ||
323 | |||
324 | # now let's move from 2.6.13-rc3 to 2.6.13-rc5 | ||
325 | $ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir | ||
326 | $ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch | ||
327 | $ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch | ||
328 | $ cd .. | ||
329 | $ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir | ||
330 | |||
331 | # finally let's try and move from 2.6.12.3 to 2.6.13-rc5 | ||
332 | $ cd ~/linux-2.6.12.3 # change to the kernel source dir | ||
333 | $ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch | ||
334 | $ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch | ||
335 | $ cd .. | ||
336 | $ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir | ||
337 | |||
338 | |||
339 | The -git kernels | ||
340 | --- | ||
341 | These are daily snapshots of Linus' kernel tree (managed in a git | ||
342 | repository, hence the name). | ||
343 | |||
344 | These patches are usually released daily and represent the current state of | ||
345 | Linus' tree. They are more experimental than -rc kernels since they are | ||
346 | generated automatically without even a cursory glance to see if they are | ||
347 | sane. | ||
348 | |||
349 | -git patches are not incremental and apply either to a base 2.6.x kernel or | ||
350 | a base 2.6.x-rc kernel - you can see which from their name. | ||
351 | A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch | ||
352 | named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel. | ||
353 | |||
354 | Here are some examples of how to apply these patches: | ||
355 | |||
356 | # moving from 2.6.12 to 2.6.12-git1 | ||
357 | $ cd ~/linux-2.6.12 # change to the kernel source dir | ||
358 | $ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch | ||
359 | $ cd .. | ||
360 | $ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir | ||
361 | |||
362 | # moving from 2.6.12-git1 to 2.6.13-rc2-git3 | ||
363 | $ cd ~/linux-2.6.12-git1 # change to the kernel source dir | ||
364 | $ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch | ||
365 | # we now have a 2.6.12 kernel | ||
366 | $ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch | ||
367 | # the kernel is now 2.6.13-rc2 | ||
368 | $ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch | ||
369 | # the kernel is now 2.6.13-rc2-git3 | ||
370 | $ cd .. | ||
371 | $ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir | ||
372 | |||
373 | |||
374 | The -mm kernels | ||
375 | --- | ||
376 | These are experimental kernels released by Andrew Morton. | ||
377 | |||
378 | The -mm tree serves as a sort of proving ground for new features and other | ||
379 | experimental patches. | ||
380 | Once a patch has proved its worth in -mm for a while Andrew pushes it on to | ||
381 | Linus for inclusion in mainline. | ||
382 | |||
383 | Although it's encouraged that patches flow to Linus via the -mm tree, this | ||
384 | is not always enforced. | ||
385 | Subsystem maintainers (or individuals) sometimes push their patches directly | ||
386 | to Linus, even though (or after) they have been merged and tested in -mm (or | ||
387 | sometimes even without prior testing in -mm). | ||
388 | |||
389 | You should generally strive to get your patches into mainline via -mm to | ||
390 | ensure maximum testing. | ||
391 | |||
392 | This branch is in constant flux and contains many experimental features, a | ||
393 | lot of debugging patches not appropriate for mainline etc and is the most | ||
394 | experimental of the branches described in this document. | ||
395 | |||
396 | These kernels are not appropriate for use on systems that are supposed to be | ||
397 | stable and they are more risky to run than any of the other branches (make | ||
398 | sure you have up-to-date backups - that goes for any experimental kernel but | ||
399 | even more so for -mm kernels). | ||
400 | |||
401 | These kernels in addition to all the other experimental patches they contain | ||
402 | usually also contain any changes in the mainline -git kernels available at | ||
403 | the time of release. | ||
404 | |||
405 | Testing of -mm kernels is greatly appreciated since the whole point of the | ||
406 | tree is to weed out regressions, crashes, data corruption bugs, build | ||
407 | breakage (and any other bug in general) before changes are merged into the | ||
408 | more stable mainline Linus tree. | ||
409 | But testers of -mm should be aware that breakage in this tree is more common | ||
410 | than in any other tree. | ||
411 | |||
412 | The -mm kernels are not released on a fixed schedule, but usually a few -mm | ||
413 | kernels are released in between each -rc kernel (1 to 3 is common). | ||
414 | The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels | ||
415 | have been released yet) or to a Linus -rc kernel. | ||
416 | |||
417 | Here are some examples of applying the -mm patches: | ||
418 | |||
419 | # moving from 2.6.12 to 2.6.12-mm1 | ||
420 | $ cd ~/linux-2.6.12 # change to the 2.6.12 source dir | ||
421 | $ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch | ||
422 | $ cd .. | ||
423 | $ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately | ||
424 | |||
425 | # moving from 2.6.12-mm1 to 2.6.13-rc3-mm3 | ||
426 | $ cd ~/linux-2.6.12-mm1 | ||
427 | $ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch | ||
428 | # we now have a 2.6.12 source | ||
429 | $ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch | ||
430 | # we now have a 2.6.13-rc3 source | ||
431 | $ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch | ||
432 | $ cd .. | ||
433 | $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir | ||
434 | |||
435 | |||
436 | This concludes this list of explanations of the various kernel trees and I | ||
437 | hope you are now crystal clear on how to apply the various patches and help | ||
438 | testing the kernel. | ||
439 | |||
diff --git a/Documentation/cpu-freq/cpufreq-stats.txt b/Documentation/cpu-freq/cpufreq-stats.txt index e2d1e760b4ba..6a82948ff4bd 100644 --- a/Documentation/cpu-freq/cpufreq-stats.txt +++ b/Documentation/cpu-freq/cpufreq-stats.txt | |||
@@ -36,7 +36,7 @@ cpufreq stats provides following statistics (explained in detail below). | |||
36 | 36 | ||
37 | All the statistics will be from the time the stats driver has been inserted | 37 | All the statistics will be from the time the stats driver has been inserted |
38 | to the time when a read of a particular statistic is done. Obviously, stats | 38 | to the time when a read of a particular statistic is done. Obviously, stats |
39 | driver will not have any information about the the frequcny transitions before | 39 | driver will not have any information about the frequency transitions before |
40 | the stats driver insertion. | 40 | the stats driver insertion. |
41 | 41 | ||
42 | -------------------------------------------------------------------------------- | 42 | -------------------------------------------------------------------------------- |
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47f4114fbf54..d17b7d2dd771 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -277,7 +277,7 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid | |||
277 | impacting the scheduler code in the kernel with a check for changes | 277 | impacting the scheduler code in the kernel with a check for changes |
278 | in a tasks processor placement. | 278 | in a tasks processor placement. |
279 | 279 | ||
280 | There is an exception to the above. If hotplug funtionality is used | 280 | There is an exception to the above. If hotplug functionality is used |
281 | to remove all the CPUs that are currently assigned to a cpuset, | 281 | to remove all the CPUs that are currently assigned to a cpuset, |
282 | then the kernel will automatically update the cpus_allowed of all | 282 | then the kernel will automatically update the cpus_allowed of all |
283 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory | 283 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
diff --git a/Documentation/crypto/descore-readme.txt b/Documentation/crypto/descore-readme.txt index 166474c2ee0b..16e9e6350755 100644 --- a/Documentation/crypto/descore-readme.txt +++ b/Documentation/crypto/descore-readme.txt | |||
@@ -1,4 +1,4 @@ | |||
1 | Below is the orginal README file from the descore.shar package. | 1 | Below is the original README file from the descore.shar package. |
2 | ------------------------------------------------------------------------------ | 2 | ------------------------------------------------------------------------------ |
3 | 3 | ||
4 | des - fast & portable DES encryption & decryption. | 4 | des - fast & portable DES encryption & decryption. |
diff --git a/Documentation/dvb/bt8xx.txt b/Documentation/dvb/bt8xx.txt index 4b8c326c6aac..cb63b7a93c82 100644 --- a/Documentation/dvb/bt8xx.txt +++ b/Documentation/dvb/bt8xx.txt | |||
@@ -1,55 +1,74 @@ | |||
1 | How to get the Nebula Electronics DigiTV, Pinnacle PCTV Sat, Twinhan DST + clones working | 1 | How to get the Nebula, PCTV and Twinhan DST cards working |
2 | ========================================================================================= | 2 | ========================================================= |
3 | 3 | ||
4 | 1) General information | 4 | This class of cards has a bt878a as the PCI interface, and |
5 | ====================== | 5 | require the bttv driver. |
6 | 6 | ||
7 | This class of cards has a bt878a chip as the PCI interface. | 7 | Please pay close attention to the warning about the bttv module |
8 | The different card drivers require the bttv driver to provide the means | 8 | options below for the DST card. |
9 | to access the i2c bus and the gpio pins of the bt8xx chipset. | ||
10 | 9 | ||
11 | 2) Compilation rules for Kernel >= 2.6.12 | 10 | 1) General informations |
12 | ========================================= | 11 | ======================= |
13 | 12 | ||
14 | Enable the following options: | 13 | These drivers require the bttv driver to provide the means to access |
14 | the i2c bus and the gpio pins of the bt8xx chipset. | ||
15 | 15 | ||
16 | Because of this, you need to enable | ||
16 | "Device drivers" => "Multimedia devices" | 17 | "Device drivers" => "Multimedia devices" |
17 | => "Video For Linux" => "BT848 Video For Linux" | 18 | => "Video For Linux" => "BT848 Video For Linux" |
19 | |||
20 | Furthermore you need to enable | ||
18 | "Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices" | 21 | "Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices" |
19 | => "DVB for Linux" "DVB Core Support" "BT8xx based PCI cards" | 22 | => "DVB for Linux" "DVB Core Support" "BT8xx based PCI cards" |
20 | 23 | ||
21 | 3) Loading Modules, described by two approaches | 24 | 2) Loading Modules |
22 | =============================================== | 25 | ================== |
23 | 26 | ||
24 | In general you need to load the bttv driver, which will handle the gpio and | 27 | In general you need to load the bttv driver, which will handle the gpio and |
25 | i2c communication for us, plus the common dvb-bt8xx device driver, | 28 | i2c communication for us, plus the common dvb-bt8xx device driver. |
26 | which is called the backend. | 29 | The frontends for Nebula (nxt6000), Pinnacle PCTV (cx24110) and |
27 | The frontends for Nebula DigiTV (nxt6000), Pinnacle PCTV Sat (cx24110), | 30 | TwinHan (dst) are loaded automatically by the dvb-bt8xx device driver. |
28 | TwinHan DST + clones (dst and dst-ca) are loaded automatically by the backend. | ||
29 | For further details about TwinHan DST + clones see /Documentation/dvb/ci.txt. | ||
30 | 31 | ||
31 | 3a) The manual approach | 32 | 3a) Nebula / Pinnacle PCTV |
32 | ----------------------- | 33 | -------------------------- |
33 | 34 | ||
34 | Loading modules: | 35 | $ modprobe bttv (normally bttv is being loaded automatically by kmod) |
35 | modprobe bttv | 36 | $ modprobe dvb-bt8xx (or just place dvb-bt8xx in /etc/modules for automatic loading) |
36 | modprobe dvb-bt8xx | ||
37 | 37 | ||
38 | Unloading modules: | ||
39 | modprobe -r dvb-bt8xx | ||
40 | modprobe -r bttv | ||
41 | 38 | ||
42 | 3b) The automatic approach | 39 | 3b) TwinHan and Clones |
43 | -------------------------- | 40 | -------------------------- |
44 | 41 | ||
45 | If not already done by installation, place a line either in | 42 | $ modprobe bttv i2c_hw=1 card=0x71 |
46 | /etc/modules.conf or in /etc/modprobe.conf containing this text: | 43 | $ modprobe dvb-bt8xx |
47 | alias char-major-81 bttv | 44 | $ modprobe dst |
45 | |||
46 | The value 0x71 will override the PCI type detection for dvb-bt8xx, | ||
47 | which is necessary for TwinHan cards. | ||
48 | |||
49 | If you're having an older card (blue color circuit) and card=0x71 locks | ||
50 | your machine, try using 0x68, too. If that does not work, ask on the | ||
51 | mailing list. | ||
52 | |||
53 | The DST module takes a couple of useful parameters. | ||
54 | |||
55 | verbose takes values 0 to 4. These values control the verbosity level, | ||
56 | and can be used to debug also. | ||
57 | |||
58 | verbose=0 means complete disabling of messages | ||
59 | 1 only error messages are displayed | ||
60 | 2 notifications are also displayed | ||
61 | 3 informational messages are also displayed | ||
62 | 4 debug setting | ||
63 | |||
64 | dst_addons takes values 0 and 0x20. A value of 0 means it is a FTA card. | ||
65 | 0x20 means it has a Conditional Access slot. | ||
66 | |||
67 | The autodected values are determined bythe cards 'response | ||
68 | string' which you can see in your logs e.g. | ||
48 | 69 | ||
49 | Then place a line in /etc/modules containing this text: | 70 | dst_get_device_id: Recognise [DSTMCI] |
50 | dvb-bt8xx | ||
51 | 71 | ||
52 | Reboot your system and have fun! | ||
53 | 72 | ||
54 | -- | 73 | -- |
55 | Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham, Uwe Bugla | 74 | Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham |
diff --git a/Documentation/dvb/ci.txt b/Documentation/dvb/ci.txt index 62e0701b542a..95f0e73b2135 100644 --- a/Documentation/dvb/ci.txt +++ b/Documentation/dvb/ci.txt | |||
@@ -23,7 +23,6 @@ This application requires the following to function properly as of now. | |||
23 | eg: $ szap -c channels.conf -r "TMC" -x | 23 | eg: $ szap -c channels.conf -r "TMC" -x |
24 | 24 | ||
25 | (b) a channels.conf containing a valid PMT PID | 25 | (b) a channels.conf containing a valid PMT PID |
26 | |||
27 | eg: TMC:11996:h:0:27500:278:512:650:321 | 26 | eg: TMC:11996:h:0:27500:278:512:650:321 |
28 | 27 | ||
29 | here 278 is a valid PMT PID. the rest of the values are the | 28 | here 278 is a valid PMT PID. the rest of the values are the |
@@ -31,13 +30,7 @@ This application requires the following to function properly as of now. | |||
31 | 30 | ||
32 | (c) after running a szap, you have to run ca_zap, for the | 31 | (c) after running a szap, you have to run ca_zap, for the |
33 | descrambler to function, | 32 | descrambler to function, |
34 | 33 | eg: $ ca_zap channels.conf "TMC" | |
35 | eg: $ ca_zap patched_channels.conf "TMC" | ||
36 | |||
37 | The patched means a patch to apply to scan, such that scan can | ||
38 | generate a channels.conf_with pmt, which has this PMT PID info | ||
39 | (NOTE: szap cannot use this channels.conf with the PMT_PID) | ||
40 | |||
41 | 34 | ||
42 | (d) Hopeflly Enjoy your favourite subscribed channel as you do with | 35 | (d) Hopeflly Enjoy your favourite subscribed channel as you do with |
43 | a FTA card. | 36 | a FTA card. |
diff --git a/Documentation/fb/cyblafb/bugs b/Documentation/fb/cyblafb/bugs new file mode 100644 index 000000000000..f90cc66ea919 --- /dev/null +++ b/Documentation/fb/cyblafb/bugs | |||
@@ -0,0 +1,14 @@ | |||
1 | Bugs | ||
2 | ==== | ||
3 | |||
4 | I currently don't know of any bug. Please do send reports to: | ||
5 | - linux-fbdev-devel@lists.sourceforge.net | ||
6 | - Knut_Petersen@t-online.de. | ||
7 | |||
8 | |||
9 | Untested features | ||
10 | ================= | ||
11 | |||
12 | All LCD stuff is untested. If it worked in tridentfb, it should work in | ||
13 | cyblafb. Please test and report the results to Knut_Petersen@t-online.de. | ||
14 | |||
diff --git a/Documentation/fb/cyblafb/credits b/Documentation/fb/cyblafb/credits new file mode 100644 index 000000000000..0eb3b443dc2b --- /dev/null +++ b/Documentation/fb/cyblafb/credits | |||
@@ -0,0 +1,7 @@ | |||
1 | Thanks to | ||
2 | ========= | ||
3 | * Alan Hourihane, for writing the X trident driver | ||
4 | * Jani Monoses, for writing the tridentfb driver | ||
5 | * Antonino A. Daplas, for review of the first published | ||
6 | version of cyblafb and some code | ||
7 | * Jochen Hein, for testing and a helpfull bug report | ||
diff --git a/Documentation/fb/cyblafb/documentation b/Documentation/fb/cyblafb/documentation new file mode 100644 index 000000000000..bb1aac048425 --- /dev/null +++ b/Documentation/fb/cyblafb/documentation | |||
@@ -0,0 +1,17 @@ | |||
1 | Available Documentation | ||
2 | ======================= | ||
3 | |||
4 | Apollo PLE 133 Chipset VT8601A North Bridge Datasheet, Rev. 1.82, October 22, | ||
5 | 2001, available from VIA: | ||
6 | |||
7 | http://www.viavpsd.com/product/6/15/DS8601A182.pdf | ||
8 | |||
9 | The datasheet is incomplete, some registers that need to be programmed are not | ||
10 | explained at all and important bits are listed as "reserved". But you really | ||
11 | need the datasheet to understand the code. "p. xxx" comments refer to page | ||
12 | numbers of this document. | ||
13 | |||
14 | XFree/XOrg drivers are available and of good quality, looking at the code | ||
15 | there is a good idea if the datasheet does not provide enough information | ||
16 | or if the datasheet seems to be wrong. | ||
17 | |||
diff --git a/Documentation/fb/cyblafb/fb.modes b/Documentation/fb/cyblafb/fb.modes new file mode 100644 index 000000000000..cf4351fc32ff --- /dev/null +++ b/Documentation/fb/cyblafb/fb.modes | |||
@@ -0,0 +1,155 @@ | |||
1 | # | ||
2 | # Sample fb.modes file | ||
3 | # | ||
4 | # Provides an incomplete list of working modes for | ||
5 | # the cyberblade/i1 graphics core. | ||
6 | # | ||
7 | # The value 4294967256 is used instead of -40. Of course, -40 is not | ||
8 | # a really reasonable value, but chip design does not always follow | ||
9 | # logic. Believe me, it's ok, and it's the way the BIOS does it. | ||
10 | # | ||
11 | # fbset requires 4294967256 in fb.modes and -40 as an argument to | ||
12 | # the -t parameter. That's also not too reasonable, and it might change | ||
13 | # in the future or might even be differt for your current version. | ||
14 | # | ||
15 | |||
16 | mode "640x480-50" | ||
17 | geometry 640 480 640 3756 8 | ||
18 | timings 47619 4294967256 24 17 0 216 3 | ||
19 | endmode | ||
20 | |||
21 | mode "640x480-60" | ||
22 | geometry 640 480 640 3756 8 | ||
23 | timings 39682 4294967256 24 17 0 216 3 | ||
24 | endmode | ||
25 | |||
26 | mode "640x480-70" | ||
27 | geometry 640 480 640 3756 8 | ||
28 | timings 34013 4294967256 24 17 0 216 3 | ||
29 | endmode | ||
30 | |||
31 | mode "640x480-72" | ||
32 | geometry 640 480 640 3756 8 | ||
33 | timings 33068 4294967256 24 17 0 216 3 | ||
34 | endmode | ||
35 | |||
36 | mode "640x480-75" | ||
37 | geometry 640 480 640 3756 8 | ||
38 | timings 31746 4294967256 24 17 0 216 3 | ||
39 | endmode | ||
40 | |||
41 | mode "640x480-80" | ||
42 | geometry 640 480 640 3756 8 | ||
43 | timings 29761 4294967256 24 17 0 216 3 | ||
44 | endmode | ||
45 | |||
46 | mode "640x480-85" | ||
47 | geometry 640 480 640 3756 8 | ||
48 | timings 28011 4294967256 24 17 0 216 3 | ||
49 | endmode | ||
50 | |||
51 | mode "800x600-50" | ||
52 | geometry 800 600 800 3221 8 | ||
53 | timings 30303 96 24 14 0 136 11 | ||
54 | endmode | ||
55 | |||
56 | mode "800x600-60" | ||
57 | geometry 800 600 800 3221 8 | ||
58 | timings 25252 96 24 14 0 136 11 | ||
59 | endmode | ||
60 | |||
61 | mode "800x600-70" | ||
62 | geometry 800 600 800 3221 8 | ||
63 | timings 21645 96 24 14 0 136 11 | ||
64 | endmode | ||
65 | |||
66 | mode "800x600-72" | ||
67 | geometry 800 600 800 3221 8 | ||
68 | timings 21043 96 24 14 0 136 11 | ||
69 | endmode | ||
70 | |||
71 | mode "800x600-75" | ||
72 | geometry 800 600 800 3221 8 | ||
73 | timings 20202 96 24 14 0 136 11 | ||
74 | endmode | ||
75 | |||
76 | mode "800x600-80" | ||
77 | geometry 800 600 800 3221 8 | ||
78 | timings 18939 96 24 14 0 136 11 | ||
79 | endmode | ||
80 | |||
81 | mode "800x600-85" | ||
82 | geometry 800 600 800 3221 8 | ||
83 | timings 17825 96 24 14 0 136 11 | ||
84 | endmode | ||
85 | |||
86 | mode "1024x768-50" | ||
87 | geometry 1024 768 1024 2815 8 | ||
88 | timings 19054 144 24 29 0 120 3 | ||
89 | endmode | ||
90 | |||
91 | mode "1024x768-60" | ||
92 | geometry 1024 768 1024 2815 8 | ||
93 | timings 15880 144 24 29 0 120 3 | ||
94 | endmode | ||
95 | |||
96 | mode "1024x768-70" | ||
97 | geometry 1024 768 1024 2815 8 | ||
98 | timings 13610 144 24 29 0 120 3 | ||
99 | endmode | ||
100 | |||
101 | mode "1024x768-72" | ||
102 | geometry 1024 768 1024 2815 8 | ||
103 | timings 13232 144 24 29 0 120 3 | ||
104 | endmode | ||
105 | |||
106 | mode "1024x768-75" | ||
107 | geometry 1024 768 1024 2815 8 | ||
108 | timings 12703 144 24 29 0 120 3 | ||
109 | endmode | ||
110 | |||
111 | mode "1024x768-80" | ||
112 | geometry 1024 768 1024 2815 8 | ||
113 | timings 11910 144 24 29 0 120 3 | ||
114 | endmode | ||
115 | |||
116 | mode "1024x768-85" | ||
117 | geometry 1024 768 1024 2815 8 | ||
118 | timings 11209 144 24 29 0 120 3 | ||
119 | endmode | ||
120 | |||
121 | mode "1280x1024-50" | ||
122 | geometry 1280 1024 1280 2662 8 | ||
123 | timings 11114 232 16 39 0 160 3 | ||
124 | endmode | ||
125 | |||
126 | mode "1280x1024-60" | ||
127 | geometry 1280 1024 1280 2662 8 | ||
128 | timings 9262 232 16 39 0 160 3 | ||
129 | endmode | ||
130 | |||
131 | mode "1280x1024-70" | ||
132 | geometry 1280 1024 1280 2662 8 | ||
133 | timings 7939 232 16 39 0 160 3 | ||
134 | endmode | ||
135 | |||
136 | mode "1280x1024-72" | ||
137 | geometry 1280 1024 1280 2662 8 | ||
138 | timings 7719 232 16 39 0 160 3 | ||
139 | endmode | ||
140 | |||
141 | mode "1280x1024-75" | ||
142 | geometry 1280 1024 1280 2662 8 | ||
143 | timings 7410 232 16 39 0 160 3 | ||
144 | endmode | ||
145 | |||
146 | mode "1280x1024-80" | ||
147 | geometry 1280 1024 1280 2662 8 | ||
148 | timings 6946 232 16 39 0 160 3 | ||
149 | endmode | ||
150 | |||
151 | mode "1280x1024-85" | ||
152 | geometry 1280 1024 1280 2662 8 | ||
153 | timings 6538 232 16 39 0 160 3 | ||
154 | endmode | ||
155 | |||
diff --git a/Documentation/fb/cyblafb/performance b/Documentation/fb/cyblafb/performance new file mode 100644 index 000000000000..eb4e47a9cea6 --- /dev/null +++ b/Documentation/fb/cyblafb/performance | |||
@@ -0,0 +1,80 @@ | |||
1 | Speed | ||
2 | ===== | ||
3 | |||
4 | CyBlaFB is much faster than tridentfb and vesafb. Compare the performance data | ||
5 | for mode 1280x1024-[8,16,32]@61 Hz. | ||
6 | |||
7 | Test 1: Cat a file with 2000 lines of 0 characters. | ||
8 | Test 2: Cat a file with 2000 lines of 80 characters. | ||
9 | Test 3: Cat a file with 2000 lines of 160 characters. | ||
10 | |||
11 | All values show system time use in seconds, kernel 2.6.12 was used for | ||
12 | the measurements. 2.6.13 is a bit slower, 2.6.14 hopefully will include a | ||
13 | patch that speeds up kernel bitblitting a lot ( > 20%). | ||
14 | |||
15 | +-----------+-----------------------------------------------------+ | ||
16 | | | not accelerated | | ||
17 | | TRIDENTFB +-----------------+-----------------+-----------------+ | ||
18 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
19 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
20 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
21 | | Test 1 | 4.31 | 4.33 | 6.05 | 12.81 | ---- | ---- | | ||
22 | | Test 2 | 67.94 | 5.44 | 123.16 | 14.79 | ---- | ---- | | ||
23 | | Test 3 | 131.36 | 6.55 | 240.12 | 16.76 | ---- | ---- | | ||
24 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
25 | | Comments | | | completely bro- | | ||
26 | | | | | ken, monitor | | ||
27 | | | | | switches off | | ||
28 | +-----------+-----------------+-----------------+-----------------+ | ||
29 | |||
30 | |||
31 | +-----------+-----------------------------------------------------+ | ||
32 | | | accelerated | | ||
33 | | TRIDENTFB +-----------------+-----------------+-----------------+ | ||
34 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
35 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
36 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
37 | | Test 1 | ---- | ---- | 20.62 | 1.22 | ---- | ---- | | ||
38 | | Test 2 | ---- | ---- | 22.61 | 3.19 | ---- | ---- | | ||
39 | | Test 3 | ---- | ---- | 24.59 | 5.16 | ---- | ---- | | ||
40 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
41 | | Comments | broken, writing | broken, ok only | completely bro- | | ||
42 | | | to wrong places | if bgcolor is | ken, monitor | | ||
43 | | | on screen + bug | black, bug in | switches off | | ||
44 | | | in fillrect() | fillrect() | | | ||
45 | +-----------+-----------------+-----------------+-----------------+ | ||
46 | |||
47 | |||
48 | +-----------+-----------------------------------------------------+ | ||
49 | | | not accelerated | | ||
50 | | VESAFB +-----------------+-----------------+-----------------+ | ||
51 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
52 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
53 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
54 | | Test 1 | 4.26 | 3.76 | 5.99 | 7.23 | ---- | ---- | | ||
55 | | Test 2 | 65.65 | 4.89 | 120.88 | 9.08 | ---- | ---- | | ||
56 | | Test 3 | 126.91 | 5.94 | 235.77 | 11.03 | ---- | ---- | | ||
57 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
58 | | Comments | vga=0x307 | vga=0x31a | vga=0x31b not | | ||
59 | | | fh=80kHz | fh=80kHz | supported by | | ||
60 | | | fv=75kHz | fv=75kHz | video BIOS and | | ||
61 | | | | | hardware | | ||
62 | +-----------+-----------------+-----------------+-----------------+ | ||
63 | |||
64 | |||
65 | +-----------+-----------------------------------------------------+ | ||
66 | | | accelerated | | ||
67 | | CYBLAFB +-----------------+-----------------+-----------------+ | ||
68 | | | 8 bpp | 16 bpp | 32 bpp | | ||
69 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
70 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
71 | | Test 1 | 8.02 | 0.23 | 19.04 | 0.61 | 57.12 | 2.74 | | ||
72 | | Test 2 | 8.38 | 0.55 | 19.39 | 0.92 | 57.54 | 3.13 | | ||
73 | | Test 3 | 8.73 | 0.86 | 19.74 | 1.24 | 57.95 | 3.51 | | ||
74 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
75 | | Comments | | | | | ||
76 | | | | | | | ||
77 | | | | | | | ||
78 | | | | | | | ||
79 | +-----------+-----------------+-----------------+-----------------+ | ||
80 | |||
diff --git a/Documentation/fb/cyblafb/todo b/Documentation/fb/cyblafb/todo new file mode 100644 index 000000000000..80fb2f89b6c1 --- /dev/null +++ b/Documentation/fb/cyblafb/todo | |||
@@ -0,0 +1,32 @@ | |||
1 | TODO / Missing features | ||
2 | ======================= | ||
3 | |||
4 | Verify LCD stuff "stretch" and "center" options are | ||
5 | completely untested ... this code needs to be | ||
6 | verified. As I don't have access to such | ||
7 | hardware, please contact me if you are | ||
8 | willing run some tests. | ||
9 | |||
10 | Interlaced video modes The reason that interleaved | ||
11 | modes are disabled is that I do not know | ||
12 | the meaning of the vertical interlace | ||
13 | parameter. Also the datasheet mentions a | ||
14 | bit d8 of a horizontal interlace parameter, | ||
15 | but nowhere the lower 8 bits. Please help | ||
16 | if you can. | ||
17 | |||
18 | low-res double scan modes Who needs it? | ||
19 | |||
20 | accelerated color blitting Who needs it? The console driver does use color | ||
21 | blitting for nothing but drawing the penguine, | ||
22 | everything else is done using color expanding | ||
23 | blitting of 1bpp character bitmaps. | ||
24 | |||
25 | xpanning Who needs it? | ||
26 | |||
27 | ioctls Who needs it? | ||
28 | |||
29 | TV-out Will be done later | ||
30 | |||
31 | ??? Feel free to contact me if you have any | ||
32 | feature requests | ||
diff --git a/Documentation/fb/cyblafb/usage b/Documentation/fb/cyblafb/usage new file mode 100644 index 000000000000..e627c8f54211 --- /dev/null +++ b/Documentation/fb/cyblafb/usage | |||
@@ -0,0 +1,206 @@ | |||
1 | CyBlaFB is a framebuffer driver for the Cyberblade/i1 graphics core integrated | ||
2 | into the VIA Apollo PLE133 (aka vt8601) south bridge. It is developed and | ||
3 | tested using a VIA EPIA 5000 board. | ||
4 | |||
5 | Cyblafb - compiled into the kernel or as a module? | ||
6 | ================================================== | ||
7 | |||
8 | You might compile cyblafb either as a module or compile it permanently into the | ||
9 | kernel. | ||
10 | |||
11 | Unless you have a real reason to do so you should not compile both vesafb and | ||
12 | cyblafb permanently into the kernel. It's possible and it helps during the | ||
13 | developement cycle, but it's useless and will at least block some otherwise | ||
14 | usefull memory for ordinary users. | ||
15 | |||
16 | Selecting Modes | ||
17 | =============== | ||
18 | |||
19 | Startup Mode | ||
20 | ============ | ||
21 | |||
22 | First of all, you might use the "vga=???" boot parameter as it is | ||
23 | documented in vesafb.txt and svga.txt. Cyblafb will detect the video | ||
24 | mode selected and will use the geometry and timings found by | ||
25 | inspecting the hardware registers. | ||
26 | |||
27 | video=cyblafb vga=0x317 | ||
28 | |||
29 | Alternatively you might use a combination of the mode, ref and bpp | ||
30 | parameters. If you compiled the driver into the kernel, add something | ||
31 | like this to the kernel command line: | ||
32 | |||
33 | video=cyblafb:1280x1024,bpp=16,ref=50 ... | ||
34 | |||
35 | If you compiled the driver as a module, the same mode would be | ||
36 | selected by the following command: | ||
37 | |||
38 | modprobe cyblafb mode=1280x1024 bpp=16 ref=50 ... | ||
39 | |||
40 | None of the modes possible to select as startup modes are affected by | ||
41 | the problems described at the end of the next subsection. | ||
42 | |||
43 | Mode changes using fbset | ||
44 | ======================== | ||
45 | |||
46 | You might use fbset to change the video mode, see "man fbset". Cyblafb | ||
47 | generally does assume that you know what you are doing. But it does | ||
48 | some checks, especially those that are needed to prevent you from | ||
49 | damaging your hardware. | ||
50 | |||
51 | - only 8, 16, 24 and 32 bpp video modes are accepted | ||
52 | - interlaced video modes are not accepted | ||
53 | - double scan video modes are not accepted | ||
54 | - if a flat panel is found, cyblafb does not allow you | ||
55 | to program a resolution higher than the physical | ||
56 | resolution of the flat panel monitor | ||
57 | - cyblafb does not allow xres to differ from xres_virtual | ||
58 | - cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp | ||
59 | and (currently) 24 bit modes use a doubled vclk internally, | ||
60 | the dotclock limit as seen by fbset is 115 MHz for those | ||
61 | modes and 230 MHz for 8 and 16 bpp modes. | ||
62 | |||
63 | Any request that violates the rules given above will be ignored and | ||
64 | fbset will return an error. | ||
65 | |||
66 | If you program a virtual y resolution higher than the hardware limit, | ||
67 | cyblafb will silently decrease that value to the highest possible | ||
68 | value. | ||
69 | |||
70 | Attempts to disable acceleration are ignored. | ||
71 | |||
72 | Some video modes that should work do not work as expected. If you use | ||
73 | the standard fb.modes, fbset 640x480-60 will program that mode, but | ||
74 | you will see a vertical area, about two characters wide, with only | ||
75 | much darker characters than the other characters on the screen. | ||
76 | Cyblafb does allow that mode to be set, as it does not violate the | ||
77 | official specifications. It would need a lot of code to reliably sort | ||
78 | out all invalid modes, playing around with the margin values will | ||
79 | give a valid mode quickly. And if cyblafb would detect such an invalid | ||
80 | mode, should it silently alter the requested values or should it | ||
81 | report an error? Both options have some pros and cons. As stated | ||
82 | above, none of the startup modes are affected, and if you set | ||
83 | verbosity to 1 or higher, cyblafb will print the fbset command that | ||
84 | would be needed to program that mode using fbset. | ||
85 | |||
86 | |||
87 | Other Parameters | ||
88 | ================ | ||
89 | |||
90 | |||
91 | crt don't autodetect, assume monitor connected to | ||
92 | standard VGA connector | ||
93 | |||
94 | fp don't autodetect, assume flat panel display | ||
95 | connected to flat panel monitor interface | ||
96 | |||
97 | nativex inform driver about native x resolution of | ||
98 | flat panel monitor connected to special | ||
99 | interface (should be autodetected) | ||
100 | |||
101 | stretch stretch image to adapt low resolution modes to | ||
102 | higer resolutions of flat panel monitors | ||
103 | connected to special interface | ||
104 | |||
105 | center center image to adapt low resolution modes to | ||
106 | higer resolutions of flat panel monitors | ||
107 | connected to special interface | ||
108 | |||
109 | memsize use if autodetected memsize is wrong ... | ||
110 | should never be necessary | ||
111 | |||
112 | nopcirr disable PCI read retry | ||
113 | nopciwr disable PCI write retry | ||
114 | nopcirb disable PCI read bursts | ||
115 | nopciwb disable PCI write bursts | ||
116 | |||
117 | bpp bpp for specified modes | ||
118 | valid values: 8 || 16 || 24 || 32 | ||
119 | |||
120 | ref refresh rate for specified mode | ||
121 | valid values: 50 <= ref <= 85 | ||
122 | |||
123 | mode 640x480 or 800x600 or 1024x768 or 1280x1024 | ||
124 | if not specified, the startup mode will be detected | ||
125 | and used, so you might also use the vga=??? parameter | ||
126 | described in vesafb.txt. If you do not specify a mode, | ||
127 | bpp and ref parameters are ignored. | ||
128 | |||
129 | verbosity 0 is the default, increase to at least 2 for every | ||
130 | bug report! | ||
131 | |||
132 | vesafb allows cyblafb to be loaded after vesafb has been | ||
133 | loaded. See sections "Module unloading ...". | ||
134 | |||
135 | |||
136 | Development hints | ||
137 | ================= | ||
138 | |||
139 | It's much faster do compile a module and to load the new version after | ||
140 | unloading the old module than to compile a new kernel and to reboot. So if you | ||
141 | try to work on cyblafb, it might be a good idea to use cyblafb as a module. | ||
142 | In real life, fast often means dangerous, and that's also the case here. If | ||
143 | you introduce a serious bug when cyblafb is compiled into the kernel, the | ||
144 | kernel will lock or oops with a high probability before the file system is | ||
145 | mounted, and the danger for your data is low. If you load a broken own version | ||
146 | of cyblafb on a running system, the danger for the integrity of the file | ||
147 | system is much higher as you might need a hard reset afterwards. Decide | ||
148 | yourself. | ||
149 | |||
150 | Module unloading, the vfb method | ||
151 | ================================ | ||
152 | |||
153 | If you want to unload/reload cyblafb using the virtual framebuffer, you need | ||
154 | to enable vfb support in the kernel first. After that, load the modules as | ||
155 | shown below: | ||
156 | |||
157 | modprobe vfb vfb_enable=1 | ||
158 | modprobe fbcon | ||
159 | modprobe cyblafb | ||
160 | fbset -fb /dev/fb1 1280x1024-60 -vyres 2662 | ||
161 | con2fb /dev/fb1 /dev/tty1 | ||
162 | ... | ||
163 | |||
164 | If you now made some changes to cyblafb and want to reload it, you might do it | ||
165 | as show below: | ||
166 | |||
167 | con2fb /dev/fb0 /dev/tty1 | ||
168 | ... | ||
169 | rmmod cyblafb | ||
170 | modprobe cyblafb | ||
171 | con2fb /dev/fb1 /dev/tty1 | ||
172 | ... | ||
173 | |||
174 | Of course, you might choose another mode, and most certainly you also want to | ||
175 | map some other /dev/tty* to the real framebuffer device. You might also choose | ||
176 | to compile fbcon as a kernel module or place it permanently in the kernel. | ||
177 | |||
178 | I do not know of any way to unload fbcon, and fbcon will prevent the | ||
179 | framebuffer device loaded first from unloading. [If there is a way, then | ||
180 | please add a description here!] | ||
181 | |||
182 | Module unloading, the vesafb method | ||
183 | =================================== | ||
184 | |||
185 | Configure the kernel: | ||
186 | |||
187 | <*> Support for frame buffer devices | ||
188 | [*] VESA VGA graphics support | ||
189 | <M> Cyberblade/i1 support | ||
190 | |||
191 | Add e.g. "video=vesafb:ypan vga=0x307" to the kernel parameters. The ypan | ||
192 | parameter is important, choose any vga parameter you like as long as it is | ||
193 | a graphics mode. | ||
194 | |||
195 | After booting, load cyblafb without any mode and bpp parameter and assign | ||
196 | cyblafb to individual ttys using con2fb, e.g.: | ||
197 | |||
198 | modprobe cyblafb vesafb=1 | ||
199 | con2fb /dev/fb1 /dev/tty1 | ||
200 | |||
201 | Unloading cyblafb works without problems after you assign vesafb to all | ||
202 | ttys again, e.g.: | ||
203 | |||
204 | con2fb /dev/fb0 /dev/tty1 | ||
205 | rmmod cyblafb | ||
206 | |||
diff --git a/Documentation/fb/cyblafb/whycyblafb b/Documentation/fb/cyblafb/whycyblafb new file mode 100644 index 000000000000..a123bc11e698 --- /dev/null +++ b/Documentation/fb/cyblafb/whycyblafb | |||
@@ -0,0 +1,85 @@ | |||
1 | I tried the following framebuffer drivers: | ||
2 | |||
3 | - TRIDENTFB is full of bugs. Acceleration is broken for Blade3D | ||
4 | graphics cores like the cyberblade/i1. It claims to support a great | ||
5 | number of devices, but documentation for most of these devices is | ||
6 | unfortunately not available. There is _no_ reason to use tridentfb | ||
7 | for cyberblade/i1 + CRT users. VESAFB is faster, and the one | ||
8 | advantage, mode switching, is broken in tridentfb. | ||
9 | |||
10 | - VESAFB is used by many distributions as a standard. Vesafb does | ||
11 | not support mode switching. VESAFB is a bit faster than the working | ||
12 | configurations of TRIDENTFB, but it is still too slow, even if you | ||
13 | use ypan. | ||
14 | |||
15 | - EPIAFB (you'll find it on sourceforge) supports the Cyberblade/i1 | ||
16 | graphics core, but it still has serious bugs and developement seems | ||
17 | to have stopped. This is the one driver with TV-out support. If you | ||
18 | do need this feature, try epiafb. | ||
19 | |||
20 | None of these drivers was a real option for me. | ||
21 | |||
22 | I believe that is unreasonable to change code that announces to support 20 | ||
23 | devices if I only have more or less sufficient documentation for exactly one | ||
24 | of these. The risk of breaking device foo while fixing device bar is too high. | ||
25 | |||
26 | So I decided to start CyBlaFB as a stripped down tridentfb. | ||
27 | |||
28 | All code specific to other Trident chips has been removed. After that there | ||
29 | were a lot of cosmetic changes to increase the readability of the code. All | ||
30 | register names were changed to those mnemonics used in the datasheet. Function | ||
31 | and macro names were changed if they hindered easy understanding of the code. | ||
32 | |||
33 | After that I debugged the code and implemented some new features. I'll try to | ||
34 | give a little summary of the main changes: | ||
35 | |||
36 | - calculation of vertical and horizontal timings was fixed | ||
37 | |||
38 | - video signal quality has been improved dramatically | ||
39 | |||
40 | - acceleration: | ||
41 | |||
42 | - fillrect and copyarea were fixed and reenabled | ||
43 | |||
44 | - color expanding imageblit was newly implemented, color | ||
45 | imageblit (only used to draw the penguine) still uses the | ||
46 | generic code. | ||
47 | |||
48 | - init of the acceleration engine was improved and moved to a | ||
49 | place where it really works ... | ||
50 | |||
51 | - sync function has a timeout now and tries to reset and | ||
52 | reinit the accel engine if necessary | ||
53 | |||
54 | - fewer slow copyarea calls when doing ypan scrolling by using | ||
55 | undocumented bit d21 of screen start address stored in | ||
56 | CR2B[5]. BIOS does use it also, so this should be safe. | ||
57 | |||
58 | - cyblafb rejects any attempt to set modes that would cause vclk | ||
59 | values above reasonable 230 MHz. 32bit modes use a clock | ||
60 | multiplicator of 2, so fbset does show the correct values for | ||
61 | pixclock but not for vclk in this case. The fbset limit is 115 MHz | ||
62 | for 32 bpp modes. | ||
63 | |||
64 | - cyblafb rejects modes known to be broken or unimplemented (all | ||
65 | interlaced modes, all doublescan modes for now) | ||
66 | |||
67 | - cyblafb now works independant of the video mode in effect at startup | ||
68 | time (tridentfb does not init all needed registers to reasonable | ||
69 | values) | ||
70 | |||
71 | - switching between video modes does work reliably now | ||
72 | |||
73 | - the first video mode now is the one selected on startup using the | ||
74 | vga=???? mechanism or any of | ||
75 | - 640x480, 800x600, 1024x768, 1280x1024 | ||
76 | - 8, 16, 24 or 32 bpp | ||
77 | - refresh between 50 Hz and 85 Hz, 1 Hz steps (1280x1024-32 | ||
78 | is limited to 63Hz) | ||
79 | |||
80 | - pci retry and pci burst mode are settable (try to disable if you | ||
81 | experience latency problems) | ||
82 | |||
83 | - built as a module cyblafb might be unloaded and reloaded using | ||
84 | the vfb module and con2vt or might be used together with vesafb | ||
85 | |||
diff --git a/Documentation/fb/intel810.txt b/Documentation/fb/intel810.txt index fd68b162e4a1..4f0d6bc789ef 100644 --- a/Documentation/fb/intel810.txt +++ b/Documentation/fb/intel810.txt | |||
@@ -5,6 +5,7 @@ Intel 810/815 Framebuffer driver | |||
5 | March 17, 2002 | 5 | March 17, 2002 |
6 | 6 | ||
7 | First Released: July 2001 | 7 | First Released: July 2001 |
8 | Last Update: September 12, 2005 | ||
8 | ================================================================ | 9 | ================================================================ |
9 | 10 | ||
10 | A. Introduction | 11 | A. Introduction |
@@ -44,6 +45,8 @@ B. Features | |||
44 | 45 | ||
45 | - Hardware Cursor Support | 46 | - Hardware Cursor Support |
46 | 47 | ||
48 | - Supports EDID probing either by DDC/I2C or through the BIOS | ||
49 | |||
47 | C. List of available options | 50 | C. List of available options |
48 | 51 | ||
49 | a. "video=i810fb" | 52 | a. "video=i810fb" |
@@ -52,14 +55,17 @@ C. List of available options | |||
52 | Recommendation: required | 55 | Recommendation: required |
53 | 56 | ||
54 | b. "xres:<value>" | 57 | b. "xres:<value>" |
55 | select horizontal resolution in pixels | 58 | select horizontal resolution in pixels. (This parameter will be |
59 | ignored if 'mode_option' is specified. See 'o' below). | ||
56 | 60 | ||
57 | Recommendation: user preference | 61 | Recommendation: user preference |
58 | (default = 640) | 62 | (default = 640) |
59 | 63 | ||
60 | c. "yres:<value>" | 64 | c. "yres:<value>" |
61 | select vertical resolution in scanlines. If Discrete Video Timings | 65 | select vertical resolution in scanlines. If Discrete Video Timings |
62 | is enabled, this will be ignored and computed as 3*xres/4. | 66 | is enabled, this will be ignored and computed as 3*xres/4. (This |
67 | parameter will be ignored if 'mode_option' is specified. See 'o' | ||
68 | below) | ||
63 | 69 | ||
64 | Recommendation: user preference | 70 | Recommendation: user preference |
65 | (default = 480) | 71 | (default = 480) |
@@ -86,7 +92,8 @@ C. List of available options | |||
86 | g. "hsync1/hsync2:<value>" | 92 | g. "hsync1/hsync2:<value>" |
87 | select the minimum and maximum Horizontal Sync Frequency of the | 93 | select the minimum and maximum Horizontal Sync Frequency of the |
88 | monitor in KHz. If a using a fixed frequency monitor, hsync1 must | 94 | monitor in KHz. If a using a fixed frequency monitor, hsync1 must |
89 | be equal to hsync2. | 95 | be equal to hsync2. If EDID probing is successful, these will be |
96 | ignored and values will be taken from the EDID block. | ||
90 | 97 | ||
91 | Recommendation: check monitor manual for correct values | 98 | Recommendation: check monitor manual for correct values |
92 | default (29/30) | 99 | default (29/30) |
@@ -94,7 +101,8 @@ C. List of available options | |||
94 | h. "vsync1/vsync2:<value>" | 101 | h. "vsync1/vsync2:<value>" |
95 | select the minimum and maximum Vertical Sync Frequency of the monitor | 102 | select the minimum and maximum Vertical Sync Frequency of the monitor |
96 | in Hz. You can also use this option to lock your monitor's refresh | 103 | in Hz. You can also use this option to lock your monitor's refresh |
97 | rate. | 104 | rate. If EDID probing is successful, these will be ignored and values |
105 | will be taken from the EDID block. | ||
98 | 106 | ||
99 | Recommendation: check monitor manual for correct values | 107 | Recommendation: check monitor manual for correct values |
100 | (default = 60/60) | 108 | (default = 60/60) |
@@ -154,7 +162,11 @@ C. List of available options | |||
154 | 162 | ||
155 | Recommendation: do not set | 163 | Recommendation: do not set |
156 | (default = not set) | 164 | (default = not set) |
157 | 165 | o. <xres>x<yres>[-<bpp>][@<refresh>] | |
166 | The driver will now accept specification of boot mode option. If this | ||
167 | is specified, the options 'xres' and 'yres' will be ignored. See | ||
168 | Documentation/fb/modedb.txt for usage. | ||
169 | |||
158 | D. Kernel booting | 170 | D. Kernel booting |
159 | 171 | ||
160 | Separate each option/option-pair by commas (,) and the option from its value | 172 | Separate each option/option-pair by commas (,) and the option from its value |
@@ -176,7 +188,10 @@ will be computed based on the hsync1/hsync2 and vsync1/vsync2 values. | |||
176 | 188 | ||
177 | IMPORTANT: | 189 | IMPORTANT: |
178 | You must include hsync1, hsync2, vsync1 and vsync2 to enable video modes | 190 | You must include hsync1, hsync2, vsync1 and vsync2 to enable video modes |
179 | better than 640x480 at 60Hz. | 191 | better than 640x480 at 60Hz. HOWEVER, if your chipset/display combination |
192 | supports I2C and has an EDID block, you can safely exclude hsync1, hsync2, | ||
193 | vsync1 and vsync2 parameters. These parameters will be taken from the EDID | ||
194 | block. | ||
180 | 195 | ||
181 | E. Module options | 196 | E. Module options |
182 | 197 | ||
@@ -217,32 +232,21 @@ F. Setup | |||
217 | This is required. The option is under "Character Devices" | 232 | This is required. The option is under "Character Devices" |
218 | 233 | ||
219 | d. Under "Graphics Support", select "Intel 810/815" either statically | 234 | d. Under "Graphics Support", select "Intel 810/815" either statically |
220 | or as a module. Choose "use VESA GTF for video timings" if you | 235 | or as a module. Choose "use VESA Generalized Timing Formula" if |
221 | need to maximize the capability of your display. To be on the | 236 | you need to maximize the capability of your display. To be on the |
222 | safe side, you can leave this unselected. | 237 | safe side, you can leave this unselected. |
223 | 238 | ||
224 | e. If you want a framebuffer console, enable it under "Console | 239 | e. If you want support for DDC/I2C probing (Plug and Play Displays), |
240 | set 'Enable DDC Support' to 'y'. To make this option appear, set | ||
241 | 'use VESA Generalized Timing Formula' to 'y'. | ||
242 | |||
243 | f. If you want a framebuffer console, enable it under "Console | ||
225 | Drivers" | 244 | Drivers" |
226 | 245 | ||
227 | f. Compile your kernel. | 246 | g. Compile your kernel. |
228 | 247 | ||
229 | g. Load the driver as described in section D and E. | 248 | h. Load the driver as described in section D and E. |
230 | 249 | ||
231 | Optional: | ||
232 | h. If you are going to run XFree86 with its native drivers, the | ||
233 | standard XFree86 4.1.0 and 4.2.0 drivers should work as is. | ||
234 | However, there's a bug in the XFree86 i810 drivers. It attempts | ||
235 | to use XAA even when switched to the console. This will crash | ||
236 | your server. I have a fix at this site: | ||
237 | |||
238 | http://i810fb.sourceforge.net. | ||
239 | |||
240 | You can either use the patch, or just replace | ||
241 | |||
242 | /usr/X11R6/lib/modules/drivers/i810_drv.o | ||
243 | |||
244 | with the one provided at the website. | ||
245 | |||
246 | i. Try the DirectFB (http://www.directfb.org) + the i810 gfxdriver | 250 | i. Try the DirectFB (http://www.directfb.org) + the i810 gfxdriver |
247 | patch to see the chipset in action (or inaction :-). | 251 | patch to see the chipset in action (or inaction :-). |
248 | 252 | ||
diff --git a/Documentation/fb/modedb.txt b/Documentation/fb/modedb.txt index e04458b319d5..4fcdb4cf4cca 100644 --- a/Documentation/fb/modedb.txt +++ b/Documentation/fb/modedb.txt | |||
@@ -20,12 +20,83 @@ in a video= option, fbmem considers that to be a global video mode option. | |||
20 | 20 | ||
21 | Valid mode specifiers (mode_option argument): | 21 | Valid mode specifiers (mode_option argument): |
22 | 22 | ||
23 | <xres>x<yres>[-<bpp>][@<refresh>] | 23 | <xres>x<yres>[M][R][-<bpp>][@<refresh>][i][m] |
24 | <name>[-<bpp>][@<refresh>] | 24 | <name>[-<bpp>][@<refresh>] |
25 | 25 | ||
26 | with <xres>, <yres>, <bpp> and <refresh> decimal numbers and <name> a string. | 26 | with <xres>, <yres>, <bpp> and <refresh> decimal numbers and <name> a string. |
27 | Things between square brackets are optional. | 27 | Things between square brackets are optional. |
28 | 28 | ||
29 | If 'M' is specified in the mode_option argument (after <yres> and before | ||
30 | <bpp> and <refresh>, if specified) the timings will be calculated using | ||
31 | VESA(TM) Coordinated Video Timings instead of looking up the mode from a table. | ||
32 | If 'R' is specified, do a 'reduced blanking' calculation for digital displays. | ||
33 | If 'i' is specified, calculate for an interlaced mode. And if 'm' is | ||
34 | specified, add margins to the calculation (1.8% of xres rounded down to 8 | ||
35 | pixels and 1.8% of yres). | ||
36 | |||
37 | Sample usage: 1024x768M@60m - CVT timing with margins | ||
38 | |||
39 | ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** | ||
40 | |||
41 | What is the VESA(TM) Coordinated Video Timings (CVT)? | ||
42 | |||
43 | From the VESA(TM) Website: | ||
44 | |||
45 | "The purpose of CVT is to provide a method for generating a consistent | ||
46 | and coordinated set of standard formats, display refresh rates, and | ||
47 | timing specifications for computer display products, both those | ||
48 | employing CRTs, and those using other display technologies. The | ||
49 | intention of CVT is to give both source and display manufacturers a | ||
50 | common set of tools to enable new timings to be developed in a | ||
51 | consistent manner that ensures greater compatibility." | ||
52 | |||
53 | This is the third standard approved by VESA(TM) concerning video timings. The | ||
54 | first was the Discrete Video Timings (DVT) which is a collection of | ||
55 | pre-defined modes approved by VESA(TM). The second is the Generalized Timing | ||
56 | Formula (GTF) which is an algorithm to calculate the timings, given the | ||
57 | pixelclock, the horizontal sync frequency, or the vertical refresh rate. | ||
58 | |||
59 | The GTF is limited by the fact that it is designed mainly for CRT displays. | ||
60 | It artificially increases the pixelclock because of its high blanking | ||
61 | requirement. This is inappropriate for digital display interface with its high | ||
62 | data rate which requires that it conserves the pixelclock as much as possible. | ||
63 | Also, GTF does not take into account the aspect ratio of the display. | ||
64 | |||
65 | The CVT addresses these limitations. If used with CRT's, the formula used | ||
66 | is a derivation of GTF with a few modifications. If used with digital | ||
67 | displays, the "reduced blanking" calculation can be used. | ||
68 | |||
69 | From the framebuffer subsystem perspective, new formats need not be added | ||
70 | to the global mode database whenever a new mode is released by display | ||
71 | manufacturers. Specifying for CVT will work for most, if not all, relatively | ||
72 | new CRT displays and probably with most flatpanels, if 'reduced blanking' | ||
73 | calculation is specified. (The CVT compatibility of the display can be | ||
74 | determined from its EDID. The version 1.3 of the EDID has extra 128-byte | ||
75 | blocks where additional timing information is placed. As of this time, there | ||
76 | is no support yet in the layer to parse this additional blocks.) | ||
77 | |||
78 | CVT also introduced a new naming convention (should be seen from dmesg output): | ||
79 | |||
80 | <pix>M<a>[-R] | ||
81 | |||
82 | where: pix = total amount of pixels in MB (xres x yres) | ||
83 | M = always present | ||
84 | a = aspect ratio (3 - 4:3; 4 - 5:4; 9 - 15:9, 16:9; A - 16:10) | ||
85 | -R = reduced blanking | ||
86 | |||
87 | example: .48M3-R - 800x600 with reduced blanking | ||
88 | |||
89 | Note: VESA(TM) has restrictions on what is a standard CVT timing: | ||
90 | |||
91 | - aspect ratio can only be one of the above values | ||
92 | - acceptable refresh rates are 50, 60, 70 or 85 Hz only | ||
93 | - if reduced blanking, the refresh rate must be at 60Hz | ||
94 | |||
95 | If one of the above are not satisfied, the kernel will print a warning but the | ||
96 | timings will still be calculated. | ||
97 | |||
98 | ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** | ||
99 | |||
29 | To find a suitable video mode, you just call | 100 | To find a suitable video mode, you just call |
30 | 101 | ||
31 | int __init fb_find_mode(struct fb_var_screeninfo *var, | 102 | int __init fb_find_mode(struct fb_var_screeninfo *var, |
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 5f95d4b3cab1..784e08c1c80a 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt | |||
@@ -17,14 +17,6 @@ Who: Greg Kroah-Hartman <greg@kroah.com> | |||
17 | 17 | ||
18 | --------------------------- | 18 | --------------------------- |
19 | 19 | ||
20 | What: ACPI S4bios support | ||
21 | When: May 2005 | ||
22 | Why: Noone uses it, and it probably does not work, anyway. swsusp is | ||
23 | faster, more reliable, and people are actually using it. | ||
24 | Who: Pavel Machek <pavel@suse.cz> | ||
25 | |||
26 | --------------------------- | ||
27 | |||
28 | What: io_remap_page_range() (macro or function) | 20 | What: io_remap_page_range() (macro or function) |
29 | When: September 2005 | 21 | When: September 2005 |
30 | Why: Replaced by io_remap_pfn_range() which allows more memory space | 22 | Why: Replaced by io_remap_pfn_range() which allows more memory space |
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt new file mode 100644 index 000000000000..8c206f4e0250 --- /dev/null +++ b/Documentation/filesystems/files.txt | |||
@@ -0,0 +1,123 @@ | |||
1 | File management in the Linux kernel | ||
2 | ----------------------------------- | ||
3 | |||
4 | This document describes how locking for files (struct file) | ||
5 | and file descriptor table (struct files) works. | ||
6 | |||
7 | Up until 2.6.12, the file descriptor table has been protected | ||
8 | with a lock (files->file_lock) and reference count (files->count). | ||
9 | ->file_lock protected accesses to all the file related fields | ||
10 | of the table. ->count was used for sharing the file descriptor | ||
11 | table between tasks cloned with CLONE_FILES flag. Typically | ||
12 | this would be the case for posix threads. As with the common | ||
13 | refcounting model in the kernel, the last task doing | ||
14 | a put_files_struct() frees the file descriptor (fd) table. | ||
15 | The files (struct file) themselves are protected using | ||
16 | reference count (->f_count). | ||
17 | |||
18 | In the new lock-free model of file descriptor management, | ||
19 | the reference counting is similar, but the locking is | ||
20 | based on RCU. The file descriptor table contains multiple | ||
21 | elements - the fd sets (open_fds and close_on_exec, the | ||
22 | array of file pointers, the sizes of the sets and the array | ||
23 | etc.). In order for the updates to appear atomic to | ||
24 | a lock-free reader, all the elements of the file descriptor | ||
25 | table are in a separate structure - struct fdtable. | ||
26 | files_struct contains a pointer to struct fdtable through | ||
27 | which the actual fd table is accessed. Initially the | ||
28 | fdtable is embedded in files_struct itself. On a subsequent | ||
29 | expansion of fdtable, a new fdtable structure is allocated | ||
30 | and files->fdtab points to the new structure. The fdtable | ||
31 | structure is freed with RCU and lock-free readers either | ||
32 | see the old fdtable or the new fdtable making the update | ||
33 | appear atomic. Here are the locking rules for | ||
34 | the fdtable structure - | ||
35 | |||
36 | 1. All references to the fdtable must be done through | ||
37 | the files_fdtable() macro : | ||
38 | |||
39 | struct fdtable *fdt; | ||
40 | |||
41 | rcu_read_lock(); | ||
42 | |||
43 | fdt = files_fdtable(files); | ||
44 | .... | ||
45 | if (n <= fdt->max_fds) | ||
46 | .... | ||
47 | ... | ||
48 | rcu_read_unlock(); | ||
49 | |||
50 | files_fdtable() uses rcu_dereference() macro which takes care of | ||
51 | the memory barrier requirements for lock-free dereference. | ||
52 | The fdtable pointer must be read within the read-side | ||
53 | critical section. | ||
54 | |||
55 | 2. Reading of the fdtable as described above must be protected | ||
56 | by rcu_read_lock()/rcu_read_unlock(). | ||
57 | |||
58 | 3. For any update to the the fd table, files->file_lock must | ||
59 | be held. | ||
60 | |||
61 | 4. To look up the file structure given an fd, a reader | ||
62 | must use either fcheck() or fcheck_files() APIs. These | ||
63 | take care of barrier requirements due to lock-free lookup. | ||
64 | An example : | ||
65 | |||
66 | struct file *file; | ||
67 | |||
68 | rcu_read_lock(); | ||
69 | file = fcheck(fd); | ||
70 | if (file) { | ||
71 | ... | ||
72 | } | ||
73 | .... | ||
74 | rcu_read_unlock(); | ||
75 | |||
76 | 5. Handling of the file structures is special. Since the look-up | ||
77 | of the fd (fget()/fget_light()) are lock-free, it is possible | ||
78 | that look-up may race with the last put() operation on the | ||
79 | file structure. This is avoided using the rcuref APIs | ||
80 | on ->f_count : | ||
81 | |||
82 | rcu_read_lock(); | ||
83 | file = fcheck_files(files, fd); | ||
84 | if (file) { | ||
85 | if (rcuref_inc_lf(&file->f_count)) | ||
86 | *fput_needed = 1; | ||
87 | else | ||
88 | /* Didn't get the reference, someone's freed */ | ||
89 | file = NULL; | ||
90 | } | ||
91 | rcu_read_unlock(); | ||
92 | .... | ||
93 | return file; | ||
94 | |||
95 | rcuref_inc_lf() detects if refcounts is already zero or | ||
96 | goes to zero during increment. If it does, we fail | ||
97 | fget()/fget_light(). | ||
98 | |||
99 | 6. Since both fdtable and file structures can be looked up | ||
100 | lock-free, they must be installed using rcu_assign_pointer() | ||
101 | API. If they are looked up lock-free, rcu_dereference() | ||
102 | must be used. However it is advisable to use files_fdtable() | ||
103 | and fcheck()/fcheck_files() which take care of these issues. | ||
104 | |||
105 | 7. While updating, the fdtable pointer must be looked up while | ||
106 | holding files->file_lock. If ->file_lock is dropped, then | ||
107 | another thread expand the files thereby creating a new | ||
108 | fdtable and making the earlier fdtable pointer stale. | ||
109 | For example : | ||
110 | |||
111 | spin_lock(&files->file_lock); | ||
112 | fd = locate_fd(files, file, start); | ||
113 | if (fd >= 0) { | ||
114 | /* locate_fd() may have expanded fdtable, load the ptr */ | ||
115 | fdt = files_fdtable(files); | ||
116 | FD_SET(fd, fdt->open_fds); | ||
117 | FD_CLR(fd, fdt->close_on_exec); | ||
118 | spin_unlock(&files->file_lock); | ||
119 | ..... | ||
120 | |||
121 | Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), | ||
122 | the fdtable pointer (fdt) must be loaded after locate_fd(). | ||
123 | |||
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt new file mode 100644 index 000000000000..6b5741e651a2 --- /dev/null +++ b/Documentation/filesystems/fuse.txt | |||
@@ -0,0 +1,315 @@ | |||
1 | Definitions | ||
2 | ~~~~~~~~~~~ | ||
3 | |||
4 | Userspace filesystem: | ||
5 | |||
6 | A filesystem in which data and metadata are provided by an ordinary | ||
7 | userspace process. The filesystem can be accessed normally through | ||
8 | the kernel interface. | ||
9 | |||
10 | Filesystem daemon: | ||
11 | |||
12 | The process(es) providing the data and metadata of the filesystem. | ||
13 | |||
14 | Non-privileged mount (or user mount): | ||
15 | |||
16 | A userspace filesystem mounted by a non-privileged (non-root) user. | ||
17 | The filesystem daemon is running with the privileges of the mounting | ||
18 | user. NOTE: this is not the same as mounts allowed with the "user" | ||
19 | option in /etc/fstab, which is not discussed here. | ||
20 | |||
21 | Mount owner: | ||
22 | |||
23 | The user who does the mounting. | ||
24 | |||
25 | User: | ||
26 | |||
27 | The user who is performing filesystem operations. | ||
28 | |||
29 | What is FUSE? | ||
30 | ~~~~~~~~~~~~~ | ||
31 | |||
32 | FUSE is a userspace filesystem framework. It consists of a kernel | ||
33 | module (fuse.ko), a userspace library (libfuse.*) and a mount utility | ||
34 | (fusermount). | ||
35 | |||
36 | One of the most important features of FUSE is allowing secure, | ||
37 | non-privileged mounts. This opens up new possibilities for the use of | ||
38 | filesystems. A good example is sshfs: a secure network filesystem | ||
39 | using the sftp protocol. | ||
40 | |||
41 | The userspace library and utilities are available from the FUSE | ||
42 | homepage: | ||
43 | |||
44 | http://fuse.sourceforge.net/ | ||
45 | |||
46 | Mount options | ||
47 | ~~~~~~~~~~~~~ | ||
48 | |||
49 | 'fd=N' | ||
50 | |||
51 | The file descriptor to use for communication between the userspace | ||
52 | filesystem and the kernel. The file descriptor must have been | ||
53 | obtained by opening the FUSE device ('/dev/fuse'). | ||
54 | |||
55 | 'rootmode=M' | ||
56 | |||
57 | The file mode of the filesystem's root in octal representation. | ||
58 | |||
59 | 'user_id=N' | ||
60 | |||
61 | The numeric user id of the mount owner. | ||
62 | |||
63 | 'group_id=N' | ||
64 | |||
65 | The numeric group id of the mount owner. | ||
66 | |||
67 | 'default_permissions' | ||
68 | |||
69 | By default FUSE doesn't check file access permissions, the | ||
70 | filesystem is free to implement it's access policy or leave it to | ||
71 | the underlying file access mechanism (e.g. in case of network | ||
72 | filesystems). This option enables permission checking, restricting | ||
73 | access based on file mode. This is option is usually useful | ||
74 | together with the 'allow_other' mount option. | ||
75 | |||
76 | 'allow_other' | ||
77 | |||
78 | This option overrides the security measure restricting file access | ||
79 | to the user mounting the filesystem. This option is by default only | ||
80 | allowed to root, but this restriction can be removed with a | ||
81 | (userspace) configuration option. | ||
82 | |||
83 | 'max_read=N' | ||
84 | |||
85 | With this option the maximum size of read operations can be set. | ||
86 | The default is infinite. Note that the size of read requests is | ||
87 | limited anyway to 32 pages (which is 128kbyte on i386). | ||
88 | |||
89 | How do non-privileged mounts work? | ||
90 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
91 | |||
92 | Since the mount() system call is a privileged operation, a helper | ||
93 | program (fusermount) is needed, which is installed setuid root. | ||
94 | |||
95 | The implication of providing non-privileged mounts is that the mount | ||
96 | owner must not be able to use this capability to compromise the | ||
97 | system. Obvious requirements arising from this are: | ||
98 | |||
99 | A) mount owner should not be able to get elevated privileges with the | ||
100 | help of the mounted filesystem | ||
101 | |||
102 | B) mount owner should not get illegitimate access to information from | ||
103 | other users' and the super user's processes | ||
104 | |||
105 | C) mount owner should not be able to induce undesired behavior in | ||
106 | other users' or the super user's processes | ||
107 | |||
108 | How are requirements fulfilled? | ||
109 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
110 | |||
111 | A) The mount owner could gain elevated privileges by either: | ||
112 | |||
113 | 1) creating a filesystem containing a device file, then opening | ||
114 | this device | ||
115 | |||
116 | 2) creating a filesystem containing a suid or sgid application, | ||
117 | then executing this application | ||
118 | |||
119 | The solution is not to allow opening device files and ignore | ||
120 | setuid and setgid bits when executing programs. To ensure this | ||
121 | fusermount always adds "nosuid" and "nodev" to the mount options | ||
122 | for non-privileged mounts. | ||
123 | |||
124 | B) If another user is accessing files or directories in the | ||
125 | filesystem, the filesystem daemon serving requests can record the | ||
126 | exact sequence and timing of operations performed. This | ||
127 | information is otherwise inaccessible to the mount owner, so this | ||
128 | counts as an information leak. | ||
129 | |||
130 | The solution to this problem will be presented in point 2) of C). | ||
131 | |||
132 | C) There are several ways in which the mount owner can induce | ||
133 | undesired behavior in other users' processes, such as: | ||
134 | |||
135 | 1) mounting a filesystem over a file or directory which the mount | ||
136 | owner could otherwise not be able to modify (or could only | ||
137 | make limited modifications). | ||
138 | |||
139 | This is solved in fusermount, by checking the access | ||
140 | permissions on the mountpoint and only allowing the mount if | ||
141 | the mount owner can do unlimited modification (has write | ||
142 | access to the mountpoint, and mountpoint is not a "sticky" | ||
143 | directory) | ||
144 | |||
145 | 2) Even if 1) is solved the mount owner can change the behavior | ||
146 | of other users' processes. | ||
147 | |||
148 | i) It can slow down or indefinitely delay the execution of a | ||
149 | filesystem operation creating a DoS against the user or the | ||
150 | whole system. For example a suid application locking a | ||
151 | system file, and then accessing a file on the mount owner's | ||
152 | filesystem could be stopped, and thus causing the system | ||
153 | file to be locked forever. | ||
154 | |||
155 | ii) It can present files or directories of unlimited length, or | ||
156 | directory structures of unlimited depth, possibly causing a | ||
157 | system process to eat up diskspace, memory or other | ||
158 | resources, again causing DoS. | ||
159 | |||
160 | The solution to this as well as B) is not to allow processes | ||
161 | to access the filesystem, which could otherwise not be | ||
162 | monitored or manipulated by the mount owner. Since if the | ||
163 | mount owner can ptrace a process, it can do all of the above | ||
164 | without using a FUSE mount, the same criteria as used in | ||
165 | ptrace can be used to check if a process is allowed to access | ||
166 | the filesystem or not. | ||
167 | |||
168 | Note that the ptrace check is not strictly necessary to | ||
169 | prevent B/2/i, it is enough to check if mount owner has enough | ||
170 | privilege to send signal to the process accessing the | ||
171 | filesystem, since SIGSTOP can be used to get a similar effect. | ||
172 | |||
173 | I think these limitations are unacceptable? | ||
174 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
175 | |||
176 | If a sysadmin trusts the users enough, or can ensure through other | ||
177 | measures, that system processes will never enter non-privileged | ||
178 | mounts, it can relax the last limitation with a "user_allow_other" | ||
179 | config option. If this config option is set, the mounting user can | ||
180 | add the "allow_other" mount option which disables the check for other | ||
181 | users' processes. | ||
182 | |||
183 | Kernel - userspace interface | ||
184 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
185 | |||
186 | The following diagram shows how a filesystem operation (in this | ||
187 | example unlink) is performed in FUSE. | ||
188 | |||
189 | NOTE: everything in this description is greatly simplified | ||
190 | |||
191 | | "rm /mnt/fuse/file" | FUSE filesystem daemon | ||
192 | | | | ||
193 | | | >sys_read() | ||
194 | | | >fuse_dev_read() | ||
195 | | | >request_wait() | ||
196 | | | [sleep on fc->waitq] | ||
197 | | | | ||
198 | | >sys_unlink() | | ||
199 | | >fuse_unlink() | | ||
200 | | [get request from | | ||
201 | | fc->unused_list] | | ||
202 | | >request_send() | | ||
203 | | [queue req on fc->pending] | | ||
204 | | [wake up fc->waitq] | [woken up] | ||
205 | | >request_wait_answer() | | ||
206 | | [sleep on req->waitq] | | ||
207 | | | <request_wait() | ||
208 | | | [remove req from fc->pending] | ||
209 | | | [copy req to read buffer] | ||
210 | | | [add req to fc->processing] | ||
211 | | | <fuse_dev_read() | ||
212 | | | <sys_read() | ||
213 | | | | ||
214 | | | [perform unlink] | ||
215 | | | | ||
216 | | | >sys_write() | ||
217 | | | >fuse_dev_write() | ||
218 | | | [look up req in fc->processing] | ||
219 | | | [remove from fc->processing] | ||
220 | | | [copy write buffer to req] | ||
221 | | [woken up] | [wake up req->waitq] | ||
222 | | | <fuse_dev_write() | ||
223 | | | <sys_write() | ||
224 | | <request_wait_answer() | | ||
225 | | <request_send() | | ||
226 | | [add request to | | ||
227 | | fc->unused_list] | | ||
228 | | <fuse_unlink() | | ||
229 | | <sys_unlink() | | ||
230 | |||
231 | There are a couple of ways in which to deadlock a FUSE filesystem. | ||
232 | Since we are talking about unprivileged userspace programs, | ||
233 | something must be done about these. | ||
234 | |||
235 | Scenario 1 - Simple deadlock | ||
236 | ----------------------------- | ||
237 | |||
238 | | "rm /mnt/fuse/file" | FUSE filesystem daemon | ||
239 | | | | ||
240 | | >sys_unlink("/mnt/fuse/file") | | ||
241 | | [acquire inode semaphore | | ||
242 | | for "file"] | | ||
243 | | >fuse_unlink() | | ||
244 | | [sleep on req->waitq] | | ||
245 | | | <sys_read() | ||
246 | | | >sys_unlink("/mnt/fuse/file") | ||
247 | | | [acquire inode semaphore | ||
248 | | | for "file"] | ||
249 | | | *DEADLOCK* | ||
250 | |||
251 | The solution for this is to allow requests to be interrupted while | ||
252 | they are in userspace: | ||
253 | |||
254 | | [interrupted by signal] | | ||
255 | | <fuse_unlink() | | ||
256 | | [release semaphore] | [semaphore acquired] | ||
257 | | <sys_unlink() | | ||
258 | | | >fuse_unlink() | ||
259 | | | [queue req on fc->pending] | ||
260 | | | [wake up fc->waitq] | ||
261 | | | [sleep on req->waitq] | ||
262 | |||
263 | If the filesystem daemon was single threaded, this will stop here, | ||
264 | since there's no other thread to dequeue and execute the request. | ||
265 | In this case the solution is to kill the FUSE daemon as well. If | ||
266 | there are multiple serving threads, you just have to kill them as | ||
267 | long as any remain. | ||
268 | |||
269 | Moral: a filesystem which deadlocks, can soon find itself dead. | ||
270 | |||
271 | Scenario 2 - Tricky deadlock | ||
272 | ---------------------------- | ||
273 | |||
274 | This one needs a carefully crafted filesystem. It's a variation on | ||
275 | the above, only the call back to the filesystem is not explicit, | ||
276 | but is caused by a pagefault. | ||
277 | |||
278 | | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 | ||
279 | | | | ||
280 | | [fd = open("/mnt/fuse/file")] | [request served normally] | ||
281 | | [mmap fd to 'addr'] | | ||
282 | | [close fd] | [FLUSH triggers 'magic' flag] | ||
283 | | [read a byte from addr] | | ||
284 | | >do_page_fault() | | ||
285 | | [find or create page] | | ||
286 | | [lock page] | | ||
287 | | >fuse_readpage() | | ||
288 | | [queue READ request] | | ||
289 | | [sleep on req->waitq] | | ||
290 | | | [read request to buffer] | ||
291 | | | [create reply header before addr] | ||
292 | | | >sys_write(addr - headerlength) | ||
293 | | | >fuse_dev_write() | ||
294 | | | [look up req in fc->processing] | ||
295 | | | [remove from fc->processing] | ||
296 | | | [copy write buffer to req] | ||
297 | | | >do_page_fault() | ||
298 | | | [find or create page] | ||
299 | | | [lock page] | ||
300 | | | * DEADLOCK * | ||
301 | |||
302 | Solution is again to let the the request be interrupted (not | ||
303 | elaborated further). | ||
304 | |||
305 | An additional problem is that while the write buffer is being | ||
306 | copied to the request, the request must not be interrupted. This | ||
307 | is because the destination address of the copy may not be valid | ||
308 | after the request is interrupted. | ||
309 | |||
310 | This is solved with doing the copy atomically, and allowing | ||
311 | interruption while the page(s) belonging to the write buffer are | ||
312 | faulted with get_user_pages(). The 'req->locked' flag indicates | ||
313 | when the copy is taking place, and interruption is delayed until | ||
314 | this flag is unset. | ||
315 | |||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 5024ba7a592c..d4773565ea2f 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1241,16 +1241,38 @@ swap-intensive. | |||
1241 | overcommit_memory | 1241 | overcommit_memory |
1242 | ----------------- | 1242 | ----------------- |
1243 | 1243 | ||
1244 | This file contains one value. The following algorithm is used to decide if | 1244 | Controls overcommit of system memory, possibly allowing processes |
1245 | there's enough memory: if the value of overcommit_memory is positive, then | 1245 | to allocate (but not use) more memory than is actually available. |
1246 | there's always enough memory. This is a useful feature, since programs often | 1246 | |
1247 | malloc() huge amounts of memory 'just in case', while they only use a small | 1247 | |
1248 | part of it. Leaving this value at 0 will lead to the failure of such a huge | 1248 | 0 - Heuristic overcommit handling. Obvious overcommits of |
1249 | malloc(), when in fact the system has enough memory for the program to run. | 1249 | address space are refused. Used for a typical system. It |
1250 | 1250 | ensures a seriously wild allocation fails while allowing | |
1251 | On the other hand, enabling this feature can cause you to run out of memory | 1251 | overcommit to reduce swap usage. root is allowed to |
1252 | and thrash the system to death, so large and/or important servers will want to | 1252 | allocate slighly more memory in this mode. This is the |
1253 | set this value to 0. | 1253 | default. |
1254 | |||
1255 | 1 - Always overcommit. Appropriate for some scientific | ||
1256 | applications. | ||
1257 | |||
1258 | 2 - Don't overcommit. The total address space commit | ||
1259 | for the system is not permitted to exceed swap plus a | ||
1260 | configurable percentage (default is 50) of physical RAM. | ||
1261 | Depending on the percentage you use, in most situations | ||
1262 | this means a process will not be killed while attempting | ||
1263 | to use already-allocated memory but will receive errors | ||
1264 | on memory allocation as appropriate. | ||
1265 | |||
1266 | overcommit_ratio | ||
1267 | ---------------- | ||
1268 | |||
1269 | Percentage of physical memory size to include in overcommit calculations | ||
1270 | (see above.) | ||
1271 | |||
1272 | Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) | ||
1273 | |||
1274 | swapspace = total size of all swap areas | ||
1275 | physmem = size of physical memory in system | ||
1254 | 1276 | ||
1255 | nr_hugepages and hugetlb_shm_group | 1277 | nr_hugepages and hugetlb_shm_group |
1256 | ---------------------------------- | 1278 | ---------------------------------- |
diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt new file mode 100644 index 000000000000..4e92feb6b507 --- /dev/null +++ b/Documentation/filesystems/v9fs.txt | |||
@@ -0,0 +1,95 @@ | |||
1 | V9FS: 9P2000 for Linux | ||
2 | ====================== | ||
3 | |||
4 | ABOUT | ||
5 | ===== | ||
6 | |||
7 | v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. | ||
8 | |||
9 | This software was originally developed by Ron Minnich <rminnich@lanl.gov> | ||
10 | and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson | ||
11 | <gwatson@lanl.gov> and most recently Eric Van Hensbergen | ||
12 | <ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>. | ||
13 | |||
14 | USAGE | ||
15 | ===== | ||
16 | |||
17 | For remote file server: | ||
18 | |||
19 | mount -t 9P 10.10.1.2 /mnt/9 | ||
20 | |||
21 | For Plan 9 From User Space applications (http://swtch.com/plan9) | ||
22 | |||
23 | mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER | ||
24 | |||
25 | OPTIONS | ||
26 | ======= | ||
27 | |||
28 | proto=name select an alternative transport. Valid options are | ||
29 | currently: | ||
30 | unix - specifying a named pipe mount point | ||
31 | tcp - specifying a normal TCP/IP connection | ||
32 | fd - used passed file descriptors for connection | ||
33 | (see rfdno and wfdno) | ||
34 | |||
35 | name=name user name to attempt mount as on the remote server. The | ||
36 | server may override or ignore this value. Certain user | ||
37 | names may require authentication. | ||
38 | |||
39 | aname=name aname specifies the file tree to access when the server is | ||
40 | offering several exported file systems. | ||
41 | |||
42 | debug=n specifies debug level. The debug level is a bitmask. | ||
43 | 0x01 = display verbose error messages | ||
44 | 0x02 = developer debug (DEBUG_CURRENT) | ||
45 | 0x04 = display 9P trace | ||
46 | 0x08 = display VFS trace | ||
47 | 0x10 = display Marshalling debug | ||
48 | 0x20 = display RPC debug | ||
49 | 0x40 = display transport debug | ||
50 | 0x80 = display allocation debug | ||
51 | |||
52 | rfdno=n the file descriptor for reading with proto=fd | ||
53 | |||
54 | wfdno=n the file descriptor for writing with proto=fd | ||
55 | |||
56 | maxdata=n the number of bytes to use for 9P packet payload (msize) | ||
57 | |||
58 | port=n port to connect to on the remote server | ||
59 | |||
60 | timeout=n request timeouts (in ms) (default 60000ms) | ||
61 | |||
62 | noextend force legacy mode (no 9P2000.u semantics) | ||
63 | |||
64 | uid attempt to mount as a particular uid | ||
65 | |||
66 | gid attempt to mount with a particular gid | ||
67 | |||
68 | afid security channel - used by Plan 9 authentication protocols | ||
69 | |||
70 | nodevmap do not map special files - represent them as normal files. | ||
71 | This can be used to share devices/named pipes/sockets between | ||
72 | hosts. This functionality will be expanded in later versions. | ||
73 | |||
74 | RESOURCES | ||
75 | ========= | ||
76 | |||
77 | The Linux version of the 9P server, along with some client-side utilities | ||
78 | can be found at http://v9fs.sf.net (along with a CVS repository of the | ||
79 | development branch of this module). There are user and developer mailing | ||
80 | lists here, as well as a bug-tracker. | ||
81 | |||
82 | For more information on the Plan 9 Operating System check out | ||
83 | http://plan9.bell-labs.com/plan9 | ||
84 | |||
85 | For information on Plan 9 from User Space (Plan 9 applications and libraries | ||
86 | ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 | ||
87 | |||
88 | |||
89 | STATUS | ||
90 | ====== | ||
91 | |||
92 | The 2.6 kernel support is working on PPC and x86. | ||
93 | |||
94 | PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS. | ||
95 | |||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 3f318dd44c77..f042c12e0ed2 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -1,35 +1,27 @@ | |||
1 | /* -*- auto-fill -*- */ | ||
2 | 1 | ||
3 | Overview of the Virtual File System | 2 | Overview of the Linux Virtual File System |
4 | 3 | ||
5 | Richard Gooch <rgooch@atnf.csiro.au> | 4 | Original author: Richard Gooch <rgooch@atnf.csiro.au> |
6 | 5 | ||
7 | 5-JUL-1999 | 6 | Last updated on August 25, 2005 |
8 | 7 | ||
8 | Copyright (C) 1999 Richard Gooch | ||
9 | Copyright (C) 2005 Pekka Enberg | ||
9 | 10 | ||
10 | Conventions used in this document <section> | 11 | This file is released under the GPLv2. |
11 | ================================= | ||
12 | 12 | ||
13 | Each section in this document will have the string "<section>" at the | ||
14 | right-hand side of the section title. Each subsection will have | ||
15 | "<subsection>" at the right-hand side. These strings are meant to make | ||
16 | it easier to search through the document. | ||
17 | 13 | ||
18 | NOTE that the master copy of this document is available online at: | 14 | What is it? |
19 | http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt | ||
20 | |||
21 | |||
22 | What is it? <section> | ||
23 | =========== | 15 | =========== |
24 | 16 | ||
25 | The Virtual File System (otherwise known as the Virtual Filesystem | 17 | The Virtual File System (otherwise known as the Virtual Filesystem |
26 | Switch) is the software layer in the kernel that provides the | 18 | Switch) is the software layer in the kernel that provides the |
27 | filesystem interface to userspace programs. It also provides an | 19 | filesystem interface to userspace programs. It also provides an |
28 | abstraction within the kernel which allows different filesystem | 20 | abstraction within the kernel which allows different filesystem |
29 | implementations to co-exist. | 21 | implementations to coexist. |
30 | 22 | ||
31 | 23 | ||
32 | A Quick Look At How It Works <section> | 24 | A Quick Look At How It Works |
33 | ============================ | 25 | ============================ |
34 | 26 | ||
35 | In this section I'll briefly describe how things work, before | 27 | In this section I'll briefly describe how things work, before |
@@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the | |||
38 | other view which is how a filesystem is supported and subsequently | 30 | other view which is how a filesystem is supported and subsequently |
39 | mounted. | 31 | mounted. |
40 | 32 | ||
41 | Opening a File <subsection> | 33 | |
34 | Opening a File | ||
42 | -------------- | 35 | -------------- |
43 | 36 | ||
44 | The VFS implements the open(2), stat(2), chmod(2) and similar system | 37 | The VFS implements the open(2), stat(2), chmod(2) and similar system |
@@ -77,7 +70,7 @@ back to userspace. | |||
77 | 70 | ||
78 | Opening a file requires another operation: allocation of a file | 71 | Opening a file requires another operation: allocation of a file |
79 | structure (this is the kernel-side implementation of file | 72 | structure (this is the kernel-side implementation of file |
80 | descriptors). The freshly allocated file structure is initialised with | 73 | descriptors). The freshly allocated file structure is initialized with |
81 | a pointer to the dentry and a set of file operation member functions. | 74 | a pointer to the dentry and a set of file operation member functions. |
82 | These are taken from the inode data. The open() file method is then | 75 | These are taken from the inode data. The open() file method is then |
83 | called so the specific filesystem implementation can do it's work. You | 76 | called so the specific filesystem implementation can do it's work. You |
@@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different | |||
102 | processors. You should ensure that access to shared resources is | 95 | processors. You should ensure that access to shared resources is |
103 | protected by appropriate locks. | 96 | protected by appropriate locks. |
104 | 97 | ||
105 | Registering and Mounting a Filesystem <subsection> | 98 | |
99 | Registering and Mounting a Filesystem | ||
106 | ------------------------------------- | 100 | ------------------------------------- |
107 | 101 | ||
108 | If you want to support a new kind of filesystem in the kernel, all you | 102 | If you want to support a new kind of filesystem in the kernel, all you |
@@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem. | |||
123 | It's now time to look at things in more detail. | 117 | It's now time to look at things in more detail. |
124 | 118 | ||
125 | 119 | ||
126 | struct file_system_type <section> | 120 | struct file_system_type |
127 | ======================= | 121 | ======================= |
128 | 122 | ||
129 | This describes the filesystem. As of kernel 2.1.99, the following | 123 | This describes the filesystem. As of kernel 2.6.13, the following |
130 | members are defined: | 124 | members are defined: |
131 | 125 | ||
132 | struct file_system_type { | 126 | struct file_system_type { |
133 | const char *name; | 127 | const char *name; |
134 | int fs_flags; | 128 | int fs_flags; |
135 | struct super_block *(*read_super) (struct super_block *, void *, int); | 129 | struct super_block *(*get_sb) (struct file_system_type *, int, |
136 | struct file_system_type * next; | 130 | const char *, void *); |
131 | void (*kill_sb) (struct super_block *); | ||
132 | struct module *owner; | ||
133 | struct file_system_type * next; | ||
134 | struct list_head fs_supers; | ||
137 | }; | 135 | }; |
138 | 136 | ||
139 | name: the name of the filesystem type, such as "ext2", "iso9660", | 137 | name: the name of the filesystem type, such as "ext2", "iso9660", |
@@ -141,51 +139,97 @@ struct file_system_type { | |||
141 | 139 | ||
142 | fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) | 140 | fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) |
143 | 141 | ||
144 | read_super: the method to call when a new instance of this | 142 | get_sb: the method to call when a new instance of this |
145 | filesystem should be mounted | 143 | filesystem should be mounted |
146 | 144 | ||
147 | next: for internal VFS use: you should initialise this to NULL | 145 | kill_sb: the method to call when an instance of this filesystem |
146 | should be unmounted | ||
147 | |||
148 | owner: for internal VFS use: you should initialize this to THIS_MODULE in | ||
149 | most cases. | ||
148 | 150 | ||
149 | The read_super() method has the following arguments: | 151 | next: for internal VFS use: you should initialize this to NULL |
152 | |||
153 | The get_sb() method has the following arguments: | ||
150 | 154 | ||
151 | struct super_block *sb: the superblock structure. This is partially | 155 | struct super_block *sb: the superblock structure. This is partially |
152 | initialised by the VFS and the rest must be initialised by the | 156 | initialized by the VFS and the rest must be initialized by the |
153 | read_super() method | 157 | get_sb() method |
158 | |||
159 | int flags: mount flags | ||
160 | |||
161 | const char *dev_name: the device name we are mounting. | ||
154 | 162 | ||
155 | void *data: arbitrary mount options, usually comes as an ASCII | 163 | void *data: arbitrary mount options, usually comes as an ASCII |
156 | string | 164 | string |
157 | 165 | ||
158 | int silent: whether or not to be silent on error | 166 | int silent: whether or not to be silent on error |
159 | 167 | ||
160 | The read_super() method must determine if the block device specified | 168 | The get_sb() method must determine if the block device specified |
161 | in the superblock contains a filesystem of the type the method | 169 | in the superblock contains a filesystem of the type the method |
162 | supports. On success the method returns the superblock pointer, on | 170 | supports. On success the method returns the superblock pointer, on |
163 | failure it returns NULL. | 171 | failure it returns NULL. |
164 | 172 | ||
165 | The most interesting member of the superblock structure that the | 173 | The most interesting member of the superblock structure that the |
166 | read_super() method fills in is the "s_op" field. This is a pointer to | 174 | get_sb() method fills in is the "s_op" field. This is a pointer to |
167 | a "struct super_operations" which describes the next level of the | 175 | a "struct super_operations" which describes the next level of the |
168 | filesystem implementation. | 176 | filesystem implementation. |
169 | 177 | ||
178 | Usually, a filesystem uses generic one of the generic get_sb() | ||
179 | implementations and provides a fill_super() method instead. The | ||
180 | generic methods are: | ||
181 | |||
182 | get_sb_bdev: mount a filesystem residing on a block device | ||
170 | 183 | ||
171 | struct super_operations <section> | 184 | get_sb_nodev: mount a filesystem that is not backed by a device |
185 | |||
186 | get_sb_single: mount a filesystem which shares the instance between | ||
187 | all mounts | ||
188 | |||
189 | A fill_super() method implementation has the following arguments: | ||
190 | |||
191 | struct super_block *sb: the superblock structure. The method fill_super() | ||
192 | must initialize this properly. | ||
193 | |||
194 | void *data: arbitrary mount options, usually comes as an ASCII | ||
195 | string | ||
196 | |||
197 | int silent: whether or not to be silent on error | ||
198 | |||
199 | |||
200 | struct super_operations | ||
172 | ======================= | 201 | ======================= |
173 | 202 | ||
174 | This describes how the VFS can manipulate the superblock of your | 203 | This describes how the VFS can manipulate the superblock of your |
175 | filesystem. As of kernel 2.1.99, the following members are defined: | 204 | filesystem. As of kernel 2.6.13, the following members are defined: |
176 | 205 | ||
177 | struct super_operations { | 206 | struct super_operations { |
178 | void (*read_inode) (struct inode *); | 207 | struct inode *(*alloc_inode)(struct super_block *sb); |
179 | int (*write_inode) (struct inode *, int); | 208 | void (*destroy_inode)(struct inode *); |
180 | void (*put_inode) (struct inode *); | 209 | |
181 | void (*drop_inode) (struct inode *); | 210 | void (*read_inode) (struct inode *); |
182 | void (*delete_inode) (struct inode *); | 211 | |
183 | int (*notify_change) (struct dentry *, struct iattr *); | 212 | void (*dirty_inode) (struct inode *); |
184 | void (*put_super) (struct super_block *); | 213 | int (*write_inode) (struct inode *, int); |
185 | void (*write_super) (struct super_block *); | 214 | void (*put_inode) (struct inode *); |
186 | int (*statfs) (struct super_block *, struct statfs *, int); | 215 | void (*drop_inode) (struct inode *); |
187 | int (*remount_fs) (struct super_block *, int *, char *); | 216 | void (*delete_inode) (struct inode *); |
188 | void (*clear_inode) (struct inode *); | 217 | void (*put_super) (struct super_block *); |
218 | void (*write_super) (struct super_block *); | ||
219 | int (*sync_fs)(struct super_block *sb, int wait); | ||
220 | void (*write_super_lockfs) (struct super_block *); | ||
221 | void (*unlockfs) (struct super_block *); | ||
222 | int (*statfs) (struct super_block *, struct kstatfs *); | ||
223 | int (*remount_fs) (struct super_block *, int *, char *); | ||
224 | void (*clear_inode) (struct inode *); | ||
225 | void (*umount_begin) (struct super_block *); | ||
226 | |||
227 | void (*sync_inodes) (struct super_block *sb, | ||
228 | struct writeback_control *wbc); | ||
229 | int (*show_options)(struct seq_file *, struct vfsmount *); | ||
230 | |||
231 | ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); | ||
232 | ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); | ||
189 | }; | 233 | }; |
190 | 234 | ||
191 | All methods are called without any locks being held, unless otherwise | 235 | All methods are called without any locks being held, unless otherwise |
@@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are | |||
193 | only called from a process context (i.e. not from an interrupt handler | 237 | only called from a process context (i.e. not from an interrupt handler |
194 | or bottom half). | 238 | or bottom half). |
195 | 239 | ||
240 | alloc_inode: this method is called by inode_alloc() to allocate memory | ||
241 | for struct inode and initialize it. | ||
242 | |||
243 | destroy_inode: this method is called by destroy_inode() to release | ||
244 | resources allocated for struct inode. | ||
245 | |||
196 | read_inode: this method is called to read a specific inode from the | 246 | read_inode: this method is called to read a specific inode from the |
197 | mounted filesystem. The "i_ino" member in the "struct inode" | 247 | mounted filesystem. The i_ino member in the struct inode is |
198 | will be initialised by the VFS to indicate which inode to | 248 | initialized by the VFS to indicate which inode to read. Other |
199 | read. Other members are filled in by this method | 249 | members are filled in by this method. |
250 | |||
251 | You can set this to NULL and use iget5_locked() instead of iget() | ||
252 | to read inodes. This is necessary for filesystems for which the | ||
253 | inode number is not sufficient to identify an inode. | ||
254 | |||
255 | dirty_inode: this method is called by the VFS to mark an inode dirty. | ||
200 | 256 | ||
201 | write_inode: this method is called when the VFS needs to write an | 257 | write_inode: this method is called when the VFS needs to write an |
202 | inode to disc. The second parameter indicates whether the write | 258 | inode to disc. The second parameter indicates whether the write |
203 | should be synchronous or not, not all filesystems check this flag. | 259 | should be synchronous or not, not all filesystems check this flag. |
204 | 260 | ||
205 | put_inode: called when the VFS inode is removed from the inode | 261 | put_inode: called when the VFS inode is removed from the inode |
206 | cache. This method is optional | 262 | cache. |
207 | 263 | ||
208 | drop_inode: called when the last access to the inode is dropped, | 264 | drop_inode: called when the last access to the inode is dropped, |
209 | with the inode_lock spinlock held. | 265 | with the inode_lock spinlock held. |
210 | 266 | ||
211 | This method should be either NULL (normal unix filesystem | 267 | This method should be either NULL (normal UNIX filesystem |
212 | semantics) or "generic_delete_inode" (for filesystems that do not | 268 | semantics) or "generic_delete_inode" (for filesystems that do not |
213 | want to cache inodes - causing "delete_inode" to always be | 269 | want to cache inodes - causing "delete_inode" to always be |
214 | called regardless of the value of i_nlink) | 270 | called regardless of the value of i_nlink) |
215 | 271 | ||
216 | The "generic_delete_inode()" behaviour is equivalent to the | 272 | The "generic_delete_inode()" behavior is equivalent to the |
217 | old practice of using "force_delete" in the put_inode() case, | 273 | old practice of using "force_delete" in the put_inode() case, |
218 | but does not have the races that the "force_delete()" approach | 274 | but does not have the races that the "force_delete()" approach |
219 | had. | 275 | had. |
220 | 276 | ||
221 | delete_inode: called when the VFS wants to delete an inode | 277 | delete_inode: called when the VFS wants to delete an inode |
222 | 278 | ||
223 | notify_change: called when VFS inode attributes are changed. If this | ||
224 | is NULL the VFS falls back to the write_inode() method. This | ||
225 | is called with the kernel lock held | ||
226 | |||
227 | put_super: called when the VFS wishes to free the superblock | 279 | put_super: called when the VFS wishes to free the superblock |
228 | (i.e. unmount). This is called with the superblock lock held | 280 | (i.e. unmount). This is called with the superblock lock held |
229 | 281 | ||
230 | write_super: called when the VFS superblock needs to be written to | 282 | write_super: called when the VFS superblock needs to be written to |
231 | disc. This method is optional | 283 | disc. This method is optional |
232 | 284 | ||
285 | sync_fs: called when VFS is writing out all dirty data associated with | ||
286 | a superblock. The second parameter indicates whether the method | ||
287 | should wait until the write out has been completed. Optional. | ||
288 | |||
289 | write_super_lockfs: called when VFS is locking a filesystem and forcing | ||
290 | it into a consistent state. This function is currently used by the | ||
291 | Logical Volume Manager (LVM). | ||
292 | |||
293 | unlockfs: called when VFS is unlocking a filesystem and making it writable | ||
294 | again. | ||
295 | |||
233 | statfs: called when the VFS needs to get filesystem statistics. This | 296 | statfs: called when the VFS needs to get filesystem statistics. This |
234 | is called with the kernel lock held | 297 | is called with the kernel lock held |
235 | 298 | ||
@@ -238,21 +301,31 @@ or bottom half). | |||
238 | 301 | ||
239 | clear_inode: called then the VFS clears the inode. Optional | 302 | clear_inode: called then the VFS clears the inode. Optional |
240 | 303 | ||
304 | umount_begin: called when the VFS is unmounting a filesystem. | ||
305 | |||
306 | sync_inodes: called when the VFS is writing out dirty data associated with | ||
307 | a superblock. | ||
308 | |||
309 | show_options: called by the VFS to show mount options for /proc/<pid>/mounts. | ||
310 | |||
311 | quota_read: called by the VFS to read from filesystem quota file. | ||
312 | |||
313 | quota_write: called by the VFS to write to filesystem quota file. | ||
314 | |||
241 | The read_inode() method is responsible for filling in the "i_op" | 315 | The read_inode() method is responsible for filling in the "i_op" |
242 | field. This is a pointer to a "struct inode_operations" which | 316 | field. This is a pointer to a "struct inode_operations" which |
243 | describes the methods that can be performed on individual inodes. | 317 | describes the methods that can be performed on individual inodes. |
244 | 318 | ||
245 | 319 | ||
246 | struct inode_operations <section> | 320 | struct inode_operations |
247 | ======================= | 321 | ======================= |
248 | 322 | ||
249 | This describes how the VFS can manipulate an inode in your | 323 | This describes how the VFS can manipulate an inode in your |
250 | filesystem. As of kernel 2.1.99, the following members are defined: | 324 | filesystem. As of kernel 2.6.13, the following members are defined: |
251 | 325 | ||
252 | struct inode_operations { | 326 | struct inode_operations { |
253 | struct file_operations * default_file_ops; | 327 | int (*create) (struct inode *,struct dentry *,int, struct nameidata *); |
254 | int (*create) (struct inode *,struct dentry *,int); | 328 | struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); |
255 | int (*lookup) (struct inode *,struct dentry *); | ||
256 | int (*link) (struct dentry *,struct inode *,struct dentry *); | 329 | int (*link) (struct dentry *,struct inode *,struct dentry *); |
257 | int (*unlink) (struct inode *,struct dentry *); | 330 | int (*unlink) (struct inode *,struct dentry *); |
258 | int (*symlink) (struct inode *,struct dentry *,const char *); | 331 | int (*symlink) (struct inode *,struct dentry *,const char *); |
@@ -261,25 +334,22 @@ struct inode_operations { | |||
261 | int (*mknod) (struct inode *,struct dentry *,int,dev_t); | 334 | int (*mknod) (struct inode *,struct dentry *,int,dev_t); |
262 | int (*rename) (struct inode *, struct dentry *, | 335 | int (*rename) (struct inode *, struct dentry *, |
263 | struct inode *, struct dentry *); | 336 | struct inode *, struct dentry *); |
264 | int (*readlink) (struct dentry *, char *,int); | 337 | int (*readlink) (struct dentry *, char __user *,int); |
265 | struct dentry * (*follow_link) (struct dentry *, struct dentry *); | 338 | void * (*follow_link) (struct dentry *, struct nameidata *); |
266 | int (*readpage) (struct file *, struct page *); | 339 | void (*put_link) (struct dentry *, struct nameidata *, void *); |
267 | int (*writepage) (struct page *page, struct writeback_control *wbc); | ||
268 | int (*bmap) (struct inode *,int); | ||
269 | void (*truncate) (struct inode *); | 340 | void (*truncate) (struct inode *); |
270 | int (*permission) (struct inode *, int); | 341 | int (*permission) (struct inode *, int, struct nameidata *); |
271 | int (*smap) (struct inode *,int); | 342 | int (*setattr) (struct dentry *, struct iattr *); |
272 | int (*updatepage) (struct file *, struct page *, const char *, | 343 | int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); |
273 | unsigned long, unsigned int, int); | 344 | int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); |
274 | int (*revalidate) (struct dentry *); | 345 | ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); |
346 | ssize_t (*listxattr) (struct dentry *, char *, size_t); | ||
347 | int (*removexattr) (struct dentry *, const char *); | ||
275 | }; | 348 | }; |
276 | 349 | ||
277 | Again, all methods are called without any locks being held, unless | 350 | Again, all methods are called without any locks being held, unless |
278 | otherwise noted. | 351 | otherwise noted. |
279 | 352 | ||
280 | default_file_ops: this is a pointer to a "struct file_operations" | ||
281 | which describes how to open and then manipulate open files | ||
282 | |||
283 | create: called by the open(2) and creat(2) system calls. Only | 353 | create: called by the open(2) and creat(2) system calls. Only |
284 | required if you want to support regular files. The dentry you | 354 | required if you want to support regular files. The dentry you |
285 | get should not have an inode (i.e. it should be a negative | 355 | get should not have an inode (i.e. it should be a negative |
@@ -328,31 +398,143 @@ otherwise noted. | |||
328 | you want to support reading symbolic links | 398 | you want to support reading symbolic links |
329 | 399 | ||
330 | follow_link: called by the VFS to follow a symbolic link to the | 400 | follow_link: called by the VFS to follow a symbolic link to the |
331 | inode it points to. Only required if you want to support | 401 | inode it points to. Only required if you want to support |
332 | symbolic links | 402 | symbolic links. This function returns a void pointer cookie |
403 | that is passed to put_link(). | ||
404 | |||
405 | put_link: called by the VFS to release resources allocated by | ||
406 | follow_link(). The cookie returned by follow_link() is passed to | ||
407 | to this function as the last parameter. It is used by filesystems | ||
408 | such as NFS where page cache is not stable (i.e. page that was | ||
409 | installed when the symbolic link walk started might not be in the | ||
410 | page cache at the end of the walk). | ||
411 | |||
412 | truncate: called by the VFS to change the size of a file. The i_size | ||
413 | field of the inode is set to the desired size by the VFS before | ||
414 | this function is called. This function is called by the truncate(2) | ||
415 | system call and related functionality. | ||
416 | |||
417 | permission: called by the VFS to check for access rights on a POSIX-like | ||
418 | filesystem. | ||
419 | |||
420 | setattr: called by the VFS to set attributes for a file. This function is | ||
421 | called by chmod(2) and related system calls. | ||
422 | |||
423 | getattr: called by the VFS to get attributes of a file. This function is | ||
424 | called by stat(2) and related system calls. | ||
425 | |||
426 | setxattr: called by the VFS to set an extended attribute for a file. | ||
427 | Extended attribute is a name:value pair associated with an inode. This | ||
428 | function is called by setxattr(2) system call. | ||
429 | |||
430 | getxattr: called by the VFS to retrieve the value of an extended attribute | ||
431 | name. This function is called by getxattr(2) function call. | ||
432 | |||
433 | listxattr: called by the VFS to list all extended attributes for a given | ||
434 | file. This function is called by listxattr(2) system call. | ||
435 | |||
436 | removexattr: called by the VFS to remove an extended attribute from a file. | ||
437 | This function is called by removexattr(2) system call. | ||
438 | |||
439 | |||
440 | struct address_space_operations | ||
441 | =============================== | ||
442 | |||
443 | This describes how the VFS can manipulate mapping of a file to page cache in | ||
444 | your filesystem. As of kernel 2.6.13, the following members are defined: | ||
445 | |||
446 | struct address_space_operations { | ||
447 | int (*writepage)(struct page *page, struct writeback_control *wbc); | ||
448 | int (*readpage)(struct file *, struct page *); | ||
449 | int (*sync_page)(struct page *); | ||
450 | int (*writepages)(struct address_space *, struct writeback_control *); | ||
451 | int (*set_page_dirty)(struct page *page); | ||
452 | int (*readpages)(struct file *filp, struct address_space *mapping, | ||
453 | struct list_head *pages, unsigned nr_pages); | ||
454 | int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); | ||
455 | int (*commit_write)(struct file *, struct page *, unsigned, unsigned); | ||
456 | sector_t (*bmap)(struct address_space *, sector_t); | ||
457 | int (*invalidatepage) (struct page *, unsigned long); | ||
458 | int (*releasepage) (struct page *, int); | ||
459 | ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, | ||
460 | loff_t offset, unsigned long nr_segs); | ||
461 | struct page* (*get_xip_page)(struct address_space *, sector_t, | ||
462 | int); | ||
463 | }; | ||
464 | |||
465 | writepage: called by the VM write a dirty page to backing store. | ||
466 | |||
467 | readpage: called by the VM to read a page from backing store. | ||
468 | |||
469 | sync_page: called by the VM to notify the backing store to perform all | ||
470 | queued I/O operations for a page. I/O operations for other pages | ||
471 | associated with this address_space object may also be performed. | ||
472 | |||
473 | writepages: called by the VM to write out pages associated with the | ||
474 | address_space object. | ||
475 | |||
476 | set_page_dirty: called by the VM to set a page dirty. | ||
477 | |||
478 | readpages: called by the VM to read pages associated with the address_space | ||
479 | object. | ||
333 | 480 | ||
481 | prepare_write: called by the generic write path in VM to set up a write | ||
482 | request for a page. | ||
334 | 483 | ||
335 | struct file_operations <section> | 484 | commit_write: called by the generic write path in VM to write page to |
485 | its backing store. | ||
486 | |||
487 | bmap: called by the VFS to map a logical block offset within object to | ||
488 | physical block number. This method is use by for the legacy FIBMAP | ||
489 | ioctl. Other uses are discouraged. | ||
490 | |||
491 | invalidatepage: called by the VM on truncate to disassociate a page from its | ||
492 | address_space mapping. | ||
493 | |||
494 | releasepage: called by the VFS to release filesystem specific metadata from | ||
495 | a page. | ||
496 | |||
497 | direct_IO: called by the VM for direct I/O writes and reads. | ||
498 | |||
499 | get_xip_page: called by the VM to translate a block number to a page. | ||
500 | The page is valid until the corresponding filesystem is unmounted. | ||
501 | Filesystems that want to use execute-in-place (XIP) need to implement | ||
502 | it. An example implementation can be found in fs/ext2/xip.c. | ||
503 | |||
504 | |||
505 | struct file_operations | ||
336 | ====================== | 506 | ====================== |
337 | 507 | ||
338 | This describes how the VFS can manipulate an open file. As of kernel | 508 | This describes how the VFS can manipulate an open file. As of kernel |
339 | 2.1.99, the following members are defined: | 509 | 2.6.13, the following members are defined: |
340 | 510 | ||
341 | struct file_operations { | 511 | struct file_operations { |
342 | loff_t (*llseek) (struct file *, loff_t, int); | 512 | loff_t (*llseek) (struct file *, loff_t, int); |
343 | ssize_t (*read) (struct file *, char *, size_t, loff_t *); | 513 | ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); |
344 | ssize_t (*write) (struct file *, const char *, size_t, loff_t *); | 514 | ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); |
515 | ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); | ||
516 | ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); | ||
345 | int (*readdir) (struct file *, void *, filldir_t); | 517 | int (*readdir) (struct file *, void *, filldir_t); |
346 | unsigned int (*poll) (struct file *, struct poll_table_struct *); | 518 | unsigned int (*poll) (struct file *, struct poll_table_struct *); |
347 | int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); | 519 | int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); |
520 | long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); | ||
521 | long (*compat_ioctl) (struct file *, unsigned int, unsigned long); | ||
348 | int (*mmap) (struct file *, struct vm_area_struct *); | 522 | int (*mmap) (struct file *, struct vm_area_struct *); |
349 | int (*open) (struct inode *, struct file *); | 523 | int (*open) (struct inode *, struct file *); |
524 | int (*flush) (struct file *); | ||
350 | int (*release) (struct inode *, struct file *); | 525 | int (*release) (struct inode *, struct file *); |
351 | int (*fsync) (struct file *, struct dentry *); | 526 | int (*fsync) (struct file *, struct dentry *, int datasync); |
352 | int (*fasync) (struct file *, int); | 527 | int (*aio_fsync) (struct kiocb *, int datasync); |
353 | int (*check_media_change) (kdev_t dev); | 528 | int (*fasync) (int, struct file *, int); |
354 | int (*revalidate) (kdev_t dev); | ||
355 | int (*lock) (struct file *, int, struct file_lock *); | 529 | int (*lock) (struct file *, int, struct file_lock *); |
530 | ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); | ||
531 | ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); | ||
532 | ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); | ||
533 | ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); | ||
534 | unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); | ||
535 | int (*check_flags)(int); | ||
536 | int (*dir_notify)(struct file *filp, unsigned long arg); | ||
537 | int (*flock) (struct file *, int, struct file_lock *); | ||
356 | }; | 538 | }; |
357 | 539 | ||
358 | Again, all methods are called without any locks being held, unless | 540 | Again, all methods are called without any locks being held, unless |
@@ -362,8 +544,12 @@ otherwise noted. | |||
362 | 544 | ||
363 | read: called by read(2) and related system calls | 545 | read: called by read(2) and related system calls |
364 | 546 | ||
547 | aio_read: called by io_submit(2) and other asynchronous I/O operations | ||
548 | |||
365 | write: called by write(2) and related system calls | 549 | write: called by write(2) and related system calls |
366 | 550 | ||
551 | aio_write: called by io_submit(2) and other asynchronous I/O operations | ||
552 | |||
367 | readdir: called when the VFS needs to read the directory contents | 553 | readdir: called when the VFS needs to read the directory contents |
368 | 554 | ||
369 | poll: called by the VFS when a process wants to check if there is | 555 | poll: called by the VFS when a process wants to check if there is |
@@ -372,18 +558,25 @@ otherwise noted. | |||
372 | 558 | ||
373 | ioctl: called by the ioctl(2) system call | 559 | ioctl: called by the ioctl(2) system call |
374 | 560 | ||
561 | unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not | ||
562 | require the BKL should use this method instead of the ioctl() above. | ||
563 | |||
564 | compat_ioctl: called by the ioctl(2) system call when 32 bit system calls | ||
565 | are used on 64 bit kernels. | ||
566 | |||
375 | mmap: called by the mmap(2) system call | 567 | mmap: called by the mmap(2) system call |
376 | 568 | ||
377 | open: called by the VFS when an inode should be opened. When the VFS | 569 | open: called by the VFS when an inode should be opened. When the VFS |
378 | opens a file, it creates a new "struct file" and initialises | 570 | opens a file, it creates a new "struct file". It then calls the |
379 | the "f_op" file operations member with the "default_file_ops" | 571 | open method for the newly allocated file structure. You might |
380 | field in the inode structure. It then calls the open method | 572 | think that the open method really belongs in |
381 | for the newly allocated file structure. You might think that | 573 | "struct inode_operations", and you may be right. I think it's |
382 | the open method really belongs in "struct inode_operations", | 574 | done the way it is because it makes filesystems simpler to |
383 | and you may be right. I think it's done the way it is because | 575 | implement. The open() method is a good place to initialize the |
384 | it makes filesystems simpler to implement. The open() method | 576 | "private_data" member in the file structure if you want to point |
385 | is a good place to initialise the "private_data" member in the | 577 | to a device structure |
386 | file structure if you want to point to a device structure | 578 | |
579 | flush: called by the close(2) system call to flush a file | ||
387 | 580 | ||
388 | release: called when the last reference to an open file is closed | 581 | release: called when the last reference to an open file is closed |
389 | 582 | ||
@@ -392,6 +585,23 @@ otherwise noted. | |||
392 | fasync: called by the fcntl(2) system call when asynchronous | 585 | fasync: called by the fcntl(2) system call when asynchronous |
393 | (non-blocking) mode is enabled for a file | 586 | (non-blocking) mode is enabled for a file |
394 | 587 | ||
588 | lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW | ||
589 | commands | ||
590 | |||
591 | readv: called by the readv(2) system call | ||
592 | |||
593 | writev: called by the writev(2) system call | ||
594 | |||
595 | sendfile: called by the sendfile(2) system call | ||
596 | |||
597 | get_unmapped_area: called by the mmap(2) system call | ||
598 | |||
599 | check_flags: called by the fcntl(2) system call for F_SETFL command | ||
600 | |||
601 | dir_notify: called by the fcntl(2) system call for F_NOTIFY command | ||
602 | |||
603 | flock: called by the flock(2) system call | ||
604 | |||
395 | Note that the file operations are implemented by the specific | 605 | Note that the file operations are implemented by the specific |
396 | filesystem in which the inode resides. When opening a device node | 606 | filesystem in which the inode resides. When opening a device node |
397 | (character or block special) most filesystems will call special | 607 | (character or block special) most filesystems will call special |
@@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file | |||
400 | operations with those for the device driver, and then proceed to call | 610 | operations with those for the device driver, and then proceed to call |
401 | the new open() method for the file. This is how opening a device file | 611 | the new open() method for the file. This is how opening a device file |
402 | in the filesystem eventually ends up calling the device driver open() | 612 | in the filesystem eventually ends up calling the device driver open() |
403 | method. Note the devfs (the Device FileSystem) has a more direct path | 613 | method. |
404 | from device node to device driver (this is an unofficial kernel | ||
405 | patch). | ||
406 | 614 | ||
407 | 615 | ||
408 | Directory Entry Cache (dcache) <section> | 616 | Directory Entry Cache (dcache) |
409 | ------------------------------ | 617 | ============================== |
618 | |||
410 | 619 | ||
411 | struct dentry_operations | 620 | struct dentry_operations |
412 | ======================== | 621 | ------------------------ |
413 | 622 | ||
414 | This describes how a filesystem can overload the standard dentry | 623 | This describes how a filesystem can overload the standard dentry |
415 | operations. Dentries and the dcache are the domain of the VFS and the | 624 | operations. Dentries and the dcache are the domain of the VFS and the |
416 | individual filesystem implementations. Device drivers have no business | 625 | individual filesystem implementations. Device drivers have no business |
417 | here. These methods may be set to NULL, as they are either optional or | 626 | here. These methods may be set to NULL, as they are either optional or |
418 | the VFS uses a default. As of kernel 2.1.99, the following members are | 627 | the VFS uses a default. As of kernel 2.6.13, the following members are |
419 | defined: | 628 | defined: |
420 | 629 | ||
421 | struct dentry_operations { | 630 | struct dentry_operations { |
422 | int (*d_revalidate)(struct dentry *); | 631 | int (*d_revalidate)(struct dentry *, struct nameidata *); |
423 | int (*d_hash) (struct dentry *, struct qstr *); | 632 | int (*d_hash) (struct dentry *, struct qstr *); |
424 | int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); | 633 | int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); |
425 | void (*d_delete)(struct dentry *); | 634 | int (*d_delete)(struct dentry *); |
426 | void (*d_release)(struct dentry *); | 635 | void (*d_release)(struct dentry *); |
427 | void (*d_iput)(struct dentry *, struct inode *); | 636 | void (*d_iput)(struct dentry *, struct inode *); |
428 | }; | 637 | }; |
@@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list | |||
451 | of child dentries. Child dentries are basically like files in a | 660 | of child dentries. Child dentries are basically like files in a |
452 | directory. | 661 | directory. |
453 | 662 | ||
663 | |||
454 | Directory Entry Cache APIs | 664 | Directory Entry Cache APIs |
455 | -------------------------- | 665 | -------------------------- |
456 | 666 | ||
@@ -471,7 +681,7 @@ manipulate dentries: | |||
471 | "d_delete" method is called | 681 | "d_delete" method is called |
472 | 682 | ||
473 | d_drop: this unhashes a dentry from its parents hash list. A | 683 | d_drop: this unhashes a dentry from its parents hash list. A |
474 | subsequent call to dput() will dellocate the dentry if its | 684 | subsequent call to dput() will deallocate the dentry if its |
475 | usage count drops to 0 | 685 | usage count drops to 0 |
476 | 686 | ||
477 | d_delete: delete a dentry. If there are no other open references to | 687 | d_delete: delete a dentry. If there are no other open references to |
@@ -507,16 +717,16 @@ up by walking the tree starting with the first component | |||
507 | of the pathname and using that dentry along with the next | 717 | of the pathname and using that dentry along with the next |
508 | component to look up the next level and so on. Since it | 718 | component to look up the next level and so on. Since it |
509 | is a frequent operation for workloads like multiuser | 719 | is a frequent operation for workloads like multiuser |
510 | environments and webservers, it is important to optimize | 720 | environments and web servers, it is important to optimize |
511 | this path. | 721 | this path. |
512 | 722 | ||
513 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | 723 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus |
514 | in every component during path look-up. Since 2.5.10 onwards, | 724 | in every component during path look-up. Since 2.5.10 onwards, |
515 | fastwalk algorithm changed this by holding the dcache_lock | 725 | fast-walk algorithm changed this by holding the dcache_lock |
516 | at the beginning and walking as many cached path component | 726 | at the beginning and walking as many cached path component |
517 | dentries as possible. This signficantly decreases the number | 727 | dentries as possible. This significantly decreases the number |
518 | of acquisition of dcache_lock. However it also increases the | 728 | of acquisition of dcache_lock. However it also increases the |
519 | lock hold time signficantly and affects performance in large | 729 | lock hold time significantly and affects performance in large |
520 | SMP machines. Since 2.5.62 kernel, dcache has been using | 730 | SMP machines. Since 2.5.62 kernel, dcache has been using |
521 | a new locking model that uses RCU to make dcache look-up | 731 | a new locking model that uses RCU to make dcache look-up |
522 | lock-free. | 732 | lock-free. |
@@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well | |||
527 | as d_inode and several other things like mount look-up. RCU-based | 737 | as d_inode and several other things like mount look-up. RCU-based |
528 | changes affect only the way the hash chain is protected. For everything | 738 | changes affect only the way the hash chain is protected. For everything |
529 | else the dcache_lock must be taken for both traversing as well as | 739 | else the dcache_lock must be taken for both traversing as well as |
530 | updating. The hash chain updations too take the dcache_lock. | 740 | updating. The hash chain updates too take the dcache_lock. |
531 | The significant change is the way d_lookup traverses the hash chain, | 741 | The significant change is the way d_lookup traverses the hash chain, |
532 | it doesn't acquire the dcache_lock for this and rely on RCU to | 742 | it doesn't acquire the dcache_lock for this and rely on RCU to |
533 | ensure that the dentry has not been *freed*. | 743 | ensure that the dentry has not been *freed*. |
@@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*. | |||
535 | 745 | ||
536 | Dcache locking details | 746 | Dcache locking details |
537 | ---------------------- | 747 | ---------------------- |
748 | |||
538 | For many multi-user workloads, open() and stat() on files are | 749 | For many multi-user workloads, open() and stat() on files are |
539 | very frequently occurring operations. Both involve walking | 750 | very frequently occurring operations. Both involve walking |
540 | of path names to find the dentry corresponding to the | 751 | of path names to find the dentry corresponding to the |
541 | concerned file. In 2.4 kernel, dcache_lock was held | 752 | concerned file. In 2.4 kernel, dcache_lock was held |
542 | during look-up of each path component. Contention and | 753 | during look-up of each path component. Contention and |
543 | cacheline bouncing of this global lock caused significant | 754 | cache-line bouncing of this global lock caused significant |
544 | scalability problems. With the introduction of RCU | 755 | scalability problems. With the introduction of RCU |
545 | in linux kernel, this was worked around by making | 756 | in Linux kernel, this was worked around by making |
546 | the look-up of path components during path walking lock-free. | 757 | the look-up of path components during path walking lock-free. |
547 | 758 | ||
548 | 759 | ||
@@ -562,7 +773,7 @@ Some of the important changes are : | |||
562 | 2. Insertion of a dentry into the hash table is done using | 773 | 2. Insertion of a dentry into the hash table is done using |
563 | hlist_add_head_rcu() which take care of ordering the writes - | 774 | hlist_add_head_rcu() which take care of ordering the writes - |
564 | the writes to the dentry must be visible before the dentry | 775 | the writes to the dentry must be visible before the dentry |
565 | is inserted. This works in conjuction with hlist_for_each_rcu() | 776 | is inserted. This works in conjunction with hlist_for_each_rcu() |
566 | while walking the hash chain. The only requirement is that | 777 | while walking the hash chain. The only requirement is that |
567 | all initialization to the dentry must be done before hlist_add_head_rcu() | 778 | all initialization to the dentry must be done before hlist_add_head_rcu() |
568 | since we don't have dcache_lock protection while traversing | 779 | since we don't have dcache_lock protection while traversing |
@@ -584,7 +795,7 @@ Some of the important changes are : | |||
584 | the same. In some sense, dcache_rcu path walking looks like | 795 | the same. In some sense, dcache_rcu path walking looks like |
585 | the pre-2.5.10 version. | 796 | the pre-2.5.10 version. |
586 | 797 | ||
587 | 5. All dentry hash chain updations must take the dcache_lock as well as | 798 | 5. All dentry hash chain updates must take the dcache_lock as well as |
588 | the per-dentry lock in that order. dput() does this to ensure | 799 | the per-dentry lock in that order. dput() does this to ensure |
589 | that a dentry that has just been looked up in another CPU | 800 | that a dentry that has just been looked up in another CPU |
590 | doesn't get deleted before dget() can be done on it. | 801 | doesn't get deleted before dget() can be done on it. |
@@ -640,10 +851,10 @@ handled as described below : | |||
640 | Since we redo the d_parent check and compare name while holding | 851 | Since we redo the d_parent check and compare name while holding |
641 | d_lock, lock-free look-up will not race against d_move(). | 852 | d_lock, lock-free look-up will not race against d_move(). |
642 | 853 | ||
643 | 4. There can be a theoritical race when a dentry keeps coming back | 854 | 4. There can be a theoretical race when a dentry keeps coming back |
644 | to original bucket due to double moves. Due to this look-up may | 855 | to original bucket due to double moves. Due to this look-up may |
645 | consider that it has never moved and can end up in a infinite loop. | 856 | consider that it has never moved and can end up in a infinite loop. |
646 | But this is not any worse that theoritical livelocks we already | 857 | But this is not any worse that theoretical livelocks we already |
647 | have in the kernel. | 858 | have in the kernel. |
648 | 859 | ||
649 | 860 | ||
diff --git a/Documentation/ioctl/cdrom.txt b/Documentation/ioctl/cdrom.txt index 4ccdcc6fe364..8ec32cc49eb1 100644 --- a/Documentation/ioctl/cdrom.txt +++ b/Documentation/ioctl/cdrom.txt | |||
@@ -878,7 +878,7 @@ DVD_READ_STRUCT Read structure | |||
878 | 878 | ||
879 | error returns: | 879 | error returns: |
880 | EINVAL physical.layer_num exceeds number of layers | 880 | EINVAL physical.layer_num exceeds number of layers |
881 | EIO Recieved invalid response from drive | 881 | EIO Received invalid response from drive |
882 | 882 | ||
883 | 883 | ||
884 | 884 | ||
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index 9a1586590d82..d802ce88bedc 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt | |||
@@ -31,7 +31,7 @@ This document describes the Linux kernel Makefiles. | |||
31 | 31 | ||
32 | === 6 Architecture Makefiles | 32 | === 6 Architecture Makefiles |
33 | --- 6.1 Set variables to tweak the build to the architecture | 33 | --- 6.1 Set variables to tweak the build to the architecture |
34 | --- 6.2 Add prerequisites to prepare: | 34 | --- 6.2 Add prerequisites to archprepare: |
35 | --- 6.3 List directories to visit when descending | 35 | --- 6.3 List directories to visit when descending |
36 | --- 6.4 Architecture specific boot images | 36 | --- 6.4 Architecture specific boot images |
37 | --- 6.5 Building non-kbuild targets | 37 | --- 6.5 Building non-kbuild targets |
@@ -734,18 +734,18 @@ When kbuild executes the following steps are followed (roughly): | |||
734 | for loadable kernel modules. | 734 | for loadable kernel modules. |
735 | 735 | ||
736 | 736 | ||
737 | --- 6.2 Add prerequisites to prepare: | 737 | --- 6.2 Add prerequisites to archprepare: |
738 | 738 | ||
739 | The prepare: rule is used to list prerequisites that needs to be | 739 | The archprepare: rule is used to list prerequisites that needs to be |
740 | built before starting to descend down in the subdirectories. | 740 | built before starting to descend down in the subdirectories. |
741 | This is usual header files containing assembler constants. | 741 | This is usual header files containing assembler constants. |
742 | 742 | ||
743 | Example: | 743 | Example: |
744 | #arch/s390/Makefile | 744 | #arch/arm/Makefile |
745 | prepare: include/asm-$(ARCH)/offsets.h | 745 | archprepare: maketools |
746 | 746 | ||
747 | In this example the file include/asm-$(ARCH)/offsets.h will | 747 | In this example the file target maketools will be processed |
748 | be built before descending down in the subdirectories. | 748 | before descending down in the subdirectories. |
749 | See also chapter XXX-TODO that describe how kbuild supports | 749 | See also chapter XXX-TODO that describe how kbuild supports |
750 | generating offset header files. | 750 | generating offset header files. |
751 | 751 | ||
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 7ff213f4becd..1f5f7d28c9e6 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt | |||
@@ -39,8 +39,7 @@ SETUP | |||
39 | and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch | 39 | and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch |
40 | and after that build the source. | 40 | and after that build the source. |
41 | 41 | ||
42 | 2) Download and build the appropriate (latest) kexec/kdump (-mm) kernel | 42 | 2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernel. |
43 | patchset and apply it to the vanilla kernel tree. | ||
44 | 43 | ||
45 | Two kernels need to be built in order to get this feature working. | 44 | Two kernels need to be built in order to get this feature working. |
46 | 45 | ||
@@ -84,15 +83,16 @@ SETUP | |||
84 | 83 | ||
85 | 4) Load the second kernel to be booted using: | 84 | 4) Load the second kernel to be booted using: |
86 | 85 | ||
87 | kexec -p <second-kernel> --crash-dump --args-linux --append="root=<root-dev> | 86 | kexec -p <second-kernel> --args-linux --elf32-core-headers |
88 | init 1 irqpoll" | 87 | --append="root=<root-dev> init 1 irqpoll" |
89 | 88 | ||
90 | Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work, | 89 | Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work, |
91 | as of now. | 90 | as of now. |
92 | ii) By default ELF headers are stored in ELF32 format (for i386). This | 91 | ii) By default ELF headers are stored in ELF64 format. Option |
93 | is sufficient to represent the physical memory up to 4GB. To store | 92 | --elf32-core-headers forces generation of ELF32 headers. gdb can |
94 | headers in ELF64 format, specifiy "--elf64-core-headers" on the | 93 | not open ELF64 headers on 32 bit systems. So creating ELF32 |
95 | kexec command line additionally. | 94 | headers can come handy for users who have got non-PAE systems and |
95 | hence have memory less than 4GB. | ||
96 | iii) Specify "irqpoll" as command line parameter. This reduces driver | 96 | iii) Specify "irqpoll" as command line parameter. This reduces driver |
97 | initialization failures in second kernel due to shared interrupts. | 97 | initialization failures in second kernel due to shared interrupts. |
98 | 98 | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d2f0c67ba1fb..7086f0a90d14 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -164,6 +164,15 @@ running once the system is up. | |||
164 | over-ride platform specific driver. | 164 | over-ride platform specific driver. |
165 | See also Documentation/acpi-hotkey.txt. | 165 | See also Documentation/acpi-hotkey.txt. |
166 | 166 | ||
167 | enable_timer_pin_1 [i386,x86-64] | ||
168 | Enable PIN 1 of APIC timer | ||
169 | Can be useful to work around chipset bugs (in particular on some ATI chipsets) | ||
170 | The kernel tries to set a reasonable default. | ||
171 | |||
172 | disable_timer_pin_1 [i386,x86-64] | ||
173 | Disable PIN 1 of APIC timer | ||
174 | Can be useful to work around chipset bugs. | ||
175 | |||
167 | ad1816= [HW,OSS] | 176 | ad1816= [HW,OSS] |
168 | Format: <io>,<irq>,<dma>,<dma2> | 177 | Format: <io>,<irq>,<dma>,<dma2> |
169 | See also Documentation/sound/oss/AD1816. | 178 | See also Documentation/sound/oss/AD1816. |
@@ -549,6 +558,7 @@ running once the system is up. | |||
549 | keyboard and can not control its state | 558 | keyboard and can not control its state |
550 | (Don't attempt to blink the leds) | 559 | (Don't attempt to blink the leds) |
551 | i8042.noaux [HW] Don't check for auxiliary (== mouse) port | 560 | i8042.noaux [HW] Don't check for auxiliary (== mouse) port |
561 | i8042.nokbd [HW] Don't check/create keyboard port | ||
552 | i8042.nomux [HW] Don't check presence of an active multiplexing | 562 | i8042.nomux [HW] Don't check presence of an active multiplexing |
553 | controller | 563 | controller |
554 | i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX | 564 | i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX |
diff --git a/Documentation/mono.txt b/Documentation/mono.txt index 6739ab9615ef..807a0c7b4737 100644 --- a/Documentation/mono.txt +++ b/Documentation/mono.txt | |||
@@ -30,7 +30,7 @@ other program after you have done the following: | |||
30 | Read the file 'binfmt_misc.txt' in this directory to know | 30 | Read the file 'binfmt_misc.txt' in this directory to know |
31 | more about the configuration process. | 31 | more about the configuration process. |
32 | 32 | ||
33 | 3) Add the following enries to /etc/rc.local or similar script | 33 | 3) Add the following entries to /etc/rc.local or similar script |
34 | to be run at system startup: | 34 | to be run at system startup: |
35 | 35 | ||
36 | # Insert BINFMT_MISC module into the kernel | 36 | # Insert BINFMT_MISC module into the kernel |
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 24d029455baa..a55f0f95b171 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt | |||
@@ -1241,7 +1241,7 @@ traffic while still maintaining carrier on. | |||
1241 | 1241 | ||
1242 | If running SNMP agents, the bonding driver should be loaded | 1242 | If running SNMP agents, the bonding driver should be loaded |
1243 | before any network drivers participating in a bond. This requirement | 1243 | before any network drivers participating in a bond. This requirement |
1244 | is due to the the interface index (ipAdEntIfIndex) being associated to | 1244 | is due to the interface index (ipAdEntIfIndex) being associated to |
1245 | the first interface found with a given IP address. That is, there is | 1245 | the first interface found with a given IP address. That is, there is |
1246 | only one ipAdEntIfIndex for each IP address. For example, if eth0 and | 1246 | only one ipAdEntIfIndex for each IP address. For example, if eth0 and |
1247 | eth1 are slaves of bond0 and the driver for eth0 is loaded before the | 1247 | eth1 are slaves of bond0 and the driver for eth0 is loaded before the |
@@ -1937,7 +1937,7 @@ switches currently available support 802.3ad. | |||
1937 | If not explicitly configured (with ifconfig or ip link), the | 1937 | If not explicitly configured (with ifconfig or ip link), the |
1938 | MAC address of the bonding device is taken from its first slave | 1938 | MAC address of the bonding device is taken from its first slave |
1939 | device. This MAC address is then passed to all following slaves and | 1939 | device. This MAC address is then passed to all following slaves and |
1940 | remains persistent (even if the the first slave is removed) until the | 1940 | remains persistent (even if the first slave is removed) until the |
1941 | bonding device is brought down or reconfigured. | 1941 | bonding device is brought down or reconfigured. |
1942 | 1942 | ||
1943 | If you wish to change the MAC address, you can set it with | 1943 | If you wish to change the MAC address, you can set it with |
diff --git a/Documentation/networking/wan-router.txt b/Documentation/networking/wan-router.txt index aea20cd2a56e..c96897aa08b6 100644 --- a/Documentation/networking/wan-router.txt +++ b/Documentation/networking/wan-router.txt | |||
@@ -355,7 +355,7 @@ REVISION HISTORY | |||
355 | There is no functional difference between the two packages | 355 | There is no functional difference between the two packages |
356 | 356 | ||
357 | 2.0.7 Aug 26, 1999 o Merged X25API code into WANPIPE. | 357 | 2.0.7 Aug 26, 1999 o Merged X25API code into WANPIPE. |
358 | o Fixed a memeory leak for X25API | 358 | o Fixed a memory leak for X25API |
359 | o Updated the X25API code for 2.2.X kernels. | 359 | o Updated the X25API code for 2.2.X kernels. |
360 | o Improved NEM handling. | 360 | o Improved NEM handling. |
361 | 361 | ||
@@ -514,7 +514,7 @@ beta2-2.2.0 Jan 8 2001 | |||
514 | o Patches for 2.4.0 kernel | 514 | o Patches for 2.4.0 kernel |
515 | o Patches for 2.2.18 kernel | 515 | o Patches for 2.2.18 kernel |
516 | o Minor updates to PPP and CHLDC drivers. | 516 | o Minor updates to PPP and CHLDC drivers. |
517 | Note: No functinal difference. | 517 | Note: No functional difference. |
518 | 518 | ||
519 | beta3-2.2.9 Jan 10 2001 | 519 | beta3-2.2.9 Jan 10 2001 |
520 | o I missed the 2.2.18 kernel patches in beta2-2.2.0 | 520 | o I missed the 2.2.18 kernel patches in beta2-2.2.0 |
diff --git a/Documentation/pci.txt b/Documentation/pci.txt index 76d28d033657..711210b38f5f 100644 --- a/Documentation/pci.txt +++ b/Documentation/pci.txt | |||
@@ -84,7 +84,7 @@ Each entry consists of: | |||
84 | 84 | ||
85 | Most drivers don't need to use the driver_data field. Best practice | 85 | Most drivers don't need to use the driver_data field. Best practice |
86 | for use of driver_data is to use it as an index into a static list of | 86 | for use of driver_data is to use it as an index into a static list of |
87 | equivalant device types, not to use it as a pointer. | 87 | equivalent device types, not to use it as a pointer. |
88 | 88 | ||
89 | Have a table entry {PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID} | 89 | Have a table entry {PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID} |
90 | to have probe() called for every PCI device known to the system. | 90 | to have probe() called for every PCI device known to the system. |
diff --git a/Documentation/powerpc/eeh-pci-error-recovery.txt b/Documentation/powerpc/eeh-pci-error-recovery.txt index 2bfe71beec5b..e75d7474322c 100644 --- a/Documentation/powerpc/eeh-pci-error-recovery.txt +++ b/Documentation/powerpc/eeh-pci-error-recovery.txt | |||
@@ -134,7 +134,7 @@ pci_get_device_by_addr() will find the pci device associated | |||
134 | with that address (if any). | 134 | with that address (if any). |
135 | 135 | ||
136 | The default include/asm-ppc64/io.h macros readb(), inb(), insb(), | 136 | The default include/asm-ppc64/io.h macros readb(), inb(), insb(), |
137 | etc. include a check to see if the the i/o read returned all-0xff's. | 137 | etc. include a check to see if the i/o read returned all-0xff's. |
138 | If so, these make a call to eeh_dn_check_failure(), which in turn | 138 | If so, these make a call to eeh_dn_check_failure(), which in turn |
139 | asks the firmware if the all-ff's value is the sign of a true EEH | 139 | asks the firmware if the all-ff's value is the sign of a true EEH |
140 | error. If it is not, processing continues as normal. The grand | 140 | error. If it is not, processing continues as normal. The grand |
diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt index e24fdeada970..e321a8ed2a2d 100644 --- a/Documentation/s390/s390dbf.txt +++ b/Documentation/s390/s390dbf.txt | |||
@@ -468,7 +468,7 @@ The hex_ascii view shows the data field in hex and ascii representation | |||
468 | The raw view returns a bytestream as the debug areas are stored in memory. | 468 | The raw view returns a bytestream as the debug areas are stored in memory. |
469 | 469 | ||
470 | The sprintf view formats the debug entries in the same way as the sprintf | 470 | The sprintf view formats the debug entries in the same way as the sprintf |
471 | function would do. The sprintf event/expection fuctions write to the | 471 | function would do. The sprintf event/expection functions write to the |
472 | debug entry a pointer to the format string (size = sizeof(long)) | 472 | debug entry a pointer to the format string (size = sizeof(long)) |
473 | and for each vararg a long value. So e.g. for a debug entry with a format | 473 | and for each vararg a long value. So e.g. for a debug entry with a format |
474 | string plus two varargs one would need to allocate a (3 * sizeof(long)) | 474 | string plus two varargs one would need to allocate a (3 * sizeof(long)) |
diff --git a/Documentation/scsi/ibmmca.txt b/Documentation/scsi/ibmmca.txt index 2814491600ff..2ffb3ae0ef4d 100644 --- a/Documentation/scsi/ibmmca.txt +++ b/Documentation/scsi/ibmmca.txt | |||
@@ -344,7 +344,7 @@ | |||
344 | /proc/scsi/ibmmca/<host_no>. ibmmca_proc_info() provides this information. | 344 | /proc/scsi/ibmmca/<host_no>. ibmmca_proc_info() provides this information. |
345 | 345 | ||
346 | This table is quite informative for interested users. It shows the load | 346 | This table is quite informative for interested users. It shows the load |
347 | of commands on the subsystem and wether you are running the bypassed | 347 | of commands on the subsystem and whether you are running the bypassed |
348 | (software) or integrated (hardware) SCSI-command set (see below). The | 348 | (software) or integrated (hardware) SCSI-command set (see below). The |
349 | amount of accesses is shown. Read, write, modeselect is shown separately | 349 | amount of accesses is shown. Read, write, modeselect is shown separately |
350 | in order to help debugging problems with CD-ROMs or tapedrives. | 350 | in order to help debugging problems with CD-ROMs or tapedrives. |
diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 5c49ba07e709..ebfcdf28485f 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt | |||
@@ -1459,7 +1459,7 @@ devices where %i is sound card number from zero to seven. | |||
1459 | To auto-load an ALSA driver for OSS services, define the string | 1459 | To auto-load an ALSA driver for OSS services, define the string |
1460 | 'sound-slot-%i' where %i means the slot number for OSS, which | 1460 | 'sound-slot-%i' where %i means the slot number for OSS, which |
1461 | corresponds to the card index of ALSA. Usually, define this | 1461 | corresponds to the card index of ALSA. Usually, define this |
1462 | as the the same card module. | 1462 | as the same card module. |
1463 | 1463 | ||
1464 | An example configuration for a single emu10k1 card is like below: | 1464 | An example configuration for a single emu10k1 card is like below: |
1465 | ----- /etc/modprobe.conf | 1465 | ----- /etc/modprobe.conf |
diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt index f97841478459..5df44dc894e5 100644 --- a/Documentation/sparse.txt +++ b/Documentation/sparse.txt | |||
@@ -57,7 +57,7 @@ With BK, you can just get it from | |||
57 | 57 | ||
58 | and DaveJ has tar-balls at | 58 | and DaveJ has tar-balls at |
59 | 59 | ||
60 | http://www.codemonkey.org.uk/projects/bitkeeper/sparse/ | 60 | http://www.codemonkey.org.uk/projects/git-snapshots/sparse/ |
61 | 61 | ||
62 | 62 | ||
63 | Once you have it, just do | 63 | Once you have it, just do |
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 136d817c01ba..baf17b381588 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -171,7 +171,7 @@ the header 'include/linux/sysrq.h', this will define everything else you need. | |||
171 | Next, you must create a sysrq_key_op struct, and populate it with A) the key | 171 | Next, you must create a sysrq_key_op struct, and populate it with A) the key |
172 | handler function you will use, B) a help_msg string, that will print when SysRQ | 172 | handler function you will use, B) a help_msg string, that will print when SysRQ |
173 | prints help, and C) an action_msg string, that will print right before your | 173 | prints help, and C) an action_msg string, that will print right before your |
174 | handler is called. Your handler must conform to the protoype in 'sysrq.h'. | 174 | handler is called. Your handler must conform to the prototype in 'sysrq.h'. |
175 | 175 | ||
176 | After the sysrq_key_op is created, you can call the macro | 176 | After the sysrq_key_op is created, you can call the macro |
177 | register_sysrq_key(int key, struct sysrq_key_op *op_p) that is defined in | 177 | register_sysrq_key(int key, struct sysrq_key_op *op_p) that is defined in |
diff --git a/Documentation/uml/UserModeLinux-HOWTO.txt b/Documentation/uml/UserModeLinux-HOWTO.txt index 0c7b654fec99..544430e39980 100644 --- a/Documentation/uml/UserModeLinux-HOWTO.txt +++ b/Documentation/uml/UserModeLinux-HOWTO.txt | |||
@@ -2176,7 +2176,7 @@ | |||
2176 | If you want to access files on the host machine from inside UML, you | 2176 | If you want to access files on the host machine from inside UML, you |
2177 | can treat it as a separate machine and either nfs mount directories | 2177 | can treat it as a separate machine and either nfs mount directories |
2178 | from the host or copy files into the virtual machine with scp or rcp. | 2178 | from the host or copy files into the virtual machine with scp or rcp. |
2179 | However, since UML is running on the the host, it can access those | 2179 | However, since UML is running on the host, it can access those |
2180 | files just like any other process and make them available inside the | 2180 | files just like any other process and make them available inside the |
2181 | virtual machine without needing to use the network. | 2181 | virtual machine without needing to use the network. |
2182 | 2182 | ||
diff --git a/Documentation/usb/gadget_serial.txt b/Documentation/usb/gadget_serial.txt index a938c3dd13d6..815f5c2301ff 100644 --- a/Documentation/usb/gadget_serial.txt +++ b/Documentation/usb/gadget_serial.txt | |||
@@ -20,7 +20,7 @@ License along with this program; if not, write to the Free | |||
20 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, | 20 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, |
21 | MA 02111-1307 USA. | 21 | MA 02111-1307 USA. |
22 | 22 | ||
23 | This document and the the gadget serial driver itself are | 23 | This document and the gadget serial driver itself are |
24 | Copyright (C) 2004 by Al Borchers (alborchers@steinerpoint.com). | 24 | Copyright (C) 2004 by Al Borchers (alborchers@steinerpoint.com). |
25 | 25 | ||
26 | If you have questions, problems, or suggestions for this driver | 26 | If you have questions, problems, or suggestions for this driver |
diff --git a/Documentation/video4linux/CARDLIST.bttv b/Documentation/video4linux/CARDLIST.bttv index 62a12a08e2ac..ec785f9f15a3 100644 --- a/Documentation/video4linux/CARDLIST.bttv +++ b/Documentation/video4linux/CARDLIST.bttv | |||
@@ -126,10 +126,12 @@ card=124 - AverMedia AverTV DVB-T 761 | |||
126 | card=125 - MATRIX Vision Sigma-SQ | 126 | card=125 - MATRIX Vision Sigma-SQ |
127 | card=126 - MATRIX Vision Sigma-SLC | 127 | card=126 - MATRIX Vision Sigma-SLC |
128 | card=127 - APAC Viewcomp 878(AMAX) | 128 | card=127 - APAC Viewcomp 878(AMAX) |
129 | card=128 - DVICO FusionHDTV DVB-T Lite | 129 | card=128 - DViCO FusionHDTV DVB-T Lite |
130 | card=129 - V-Gear MyVCD | 130 | card=129 - V-Gear MyVCD |
131 | card=130 - Super TV Tuner | 131 | card=130 - Super TV Tuner |
132 | card=131 - Tibet Systems 'Progress DVR' CS16 | 132 | card=131 - Tibet Systems 'Progress DVR' CS16 |
133 | card=132 - Kodicom 4400R (master) | 133 | card=132 - Kodicom 4400R (master) |
134 | card=133 - Kodicom 4400R (slave) | 134 | card=133 - Kodicom 4400R (slave) |
135 | card=134 - Adlink RTV24 | 135 | card=134 - Adlink RTV24 |
136 | card=135 - DViCO FusionHDTV 5 Lite | ||
137 | card=136 - Acorp Y878F | ||
diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134 index 1b5a3a9ffbe2..dc57225f39be 100644 --- a/Documentation/video4linux/CARDLIST.saa7134 +++ b/Documentation/video4linux/CARDLIST.saa7134 | |||
@@ -62,3 +62,6 @@ | |||
62 | 61 -> Philips TOUGH DVB-T reference design [1131:2004] | 62 | 61 -> Philips TOUGH DVB-T reference design [1131:2004] |
63 | 62 -> Compro VideoMate TV Gold+II | 63 | 62 -> Compro VideoMate TV Gold+II |
64 | 63 -> Kworld Xpert TV PVR7134 | 64 | 63 -> Kworld Xpert TV PVR7134 |
65 | 64 -> FlyTV mini Asus Digimatrix [1043:0210,1043:0210] | ||
66 | 65 -> V-Stream Studio TV Terminator | ||
67 | 66 -> Yuan TUN-900 (saa7135) | ||
diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index f3302e1b1b9c..f5876be658a6 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner | |||
@@ -64,3 +64,4 @@ tuner=62 - Philips TEA5767HN FM Radio | |||
64 | tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner | 64 | tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner |
65 | tuner=64 - LG TDVS-H062F/TUA6034 | 65 | tuner=64 - LG TDVS-H062F/TUA6034 |
66 | tuner=65 - Ymec TVF66T5-B/DFF | 66 | tuner=65 - Ymec TVF66T5-B/DFF |
67 | tuner=66 - LG NTSC (TALN mini series) | ||
diff --git a/Documentation/video4linux/Zoran b/Documentation/video4linux/Zoran index 01425c21986b..52c94bd7dca1 100644 --- a/Documentation/video4linux/Zoran +++ b/Documentation/video4linux/Zoran | |||
@@ -222,7 +222,7 @@ was introduced in 1991, is used in the DC10 old | |||
222 | can generate: PAL , NTSC , SECAM | 222 | can generate: PAL , NTSC , SECAM |
223 | 223 | ||
224 | The adv717x, should be able to produce PAL N. But you find nothing PAL N | 224 | The adv717x, should be able to produce PAL N. But you find nothing PAL N |
225 | specific in the the registers. Seem that you have to reuse a other standard | 225 | specific in the registers. Seem that you have to reuse a other standard |
226 | to generate PAL N, maybe it would work if you use the PAL M settings. | 226 | to generate PAL N, maybe it would work if you use the PAL M settings. |
227 | 227 | ||
228 | ========================== | 228 | ========================== |
diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index 678e8f192db2..ffe1c062088b 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt | |||
@@ -11,6 +11,11 @@ Machine check | |||
11 | If your BIOS doesn't do that it's a good idea to enable though | 11 | If your BIOS doesn't do that it's a good idea to enable though |
12 | to make sure you log even machine check events that result | 12 | to make sure you log even machine check events that result |
13 | in a reboot. | 13 | in a reboot. |
14 | mce=tolerancelevel (number) | ||
15 | 0: always panic, 1: panic if deadlock possible, | ||
16 | 2: try to avoid panic, 3: never panic or exit (for testing) | ||
17 | default is 1 | ||
18 | Can be also set using sysfs which is preferable. | ||
14 | 19 | ||
15 | nomce (for compatibility with i386): same as mce=off | 20 | nomce (for compatibility with i386): same as mce=off |
16 | 21 | ||