aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Documentation/DMA-API-HOWTO.txt153
-rw-r--r--Documentation/DMA-API.txt580
-rw-r--r--Documentation/DMA-ISA-LPC.txt71
-rw-r--r--Documentation/DMA-attributes.txt15
-rw-r--r--Documentation/IPMI.txt76
-rw-r--r--Documentation/IRQ-affinity.txt75
-rw-r--r--Documentation/IRQ-domain.txt69
-rw-r--r--Documentation/IRQ.txt2
-rw-r--r--Documentation/Intel-IOMMU.txt37
-rw-r--r--Documentation/SAK.txt65
-rw-r--r--Documentation/SM501.txt9
-rw-r--r--Documentation/bcache.txt190
-rw-r--r--Documentation/bt8xxgpio.txt19
-rw-r--r--Documentation/btmrvl.txt65
-rw-r--r--Documentation/bus-virt-phys-mapping.txt64
-rw-r--r--Documentation/cachetlb.txt92
-rw-r--r--Documentation/cgroup-v2.txt460
-rw-r--r--Documentation/circular-buffers.txt51
-rw-r--r--Documentation/clk.txt189
-rw-r--r--Documentation/cpu-load.txt131
-rw-r--r--Documentation/cputopology.txt37
-rw-r--r--Documentation/crc32.txt75
-rw-r--r--Documentation/dcdbas.txt24
-rw-r--r--Documentation/debugging-via-ohci1394.txt21
-rw-r--r--Documentation/dell_rbu.txt81
-rw-r--r--Documentation/digsig.txt131
-rw-r--r--Documentation/efi-stub.txt25
-rw-r--r--Documentation/eisa.txt273
-rw-r--r--Documentation/flexible-arrays.txt25
-rw-r--r--Documentation/futex-requeue-pi.txt93
-rw-r--r--Documentation/gcc-plugins.txt58
-rw-r--r--Documentation/highuid.txt47
-rw-r--r--Documentation/hw_random.txt159
-rw-r--r--Documentation/hwspinlock.txt527
-rw-r--r--Documentation/intel_txt.txt63
-rw-r--r--Documentation/io-mapping.txt67
-rw-r--r--Documentation/io_ordering.txt62
-rw-r--r--Documentation/iostats.txt76
-rw-r--r--Documentation/irqflags-tracing.txt10
-rw-r--r--Documentation/isa.txt53
-rw-r--r--Documentation/isapnp.txt1
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt156
-rw-r--r--Documentation/kobject.txt69
-rw-r--r--Documentation/kprobes.txt475
-rw-r--r--Documentation/kref.txt295
-rw-r--r--Documentation/ldm.txt56
-rw-r--r--Documentation/lockup-watchdogs.txt3
-rw-r--r--Documentation/lzo.txt27
-rw-r--r--Documentation/mailbox.txt185
-rw-r--r--Documentation/memory-hotplug.txt355
-rw-r--r--Documentation/men-chameleon-bus.txt330
-rw-r--r--Documentation/nommu-mmap.txt68
-rw-r--r--Documentation/ntb.txt58
-rw-r--r--Documentation/numastat.txt5
-rw-r--r--Documentation/padata.txt27
-rw-r--r--Documentation/parport-lowlevel.txt1321
-rw-r--r--Documentation/percpu-rw-semaphore.txt3
-rw-r--r--Documentation/phy.txt106
-rw-r--r--Documentation/pi-futex.txt15
-rw-r--r--Documentation/pnp.txt343
-rw-r--r--Documentation/preempt-locking.txt40
-rw-r--r--Documentation/printk-formats.txt384
-rw-r--r--Documentation/rbtree.txt88
-rw-r--r--Documentation/remoteproc.txt328
-rw-r--r--Documentation/rfkill.txt43
-rw-r--r--Documentation/robust-futex-ABI.txt14
-rw-r--r--Documentation/robust-futexes.txt12
-rw-r--r--Documentation/rpmsg.txt348
-rw-r--r--Documentation/sgi-ioc4.txt4
-rw-r--r--Documentation/siphash.txt164
-rw-r--r--Documentation/smsc_ece1099.txt4
-rw-r--r--Documentation/static-keys.txt207
-rw-r--r--Documentation/svga.txt146
-rw-r--r--Documentation/tee.txt53
-rw-r--r--Documentation/this_cpu_ops.txt49
-rw-r--r--Documentation/unaligned-memory-access.txt57
-rw-r--r--Documentation/vfio-mediated-device.txt266
-rw-r--r--Documentation/vfio.txt281
-rw-r--r--Documentation/xillybus.txt29
-rw-r--r--Documentation/xz.txt200
-rw-r--r--Documentation/zorro.txt59
81 files changed, 6263 insertions, 4731 deletions
diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt
index 4ed388356898..f0cc3f772265 100644
--- a/Documentation/DMA-API-HOWTO.txt
+++ b/Documentation/DMA-API-HOWTO.txt
@@ -1,22 +1,24 @@
1 Dynamic DMA mapping Guide 1=========================
2 ========================= 2Dynamic DMA mapping Guide
3=========================
3 4
4 David S. Miller <davem@redhat.com> 5:Author: David S. Miller <davem@redhat.com>
5 Richard Henderson <rth@cygnus.com> 6:Author: Richard Henderson <rth@cygnus.com>
6 Jakub Jelinek <jakub@redhat.com> 7:Author: Jakub Jelinek <jakub@redhat.com>
7 8
8This is a guide to device driver writers on how to use the DMA API 9This is a guide to device driver writers on how to use the DMA API
9with example pseudo-code. For a concise description of the API, see 10with example pseudo-code. For a concise description of the API, see
10DMA-API.txt. 11DMA-API.txt.
11 12
12 CPU and DMA addresses 13CPU and DMA addresses
14=====================
13 15
14There are several kinds of addresses involved in the DMA API, and it's 16There are several kinds of addresses involved in the DMA API, and it's
15important to understand the differences. 17important to understand the differences.
16 18
17The kernel normally uses virtual addresses. Any address returned by 19The kernel normally uses virtual addresses. Any address returned by
18kmalloc(), vmalloc(), and similar interfaces is a virtual address and can 20kmalloc(), vmalloc(), and similar interfaces is a virtual address and can
19be stored in a "void *". 21be stored in a ``void *``.
20 22
21The virtual memory system (TLB, page tables, etc.) translates virtual 23The virtual memory system (TLB, page tables, etc.) translates virtual
22addresses to CPU physical addresses, which are stored as "phys_addr_t" or 24addresses to CPU physical addresses, which are stored as "phys_addr_t" or
@@ -37,7 +39,7 @@ be restricted to a subset of that space. For example, even if a system
37supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU 39supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU
38so devices only need to use 32-bit DMA addresses. 40so devices only need to use 32-bit DMA addresses.
39 41
40Here's a picture and some examples: 42Here's a picture and some examples::
41 43
42 CPU CPU Bus 44 CPU CPU Bus
43 Virtual Physical Address 45 Virtual Physical Address
@@ -98,15 +100,16 @@ microprocessor architecture. You should use the DMA API rather than the
98bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the 100bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
99pci_map_*() interfaces. 101pci_map_*() interfaces.
100 102
101First of all, you should make sure 103First of all, you should make sure::
102 104
103#include <linux/dma-mapping.h> 105 #include <linux/dma-mapping.h>
104 106
105is in your driver, which provides the definition of dma_addr_t. This type 107is in your driver, which provides the definition of dma_addr_t. This type
106can hold any valid DMA address for the platform and should be used 108can hold any valid DMA address for the platform and should be used
107everywhere you hold a DMA address returned from the DMA mapping functions. 109everywhere you hold a DMA address returned from the DMA mapping functions.
108 110
109 What memory is DMA'able? 111What memory is DMA'able?
112========================
110 113
111The first piece of information you must know is what kernel memory can 114The first piece of information you must know is what kernel memory can
112be used with the DMA mapping facilities. There has been an unwritten 115be used with the DMA mapping facilities. There has been an unwritten
@@ -143,7 +146,8 @@ What about block I/O and networking buffers? The block I/O and
143networking subsystems make sure that the buffers they use are valid 146networking subsystems make sure that the buffers they use are valid
144for you to DMA from/to. 147for you to DMA from/to.
145 148
146 DMA addressing limitations 149DMA addressing limitations
150==========================
147 151
148Does your device have any DMA addressing limitations? For example, is 152Does your device have any DMA addressing limitations? For example, is
149your device only capable of driving the low order 24-bits of address? 153your device only capable of driving the low order 24-bits of address?
@@ -166,7 +170,7 @@ style to do this even if your device holds the default setting,
166because this shows that you did think about these issues wrt. your 170because this shows that you did think about these issues wrt. your
167device. 171device.
168 172
169The query is performed via a call to dma_set_mask_and_coherent(): 173The query is performed via a call to dma_set_mask_and_coherent()::
170 174
171 int dma_set_mask_and_coherent(struct device *dev, u64 mask); 175 int dma_set_mask_and_coherent(struct device *dev, u64 mask);
172 176
@@ -175,12 +179,12 @@ If you have some special requirements, then the following two separate
175queries can be used instead: 179queries can be used instead:
176 180
177 The query for streaming mappings is performed via a call to 181 The query for streaming mappings is performed via a call to
178 dma_set_mask(): 182 dma_set_mask()::
179 183
180 int dma_set_mask(struct device *dev, u64 mask); 184 int dma_set_mask(struct device *dev, u64 mask);
181 185
182 The query for consistent allocations is performed via a call 186 The query for consistent allocations is performed via a call
183 to dma_set_coherent_mask(): 187 to dma_set_coherent_mask()::
184 188
185 int dma_set_coherent_mask(struct device *dev, u64 mask); 189 int dma_set_coherent_mask(struct device *dev, u64 mask);
186 190
@@ -209,7 +213,7 @@ of your driver reports that performance is bad or that the device is not
209even detected, you can ask them for the kernel messages to find out 213even detected, you can ask them for the kernel messages to find out
210exactly why. 214exactly why.
211 215
212The standard 32-bit addressing device would do something like this: 216The standard 32-bit addressing device would do something like this::
213 217
214 if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) { 218 if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) {
215 dev_warn(dev, "mydev: No suitable DMA available\n"); 219 dev_warn(dev, "mydev: No suitable DMA available\n");
@@ -225,7 +229,7 @@ than 64-bit addressing. For example, Sparc64 PCI SAC addressing is
225more efficient than DAC addressing. 229more efficient than DAC addressing.
226 230
227Here is how you would handle a 64-bit capable device which can drive 231Here is how you would handle a 64-bit capable device which can drive
228all 64-bits when accessing streaming DMA: 232all 64-bits when accessing streaming DMA::
229 233
230 int using_dac; 234 int using_dac;
231 235
@@ -239,7 +243,7 @@ all 64-bits when accessing streaming DMA:
239 } 243 }
240 244
241If a card is capable of using 64-bit consistent allocations as well, 245If a card is capable of using 64-bit consistent allocations as well,
242the case would look like this: 246the case would look like this::
243 247
244 int using_dac, consistent_using_dac; 248 int using_dac, consistent_using_dac;
245 249
@@ -260,7 +264,7 @@ uses consistent allocations, one would have to check the return value from
260dma_set_coherent_mask(). 264dma_set_coherent_mask().
261 265
262Finally, if your device can only drive the low 24-bits of 266Finally, if your device can only drive the low 24-bits of
263address you might do something like: 267address you might do something like::
264 268
265 if (dma_set_mask(dev, DMA_BIT_MASK(24))) { 269 if (dma_set_mask(dev, DMA_BIT_MASK(24))) {
266 dev_warn(dev, "mydev: 24-bit DMA addressing not available\n"); 270 dev_warn(dev, "mydev: 24-bit DMA addressing not available\n");
@@ -280,7 +284,7 @@ only provide the functionality which the machine can handle. It
280is important that the last call to dma_set_mask() be for the 284is important that the last call to dma_set_mask() be for the
281most specific mask. 285most specific mask.
282 286
283Here is pseudo-code showing how this might be done: 287Here is pseudo-code showing how this might be done::
284 288
285 #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) 289 #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32)
286 #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) 290 #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24)
@@ -308,7 +312,8 @@ A sound card was used as an example here because this genre of PCI
308devices seems to be littered with ISA chips given a PCI front end, 312devices seems to be littered with ISA chips given a PCI front end,
309and thus retaining the 16MB DMA addressing limitations of ISA. 313and thus retaining the 16MB DMA addressing limitations of ISA.
310 314
311 Types of DMA mappings 315Types of DMA mappings
316=====================
312 317
313There are two types of DMA mappings: 318There are two types of DMA mappings:
314 319
@@ -336,12 +341,14 @@ There are two types of DMA mappings:
336 to memory is immediately visible to the device, and vice 341 to memory is immediately visible to the device, and vice
337 versa. Consistent mappings guarantee this. 342 versa. Consistent mappings guarantee this.
338 343
339 IMPORTANT: Consistent DMA memory does not preclude the usage of 344 .. important::
340 proper memory barriers. The CPU may reorder stores to 345
346 Consistent DMA memory does not preclude the usage of
347 proper memory barriers. The CPU may reorder stores to
341 consistent memory just as it may normal memory. Example: 348 consistent memory just as it may normal memory. Example:
342 if it is important for the device to see the first word 349 if it is important for the device to see the first word
343 of a descriptor updated before the second, you must do 350 of a descriptor updated before the second, you must do
344 something like: 351 something like::
345 352
346 desc->word0 = address; 353 desc->word0 = address;
347 wmb(); 354 wmb();
@@ -377,16 +384,17 @@ Also, systems with caches that aren't DMA-coherent will work better
377when the underlying buffers don't share cache lines with other data. 384when the underlying buffers don't share cache lines with other data.
378 385
379 386
380 Using Consistent DMA mappings. 387Using Consistent DMA mappings
388=============================
381 389
382To allocate and map large (PAGE_SIZE or so) consistent DMA regions, 390To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
383you should do: 391you should do::
384 392
385 dma_addr_t dma_handle; 393 dma_addr_t dma_handle;
386 394
387 cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); 395 cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp);
388 396
389where device is a struct device *. This may be called in interrupt 397where device is a ``struct device *``. This may be called in interrupt
390context with the GFP_ATOMIC flag. 398context with the GFP_ATOMIC flag.
391 399
392Size is the length of the region you want to allocate, in bytes. 400Size is the length of the region you want to allocate, in bytes.
@@ -415,7 +423,7 @@ exists (for example) to guarantee that if you allocate a chunk
415which is smaller than or equal to 64 kilobytes, the extent of the 423which is smaller than or equal to 64 kilobytes, the extent of the
416buffer you receive will not cross a 64K boundary. 424buffer you receive will not cross a 64K boundary.
417 425
418To unmap and free such a DMA region, you call: 426To unmap and free such a DMA region, you call::
419 427
420 dma_free_coherent(dev, size, cpu_addr, dma_handle); 428 dma_free_coherent(dev, size, cpu_addr, dma_handle);
421 429
@@ -430,7 +438,7 @@ a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
430Also, it understands common hardware constraints for alignment, 438Also, it understands common hardware constraints for alignment,
431like queue heads needing to be aligned on N byte boundaries. 439like queue heads needing to be aligned on N byte boundaries.
432 440
433Create a dma_pool like this: 441Create a dma_pool like this::
434 442
435 struct dma_pool *pool; 443 struct dma_pool *pool;
436 444
@@ -444,7 +452,7 @@ pass 0 for boundary; passing 4096 says memory allocated from this pool
444must not cross 4KByte boundaries (but at that time it may be better to 452must not cross 4KByte boundaries (but at that time it may be better to
445use dma_alloc_coherent() directly instead). 453use dma_alloc_coherent() directly instead).
446 454
447Allocate memory from a DMA pool like this: 455Allocate memory from a DMA pool like this::
448 456
449 cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); 457 cpu_addr = dma_pool_alloc(pool, flags, &dma_handle);
450 458
@@ -452,7 +460,7 @@ flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor
452holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(), 460holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(),
453this returns two values, cpu_addr and dma_handle. 461this returns two values, cpu_addr and dma_handle.
454 462
455Free memory that was allocated from a dma_pool like this: 463Free memory that was allocated from a dma_pool like this::
456 464
457 dma_pool_free(pool, cpu_addr, dma_handle); 465 dma_pool_free(pool, cpu_addr, dma_handle);
458 466
@@ -460,7 +468,7 @@ where pool is what you passed to dma_pool_alloc(), and cpu_addr and
460dma_handle are the values dma_pool_alloc() returned. This function 468dma_handle are the values dma_pool_alloc() returned. This function
461may be called in interrupt context. 469may be called in interrupt context.
462 470
463Destroy a dma_pool by calling: 471Destroy a dma_pool by calling::
464 472
465 dma_pool_destroy(pool); 473 dma_pool_destroy(pool);
466 474
@@ -468,11 +476,12 @@ Make sure you've called dma_pool_free() for all memory allocated
468from a pool before you destroy the pool. This function may not 476from a pool before you destroy the pool. This function may not
469be called in interrupt context. 477be called in interrupt context.
470 478
471 DMA Direction 479DMA Direction
480=============
472 481
473The interfaces described in subsequent portions of this document 482The interfaces described in subsequent portions of this document
474take a DMA direction argument, which is an integer and takes on 483take a DMA direction argument, which is an integer and takes on
475one of the following values: 484one of the following values::
476 485
477 DMA_BIDIRECTIONAL 486 DMA_BIDIRECTIONAL
478 DMA_TO_DEVICE 487 DMA_TO_DEVICE
@@ -521,14 +530,15 @@ packets, map/unmap them with the DMA_TO_DEVICE direction
521specifier. For receive packets, just the opposite, map/unmap them 530specifier. For receive packets, just the opposite, map/unmap them
522with the DMA_FROM_DEVICE direction specifier. 531with the DMA_FROM_DEVICE direction specifier.
523 532
524 Using Streaming DMA mappings 533Using Streaming DMA mappings
534============================
525 535
526The streaming DMA mapping routines can be called from interrupt 536The streaming DMA mapping routines can be called from interrupt
527context. There are two versions of each map/unmap, one which will 537context. There are two versions of each map/unmap, one which will
528map/unmap a single memory region, and one which will map/unmap a 538map/unmap a single memory region, and one which will map/unmap a
529scatterlist. 539scatterlist.
530 540
531To map a single region, you do: 541To map a single region, you do::
532 542
533 struct device *dev = &my_dev->dev; 543 struct device *dev = &my_dev->dev;
534 dma_addr_t dma_handle; 544 dma_addr_t dma_handle;
@@ -545,7 +555,7 @@ To map a single region, you do:
545 goto map_error_handling; 555 goto map_error_handling;
546 } 556 }
547 557
548and to unmap it: 558and to unmap it::
549 559
550 dma_unmap_single(dev, dma_handle, size, direction); 560 dma_unmap_single(dev, dma_handle, size, direction);
551 561
@@ -563,7 +573,7 @@ Using CPU pointers like this for single mappings has a disadvantage:
563you cannot reference HIGHMEM memory in this way. Thus, there is a 573you cannot reference HIGHMEM memory in this way. Thus, there is a
564map/unmap interface pair akin to dma_{map,unmap}_single(). These 574map/unmap interface pair akin to dma_{map,unmap}_single(). These
565interfaces deal with page/offset pairs instead of CPU pointers. 575interfaces deal with page/offset pairs instead of CPU pointers.
566Specifically: 576Specifically::
567 577
568 struct device *dev = &my_dev->dev; 578 struct device *dev = &my_dev->dev;
569 dma_addr_t dma_handle; 579 dma_addr_t dma_handle;
@@ -593,7 +603,7 @@ error as outlined under the dma_map_single() discussion.
593You should call dma_unmap_page() when the DMA activity is finished, e.g., 603You should call dma_unmap_page() when the DMA activity is finished, e.g.,
594from the interrupt which told you that the DMA transfer is done. 604from the interrupt which told you that the DMA transfer is done.
595 605
596With scatterlists, you map a region gathered from several regions by: 606With scatterlists, you map a region gathered from several regions by::
597 607
598 int i, count = dma_map_sg(dev, sglist, nents, direction); 608 int i, count = dma_map_sg(dev, sglist, nents, direction);
599 struct scatterlist *sg; 609 struct scatterlist *sg;
@@ -617,16 +627,18 @@ Then you should loop count times (note: this can be less than nents times)
617and use sg_dma_address() and sg_dma_len() macros where you previously 627and use sg_dma_address() and sg_dma_len() macros where you previously
618accessed sg->address and sg->length as shown above. 628accessed sg->address and sg->length as shown above.
619 629
620To unmap a scatterlist, just call: 630To unmap a scatterlist, just call::
621 631
622 dma_unmap_sg(dev, sglist, nents, direction); 632 dma_unmap_sg(dev, sglist, nents, direction);
623 633
624Again, make sure DMA activity has already finished. 634Again, make sure DMA activity has already finished.
625 635
626PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be 636.. note::
627 the _same_ one you passed into the dma_map_sg call, 637
628 it should _NOT_ be the 'count' value _returned_ from the 638 The 'nents' argument to the dma_unmap_sg call must be
629 dma_map_sg call. 639 the _same_ one you passed into the dma_map_sg call,
640 it should _NOT_ be the 'count' value _returned_ from the
641 dma_map_sg call.
630 642
631Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}() 643Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}()
632counterpart, because the DMA address space is a shared resource and 644counterpart, because the DMA address space is a shared resource and
@@ -638,11 +650,11 @@ properly in order for the CPU and device to see the most up-to-date and
638correct copy of the DMA buffer. 650correct copy of the DMA buffer.
639 651
640So, firstly, just map it with dma_map_{single,sg}(), and after each DMA 652So, firstly, just map it with dma_map_{single,sg}(), and after each DMA
641transfer call either: 653transfer call either::
642 654
643 dma_sync_single_for_cpu(dev, dma_handle, size, direction); 655 dma_sync_single_for_cpu(dev, dma_handle, size, direction);
644 656
645or: 657or::
646 658
647 dma_sync_sg_for_cpu(dev, sglist, nents, direction); 659 dma_sync_sg_for_cpu(dev, sglist, nents, direction);
648 660
@@ -650,17 +662,19 @@ as appropriate.
650 662
651Then, if you wish to let the device get at the DMA area again, 663Then, if you wish to let the device get at the DMA area again,
652finish accessing the data with the CPU, and then before actually 664finish accessing the data with the CPU, and then before actually
653giving the buffer to the hardware call either: 665giving the buffer to the hardware call either::
654 666
655 dma_sync_single_for_device(dev, dma_handle, size, direction); 667 dma_sync_single_for_device(dev, dma_handle, size, direction);
656 668
657or: 669or::
658 670
659 dma_sync_sg_for_device(dev, sglist, nents, direction); 671 dma_sync_sg_for_device(dev, sglist, nents, direction);
660 672
661as appropriate. 673as appropriate.
662 674
663PLEASE NOTE: The 'nents' argument to dma_sync_sg_for_cpu() and 675.. note::
676
677 The 'nents' argument to dma_sync_sg_for_cpu() and
664 dma_sync_sg_for_device() must be the same passed to 678 dma_sync_sg_for_device() must be the same passed to
665 dma_map_sg(). It is _NOT_ the count returned by 679 dma_map_sg(). It is _NOT_ the count returned by
666 dma_map_sg(). 680 dma_map_sg().
@@ -671,7 +685,7 @@ dma_map_*() call till dma_unmap_*(), then you don't have to call the
671dma_sync_*() routines at all. 685dma_sync_*() routines at all.
672 686
673Here is pseudo code which shows a situation in which you would need 687Here is pseudo code which shows a situation in which you would need
674to use the dma_sync_*() interfaces. 688to use the dma_sync_*() interfaces::
675 689
676 my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) 690 my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
677 { 691 {
@@ -747,7 +761,8 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as
747they are entirely deprecated. Some ports already do not provide these 761they are entirely deprecated. Some ports already do not provide these
748as it is impossible to correctly support them. 762as it is impossible to correctly support them.
749 763
750 Handling Errors 764Handling Errors
765===============
751 766
752DMA address space is limited on some architectures and an allocation 767DMA address space is limited on some architectures and an allocation
753failure can be determined by: 768failure can be determined by:
@@ -755,7 +770,7 @@ failure can be determined by:
755- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0 770- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0
756 771
757- checking the dma_addr_t returned from dma_map_single() and dma_map_page() 772- checking the dma_addr_t returned from dma_map_single() and dma_map_page()
758 by using dma_mapping_error(): 773 by using dma_mapping_error()::
759 774
760 dma_addr_t dma_handle; 775 dma_addr_t dma_handle;
761 776
@@ -773,7 +788,8 @@ failure can be determined by:
773 of a multiple page mapping attempt. These example are applicable to 788 of a multiple page mapping attempt. These example are applicable to
774 dma_map_page() as well. 789 dma_map_page() as well.
775 790
776Example 1: 791Example 1::
792
777 dma_addr_t dma_handle1; 793 dma_addr_t dma_handle1;
778 dma_addr_t dma_handle2; 794 dma_addr_t dma_handle2;
779 795
@@ -802,8 +818,12 @@ Example 1:
802 dma_unmap_single(dma_handle1); 818 dma_unmap_single(dma_handle1);
803 map_error_handling1: 819 map_error_handling1:
804 820
805Example 2: (if buffers are allocated in a loop, unmap all mapped buffers when 821Example 2::
806 mapping error is detected in the middle) 822
823 /*
824 * if buffers are allocated in a loop, unmap all mapped buffers when
825 * mapping error is detected in the middle
826 */
807 827
808 dma_addr_t dma_addr; 828 dma_addr_t dma_addr;
809 dma_addr_t array[DMA_BUFFERS]; 829 dma_addr_t array[DMA_BUFFERS];
@@ -846,7 +866,8 @@ SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping
846fails in the queuecommand hook. This means that the SCSI subsystem 866fails in the queuecommand hook. This means that the SCSI subsystem
847passes the command to the driver again later. 867passes the command to the driver again later.
848 868
849 Optimizing Unmap State Space Consumption 869Optimizing Unmap State Space Consumption
870========================================
850 871
851On many platforms, dma_unmap_{single,page}() is simply a nop. 872On many platforms, dma_unmap_{single,page}() is simply a nop.
852Therefore, keeping track of the mapping address and length is a waste 873Therefore, keeping track of the mapping address and length is a waste
@@ -858,7 +879,7 @@ Actually, instead of describing the macros one by one, we'll
858transform some example code. 879transform some example code.
859 880
8601) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. 8811) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures.
861 Example, before: 882 Example, before::
862 883
863 struct ring_state { 884 struct ring_state {
864 struct sk_buff *skb; 885 struct sk_buff *skb;
@@ -866,7 +887,7 @@ transform some example code.
866 __u32 len; 887 __u32 len;
867 }; 888 };
868 889
869 after: 890 after::
870 891
871 struct ring_state { 892 struct ring_state {
872 struct sk_buff *skb; 893 struct sk_buff *skb;
@@ -875,23 +896,23 @@ transform some example code.
875 }; 896 };
876 897
8772) Use dma_unmap_{addr,len}_set() to set these values. 8982) Use dma_unmap_{addr,len}_set() to set these values.
878 Example, before: 899 Example, before::
879 900
880 ringp->mapping = FOO; 901 ringp->mapping = FOO;
881 ringp->len = BAR; 902 ringp->len = BAR;
882 903
883 after: 904 after::
884 905
885 dma_unmap_addr_set(ringp, mapping, FOO); 906 dma_unmap_addr_set(ringp, mapping, FOO);
886 dma_unmap_len_set(ringp, len, BAR); 907 dma_unmap_len_set(ringp, len, BAR);
887 908
8883) Use dma_unmap_{addr,len}() to access these values. 9093) Use dma_unmap_{addr,len}() to access these values.
889 Example, before: 910 Example, before::
890 911
891 dma_unmap_single(dev, ringp->mapping, ringp->len, 912 dma_unmap_single(dev, ringp->mapping, ringp->len,
892 DMA_FROM_DEVICE); 913 DMA_FROM_DEVICE);
893 914
894 after: 915 after::
895 916
896 dma_unmap_single(dev, 917 dma_unmap_single(dev,
897 dma_unmap_addr(ringp, mapping), 918 dma_unmap_addr(ringp, mapping),
@@ -902,7 +923,8 @@ It really should be self-explanatory. We treat the ADDR and LEN
902separately, because it is possible for an implementation to only 923separately, because it is possible for an implementation to only
903need the address in order to perform the unmap operation. 924need the address in order to perform the unmap operation.
904 925
905 Platform Issues 926Platform Issues
927===============
906 928
907If you are just writing drivers for Linux and do not maintain 929If you are just writing drivers for Linux and do not maintain
908an architecture port for the kernel, you can safely skip down 930an architecture port for the kernel, you can safely skip down
@@ -928,12 +950,13 @@ to "Closing".
928 alignment constraints (e.g. the alignment constraints about 64-bit 950 alignment constraints (e.g. the alignment constraints about 64-bit
929 objects). 951 objects).
930 952
931 Closing 953Closing
954=======
932 955
933This document, and the API itself, would not be in its current 956This document, and the API itself, would not be in its current
934form without the feedback and suggestions from numerous individuals. 957form without the feedback and suggestions from numerous individuals.
935We would like to specifically mention, in no particular order, the 958We would like to specifically mention, in no particular order, the
936following people: 959following people::
937 960
938 Russell King <rmk@arm.linux.org.uk> 961 Russell King <rmk@arm.linux.org.uk>
939 Leo Dagum <dagum@barrel.engr.sgi.com> 962 Leo Dagum <dagum@barrel.engr.sgi.com>
diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index 71200dfa0922..45b29326d719 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -1,7 +1,8 @@
1 Dynamic DMA mapping using the generic device 1============================================
2 ============================================ 2Dynamic DMA mapping using the generic device
3============================================
3 4
4 James E.J. Bottomley <James.Bottomley@HansenPartnership.com> 5:Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
5 6
6This document describes the DMA API. For a more gentle introduction 7This document describes the DMA API. For a more gentle introduction
7of the API (and actual examples), see Documentation/DMA-API-HOWTO.txt. 8of the API (and actual examples), see Documentation/DMA-API-HOWTO.txt.
@@ -12,10 +13,10 @@ machines. Unless you know that your driver absolutely has to support
12non-consistent platforms (this is usually only legacy platforms) you 13non-consistent platforms (this is usually only legacy platforms) you
13should only use the API described in part I. 14should only use the API described in part I.
14 15
15Part I - dma_ API 16Part I - dma_API
16------------------------------------- 17----------------
17 18
18To get the dma_ API, you must #include <linux/dma-mapping.h>. This 19To get the dma_API, you must #include <linux/dma-mapping.h>. This
19provides dma_addr_t and the interfaces described below. 20provides dma_addr_t and the interfaces described below.
20 21
21A dma_addr_t can hold any valid DMA address for the platform. It can be 22A dma_addr_t can hold any valid DMA address for the platform. It can be
@@ -26,9 +27,11 @@ address space and the DMA address space.
26Part Ia - Using large DMA-coherent buffers 27Part Ia - Using large DMA-coherent buffers
27------------------------------------------ 28------------------------------------------
28 29
29void * 30::
30dma_alloc_coherent(struct device *dev, size_t size, 31
31 dma_addr_t *dma_handle, gfp_t flag) 32 void *
33 dma_alloc_coherent(struct device *dev, size_t size,
34 dma_addr_t *dma_handle, gfp_t flag)
32 35
33Consistent memory is memory for which a write by either the device or 36Consistent memory is memory for which a write by either the device or
34the processor can immediately be read by the processor or device 37the processor can immediately be read by the processor or device
@@ -51,20 +54,24 @@ consolidate your requests for consistent memory as much as possible.
51The simplest way to do that is to use the dma_pool calls (see below). 54The simplest way to do that is to use the dma_pool calls (see below).
52 55
53The flag parameter (dma_alloc_coherent() only) allows the caller to 56The flag parameter (dma_alloc_coherent() only) allows the caller to
54specify the GFP_ flags (see kmalloc()) for the allocation (the 57specify the ``GFP_`` flags (see kmalloc()) for the allocation (the
55implementation may choose to ignore flags that affect the location of 58implementation may choose to ignore flags that affect the location of
56the returned memory, like GFP_DMA). 59the returned memory, like GFP_DMA).
57 60
58void * 61::
59dma_zalloc_coherent(struct device *dev, size_t size, 62
60 dma_addr_t *dma_handle, gfp_t flag) 63 void *
64 dma_zalloc_coherent(struct device *dev, size_t size,
65 dma_addr_t *dma_handle, gfp_t flag)
61 66
62Wraps dma_alloc_coherent() and also zeroes the returned memory if the 67Wraps dma_alloc_coherent() and also zeroes the returned memory if the
63allocation attempt succeeded. 68allocation attempt succeeded.
64 69
65void 70::
66dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, 71
67 dma_addr_t dma_handle) 72 void
73 dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
74 dma_addr_t dma_handle)
68 75
69Free a region of consistent memory you previously allocated. dev, 76Free a region of consistent memory you previously allocated. dev,
70size and dma_handle must all be the same as those passed into 77size and dma_handle must all be the same as those passed into
@@ -78,7 +85,7 @@ may only be called with IRQs enabled.
78Part Ib - Using small DMA-coherent buffers 85Part Ib - Using small DMA-coherent buffers
79------------------------------------------ 86------------------------------------------
80 87
81To get this part of the dma_ API, you must #include <linux/dmapool.h> 88To get this part of the dma_API, you must #include <linux/dmapool.h>
82 89
83Many drivers need lots of small DMA-coherent memory regions for DMA 90Many drivers need lots of small DMA-coherent memory regions for DMA
84descriptors or I/O buffers. Rather than allocating in units of a page 91descriptors or I/O buffers. Rather than allocating in units of a page
@@ -88,6 +95,8 @@ not __get_free_pages(). Also, they understand common hardware constraints
88for alignment, like queue heads needing to be aligned on N-byte boundaries. 95for alignment, like queue heads needing to be aligned on N-byte boundaries.
89 96
90 97
98::
99
91 struct dma_pool * 100 struct dma_pool *
92 dma_pool_create(const char *name, struct device *dev, 101 dma_pool_create(const char *name, struct device *dev,
93 size_t size, size_t align, size_t alloc); 102 size_t size, size_t align, size_t alloc);
@@ -103,16 +112,21 @@ in bytes, and must be a power of two). If your device has no boundary
103crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated 112crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated
104from this pool must not cross 4KByte boundaries. 113from this pool must not cross 4KByte boundaries.
105 114
115::
106 116
107 void *dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, 117 void *
108 dma_addr_t *handle) 118 dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags,
119 dma_addr_t *handle)
109 120
110Wraps dma_pool_alloc() and also zeroes the returned memory if the 121Wraps dma_pool_alloc() and also zeroes the returned memory if the
111allocation attempt succeeded. 122allocation attempt succeeded.
112 123
113 124
114 void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, 125::
115 dma_addr_t *dma_handle); 126
127 void *
128 dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags,
129 dma_addr_t *dma_handle);
116 130
117This allocates memory from the pool; the returned memory will meet the 131This allocates memory from the pool; the returned memory will meet the
118size and alignment requirements specified at creation time. Pass 132size and alignment requirements specified at creation time. Pass
@@ -122,16 +136,20 @@ blocking. Like dma_alloc_coherent(), this returns two values: an
122address usable by the CPU, and the DMA address usable by the pool's 136address usable by the CPU, and the DMA address usable by the pool's
123device. 137device.
124 138
139::
125 140
126 void dma_pool_free(struct dma_pool *pool, void *vaddr, 141 void
127 dma_addr_t addr); 142 dma_pool_free(struct dma_pool *pool, void *vaddr,
143 dma_addr_t addr);
128 144
129This puts memory back into the pool. The pool is what was passed to 145This puts memory back into the pool. The pool is what was passed to
130dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what 146dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what
131were returned when that routine allocated the memory being freed. 147were returned when that routine allocated the memory being freed.
132 148
149::
133 150
134 void dma_pool_destroy(struct dma_pool *pool); 151 void
152 dma_pool_destroy(struct dma_pool *pool);
135 153
136dma_pool_destroy() frees the resources of the pool. It must be 154dma_pool_destroy() frees the resources of the pool. It must be
137called in a context which can sleep. Make sure you've freed all allocated 155called in a context which can sleep. Make sure you've freed all allocated
@@ -141,32 +159,40 @@ memory back to the pool before you destroy it.
141Part Ic - DMA addressing limitations 159Part Ic - DMA addressing limitations
142------------------------------------ 160------------------------------------
143 161
144int 162::
145dma_set_mask_and_coherent(struct device *dev, u64 mask) 163
164 int
165 dma_set_mask_and_coherent(struct device *dev, u64 mask)
146 166
147Checks to see if the mask is possible and updates the device 167Checks to see if the mask is possible and updates the device
148streaming and coherent DMA mask parameters if it is. 168streaming and coherent DMA mask parameters if it is.
149 169
150Returns: 0 if successful and a negative error if not. 170Returns: 0 if successful and a negative error if not.
151 171
152int 172::
153dma_set_mask(struct device *dev, u64 mask) 173
174 int
175 dma_set_mask(struct device *dev, u64 mask)
154 176
155Checks to see if the mask is possible and updates the device 177Checks to see if the mask is possible and updates the device
156parameters if it is. 178parameters if it is.
157 179
158Returns: 0 if successful and a negative error if not. 180Returns: 0 if successful and a negative error if not.
159 181
160int 182::
161dma_set_coherent_mask(struct device *dev, u64 mask) 183
184 int
185 dma_set_coherent_mask(struct device *dev, u64 mask)
162 186
163Checks to see if the mask is possible and updates the device 187Checks to see if the mask is possible and updates the device
164parameters if it is. 188parameters if it is.
165 189
166Returns: 0 if successful and a negative error if not. 190Returns: 0 if successful and a negative error if not.
167 191
168u64 192::
169dma_get_required_mask(struct device *dev) 193
194 u64
195 dma_get_required_mask(struct device *dev)
170 196
171This API returns the mask that the platform requires to 197This API returns the mask that the platform requires to
172operate efficiently. Usually this means the returned mask 198operate efficiently. Usually this means the returned mask
@@ -182,94 +208,107 @@ call to set the mask to the value returned.
182Part Id - Streaming DMA mappings 208Part Id - Streaming DMA mappings
183-------------------------------- 209--------------------------------
184 210
185dma_addr_t 211::
186dma_map_single(struct device *dev, void *cpu_addr, size_t size, 212
187 enum dma_data_direction direction) 213 dma_addr_t
214 dma_map_single(struct device *dev, void *cpu_addr, size_t size,
215 enum dma_data_direction direction)
188 216
189Maps a piece of processor virtual memory so it can be accessed by the 217Maps a piece of processor virtual memory so it can be accessed by the
190device and returns the DMA address of the memory. 218device and returns the DMA address of the memory.
191 219
192The direction for both APIs may be converted freely by casting. 220The direction for both APIs may be converted freely by casting.
193However the dma_ API uses a strongly typed enumerator for its 221However the dma_API uses a strongly typed enumerator for its
194direction: 222direction:
195 223
224======================= =============================================
196DMA_NONE no direction (used for debugging) 225DMA_NONE no direction (used for debugging)
197DMA_TO_DEVICE data is going from the memory to the device 226DMA_TO_DEVICE data is going from the memory to the device
198DMA_FROM_DEVICE data is coming from the device to the memory 227DMA_FROM_DEVICE data is coming from the device to the memory
199DMA_BIDIRECTIONAL direction isn't known 228DMA_BIDIRECTIONAL direction isn't known
229======================= =============================================
230
231.. note::
232
233 Not all memory regions in a machine can be mapped by this API.
234 Further, contiguous kernel virtual space may not be contiguous as
235 physical memory. Since this API does not provide any scatter/gather
236 capability, it will fail if the user tries to map a non-physically
237 contiguous piece of memory. For this reason, memory to be mapped by
238 this API should be obtained from sources which guarantee it to be
239 physically contiguous (like kmalloc).
240
241 Further, the DMA address of the memory must be within the
242 dma_mask of the device (the dma_mask is a bit mask of the
243 addressable region for the device, i.e., if the DMA address of
244 the memory ANDed with the dma_mask is still equal to the DMA
245 address, then the device can perform DMA to the memory). To
246 ensure that the memory allocated by kmalloc is within the dma_mask,
247 the driver may specify various platform-dependent flags to restrict
248 the DMA address range of the allocation (e.g., on x86, GFP_DMA
249 guarantees to be within the first 16MB of available DMA addresses,
250 as required by ISA devices).
251
252 Note also that the above constraints on physical contiguity and
253 dma_mask may not apply if the platform has an IOMMU (a device which
254 maps an I/O DMA address to a physical memory address). However, to be
255 portable, device driver writers may *not* assume that such an IOMMU
256 exists.
257
258.. warning::
259
260 Memory coherency operates at a granularity called the cache
261 line width. In order for memory mapped by this API to operate
262 correctly, the mapped region must begin exactly on a cache line
263 boundary and end exactly on one (to prevent two separately mapped
264 regions from sharing a single cache line). Since the cache line size
265 may not be known at compile time, the API will not enforce this
266 requirement. Therefore, it is recommended that driver writers who
267 don't take special care to determine the cache line size at run time
268 only map virtual regions that begin and end on page boundaries (which
269 are guaranteed also to be cache line boundaries).
270
271 DMA_TO_DEVICE synchronisation must be done after the last modification
272 of the memory region by the software and before it is handed off to
273 the device. Once this primitive is used, memory covered by this
274 primitive should be treated as read-only by the device. If the device
275 may write to it at any point, it should be DMA_BIDIRECTIONAL (see
276 below).
277
278 DMA_FROM_DEVICE synchronisation must be done before the driver
279 accesses data that may be changed by the device. This memory should
280 be treated as read-only by the driver. If the driver needs to write
281 to it at any point, it should be DMA_BIDIRECTIONAL (see below).
282
283 DMA_BIDIRECTIONAL requires special handling: it means that the driver
284 isn't sure if the memory was modified before being handed off to the
285 device and also isn't sure if the device will also modify it. Thus,
286 you must always sync bidirectional memory twice: once before the
287 memory is handed off to the device (to make sure all memory changes
288 are flushed from the processor) and once before the data may be
289 accessed after being used by the device (to make sure any processor
290 cache lines are updated with data that the device may have changed).
291
292::
200 293
201Notes: Not all memory regions in a machine can be mapped by this API. 294 void
202Further, contiguous kernel virtual space may not be contiguous as 295 dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
203physical memory. Since this API does not provide any scatter/gather 296 enum dma_data_direction direction)
204capability, it will fail if the user tries to map a non-physically
205contiguous piece of memory. For this reason, memory to be mapped by
206this API should be obtained from sources which guarantee it to be
207physically contiguous (like kmalloc).
208
209Further, the DMA address of the memory must be within the
210dma_mask of the device (the dma_mask is a bit mask of the
211addressable region for the device, i.e., if the DMA address of
212the memory ANDed with the dma_mask is still equal to the DMA
213address, then the device can perform DMA to the memory). To
214ensure that the memory allocated by kmalloc is within the dma_mask,
215the driver may specify various platform-dependent flags to restrict
216the DMA address range of the allocation (e.g., on x86, GFP_DMA
217guarantees to be within the first 16MB of available DMA addresses,
218as required by ISA devices).
219
220Note also that the above constraints on physical contiguity and
221dma_mask may not apply if the platform has an IOMMU (a device which
222maps an I/O DMA address to a physical memory address). However, to be
223portable, device driver writers may *not* assume that such an IOMMU
224exists.
225
226Warnings: Memory coherency operates at a granularity called the cache
227line width. In order for memory mapped by this API to operate
228correctly, the mapped region must begin exactly on a cache line
229boundary and end exactly on one (to prevent two separately mapped
230regions from sharing a single cache line). Since the cache line size
231may not be known at compile time, the API will not enforce this
232requirement. Therefore, it is recommended that driver writers who
233don't take special care to determine the cache line size at run time
234only map virtual regions that begin and end on page boundaries (which
235are guaranteed also to be cache line boundaries).
236
237DMA_TO_DEVICE synchronisation must be done after the last modification
238of the memory region by the software and before it is handed off to
239the device. Once this primitive is used, memory covered by this
240primitive should be treated as read-only by the device. If the device
241may write to it at any point, it should be DMA_BIDIRECTIONAL (see
242below).
243
244DMA_FROM_DEVICE synchronisation must be done before the driver
245accesses data that may be changed by the device. This memory should
246be treated as read-only by the driver. If the driver needs to write
247to it at any point, it should be DMA_BIDIRECTIONAL (see below).
248
249DMA_BIDIRECTIONAL requires special handling: it means that the driver
250isn't sure if the memory was modified before being handed off to the
251device and also isn't sure if the device will also modify it. Thus,
252you must always sync bidirectional memory twice: once before the
253memory is handed off to the device (to make sure all memory changes
254are flushed from the processor) and once before the data may be
255accessed after being used by the device (to make sure any processor
256cache lines are updated with data that the device may have changed).
257
258void
259dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
260 enum dma_data_direction direction)
261 297
262Unmaps the region previously mapped. All the parameters passed in 298Unmaps the region previously mapped. All the parameters passed in
263must be identical to those passed in (and returned) by the mapping 299must be identical to those passed in (and returned) by the mapping
264API. 300API.
265 301
266dma_addr_t 302::
267dma_map_page(struct device *dev, struct page *page, 303
268 unsigned long offset, size_t size, 304 dma_addr_t
269 enum dma_data_direction direction) 305 dma_map_page(struct device *dev, struct page *page,
270void 306 unsigned long offset, size_t size,
271dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, 307 enum dma_data_direction direction)
272 enum dma_data_direction direction) 308
309 void
310 dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
311 enum dma_data_direction direction)
273 312
274API for mapping and unmapping for pages. All the notes and warnings 313API for mapping and unmapping for pages. All the notes and warnings
275for the other mapping APIs apply here. Also, although the <offset> 314for the other mapping APIs apply here. Also, although the <offset>
@@ -277,20 +316,24 @@ and <size> parameters are provided to do partial page mapping, it is
277recommended that you never use these unless you really know what the 316recommended that you never use these unless you really know what the
278cache width is. 317cache width is.
279 318
280dma_addr_t 319::
281dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size,
282 enum dma_data_direction dir, unsigned long attrs)
283 320
284void 321 dma_addr_t
285dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, 322 dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size,
286 enum dma_data_direction dir, unsigned long attrs) 323 enum dma_data_direction dir, unsigned long attrs)
324
325 void
326 dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
327 enum dma_data_direction dir, unsigned long attrs)
287 328
288API for mapping and unmapping for MMIO resources. All the notes and 329API for mapping and unmapping for MMIO resources. All the notes and
289warnings for the other mapping APIs apply here. The API should only be 330warnings for the other mapping APIs apply here. The API should only be
290used to map device MMIO resources, mapping of RAM is not permitted. 331used to map device MMIO resources, mapping of RAM is not permitted.
291 332
292int 333::
293dma_mapping_error(struct device *dev, dma_addr_t dma_addr) 334
335 int
336 dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
294 337
295In some circumstances dma_map_single(), dma_map_page() and dma_map_resource() 338In some circumstances dma_map_single(), dma_map_page() and dma_map_resource()
296will fail to create a mapping. A driver can check for these errors by testing 339will fail to create a mapping. A driver can check for these errors by testing
@@ -298,9 +341,11 @@ the returned DMA address with dma_mapping_error(). A non-zero return value
298means the mapping could not be created and the driver should take appropriate 341means the mapping could not be created and the driver should take appropriate
299action (e.g. reduce current DMA mapping usage or delay and try again later). 342action (e.g. reduce current DMA mapping usage or delay and try again later).
300 343
344::
345
301 int 346 int
302 dma_map_sg(struct device *dev, struct scatterlist *sg, 347 dma_map_sg(struct device *dev, struct scatterlist *sg,
303 int nents, enum dma_data_direction direction) 348 int nents, enum dma_data_direction direction)
304 349
305Returns: the number of DMA address segments mapped (this may be shorter 350Returns: the number of DMA address segments mapped (this may be shorter
306than <nents> passed in if some elements of the scatter/gather list are 351than <nents> passed in if some elements of the scatter/gather list are
@@ -316,7 +361,7 @@ critical that the driver do something, in the case of a block driver
316aborting the request or even oopsing is better than doing nothing and 361aborting the request or even oopsing is better than doing nothing and
317corrupting the filesystem. 362corrupting the filesystem.
318 363
319With scatterlists, you use the resulting mapping like this: 364With scatterlists, you use the resulting mapping like this::
320 365
321 int i, count = dma_map_sg(dev, sglist, nents, direction); 366 int i, count = dma_map_sg(dev, sglist, nents, direction);
322 struct scatterlist *sg; 367 struct scatterlist *sg;
@@ -337,9 +382,11 @@ Then you should loop count times (note: this can be less than nents times)
337and use sg_dma_address() and sg_dma_len() macros where you previously 382and use sg_dma_address() and sg_dma_len() macros where you previously
338accessed sg->address and sg->length as shown above. 383accessed sg->address and sg->length as shown above.
339 384
385::
386
340 void 387 void
341 dma_unmap_sg(struct device *dev, struct scatterlist *sg, 388 dma_unmap_sg(struct device *dev, struct scatterlist *sg,
342 int nents, enum dma_data_direction direction) 389 int nents, enum dma_data_direction direction)
343 390
344Unmap the previously mapped scatter/gather list. All the parameters 391Unmap the previously mapped scatter/gather list. All the parameters
345must be the same as those and passed in to the scatter/gather mapping 392must be the same as those and passed in to the scatter/gather mapping
@@ -348,18 +395,27 @@ API.
348Note: <nents> must be the number you passed in, *not* the number of 395Note: <nents> must be the number you passed in, *not* the number of
349DMA address entries returned. 396DMA address entries returned.
350 397
351void 398::
352dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, 399
353 enum dma_data_direction direction) 400 void
354void 401 dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
355dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, 402 size_t size,
356 enum dma_data_direction direction) 403 enum dma_data_direction direction)
357void 404
358dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nents, 405 void
359 enum dma_data_direction direction) 406 dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
360void 407 size_t size,
361dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nents, 408 enum dma_data_direction direction)
362 enum dma_data_direction direction) 409
410 void
411 dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
412 int nents,
413 enum dma_data_direction direction)
414
415 void
416 dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
417 int nents,
418 enum dma_data_direction direction)
363 419
364Synchronise a single contiguous or scatter/gather mapping for the CPU 420Synchronise a single contiguous or scatter/gather mapping for the CPU
365and device. With the sync_sg API, all the parameters must be the same 421and device. With the sync_sg API, all the parameters must be the same
@@ -367,36 +423,41 @@ as those passed into the single mapping API. With the sync_single API,
367you can use dma_handle and size parameters that aren't identical to 423you can use dma_handle and size parameters that aren't identical to
368those passed into the single mapping API to do a partial sync. 424those passed into the single mapping API to do a partial sync.
369 425
370Notes: You must do this:
371 426
372- Before reading values that have been written by DMA from the device 427.. note::
373 (use the DMA_FROM_DEVICE direction) 428
374- After writing values that will be written to the device using DMA 429 You must do this:
375 (use the DMA_TO_DEVICE) direction 430
376- before *and* after handing memory to the device if the memory is 431 - Before reading values that have been written by DMA from the device
377 DMA_BIDIRECTIONAL 432 (use the DMA_FROM_DEVICE direction)
433 - After writing values that will be written to the device using DMA
434 (use the DMA_TO_DEVICE) direction
435 - before *and* after handing memory to the device if the memory is
436 DMA_BIDIRECTIONAL
378 437
379See also dma_map_single(). 438See also dma_map_single().
380 439
381dma_addr_t 440::
382dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, 441
383 enum dma_data_direction dir, 442 dma_addr_t
384 unsigned long attrs) 443 dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size,
444 enum dma_data_direction dir,
445 unsigned long attrs)
385 446
386void 447 void
387dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, 448 dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr,
388 size_t size, enum dma_data_direction dir, 449 size_t size, enum dma_data_direction dir,
389 unsigned long attrs) 450 unsigned long attrs)
390 451
391int 452 int
392dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, 453 dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
393 int nents, enum dma_data_direction dir, 454 int nents, enum dma_data_direction dir,
394 unsigned long attrs) 455 unsigned long attrs)
395 456
396void 457 void
397dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, 458 dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl,
398 int nents, enum dma_data_direction dir, 459 int nents, enum dma_data_direction dir,
399 unsigned long attrs) 460 unsigned long attrs)
400 461
401The four functions above are just like the counterpart functions 462The four functions above are just like the counterpart functions
402without the _attrs suffixes, except that they pass an optional 463without the _attrs suffixes, except that they pass an optional
@@ -410,37 +471,38 @@ is identical to those of the corresponding function
410without the _attrs suffix. As a result dma_map_single_attrs() 471without the _attrs suffix. As a result dma_map_single_attrs()
411can generally replace dma_map_single(), etc. 472can generally replace dma_map_single(), etc.
412 473
413As an example of the use of the *_attrs functions, here's how 474As an example of the use of the ``*_attrs`` functions, here's how
414you could pass an attribute DMA_ATTR_FOO when mapping memory 475you could pass an attribute DMA_ATTR_FOO when mapping memory
415for DMA: 476for DMA::
416 477
417#include <linux/dma-mapping.h> 478 #include <linux/dma-mapping.h>
418/* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and 479 /* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and
419 * documented in Documentation/DMA-attributes.txt */ 480 * documented in Documentation/DMA-attributes.txt */
420... 481 ...
421 482
422 unsigned long attr; 483 unsigned long attr;
423 attr |= DMA_ATTR_FOO; 484 attr |= DMA_ATTR_FOO;
424 .... 485 ....
425 n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr); 486 n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr);
426 .... 487 ....
427 488
428Architectures that care about DMA_ATTR_FOO would check for its 489Architectures that care about DMA_ATTR_FOO would check for its
429presence in their implementations of the mapping and unmapping 490presence in their implementations of the mapping and unmapping
430routines, e.g.: 491routines, e.g.:::
431 492
432void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, 493 void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
433 size_t size, enum dma_data_direction dir, 494 size_t size, enum dma_data_direction dir,
434 unsigned long attrs) 495 unsigned long attrs)
435{ 496 {
436 .... 497 ....
437 if (attrs & DMA_ATTR_FOO) 498 if (attrs & DMA_ATTR_FOO)
438 /* twizzle the frobnozzle */ 499 /* twizzle the frobnozzle */
439 .... 500 ....
501 }
440 502
441 503
442Part II - Advanced dma_ usage 504Part II - Advanced dma usage
443----------------------------- 505----------------------------
444 506
445Warning: These pieces of the DMA API should not be used in the 507Warning: These pieces of the DMA API should not be used in the
446majority of cases, since they cater for unlikely corner cases that 508majority of cases, since they cater for unlikely corner cases that
@@ -450,9 +512,11 @@ If you don't understand how cache line coherency works between a
450processor and an I/O device, you should not be using this part of the 512processor and an I/O device, you should not be using this part of the
451API at all. 513API at all.
452 514
453void * 515::
454dma_alloc_noncoherent(struct device *dev, size_t size, 516
455 dma_addr_t *dma_handle, gfp_t flag) 517 void *
518 dma_alloc_noncoherent(struct device *dev, size_t size,
519 dma_addr_t *dma_handle, gfp_t flag)
456 520
457Identical to dma_alloc_coherent() except that the platform will 521Identical to dma_alloc_coherent() except that the platform will
458choose to return either consistent or non-consistent memory as it sees 522choose to return either consistent or non-consistent memory as it sees
@@ -468,39 +532,49 @@ only use this API if you positively know your driver will be
468required to work on one of the rare (usually non-PCI) architectures 532required to work on one of the rare (usually non-PCI) architectures
469that simply cannot make consistent memory. 533that simply cannot make consistent memory.
470 534
471void 535::
472dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, 536
473 dma_addr_t dma_handle) 537 void
538 dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
539 dma_addr_t dma_handle)
474 540
475Free memory allocated by the nonconsistent API. All parameters must 541Free memory allocated by the nonconsistent API. All parameters must
476be identical to those passed in (and returned by 542be identical to those passed in (and returned by
477dma_alloc_noncoherent()). 543dma_alloc_noncoherent()).
478 544
479int 545::
480dma_get_cache_alignment(void) 546
547 int
548 dma_get_cache_alignment(void)
481 549
482Returns the processor cache alignment. This is the absolute minimum 550Returns the processor cache alignment. This is the absolute minimum
483alignment *and* width that you must observe when either mapping 551alignment *and* width that you must observe when either mapping
484memory or doing partial flushes. 552memory or doing partial flushes.
485 553
486Notes: This API may return a number *larger* than the actual cache 554.. note::
487line, but it will guarantee that one or more cache lines fit exactly
488into the width returned by this call. It will also always be a power
489of two for easy alignment.
490 555
491void 556 This API may return a number *larger* than the actual cache
492dma_cache_sync(struct device *dev, void *vaddr, size_t size, 557 line, but it will guarantee that one or more cache lines fit exactly
493 enum dma_data_direction direction) 558 into the width returned by this call. It will also always be a power
559 of two for easy alignment.
560
561::
562
563 void
564 dma_cache_sync(struct device *dev, void *vaddr, size_t size,
565 enum dma_data_direction direction)
494 566
495Do a partial sync of memory that was allocated by 567Do a partial sync of memory that was allocated by
496dma_alloc_noncoherent(), starting at virtual address vaddr and 568dma_alloc_noncoherent(), starting at virtual address vaddr and
497continuing on for size. Again, you *must* observe the cache line 569continuing on for size. Again, you *must* observe the cache line
498boundaries when doing this. 570boundaries when doing this.
499 571
500int 572::
501dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, 573
502 dma_addr_t device_addr, size_t size, int 574 int
503 flags) 575 dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
576 dma_addr_t device_addr, size_t size, int
577 flags)
504 578
505Declare region of memory to be handed out by dma_alloc_coherent() when 579Declare region of memory to be handed out by dma_alloc_coherent() when
506it's asked for coherent memory for this device. 580it's asked for coherent memory for this device.
@@ -516,21 +590,21 @@ size is the size of the area (must be multiples of PAGE_SIZE).
516 590
517flags can be ORed together and are: 591flags can be ORed together and are:
518 592
519DMA_MEMORY_MAP - request that the memory returned from 593- DMA_MEMORY_MAP - request that the memory returned from
520dma_alloc_coherent() be directly writable. 594 dma_alloc_coherent() be directly writable.
521 595
522DMA_MEMORY_IO - request that the memory returned from 596- DMA_MEMORY_IO - request that the memory returned from
523dma_alloc_coherent() be addressable using read()/write()/memcpy_toio() etc. 597 dma_alloc_coherent() be addressable using read()/write()/memcpy_toio() etc.
524 598
525One or both of these flags must be present. 599One or both of these flags must be present.
526 600
527DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by 601- DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by
528dma_alloc_coherent of any child devices of this one (for memory residing 602 dma_alloc_coherent of any child devices of this one (for memory residing
529on a bridge). 603 on a bridge).
530 604
531DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. 605- DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions.
532Do not allow dma_alloc_coherent() to fall back to system memory when 606 Do not allow dma_alloc_coherent() to fall back to system memory when
533it's out of memory in the declared region. 607 it's out of memory in the declared region.
534 608
535The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and 609The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and
536must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO 610must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO
@@ -543,15 +617,17 @@ must be accessed using the correct bus functions. If your driver
543isn't prepared to handle this contingency, it should not specify 617isn't prepared to handle this contingency, it should not specify
544DMA_MEMORY_IO in the input flags. 618DMA_MEMORY_IO in the input flags.
545 619
546As a simplification for the platforms, only *one* such region of 620As a simplification for the platforms, only **one** such region of
547memory may be declared per device. 621memory may be declared per device.
548 622
549For reasons of efficiency, most platforms choose to track the declared 623For reasons of efficiency, most platforms choose to track the declared
550region only at the granularity of a page. For smaller allocations, 624region only at the granularity of a page. For smaller allocations,
551you should use the dma_pool() API. 625you should use the dma_pool() API.
552 626
553void 627::
554dma_release_declared_memory(struct device *dev) 628
629 void
630 dma_release_declared_memory(struct device *dev)
555 631
556Remove the memory region previously declared from the system. This 632Remove the memory region previously declared from the system. This
557API performs *no* in-use checking for this region and will return 633API performs *no* in-use checking for this region and will return
@@ -559,9 +635,11 @@ unconditionally having removed all the required structures. It is the
559driver's job to ensure that no parts of this memory region are 635driver's job to ensure that no parts of this memory region are
560currently in use. 636currently in use.
561 637
562void * 638::
563dma_mark_declared_memory_occupied(struct device *dev, 639
564 dma_addr_t device_addr, size_t size) 640 void *
641 dma_mark_declared_memory_occupied(struct device *dev,
642 dma_addr_t device_addr, size_t size)
565 643
566This is used to occupy specific regions of the declared space 644This is used to occupy specific regions of the declared space
567(dma_alloc_coherent() will hand out the first free region it finds). 645(dma_alloc_coherent() will hand out the first free region it finds).
@@ -592,38 +670,37 @@ option has a performance impact. Do not enable it in production kernels.
592If you boot the resulting kernel will contain code which does some bookkeeping 670If you boot the resulting kernel will contain code which does some bookkeeping
593about what DMA memory was allocated for which device. If this code detects an 671about what DMA memory was allocated for which device. If this code detects an
594error it prints a warning message with some details into your kernel log. An 672error it prints a warning message with some details into your kernel log. An
595example warning message may look like this: 673example warning message may look like this::
596 674
597------------[ cut here ]------------ 675 WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448
598WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448 676 check_unmap+0x203/0x490()
599 check_unmap+0x203/0x490() 677 Hardware name:
600Hardware name: 678 forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong
601forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong 679 function [device address=0x00000000640444be] [size=66 bytes] [mapped as
602 function [device address=0x00000000640444be] [size=66 bytes] [mapped as 680 single] [unmapped as page]
603single] [unmapped as page] 681 Modules linked in: nfsd exportfs bridge stp llc r8169
604Modules linked in: nfsd exportfs bridge stp llc r8169 682 Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1
605Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1 683 Call Trace:
606Call Trace: 684 <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130
607 <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130 685 [<ffffffff80647b70>] _spin_unlock+0x10/0x30
608 [<ffffffff80647b70>] _spin_unlock+0x10/0x30 686 [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0
609 [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0 687 [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40
610 [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40 688 [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0
611 [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0 689 [<ffffffff80252f96>] queue_work+0x56/0x60
612 [<ffffffff80252f96>] queue_work+0x56/0x60 690 [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50
613 [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50 691 [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0
614 [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0 692 [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40
615 [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40 693 [<ffffffff80235177>] find_busiest_group+0x207/0x8a0
616 [<ffffffff80235177>] find_busiest_group+0x207/0x8a0 694 [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50
617 [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50 695 [<ffffffff803c7ea3>] check_unmap+0x203/0x490
618 [<ffffffff803c7ea3>] check_unmap+0x203/0x490 696 [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50
619 [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50 697 [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0
620 [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0 698 [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0
621 [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0 699 [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70
622 [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70 700 [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150
623 [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150 701 [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0
624 [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0 702 [<ffffffff8020c093>] ret_from_intr+0x0/0xa
625 [<ffffffff8020c093>] ret_from_intr+0x0/0xa 703 <EOI> <4>---[ end trace f6435a98e2a38c0e ]---
626 <EOI> <4>---[ end trace f6435a98e2a38c0e ]---
627 704
628The driver developer can find the driver and the device including a stacktrace 705The driver developer can find the driver and the device including a stacktrace
629of the DMA-API call which caused this warning. 706of the DMA-API call which caused this warning.
@@ -637,43 +714,42 @@ details.
637The debugfs directory for the DMA-API debugging code is called dma-api/. In 714The debugfs directory for the DMA-API debugging code is called dma-api/. In
638this directory the following files can currently be found: 715this directory the following files can currently be found:
639 716
640 dma-api/all_errors This file contains a numeric value. If this 717=============================== ===============================================
718dma-api/all_errors This file contains a numeric value. If this
641 value is not equal to zero the debugging code 719 value is not equal to zero the debugging code
642 will print a warning for every error it finds 720 will print a warning for every error it finds
643 into the kernel log. Be careful with this 721 into the kernel log. Be careful with this
644 option, as it can easily flood your logs. 722 option, as it can easily flood your logs.
645 723
646 dma-api/disabled This read-only file contains the character 'Y' 724dma-api/disabled This read-only file contains the character 'Y'
647 if the debugging code is disabled. This can 725 if the debugging code is disabled. This can
648 happen when it runs out of memory or if it was 726 happen when it runs out of memory or if it was
649 disabled at boot time 727 disabled at boot time
650 728
651 dma-api/error_count This file is read-only and shows the total 729dma-api/error_count This file is read-only and shows the total
652 numbers of errors found. 730 numbers of errors found.
653 731
654 dma-api/num_errors The number in this file shows how many 732dma-api/num_errors The number in this file shows how many
655 warnings will be printed to the kernel log 733 warnings will be printed to the kernel log
656 before it stops. This number is initialized to 734 before it stops. This number is initialized to
657 one at system boot and be set by writing into 735 one at system boot and be set by writing into
658 this file 736 this file
659 737
660 dma-api/min_free_entries 738dma-api/min_free_entries This read-only file can be read to get the
661 This read-only file can be read to get the
662 minimum number of free dma_debug_entries the 739 minimum number of free dma_debug_entries the
663 allocator has ever seen. If this value goes 740 allocator has ever seen. If this value goes
664 down to zero the code will disable itself 741 down to zero the code will disable itself
665 because it is not longer reliable. 742 because it is not longer reliable.
666 743
667 dma-api/num_free_entries 744dma-api/num_free_entries The current number of free dma_debug_entries
668 The current number of free dma_debug_entries
669 in the allocator. 745 in the allocator.
670 746
671 dma-api/driver-filter 747dma-api/driver-filter You can write a name of a driver into this file
672 You can write a name of a driver into this file
673 to limit the debug output to requests from that 748 to limit the debug output to requests from that
674 particular driver. Write an empty string to 749 particular driver. Write an empty string to
675 that file to disable the filter and see 750 that file to disable the filter and see
676 all errors again. 751 all errors again.
752=============================== ===============================================
677 753
678If you have this code compiled into your kernel it will be enabled by default. 754If you have this code compiled into your kernel it will be enabled by default.
679If you want to boot without the bookkeeping anyway you can provide 755If you want to boot without the bookkeeping anyway you can provide
@@ -692,7 +768,10 @@ of preallocated entries is defined per architecture. If it is too low for you
692boot with 'dma_debug_entries=<your_desired_number>' to overwrite the 768boot with 'dma_debug_entries=<your_desired_number>' to overwrite the
693architectural default. 769architectural default.
694 770
695void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr); 771::
772
773 void
774 debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
696 775
697dma-debug interface debug_dma_mapping_error() to debug drivers that fail 776dma-debug interface debug_dma_mapping_error() to debug drivers that fail
698to check DMA mapping errors on addresses returned by dma_map_single() and 777to check DMA mapping errors on addresses returned by dma_map_single() and
@@ -702,4 +781,3 @@ the driver. When driver does unmap, debug_dma_unmap() checks the flag and if
702this flag is still set, prints warning message that includes call trace that 781this flag is still set, prints warning message that includes call trace that
703leads up to the unmap. This interface can be called from dma_mapping_error() 782leads up to the unmap. This interface can be called from dma_mapping_error()
704routines to enable DMA mapping error check debugging. 783routines to enable DMA mapping error check debugging.
705
diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
index 7a065ac4a9d1..8c2b8be6e45b 100644
--- a/Documentation/DMA-ISA-LPC.txt
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -1,19 +1,20 @@
1 DMA with ISA and LPC devices 1============================
2 ============================ 2DMA with ISA and LPC devices
3============================
3 4
4 Pierre Ossman <drzeus@drzeus.cx> 5:Author: Pierre Ossman <drzeus@drzeus.cx>
5 6
6This document describes how to do DMA transfers using the old ISA DMA 7This document describes how to do DMA transfers using the old ISA DMA
7controller. Even though ISA is more or less dead today the LPC bus 8controller. Even though ISA is more or less dead today the LPC bus
8uses the same DMA system so it will be around for quite some time. 9uses the same DMA system so it will be around for quite some time.
9 10
10Part I - Headers and dependencies 11Headers and dependencies
11--------------------------------- 12------------------------
12 13
13To do ISA style DMA you need to include two headers: 14To do ISA style DMA you need to include two headers::
14 15
15#include <linux/dma-mapping.h> 16 #include <linux/dma-mapping.h>
16#include <asm/dma.h> 17 #include <asm/dma.h>
17 18
18The first is the generic DMA API used to convert virtual addresses to 19The first is the generic DMA API used to convert virtual addresses to
19bus addresses (see Documentation/DMA-API.txt for details). 20bus addresses (see Documentation/DMA-API.txt for details).
@@ -23,8 +24,8 @@ this is not present on all platforms make sure you construct your
23Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries 24Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
24to build your driver on unsupported platforms. 25to build your driver on unsupported platforms.
25 26
26Part II - Buffer allocation 27Buffer allocation
27--------------------------- 28-----------------
28 29
29The ISA DMA controller has some very strict requirements on which 30The ISA DMA controller has some very strict requirements on which
30memory it can access so extra care must be taken when allocating 31memory it can access so extra care must be taken when allocating
@@ -47,8 +48,8 @@ __GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
47(This scarcity also means that you should allocate the buffer as 48(This scarcity also means that you should allocate the buffer as
48early as possible and not release it until the driver is unloaded.) 49early as possible and not release it until the driver is unloaded.)
49 50
50Part III - Address translation 51Address translation
51------------------------------ 52-------------------
52 53
53To translate the virtual address to a bus address, use the normal DMA 54To translate the virtual address to a bus address, use the normal DMA
54API. Do _not_ use isa_virt_to_phys() even though it does the same 55API. Do _not_ use isa_virt_to_phys() even though it does the same
@@ -61,8 +62,8 @@ Note: x86_64 had a broken DMA API when it came to ISA but has since
61been fixed. If your arch has problems then fix the DMA API instead of 62been fixed. If your arch has problems then fix the DMA API instead of
62reverting to the ISA functions. 63reverting to the ISA functions.
63 64
64Part IV - Channels 65Channels
65------------------ 66--------
66 67
67A normal ISA DMA controller has 8 channels. The lower four are for 68A normal ISA DMA controller has 8 channels. The lower four are for
688-bit transfers and the upper four are for 16-bit transfers. 698-bit transfers and the upper four are for 16-bit transfers.
@@ -80,8 +81,8 @@ The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
80driver author but depends on what the hardware supports. Check your 81driver author but depends on what the hardware supports. Check your
81specs or test different channels. 82specs or test different channels.
82 83
83Part V - Transfer data 84Transfer data
84---------------------- 85-------------
85 86
86Now for the good stuff, the actual DMA transfer. :) 87Now for the good stuff, the actual DMA transfer. :)
87 88
@@ -112,37 +113,37 @@ Once the DMA transfer is finished (or timed out) you should disable
112the channel again. You should also check get_dma_residue() to make 113the channel again. You should also check get_dma_residue() to make
113sure that all data has been transferred. 114sure that all data has been transferred.
114 115
115Example: 116Example::
116 117
117int flags, residue; 118 int flags, residue;
118 119
119flags = claim_dma_lock(); 120 flags = claim_dma_lock();
120 121
121clear_dma_ff(); 122 clear_dma_ff();
122 123
123set_dma_mode(channel, DMA_MODE_WRITE); 124 set_dma_mode(channel, DMA_MODE_WRITE);
124set_dma_addr(channel, phys_addr); 125 set_dma_addr(channel, phys_addr);
125set_dma_count(channel, num_bytes); 126 set_dma_count(channel, num_bytes);
126 127
127dma_enable(channel); 128 dma_enable(channel);
128 129
129release_dma_lock(flags); 130 release_dma_lock(flags);
130 131
131while (!device_done()); 132 while (!device_done());
132 133
133flags = claim_dma_lock(); 134 flags = claim_dma_lock();
134 135
135dma_disable(channel); 136 dma_disable(channel);
136 137
137residue = dma_get_residue(channel); 138 residue = dma_get_residue(channel);
138if (residue != 0) 139 if (residue != 0)
139 printk(KERN_ERR "driver: Incomplete DMA transfer!" 140 printk(KERN_ERR "driver: Incomplete DMA transfer!"
140 " %d bytes left!\n", residue); 141 " %d bytes left!\n", residue);
141 142
142release_dma_lock(flags); 143 release_dma_lock(flags);
143 144
144Part VI - Suspend/resume 145Suspend/resume
145------------------------ 146--------------
146 147
147It is the driver's responsibility to make sure that the machine isn't 148It is the driver's responsibility to make sure that the machine isn't
148suspended while a DMA transfer is in progress. Also, all DMA settings 149suspended while a DMA transfer is in progress. Also, all DMA settings
diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt
index 44c6bc496eee..8f8d97f65d73 100644
--- a/Documentation/DMA-attributes.txt
+++ b/Documentation/DMA-attributes.txt
@@ -1,5 +1,6 @@
1 DMA attributes 1==============
2 ============== 2DMA attributes
3==============
3 4
4This document describes the semantics of the DMA attributes that are 5This document describes the semantics of the DMA attributes that are
5defined in linux/dma-mapping.h. 6defined in linux/dma-mapping.h.
@@ -108,6 +109,7 @@ This is a hint to the DMA-mapping subsystem that it's probably not worth
108the time to try to allocate memory to in a way that gives better TLB 109the time to try to allocate memory to in a way that gives better TLB
109efficiency (AKA it's not worth trying to build the mapping out of larger 110efficiency (AKA it's not worth trying to build the mapping out of larger
110pages). You might want to specify this if: 111pages). You might want to specify this if:
112
111- You know that the accesses to this memory won't thrash the TLB. 113- You know that the accesses to this memory won't thrash the TLB.
112 You might know that the accesses are likely to be sequential or 114 You might know that the accesses are likely to be sequential or
113 that they aren't sequential but it's unlikely you'll ping-pong 115 that they aren't sequential but it's unlikely you'll ping-pong
@@ -121,11 +123,12 @@ pages). You might want to specify this if:
121 the mapping to have a short lifetime then it may be worth it to 123 the mapping to have a short lifetime then it may be worth it to
122 optimize allocation (avoid coming up with large pages) instead of 124 optimize allocation (avoid coming up with large pages) instead of
123 getting the slight performance win of larger pages. 125 getting the slight performance win of larger pages.
126
124Setting this hint doesn't guarantee that you won't get huge pages, but it 127Setting this hint doesn't guarantee that you won't get huge pages, but it
125means that we won't try quite as hard to get them. 128means that we won't try quite as hard to get them.
126 129
127NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, 130.. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM,
128though ARM64 patches will likely be posted soon. 131 though ARM64 patches will likely be posted soon.
129 132
130DMA_ATTR_NO_WARN 133DMA_ATTR_NO_WARN
131---------------- 134----------------
@@ -142,10 +145,10 @@ problem at all, depending on the implementation of the retry mechanism.
142So, this provides a way for drivers to avoid those error messages on calls 145So, this provides a way for drivers to avoid those error messages on calls
143where allocation failures are not a problem, and shouldn't bother the logs. 146where allocation failures are not a problem, and shouldn't bother the logs.
144 147
145NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. 148.. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC.
146 149
147DMA_ATTR_PRIVILEGED 150DMA_ATTR_PRIVILEGED
148------------------------------ 151-------------------
149 152
150Some advanced peripherals such as remote processors and GPUs perform 153Some advanced peripherals such as remote processors and GPUs perform
151accesses to DMA buffers in both privileged "supervisor" and unprivileged 154accesses to DMA buffers in both privileged "supervisor" and unprivileged
diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt
index 6962cab997ef..aa77a25a0940 100644
--- a/Documentation/IPMI.txt
+++ b/Documentation/IPMI.txt
@@ -1,9 +1,8 @@
1=====================
2The Linux IPMI Driver
3=====================
1 4
2 The Linux IPMI Driver 5:Author: Corey Minyard <minyard@mvista.com> / <minyard@acm.org>
3 ---------------------
4 Corey Minyard
5 <minyard@mvista.com>
6 <minyard@acm.org>
7 6
8The Intelligent Platform Management Interface, or IPMI, is a 7The Intelligent Platform Management Interface, or IPMI, is a
9standard for controlling intelligent devices that monitor a system. 8standard for controlling intelligent devices that monitor a system.
@@ -141,7 +140,7 @@ Addressing
141---------- 140----------
142 141
143The IPMI addressing works much like IP addresses, you have an overlay 142The IPMI addressing works much like IP addresses, you have an overlay
144to handle the different address types. The overlay is: 143to handle the different address types. The overlay is::
145 144
146 struct ipmi_addr 145 struct ipmi_addr
147 { 146 {
@@ -153,7 +152,7 @@ to handle the different address types. The overlay is:
153The addr_type determines what the address really is. The driver 152The addr_type determines what the address really is. The driver
154currently understands two different types of addresses. 153currently understands two different types of addresses.
155 154
156"System Interface" addresses are defined as: 155"System Interface" addresses are defined as::
157 156
158 struct ipmi_system_interface_addr 157 struct ipmi_system_interface_addr
159 { 158 {
@@ -166,7 +165,7 @@ straight to the BMC on the current card. The channel must be
166IPMI_BMC_CHANNEL. 165IPMI_BMC_CHANNEL.
167 166
168Messages that are destined to go out on the IPMB bus use the 167Messages that are destined to go out on the IPMB bus use the
169IPMI_IPMB_ADDR_TYPE address type. The format is 168IPMI_IPMB_ADDR_TYPE address type. The format is::
170 169
171 struct ipmi_ipmb_addr 170 struct ipmi_ipmb_addr
172 { 171 {
@@ -184,16 +183,16 @@ spec.
184Messages 183Messages
185-------- 184--------
186 185
187Messages are defined as: 186Messages are defined as::
188 187
189struct ipmi_msg 188 struct ipmi_msg
190{ 189 {
191 unsigned char netfn; 190 unsigned char netfn;
192 unsigned char lun; 191 unsigned char lun;
193 unsigned char cmd; 192 unsigned char cmd;
194 unsigned char *data; 193 unsigned char *data;
195 int data_len; 194 int data_len;
196}; 195 };
197 196
198The driver takes care of adding/stripping the header information. The 197The driver takes care of adding/stripping the header information. The
199data portion is just the data to be send (do NOT put addressing info 198data portion is just the data to be send (do NOT put addressing info
@@ -208,7 +207,7 @@ block of data, even when receiving messages. Otherwise the driver
208will have no place to put the message. 207will have no place to put the message.
209 208
210Messages coming up from the message handler in kernelland will come in 209Messages coming up from the message handler in kernelland will come in
211as: 210as::
212 211
213 struct ipmi_recv_msg 212 struct ipmi_recv_msg
214 { 213 {
@@ -246,6 +245,7 @@ and the user should not have to care what type of SMI is below them.
246 245
247 246
248Watching For Interfaces 247Watching For Interfaces
248^^^^^^^^^^^^^^^^^^^^^^^
249 249
250When your code comes up, the IPMI driver may or may not have detected 250When your code comes up, the IPMI driver may or may not have detected
251if IPMI devices exist. So you might have to defer your setup until 251if IPMI devices exist. So you might have to defer your setup until
@@ -256,6 +256,7 @@ and tell you when they come and go.
256 256
257 257
258Creating the User 258Creating the User
259^^^^^^^^^^^^^^^^^
259 260
260To use the message handler, you must first create a user using 261To use the message handler, you must first create a user using
261ipmi_create_user. The interface number specifies which SMI you want 262ipmi_create_user. The interface number specifies which SMI you want
@@ -272,6 +273,7 @@ closing the device automatically destroys the user.
272 273
273 274
274Messaging 275Messaging
276^^^^^^^^^
275 277
276To send a message from kernel-land, the ipmi_request_settime() call does 278To send a message from kernel-land, the ipmi_request_settime() call does
277pretty much all message handling. Most of the parameter are 279pretty much all message handling. Most of the parameter are
@@ -321,6 +323,7 @@ though, since it is tricky to manage your own buffers.
321 323
322 324
323Events and Incoming Commands 325Events and Incoming Commands
326^^^^^^^^^^^^^^^^^^^^^^^^^^^^
324 327
325The driver takes care of polling for IPMI events and receiving 328The driver takes care of polling for IPMI events and receiving
326commands (commands are messages that are not responses, they are 329commands (commands are messages that are not responses, they are
@@ -367,7 +370,7 @@ in the system. It discovers interfaces through a host of different
367methods, depending on the system. 370methods, depending on the system.
368 371
369You can specify up to four interfaces on the module load line and 372You can specify up to four interfaces on the module load line and
370control some module parameters: 373control some module parameters::
371 374
372 modprobe ipmi_si.o type=<type1>,<type2>.... 375 modprobe ipmi_si.o type=<type1>,<type2>....
373 ports=<port1>,<port2>... addrs=<addr1>,<addr2>... 376 ports=<port1>,<port2>... addrs=<addr1>,<addr2>...
@@ -437,7 +440,7 @@ default is one. Setting to 0 is useful with the hotmod, but is
437obviously only useful for modules. 440obviously only useful for modules.
438 441
439When compiled into the kernel, the parameters can be specified on the 442When compiled into the kernel, the parameters can be specified on the
440kernel command line as: 443kernel command line as::
441 444
442 ipmi_si.type=<type1>,<type2>... 445 ipmi_si.type=<type1>,<type2>...
443 ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>... 446 ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>...
@@ -474,16 +477,22 @@ The driver supports a hot add and remove of interfaces. This way,
474interfaces can be added or removed after the kernel is up and running. 477interfaces can be added or removed after the kernel is up and running.
475This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a 478This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
476write-only parameter. You write a string to this interface. The string 479write-only parameter. You write a string to this interface. The string
477has the format: 480has the format::
481
478 <op1>[:op2[:op3...]] 482 <op1>[:op2[:op3...]]
479The "op"s are: 483
484The "op"s are::
485
480 add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]] 486 add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
481You can specify more than one interface on the line. The "opt"s are: 487
488You can specify more than one interface on the line. The "opt"s are::
489
482 rsp=<regspacing> 490 rsp=<regspacing>
483 rsi=<regsize> 491 rsi=<regsize>
484 rsh=<regshift> 492 rsh=<regshift>
485 irq=<irq> 493 irq=<irq>
486 ipmb=<ipmb slave addr> 494 ipmb=<ipmb slave addr>
495
487and these have the same meanings as discussed above. Note that you 496and these have the same meanings as discussed above. Note that you
488can also use this on the kernel command line for a more compact format 497can also use this on the kernel command line for a more compact format
489for specifying an interface. Note that when removing an interface, 498for specifying an interface. Note that when removing an interface,
@@ -496,7 +505,7 @@ The SMBus Driver (SSIF)
496The SMBus driver allows up to 4 SMBus devices to be configured in the 505The SMBus driver allows up to 4 SMBus devices to be configured in the
497system. By default, the driver will only register with something it 506system. By default, the driver will only register with something it
498finds in DMI or ACPI tables. You can change this 507finds in DMI or ACPI tables. You can change this
499at module load time (for a module) with: 508at module load time (for a module) with::
500 509
501 modprobe ipmi_ssif.o 510 modprobe ipmi_ssif.o
502 addr=<i2caddr1>[,<i2caddr2>[,...]] 511 addr=<i2caddr1>[,<i2caddr2>[,...]]
@@ -535,7 +544,7 @@ the smb_addr parameter unless you have DMI or ACPI data to tell the
535driver what to use. 544driver what to use.
536 545
537When compiled into the kernel, the addresses can be specified on the 546When compiled into the kernel, the addresses can be specified on the
538kernel command line as: 547kernel command line as::
539 548
540 ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]] 549 ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]]
541 ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]] 550 ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]]
@@ -565,9 +574,9 @@ Some users need more detailed information about a device, like where
565the address came from or the raw base device for the IPMI interface. 574the address came from or the raw base device for the IPMI interface.
566You can use the IPMI smi_watcher to catch the IPMI interfaces as they 575You can use the IPMI smi_watcher to catch the IPMI interfaces as they
567come or go, and to grab the information, you can use the function 576come or go, and to grab the information, you can use the function
568ipmi_get_smi_info(), which returns the following structure: 577ipmi_get_smi_info(), which returns the following structure::
569 578
570struct ipmi_smi_info { 579 struct ipmi_smi_info {
571 enum ipmi_addr_src addr_src; 580 enum ipmi_addr_src addr_src;
572 struct device *dev; 581 struct device *dev;
573 union { 582 union {
@@ -575,7 +584,7 @@ struct ipmi_smi_info {
575 void *acpi_handle; 584 void *acpi_handle;
576 } acpi_info; 585 } acpi_info;
577 } addr_info; 586 } addr_info;
578}; 587 };
579 588
580Currently special info for only for SI_ACPI address sources is 589Currently special info for only for SI_ACPI address sources is
581returned. Others may be added as necessary. 590returned. Others may be added as necessary.
@@ -590,7 +599,7 @@ Watchdog
590 599
591A watchdog timer is provided that implements the Linux-standard 600A watchdog timer is provided that implements the Linux-standard
592watchdog timer interface. It has three module parameters that can be 601watchdog timer interface. It has three module parameters that can be
593used to control it: 602used to control it::
594 603
595 modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type> 604 modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type>
596 preaction=<preaction type> preop=<preop type> start_now=x 605 preaction=<preaction type> preop=<preop type> start_now=x
@@ -635,7 +644,7 @@ watchdog device is closed. The default value of nowayout is true
635if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not. 644if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not.
636 645
637When compiled into the kernel, the kernel command line is available 646When compiled into the kernel, the kernel command line is available
638for configuring the watchdog: 647for configuring the watchdog::
639 648
640 ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t> 649 ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t>
641 ipmi_watchdog.action=<action type> 650 ipmi_watchdog.action=<action type>
@@ -675,6 +684,7 @@ also get a bunch of OEM events holding the panic string.
675 684
676 685
677The field settings of the events are: 686The field settings of the events are:
687
678* Generator ID: 0x21 (kernel) 688* Generator ID: 0x21 (kernel)
679* EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format) 689* EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format)
680* Sensor Type: 0x20 (OS critical stop sensor) 690* Sensor Type: 0x20 (OS critical stop sensor)
@@ -683,18 +693,20 @@ The field settings of the events are:
683* Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3) 693* Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3)
684* Event data 2: second byte of panic string 694* Event data 2: second byte of panic string
685* Event data 3: third byte of panic string 695* Event data 3: third byte of panic string
696
686See the IPMI spec for the details of the event layout. This event is 697See the IPMI spec for the details of the event layout. This event is
687always sent to the local management controller. It will handle routing 698always sent to the local management controller. It will handle routing
688the message to the right place 699the message to the right place
689 700
690Other OEM events have the following format: 701Other OEM events have the following format:
691Record ID (bytes 0-1): Set by the SEL. 702
692Record type (byte 2): 0xf0 (OEM non-timestamped) 703* Record ID (bytes 0-1): Set by the SEL.
693byte 3: The slave address of the card saving the panic 704* Record type (byte 2): 0xf0 (OEM non-timestamped)
694byte 4: A sequence number (starting at zero) 705* byte 3: The slave address of the card saving the panic
695The rest of the bytes (11 bytes) are the panic string. If the panic string 706* byte 4: A sequence number (starting at zero)
696is longer than 11 bytes, multiple messages will be sent with increasing 707 The rest of the bytes (11 bytes) are the panic string. If the panic string
697sequence numbers. 708 is longer than 11 bytes, multiple messages will be sent with increasing
709 sequence numbers.
698 710
699Because you cannot send OEM events using the standard interface, this 711Because you cannot send OEM events using the standard interface, this
700function will attempt to find an SEL and add the events there. It 712function will attempt to find an SEL and add the events there. It
diff --git a/Documentation/IRQ-affinity.txt b/Documentation/IRQ-affinity.txt
index 01a675175a36..29da5000836a 100644
--- a/Documentation/IRQ-affinity.txt
+++ b/Documentation/IRQ-affinity.txt
@@ -1,8 +1,11 @@
1================
2SMP IRQ affinity
3================
4
1ChangeLog: 5ChangeLog:
2 Started by Ingo Molnar <mingo@redhat.com> 6 - Started by Ingo Molnar <mingo@redhat.com>
3 Update by Max Krasnyansky <maxk@qualcomm.com> 7 - Update by Max Krasnyansky <maxk@qualcomm.com>
4 8
5SMP IRQ affinity
6 9
7/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify 10/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify
8which target CPUs are permitted for a given IRQ source. It's a bitmask 11which target CPUs are permitted for a given IRQ source. It's a bitmask
@@ -16,50 +19,52 @@ will be set to the default mask. It can then be changed as described above.
16Default mask is 0xffffffff. 19Default mask is 0xffffffff.
17 20
18Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting 21Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
19it to CPU4-7 (this is an 8-CPU SMP box): 22it to CPU4-7 (this is an 8-CPU SMP box)::
20 23
21[root@moon 44]# cd /proc/irq/44 24 [root@moon 44]# cd /proc/irq/44
22[root@moon 44]# cat smp_affinity 25 [root@moon 44]# cat smp_affinity
23ffffffff 26 ffffffff
24 27
25[root@moon 44]# echo 0f > smp_affinity 28 [root@moon 44]# echo 0f > smp_affinity
26[root@moon 44]# cat smp_affinity 29 [root@moon 44]# cat smp_affinity
270000000f 30 0000000f
28[root@moon 44]# ping -f h 31 [root@moon 44]# ping -f h
29PING hell (195.4.7.3): 56 data bytes 32 PING hell (195.4.7.3): 56 data bytes
30... 33 ...
31--- hell ping statistics --- 34 --- hell ping statistics ---
326029 packets transmitted, 6027 packets received, 0% packet loss 35 6029 packets transmitted, 6027 packets received, 0% packet loss
33round-trip min/avg/max = 0.1/0.1/0.4 ms 36 round-trip min/avg/max = 0.1/0.1/0.4 ms
34[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' 37 [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
35 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 38 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
36 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 39 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
37 40
38As can be seen from the line above IRQ44 was delivered only to the first four 41As can be seen from the line above IRQ44 was delivered only to the first four
39processors (0-3). 42processors (0-3).
40Now lets restrict that IRQ to CPU(4-7). 43Now lets restrict that IRQ to CPU(4-7).
41 44
42[root@moon 44]# echo f0 > smp_affinity 45::
43[root@moon 44]# cat smp_affinity 46
44000000f0 47 [root@moon 44]# echo f0 > smp_affinity
45[root@moon 44]# ping -f h 48 [root@moon 44]# cat smp_affinity
46PING hell (195.4.7.3): 56 data bytes 49 000000f0
47.. 50 [root@moon 44]# ping -f h
48--- hell ping statistics --- 51 PING hell (195.4.7.3): 56 data bytes
492779 packets transmitted, 2777 packets received, 0% packet loss 52 ..
50round-trip min/avg/max = 0.1/0.5/585.4 ms 53 --- hell ping statistics ---
51[root@moon 44]# cat /proc/interrupts | 'CPU\|44:' 54 2779 packets transmitted, 2777 packets received, 0% packet loss
52 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 55 round-trip min/avg/max = 0.1/0.5/585.4 ms
53 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 56 [root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
57 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
58 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
54 59
55This time around IRQ44 was delivered only to the last four processors. 60This time around IRQ44 was delivered only to the last four processors.
56i.e counters for the CPU0-3 did not change. 61i.e counters for the CPU0-3 did not change.
57 62
58Here is an example of limiting that same irq (44) to cpus 1024 to 1031: 63Here is an example of limiting that same irq (44) to cpus 1024 to 1031::
59 64
60[root@moon 44]# echo 1024-1031 > smp_affinity_list 65 [root@moon 44]# echo 1024-1031 > smp_affinity_list
61[root@moon 44]# cat smp_affinity_list 66 [root@moon 44]# cat smp_affinity_list
621024-1031 67 1024-1031
63 68
64Note that to do this with a bitmask would require 32 bitmasks of zero 69Note that to do this with a bitmask would require 32 bitmasks of zero
65to follow the pertinent one. 70to follow the pertinent one.
diff --git a/Documentation/IRQ-domain.txt b/Documentation/IRQ-domain.txt
index 1f246eb25ca5..4a1cd7645d85 100644
--- a/Documentation/IRQ-domain.txt
+++ b/Documentation/IRQ-domain.txt
@@ -1,4 +1,6 @@
1irq_domain interrupt number mapping library 1===============================================
2The irq_domain interrupt number mapping library
3===============================================
2 4
3The current design of the Linux kernel uses a single large number 5The current design of the Linux kernel uses a single large number
4space where each separate IRQ source is assigned a different number. 6space where each separate IRQ source is assigned a different number.
@@ -36,7 +38,9 @@ irq_domain also implements translation from an abstract irq_fwspec
36structure to hwirq numbers (Device Tree and ACPI GSI so far), and can 38structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
37be easily extended to support other IRQ topology data sources. 39be easily extended to support other IRQ topology data sources.
38 40
39=== irq_domain usage === 41irq_domain usage
42================
43
40An interrupt controller driver creates and registers an irq_domain by 44An interrupt controller driver creates and registers an irq_domain by
41calling one of the irq_domain_add_*() functions (each mapping method 45calling one of the irq_domain_add_*() functions (each mapping method
42has a different allocator function, more on that later). The function 46has a different allocator function, more on that later). The function
@@ -62,15 +66,21 @@ If the driver has the Linux IRQ number or the irq_data pointer, and
62needs to know the associated hwirq number (such as in the irq_chip 66needs to know the associated hwirq number (such as in the irq_chip
63callbacks) then it can be directly obtained from irq_data->hwirq. 67callbacks) then it can be directly obtained from irq_data->hwirq.
64 68
65=== Types of irq_domain mappings === 69Types of irq_domain mappings
70============================
71
66There are several mechanisms available for reverse mapping from hwirq 72There are several mechanisms available for reverse mapping from hwirq
67to Linux irq, and each mechanism uses a different allocation function. 73to Linux irq, and each mechanism uses a different allocation function.
68Which reverse map type should be used depends on the use case. Each 74Which reverse map type should be used depends on the use case. Each
69of the reverse map types are described below: 75of the reverse map types are described below:
70 76
71==== Linear ==== 77Linear
72irq_domain_add_linear() 78------
73irq_domain_create_linear() 79
80::
81
82 irq_domain_add_linear()
83 irq_domain_create_linear()
74 84
75The linear reverse map maintains a fixed size table indexed by the 85The linear reverse map maintains a fixed size table indexed by the
76hwirq number. When a hwirq is mapped, an irq_desc is allocated for 86hwirq number. When a hwirq is mapped, an irq_desc is allocated for
@@ -89,9 +99,13 @@ accepts a more general abstraction 'struct fwnode_handle'.
89 99
90The majority of drivers should use the linear map. 100The majority of drivers should use the linear map.
91 101
92==== Tree ==== 102Tree
93irq_domain_add_tree() 103----
94irq_domain_create_tree() 104
105::
106
107 irq_domain_add_tree()
108 irq_domain_create_tree()
95 109
96The irq_domain maintains a radix tree map from hwirq numbers to Linux 110The irq_domain maintains a radix tree map from hwirq numbers to Linux
97IRQs. When an hwirq is mapped, an irq_desc is allocated and the 111IRQs. When an hwirq is mapped, an irq_desc is allocated and the
@@ -109,8 +123,12 @@ accepts a more general abstraction 'struct fwnode_handle'.
109 123
110Very few drivers should need this mapping. 124Very few drivers should need this mapping.
111 125
112==== No Map ===- 126No Map
113irq_domain_add_nomap() 127------
128
129::
130
131 irq_domain_add_nomap()
114 132
115The No Map mapping is to be used when the hwirq number is 133The No Map mapping is to be used when the hwirq number is
116programmable in the hardware. In this case it is best to program the 134programmable in the hardware. In this case it is best to program the
@@ -121,10 +139,14 @@ Linux IRQ number into the hardware.
121 139
122Most drivers cannot use this mapping. 140Most drivers cannot use this mapping.
123 141
124==== Legacy ==== 142Legacy
125irq_domain_add_simple() 143------
126irq_domain_add_legacy() 144
127irq_domain_add_legacy_isa() 145::
146
147 irq_domain_add_simple()
148 irq_domain_add_legacy()
149 irq_domain_add_legacy_isa()
128 150
129The Legacy mapping is a special case for drivers that already have a 151The Legacy mapping is a special case for drivers that already have a
130range of irq_descs allocated for the hwirqs. It is used when the 152range of irq_descs allocated for the hwirqs. It is used when the
@@ -163,14 +185,17 @@ that the driver using the simple domain call irq_create_mapping()
163before any irq_find_mapping() since the latter will actually work 185before any irq_find_mapping() since the latter will actually work
164for the static IRQ assignment case. 186for the static IRQ assignment case.
165 187
166==== Hierarchy IRQ domain ==== 188Hierarchy IRQ domain
189--------------------
190
167On some architectures, there may be multiple interrupt controllers 191On some architectures, there may be multiple interrupt controllers
168involved in delivering an interrupt from the device to the target CPU. 192involved in delivering an interrupt from the device to the target CPU.
169Let's look at a typical interrupt delivering path on x86 platforms: 193Let's look at a typical interrupt delivering path on x86 platforms::
170 194
171Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU 195 Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
172 196
173There are three interrupt controllers involved: 197There are three interrupt controllers involved:
198
1741) IOAPIC controller 1991) IOAPIC controller
1752) Interrupt remapping controller 2002) Interrupt remapping controller
1763) Local APIC controller 2013) Local APIC controller
@@ -180,7 +205,8 @@ hardware architecture, an irq_domain data structure is built for each
180interrupt controller and those irq_domains are organized into hierarchy. 205interrupt controller and those irq_domains are organized into hierarchy.
181When building irq_domain hierarchy, the irq_domain near to the device is 206When building irq_domain hierarchy, the irq_domain near to the device is
182child and the irq_domain near to CPU is parent. So a hierarchy structure 207child and the irq_domain near to CPU is parent. So a hierarchy structure
183as below will be built for the example above. 208as below will be built for the example above::
209
184 CPU Vector irq_domain (root irq_domain to manage CPU vectors) 210 CPU Vector irq_domain (root irq_domain to manage CPU vectors)
185 ^ 211 ^
186 | 212 |
@@ -190,6 +216,7 @@ as below will be built for the example above.
190 IOAPIC irq_domain (manage IOAPIC delivery entries/pins) 216 IOAPIC irq_domain (manage IOAPIC delivery entries/pins)
191 217
192There are four major interfaces to use hierarchy irq_domain: 218There are four major interfaces to use hierarchy irq_domain:
219
1931) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt 2201) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt
194 controller related resources to deliver these interrupts. 221 controller related resources to deliver these interrupts.
1952) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller 2222) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller
@@ -199,7 +226,8 @@ There are four major interfaces to use hierarchy irq_domain:
1994) irq_domain_deactivate_irq(): deactivate interrupt controller hardware 2264) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
200 to stop delivering the interrupt. 227 to stop delivering the interrupt.
201 228
202Following changes are needed to support hierarchy irq_domain. 229Following changes are needed to support hierarchy irq_domain:
230
2031) a new field 'parent' is added to struct irq_domain; it's used to 2311) a new field 'parent' is added to struct irq_domain; it's used to
204 maintain irq_domain hierarchy information. 232 maintain irq_domain hierarchy information.
2052) a new field 'parent_data' is added to struct irq_data; it's used to 2332) a new field 'parent_data' is added to struct irq_data; it's used to
@@ -223,6 +251,7 @@ software architecture.
223 251
224For an interrupt controller driver to support hierarchy irq_domain, it 252For an interrupt controller driver to support hierarchy irq_domain, it
225needs to: 253needs to:
254
2261) Implement irq_domain_ops.alloc and irq_domain_ops.free 2551) Implement irq_domain_ops.alloc and irq_domain_ops.free
2272) Optionally implement irq_domain_ops.activate and 2562) Optionally implement irq_domain_ops.activate and
228 irq_domain_ops.deactivate. 257 irq_domain_ops.deactivate.
diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt
index 1011e7175021..4273806a606b 100644
--- a/Documentation/IRQ.txt
+++ b/Documentation/IRQ.txt
@@ -1,4 +1,6 @@
1===============
1What is an IRQ? 2What is an IRQ?
3===============
2 4
3An IRQ is an interrupt request from a device. 5An IRQ is an interrupt request from a device.
4Currently they can come in over a pin, or over a packet. 6Currently they can come in over a pin, or over a packet.
diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt
index 49585b6e1ea2..9dae6b47e398 100644
--- a/Documentation/Intel-IOMMU.txt
+++ b/Documentation/Intel-IOMMU.txt
@@ -1,3 +1,4 @@
1===================
1Linux IOMMU Support 2Linux IOMMU Support
2=================== 3===================
3 4
@@ -9,11 +10,11 @@ This guide gives a quick cheat sheet for some basic understanding.
9 10
10Some Keywords 11Some Keywords
11 12
12DMAR - DMA remapping 13- DMAR - DMA remapping
13DRHD - DMA Remapping Hardware Unit Definition 14- DRHD - DMA Remapping Hardware Unit Definition
14RMRR - Reserved memory Region Reporting Structure 15- RMRR - Reserved memory Region Reporting Structure
15ZLR - Zero length reads from PCI devices 16- ZLR - Zero length reads from PCI devices
16IOVA - IO Virtual address. 17- IOVA - IO Virtual address.
17 18
18Basic stuff 19Basic stuff
19----------- 20-----------
@@ -33,7 +34,7 @@ devices that need to access these regions. OS is expected to setup
33unity mappings for these regions for these devices to access these regions. 34unity mappings for these regions for these devices to access these regions.
34 35
35How is IOVA generated? 36How is IOVA generated?
36--------------------- 37----------------------
37 38
38Well behaved drivers call pci_map_*() calls before sending command to device 39Well behaved drivers call pci_map_*() calls before sending command to device
39that needs to perform DMA. Once DMA is completed and mapping is no longer 40that needs to perform DMA. Once DMA is completed and mapping is no longer
@@ -82,14 +83,14 @@ in ACPI.
82ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 83ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
83 84
84When DMAR is being processed and initialized by ACPI, prints DMAR locations 85When DMAR is being processed and initialized by ACPI, prints DMAR locations
85and any RMRR's processed. 86and any RMRR's processed::
86 87
87ACPI DMAR:Host address width 36 88 ACPI DMAR:Host address width 36
88ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 89 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
89ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 90 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
90ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 91 ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
91ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff 92 ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
92ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff 93 ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
93 94
94When DMAR is enabled for use, you will notice.. 95When DMAR is enabled for use, you will notice..
95 96
@@ -98,10 +99,12 @@ PCI-DMA: Using DMAR IOMMU
98Fault reporting 99Fault reporting
99--------------- 100---------------
100 101
101DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 102::
102DMAR:[fault reason 05] PTE Write access is not set 103
103DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 104 DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
104DMAR:[fault reason 05] PTE Write access is not set 105 DMAR:[fault reason 05] PTE Write access is not set
106 DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
107 DMAR:[fault reason 05] PTE Write access is not set
105 108
106TBD 109TBD
107---- 110----
diff --git a/Documentation/SAK.txt b/Documentation/SAK.txt
index 74be14679ed8..260e1d3687bd 100644
--- a/Documentation/SAK.txt
+++ b/Documentation/SAK.txt
@@ -1,5 +1,9 @@
1Linux 2.4.2 Secure Attention Key (SAK) handling 1=========================================
218 March 2001, Andrew Morton 2Linux Secure Attention Key (SAK) handling
3=========================================
4
5:Date: 18 March 2001
6:Author: Andrew Morton
3 7
4An operating system's Secure Attention Key is a security tool which is 8An operating system's Secure Attention Key is a security tool which is
5provided as protection against trojan password capturing programs. It 9provided as protection against trojan password capturing programs. It
@@ -13,7 +17,7 @@ this sequence. It is only available if the kernel was compiled with
13sysrq support. 17sysrq support.
14 18
15The proper way of generating a SAK is to define the key sequence using 19The proper way of generating a SAK is to define the key sequence using
16`loadkeys'. This will work whether or not sysrq support is compiled 20``loadkeys``. This will work whether or not sysrq support is compiled
17into the kernel. 21into the kernel.
18 22
19SAK works correctly when the keyboard is in raw mode. This means that 23SAK works correctly when the keyboard is in raw mode. This means that
@@ -25,64 +29,63 @@ What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot
25the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll 29the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll
26choose CTRL-ALT-PAUSE. 30choose CTRL-ALT-PAUSE.
27 31
28In your rc.sysinit (or rc.local) file, add the command 32In your rc.sysinit (or rc.local) file, add the command::
29 33
30 echo "control alt keycode 101 = SAK" | /bin/loadkeys 34 echo "control alt keycode 101 = SAK" | /bin/loadkeys
31 35
32And that's it! Only the superuser may reprogram the SAK key. 36And that's it! Only the superuser may reprogram the SAK key.
33 37
34 38
35NOTES 39.. note::
36=====
37 40
381: Linux SAK is said to be not a "true SAK" as is required by 41 1. Linux SAK is said to be not a "true SAK" as is required by
39 systems which implement C2 level security. This author does not 42 systems which implement C2 level security. This author does not
40 know why. 43 know why.
41 44
42 45
432: On the PC keyboard, SAK kills all applications which have 46 2. On the PC keyboard, SAK kills all applications which have
44 /dev/console opened. 47 /dev/console opened.
45 48
46 Unfortunately this includes a number of things which you don't 49 Unfortunately this includes a number of things which you don't
47 actually want killed. This is because these applications are 50 actually want killed. This is because these applications are
48 incorrectly holding /dev/console open. Be sure to complain to your 51 incorrectly holding /dev/console open. Be sure to complain to your
49 Linux distributor about this! 52 Linux distributor about this!
50 53
51 You can identify processes which will be killed by SAK with the 54 You can identify processes which will be killed by SAK with the
52 command 55 command::
53 56
54 # ls -l /proc/[0-9]*/fd/* | grep console 57 # ls -l /proc/[0-9]*/fd/* | grep console
55 l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console 58 l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console
56 59
57 Then: 60 Then::
58 61
59 # ps aux|grep 579 62 # ps aux|grep 579
60 root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2 63 root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2
61 64
62 So `gpm' will be killed by SAK. This is a bug in gpm. It should 65 So ``gpm`` will be killed by SAK. This is a bug in gpm. It should
63 be closing standard input. You can work around this by finding the 66 be closing standard input. You can work around this by finding the
64 initscript which launches gpm and changing it thusly: 67 initscript which launches gpm and changing it thusly:
65 68
66 Old: 69 Old::
67 70
68 daemon gpm 71 daemon gpm
69 72
70 New: 73 New::
71 74
72 daemon gpm < /dev/null 75 daemon gpm < /dev/null
73 76
74 Vixie cron also seems to have this problem, and needs the same treatment. 77 Vixie cron also seems to have this problem, and needs the same treatment.
75 78
76 Also, one prominent Linux distribution has the following three 79 Also, one prominent Linux distribution has the following three
77 lines in its rc.sysinit and rc scripts: 80 lines in its rc.sysinit and rc scripts::
78 81
79 exec 3<&0 82 exec 3<&0
80 exec 4>&1 83 exec 4>&1
81 exec 5>&2 84 exec 5>&2
82 85
83 These commands cause *all* daemons which are launched by the 86 These commands cause **all** daemons which are launched by the
84 initscripts to have file descriptors 3, 4 and 5 attached to 87 initscripts to have file descriptors 3, 4 and 5 attached to
85 /dev/console. So SAK kills them all. A workaround is to simply 88 /dev/console. So SAK kills them all. A workaround is to simply
86 delete these lines, but this may cause system management 89 delete these lines, but this may cause system management
87 applications to malfunction - test everything well. 90 applications to malfunction - test everything well.
88 91
diff --git a/Documentation/SM501.txt b/Documentation/SM501.txt
index 561826f82093..882507453ba4 100644
--- a/Documentation/SM501.txt
+++ b/Documentation/SM501.txt
@@ -1,7 +1,10 @@
1 SM501 Driver 1.. include:: <isonum.txt>
2 ============
3 2
4Copyright 2006, 2007 Simtec Electronics 3============
4SM501 Driver
5============
6
7:Copyright: |copy| 2006, 2007 Simtec Electronics
5 8
6The Silicon Motion SM501 multimedia companion chip is a multifunction device 9The Silicon Motion SM501 multimedia companion chip is a multifunction device
7which may provide numerous interfaces including USB host controller USB gadget, 10which may provide numerous interfaces including USB host controller USB gadget,
diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt
index a9259b562d5c..c0ce64d75bbf 100644
--- a/Documentation/bcache.txt
+++ b/Documentation/bcache.txt
@@ -1,10 +1,15 @@
1============================
2A block layer cache (bcache)
3============================
4
1Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be 5Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
2nice if you could use them as cache... Hence bcache. 6nice if you could use them as cache... Hence bcache.
3 7
4Wiki and git repositories are at: 8Wiki and git repositories are at:
5 http://bcache.evilpiepirate.org 9
6 http://evilpiepirate.org/git/linux-bcache.git 10 - http://bcache.evilpiepirate.org
7 http://evilpiepirate.org/git/bcache-tools.git 11 - http://evilpiepirate.org/git/linux-bcache.git
12 - http://evilpiepirate.org/git/bcache-tools.git
8 13
9It's designed around the performance characteristics of SSDs - it only allocates 14It's designed around the performance characteristics of SSDs - it only allocates
10in erase block sized buckets, and it uses a hybrid btree/log to track cached 15in erase block sized buckets, and it uses a hybrid btree/log to track cached
@@ -37,17 +42,19 @@ to be flushed.
37 42
38Getting started: 43Getting started:
39You'll need make-bcache from the bcache-tools repository. Both the cache device 44You'll need make-bcache from the bcache-tools repository. Both the cache device
40and backing device must be formatted before use. 45and backing device must be formatted before use::
46
41 make-bcache -B /dev/sdb 47 make-bcache -B /dev/sdb
42 make-bcache -C /dev/sdc 48 make-bcache -C /dev/sdc
43 49
44make-bcache has the ability to format multiple devices at the same time - if 50make-bcache has the ability to format multiple devices at the same time - if
45you format your backing devices and cache device at the same time, you won't 51you format your backing devices and cache device at the same time, you won't
46have to manually attach: 52have to manually attach::
53
47 make-bcache -B /dev/sda /dev/sdb -C /dev/sdc 54 make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
48 55
49bcache-tools now ships udev rules, and bcache devices are known to the kernel 56bcache-tools now ships udev rules, and bcache devices are known to the kernel
50immediately. Without udev, you can manually register devices like this: 57immediately. Without udev, you can manually register devices like this::
51 58
52 echo /dev/sdb > /sys/fs/bcache/register 59 echo /dev/sdb > /sys/fs/bcache/register
53 echo /dev/sdc > /sys/fs/bcache/register 60 echo /dev/sdc > /sys/fs/bcache/register
@@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad
60a caching device later. 67a caching device later.
61See 'ATTACHING' section below. 68See 'ATTACHING' section below.
62 69
63The devices show up as: 70The devices show up as::
64 71
65 /dev/bcache<N> 72 /dev/bcache<N>
66 73
67As well as (with udev): 74As well as (with udev)::
68 75
69 /dev/bcache/by-uuid/<uuid> 76 /dev/bcache/by-uuid/<uuid>
70 /dev/bcache/by-label/<label> 77 /dev/bcache/by-label/<label>
71 78
72To get started: 79To get started::
73 80
74 mkfs.ext4 /dev/bcache0 81 mkfs.ext4 /dev/bcache0
75 mount /dev/bcache0 /mnt 82 mount /dev/bcache0 /mnt
@@ -81,13 +88,13 @@ Cache devices are managed as sets; multiple caches per set isn't supported yet
81but will allow for mirroring of metadata and dirty data in the future. Your new 88but will allow for mirroring of metadata and dirty data in the future. Your new
82cache set shows up as /sys/fs/bcache/<UUID> 89cache set shows up as /sys/fs/bcache/<UUID>
83 90
84ATTACHING 91Attaching
85--------- 92---------
86 93
87After your cache device and backing device are registered, the backing device 94After your cache device and backing device are registered, the backing device
88must be attached to your cache set to enable caching. Attaching a backing 95must be attached to your cache set to enable caching. Attaching a backing
89device to a cache set is done thusly, with the UUID of the cache set in 96device to a cache set is done thusly, with the UUID of the cache set in
90/sys/fs/bcache: 97/sys/fs/bcache::
91 98
92 echo <CSET-UUID> > /sys/block/bcache0/bcache/attach 99 echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
93 100
@@ -97,7 +104,7 @@ your bcache devices. If a backing device has data in a cache somewhere, the
97important if you have writeback caching turned on. 104important if you have writeback caching turned on.
98 105
99If you're booting up and your cache device is gone and never coming back, you 106If you're booting up and your cache device is gone and never coming back, you
100can force run the backing device: 107can force run the backing device::
101 108
102 echo 1 > /sys/block/sdb/bcache/running 109 echo 1 > /sys/block/sdb/bcache/running
103 110
@@ -110,7 +117,7 @@ but all the cached data will be invalidated. If there was dirty data in the
110cache, don't expect the filesystem to be recoverable - you will have massive 117cache, don't expect the filesystem to be recoverable - you will have massive
111filesystem corruption, though ext4's fsck does work miracles. 118filesystem corruption, though ext4's fsck does work miracles.
112 119
113ERROR HANDLING 120Error Handling
114-------------- 121--------------
115 122
116Bcache tries to transparently handle IO errors to/from the cache device without 123Bcache tries to transparently handle IO errors to/from the cache device without
@@ -134,25 +141,27 @@ the backing devices to passthrough mode.
134 read some of the dirty data, though. 141 read some of the dirty data, though.
135 142
136 143
137HOWTO/COOKBOOK 144Howto/cookbook
138-------------- 145--------------
139 146
140A) Starting a bcache with a missing caching device 147A) Starting a bcache with a missing caching device
141 148
142If registering the backing device doesn't help, it's already there, you just need 149If registering the backing device doesn't help, it's already there, you just need
143to force it to run without the cache: 150to force it to run without the cache::
151
144 host:~# echo /dev/sdb1 > /sys/fs/bcache/register 152 host:~# echo /dev/sdb1 > /sys/fs/bcache/register
145 [ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered 153 [ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered
146 154
147Next, you try to register your caching device if it's present. However 155Next, you try to register your caching device if it's present. However
148if it's absent, or registration fails for some reason, you can still 156if it's absent, or registration fails for some reason, you can still
149start your bcache without its cache, like so: 157start your bcache without its cache, like so::
158
150 host:/sys/block/sdb/sdb1/bcache# echo 1 > running 159 host:/sys/block/sdb/sdb1/bcache# echo 1 > running
151 160
152Note that this may cause data loss if you were running in writeback mode. 161Note that this may cause data loss if you were running in writeback mode.
153 162
154 163
155B) Bcache does not find its cache 164B) Bcache does not find its cache::
156 165
157 host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach 166 host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach
158 [ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set 167 [ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set
@@ -160,7 +169,8 @@ B) Bcache does not find its cache
160 [ 1933.478179] : cache set not found 169 [ 1933.478179] : cache set not found
161 170
162In this case, the caching device was simply not registered at boot 171In this case, the caching device was simply not registered at boot
163or disappeared and came back, and needs to be (re-)registered: 172or disappeared and came back, and needs to be (re-)registered::
173
164 host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register 174 host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
165 175
166 176
@@ -180,7 +190,8 @@ device is still available at an 8KiB offset. So either via a loopdev
180of the backing device created with --offset 8K, or any value defined by 190of the backing device created with --offset 8K, or any value defined by
181--data-offset when you originally formatted bcache with `make-bcache`. 191--data-offset when you originally formatted bcache with `make-bcache`.
182 192
183For example: 193For example::
194
184 losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev 195 losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev
185 196
186This should present your unmodified backing device data in /dev/loop0 197This should present your unmodified backing device data in /dev/loop0
@@ -191,33 +202,38 @@ cache device without loosing data.
191 202
192E) Wiping a cache device 203E) Wiping a cache device
193 204
194host:~# wipefs -a /dev/sdh2 205::
19516 bytes were erased at offset 0x1018 (bcache) 206
196they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 207 host:~# wipefs -a /dev/sdh2
208 16 bytes were erased at offset 0x1018 (bcache)
209 they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
210
211After you boot back with bcache enabled, you recreate the cache and attach it::
197 212
198After you boot back with bcache enabled, you recreate the cache and attach it: 213 host:~# make-bcache -C /dev/sdh2
199host:~# make-bcache -C /dev/sdh2 214 UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
200UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 215 Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
201Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 216 version: 0
202version: 0 217 nbuckets: 106874
203nbuckets: 106874 218 block_size: 1
204block_size: 1 219 bucket_size: 1024
205bucket_size: 1024 220 nr_in_set: 1
206nr_in_set: 1 221 nr_this_dev: 0
207nr_this_dev: 0 222 first_bucket: 1
208first_bucket: 1 223 [ 650.511912] bcache: run_cache_set() invalidating existing data
209[ 650.511912] bcache: run_cache_set() invalidating existing data 224 [ 650.549228] bcache: register_cache() registered cache device sdh2
210[ 650.549228] bcache: register_cache() registered cache device sdh2
211 225
212start backing device with missing cache: 226start backing device with missing cache::
213host:/sys/block/md5/bcache# echo 1 > running
214 227
215attach new cache: 228 host:/sys/block/md5/bcache# echo 1 > running
216host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
217[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
218 229
230attach new cache::
219 231
220F) Remove or replace a caching device 232 host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
233 [ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
234
235
236F) Remove or replace a caching device::
221 237
222 host:/sys/block/sda/sda7/bcache# echo 1 > detach 238 host:/sys/block/sda/sda7/bcache# echo 1 > detach
223 [ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7 239 [ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7
@@ -226,13 +242,15 @@ F) Remove or replace a caching device
226 wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy 242 wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy
227 Ooops, it's disabled, but not unregistered, so it's still protected 243 Ooops, it's disabled, but not unregistered, so it's still protected
228 244
229We need to go and unregister it: 245We need to go and unregister it::
246
230 host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0 247 host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0
231 lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/ 248 lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/
232 host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop 249 host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop
233 kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered 250 kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered
234 251
235Now we can wipe it: 252Now we can wipe it::
253
236 host:~# wipefs -a /dev/nvme0n1p4 254 host:~# wipefs -a /dev/nvme0n1p4
237 /dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 255 /dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
238 256
@@ -252,40 +270,44 @@ if there are any active backing or caching devices left on it:
252 270
2531) Is it present in /dev/bcache* ? (there are times where it won't be) 2711) Is it present in /dev/bcache* ? (there are times where it won't be)
254 272
255If so, it's easy: 273 If so, it's easy::
274
256 host:/sys/block/bcache0/bcache# echo 1 > stop 275 host:/sys/block/bcache0/bcache# echo 1 > stop
257 276
2582) But if your backing device is gone, this won't work: 2772) But if your backing device is gone, this won't work::
278
259 host:/sys/block/bcache0# cd bcache 279 host:/sys/block/bcache0# cd bcache
260 bash: cd: bcache: No such file or directory 280 bash: cd: bcache: No such file or directory
261 281
262In this case, you may have to unregister the dmcrypt block device that 282 In this case, you may have to unregister the dmcrypt block device that
263references this bcache to free it up: 283 references this bcache to free it up::
284
264 host:~# dmsetup remove oldds1 285 host:~# dmsetup remove oldds1
265 bcache: bcache_device_free() bcache0 stopped 286 bcache: bcache_device_free() bcache0 stopped
266 bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered 287 bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered
267 288
268This causes the backing bcache to be removed from /sys/fs/bcache and 289 This causes the backing bcache to be removed from /sys/fs/bcache and
269then it can be reused. This would be true of any block device stacking 290 then it can be reused. This would be true of any block device stacking
270where bcache is a lower device. 291 where bcache is a lower device.
292
2933) In other cases, you can also look in /sys/fs/bcache/::
271 294
2723) In other cases, you can also look in /sys/fs/bcache/: 295 host:/sys/fs/bcache# ls -l */{cache?,bdev?}
296 lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
297 lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
298 lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
273 299
274host:/sys/fs/bcache# ls -l */{cache?,bdev?} 300 The device names will show which UUID is relevant, cd in that directory
275lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/ 301 and stop the cache::
276lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
277lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
278 302
279The device names will show which UUID is relevant, cd in that directory
280and stop the cache:
281 host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop 303 host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop
282 304
283This will free up bcache references and let you reuse the partition for 305 This will free up bcache references and let you reuse the partition for
284other purposes. 306 other purposes.
285 307
286 308
287 309
288TROUBLESHOOTING PERFORMANCE 310Troubleshooting performance
289--------------------------- 311---------------------------
290 312
291Bcache has a bunch of config options and tunables. The defaults are intended to 313Bcache has a bunch of config options and tunables. The defaults are intended to
@@ -301,11 +323,13 @@ want for getting the best possible numbers when benchmarking.
301 raid stripe size to get the disk multiples that you would like. 323 raid stripe size to get the disk multiples that you would like.
302 324
303 For example: If you have a 64k stripe size, then the following offset 325 For example: If you have a 64k stripe size, then the following offset
304 would provide alignment for many common RAID5 data spindle counts: 326 would provide alignment for many common RAID5 data spindle counts::
327
305 64k * 2*2*2*3*3*5*7 bytes = 161280k 328 64k * 2*2*2*3*3*5*7 bytes = 161280k
306 329
307 That space is wasted, but for only 157.5MB you can grow your RAID 5 330 That space is wasted, but for only 157.5MB you can grow your RAID 5
308 volume to the following data-spindle counts without re-aligning: 331 volume to the following data-spindle counts without re-aligning::
332
309 3,4,5,6,7,8,9,10,12,14,15,18,20,21 ... 333 3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
310 334
311 - Bad write performance 335 - Bad write performance
@@ -313,9 +337,9 @@ want for getting the best possible numbers when benchmarking.
313 If write performance is not what you expected, you probably wanted to be 337 If write performance is not what you expected, you probably wanted to be
314 running in writeback mode, which isn't the default (not due to a lack of 338 running in writeback mode, which isn't the default (not due to a lack of
315 maturity, but simply because in writeback mode you'll lose data if something 339 maturity, but simply because in writeback mode you'll lose data if something
316 happens to your SSD) 340 happens to your SSD)::
317 341
318 # echo writeback > /sys/block/bcache0/bcache/cache_mode 342 # echo writeback > /sys/block/bcache0/bcache/cache_mode
319 343
320 - Bad performance, or traffic not going to the SSD that you'd expect 344 - Bad performance, or traffic not going to the SSD that you'd expect
321 345
@@ -325,13 +349,13 @@ want for getting the best possible numbers when benchmarking.
325 accessed data out of your cache. 349 accessed data out of your cache.
326 350
327 But if you want to benchmark reads from cache, and you start out with fio 351 But if you want to benchmark reads from cache, and you start out with fio
328 writing an 8 gigabyte test file - so you want to disable that. 352 writing an 8 gigabyte test file - so you want to disable that::
329 353
330 # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff 354 # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
331 355
332 To set it back to the default (4 mb), do 356 To set it back to the default (4 mb), do::
333 357
334 # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff 358 # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
335 359
336 - Traffic's still going to the spindle/still getting cache misses 360 - Traffic's still going to the spindle/still getting cache misses
337 361
@@ -344,10 +368,10 @@ want for getting the best possible numbers when benchmarking.
344 throttles traffic if the latency exceeds a threshold (it does this by 368 throttles traffic if the latency exceeds a threshold (it does this by
345 cranking down the sequential bypass). 369 cranking down the sequential bypass).
346 370
347 You can disable this if you need to by setting the thresholds to 0: 371 You can disable this if you need to by setting the thresholds to 0::
348 372
349 # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us 373 # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
350 # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us 374 # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
351 375
352 The default is 2000 us (2 milliseconds) for reads, and 20000 for writes. 376 The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
353 377
@@ -369,7 +393,7 @@ want for getting the best possible numbers when benchmarking.
369 a fix for the issue there). 393 a fix for the issue there).
370 394
371 395
372SYSFS - BACKING DEVICE 396Sysfs - backing device
373---------------------- 397----------------------
374 398
375Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and 399Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
@@ -454,7 +478,8 @@ writeback_running
454 still be added to the cache until it is mostly full; only meant for 478 still be added to the cache until it is mostly full; only meant for
455 benchmarking. Defaults to on. 479 benchmarking. Defaults to on.
456 480
457SYSFS - BACKING DEVICE STATS: 481Sysfs - backing device stats
482~~~~~~~~~~~~~~~~~~~~~~~~~~~~
458 483
459There are directories with these numbers for a running total, as well as 484There are directories with these numbers for a running total, as well as
460versions that decay over the past day, hour and 5 minutes; they're also 485versions that decay over the past day, hour and 5 minutes; they're also
@@ -463,14 +488,11 @@ aggregated in the cache set directory as well.
463bypassed 488bypassed
464 Amount of IO (both reads and writes) that has bypassed the cache 489 Amount of IO (both reads and writes) that has bypassed the cache
465 490
466cache_hits 491cache_hits, cache_misses, cache_hit_ratio
467cache_misses
468cache_hit_ratio
469 Hits and misses are counted per individual IO as bcache sees them; a 492 Hits and misses are counted per individual IO as bcache sees them; a
470 partial hit is counted as a miss. 493 partial hit is counted as a miss.
471 494
472cache_bypass_hits 495cache_bypass_hits, cache_bypass_misses
473cache_bypass_misses
474 Hits and misses for IO that is intended to skip the cache are still counted, 496 Hits and misses for IO that is intended to skip the cache are still counted,
475 but broken out here. 497 but broken out here.
476 498
@@ -482,7 +504,8 @@ cache_miss_collisions
482cache_readaheads 504cache_readaheads
483 Count of times readahead occurred. 505 Count of times readahead occurred.
484 506
485SYSFS - CACHE SET: 507Sysfs - cache set
508~~~~~~~~~~~~~~~~~
486 509
487Available at /sys/fs/bcache/<cset-uuid> 510Available at /sys/fs/bcache/<cset-uuid>
488 511
@@ -520,8 +543,7 @@ flash_vol_create
520 Echoing a size to this file (in human readable units, k/M/G) creates a thinly 543 Echoing a size to this file (in human readable units, k/M/G) creates a thinly
521 provisioned volume backed by the cache set. 544 provisioned volume backed by the cache set.
522 545
523io_error_halflife 546io_error_halflife, io_error_limit
524io_error_limit
525 These determines how many errors we accept before disabling the cache. 547 These determines how many errors we accept before disabling the cache.
526 Each error is decayed by the half life (in # ios). If the decaying count 548 Each error is decayed by the half life (in # ios). If the decaying count
527 reaches io_error_limit dirty data is written out and the cache is disabled. 549 reaches io_error_limit dirty data is written out and the cache is disabled.
@@ -545,7 +567,8 @@ unregister
545 Detaches all backing devices and closes the cache devices; if dirty data is 567 Detaches all backing devices and closes the cache devices; if dirty data is
546 present it will disable writeback caching and wait for it to be flushed. 568 present it will disable writeback caching and wait for it to be flushed.
547 569
548SYSFS - CACHE SET INTERNAL: 570Sysfs - cache set internal
571~~~~~~~~~~~~~~~~~~~~~~~~~~
549 572
550This directory also exposes timings for a number of internal operations, with 573This directory also exposes timings for a number of internal operations, with
551separate files for average duration, average frequency, last occurrence and max 574separate files for average duration, average frequency, last occurrence and max
@@ -574,7 +597,8 @@ cache_read_races
574trigger_gc 597trigger_gc
575 Writing to this file forces garbage collection to run. 598 Writing to this file forces garbage collection to run.
576 599
577SYSFS - CACHE DEVICE: 600Sysfs - Cache device
601~~~~~~~~~~~~~~~~~~~~
578 602
579Available at /sys/block/<cdev>/bcache 603Available at /sys/block/<cdev>/bcache
580 604
diff --git a/Documentation/bt8xxgpio.txt b/Documentation/bt8xxgpio.txt
index d8297e4ebd26..a845feb074de 100644
--- a/Documentation/bt8xxgpio.txt
+++ b/Documentation/bt8xxgpio.txt
@@ -1,12 +1,8 @@
1=============================================================== 1===================================================================
2== BT8XXGPIO driver == 2A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio)
3== == 3===================================================================
4== A driver for a selfmade cheap BT8xx based PCI GPIO-card ==
5== ==
6== For advanced documentation, see ==
7== http://www.bu3sch.de/btgpio.php ==
8===============================================================
9 4
5For advanced documentation, see http://www.bu3sch.de/btgpio.php
10 6
11A generic digital 24-port PCI GPIO card can be built out of an ordinary 7A generic digital 24-port PCI GPIO card can be built out of an ordinary
12Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The 8Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The
@@ -17,9 +13,8 @@ The bt8xx chip does have 24 digital GPIO ports.
17These ports are accessible via 24 pins on the SMD chip package. 13These ports are accessible via 24 pins on the SMD chip package.
18 14
19 15
20============================================== 16How to physically access the GPIO pins
21== How to physically access the GPIO pins == 17======================================
22==============================================
23 18
24The are several ways to access these pins. One might unsolder the whole chip 19The are several ways to access these pins. One might unsolder the whole chip
25and put it on a custom PCI board, or one might only unsolder each individual 20and put it on a custom PCI board, or one might only unsolder each individual
@@ -27,7 +22,7 @@ GPIO pin and solder that to some tiny wire. As the chip package really is tiny
27there are some advanced soldering skills needed in any case. 22there are some advanced soldering skills needed in any case.
28 23
29The physical pinouts are drawn in the following ASCII art. 24The physical pinouts are drawn in the following ASCII art.
30The GPIO pins are marked with G00-G23 25The GPIO pins are marked with G00-G23::
31 26
32 G G G G G G G G G G G G G G G G G G 27 G G G G G G G G G G G G G G G G G G
33 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 28 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
diff --git a/Documentation/btmrvl.txt b/Documentation/btmrvl.txt
index 34916a46c099..ec57740ead0c 100644
--- a/Documentation/btmrvl.txt
+++ b/Documentation/btmrvl.txt
@@ -1,18 +1,16 @@
1======================================================================= 1=============
2 README for btmrvl driver 2btmrvl driver
3======================================================================= 3=============
4
5 4
6All commands are used via debugfs interface. 5All commands are used via debugfs interface.
7 6
8===================== 7Set/get driver configurations
9Set/get driver configurations: 8=============================
10 9
11Path: /debug/btmrvl/config/ 10Path: /debug/btmrvl/config/
12 11
13gpiogap=[n] 12gpiogap=[n], hscfgcmd
14hscfgcmd 13 These commands are used to configure the host sleep parameters::
15 These commands are used to configure the host sleep parameters.
16 bit 8:0 -- Gap 14 bit 8:0 -- Gap
17 bit 16:8 -- GPIO 15 bit 16:8 -- GPIO
18 16
@@ -23,7 +21,8 @@ hscfgcmd
23 where Gap is the gap in milli seconds between wakeup signal and 21 where Gap is the gap in milli seconds between wakeup signal and
24 wakeup event, or 0xff for special host sleep setting. 22 wakeup event, or 0xff for special host sleep setting.
25 23
26 Usage: 24 Usage::
25
27 # Use SDIO interface to wake up the host and set GAP to 0x80: 26 # Use SDIO interface to wake up the host and set GAP to 0x80:
28 echo 0xff80 > /debug/btmrvl/config/gpiogap 27 echo 0xff80 > /debug/btmrvl/config/gpiogap
29 echo 1 > /debug/btmrvl/config/hscfgcmd 28 echo 1 > /debug/btmrvl/config/hscfgcmd
@@ -32,15 +31,16 @@ hscfgcmd
32 echo 0x03ff > /debug/btmrvl/config/gpiogap 31 echo 0x03ff > /debug/btmrvl/config/gpiogap
33 echo 1 > /debug/btmrvl/config/hscfgcmd 32 echo 1 > /debug/btmrvl/config/hscfgcmd
34 33
35psmode=[n] 34psmode=[n], pscmd
36pscmd
37 These commands are used to enable/disable auto sleep mode 35 These commands are used to enable/disable auto sleep mode
38 36
39 where the option is: 37 where the option is::
38
40 1 -- Enable auto sleep mode 39 1 -- Enable auto sleep mode
41 0 -- Disable auto sleep mode 40 0 -- Disable auto sleep mode
42 41
43 Usage: 42 Usage::
43
44 # Enable auto sleep mode 44 # Enable auto sleep mode
45 echo 1 > /debug/btmrvl/config/psmode 45 echo 1 > /debug/btmrvl/config/psmode
46 echo 1 > /debug/btmrvl/config/pscmd 46 echo 1 > /debug/btmrvl/config/pscmd
@@ -50,15 +50,16 @@ pscmd
50 echo 1 > /debug/btmrvl/config/pscmd 50 echo 1 > /debug/btmrvl/config/pscmd
51 51
52 52
53hsmode=[n] 53hsmode=[n], hscmd
54hscmd
55 These commands are used to enable host sleep or wake up firmware 54 These commands are used to enable host sleep or wake up firmware
56 55
57 where the option is: 56 where the option is::
57
58 1 -- Enable host sleep 58 1 -- Enable host sleep
59 0 -- Wake up firmware 59 0 -- Wake up firmware
60 60
61 Usage: 61 Usage::
62
62 # Enable host sleep 63 # Enable host sleep
63 echo 1 > /debug/btmrvl/config/hsmode 64 echo 1 > /debug/btmrvl/config/hsmode
64 echo 1 > /debug/btmrvl/config/hscmd 65 echo 1 > /debug/btmrvl/config/hscmd
@@ -68,12 +69,13 @@ hscmd
68 echo 1 > /debug/btmrvl/config/hscmd 69 echo 1 > /debug/btmrvl/config/hscmd
69 70
70 71
71====================== 72Get driver status
72Get driver status: 73=================
73 74
74Path: /debug/btmrvl/status/ 75Path: /debug/btmrvl/status/
75 76
76Usage: 77Usage::
78
77 cat /debug/btmrvl/status/<args> 79 cat /debug/btmrvl/status/<args>
78 80
79where the args are: 81where the args are:
@@ -90,14 +92,17 @@ hsstate
90txdnldrdy 92txdnldrdy
91 This command displays the value of Tx download ready flag. 93 This command displays the value of Tx download ready flag.
92 94
93 95Issuing a raw hci command
94===================== 96=========================
95 97
96Use hcitool to issue raw hci command, refer to hcitool manual 98Use hcitool to issue raw hci command, refer to hcitool manual
97 99
98 Usage: Hcitool cmd <ogf> <ocf> [Parameters] 100Usage::
101
102 Hcitool cmd <ogf> <ocf> [Parameters]
103
104Interface Control Command::
99 105
100 Interface Control Command
101 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface 106 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
102 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface 107 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
103 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface 108 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
@@ -105,13 +110,13 @@ Use hcitool to issue raw hci command, refer to hcitool manual
105 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface 110 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
106 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface 111 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
107 112
108======================================================================= 113SD8688 firmware
109 114===============
110 115
111SD8688 firmware: 116Images:
112 117
113/lib/firmware/sd8688_helper.bin 118- /lib/firmware/sd8688_helper.bin
114/lib/firmware/sd8688.bin 119- /lib/firmware/sd8688.bin
115 120
116 121
117The images can be downloaded from: 122The images can be downloaded from:
diff --git a/Documentation/bus-virt-phys-mapping.txt b/Documentation/bus-virt-phys-mapping.txt
index 2bc55ff3b4d1..4bb07c2f3e7d 100644
--- a/Documentation/bus-virt-phys-mapping.txt
+++ b/Documentation/bus-virt-phys-mapping.txt
@@ -1,17 +1,27 @@
1[ NOTE: The virt_to_bus() and bus_to_virt() functions have been 1==========================================================
2How to access I/O mapped memory from within device drivers
3==========================================================
4
5:Author: Linus
6
7.. warning::
8
9 The virt_to_bus() and bus_to_virt() functions have been
2 superseded by the functionality provided by the PCI DMA interface 10 superseded by the functionality provided by the PCI DMA interface
3 (see Documentation/DMA-API-HOWTO.txt). They continue 11 (see Documentation/DMA-API-HOWTO.txt). They continue
4 to be documented below for historical purposes, but new code 12 to be documented below for historical purposes, but new code
5 must not use them. --davidm 00/12/12 ] 13 must not use them. --davidm 00/12/12
6 14
7[ This is a mail message in response to a query on IO mapping, thus the 15::
8 strange format for a "document" ] 16
17 [ This is a mail message in response to a query on IO mapping, thus the
18 strange format for a "document" ]
9 19
10The AHA-1542 is a bus-master device, and your patch makes the driver give the 20The AHA-1542 is a bus-master device, and your patch makes the driver give the
11controller the physical address of the buffers, which is correct on x86 21controller the physical address of the buffers, which is correct on x86
12(because all bus master devices see the physical memory mappings directly). 22(because all bus master devices see the physical memory mappings directly).
13 23
14However, on many setups, there are actually _three_ different ways of looking 24However, on many setups, there are actually **three** different ways of looking
15at memory addresses, and in this case we actually want the third, the 25at memory addresses, and in this case we actually want the third, the
16so-called "bus address". 26so-called "bus address".
17 27
@@ -38,7 +48,7 @@ because the memory and the devices share the same address space, and that is
38not generally necessarily true on other PCI/ISA setups. 48not generally necessarily true on other PCI/ISA setups.
39 49
40Now, just as an example, on the PReP (PowerPC Reference Platform), the 50Now, just as an example, on the PReP (PowerPC Reference Platform), the
41CPU sees a memory map something like this (this is from memory): 51CPU sees a memory map something like this (this is from memory)::
42 52
43 0-2 GB "real memory" 53 0-2 GB "real memory"
44 2 GB-3 GB "system IO" (inb/out and similar accesses on x86) 54 2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
@@ -52,7 +62,7 @@ So when the CPU wants any bus master to write to physical memory 0, it
52has to give the master address 0x80000000 as the memory address. 62has to give the master address 0x80000000 as the memory address.
53 63
54So, for example, depending on how the kernel is actually mapped on the 64So, for example, depending on how the kernel is actually mapped on the
55PPC, you can end up with a setup like this: 65PPC, you can end up with a setup like this::
56 66
57 physical address: 0 67 physical address: 0
58 virtual address: 0xC0000000 68 virtual address: 0xC0000000
@@ -61,7 +71,7 @@ PPC, you can end up with a setup like this:
61where all the addresses actually point to the same thing. It's just seen 71where all the addresses actually point to the same thing. It's just seen
62through different translations.. 72through different translations..
63 73
64Similarly, on the Alpha, the normal translation is 74Similarly, on the Alpha, the normal translation is::
65 75
66 physical address: 0 76 physical address: 0
67 virtual address: 0xfffffc0000000000 77 virtual address: 0xfffffc0000000000
@@ -70,7 +80,7 @@ Similarly, on the Alpha, the normal translation is
70(but there are also Alphas where the physical address and the bus address 80(but there are also Alphas where the physical address and the bus address
71are the same). 81are the same).
72 82
73Anyway, the way to look up all these translations, you do 83Anyway, the way to look up all these translations, you do::
74 84
75 #include <asm/io.h> 85 #include <asm/io.h>
76 86
@@ -81,8 +91,8 @@ Anyway, the way to look up all these translations, you do
81 91
82Now, when do you need these? 92Now, when do you need these?
83 93
84You want the _virtual_ address when you are actually going to access that 94You want the **virtual** address when you are actually going to access that
85pointer from the kernel. So you can have something like this: 95pointer from the kernel. So you can have something like this::
86 96
87 /* 97 /*
88 * this is the hardware "mailbox" we use to communicate with 98 * this is the hardware "mailbox" we use to communicate with
@@ -104,7 +114,7 @@ pointer from the kernel. So you can have something like this:
104 ... 114 ...
105 115
106on the other hand, you want the bus address when you have a buffer that 116on the other hand, you want the bus address when you have a buffer that
107you want to give to the controller: 117you want to give to the controller::
108 118
109 /* ask the controller to read the sense status into "sense_buffer" */ 119 /* ask the controller to read the sense status into "sense_buffer" */
110 mbox.bufstart = virt_to_bus(&sense_buffer); 120 mbox.bufstart = virt_to_bus(&sense_buffer);
@@ -112,7 +122,7 @@ you want to give to the controller:
112 mbox.status = 0; 122 mbox.status = 0;
113 notify_controller(&mbox); 123 notify_controller(&mbox);
114 124
115And you generally _never_ want to use the physical address, because you can't 125And you generally **never** want to use the physical address, because you can't
116use that from the CPU (the CPU only uses translated virtual addresses), and 126use that from the CPU (the CPU only uses translated virtual addresses), and
117you can't use it from the bus master. 127you can't use it from the bus master.
118 128
@@ -124,8 +134,10 @@ be remapped as measured in units of pages, a.k.a. the pfn (the memory
124management layer doesn't know about devices outside the CPU, so it 134management layer doesn't know about devices outside the CPU, so it
125shouldn't need to know about "bus addresses" etc). 135shouldn't need to know about "bus addresses" etc).
126 136
127NOTE NOTE NOTE! The above is only one part of the whole equation. The above 137.. note::
128only talks about "real memory", that is, CPU memory (RAM). 138
139 The above is only one part of the whole equation. The above
140 only talks about "real memory", that is, CPU memory (RAM).
129 141
130There is a completely different type of memory too, and that's the "shared 142There is a completely different type of memory too, and that's the "shared
131memory" on the PCI or ISA bus. That's generally not RAM (although in the case 143memory" on the PCI or ISA bus. That's generally not RAM (although in the case
@@ -137,20 +149,22 @@ whatever, and there is only one way to access it: the readb/writeb and
137related functions. You should never take the address of such memory, because 149related functions. You should never take the address of such memory, because
138there is really nothing you can do with such an address: it's not 150there is really nothing you can do with such an address: it's not
139conceptually in the same memory space as "real memory" at all, so you cannot 151conceptually in the same memory space as "real memory" at all, so you cannot
140just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space, 152just dereference a pointer. (Sadly, on x86 it **is** in the same memory space,
141so on x86 it actually works to just deference a pointer, but it's not 153so on x86 it actually works to just deference a pointer, but it's not
142portable). 154portable).
143 155
144For such memory, you can do things like 156For such memory, you can do things like:
157
158 - reading::
145 159
146 - reading:
147 /* 160 /*
148 * read first 32 bits from ISA memory at 0xC0000, aka 161 * read first 32 bits from ISA memory at 0xC0000, aka
149 * C000:0000 in DOS terms 162 * C000:0000 in DOS terms
150 */ 163 */
151 unsigned int signature = isa_readl(0xC0000); 164 unsigned int signature = isa_readl(0xC0000);
152 165
153 - remapping and writing: 166 - remapping and writing::
167
154 /* 168 /*
155 * remap framebuffer PCI memory area at 0xFC000000, 169 * remap framebuffer PCI memory area at 0xFC000000,
156 * size 1MB, so that we can access it: We can directly 170 * size 1MB, so that we can access it: We can directly
@@ -165,7 +179,8 @@ For such memory, you can do things like
165 /* unmap when we unload the driver */ 179 /* unmap when we unload the driver */
166 iounmap(baseptr); 180 iounmap(baseptr);
167 181
168 - copying and clearing: 182 - copying and clearing::
183
169 /* get the 6-byte Ethernet address at ISA address E000:0040 */ 184 /* get the 6-byte Ethernet address at ISA address E000:0040 */
170 memcpy_fromio(kernel_buffer, 0xE0040, 6); 185 memcpy_fromio(kernel_buffer, 0xE0040, 6);
171 /* write a packet to the driver */ 186 /* write a packet to the driver */
@@ -181,10 +196,10 @@ happy that your driver works ;)
181Note that kernel versions 2.0.x (and earlier) mistakenly called the 196Note that kernel versions 2.0.x (and earlier) mistakenly called the
182ioremap() function "vremap()". ioremap() is the proper name, but I 197ioremap() function "vremap()". ioremap() is the proper name, but I
183didn't think straight when I wrote it originally. People who have to 198didn't think straight when I wrote it originally. People who have to
184support both can do something like: 199support both can do something like::
185 200
186 /* support old naming silliness */ 201 /* support old naming silliness */
187 #if LINUX_VERSION_CODE < 0x020100 202 #if LINUX_VERSION_CODE < 0x020100
188 #define ioremap vremap 203 #define ioremap vremap
189 #define iounmap vfree 204 #define iounmap vfree
190 #endif 205 #endif
@@ -196,13 +211,10 @@ And the above sounds worse than it really is. Most real drivers really
196don't do all that complex things (or rather: the complexity is not so 211don't do all that complex things (or rather: the complexity is not so
197much in the actual IO accesses as in error handling and timeouts etc). 212much in the actual IO accesses as in error handling and timeouts etc).
198It's generally not hard to fix drivers, and in many cases the code 213It's generally not hard to fix drivers, and in many cases the code
199actually looks better afterwards: 214actually looks better afterwards::
200 215
201 unsigned long signature = *(unsigned int *) 0xC0000; 216 unsigned long signature = *(unsigned int *) 0xC0000;
202 vs 217 vs
203 unsigned long signature = readl(0xC0000); 218 unsigned long signature = readl(0xC0000);
204 219
205I think the second version actually is more readable, no? 220I think the second version actually is more readable, no?
206
207 Linus
208
diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt
index 3f9f808b5119..6eb9d3f090cd 100644
--- a/Documentation/cachetlb.txt
+++ b/Documentation/cachetlb.txt
@@ -1,7 +1,8 @@
1 Cache and TLB Flushing 1==================================
2 Under Linux 2Cache and TLB Flushing Under Linux
3==================================
3 4
4 David S. Miller <davem@redhat.com> 5:Author: David S. Miller <davem@redhat.com>
5 6
6This document describes the cache/tlb flushing interfaces called 7This document describes the cache/tlb flushing interfaces called
7by the Linux VM subsystem. It enumerates over each interface, 8by the Linux VM subsystem. It enumerates over each interface,
@@ -28,7 +29,7 @@ Therefore when software page table changes occur, the kernel will
28invoke one of the following flush methods _after_ the page table 29invoke one of the following flush methods _after_ the page table
29changes occur: 30changes occur:
30 31
311) void flush_tlb_all(void) 321) ``void flush_tlb_all(void)``
32 33
33 The most severe flush of all. After this interface runs, 34 The most severe flush of all. After this interface runs,
34 any previous page table modification whatsoever will be 35 any previous page table modification whatsoever will be
@@ -37,7 +38,7 @@ changes occur:
37 This is usually invoked when the kernel page tables are 38 This is usually invoked when the kernel page tables are
38 changed, since such translations are "global" in nature. 39 changed, since such translations are "global" in nature.
39 40
402) void flush_tlb_mm(struct mm_struct *mm) 412) ``void flush_tlb_mm(struct mm_struct *mm)``
41 42
42 This interface flushes an entire user address space from 43 This interface flushes an entire user address space from
43 the TLB. After running, this interface must make sure that 44 the TLB. After running, this interface must make sure that
@@ -49,8 +50,8 @@ changes occur:
49 page table operations such as what happens during 50 page table operations such as what happens during
50 fork, and exec. 51 fork, and exec.
51 52
523) void flush_tlb_range(struct vm_area_struct *vma, 533) ``void flush_tlb_range(struct vm_area_struct *vma,
53 unsigned long start, unsigned long end) 54 unsigned long start, unsigned long end)``
54 55
55 Here we are flushing a specific range of (user) virtual 56 Here we are flushing a specific range of (user) virtual
56 address translations from the TLB. After running, this 57 address translations from the TLB. After running, this
@@ -69,7 +70,7 @@ changes occur:
69 call flush_tlb_page (see below) for each entry which may be 70 call flush_tlb_page (see below) for each entry which may be
70 modified. 71 modified.
71 72
724) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) 734) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)``
73 74
74 This time we need to remove the PAGE_SIZE sized translation 75 This time we need to remove the PAGE_SIZE sized translation
75 from the TLB. The 'vma' is the backing structure used by 76 from the TLB. The 'vma' is the backing structure used by
@@ -87,8 +88,8 @@ changes occur:
87 88
88 This is used primarily during fault processing. 89 This is used primarily during fault processing.
89 90
905) void update_mmu_cache(struct vm_area_struct *vma, 915) ``void update_mmu_cache(struct vm_area_struct *vma,
91 unsigned long address, pte_t *ptep) 92 unsigned long address, pte_t *ptep)``
92 93
93 At the end of every page fault, this routine is invoked to 94 At the end of every page fault, this routine is invoked to
94 tell the architecture specific code that a translation 95 tell the architecture specific code that a translation
@@ -100,7 +101,7 @@ changes occur:
100 translations for software managed TLB configurations. 101 translations for software managed TLB configurations.
101 The sparc64 port currently does this. 102 The sparc64 port currently does this.
102 103
1036) void tlb_migrate_finish(struct mm_struct *mm) 1046) ``void tlb_migrate_finish(struct mm_struct *mm)``
104 105
105 This interface is called at the end of an explicit 106 This interface is called at the end of an explicit
106 process migration. This interface provides a hook 107 process migration. This interface provides a hook
@@ -112,7 +113,7 @@ changes occur:
112 113
113Next, we have the cache flushing interfaces. In general, when Linux 114Next, we have the cache flushing interfaces. In general, when Linux
114is changing an existing virtual-->physical mapping to a new value, 115is changing an existing virtual-->physical mapping to a new value,
115the sequence will be in one of the following forms: 116the sequence will be in one of the following forms::
116 117
117 1) flush_cache_mm(mm); 118 1) flush_cache_mm(mm);
118 change_all_page_tables_of(mm); 119 change_all_page_tables_of(mm);
@@ -143,7 +144,7 @@ and have no dependency on translation information.
143 144
144Here are the routines, one by one: 145Here are the routines, one by one:
145 146
1461) void flush_cache_mm(struct mm_struct *mm) 1471) ``void flush_cache_mm(struct mm_struct *mm)``
147 148
148 This interface flushes an entire user address space from 149 This interface flushes an entire user address space from
149 the caches. That is, after running, there will be no cache 150 the caches. That is, after running, there will be no cache
@@ -152,7 +153,7 @@ Here are the routines, one by one:
152 This interface is used to handle whole address space 153 This interface is used to handle whole address space
153 page table operations such as what happens during exit and exec. 154 page table operations such as what happens during exit and exec.
154 155
1552) void flush_cache_dup_mm(struct mm_struct *mm) 1562) ``void flush_cache_dup_mm(struct mm_struct *mm)``
156 157
157 This interface flushes an entire user address space from 158 This interface flushes an entire user address space from
158 the caches. That is, after running, there will be no cache 159 the caches. That is, after running, there will be no cache
@@ -164,8 +165,8 @@ Here are the routines, one by one:
164 This option is separate from flush_cache_mm to allow some 165 This option is separate from flush_cache_mm to allow some
165 optimizations for VIPT caches. 166 optimizations for VIPT caches.
166 167
1673) void flush_cache_range(struct vm_area_struct *vma, 1683) ``void flush_cache_range(struct vm_area_struct *vma,
168 unsigned long start, unsigned long end) 169 unsigned long start, unsigned long end)``
169 170
170 Here we are flushing a specific range of (user) virtual 171 Here we are flushing a specific range of (user) virtual
171 addresses from the cache. After running, there will be no 172 addresses from the cache. After running, there will be no
@@ -181,7 +182,7 @@ Here are the routines, one by one:
181 call flush_cache_page (see below) for each entry which may be 182 call flush_cache_page (see below) for each entry which may be
182 modified. 183 modified.
183 184
1844) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) 1854) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)``
185 186
186 This time we need to remove a PAGE_SIZE sized range 187 This time we need to remove a PAGE_SIZE sized range
187 from the cache. The 'vma' is the backing structure used by 188 from the cache. The 'vma' is the backing structure used by
@@ -202,7 +203,7 @@ Here are the routines, one by one:
202 203
203 This is used primarily during fault processing. 204 This is used primarily during fault processing.
204 205
2055) void flush_cache_kmaps(void) 2065) ``void flush_cache_kmaps(void)``
206 207
207 This routine need only be implemented if the platform utilizes 208 This routine need only be implemented if the platform utilizes
208 highmem. It will be called right before all of the kmaps 209 highmem. It will be called right before all of the kmaps
@@ -214,8 +215,8 @@ Here are the routines, one by one:
214 215
215 This routing should be implemented in asm/highmem.h 216 This routing should be implemented in asm/highmem.h
216 217
2176) void flush_cache_vmap(unsigned long start, unsigned long end) 2186) ``void flush_cache_vmap(unsigned long start, unsigned long end)``
218 void flush_cache_vunmap(unsigned long start, unsigned long end) 219 ``void flush_cache_vunmap(unsigned long start, unsigned long end)``
219 220
220 Here in these two interfaces we are flushing a specific range 221 Here in these two interfaces we are flushing a specific range
221 of (kernel) virtual addresses from the cache. After running, 222 of (kernel) virtual addresses from the cache. After running,
@@ -243,8 +244,10 @@ size). This setting will force the SYSv IPC layer to only allow user
243processes to mmap shared memory at address which are a multiple of 244processes to mmap shared memory at address which are a multiple of
244this value. 245this value.
245 246
246NOTE: This does not fix shared mmaps, check out the sparc64 port for 247.. note::
247one way to solve this (in particular SPARC_FLAG_MMAPSHARED). 248
249 This does not fix shared mmaps, check out the sparc64 port for
250 one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
248 251
249Next, you have to solve the D-cache aliasing issue for all 252Next, you have to solve the D-cache aliasing issue for all
250other cases. Please keep in mind that fact that, for a given page 253other cases. Please keep in mind that fact that, for a given page
@@ -255,8 +258,8 @@ physical page into its address space, by implication the D-cache
255aliasing problem has the potential to exist since the kernel already 258aliasing problem has the potential to exist since the kernel already
256maps this page at its virtual address. 259maps this page at its virtual address.
257 260
258 void copy_user_page(void *to, void *from, unsigned long addr, struct page *page) 261 ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
259 void clear_user_page(void *to, unsigned long addr, struct page *page) 262 ``void clear_user_page(void *to, unsigned long addr, struct page *page)``
260 263
261 These two routines store data in user anonymous or COW 264 These two routines store data in user anonymous or COW
262 pages. It allows a port to efficiently avoid D-cache alias 265 pages. It allows a port to efficiently avoid D-cache alias
@@ -276,14 +279,16 @@ maps this page at its virtual address.
276 If D-cache aliasing is not an issue, these two routines may 279 If D-cache aliasing is not an issue, these two routines may
277 simply call memcpy/memset directly and do nothing more. 280 simply call memcpy/memset directly and do nothing more.
278 281
279 void flush_dcache_page(struct page *page) 282 ``void flush_dcache_page(struct page *page)``
280 283
281 Any time the kernel writes to a page cache page, _OR_ 284 Any time the kernel writes to a page cache page, _OR_
282 the kernel is about to read from a page cache page and 285 the kernel is about to read from a page cache page and
283 user space shared/writable mappings of this page potentially 286 user space shared/writable mappings of this page potentially
284 exist, this routine is called. 287 exist, this routine is called.
285 288
286 NOTE: This routine need only be called for page cache pages 289 .. note::
290
291 This routine need only be called for page cache pages
287 which can potentially ever be mapped into the address 292 which can potentially ever be mapped into the address
288 space of a user process. So for example, VFS layer code 293 space of a user process. So for example, VFS layer code
289 handling vfs symlinks in the page cache need not call 294 handling vfs symlinks in the page cache need not call
@@ -322,18 +327,19 @@ maps this page at its virtual address.
322 made of this flag bit, and if set the flush is done and the flag 327 made of this flag bit, and if set the flush is done and the flag
323 bit is cleared. 328 bit is cleared.
324 329
325 IMPORTANT NOTE: It is often important, if you defer the flush, 330 .. important::
331
332 It is often important, if you defer the flush,
326 that the actual flush occurs on the same CPU 333 that the actual flush occurs on the same CPU
327 as did the cpu stores into the page to make it 334 as did the cpu stores into the page to make it
328 dirty. Again, see sparc64 for examples of how 335 dirty. Again, see sparc64 for examples of how
329 to deal with this. 336 to deal with this.
330 337
331 void copy_to_user_page(struct vm_area_struct *vma, struct page *page, 338 ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
332 unsigned long user_vaddr, 339 unsigned long user_vaddr, void *dst, void *src, int len)``
333 void *dst, void *src, int len) 340 ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
334 void copy_from_user_page(struct vm_area_struct *vma, struct page *page, 341 unsigned long user_vaddr, void *dst, void *src, int len)``
335 unsigned long user_vaddr, 342
336 void *dst, void *src, int len)
337 When the kernel needs to copy arbitrary data in and out 343 When the kernel needs to copy arbitrary data in and out
338 of arbitrary user pages (f.e. for ptrace()) it will use 344 of arbitrary user pages (f.e. for ptrace()) it will use
339 these two routines. 345 these two routines.
@@ -344,8 +350,9 @@ maps this page at its virtual address.
344 likely that you will need to flush the instruction cache 350 likely that you will need to flush the instruction cache
345 for copy_to_user_page(). 351 for copy_to_user_page().
346 352
347 void flush_anon_page(struct vm_area_struct *vma, struct page *page, 353 ``void flush_anon_page(struct vm_area_struct *vma, struct page *page,
348 unsigned long vmaddr) 354 unsigned long vmaddr)``
355
349 When the kernel needs to access the contents of an anonymous 356 When the kernel needs to access the contents of an anonymous
350 page, it calls this function (currently only 357 page, it calls this function (currently only
351 get_user_pages()). Note: flush_dcache_page() deliberately 358 get_user_pages()). Note: flush_dcache_page() deliberately
@@ -354,7 +361,8 @@ maps this page at its virtual address.
354 architectures). For incoherent architectures, it should flush 361 architectures). For incoherent architectures, it should flush
355 the cache of the page at vmaddr. 362 the cache of the page at vmaddr.
356 363
357 void flush_kernel_dcache_page(struct page *page) 364 ``void flush_kernel_dcache_page(struct page *page)``
365
358 When the kernel needs to modify a user page is has obtained 366 When the kernel needs to modify a user page is has obtained
359 with kmap, it calls this function after all modifications are 367 with kmap, it calls this function after all modifications are
360 complete (but before kunmapping it) to bring the underlying 368 complete (but before kunmapping it) to bring the underlying
@@ -366,14 +374,16 @@ maps this page at its virtual address.
366 the kernel cache for page (using page_address(page)). 374 the kernel cache for page (using page_address(page)).
367 375
368 376
369 void flush_icache_range(unsigned long start, unsigned long end) 377 ``void flush_icache_range(unsigned long start, unsigned long end)``
378
370 When the kernel stores into addresses that it will execute 379 When the kernel stores into addresses that it will execute
371 out of (eg when loading modules), this function is called. 380 out of (eg when loading modules), this function is called.
372 381
373 If the icache does not snoop stores then this routine will need 382 If the icache does not snoop stores then this routine will need
374 to flush it. 383 to flush it.
375 384
376 void flush_icache_page(struct vm_area_struct *vma, struct page *page) 385 ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
386
377 All the functionality of flush_icache_page can be implemented in 387 All the functionality of flush_icache_page can be implemented in
378 flush_dcache_page and update_mmu_cache. In the future, the hope 388 flush_dcache_page and update_mmu_cache. In the future, the hope
379 is to remove this interface completely. 389 is to remove this interface completely.
@@ -387,7 +397,8 @@ the kernel trying to do I/O to vmap areas must manually manage
387coherency. It must do this by flushing the vmap range before doing 397coherency. It must do this by flushing the vmap range before doing
388I/O and invalidating it after the I/O returns. 398I/O and invalidating it after the I/O returns.
389 399
390 void flush_kernel_vmap_range(void *vaddr, int size) 400 ``void flush_kernel_vmap_range(void *vaddr, int size)``
401
391 flushes the kernel cache for a given virtual address range in 402 flushes the kernel cache for a given virtual address range in
392 the vmap area. This is to make sure that any data the kernel 403 the vmap area. This is to make sure that any data the kernel
393 modified in the vmap range is made visible to the physical 404 modified in the vmap range is made visible to the physical
@@ -395,7 +406,8 @@ I/O and invalidating it after the I/O returns.
395 Note that this API does *not* also flush the offset map alias 406 Note that this API does *not* also flush the offset map alias
396 of the area. 407 of the area.
397 408
398 void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates 409 ``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates``
410
399 the cache for a given virtual address range in the vmap area 411 the cache for a given virtual address range in the vmap area
400 which prevents the processor from making the cache stale by 412 which prevents the processor from making the cache stale by
401 speculatively reading data while the I/O was occurring to the 413 speculatively reading data while the I/O was occurring to the
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index e6101976e0f1..bde177103567 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1,7 +1,9 @@
1 1================
2Control Group v2 2Control Group v2
3================
3 4
4October, 2015 Tejun Heo <tj@kernel.org> 5:Date: October, 2015
6:Author: Tejun Heo <tj@kernel.org>
5 7
6This is the authoritative documentation on the design, interface and 8This is the authoritative documentation on the design, interface and
7conventions of cgroup v2. It describes all userland-visible aspects 9conventions of cgroup v2. It describes all userland-visible aspects
@@ -9,70 +11,72 @@ of cgroup including core and specific controller behaviors. All
9future changes must be reflected in this document. Documentation for 11future changes must be reflected in this document. Documentation for
10v1 is available under Documentation/cgroup-v1/. 12v1 is available under Documentation/cgroup-v1/.
11 13
12CONTENTS 14.. CONTENTS
13 15
141. Introduction 16 1. Introduction
15 1-1. Terminology 17 1-1. Terminology
16 1-2. What is cgroup? 18 1-2. What is cgroup?
172. Basic Operations 19 2. Basic Operations
18 2-1. Mounting 20 2-1. Mounting
19 2-2. Organizing Processes 21 2-2. Organizing Processes
20 2-3. [Un]populated Notification 22 2-3. [Un]populated Notification
21 2-4. Controlling Controllers 23 2-4. Controlling Controllers
22 2-4-1. Enabling and Disabling 24 2-4-1. Enabling and Disabling
23 2-4-2. Top-down Constraint 25 2-4-2. Top-down Constraint
24 2-4-3. No Internal Process Constraint 26 2-4-3. No Internal Process Constraint
25 2-5. Delegation 27 2-5. Delegation
26 2-5-1. Model of Delegation 28 2-5-1. Model of Delegation
27 2-5-2. Delegation Containment 29 2-5-2. Delegation Containment
28 2-6. Guidelines 30 2-6. Guidelines
29 2-6-1. Organize Once and Control 31 2-6-1. Organize Once and Control
30 2-6-2. Avoid Name Collisions 32 2-6-2. Avoid Name Collisions
313. Resource Distribution Models 33 3. Resource Distribution Models
32 3-1. Weights 34 3-1. Weights
33 3-2. Limits 35 3-2. Limits
34 3-3. Protections 36 3-3. Protections
35 3-4. Allocations 37 3-4. Allocations
364. Interface Files 38 4. Interface Files
37 4-1. Format 39 4-1. Format
38 4-2. Conventions 40 4-2. Conventions
39 4-3. Core Interface Files 41 4-3. Core Interface Files
405. Controllers 42 5. Controllers
41 5-1. CPU 43 5-1. CPU
42 5-1-1. CPU Interface Files 44 5-1-1. CPU Interface Files
43 5-2. Memory 45 5-2. Memory
44 5-2-1. Memory Interface Files 46 5-2-1. Memory Interface Files
45 5-2-2. Usage Guidelines 47 5-2-2. Usage Guidelines
46 5-2-3. Memory Ownership 48 5-2-3. Memory Ownership
47 5-3. IO 49 5-3. IO
48 5-3-1. IO Interface Files 50 5-3-1. IO Interface Files
49 5-3-2. Writeback 51 5-3-2. Writeback
50 5-4. PID 52 5-4. PID
51 5-4-1. PID Interface Files 53 5-4-1. PID Interface Files
52 5-5. RDMA 54 5-5. RDMA
53 5-5-1. RDMA Interface Files 55 5-5-1. RDMA Interface Files
54 5-6. Misc 56 5-6. Misc
55 5-6-1. perf_event 57 5-6-1. perf_event
566. Namespace 58 6. Namespace
57 6-1. Basics 59 6-1. Basics
58 6-2. The Root and Views 60 6-2. The Root and Views
59 6-3. Migration and setns(2) 61 6-3. Migration and setns(2)
60 6-4. Interaction with Other Namespaces 62 6-4. Interaction with Other Namespaces
61P. Information on Kernel Programming 63 P. Information on Kernel Programming
62 P-1. Filesystem Support for Writeback 64 P-1. Filesystem Support for Writeback
63D. Deprecated v1 Core Features 65 D. Deprecated v1 Core Features
64R. Issues with v1 and Rationales for v2 66 R. Issues with v1 and Rationales for v2
65 R-1. Multiple Hierarchies 67 R-1. Multiple Hierarchies
66 R-2. Thread Granularity 68 R-2. Thread Granularity
67 R-3. Competition Between Inner Nodes and Threads 69 R-3. Competition Between Inner Nodes and Threads
68 R-4. Other Interface Issues 70 R-4. Other Interface Issues
69 R-5. Controller Issues and Remedies 71 R-5. Controller Issues and Remedies
70 R-5-1. Memory 72 R-5-1. Memory
71 73
72 74
731. Introduction 75Introduction
74 76============
751-1. Terminology 77
78Terminology
79-----------
76 80
77"cgroup" stands for "control group" and is never capitalized. The 81"cgroup" stands for "control group" and is never capitalized. The
78singular form is used to designate the whole feature and also as a 82singular form is used to designate the whole feature and also as a
@@ -80,7 +84,8 @@ qualifier as in "cgroup controllers". When explicitly referring to
80multiple individual control groups, the plural form "cgroups" is used. 84multiple individual control groups, the plural form "cgroups" is used.
81 85
82 86
831-2. What is cgroup? 87What is cgroup?
88---------------
84 89
85cgroup is a mechanism to organize processes hierarchically and 90cgroup is a mechanism to organize processes hierarchically and
86distribute system resources along the hierarchy in a controlled and 91distribute system resources along the hierarchy in a controlled and
@@ -110,12 +115,14 @@ restrictions set closer to the root in the hierarchy can not be
110overridden from further away. 115overridden from further away.
111 116
112 117
1132. Basic Operations 118Basic Operations
119================
114 120
1152-1. Mounting 121Mounting
122--------
116 123
117Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 124Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
118hierarchy can be mounted with the following mount command. 125hierarchy can be mounted with the following mount command::
119 126
120 # mount -t cgroup2 none $MOUNT_POINT 127 # mount -t cgroup2 none $MOUNT_POINT
121 128
@@ -160,10 +167,11 @@ cgroup v2 currently supports the following mount options.
160 Delegation section for details. 167 Delegation section for details.
161 168
162 169
1632-2. Organizing Processes 170Organizing Processes
171--------------------
164 172
165Initially, only the root cgroup exists to which all processes belong. 173Initially, only the root cgroup exists to which all processes belong.
166A child cgroup can be created by creating a sub-directory. 174A child cgroup can be created by creating a sub-directory::
167 175
168 # mkdir $CGROUP_NAME 176 # mkdir $CGROUP_NAME
169 177
@@ -190,28 +198,29 @@ moved to another cgroup.
190A cgroup which doesn't have any children or live processes can be 198A cgroup which doesn't have any children or live processes can be
191destroyed by removing the directory. Note that a cgroup which doesn't 199destroyed by removing the directory. Note that a cgroup which doesn't
192have any children and is associated only with zombie processes is 200have any children and is associated only with zombie processes is
193considered empty and can be removed. 201considered empty and can be removed::
194 202
195 # rmdir $CGROUP_NAME 203 # rmdir $CGROUP_NAME
196 204
197"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 205"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
198cgroup is in use in the system, this file may contain multiple lines, 206cgroup is in use in the system, this file may contain multiple lines,
199one for each hierarchy. The entry for cgroup v2 is always in the 207one for each hierarchy. The entry for cgroup v2 is always in the
200format "0::$PATH". 208format "0::$PATH"::
201 209
202 # cat /proc/842/cgroup 210 # cat /proc/842/cgroup
203 ... 211 ...
204 0::/test-cgroup/test-cgroup-nested 212 0::/test-cgroup/test-cgroup-nested
205 213
206If the process becomes a zombie and the cgroup it was associated with 214If the process becomes a zombie and the cgroup it was associated with
207is removed subsequently, " (deleted)" is appended to the path. 215is removed subsequently, " (deleted)" is appended to the path::
208 216
209 # cat /proc/842/cgroup 217 # cat /proc/842/cgroup
210 ... 218 ...
211 0::/test-cgroup/test-cgroup-nested (deleted) 219 0::/test-cgroup/test-cgroup-nested (deleted)
212 220
213 221
2142-3. [Un]populated Notification 222[Un]populated Notification
223--------------------------
215 224
216Each non-root cgroup has a "cgroup.events" file which contains 225Each non-root cgroup has a "cgroup.events" file which contains
217"populated" field indicating whether the cgroup's sub-hierarchy has 226"populated" field indicating whether the cgroup's sub-hierarchy has
@@ -222,7 +231,7 @@ example, to start a clean-up operation after all processes of a given
222sub-hierarchy have exited. The populated state updates and 231sub-hierarchy have exited. The populated state updates and
223notifications are recursive. Consider the following sub-hierarchy 232notifications are recursive. Consider the following sub-hierarchy
224where the numbers in the parentheses represent the numbers of processes 233where the numbers in the parentheses represent the numbers of processes
225in each cgroup. 234in each cgroup::
226 235
227 A(4) - B(0) - C(1) 236 A(4) - B(0) - C(1)
228 \ D(0) 237 \ D(0)
@@ -233,18 +242,20 @@ file modified events will be generated on the "cgroup.events" files of
233both cgroups. 242both cgroups.
234 243
235 244
2362-4. Controlling Controllers 245Controlling Controllers
246-----------------------
237 247
2382-4-1. Enabling and Disabling 248Enabling and Disabling
249~~~~~~~~~~~~~~~~~~~~~~
239 250
240Each cgroup has a "cgroup.controllers" file which lists all 251Each cgroup has a "cgroup.controllers" file which lists all
241controllers available for the cgroup to enable. 252controllers available for the cgroup to enable::
242 253
243 # cat cgroup.controllers 254 # cat cgroup.controllers
244 cpu io memory 255 cpu io memory
245 256
246No controller is enabled by default. Controllers can be enabled and 257No controller is enabled by default. Controllers can be enabled and
247disabled by writing to the "cgroup.subtree_control" file. 258disabled by writing to the "cgroup.subtree_control" file::
248 259
249 # echo "+cpu +memory -io" > cgroup.subtree_control 260 # echo "+cpu +memory -io" > cgroup.subtree_control
250 261
@@ -256,7 +267,7 @@ are specified, the last one is effective.
256Enabling a controller in a cgroup indicates that the distribution of 267Enabling a controller in a cgroup indicates that the distribution of
257the target resource across its immediate children will be controlled. 268the target resource across its immediate children will be controlled.
258Consider the following sub-hierarchy. The enabled controllers are 269Consider the following sub-hierarchy. The enabled controllers are
259listed in parentheses. 270listed in parentheses::
260 271
261 A(cpu,memory) - B(memory) - C() 272 A(cpu,memory) - B(memory) - C()
262 \ D() 273 \ D()
@@ -276,7 +287,8 @@ controller interface files - anything which doesn't start with
276"cgroup." are owned by the parent rather than the cgroup itself. 287"cgroup." are owned by the parent rather than the cgroup itself.
277 288
278 289
2792-4-2. Top-down Constraint 290Top-down Constraint
291~~~~~~~~~~~~~~~~~~~
280 292
281Resources are distributed top-down and a cgroup can further distribute 293Resources are distributed top-down and a cgroup can further distribute
282a resource only if the resource has been distributed to it from the 294a resource only if the resource has been distributed to it from the
@@ -287,7 +299,8 @@ the parent has the controller enabled and a controller can't be
287disabled if one or more children have it enabled. 299disabled if one or more children have it enabled.
288 300
289 301
2902-4-3. No Internal Process Constraint 302No Internal Process Constraint
303~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
291 304
292Non-root cgroups can only distribute resources to their children when 305Non-root cgroups can only distribute resources to their children when
293they don't have any processes of their own. In other words, only 306they don't have any processes of their own. In other words, only
@@ -314,9 +327,11 @@ children before enabling controllers in its "cgroup.subtree_control"
314file. 327file.
315 328
316 329
3172-5. Delegation 330Delegation
331----------
318 332
3192-5-1. Model of Delegation 333Model of Delegation
334~~~~~~~~~~~~~~~~~~~
320 335
321A cgroup can be delegated in two ways. First, to a less privileged 336A cgroup can be delegated in two ways. First, to a less privileged
322user by granting write access of the directory and its "cgroup.procs" 337user by granting write access of the directory and its "cgroup.procs"
@@ -345,7 +360,8 @@ cgroups in or nesting depth of a delegated sub-hierarchy; however,
345this may be limited explicitly in the future. 360this may be limited explicitly in the future.
346 361
347 362
3482-5-2. Delegation Containment 363Delegation Containment
364~~~~~~~~~~~~~~~~~~~~~~
349 365
350A delegated sub-hierarchy is contained in the sense that processes 366A delegated sub-hierarchy is contained in the sense that processes
351can't be moved into or out of the sub-hierarchy by the delegatee. 367can't be moved into or out of the sub-hierarchy by the delegatee.
@@ -366,7 +382,7 @@ in from or push out to outside the sub-hierarchy.
366 382
367For an example, let's assume cgroups C0 and C1 have been delegated to 383For an example, let's assume cgroups C0 and C1 have been delegated to
368user U0 who created C00, C01 under C0 and C10 under C1 as follows and 384user U0 who created C00, C01 under C0 and C10 under C1 as follows and
369all processes under C0 and C1 belong to U0. 385all processes under C0 and C1 belong to U0::
370 386
371 ~~~~~~~~~~~~~ - C0 - C00 387 ~~~~~~~~~~~~~ - C0 - C00
372 ~ cgroup ~ \ C01 388 ~ cgroup ~ \ C01
@@ -386,9 +402,11 @@ namespace of the process which is attempting the migration. If either
386is not reachable, the migration is rejected with -ENOENT. 402is not reachable, the migration is rejected with -ENOENT.
387 403
388 404
3892-6. Guidelines 405Guidelines
406----------
390 407
3912-6-1. Organize Once and Control 408Organize Once and Control
409~~~~~~~~~~~~~~~~~~~~~~~~~
392 410
393Migrating a process across cgroups is a relatively expensive operation 411Migrating a process across cgroups is a relatively expensive operation
394and stateful resources such as memory are not moved together with the 412and stateful resources such as memory are not moved together with the
@@ -404,7 +422,8 @@ distribution can be made by changing controller configuration through
404the interface files. 422the interface files.
405 423
406 424
4072-6-2. Avoid Name Collisions 425Avoid Name Collisions
426~~~~~~~~~~~~~~~~~~~~~
408 427
409Interface files for a cgroup and its children cgroups occupy the same 428Interface files for a cgroup and its children cgroups occupy the same
410directory and it is possible to create children cgroups which collide 429directory and it is possible to create children cgroups which collide
@@ -422,14 +441,16 @@ cgroup doesn't do anything to prevent name collisions and it's the
422user's responsibility to avoid them. 441user's responsibility to avoid them.
423 442
424 443
4253. Resource Distribution Models 444Resource Distribution Models
445============================
426 446
427cgroup controllers implement several resource distribution schemes 447cgroup controllers implement several resource distribution schemes
428depending on the resource type and expected use cases. This section 448depending on the resource type and expected use cases. This section
429describes major schemes in use along with their expected behaviors. 449describes major schemes in use along with their expected behaviors.
430 450
431 451
4323-1. Weights 452Weights
453-------
433 454
434A parent's resource is distributed by adding up the weights of all 455A parent's resource is distributed by adding up the weights of all
435active children and giving each the fraction matching the ratio of its 456active children and giving each the fraction matching the ratio of its
@@ -450,7 +471,8 @@ process migrations.
450and is an example of this type. 471and is an example of this type.
451 472
452 473
4533-2. Limits 474Limits
475------
454 476
455A child can only consume upto the configured amount of the resource. 477A child can only consume upto the configured amount of the resource.
456Limits can be over-committed - the sum of the limits of children can 478Limits can be over-committed - the sum of the limits of children can
@@ -466,7 +488,8 @@ process migrations.
466on an IO device and is an example of this type. 488on an IO device and is an example of this type.
467 489
468 490
4693-3. Protections 491Protections
492-----------
470 493
471A cgroup is protected to be allocated upto the configured amount of 494A cgroup is protected to be allocated upto the configured amount of
472the resource if the usages of all its ancestors are under their 495the resource if the usages of all its ancestors are under their
@@ -486,7 +509,8 @@ process migrations.
486example of this type. 509example of this type.
487 510
488 511
4893-4. Allocations 512Allocations
513-----------
490 514
491A cgroup is exclusively allocated a certain amount of a finite 515A cgroup is exclusively allocated a certain amount of a finite
492resource. Allocations can't be over-committed - the sum of the 516resource. Allocations can't be over-committed - the sum of the
@@ -505,12 +529,14 @@ may be rejected.
505type. 529type.
506 530
507 531
5084. Interface Files 532Interface Files
533===============
509 534
5104-1. Format 535Format
536------
511 537
512All interface files should be in one of the following formats whenever 538All interface files should be in one of the following formats whenever
513possible. 539possible::
514 540
515 New-line separated values 541 New-line separated values
516 (when only one value can be written at once) 542 (when only one value can be written at once)
@@ -545,7 +571,8 @@ can be written at a time. For nested keyed files, the sub key pairs
545may be specified in any order and not all pairs have to be specified. 571may be specified in any order and not all pairs have to be specified.
546 572
547 573
5484-2. Conventions 574Conventions
575-----------
549 576
550- Settings for a single feature should be contained in a single file. 577- Settings for a single feature should be contained in a single file.
551 578
@@ -581,25 +608,25 @@ may be specified in any order and not all pairs have to be specified.
581 with "default" as the value must not appear when read. 608 with "default" as the value must not appear when read.
582 609
583 For example, a setting which is keyed by major:minor device numbers 610 For example, a setting which is keyed by major:minor device numbers
584 with integer values may look like the following. 611 with integer values may look like the following::
585 612
586 # cat cgroup-example-interface-file 613 # cat cgroup-example-interface-file
587 default 150 614 default 150
588 8:0 300 615 8:0 300
589 616
590 The default value can be updated by 617 The default value can be updated by::
591 618
592 # echo 125 > cgroup-example-interface-file 619 # echo 125 > cgroup-example-interface-file
593 620
594 or 621 or::
595 622
596 # echo "default 125" > cgroup-example-interface-file 623 # echo "default 125" > cgroup-example-interface-file
597 624
598 An override can be set by 625 An override can be set by::
599 626
600 # echo "8:16 170" > cgroup-example-interface-file 627 # echo "8:16 170" > cgroup-example-interface-file
601 628
602 and cleared by 629 and cleared by::
603 630
604 # echo "8:0 default" > cgroup-example-interface-file 631 # echo "8:0 default" > cgroup-example-interface-file
605 # cat cgroup-example-interface-file 632 # cat cgroup-example-interface-file
@@ -612,12 +639,12 @@ may be specified in any order and not all pairs have to be specified.
612 generated on the file. 639 generated on the file.
613 640
614 641
6154-3. Core Interface Files 642Core Interface Files
643--------------------
616 644
617All cgroup core files are prefixed with "cgroup." 645All cgroup core files are prefixed with "cgroup."
618 646
619 cgroup.procs 647 cgroup.procs
620
621 A read-write new-line separated values file which exists on 648 A read-write new-line separated values file which exists on
622 all cgroups. 649 all cgroups.
623 650
@@ -643,7 +670,6 @@ All cgroup core files are prefixed with "cgroup."
643 should be granted along with the containing directory. 670 should be granted along with the containing directory.
644 671
645 cgroup.controllers 672 cgroup.controllers
646
647 A read-only space separated values file which exists on all 673 A read-only space separated values file which exists on all
648 cgroups. 674 cgroups.
649 675
@@ -651,7 +677,6 @@ All cgroup core files are prefixed with "cgroup."
651 the cgroup. The controllers are not ordered. 677 the cgroup. The controllers are not ordered.
652 678
653 cgroup.subtree_control 679 cgroup.subtree_control
654
655 A read-write space separated values file which exists on all 680 A read-write space separated values file which exists on all
656 cgroups. Starts out empty. 681 cgroups. Starts out empty.
657 682
@@ -667,23 +692,25 @@ All cgroup core files are prefixed with "cgroup."
667 operations are specified, either all succeed or all fail. 692 operations are specified, either all succeed or all fail.
668 693
669 cgroup.events 694 cgroup.events
670
671 A read-only flat-keyed file which exists on non-root cgroups. 695 A read-only flat-keyed file which exists on non-root cgroups.
672 The following entries are defined. Unless specified 696 The following entries are defined. Unless specified
673 otherwise, a value change in this file generates a file 697 otherwise, a value change in this file generates a file
674 modified event. 698 modified event.
675 699
676 populated 700 populated
677
678 1 if the cgroup or its descendants contains any live 701 1 if the cgroup or its descendants contains any live
679 processes; otherwise, 0. 702 processes; otherwise, 0.
680 703
681 704
6825. Controllers 705Controllers
706===========
683 707
6845-1. CPU 708CPU
709---
685 710
686[NOTE: The interface for the cpu controller hasn't been merged yet] 711.. note::
712
713 The interface for the cpu controller hasn't been merged yet
687 714
688The "cpu" controllers regulates distribution of CPU cycles. This 715The "cpu" controllers regulates distribution of CPU cycles. This
689controller implements weight and absolute bandwidth limit models for 716controller implements weight and absolute bandwidth limit models for
@@ -691,36 +718,34 @@ normal scheduling policy and absolute bandwidth allocation model for
691realtime scheduling policy. 718realtime scheduling policy.
692 719
693 720
6945-1-1. CPU Interface Files 721CPU Interface Files
722~~~~~~~~~~~~~~~~~~~
695 723
696All time durations are in microseconds. 724All time durations are in microseconds.
697 725
698 cpu.stat 726 cpu.stat
699
700 A read-only flat-keyed file which exists on non-root cgroups. 727 A read-only flat-keyed file which exists on non-root cgroups.
701 728
702 It reports the following six stats. 729 It reports the following six stats:
703 730
704 usage_usec 731 - usage_usec
705 user_usec 732 - user_usec
706 system_usec 733 - system_usec
707 nr_periods 734 - nr_periods
708 nr_throttled 735 - nr_throttled
709 throttled_usec 736 - throttled_usec
710 737
711 cpu.weight 738 cpu.weight
712
713 A read-write single value file which exists on non-root 739 A read-write single value file which exists on non-root
714 cgroups. The default is "100". 740 cgroups. The default is "100".
715 741
716 The weight in the range [1, 10000]. 742 The weight in the range [1, 10000].
717 743
718 cpu.max 744 cpu.max
719
720 A read-write two value file which exists on non-root cgroups. 745 A read-write two value file which exists on non-root cgroups.
721 The default is "max 100000". 746 The default is "max 100000".
722 747
723 The maximum bandwidth limit. It's in the following format. 748 The maximum bandwidth limit. It's in the following format::
724 749
725 $MAX $PERIOD 750 $MAX $PERIOD
726 751
@@ -729,9 +754,10 @@ All time durations are in microseconds.
729 one number is written, $MAX is updated. 754 one number is written, $MAX is updated.
730 755
731 cpu.rt.max 756 cpu.rt.max
757 .. note::
732 758
733 [NOTE: The semantics of this file is still under discussion and the 759 The semantics of this file is still under discussion and the
734 interface hasn't been merged yet] 760 interface hasn't been merged yet
735 761
736 A read-write two value file which exists on all cgroups. 762 A read-write two value file which exists on all cgroups.
737 The default is "0 100000". 763 The default is "0 100000".
@@ -739,7 +765,7 @@ All time durations are in microseconds.
739 The maximum realtime runtime allocation. Over-committing 765 The maximum realtime runtime allocation. Over-committing
740 configurations are disallowed and process migrations are 766 configurations are disallowed and process migrations are
741 rejected if not enough bandwidth is available. It's in the 767 rejected if not enough bandwidth is available. It's in the
742 following format. 768 following format::
743 769
744 $MAX $PERIOD 770 $MAX $PERIOD
745 771
@@ -748,7 +774,8 @@ All time durations are in microseconds.
748 updated. 774 updated.
749 775
750 776
7515-2. Memory 777Memory
778------
752 779
753The "memory" controller regulates distribution of memory. Memory is 780The "memory" controller regulates distribution of memory. Memory is
754stateful and implements both limit and protection models. Due to the 781stateful and implements both limit and protection models. Due to the
@@ -770,14 +797,14 @@ following types of memory usages are tracked.
770The above list may expand in the future for better coverage. 797The above list may expand in the future for better coverage.
771 798
772 799
7735-2-1. Memory Interface Files 800Memory Interface Files
801~~~~~~~~~~~~~~~~~~~~~~
774 802
775All memory amounts are in bytes. If a value which is not aligned to 803All memory amounts are in bytes. If a value which is not aligned to
776PAGE_SIZE is written, the value may be rounded up to the closest 804PAGE_SIZE is written, the value may be rounded up to the closest
777PAGE_SIZE multiple when read back. 805PAGE_SIZE multiple when read back.
778 806
779 memory.current 807 memory.current
780
781 A read-only single value file which exists on non-root 808 A read-only single value file which exists on non-root
782 cgroups. 809 cgroups.
783 810
@@ -785,7 +812,6 @@ PAGE_SIZE multiple when read back.
785 and its descendants. 812 and its descendants.
786 813
787 memory.low 814 memory.low
788
789 A read-write single value file which exists on non-root 815 A read-write single value file which exists on non-root
790 cgroups. The default is "0". 816 cgroups. The default is "0".
791 817
@@ -798,7 +824,6 @@ PAGE_SIZE multiple when read back.
798 protection is discouraged. 824 protection is discouraged.
799 825
800 memory.high 826 memory.high
801
802 A read-write single value file which exists on non-root 827 A read-write single value file which exists on non-root
803 cgroups. The default is "max". 828 cgroups. The default is "max".
804 829
@@ -811,7 +836,6 @@ PAGE_SIZE multiple when read back.
811 under extreme conditions the limit may be breached. 836 under extreme conditions the limit may be breached.
812 837
813 memory.max 838 memory.max
814
815 A read-write single value file which exists on non-root 839 A read-write single value file which exists on non-root
816 cgroups. The default is "max". 840 cgroups. The default is "max".
817 841
@@ -826,21 +850,18 @@ PAGE_SIZE multiple when read back.
826 utility is limited to providing the final safety net. 850 utility is limited to providing the final safety net.
827 851
828 memory.events 852 memory.events
829
830 A read-only flat-keyed file which exists on non-root cgroups. 853 A read-only flat-keyed file which exists on non-root cgroups.
831 The following entries are defined. Unless specified 854 The following entries are defined. Unless specified
832 otherwise, a value change in this file generates a file 855 otherwise, a value change in this file generates a file
833 modified event. 856 modified event.
834 857
835 low 858 low
836
837 The number of times the cgroup is reclaimed due to 859 The number of times the cgroup is reclaimed due to
838 high memory pressure even though its usage is under 860 high memory pressure even though its usage is under
839 the low boundary. This usually indicates that the low 861 the low boundary. This usually indicates that the low
840 boundary is over-committed. 862 boundary is over-committed.
841 863
842 high 864 high
843
844 The number of times processes of the cgroup are 865 The number of times processes of the cgroup are
845 throttled and routed to perform direct memory reclaim 866 throttled and routed to perform direct memory reclaim
846 because the high memory boundary was exceeded. For a 867 because the high memory boundary was exceeded. For a
@@ -849,13 +870,11 @@ PAGE_SIZE multiple when read back.
849 occurrences are expected. 870 occurrences are expected.
850 871
851 max 872 max
852
853 The number of times the cgroup's memory usage was 873 The number of times the cgroup's memory usage was
854 about to go over the max boundary. If direct reclaim 874 about to go over the max boundary. If direct reclaim
855 fails to bring it down, the cgroup goes to OOM state. 875 fails to bring it down, the cgroup goes to OOM state.
856 876
857 oom 877 oom
858
859 The number of time the cgroup's memory usage was 878 The number of time the cgroup's memory usage was
860 reached the limit and allocation was about to fail. 879 reached the limit and allocation was about to fail.
861 880
@@ -864,16 +883,14 @@ PAGE_SIZE multiple when read back.
864 883
865 Failed allocation in its turn could be returned into 884 Failed allocation in its turn could be returned into
866 userspace as -ENOMEM or siletly ignored in cases like 885 userspace as -ENOMEM or siletly ignored in cases like
867 disk readahead. For now OOM in memory cgroup kills 886 disk readahead. For now OOM in memory cgroup kills
868 tasks iff shortage has happened inside page fault. 887 tasks iff shortage has happened inside page fault.
869 888
870 oom_kill 889 oom_kill
871
872 The number of processes belonging to this cgroup 890 The number of processes belonging to this cgroup
873 killed by any kind of OOM killer. 891 killed by any kind of OOM killer.
874 892
875 memory.stat 893 memory.stat
876
877 A read-only flat-keyed file which exists on non-root cgroups. 894 A read-only flat-keyed file which exists on non-root cgroups.
878 895
879 This breaks down the cgroup's memory footprint into different 896 This breaks down the cgroup's memory footprint into different
@@ -887,73 +904,55 @@ PAGE_SIZE multiple when read back.
887 fixed position; use the keys to look up specific values! 904 fixed position; use the keys to look up specific values!
888 905
889 anon 906 anon
890
891 Amount of memory used in anonymous mappings such as 907 Amount of memory used in anonymous mappings such as
892 brk(), sbrk(), and mmap(MAP_ANONYMOUS) 908 brk(), sbrk(), and mmap(MAP_ANONYMOUS)
893 909
894 file 910 file
895
896 Amount of memory used to cache filesystem data, 911 Amount of memory used to cache filesystem data,
897 including tmpfs and shared memory. 912 including tmpfs and shared memory.
898 913
899 kernel_stack 914 kernel_stack
900
901 Amount of memory allocated to kernel stacks. 915 Amount of memory allocated to kernel stacks.
902 916
903 slab 917 slab
904
905 Amount of memory used for storing in-kernel data 918 Amount of memory used for storing in-kernel data
906 structures. 919 structures.
907 920
908 sock 921 sock
909
910 Amount of memory used in network transmission buffers 922 Amount of memory used in network transmission buffers
911 923
912 shmem 924 shmem
913
914 Amount of cached filesystem data that is swap-backed, 925 Amount of cached filesystem data that is swap-backed,
915 such as tmpfs, shm segments, shared anonymous mmap()s 926 such as tmpfs, shm segments, shared anonymous mmap()s
916 927
917 file_mapped 928 file_mapped
918
919 Amount of cached filesystem data mapped with mmap() 929 Amount of cached filesystem data mapped with mmap()
920 930
921 file_dirty 931 file_dirty
922
923 Amount of cached filesystem data that was modified but 932 Amount of cached filesystem data that was modified but
924 not yet written back to disk 933 not yet written back to disk
925 934
926 file_writeback 935 file_writeback
927
928 Amount of cached filesystem data that was modified and 936 Amount of cached filesystem data that was modified and
929 is currently being written back to disk 937 is currently being written back to disk
930 938
931 inactive_anon 939 inactive_anon, active_anon, inactive_file, active_file, unevictable
932 active_anon
933 inactive_file
934 active_file
935 unevictable
936
937 Amount of memory, swap-backed and filesystem-backed, 940 Amount of memory, swap-backed and filesystem-backed,
938 on the internal memory management lists used by the 941 on the internal memory management lists used by the
939 page reclaim algorithm 942 page reclaim algorithm
940 943
941 slab_reclaimable 944 slab_reclaimable
942
943 Part of "slab" that might be reclaimed, such as 945 Part of "slab" that might be reclaimed, such as
944 dentries and inodes. 946 dentries and inodes.
945 947
946 slab_unreclaimable 948 slab_unreclaimable
947
948 Part of "slab" that cannot be reclaimed on memory 949 Part of "slab" that cannot be reclaimed on memory
949 pressure. 950 pressure.
950 951
951 pgfault 952 pgfault
952
953 Total number of page faults incurred 953 Total number of page faults incurred
954 954
955 pgmajfault 955 pgmajfault
956
957 Number of major page faults incurred 956 Number of major page faults incurred
958 957
959 workingset_refault 958 workingset_refault
@@ -997,7 +996,6 @@ PAGE_SIZE multiple when read back.
997 Amount of reclaimed lazyfree pages 996 Amount of reclaimed lazyfree pages
998 997
999 memory.swap.current 998 memory.swap.current
1000
1001 A read-only single value file which exists on non-root 999 A read-only single value file which exists on non-root
1002 cgroups. 1000 cgroups.
1003 1001
@@ -1005,7 +1003,6 @@ PAGE_SIZE multiple when read back.
1005 and its descendants. 1003 and its descendants.
1006 1004
1007 memory.swap.max 1005 memory.swap.max
1008
1009 A read-write single value file which exists on non-root 1006 A read-write single value file which exists on non-root
1010 cgroups. The default is "max". 1007 cgroups. The default is "max".
1011 1008
@@ -1013,7 +1010,8 @@ PAGE_SIZE multiple when read back.
1013 limit, anonymous meomry of the cgroup will not be swapped out. 1010 limit, anonymous meomry of the cgroup will not be swapped out.
1014 1011
1015 1012
10165-2-2. Usage Guidelines 1013Usage Guidelines
1014~~~~~~~~~~~~~~~~
1017 1015
1018"memory.high" is the main mechanism to control memory usage. 1016"memory.high" is the main mechanism to control memory usage.
1019Over-committing on high limit (sum of high limits > available memory) 1017Over-committing on high limit (sum of high limits > available memory)
@@ -1036,7 +1034,8 @@ memory; unfortunately, memory pressure monitoring mechanism isn't
1036implemented yet. 1034implemented yet.
1037 1035
1038 1036
10395-2-3. Memory Ownership 1037Memory Ownership
1038~~~~~~~~~~~~~~~~
1040 1039
1041A memory area is charged to the cgroup which instantiated it and stays 1040A memory area is charged to the cgroup which instantiated it and stays
1042charged to the cgroup until the area is released. Migrating a process 1041charged to the cgroup until the area is released. Migrating a process
@@ -1054,7 +1053,8 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1054belonging to the affected files to ensure correct memory ownership. 1053belonging to the affected files to ensure correct memory ownership.
1055 1054
1056 1055
10575-3. IO 1056IO
1057--
1058 1058
1059The "io" controller regulates the distribution of IO resources. This 1059The "io" controller regulates the distribution of IO resources. This
1060controller implements both weight based and absolute bandwidth or IOPS 1060controller implements both weight based and absolute bandwidth or IOPS
@@ -1063,28 +1063,29 @@ only if cfq-iosched is in use and neither scheme is available for
1063blk-mq devices. 1063blk-mq devices.
1064 1064
1065 1065
10665-3-1. IO Interface Files 1066IO Interface Files
1067~~~~~~~~~~~~~~~~~~
1067 1068
1068 io.stat 1069 io.stat
1069
1070 A read-only nested-keyed file which exists on non-root 1070 A read-only nested-keyed file which exists on non-root
1071 cgroups. 1071 cgroups.
1072 1072
1073 Lines are keyed by $MAJ:$MIN device numbers and not ordered. 1073 Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1074 The following nested keys are defined. 1074 The following nested keys are defined.
1075 1075
1076 ====== ===================
1076 rbytes Bytes read 1077 rbytes Bytes read
1077 wbytes Bytes written 1078 wbytes Bytes written
1078 rios Number of read IOs 1079 rios Number of read IOs
1079 wios Number of write IOs 1080 wios Number of write IOs
1081 ====== ===================
1080 1082
1081 An example read output follows. 1083 An example read output follows:
1082 1084
1083 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 1085 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
1084 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 1086 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
1085 1087
1086 io.weight 1088 io.weight
1087
1088 A read-write flat-keyed file which exists on non-root cgroups. 1089 A read-write flat-keyed file which exists on non-root cgroups.
1089 The default is "default 100". 1090 The default is "default 100".
1090 1091
@@ -1098,14 +1099,13 @@ blk-mq devices.
1098 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 1099 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
1099 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 1100 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
1100 1101
1101 An example read output follows. 1102 An example read output follows::
1102 1103
1103 default 100 1104 default 100
1104 8:16 200 1105 8:16 200
1105 8:0 50 1106 8:0 50
1106 1107
1107 io.max 1108 io.max
1108
1109 A read-write nested-keyed file which exists on non-root 1109 A read-write nested-keyed file which exists on non-root
1110 cgroups. 1110 cgroups.
1111 1111
@@ -1113,10 +1113,12 @@ blk-mq devices.
1113 device numbers and not ordered. The following nested keys are 1113 device numbers and not ordered. The following nested keys are
1114 defined. 1114 defined.
1115 1115
1116 ===== ==================================
1116 rbps Max read bytes per second 1117 rbps Max read bytes per second
1117 wbps Max write bytes per second 1118 wbps Max write bytes per second
1118 riops Max read IO operations per second 1119 riops Max read IO operations per second
1119 wiops Max write IO operations per second 1120 wiops Max write IO operations per second
1121 ===== ==================================
1120 1122
1121 When writing, any number of nested key-value pairs can be 1123 When writing, any number of nested key-value pairs can be
1122 specified in any order. "max" can be specified as the value 1124 specified in any order. "max" can be specified as the value
@@ -1126,24 +1128,25 @@ blk-mq devices.
1126 BPS and IOPS are measured in each IO direction and IOs are 1128 BPS and IOPS are measured in each IO direction and IOs are
1127 delayed if limit is reached. Temporary bursts are allowed. 1129 delayed if limit is reached. Temporary bursts are allowed.
1128 1130
1129 Setting read limit at 2M BPS and write at 120 IOPS for 8:16. 1131 Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
1130 1132
1131 echo "8:16 rbps=2097152 wiops=120" > io.max 1133 echo "8:16 rbps=2097152 wiops=120" > io.max
1132 1134
1133 Reading returns the following. 1135 Reading returns the following::
1134 1136
1135 8:16 rbps=2097152 wbps=max riops=max wiops=120 1137 8:16 rbps=2097152 wbps=max riops=max wiops=120
1136 1138
1137 Write IOPS limit can be removed by writing the following. 1139 Write IOPS limit can be removed by writing the following::
1138 1140
1139 echo "8:16 wiops=max" > io.max 1141 echo "8:16 wiops=max" > io.max
1140 1142
1141 Reading now returns the following. 1143 Reading now returns the following::
1142 1144
1143 8:16 rbps=2097152 wbps=max riops=max wiops=max 1145 8:16 rbps=2097152 wbps=max riops=max wiops=max
1144 1146
1145 1147
11465-3-2. Writeback 1148Writeback
1149~~~~~~~~~
1147 1150
1148Page cache is dirtied through buffered writes and shared mmaps and 1151Page cache is dirtied through buffered writes and shared mmaps and
1149written asynchronously to the backing filesystem by the writeback 1152written asynchronously to the backing filesystem by the writeback
@@ -1191,22 +1194,19 @@ patterns.
1191The sysctl knobs which affect writeback behavior are applied to cgroup 1194The sysctl knobs which affect writeback behavior are applied to cgroup
1192writeback as follows. 1195writeback as follows.
1193 1196
1194 vm.dirty_background_ratio 1197 vm.dirty_background_ratio, vm.dirty_ratio
1195 vm.dirty_ratio
1196
1197 These ratios apply the same to cgroup writeback with the 1198 These ratios apply the same to cgroup writeback with the
1198 amount of available memory capped by limits imposed by the 1199 amount of available memory capped by limits imposed by the
1199 memory controller and system-wide clean memory. 1200 memory controller and system-wide clean memory.
1200 1201
1201 vm.dirty_background_bytes 1202 vm.dirty_background_bytes, vm.dirty_bytes
1202 vm.dirty_bytes
1203
1204 For cgroup writeback, this is calculated into ratio against 1203 For cgroup writeback, this is calculated into ratio against
1205 total available memory and applied the same way as 1204 total available memory and applied the same way as
1206 vm.dirty[_background]_ratio. 1205 vm.dirty[_background]_ratio.
1207 1206
1208 1207
12095-4. PID 1208PID
1209---
1210 1210
1211The process number controller is used to allow a cgroup to stop any 1211The process number controller is used to allow a cgroup to stop any
1212new tasks from being fork()'d or clone()'d after a specified limit is 1212new tasks from being fork()'d or clone()'d after a specified limit is
@@ -1221,17 +1221,16 @@ Note that PIDs used in this controller refer to TIDs, process IDs as
1221used by the kernel. 1221used by the kernel.
1222 1222
1223 1223
12245-4-1. PID Interface Files 1224PID Interface Files
1225~~~~~~~~~~~~~~~~~~~
1225 1226
1226 pids.max 1227 pids.max
1227
1228 A read-write single value file which exists on non-root 1228 A read-write single value file which exists on non-root
1229 cgroups. The default is "max". 1229 cgroups. The default is "max".
1230 1230
1231 Hard limit of number of processes. 1231 Hard limit of number of processes.
1232 1232
1233 pids.current 1233 pids.current
1234
1235 A read-only single value file which exists on all cgroups. 1234 A read-only single value file which exists on all cgroups.
1236 1235
1237 The number of processes currently in the cgroup and its 1236 The number of processes currently in the cgroup and its
@@ -1246,12 +1245,14 @@ through fork() or clone(). These will return -EAGAIN if the creation
1246of a new process would cause a cgroup policy to be violated. 1245of a new process would cause a cgroup policy to be violated.
1247 1246
1248 1247
12495-5. RDMA 1248RDMA
1249----
1250 1250
1251The "rdma" controller regulates the distribution and accounting of 1251The "rdma" controller regulates the distribution and accounting of
1252of RDMA resources. 1252of RDMA resources.
1253 1253
12545-5-1. RDMA Interface Files 1254RDMA Interface Files
1255~~~~~~~~~~~~~~~~~~~~
1255 1256
1256 rdma.max 1257 rdma.max
1257 A readwrite nested-keyed file that exists for all the cgroups 1258 A readwrite nested-keyed file that exists for all the cgroups
@@ -1264,10 +1265,12 @@ of RDMA resources.
1264 1265
1265 The following nested keys are defined. 1266 The following nested keys are defined.
1266 1267
1268 ========== =============================
1267 hca_handle Maximum number of HCA Handles 1269 hca_handle Maximum number of HCA Handles
1268 hca_object Maximum number of HCA Objects 1270 hca_object Maximum number of HCA Objects
1271 ========== =============================
1269 1272
1270 An example for mlx4 and ocrdma device follows. 1273 An example for mlx4 and ocrdma device follows::
1271 1274
1272 mlx4_0 hca_handle=2 hca_object=2000 1275 mlx4_0 hca_handle=2 hca_object=2000
1273 ocrdma1 hca_handle=3 hca_object=max 1276 ocrdma1 hca_handle=3 hca_object=max
@@ -1276,15 +1279,17 @@ of RDMA resources.
1276 A read-only file that describes current resource usage. 1279 A read-only file that describes current resource usage.
1277 It exists for all the cgroup except root. 1280 It exists for all the cgroup except root.
1278 1281
1279 An example for mlx4 and ocrdma device follows. 1282 An example for mlx4 and ocrdma device follows::
1280 1283
1281 mlx4_0 hca_handle=1 hca_object=20 1284 mlx4_0 hca_handle=1 hca_object=20
1282 ocrdma1 hca_handle=1 hca_object=23 1285 ocrdma1 hca_handle=1 hca_object=23
1283 1286
1284 1287
12855-6. Misc 1288Misc
1289----
1286 1290
12875-6-1. perf_event 1291perf_event
1292~~~~~~~~~~
1288 1293
1289perf_event controller, if not mounted on a legacy hierarchy, is 1294perf_event controller, if not mounted on a legacy hierarchy, is
1290automatically enabled on the v2 hierarchy so that perf events can 1295automatically enabled on the v2 hierarchy so that perf events can
@@ -1292,9 +1297,11 @@ always be filtered by cgroup v2 path. The controller can still be
1292moved to a legacy hierarchy after v2 hierarchy is populated. 1297moved to a legacy hierarchy after v2 hierarchy is populated.
1293 1298
1294 1299
12956. Namespace 1300Namespace
1301=========
1296 1302
12976-1. Basics 1303Basics
1304------
1298 1305
1299cgroup namespace provides a mechanism to virtualize the view of the 1306cgroup namespace provides a mechanism to virtualize the view of the
1300"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 1307"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
@@ -1308,7 +1315,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
1308complete path of the cgroup of a process. In a container setup where 1315complete path of the cgroup of a process. In a container setup where
1309a set of cgroups and namespaces are intended to isolate processes the 1316a set of cgroups and namespaces are intended to isolate processes the
1310"/proc/$PID/cgroup" file may leak potential system level information 1317"/proc/$PID/cgroup" file may leak potential system level information
1311to the isolated processes. For Example: 1318to the isolated processes. For Example::
1312 1319
1313 # cat /proc/self/cgroup 1320 # cat /proc/self/cgroup
1314 0::/batchjobs/container_id1 1321 0::/batchjobs/container_id1
@@ -1316,14 +1323,14 @@ to the isolated processes. For Example:
1316The path '/batchjobs/container_id1' can be considered as system-data 1323The path '/batchjobs/container_id1' can be considered as system-data
1317and undesirable to expose to the isolated processes. cgroup namespace 1324and undesirable to expose to the isolated processes. cgroup namespace
1318can be used to restrict visibility of this path. For example, before 1325can be used to restrict visibility of this path. For example, before
1319creating a cgroup namespace, one would see: 1326creating a cgroup namespace, one would see::
1320 1327
1321 # ls -l /proc/self/ns/cgroup 1328 # ls -l /proc/self/ns/cgroup
1322 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 1329 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
1323 # cat /proc/self/cgroup 1330 # cat /proc/self/cgroup
1324 0::/batchjobs/container_id1 1331 0::/batchjobs/container_id1
1325 1332
1326After unsharing a new namespace, the view changes. 1333After unsharing a new namespace, the view changes::
1327 1334
1328 # ls -l /proc/self/ns/cgroup 1335 # ls -l /proc/self/ns/cgroup
1329 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 1336 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
@@ -1341,7 +1348,8 @@ namespace is destroyed. The cgroupns root and the actual cgroups
1341remain. 1348remain.
1342 1349
1343 1350
13446-2. The Root and Views 1351The Root and Views
1352------------------
1345 1353
1346The 'cgroupns root' for a cgroup namespace is the cgroup in which the 1354The 'cgroupns root' for a cgroup namespace is the cgroup in which the
1347process calling unshare(2) is running. For example, if a process in 1355process calling unshare(2) is running. For example, if a process in
@@ -1350,7 +1358,7 @@ process calling unshare(2) is running. For example, if a process in
1350init_cgroup_ns, this is the real root ('/') cgroup. 1358init_cgroup_ns, this is the real root ('/') cgroup.
1351 1359
1352The cgroupns root cgroup does not change even if the namespace creator 1360The cgroupns root cgroup does not change even if the namespace creator
1353process later moves to a different cgroup. 1361process later moves to a different cgroup::
1354 1362
1355 # ~/unshare -c # unshare cgroupns in some cgroup 1363 # ~/unshare -c # unshare cgroupns in some cgroup
1356 # cat /proc/self/cgroup 1364 # cat /proc/self/cgroup
@@ -1364,7 +1372,7 @@ Each process gets its namespace-specific view of "/proc/$PID/cgroup"
1364 1372
1365Processes running inside the cgroup namespace will be able to see 1373Processes running inside the cgroup namespace will be able to see
1366cgroup paths (in /proc/self/cgroup) only inside their root cgroup. 1374cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
1367From within an unshared cgroupns: 1375From within an unshared cgroupns::
1368 1376
1369 # sleep 100000 & 1377 # sleep 100000 &
1370 [1] 7353 1378 [1] 7353
@@ -1373,7 +1381,7 @@ From within an unshared cgroupns:
1373 0::/sub_cgrp_1 1381 0::/sub_cgrp_1
1374 1382
1375From the initial cgroup namespace, the real cgroup path will be 1383From the initial cgroup namespace, the real cgroup path will be
1376visible: 1384visible::
1377 1385
1378 $ cat /proc/7353/cgroup 1386 $ cat /proc/7353/cgroup
1379 0::/batchjobs/container_id1/sub_cgrp_1 1387 0::/batchjobs/container_id1/sub_cgrp_1
@@ -1381,7 +1389,7 @@ visible:
1381From a sibling cgroup namespace (that is, a namespace rooted at a 1389From a sibling cgroup namespace (that is, a namespace rooted at a
1382different cgroup), the cgroup path relative to its own cgroup 1390different cgroup), the cgroup path relative to its own cgroup
1383namespace root will be shown. For instance, if PID 7353's cgroup 1391namespace root will be shown. For instance, if PID 7353's cgroup
1384namespace root is at '/batchjobs/container_id2', then it will see 1392namespace root is at '/batchjobs/container_id2', then it will see::
1385 1393
1386 # cat /proc/7353/cgroup 1394 # cat /proc/7353/cgroup
1387 0::/../container_id2/sub_cgrp_1 1395 0::/../container_id2/sub_cgrp_1
@@ -1390,13 +1398,14 @@ Note that the relative path always starts with '/' to indicate that
1390its relative to the cgroup namespace root of the caller. 1398its relative to the cgroup namespace root of the caller.
1391 1399
1392 1400
13936-3. Migration and setns(2) 1401Migration and setns(2)
1402----------------------
1394 1403
1395Processes inside a cgroup namespace can move into and out of the 1404Processes inside a cgroup namespace can move into and out of the
1396namespace root if they have proper access to external cgroups. For 1405namespace root if they have proper access to external cgroups. For
1397example, from inside a namespace with cgroupns root at 1406example, from inside a namespace with cgroupns root at
1398/batchjobs/container_id1, and assuming that the global hierarchy is 1407/batchjobs/container_id1, and assuming that the global hierarchy is
1399still accessible inside cgroupns: 1408still accessible inside cgroupns::
1400 1409
1401 # cat /proc/7353/cgroup 1410 # cat /proc/7353/cgroup
1402 0::/sub_cgrp_1 1411 0::/sub_cgrp_1
@@ -1418,10 +1427,11 @@ namespace. It is expected that the someone moves the attaching
1418process under the target cgroup namespace root. 1427process under the target cgroup namespace root.
1419 1428
1420 1429
14216-4. Interaction with Other Namespaces 1430Interaction with Other Namespaces
1431---------------------------------
1422 1432
1423Namespace specific cgroup hierarchy can be mounted by a process 1433Namespace specific cgroup hierarchy can be mounted by a process
1424running inside a non-init cgroup namespace. 1434running inside a non-init cgroup namespace::
1425 1435
1426 # mount -t cgroup2 none $MOUNT_POINT 1436 # mount -t cgroup2 none $MOUNT_POINT
1427 1437
@@ -1434,27 +1444,27 @@ the view of cgroup hierarchy by namespace-private cgroupfs mount
1434provides a properly isolated cgroup view inside the container. 1444provides a properly isolated cgroup view inside the container.
1435 1445
1436 1446
1437P. Information on Kernel Programming 1447Information on Kernel Programming
1448=================================
1438 1449
1439This section contains kernel programming information in the areas 1450This section contains kernel programming information in the areas
1440where interacting with cgroup is necessary. cgroup core and 1451where interacting with cgroup is necessary. cgroup core and
1441controllers are not covered. 1452controllers are not covered.
1442 1453
1443 1454
1444P-1. Filesystem Support for Writeback 1455Filesystem Support for Writeback
1456--------------------------------
1445 1457
1446A filesystem can support cgroup writeback by updating 1458A filesystem can support cgroup writeback by updating
1447address_space_operations->writepage[s]() to annotate bio's using the 1459address_space_operations->writepage[s]() to annotate bio's using the
1448following two functions. 1460following two functions.
1449 1461
1450 wbc_init_bio(@wbc, @bio) 1462 wbc_init_bio(@wbc, @bio)
1451
1452 Should be called for each bio carrying writeback data and 1463 Should be called for each bio carrying writeback data and
1453 associates the bio with the inode's owner cgroup. Can be 1464 associates the bio with the inode's owner cgroup. Can be
1454 called anytime between bio allocation and submission. 1465 called anytime between bio allocation and submission.
1455 1466
1456 wbc_account_io(@wbc, @page, @bytes) 1467 wbc_account_io(@wbc, @page, @bytes)
1457
1458 Should be called for each data segment being written out. 1468 Should be called for each data segment being written out.
1459 While this function doesn't care exactly when it's called 1469 While this function doesn't care exactly when it's called
1460 during the writeback session, it's the easiest and most 1470 during the writeback session, it's the easiest and most
@@ -1475,7 +1485,8 @@ cases by skipping wbc_init_bio() or using bio_associate_blkcg()
1475directly. 1485directly.
1476 1486
1477 1487
1478D. Deprecated v1 Core Features 1488Deprecated v1 Core Features
1489===========================
1479 1490
1480- Multiple hierarchies including named ones are not supported. 1491- Multiple hierarchies including named ones are not supported.
1481 1492
@@ -1489,9 +1500,11 @@ D. Deprecated v1 Core Features
1489 at the root instead. 1500 at the root instead.
1490 1501
1491 1502
1492R. Issues with v1 and Rationales for v2 1503Issues with v1 and Rationales for v2
1504====================================
1493 1505
1494R-1. Multiple Hierarchies 1506Multiple Hierarchies
1507--------------------
1495 1508
1496cgroup v1 allowed an arbitrary number of hierarchies and each 1509cgroup v1 allowed an arbitrary number of hierarchies and each
1497hierarchy could host any number of controllers. While this seemed to 1510hierarchy could host any number of controllers. While this seemed to
@@ -1543,7 +1556,8 @@ how memory is distributed beyond a certain level while still wanting
1543to control how CPU cycles are distributed. 1556to control how CPU cycles are distributed.
1544 1557
1545 1558
1546R-2. Thread Granularity 1559Thread Granularity
1560------------------
1547 1561
1548cgroup v1 allowed threads of a process to belong to different cgroups. 1562cgroup v1 allowed threads of a process to belong to different cgroups.
1549This didn't make sense for some controllers and those controllers 1563This didn't make sense for some controllers and those controllers
@@ -1586,7 +1600,8 @@ misbehaving and poorly abstracted interfaces and kernel exposing and
1586locked into constructs inadvertently. 1600locked into constructs inadvertently.
1587 1601
1588 1602
1589R-3. Competition Between Inner Nodes and Threads 1603Competition Between Inner Nodes and Threads
1604-------------------------------------------
1590 1605
1591cgroup v1 allowed threads to be in any cgroups which created an 1606cgroup v1 allowed threads to be in any cgroups which created an
1592interesting problem where threads belonging to a parent cgroup and its 1607interesting problem where threads belonging to a parent cgroup and its
@@ -1605,7 +1620,7 @@ simply weren't available for threads.
1605 1620
1606The io controller implicitly created a hidden leaf node for each 1621The io controller implicitly created a hidden leaf node for each
1607cgroup to host the threads. The hidden leaf had its own copies of all 1622cgroup to host the threads. The hidden leaf had its own copies of all
1608the knobs with "leaf_" prefixed. While this allowed equivalent 1623the knobs with ``leaf_`` prefixed. While this allowed equivalent
1609control over internal threads, it was with serious drawbacks. It 1624control over internal threads, it was with serious drawbacks. It
1610always added an extra layer of nesting which wouldn't be necessary 1625always added an extra layer of nesting which wouldn't be necessary
1611otherwise, made the interface messy and significantly complicated the 1626otherwise, made the interface messy and significantly complicated the
@@ -1626,7 +1641,8 @@ This clearly is a problem which needs to be addressed from cgroup core
1626in a uniform way. 1641in a uniform way.
1627 1642
1628 1643
1629R-4. Other Interface Issues 1644Other Interface Issues
1645----------------------
1630 1646
1631cgroup v1 grew without oversight and developed a large number of 1647cgroup v1 grew without oversight and developed a large number of
1632idiosyncrasies and inconsistencies. One issue on the cgroup core side 1648idiosyncrasies and inconsistencies. One issue on the cgroup core side
@@ -1654,9 +1670,11 @@ cgroup v2 establishes common conventions where appropriate and updates
1654controllers so that they expose minimal and consistent interfaces. 1670controllers so that they expose minimal and consistent interfaces.
1655 1671
1656 1672
1657R-5. Controller Issues and Remedies 1673Controller Issues and Remedies
1674------------------------------
1658 1675
1659R-5-1. Memory 1676Memory
1677~~~~~~
1660 1678
1661The original lower boundary, the soft limit, is defined as a limit 1679The original lower boundary, the soft limit, is defined as a limit
1662that is per default unset. As a result, the set of cgroups that 1680that is per default unset. As a result, the set of cgroups that
diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt
index 4a824d232472..d4628174b7c5 100644
--- a/Documentation/circular-buffers.txt
+++ b/Documentation/circular-buffers.txt
@@ -1,9 +1,9 @@
1 ================ 1================
2 CIRCULAR BUFFERS 2Circular Buffers
3 ================ 3================
4 4
5By: David Howells <dhowells@redhat.com> 5:Author: David Howells <dhowells@redhat.com>
6 Paul E. McKenney <paulmck@linux.vnet.ibm.com> 6:Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
7 7
8 8
9Linux provides a number of features that can be used to implement circular 9Linux provides a number of features that can be used to implement circular
@@ -20,7 +20,7 @@ producer and just one consumer. It is possible to handle multiple producers by
20serialising them, and to handle multiple consumers by serialising them. 20serialising them, and to handle multiple consumers by serialising them.
21 21
22 22
23Contents: 23.. Contents:
24 24
25 (*) What is a circular buffer? 25 (*) What is a circular buffer?
26 26
@@ -31,8 +31,8 @@ Contents:
31 - The consumer. 31 - The consumer.
32 32
33 33
34========================== 34
35WHAT IS A CIRCULAR BUFFER? 35What is a circular buffer?
36========================== 36==========================
37 37
38First of all, what is a circular buffer? A circular buffer is a buffer of 38First of all, what is a circular buffer? A circular buffer is a buffer of
@@ -60,9 +60,7 @@ buffer, provided that neither index overtakes the other. The implementer must
60be careful, however, as a region more than one unit in size may wrap the end of 60be careful, however, as a region more than one unit in size may wrap the end of
61the buffer and be broken into two segments. 61the buffer and be broken into two segments.
62 62
63 63Measuring power-of-2 buffers
64============================
65MEASURING POWER-OF-2 BUFFERS
66============================ 64============================
67 65
68Calculation of the occupancy or the remaining capacity of an arbitrarily sized 66Calculation of the occupancy or the remaining capacity of an arbitrarily sized
@@ -71,13 +69,13 @@ modulus (divide) instruction. However, if the buffer is of a power-of-2 size,
71then a much quicker bitwise-AND instruction can be used instead. 69then a much quicker bitwise-AND instruction can be used instead.
72 70
73Linux provides a set of macros for handling power-of-2 circular buffers. These 71Linux provides a set of macros for handling power-of-2 circular buffers. These
74can be made use of by: 72can be made use of by::
75 73
76 #include <linux/circ_buf.h> 74 #include <linux/circ_buf.h>
77 75
78The macros are: 76The macros are:
79 77
80 (*) Measure the remaining capacity of a buffer: 78 (#) Measure the remaining capacity of a buffer::
81 79
82 CIRC_SPACE(head_index, tail_index, buffer_size); 80 CIRC_SPACE(head_index, tail_index, buffer_size);
83 81
@@ -85,7 +83,7 @@ The macros are:
85 can be inserted. 83 can be inserted.
86 84
87 85
88 (*) Measure the maximum consecutive immediate space in a buffer: 86 (#) Measure the maximum consecutive immediate space in a buffer::
89 87
90 CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); 88 CIRC_SPACE_TO_END(head_index, tail_index, buffer_size);
91 89
@@ -94,14 +92,14 @@ The macros are:
94 beginning of the buffer. 92 beginning of the buffer.
95 93
96 94
97 (*) Measure the occupancy of a buffer: 95 (#) Measure the occupancy of a buffer::
98 96
99 CIRC_CNT(head_index, tail_index, buffer_size); 97 CIRC_CNT(head_index, tail_index, buffer_size);
100 98
101 This returns the number of items currently occupying a buffer[2]. 99 This returns the number of items currently occupying a buffer[2].
102 100
103 101
104 (*) Measure the non-wrapping occupancy of a buffer: 102 (#) Measure the non-wrapping occupancy of a buffer::
105 103
106 CIRC_CNT_TO_END(head_index, tail_index, buffer_size); 104 CIRC_CNT_TO_END(head_index, tail_index, buffer_size);
107 105
@@ -112,7 +110,7 @@ The macros are:
112Each of these macros will nominally return a value between 0 and buffer_size-1, 110Each of these macros will nominally return a value between 0 and buffer_size-1,
113however: 111however:
114 112
115 [1] CIRC_SPACE*() are intended to be used in the producer. To the producer 113 (1) CIRC_SPACE*() are intended to be used in the producer. To the producer
116 they will return a lower bound as the producer controls the head index, 114 they will return a lower bound as the producer controls the head index,
117 but the consumer may still be depleting the buffer on another CPU and 115 but the consumer may still be depleting the buffer on another CPU and
118 moving the tail index. 116 moving the tail index.
@@ -120,7 +118,7 @@ however:
120 To the consumer it will show an upper bound as the producer may be busy 118 To the consumer it will show an upper bound as the producer may be busy
121 depleting the space. 119 depleting the space.
122 120
123 [2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they 121 (2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they
124 will return a lower bound as the consumer controls the tail index, but the 122 will return a lower bound as the consumer controls the tail index, but the
125 producer may still be filling the buffer on another CPU and moving the 123 producer may still be filling the buffer on another CPU and moving the
126 head index. 124 head index.
@@ -128,14 +126,12 @@ however:
128 To the producer it will show an upper bound as the consumer may be busy 126 To the producer it will show an upper bound as the consumer may be busy
129 emptying the buffer. 127 emptying the buffer.
130 128
131 [3] To a third party, the order in which the writes to the indices by the 129 (3) To a third party, the order in which the writes to the indices by the
132 producer and consumer become visible cannot be guaranteed as they are 130 producer and consumer become visible cannot be guaranteed as they are
133 independent and may be made on different CPUs - so the result in such a 131 independent and may be made on different CPUs - so the result in such a
134 situation will merely be a guess, and may even be negative. 132 situation will merely be a guess, and may even be negative.
135 133
136 134Using memory barriers with circular buffers
137===========================================
138USING MEMORY BARRIERS WITH CIRCULAR BUFFERS
139=========================================== 135===========================================
140 136
141By using memory barriers in conjunction with circular buffers, you can avoid 137By using memory barriers in conjunction with circular buffers, you can avoid
@@ -152,10 +148,10 @@ time, and only one thing should be emptying a buffer at any one time, but the
152two sides can operate simultaneously. 148two sides can operate simultaneously.
153 149
154 150
155THE PRODUCER 151The producer
156------------ 152------------
157 153
158The producer will look something like this: 154The producer will look something like this::
159 155
160 spin_lock(&producer_lock); 156 spin_lock(&producer_lock);
161 157
@@ -193,10 +189,10 @@ ordering between the read of the index indicating that the consumer has
193vacated a given element and the write by the producer to that same element. 189vacated a given element and the write by the producer to that same element.
194 190
195 191
196THE CONSUMER 192The Consumer
197------------ 193------------
198 194
199The consumer will look something like this: 195The consumer will look something like this::
200 196
201 spin_lock(&consumer_lock); 197 spin_lock(&consumer_lock);
202 198
@@ -235,8 +231,7 @@ prevents the compiler from tearing the store, and enforces ordering
235against previous accesses. 231against previous accesses.
236 232
237 233
238=============== 234Further reading
239FURTHER READING
240=============== 235===============
241 236
242See also Documentation/memory-barriers.txt for a description of Linux's memory 237See also Documentation/memory-barriers.txt for a description of Linux's memory
diff --git a/Documentation/clk.txt b/Documentation/clk.txt
index 22f026aa2f34..be909ed45970 100644
--- a/Documentation/clk.txt
+++ b/Documentation/clk.txt
@@ -1,12 +1,16 @@
1 The Common Clk Framework 1========================
2 Mike Turquette <mturquette@ti.com> 2The Common Clk Framework
3========================
4
5:Author: Mike Turquette <mturquette@ti.com>
3 6
4This document endeavours to explain the common clk framework details, 7This document endeavours to explain the common clk framework details,
5and how to port a platform over to this framework. It is not yet a 8and how to port a platform over to this framework. It is not yet a
6detailed explanation of the clock api in include/linux/clk.h, but 9detailed explanation of the clock api in include/linux/clk.h, but
7perhaps someday it will include that information. 10perhaps someday it will include that information.
8 11
9 Part 1 - introduction and interface split 12Introduction and interface split
13================================
10 14
11The common clk framework is an interface to control the clock nodes 15The common clk framework is an interface to control the clock nodes
12available on various devices today. This may come in the form of clock 16available on various devices today. This may come in the form of clock
@@ -35,10 +39,11 @@ is defined in struct clk_foo and pointed to within struct clk_core. This
35allows for easy navigation between the two discrete halves of the common 39allows for easy navigation between the two discrete halves of the common
36clock interface. 40clock interface.
37 41
38 Part 2 - common data structures and api 42Common data structures and api
43==============================
39 44
40Below is the common struct clk_core definition from 45Below is the common struct clk_core definition from
41drivers/clk/clk.c, modified for brevity: 46drivers/clk/clk.c, modified for brevity::
42 47
43 struct clk_core { 48 struct clk_core {
44 const char *name; 49 const char *name;
@@ -59,7 +64,7 @@ struct clk. That api is documented in include/linux/clk.h.
59 64
60Platforms and devices utilizing the common struct clk_core use the struct 65Platforms and devices utilizing the common struct clk_core use the struct
61clk_ops pointer in struct clk_core to perform the hardware-specific parts of 66clk_ops pointer in struct clk_core to perform the hardware-specific parts of
62the operations defined in clk-provider.h: 67the operations defined in clk-provider.h::
63 68
64 struct clk_ops { 69 struct clk_ops {
65 int (*prepare)(struct clk_hw *hw); 70 int (*prepare)(struct clk_hw *hw);
@@ -95,19 +100,20 @@ the operations defined in clk-provider.h:
95 struct dentry *dentry); 100 struct dentry *dentry);
96 }; 101 };
97 102
98 Part 3 - hardware clk implementations 103Hardware clk implementations
104============================
99 105
100The strength of the common struct clk_core comes from its .ops and .hw pointers 106The strength of the common struct clk_core comes from its .ops and .hw pointers
101which abstract the details of struct clk from the hardware-specific bits, and 107which abstract the details of struct clk from the hardware-specific bits, and
102vice versa. To illustrate consider the simple gateable clk implementation in 108vice versa. To illustrate consider the simple gateable clk implementation in
103drivers/clk/clk-gate.c: 109drivers/clk/clk-gate.c::
104 110
105struct clk_gate { 111 struct clk_gate {
106 struct clk_hw hw; 112 struct clk_hw hw;
107 void __iomem *reg; 113 void __iomem *reg;
108 u8 bit_idx; 114 u8 bit_idx;
109 ... 115 ...
110}; 116 };
111 117
112struct clk_gate contains struct clk_hw hw as well as hardware-specific 118struct clk_gate contains struct clk_hw hw as well as hardware-specific
113knowledge about which register and bit controls this clk's gating. 119knowledge about which register and bit controls this clk's gating.
@@ -115,7 +121,7 @@ Nothing about clock topology or accounting, such as enable_count or
115notifier_count, is needed here. That is all handled by the common 121notifier_count, is needed here. That is all handled by the common
116framework code and struct clk_core. 122framework code and struct clk_core.
117 123
118Let's walk through enabling this clk from driver code: 124Let's walk through enabling this clk from driver code::
119 125
120 struct clk *clk; 126 struct clk *clk;
121 clk = clk_get(NULL, "my_gateable_clk"); 127 clk = clk_get(NULL, "my_gateable_clk");
@@ -123,70 +129,71 @@ Let's walk through enabling this clk from driver code:
123 clk_prepare(clk); 129 clk_prepare(clk);
124 clk_enable(clk); 130 clk_enable(clk);
125 131
126The call graph for clk_enable is very simple: 132The call graph for clk_enable is very simple::
127 133
128clk_enable(clk); 134 clk_enable(clk);
129 clk->ops->enable(clk->hw); 135 clk->ops->enable(clk->hw);
130 [resolves to...] 136 [resolves to...]
131 clk_gate_enable(hw); 137 clk_gate_enable(hw);
132 [resolves struct clk gate with to_clk_gate(hw)] 138 [resolves struct clk gate with to_clk_gate(hw)]
133 clk_gate_set_bit(gate); 139 clk_gate_set_bit(gate);
134 140
135And the definition of clk_gate_set_bit: 141And the definition of clk_gate_set_bit::
136 142
137static void clk_gate_set_bit(struct clk_gate *gate) 143 static void clk_gate_set_bit(struct clk_gate *gate)
138{ 144 {
139 u32 reg; 145 u32 reg;
140 146
141 reg = __raw_readl(gate->reg); 147 reg = __raw_readl(gate->reg);
142 reg |= BIT(gate->bit_idx); 148 reg |= BIT(gate->bit_idx);
143 writel(reg, gate->reg); 149 writel(reg, gate->reg);
144} 150 }
145 151
146Note that to_clk_gate is defined as: 152Note that to_clk_gate is defined as::
147 153
148#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw) 154 #define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw)
149 155
150This pattern of abstraction is used for every clock hardware 156This pattern of abstraction is used for every clock hardware
151representation. 157representation.
152 158
153 Part 4 - supporting your own clk hardware 159Supporting your own clk hardware
160================================
154 161
155When implementing support for a new type of clock it is only necessary to 162When implementing support for a new type of clock it is only necessary to
156include the following header: 163include the following header::
157 164
158#include <linux/clk-provider.h> 165 #include <linux/clk-provider.h>
159 166
160To construct a clk hardware structure for your platform you must define 167To construct a clk hardware structure for your platform you must define
161the following: 168the following::
162 169
163struct clk_foo { 170 struct clk_foo {
164 struct clk_hw hw; 171 struct clk_hw hw;
165 ... hardware specific data goes here ... 172 ... hardware specific data goes here ...
166}; 173 };
167 174
168To take advantage of your data you'll need to support valid operations 175To take advantage of your data you'll need to support valid operations
169for your clk: 176for your clk::
170 177
171struct clk_ops clk_foo_ops { 178 struct clk_ops clk_foo_ops {
172 .enable = &clk_foo_enable; 179 .enable = &clk_foo_enable;
173 .disable = &clk_foo_disable; 180 .disable = &clk_foo_disable;
174}; 181 };
175 182
176Implement the above functions using container_of: 183Implement the above functions using container_of::
177 184
178#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw) 185 #define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw)
179 186
180int clk_foo_enable(struct clk_hw *hw) 187 int clk_foo_enable(struct clk_hw *hw)
181{ 188 {
182 struct clk_foo *foo; 189 struct clk_foo *foo;
183 190
184 foo = to_clk_foo(hw); 191 foo = to_clk_foo(hw);
185 192
186 ... perform magic on foo ... 193 ... perform magic on foo ...
187 194
188 return 0; 195 return 0;
189}; 196 };
190 197
191Below is a matrix detailing which clk_ops are mandatory based upon the 198Below is a matrix detailing which clk_ops are mandatory based upon the
192hardware capabilities of that clock. A cell marked as "y" means 199hardware capabilities of that clock. A cell marked as "y" means
@@ -194,41 +201,56 @@ mandatory, a cell marked as "n" implies that either including that
194callback is invalid or otherwise unnecessary. Empty cells are either 201callback is invalid or otherwise unnecessary. Empty cells are either
195optional or must be evaluated on a case-by-case basis. 202optional or must be evaluated on a case-by-case basis.
196 203
197 clock hardware characteristics 204.. table:: clock hardware characteristics
198 ----------------------------------------------------------- 205
199 | gate | change rate | single parent | multiplexer | root | 206 +----------------+------+-------------+---------------+-------------+------+
200 |------|-------------|---------------|-------------|------| 207 | | gate | change rate | single parent | multiplexer | root |
201.prepare | | | | | | 208 +================+======+=============+===============+=============+======+
202.unprepare | | | | | | 209 |.prepare | | | | | |
203 | | | | | | 210 +----------------+------+-------------+---------------+-------------+------+
204.enable | y | | | | | 211 |.unprepare | | | | | |
205.disable | y | | | | | 212 +----------------+------+-------------+---------------+-------------+------+
206.is_enabled | y | | | | | 213 +----------------+------+-------------+---------------+-------------+------+
207 | | | | | | 214 |.enable | y | | | | |
208.recalc_rate | | y | | | | 215 +----------------+------+-------------+---------------+-------------+------+
209.round_rate | | y [1] | | | | 216 |.disable | y | | | | |
210.determine_rate | | y [1] | | | | 217 +----------------+------+-------------+---------------+-------------+------+
211.set_rate | | y | | | | 218 |.is_enabled | y | | | | |
212 | | | | | | 219 +----------------+------+-------------+---------------+-------------+------+
213.set_parent | | | n | y | n | 220 +----------------+------+-------------+---------------+-------------+------+
214.get_parent | | | n | y | n | 221 |.recalc_rate | | y | | | |
215 | | | | | | 222 +----------------+------+-------------+---------------+-------------+------+
216.recalc_accuracy| | | | | | 223 |.round_rate | | y [1]_ | | | |
217 | | | | | | 224 +----------------+------+-------------+---------------+-------------+------+
218.init | | | | | | 225 |.determine_rate | | y [1]_ | | | |
219 ----------------------------------------------------------- 226 +----------------+------+-------------+---------------+-------------+------+
220[1] either one of round_rate or determine_rate is required. 227 |.set_rate | | y | | | |
228 +----------------+------+-------------+---------------+-------------+------+
229 +----------------+------+-------------+---------------+-------------+------+
230 |.set_parent | | | n | y | n |
231 +----------------+------+-------------+---------------+-------------+------+
232 |.get_parent | | | n | y | n |
233 +----------------+------+-------------+---------------+-------------+------+
234 +----------------+------+-------------+---------------+-------------+------+
235 |.recalc_accuracy| | | | | |
236 +----------------+------+-------------+---------------+-------------+------+
237 +----------------+------+-------------+---------------+-------------+------+
238 |.init | | | | | |
239 +----------------+------+-------------+---------------+-------------+------+
240
241.. [1] either one of round_rate or determine_rate is required.
221 242
222Finally, register your clock at run-time with a hardware-specific 243Finally, register your clock at run-time with a hardware-specific
223registration function. This function simply populates struct clk_foo's 244registration function. This function simply populates struct clk_foo's
224data and then passes the common struct clk parameters to the framework 245data and then passes the common struct clk parameters to the framework
225with a call to: 246with a call to::
226 247
227clk_register(...) 248 clk_register(...)
228 249
229See the basic clock types in drivers/clk/clk-*.c for examples. 250See the basic clock types in ``drivers/clk/clk-*.c`` for examples.
230 251
231 Part 5 - Disabling clock gating of unused clocks 252Disabling clock gating of unused clocks
253=======================================
232 254
233Sometimes during development it can be useful to be able to bypass the 255Sometimes during development it can be useful to be able to bypass the
234default disabling of unused clocks. For example, if drivers aren't enabling 256default disabling of unused clocks. For example, if drivers aren't enabling
@@ -239,7 +261,8 @@ are sorted out.
239To bypass this disabling, include "clk_ignore_unused" in the bootargs to the 261To bypass this disabling, include "clk_ignore_unused" in the bootargs to the
240kernel. 262kernel.
241 263
242 Part 6 - Locking 264Locking
265=======
243 266
244The common clock framework uses two global locks, the prepare lock and the 267The common clock framework uses two global locks, the prepare lock and the
245enable lock. 268enable lock.
diff --git a/Documentation/cpu-load.txt b/Documentation/cpu-load.txt
index 287224e57cfc..2d01ce43d2a2 100644
--- a/Documentation/cpu-load.txt
+++ b/Documentation/cpu-load.txt
@@ -1,9 +1,10 @@
1========
1CPU load 2CPU load
2-------- 3========
3 4
4Linux exports various bits of information via `/proc/stat' and 5Linux exports various bits of information via ``/proc/stat`` and
5`/proc/uptime' that userland tools, such as top(1), use to calculate 6``/proc/uptime`` that userland tools, such as top(1), use to calculate
6the average time system spent in a particular state, for example: 7the average time system spent in a particular state, for example::
7 8
8 $ iostat 9 $ iostat
9 Linux 2.6.18.3-exp (linmac) 02/20/2007 10 Linux 2.6.18.3-exp (linmac) 02/20/2007
@@ -17,7 +18,7 @@ Here the system thinks that over the default sampling period the
17system spent 10.01% of the time doing work in user space, 2.92% in the 18system spent 10.01% of the time doing work in user space, 2.92% in the
18kernel, and was overall 81.63% of the time idle. 19kernel, and was overall 81.63% of the time idle.
19 20
20In most cases the `/proc/stat' information reflects the reality quite 21In most cases the ``/proc/stat`` information reflects the reality quite
21closely, however due to the nature of how/when the kernel collects 22closely, however due to the nature of how/when the kernel collects
22this data sometimes it can not be trusted at all. 23this data sometimes it can not be trusted at all.
23 24
@@ -33,78 +34,78 @@ Example
33------- 34-------
34 35
35If we imagine the system with one task that periodically burns cycles 36If we imagine the system with one task that periodically burns cycles
36in the following manner: 37in the following manner::
37 38
38 time line between two timer interrupts 39 time line between two timer interrupts
39|--------------------------------------| 40 |--------------------------------------|
40 ^ ^ 41 ^ ^
41 |_ something begins working | 42 |_ something begins working |
42 |_ something goes to sleep 43 |_ something goes to sleep
43 (only to be awaken quite soon) 44 (only to be awaken quite soon)
44 45
45In the above situation the system will be 0% loaded according to the 46In the above situation the system will be 0% loaded according to the
46`/proc/stat' (since the timer interrupt will always happen when the 47``/proc/stat`` (since the timer interrupt will always happen when the
47system is executing the idle handler), but in reality the load is 48system is executing the idle handler), but in reality the load is
48closer to 99%. 49closer to 99%.
49 50
50One can imagine many more situations where this behavior of the kernel 51One can imagine many more situations where this behavior of the kernel
51will lead to quite erratic information inside `/proc/stat'. 52will lead to quite erratic information inside ``/proc/stat``::
52 53
53 54
54/* gcc -o hog smallhog.c */ 55 /* gcc -o hog smallhog.c */
55#include <time.h> 56 #include <time.h>
56#include <limits.h> 57 #include <limits.h>
57#include <signal.h> 58 #include <signal.h>
58#include <sys/time.h> 59 #include <sys/time.h>
59#define HIST 10 60 #define HIST 10
60 61
61static volatile sig_atomic_t stop; 62 static volatile sig_atomic_t stop;
62 63
63static void sighandler (int signr) 64 static void sighandler (int signr)
64{ 65 {
65 (void) signr; 66 (void) signr;
66 stop = 1; 67 stop = 1;
67} 68 }
68static unsigned long hog (unsigned long niters) 69 static unsigned long hog (unsigned long niters)
69{ 70 {
70 stop = 0; 71 stop = 0;
71 while (!stop && --niters); 72 while (!stop && --niters);
72 return niters; 73 return niters;
73} 74 }
74int main (void) 75 int main (void)
75{ 76 {
76 int i; 77 int i;
77 struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 }, 78 struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
78 .it_value = { .tv_sec = 0, .tv_usec = 1 } }; 79 .it_value = { .tv_sec = 0, .tv_usec = 1 } };
79 sigset_t set; 80 sigset_t set;
80 unsigned long v[HIST]; 81 unsigned long v[HIST];
81 double tmp = 0.0; 82 double tmp = 0.0;
82 unsigned long n; 83 unsigned long n;
83 signal (SIGALRM, &sighandler); 84 signal (SIGALRM, &sighandler);
84 setitimer (ITIMER_REAL, &it, NULL); 85 setitimer (ITIMER_REAL, &it, NULL);
85 86
86 hog (ULONG_MAX); 87 hog (ULONG_MAX);
87 for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX); 88 for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
88 for (i = 0; i < HIST; ++i) tmp += v[i]; 89 for (i = 0; i < HIST; ++i) tmp += v[i];
89 tmp /= HIST; 90 tmp /= HIST;
90 n = tmp - (tmp / 3.0); 91 n = tmp - (tmp / 3.0);
91 92
92 sigemptyset (&set); 93 sigemptyset (&set);
93 sigaddset (&set, SIGALRM); 94 sigaddset (&set, SIGALRM);
94 95
95 for (;;) { 96 for (;;) {
96 hog (n); 97 hog (n);
97 sigwait (&set, &i); 98 sigwait (&set, &i);
98 } 99 }
99 return 0; 100 return 0;
100} 101 }
101 102
102 103
103References 104References
104---------- 105----------
105 106
106http://lkml.org/lkml/2007/2/12/6 107- http://lkml.org/lkml/2007/2/12/6
107Documentation/filesystems/proc.txt (1.8) 108- Documentation/filesystems/proc.txt (1.8)
108 109
109 110
110Thanks 111Thanks
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 127c9d8c2174..c6e7e9196a8b 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -1,3 +1,6 @@
1===========================================
2How CPU topology info is exported via sysfs
3===========================================
1 4
2Export CPU topology info via sysfs. Items (attributes) are similar 5Export CPU topology info via sysfs. Items (attributes) are similar
3to /proc/cpuinfo output of some architectures: 6to /proc/cpuinfo output of some architectures:
@@ -75,24 +78,26 @@ CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where
75they reflect the cpu and cache hierarchy. 78they reflect the cpu and cache hierarchy.
76 79
77For an architecture to support this feature, it must define some of 80For an architecture to support this feature, it must define some of
78these macros in include/asm-XXX/topology.h: 81these macros in include/asm-XXX/topology.h::
79#define topology_physical_package_id(cpu) 82
80#define topology_core_id(cpu) 83 #define topology_physical_package_id(cpu)
81#define topology_book_id(cpu) 84 #define topology_core_id(cpu)
82#define topology_drawer_id(cpu) 85 #define topology_book_id(cpu)
83#define topology_sibling_cpumask(cpu) 86 #define topology_drawer_id(cpu)
84#define topology_core_cpumask(cpu) 87 #define topology_sibling_cpumask(cpu)
85#define topology_book_cpumask(cpu) 88 #define topology_core_cpumask(cpu)
86#define topology_drawer_cpumask(cpu) 89 #define topology_book_cpumask(cpu)
87 90 #define topology_drawer_cpumask(cpu)
88The type of **_id macros is int. 91
89The type of **_cpumask macros is (const) struct cpumask *. The latter 92The type of ``**_id macros`` is int.
90correspond with appropriate **_siblings sysfs attributes (except for 93The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
94correspond with appropriate ``**_siblings`` sysfs attributes (except for
91topology_sibling_cpumask() which corresponds with thread_siblings). 95topology_sibling_cpumask() which corresponds with thread_siblings).
92 96
93To be consistent on all architectures, include/linux/topology.h 97To be consistent on all architectures, include/linux/topology.h
94provides default definitions for any of the above macros that are 98provides default definitions for any of the above macros that are
95not defined by include/asm-XXX/topology.h: 99not defined by include/asm-XXX/topology.h:
100
961) physical_package_id: -1 1011) physical_package_id: -1
972) core_id: 0 1022) core_id: 0
983) sibling_cpumask: just the given CPU 1033) sibling_cpumask: just the given CPU
@@ -107,6 +112,7 @@ Additionally, CPU topology information is provided under
107/sys/devices/system/cpu and includes these files. The internal 112/sys/devices/system/cpu and includes these files. The internal
108source for the output is in brackets ("[]"). 113source for the output is in brackets ("[]").
109 114
115 =========== ==========================================================
110 kernel_max: the maximum CPU index allowed by the kernel configuration. 116 kernel_max: the maximum CPU index allowed by the kernel configuration.
111 [NR_CPUS-1] 117 [NR_CPUS-1]
112 118
@@ -122,6 +128,7 @@ source for the output is in brackets ("[]").
122 128
123 present: CPUs that have been identified as being present in the 129 present: CPUs that have been identified as being present in the
124 system. [cpu_present_mask] 130 system. [cpu_present_mask]
131 =========== ==========================================================
125 132
126The format for the above output is compatible with cpulist_parse() 133The format for the above output is compatible with cpulist_parse()
127[see <linux/cpumask.h>]. Some examples follow. 134[see <linux/cpumask.h>]. Some examples follow.
@@ -129,7 +136,7 @@ The format for the above output is compatible with cpulist_parse()
129In this example, there are 64 CPUs in the system but cpus 32-63 exceed 136In this example, there are 64 CPUs in the system but cpus 32-63 exceed
130the kernel max which is limited to 0..31 by the NR_CPUS config option 137the kernel max which is limited to 0..31 by the NR_CPUS config option
131being 32. Note also that CPUs 2 and 4-31 are not online but could be 138being 32. Note also that CPUs 2 and 4-31 are not online but could be
132brought online as they are both present and possible. 139brought online as they are both present and possible::
133 140
134 kernel_max: 31 141 kernel_max: 31
135 offline: 2,4-31,32-63 142 offline: 2,4-31,32-63
@@ -140,7 +147,7 @@ brought online as they are both present and possible.
140In this example, the NR_CPUS config option is 128, but the kernel was 147In this example, the NR_CPUS config option is 128, but the kernel was
141started with possible_cpus=144. There are 4 CPUs in the system and cpu2 148started with possible_cpus=144. There are 4 CPUs in the system and cpu2
142was manually taken offline (and is the only CPU that can be brought 149was manually taken offline (and is the only CPU that can be brought
143online.) 150online.)::
144 151
145 kernel_max: 127 152 kernel_max: 127
146 offline: 2,4-127,128-143 153 offline: 2,4-127,128-143
diff --git a/Documentation/crc32.txt b/Documentation/crc32.txt
index a08a7dd9d625..8a6860f33b4e 100644
--- a/Documentation/crc32.txt
+++ b/Documentation/crc32.txt
@@ -1,4 +1,6 @@
1A brief CRC tutorial. 1=================================
2brief tutorial on CRC computation
3=================================
2 4
3A CRC is a long-division remainder. You add the CRC to the message, 5A CRC is a long-division remainder. You add the CRC to the message,
4and the whole thing (message+CRC) is a multiple of the given 6and the whole thing (message+CRC) is a multiple of the given
@@ -8,7 +10,8 @@ remainder computed on the message+CRC is 0. This latter approach
8is used by a lot of hardware implementations, and is why so many 10is used by a lot of hardware implementations, and is why so many
9protocols put the end-of-frame flag after the CRC. 11protocols put the end-of-frame flag after the CRC.
10 12
11It's actually the same long division you learned in school, except that 13It's actually the same long division you learned in school, except that:
14
12- We're working in binary, so the digits are only 0 and 1, and 15- We're working in binary, so the digits are only 0 and 1, and
13- When dividing polynomials, there are no carries. Rather than add and 16- When dividing polynomials, there are no carries. Rather than add and
14 subtract, we just xor. Thus, we tend to get a bit sloppy about 17 subtract, we just xor. Thus, we tend to get a bit sloppy about
@@ -40,11 +43,12 @@ throw the quotient bit away, but subtract the appropriate multiple of
40the polynomial from the remainder and we're back to where we started, 43the polynomial from the remainder and we're back to where we started,
41ready to process the next bit. 44ready to process the next bit.
42 45
43A big-endian CRC written this way would be coded like: 46A big-endian CRC written this way would be coded like::
44for (i = 0; i < input_bits; i++) { 47
45 multiple = remainder & 0x80000000 ? CRCPOLY : 0; 48 for (i = 0; i < input_bits; i++) {
46 remainder = (remainder << 1 | next_input_bit()) ^ multiple; 49 multiple = remainder & 0x80000000 ? CRCPOLY : 0;
47} 50 remainder = (remainder << 1 | next_input_bit()) ^ multiple;
51 }
48 52
49Notice how, to get at bit 32 of the shifted remainder, we look 53Notice how, to get at bit 32 of the shifted remainder, we look
50at bit 31 of the remainder *before* shifting it. 54at bit 31 of the remainder *before* shifting it.
@@ -54,25 +58,26 @@ the remainder don't actually affect any decision-making until
5432 bits later. Thus, the first 32 cycles of this are pretty boring. 5832 bits later. Thus, the first 32 cycles of this are pretty boring.
55Also, to add the CRC to a message, we need a 32-bit-long hole for it at 59Also, to add the CRC to a message, we need a 32-bit-long hole for it at
56the end, so we have to add 32 extra cycles shifting in zeros at the 60the end, so we have to add 32 extra cycles shifting in zeros at the
57end of every message, 61end of every message.
58 62
59These details lead to a standard trick: rearrange merging in the 63These details lead to a standard trick: rearrange merging in the
60next_input_bit() until the moment it's needed. Then the first 32 cycles 64next_input_bit() until the moment it's needed. Then the first 32 cycles
61can be precomputed, and merging in the final 32 zero bits to make room 65can be precomputed, and merging in the final 32 zero bits to make room
62for the CRC can be skipped entirely. This changes the code to: 66for the CRC can be skipped entirely. This changes the code to::
63 67
64for (i = 0; i < input_bits; i++) { 68 for (i = 0; i < input_bits; i++) {
65 remainder ^= next_input_bit() << 31; 69 remainder ^= next_input_bit() << 31;
66 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 70 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
67 remainder = (remainder << 1) ^ multiple; 71 remainder = (remainder << 1) ^ multiple;
68} 72 }
69 73
70With this optimization, the little-endian code is particularly simple: 74With this optimization, the little-endian code is particularly simple::
71for (i = 0; i < input_bits; i++) { 75
72 remainder ^= next_input_bit(); 76 for (i = 0; i < input_bits; i++) {
73 multiple = (remainder & 1) ? CRCPOLY : 0; 77 remainder ^= next_input_bit();
74 remainder = (remainder >> 1) ^ multiple; 78 multiple = (remainder & 1) ? CRCPOLY : 0;
75} 79 remainder = (remainder >> 1) ^ multiple;
80 }
76 81
77The most significant coefficient of the remainder polynomial is stored 82The most significant coefficient of the remainder polynomial is stored
78in the least significant bit of the binary "remainder" variable. 83in the least significant bit of the binary "remainder" variable.
@@ -81,23 +86,25 @@ be bit-reversed) and next_input_bit().
81 86
82As long as next_input_bit is returning the bits in a sensible order, we don't 87As long as next_input_bit is returning the bits in a sensible order, we don't
83*have* to wait until the last possible moment to merge in additional bits. 88*have* to wait until the last possible moment to merge in additional bits.
84We can do it 8 bits at a time rather than 1 bit at a time: 89We can do it 8 bits at a time rather than 1 bit at a time::
85for (i = 0; i < input_bytes; i++) { 90
86 remainder ^= next_input_byte() << 24; 91 for (i = 0; i < input_bytes; i++) {
87 for (j = 0; j < 8; j++) { 92 remainder ^= next_input_byte() << 24;
88 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 93 for (j = 0; j < 8; j++) {
89 remainder = (remainder << 1) ^ multiple; 94 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
95 remainder = (remainder << 1) ^ multiple;
96 }
90 } 97 }
91}
92 98
93Or in little-endian: 99Or in little-endian::
94for (i = 0; i < input_bytes; i++) { 100
95 remainder ^= next_input_byte(); 101 for (i = 0; i < input_bytes; i++) {
96 for (j = 0; j < 8; j++) { 102 remainder ^= next_input_byte();
97 multiple = (remainder & 1) ? CRCPOLY : 0; 103 for (j = 0; j < 8; j++) {
98 remainder = (remainder >> 1) ^ multiple; 104 multiple = (remainder & 1) ? CRCPOLY : 0;
105 remainder = (remainder >> 1) ^ multiple;
106 }
99 } 107 }
100}
101 108
102If the input is a multiple of 32 bits, you can even XOR in a 32-bit 109If the input is a multiple of 32 bits, you can even XOR in a 32-bit
103word at a time and increase the inner loop count to 32. 110word at a time and increase the inner loop count to 32.
diff --git a/Documentation/dcdbas.txt b/Documentation/dcdbas.txt
index e1c52e2dc361..309cc57a7c1c 100644
--- a/Documentation/dcdbas.txt
+++ b/Documentation/dcdbas.txt
@@ -1,4 +1,9 @@
1===================================
2Dell Systems Management Base Driver
3===================================
4
1Overview 5Overview
6========
2 7
3The Dell Systems Management Base Driver provides a sysfs interface for 8The Dell Systems Management Base Driver provides a sysfs interface for
4systems management software such as Dell OpenManage to perform system 9systems management software such as Dell OpenManage to perform system
@@ -17,6 +22,7 @@ more information about the libsmbios project.
17 22
18 23
19System Management Interrupt 24System Management Interrupt
25===========================
20 26
21On some Dell systems, systems management software must access certain 27On some Dell systems, systems management software must access certain
22management information via a system management interrupt (SMI). The SMI data 28management information via a system management interrupt (SMI). The SMI data
@@ -24,12 +30,12 @@ buffer must reside in 32-bit address space, and the physical address of the
24buffer is required for the SMI. The driver maintains the memory required for 30buffer is required for the SMI. The driver maintains the memory required for
25the SMI and provides a way for the application to generate the SMI. 31the SMI and provides a way for the application to generate the SMI.
26The driver creates the following sysfs entries for systems management 32The driver creates the following sysfs entries for systems management
27software to perform these system management interrupts: 33software to perform these system management interrupts::
28 34
29/sys/devices/platform/dcdbas/smi_data 35 /sys/devices/platform/dcdbas/smi_data
30/sys/devices/platform/dcdbas/smi_data_buf_phys_addr 36 /sys/devices/platform/dcdbas/smi_data_buf_phys_addr
31/sys/devices/platform/dcdbas/smi_data_buf_size 37 /sys/devices/platform/dcdbas/smi_data_buf_size
32/sys/devices/platform/dcdbas/smi_request 38 /sys/devices/platform/dcdbas/smi_request
33 39
34Systems management software must perform the following steps to execute 40Systems management software must perform the following steps to execute
35a SMI using this driver: 41a SMI using this driver:
@@ -43,6 +49,7 @@ a SMI using this driver:
43 49
44 50
45Host Control Action 51Host Control Action
52===================
46 53
47Dell OpenManage supports a host control feature that allows the administrator 54Dell OpenManage supports a host control feature that allows the administrator
48to perform a power cycle or power off of the system after the OS has finished 55to perform a power cycle or power off of the system after the OS has finished
@@ -69,12 +76,14 @@ power off host control action using this driver:
69 76
70 77
71Host Control SMI Type 78Host Control SMI Type
79=====================
72 80
73The following table shows the value to write to host_control_smi_type to 81The following table shows the value to write to host_control_smi_type to
74perform a power cycle or power off host control action: 82perform a power cycle or power off host control action:
75 83
84=================== =====================
76PowerEdge System Host Control SMI Type 85PowerEdge System Host Control SMI Type
77---------------- --------------------- 86=================== =====================
78 300 HC_SMITYPE_TYPE1 87 300 HC_SMITYPE_TYPE1
79 1300 HC_SMITYPE_TYPE1 88 1300 HC_SMITYPE_TYPE1
80 1400 HC_SMITYPE_TYPE2 89 1400 HC_SMITYPE_TYPE2
@@ -87,5 +96,4 @@ PowerEdge System Host Control SMI Type
87 1655MC HC_SMITYPE_TYPE2 96 1655MC HC_SMITYPE_TYPE2
88 700 HC_SMITYPE_TYPE3 97 700 HC_SMITYPE_TYPE3
89 750 HC_SMITYPE_TYPE3 98 750 HC_SMITYPE_TYPE3
90 99=================== =====================
91
diff --git a/Documentation/debugging-via-ohci1394.txt b/Documentation/debugging-via-ohci1394.txt
index 9ff026d22b75..981ad4f89fd3 100644
--- a/Documentation/debugging-via-ohci1394.txt
+++ b/Documentation/debugging-via-ohci1394.txt
@@ -1,6 +1,6 @@
1 1===========================================================================
2 Using physical DMA provided by OHCI-1394 FireWire controllers for debugging 2Using physical DMA provided by OHCI-1394 FireWire controllers for debugging
3 --------------------------------------------------------------------------- 3===========================================================================
4 4
5Introduction 5Introduction
6------------ 6------------
@@ -91,10 +91,10 @@ Step-by-step instructions for using firescope with early OHCI initialization:
911) Verify that your hardware is supported: 911) Verify that your hardware is supported:
92 92
93 Load the firewire-ohci module and check your kernel logs. 93 Load the firewire-ohci module and check your kernel logs.
94 You should see a line similar to 94 You should see a line similar to::
95 95
96 firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT 96 firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT
97 ... contexts, quirks 0x11 97 ... contexts, quirks 0x11
98 98
99 when loading the driver. If you have no supported controller, many PCI, 99 when loading the driver. If you have no supported controller, many PCI,
100 CardBus and even some Express cards which are fully compliant to OHCI-1394 100 CardBus and even some Express cards which are fully compliant to OHCI-1394
@@ -113,9 +113,9 @@ Step-by-step instructions for using firescope with early OHCI initialization:
113 stable connection and has matching connectors (there are small 4-pin and 113 stable connection and has matching connectors (there are small 4-pin and
114 large 6-pin FireWire ports) will do. 114 large 6-pin FireWire ports) will do.
115 115
116 If an driver is running on both machines you should see a line like 116 If an driver is running on both machines you should see a line like::
117 117
118 firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400 118 firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400
119 119
120 on both machines in the kernel log when the cable is plugged in 120 on both machines in the kernel log when the cable is plugged in
121 and connects the two machines. 121 and connects the two machines.
@@ -123,7 +123,7 @@ Step-by-step instructions for using firescope with early OHCI initialization:
1233) Test physical DMA using firescope: 1233) Test physical DMA using firescope:
124 124
125 On the debug host, make sure that /dev/fw* is accessible, 125 On the debug host, make sure that /dev/fw* is accessible,
126 then start firescope: 126 then start firescope::
127 127
128 $ firescope 128 $ firescope
129 Port 0 (/dev/fw1) opened, 2 nodes detected 129 Port 0 (/dev/fw1) opened, 2 nodes detected
@@ -163,7 +163,7 @@ Step-by-step instructions for using firescope with early OHCI initialization:
163 host loaded, reboot the debugged machine, booting the kernel which has 163 host loaded, reboot the debugged machine, booting the kernel which has
164 CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early. 164 CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early.
165 165
166 Then, on the debugging host, run firescope, for example by using -A: 166 Then, on the debugging host, run firescope, for example by using -A::
167 167
168 firescope -A System.map-of-debug-target-kernel 168 firescope -A System.map-of-debug-target-kernel
169 169
@@ -178,6 +178,7 @@ Step-by-step instructions for using firescope with early OHCI initialization:
178 178
179Notes 179Notes
180----- 180-----
181
181Documentation and specifications: http://halobates.de/firewire/ 182Documentation and specifications: http://halobates.de/firewire/
182 183
183FireWire is a trademark of Apple Inc. - for more information please refer to: 184FireWire is a trademark of Apple Inc. - for more information please refer to:
diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt
index d262e22bddec..0fdb6aa2704c 100644
--- a/Documentation/dell_rbu.txt
+++ b/Documentation/dell_rbu.txt
@@ -1,18 +1,30 @@
1Purpose: 1=============================================================
2Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver 2Usage of the new open sourced rbu (Remote BIOS Update) driver
3=============================================================
4
5Purpose
6=======
7
8Document demonstrating the use of the Dell Remote BIOS Update driver.
3for updating BIOS images on Dell servers and desktops. 9for updating BIOS images on Dell servers and desktops.
4 10
5Scope: 11Scope
12=====
13
6This document discusses the functionality of the rbu driver only. 14This document discusses the functionality of the rbu driver only.
7It does not cover the support needed from applications to enable the BIOS to 15It does not cover the support needed from applications to enable the BIOS to
8update itself with the image downloaded in to the memory. 16update itself with the image downloaded in to the memory.
9 17
10Overview: 18Overview
19========
20
11This driver works with Dell OpenManage or Dell Update Packages for updating 21This driver works with Dell OpenManage or Dell Update Packages for updating
12the BIOS on Dell servers (starting from servers sold since 1999), desktops 22the BIOS on Dell servers (starting from servers sold since 1999), desktops
13and notebooks (starting from those sold in 2005). 23and notebooks (starting from those sold in 2005).
24
14Please go to http://support.dell.com register and you can find info on 25Please go to http://support.dell.com register and you can find info on
15OpenManage and Dell Update packages (DUP). 26OpenManage and Dell Update packages (DUP).
27
16Libsmbios can also be used to update BIOS on Dell systems go to 28Libsmbios can also be used to update BIOS on Dell systems go to
17http://linux.dell.com/libsmbios/ for details. 29http://linux.dell.com/libsmbios/ for details.
18 30
@@ -22,6 +34,7 @@ of physical pages having the BIOS image. In case of packetized the app
22using the driver breaks the image in to packets of fixed sizes and the driver 34using the driver breaks the image in to packets of fixed sizes and the driver
23would place each packet in contiguous physical memory. The driver also 35would place each packet in contiguous physical memory. The driver also
24maintains a link list of packets for reading them back. 36maintains a link list of packets for reading them back.
37
25If the dell_rbu driver is unloaded all the allocated memory is freed. 38If the dell_rbu driver is unloaded all the allocated memory is freed.
26 39
27The rbu driver needs to have an application (as mentioned above)which will 40The rbu driver needs to have an application (as mentioned above)which will
@@ -30,28 +43,33 @@ inform the BIOS to enable the update in the next system reboot.
30The user should not unload the rbu driver after downloading the BIOS image 43The user should not unload the rbu driver after downloading the BIOS image
31or updating. 44or updating.
32 45
33The driver load creates the following directories under the /sys file system. 46The driver load creates the following directories under the /sys file system::
34/sys/class/firmware/dell_rbu/loading 47
35/sys/class/firmware/dell_rbu/data 48 /sys/class/firmware/dell_rbu/loading
36/sys/devices/platform/dell_rbu/image_type 49 /sys/class/firmware/dell_rbu/data
37/sys/devices/platform/dell_rbu/data 50 /sys/devices/platform/dell_rbu/image_type
38/sys/devices/platform/dell_rbu/packet_size 51 /sys/devices/platform/dell_rbu/data
52 /sys/devices/platform/dell_rbu/packet_size
39 53
40The driver supports two types of update mechanism; monolithic and packetized. 54The driver supports two types of update mechanism; monolithic and packetized.
41These update mechanism depends upon the BIOS currently running on the system. 55These update mechanism depends upon the BIOS currently running on the system.
42Most of the Dell systems support a monolithic update where the BIOS image is 56Most of the Dell systems support a monolithic update where the BIOS image is
43copied to a single contiguous block of physical memory. 57copied to a single contiguous block of physical memory.
58
44In case of packet mechanism the single memory can be broken in smaller chunks 59In case of packet mechanism the single memory can be broken in smaller chunks
45of contiguous memory and the BIOS image is scattered in these packets. 60of contiguous memory and the BIOS image is scattered in these packets.
46 61
47By default the driver uses monolithic memory for the update type. This can be 62By default the driver uses monolithic memory for the update type. This can be
48changed to packets during the driver load time by specifying the load 63changed to packets during the driver load time by specifying the load
49parameter image_type=packet. This can also be changed later as below 64parameter image_type=packet. This can also be changed later as below::
50echo packet > /sys/devices/platform/dell_rbu/image_type 65
66 echo packet > /sys/devices/platform/dell_rbu/image_type
51 67
52In packet update mode the packet size has to be given before any packets can 68In packet update mode the packet size has to be given before any packets can
53be downloaded. It is done as below 69be downloaded. It is done as below::
54echo XXXX > /sys/devices/platform/dell_rbu/packet_size 70
71 echo XXXX > /sys/devices/platform/dell_rbu/packet_size
72
55In the packet update mechanism, the user needs to create a new file having 73In the packet update mechanism, the user needs to create a new file having
56packets of data arranged back to back. It can be done as follows 74packets of data arranged back to back. It can be done as follows
57The user creates packets header, gets the chunk of the BIOS image and 75The user creates packets header, gets the chunk of the BIOS image and
@@ -60,41 +78,54 @@ added together should match the specified packet_size. This makes one
60packet, the user needs to create more such packets out of the entire BIOS 78packet, the user needs to create more such packets out of the entire BIOS
61image file and then arrange all these packets back to back in to one single 79image file and then arrange all these packets back to back in to one single
62file. 80file.
81
63This file is then copied to /sys/class/firmware/dell_rbu/data. 82This file is then copied to /sys/class/firmware/dell_rbu/data.
64Once this file gets to the driver, the driver extracts packet_size data from 83Once this file gets to the driver, the driver extracts packet_size data from
65the file and spreads it across the physical memory in contiguous packet_sized 84the file and spreads it across the physical memory in contiguous packet_sized
66space. 85space.
86
67This method makes sure that all the packets get to the driver in a single operation. 87This method makes sure that all the packets get to the driver in a single operation.
68 88
69In monolithic update the user simply get the BIOS image (.hdr file) and copies 89In monolithic update the user simply get the BIOS image (.hdr file) and copies
70to the data file as is without any change to the BIOS image itself. 90to the data file as is without any change to the BIOS image itself.
71 91
72Do the steps below to download the BIOS image. 92Do the steps below to download the BIOS image.
93
731) echo 1 > /sys/class/firmware/dell_rbu/loading 941) echo 1 > /sys/class/firmware/dell_rbu/loading
742) cp bios_image.hdr /sys/class/firmware/dell_rbu/data 952) cp bios_image.hdr /sys/class/firmware/dell_rbu/data
753) echo 0 > /sys/class/firmware/dell_rbu/loading 963) echo 0 > /sys/class/firmware/dell_rbu/loading
76 97
77The /sys/class/firmware/dell_rbu/ entries will remain till the following is 98The /sys/class/firmware/dell_rbu/ entries will remain till the following is
78done. 99done.
79echo -1 > /sys/class/firmware/dell_rbu/loading 100
101::
102
103 echo -1 > /sys/class/firmware/dell_rbu/loading
104
80Until this step is completed the driver cannot be unloaded. 105Until this step is completed the driver cannot be unloaded.
106
81Also echoing either mono, packet or init in to image_type will free up the 107Also echoing either mono, packet or init in to image_type will free up the
82memory allocated by the driver. 108memory allocated by the driver.
83 109
84If a user by accident executes steps 1 and 3 above without executing step 2; 110If a user by accident executes steps 1 and 3 above without executing step 2;
85it will make the /sys/class/firmware/dell_rbu/ entries disappear. 111it will make the /sys/class/firmware/dell_rbu/ entries disappear.
86The entries can be recreated by doing the following 112
87echo init > /sys/devices/platform/dell_rbu/image_type 113The entries can be recreated by doing the following::
88NOTE: echoing init in image_type does not change it original value. 114
115 echo init > /sys/devices/platform/dell_rbu/image_type
116
117.. note:: echoing init in image_type does not change it original value.
89 118
90Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to 119Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to
91read back the image downloaded. 120read back the image downloaded.
92 121
93NOTE: 122.. note::
94This driver requires a patch for firmware_class.c which has the modified 123
95request_firmware_nowait function. 124 This driver requires a patch for firmware_class.c which has the modified
96Also after updating the BIOS image a user mode application needs to execute 125 request_firmware_nowait function.
97code which sends the BIOS update request to the BIOS. So on the next reboot 126
98the BIOS knows about the new image downloaded and it updates itself. 127 Also after updating the BIOS image a user mode application needs to execute
99Also don't unload the rbu driver if the image has to be updated. 128 code which sends the BIOS update request to the BIOS. So on the next reboot
129 the BIOS knows about the new image downloaded and it updates itself.
130 Also don't unload the rbu driver if the image has to be updated.
100 131
diff --git a/Documentation/digsig.txt b/Documentation/digsig.txt
index 3f682889068b..f6a8902d3ef7 100644
--- a/Documentation/digsig.txt
+++ b/Documentation/digsig.txt
@@ -1,13 +1,20 @@
1==================================
1Digital Signature Verification API 2Digital Signature Verification API
3==================================
2 4
3CONTENTS 5:Author: Dmitry Kasatkin
6:Date: 06.10.2011
4 7
51. Introduction
62. API
73. User-space utilities
8 8
9.. CONTENTS
9 10
101. Introduction 11 1. Introduction
12 2. API
13 3. User-space utilities
14
15
16Introduction
17============
11 18
12Digital signature verification API provides a method to verify digital signature. 19Digital signature verification API provides a method to verify digital signature.
13Currently digital signatures are used by the IMA/EVM integrity protection subsystem. 20Currently digital signatures are used by the IMA/EVM integrity protection subsystem.
@@ -17,25 +24,25 @@ GnuPG multi-precision integers (MPI) library. The kernel port provides
17memory allocation errors handling, has been refactored according to kernel 24memory allocation errors handling, has been refactored according to kernel
18coding style, and checkpatch.pl reported errors and warnings have been fixed. 25coding style, and checkpatch.pl reported errors and warnings have been fixed.
19 26
20Public key and signature consist of header and MPIs. 27Public key and signature consist of header and MPIs::
21 28
22struct pubkey_hdr { 29 struct pubkey_hdr {
23 uint8_t version; /* key format version */ 30 uint8_t version; /* key format version */
24 time_t timestamp; /* key made, always 0 for now */ 31 time_t timestamp; /* key made, always 0 for now */
25 uint8_t algo; 32 uint8_t algo;
26 uint8_t nmpi; 33 uint8_t nmpi;
27 char mpi[0]; 34 char mpi[0];
28} __packed; 35 } __packed;
29 36
30struct signature_hdr { 37 struct signature_hdr {
31 uint8_t version; /* signature format version */ 38 uint8_t version; /* signature format version */
32 time_t timestamp; /* signature made */ 39 time_t timestamp; /* signature made */
33 uint8_t algo; 40 uint8_t algo;
34 uint8_t hash; 41 uint8_t hash;
35 uint8_t keyid[8]; 42 uint8_t keyid[8];
36 uint8_t nmpi; 43 uint8_t nmpi;
37 char mpi[0]; 44 char mpi[0];
38} __packed; 45 } __packed;
39 46
40keyid equals to SHA1[12-19] over the total key content. 47keyid equals to SHA1[12-19] over the total key content.
41Signature header is used as an input to generate a signature. 48Signature header is used as an input to generate a signature.
@@ -43,31 +50,33 @@ Such approach insures that key or signature header could not be changed.
43It protects timestamp from been changed and can be used for rollback 50It protects timestamp from been changed and can be used for rollback
44protection. 51protection.
45 52
462. API 53API
54===
47 55
48API currently includes only 1 function: 56API currently includes only 1 function::
49 57
50 digsig_verify() - digital signature verification with public key 58 digsig_verify() - digital signature verification with public key
51 59
52 60
53/** 61 /**
54 * digsig_verify() - digital signature verification with public key 62 * digsig_verify() - digital signature verification with public key
55 * @keyring: keyring to search key in 63 * @keyring: keyring to search key in
56 * @sig: digital signature 64 * @sig: digital signature
57 * @sigen: length of the signature 65 * @sigen: length of the signature
58 * @data: data 66 * @data: data
59 * @datalen: length of the data 67 * @datalen: length of the data
60 * @return: 0 on success, -EINVAL otherwise 68 * @return: 0 on success, -EINVAL otherwise
61 * 69 *
62 * Verifies data integrity against digital signature. 70 * Verifies data integrity against digital signature.
63 * Currently only RSA is supported. 71 * Currently only RSA is supported.
64 * Normally hash of the content is used as a data for this function. 72 * Normally hash of the content is used as a data for this function.
65 * 73 *
66 */ 74 */
67int digsig_verify(struct key *keyring, const char *sig, int siglen, 75 int digsig_verify(struct key *keyring, const char *sig, int siglen,
68 const char *data, int datalen); 76 const char *data, int datalen);
69 77
703. User-space utilities 78User-space utilities
79====================
71 80
72The signing and key management utilities evm-utils provide functionality 81The signing and key management utilities evm-utils provide functionality
73to generate signatures, to load keys into the kernel keyring. 82to generate signatures, to load keys into the kernel keyring.
@@ -75,22 +84,18 @@ Keys can be in PEM or converted to the kernel format.
75When the key is added to the kernel keyring, the keyid defines the name 84When the key is added to the kernel keyring, the keyid defines the name
76of the key: 5D2B05FC633EE3E8 in the example bellow. 85of the key: 5D2B05FC633EE3E8 in the example bellow.
77 86
78Here is example output of the keyctl utility. 87Here is example output of the keyctl utility::
79 88
80$ keyctl show 89 $ keyctl show
81Session Keyring 90 Session Keyring
82 -3 --alswrv 0 0 keyring: _ses 91 -3 --alswrv 0 0 keyring: _ses
83603976250 --alswrv 0 -1 \_ keyring: _uid.0 92 603976250 --alswrv 0 -1 \_ keyring: _uid.0
84817777377 --alswrv 0 0 \_ user: kmk 93 817777377 --alswrv 0 0 \_ user: kmk
85891974900 --alswrv 0 0 \_ encrypted: evm-key 94 891974900 --alswrv 0 0 \_ encrypted: evm-key
86170323636 --alswrv 0 0 \_ keyring: _module 95 170323636 --alswrv 0 0 \_ keyring: _module
87548221616 --alswrv 0 0 \_ keyring: _ima 96 548221616 --alswrv 0 0 \_ keyring: _ima
88128198054 --alswrv 0 0 \_ keyring: _evm 97 128198054 --alswrv 0 0 \_ keyring: _evm
89 98
90$ keyctl list 128198054 99 $ keyctl list 128198054
911 key in keyring: 100 1 key in keyring:
92620789745: --alswrv 0 0 user: 5D2B05FC633EE3E8 101 620789745: --alswrv 0 0 user: 5D2B05FC633EE3E8
93
94
95Dmitry Kasatkin
9606.10.2011
diff --git a/Documentation/efi-stub.txt b/Documentation/efi-stub.txt
index e15746988261..41df801f9a50 100644
--- a/Documentation/efi-stub.txt
+++ b/Documentation/efi-stub.txt
@@ -1,5 +1,6 @@
1 The EFI Boot Stub 1=================
2 --------------------------- 2The EFI Boot Stub
3=================
3 4
4On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade 5On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade
5as a PE/COFF image, thereby convincing EFI firmware loaders to load 6as a PE/COFF image, thereby convincing EFI firmware loaders to load
@@ -25,7 +26,8 @@ a certain sense it *IS* the boot loader.
25The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option. 26The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option.
26 27
27 28
28**** How to install bzImage.efi 29How to install bzImage.efi
30--------------------------
29 31
30The bzImage located in arch/x86/boot/bzImage must be copied to the EFI 32The bzImage located in arch/x86/boot/bzImage must be copied to the EFI
31System Partition (ESP) and renamed with the extension ".efi". Without 33System Partition (ESP) and renamed with the extension ".efi". Without
@@ -37,14 +39,16 @@ may not need to be renamed. Similarly for arm64, arch/arm64/boot/Image
37should be copied but not necessarily renamed. 39should be copied but not necessarily renamed.
38 40
39 41
40**** Passing kernel parameters from the EFI shell 42Passing kernel parameters from the EFI shell
43--------------------------------------------
41 44
42Arguments to the kernel can be passed after bzImage.efi, e.g. 45Arguments to the kernel can be passed after bzImage.efi, e.g.::
43 46
44 fs0:> bzImage.efi console=ttyS0 root=/dev/sda4 47 fs0:> bzImage.efi console=ttyS0 root=/dev/sda4
45 48
46 49
47**** The "initrd=" option 50The "initrd=" option
51--------------------
48 52
49Like most boot loaders, the EFI stub allows the user to specify 53Like most boot loaders, the EFI stub allows the user to specify
50multiple initrd files using the "initrd=" option. This is the only EFI 54multiple initrd files using the "initrd=" option. This is the only EFI
@@ -54,9 +58,9 @@ kernel when it boots.
54The path to the initrd file must be an absolute path from the 58The path to the initrd file must be an absolute path from the
55beginning of the ESP, relative path names do not work. Also, the path 59beginning of the ESP, relative path names do not work. Also, the path
56is an EFI-style path and directory elements must be separated with 60is an EFI-style path and directory elements must be separated with
57backslashes (\). For example, given the following directory layout, 61backslashes (\). For example, given the following directory layout::
58 62
59fs0:> 63 fs0:>
60 Kernels\ 64 Kernels\
61 bzImage.efi 65 bzImage.efi
62 initrd-large.img 66 initrd-large.img
@@ -66,7 +70,7 @@ fs0:>
66 initrd-medium.img 70 initrd-medium.img
67 71
68to boot with the initrd-large.img file if the current working 72to boot with the initrd-large.img file if the current working
69directory is fs0:\Kernels, the following command must be used, 73directory is fs0:\Kernels, the following command must be used::
70 74
71 fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img 75 fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img
72 76
@@ -76,7 +80,8 @@ which understands relative paths, whereas the rest of the command line
76is passed to bzImage.efi. 80is passed to bzImage.efi.
77 81
78 82
79**** The "dtb=" option 83The "dtb=" option
84-----------------
80 85
81For the ARM and arm64 architectures, we also need to be able to provide a 86For the ARM and arm64 architectures, we also need to be able to provide a
82device tree to the kernel. This is done with the "dtb=" command line option, 87device tree to the kernel. This is done with the "dtb=" command line option,
diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt
index a55e4910924e..2806e5544e43 100644
--- a/Documentation/eisa.txt
+++ b/Documentation/eisa.txt
@@ -1,4 +1,8 @@
1EISA bus support (Marc Zyngier <maz@wild-wind.fr.eu.org>) 1================
2EISA bus support
3================
4
5:Author: Marc Zyngier <maz@wild-wind.fr.eu.org>
2 6
3This document groups random notes about porting EISA drivers to the 7This document groups random notes about porting EISA drivers to the
4new EISA/sysfs API. 8new EISA/sysfs API.
@@ -14,168 +18,189 @@ detection code is generally also used to probe ISA cards). Moreover,
14most EISA drivers are among the oldest Linux drivers so, as you can 18most EISA drivers are among the oldest Linux drivers so, as you can
15imagine, some dust has settled here over the years. 19imagine, some dust has settled here over the years.
16 20
17The EISA infrastructure is made up of three parts : 21The EISA infrastructure is made up of three parts:
18 22
19 - The bus code implements most of the generic code. It is shared 23 - The bus code implements most of the generic code. It is shared
20 among all the architectures that the EISA code runs on. It 24 among all the architectures that the EISA code runs on. It
21 implements bus probing (detecting EISA cards available on the bus), 25 implements bus probing (detecting EISA cards available on the bus),
22 allocates I/O resources, allows fancy naming through sysfs, and 26 allocates I/O resources, allows fancy naming through sysfs, and
23 offers interfaces for driver to register. 27 offers interfaces for driver to register.
24 28
25 - The bus root driver implements the glue between the bus hardware 29 - The bus root driver implements the glue between the bus hardware
26 and the generic bus code. It is responsible for discovering the 30 and the generic bus code. It is responsible for discovering the
27 device implementing the bus, and setting it up to be latter probed 31 device implementing the bus, and setting it up to be latter probed
28 by the bus code. This can go from something as simple as reserving 32 by the bus code. This can go from something as simple as reserving
29 an I/O region on x86, to the rather more complex, like the hppa 33 an I/O region on x86, to the rather more complex, like the hppa
30 EISA code. This is the part to implement in order to have EISA 34 EISA code. This is the part to implement in order to have EISA
31 running on an "new" platform. 35 running on an "new" platform.
32 36
33 - The driver offers the bus a list of devices that it manages, and 37 - The driver offers the bus a list of devices that it manages, and
34 implements the necessary callbacks to probe and release devices 38 implements the necessary callbacks to probe and release devices
35 whenever told to. 39 whenever told to.
36 40
37Every function/structure below lives in <linux/eisa.h>, which depends 41Every function/structure below lives in <linux/eisa.h>, which depends
38heavily on <linux/device.h>. 42heavily on <linux/device.h>.
39 43
40** Bus root driver : 44Bus root driver
45===============
46
47::
41 48
42int eisa_root_register (struct eisa_root_device *root); 49 int eisa_root_register (struct eisa_root_device *root);
43 50
44The eisa_root_register function is used to declare a device as the 51The eisa_root_register function is used to declare a device as the
45root of an EISA bus. The eisa_root_device structure holds a reference 52root of an EISA bus. The eisa_root_device structure holds a reference
46to this device, as well as some parameters for probing purposes. 53to this device, as well as some parameters for probing purposes::
47 54
48struct eisa_root_device { 55 struct eisa_root_device {
49 struct device *dev; /* Pointer to bridge device */ 56 struct device *dev; /* Pointer to bridge device */
50 struct resource *res; 57 struct resource *res;
51 unsigned long bus_base_addr; 58 unsigned long bus_base_addr;
52 int slots; /* Max slot number */ 59 int slots; /* Max slot number */
53 int force_probe; /* Probe even when no slot 0 */ 60 int force_probe; /* Probe even when no slot 0 */
54 u64 dma_mask; /* from bridge device */ 61 u64 dma_mask; /* from bridge device */
55 int bus_nr; /* Set by eisa_root_register */ 62 int bus_nr; /* Set by eisa_root_register */
56 struct resource eisa_root_res; /* ditto */ 63 struct resource eisa_root_res; /* ditto */
57}; 64 };
58 65
59node : used for eisa_root_register internal purpose 66============= ======================================================
60dev : pointer to the root device 67node used for eisa_root_register internal purpose
61res : root device I/O resource 68dev pointer to the root device
62bus_base_addr : slot 0 address on this bus 69res root device I/O resource
63slots : max slot number to probe 70bus_base_addr slot 0 address on this bus
64force_probe : Probe even when slot 0 is empty (no EISA mainboard) 71slots max slot number to probe
65dma_mask : Default DMA mask. Usually the bridge device dma_mask. 72force_probe Probe even when slot 0 is empty (no EISA mainboard)
66bus_nr : unique bus id, set by eisa_root_register 73dma_mask Default DMA mask. Usually the bridge device dma_mask.
67 74bus_nr unique bus id, set by eisa_root_register
68** Driver : 75============= ======================================================
69 76
70int eisa_driver_register (struct eisa_driver *edrv); 77Driver
71void eisa_driver_unregister (struct eisa_driver *edrv); 78======
79
80::
81
82 int eisa_driver_register (struct eisa_driver *edrv);
83 void eisa_driver_unregister (struct eisa_driver *edrv);
72 84
73Clear enough ? 85Clear enough ?
74 86
75struct eisa_device_id { 87::
76 char sig[EISA_SIG_LEN]; 88
77 unsigned long driver_data; 89 struct eisa_device_id {
78}; 90 char sig[EISA_SIG_LEN];
79 91 unsigned long driver_data;
80struct eisa_driver { 92 };
81 const struct eisa_device_id *id_table; 93
82 struct device_driver driver; 94 struct eisa_driver {
83}; 95 const struct eisa_device_id *id_table;
84 96 struct device_driver driver;
85id_table : an array of NULL terminated EISA id strings, 97 };
86 followed by an empty string. Each string can 98
87 optionally be paired with a driver-dependent value 99=============== ====================================================
88 (driver_data). 100id_table an array of NULL terminated EISA id strings,
89 101 followed by an empty string. Each string can
90driver : a generic driver, such as described in 102 optionally be paired with a driver-dependent value
91 Documentation/driver-model/driver.txt. Only .name, 103 (driver_data).
92 .probe and .remove members are mandatory. 104
93 105driver a generic driver, such as described in
94An example is the 3c59x driver : 106 Documentation/driver-model/driver.txt. Only .name,
95 107 .probe and .remove members are mandatory.
96static struct eisa_device_id vortex_eisa_ids[] = { 108=============== ====================================================
97 { "TCM5920", EISA_3C592_OFFSET }, 109
98 { "TCM5970", EISA_3C597_OFFSET }, 110An example is the 3c59x driver::
99 { "" } 111
100}; 112 static struct eisa_device_id vortex_eisa_ids[] = {
101 113 { "TCM5920", EISA_3C592_OFFSET },
102static struct eisa_driver vortex_eisa_driver = { 114 { "TCM5970", EISA_3C597_OFFSET },
103 .id_table = vortex_eisa_ids, 115 { "" }
104 .driver = { 116 };
105 .name = "3c59x", 117
106 .probe = vortex_eisa_probe, 118 static struct eisa_driver vortex_eisa_driver = {
107 .remove = vortex_eisa_remove 119 .id_table = vortex_eisa_ids,
108 } 120 .driver = {
109}; 121 .name = "3c59x",
110 122 .probe = vortex_eisa_probe,
111** Device : 123 .remove = vortex_eisa_remove
124 }
125 };
126
127Device
128======
112 129
113The sysfs framework calls .probe and .remove functions upon device 130The sysfs framework calls .probe and .remove functions upon device
114discovery and removal (note that the .remove function is only called 131discovery and removal (note that the .remove function is only called
115when driver is built as a module). 132when driver is built as a module).
116 133
117Both functions are passed a pointer to a 'struct device', which is 134Both functions are passed a pointer to a 'struct device', which is
118encapsulated in a 'struct eisa_device' described as follows : 135encapsulated in a 'struct eisa_device' described as follows::
119 136
120struct eisa_device { 137 struct eisa_device {
121 struct eisa_device_id id; 138 struct eisa_device_id id;
122 int slot; 139 int slot;
123 int state; 140 int state;
124 unsigned long base_addr; 141 unsigned long base_addr;
125 struct resource res[EISA_MAX_RESOURCES]; 142 struct resource res[EISA_MAX_RESOURCES];
126 u64 dma_mask; 143 u64 dma_mask;
127 struct device dev; /* generic device */ 144 struct device dev; /* generic device */
128}; 145 };
129 146
130id : EISA id, as read from device. id.driver_data is set from the 147======== ============================================================
131 matching driver EISA id. 148id EISA id, as read from device. id.driver_data is set from the
132slot : slot number which the device was detected on 149 matching driver EISA id.
133state : set of flags indicating the state of the device. Current 150slot slot number which the device was detected on
134 flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. 151state set of flags indicating the state of the device. Current
135res : set of four 256 bytes I/O regions allocated to this device 152 flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED.
136dma_mask: DMA mask set from the parent device. 153res set of four 256 bytes I/O regions allocated to this device
137dev : generic device (see Documentation/driver-model/device.txt) 154dma_mask DMA mask set from the parent device.
155dev generic device (see Documentation/driver-model/device.txt)
156======== ============================================================
138 157
139You can get the 'struct eisa_device' from 'struct device' using the 158You can get the 'struct eisa_device' from 'struct device' using the
140'to_eisa_device' macro. 159'to_eisa_device' macro.
141 160
142** Misc stuff : 161Misc stuff
162==========
163
164::
143 165
144void eisa_set_drvdata (struct eisa_device *edev, void *data); 166 void eisa_set_drvdata (struct eisa_device *edev, void *data);
145 167
146Stores data into the device's driver_data area. 168Stores data into the device's driver_data area.
147 169
148void *eisa_get_drvdata (struct eisa_device *edev): 170::
171
172 void *eisa_get_drvdata (struct eisa_device *edev):
149 173
150Gets the pointer previously stored into the device's driver_data area. 174Gets the pointer previously stored into the device's driver_data area.
151 175
152int eisa_get_region_index (void *addr); 176::
177
178 int eisa_get_region_index (void *addr);
153 179
154Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given 180Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given
155address. 181address.
156 182
157** Kernel parameters : 183Kernel parameters
184=================
158 185
159eisa_bus.enable_dev : 186eisa_bus.enable_dev
187 A comma-separated list of slots to be enabled, even if the firmware
188 set the card as disabled. The driver must be able to properly
189 initialize the device in such conditions.
160 190
161A comma-separated list of slots to be enabled, even if the firmware 191eisa_bus.disable_dev
162set the card as disabled. The driver must be able to properly 192 A comma-separated list of slots to be enabled, even if the firmware
163initialize the device in such conditions. 193 set the card as enabled. The driver won't be called to handle this
194 device.
164 195
165eisa_bus.disable_dev : 196virtual_root.force_probe
197 Force the probing code to probe EISA slots even when it cannot find an
198 EISA compliant mainboard (nothing appears on slot 0). Defaults to 0
199 (don't force), and set to 1 (force probing) when either
200 CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set.
166 201
167A comma-separated list of slots to be enabled, even if the firmware 202Random notes
168set the card as enabled. The driver won't be called to handle this 203============
169device.
170
171virtual_root.force_probe :
172
173Force the probing code to probe EISA slots even when it cannot find an
174EISA compliant mainboard (nothing appears on slot 0). Defaults to 0
175(don't force), and set to 1 (force probing) when either
176CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set.
177
178** Random notes :
179 204
180Converting an EISA driver to the new API mostly involves *deleting* 205Converting an EISA driver to the new API mostly involves *deleting*
181code (since probing is now in the core EISA code). Unfortunately, most 206code (since probing is now in the core EISA code). Unfortunately, most
@@ -194,9 +219,11 @@ routine.
194For example, switching your favorite EISA SCSI card to the "hotplug" 219For example, switching your favorite EISA SCSI card to the "hotplug"
195model is "the right thing"(tm). 220model is "the right thing"(tm).
196 221
197** Thanks : 222Thanks
223======
224
225I'd like to thank the following people for their help:
198 226
199I'd like to thank the following people for their help :
200- Xavier Benigni for lending me a wonderful Alpha Jensen, 227- Xavier Benigni for lending me a wonderful Alpha Jensen,
201- James Bottomley, Jeff Garzik for getting this stuff into the kernel, 228- James Bottomley, Jeff Garzik for getting this stuff into the kernel,
202- Andries Brouwer for contributing numerous EISA ids, 229- Andries Brouwer for contributing numerous EISA ids,
diff --git a/Documentation/flexible-arrays.txt b/Documentation/flexible-arrays.txt
index df904aec9904..a0f2989dd804 100644
--- a/Documentation/flexible-arrays.txt
+++ b/Documentation/flexible-arrays.txt
@@ -1,6 +1,9 @@
1===================================
1Using flexible arrays in the kernel 2Using flexible arrays in the kernel
2Last updated for 2.6.32 3===================================
3Jonathan Corbet <corbet@lwn.net> 4
5:Updated: Last updated for 2.6.32
6:Author: Jonathan Corbet <corbet@lwn.net>
4 7
5Large contiguous memory allocations can be unreliable in the Linux kernel. 8Large contiguous memory allocations can be unreliable in the Linux kernel.
6Kernel programmers will sometimes respond to this problem by allocating 9Kernel programmers will sometimes respond to this problem by allocating
@@ -26,7 +29,7 @@ operation. It's also worth noting that flexible arrays do no internal
26locking at all; if concurrent access to an array is possible, then the 29locking at all; if concurrent access to an array is possible, then the
27caller must arrange for appropriate mutual exclusion. 30caller must arrange for appropriate mutual exclusion.
28 31
29The creation of a flexible array is done with: 32The creation of a flexible array is done with::
30 33
31 #include <linux/flex_array.h> 34 #include <linux/flex_array.h>
32 35
@@ -40,14 +43,14 @@ argument is passed directly to the internal memory allocation calls. With
40the current code, using flags to ask for high memory is likely to lead to 43the current code, using flags to ask for high memory is likely to lead to
41notably unpleasant side effects. 44notably unpleasant side effects.
42 45
43It is also possible to define flexible arrays at compile time with: 46It is also possible to define flexible arrays at compile time with::
44 47
45 DEFINE_FLEX_ARRAY(name, element_size, total); 48 DEFINE_FLEX_ARRAY(name, element_size, total);
46 49
47This macro will result in a definition of an array with the given name; the 50This macro will result in a definition of an array with the given name; the
48element size and total will be checked for validity at compile time. 51element size and total will be checked for validity at compile time.
49 52
50Storing data into a flexible array is accomplished with a call to: 53Storing data into a flexible array is accomplished with a call to::
51 54
52 int flex_array_put(struct flex_array *array, unsigned int element_nr, 55 int flex_array_put(struct flex_array *array, unsigned int element_nr,
53 void *src, gfp_t flags); 56 void *src, gfp_t flags);
@@ -63,7 +66,7 @@ running in some sort of atomic context; in this situation, sleeping in the
63memory allocator would be a bad thing. That can be avoided by using 66memory allocator would be a bad thing. That can be avoided by using
64GFP_ATOMIC for the flags value, but, often, there is a better way. The 67GFP_ATOMIC for the flags value, but, often, there is a better way. The
65trick is to ensure that any needed memory allocations are done before 68trick is to ensure that any needed memory allocations are done before
66entering atomic context, using: 69entering atomic context, using::
67 70
68 int flex_array_prealloc(struct flex_array *array, unsigned int start, 71 int flex_array_prealloc(struct flex_array *array, unsigned int start,
69 unsigned int nr_elements, gfp_t flags); 72 unsigned int nr_elements, gfp_t flags);
@@ -73,7 +76,7 @@ defined by start and nr_elements has been allocated. Thereafter, a
73flex_array_put() call on an element in that range is guaranteed not to 76flex_array_put() call on an element in that range is guaranteed not to
74block. 77block.
75 78
76Getting data back out of the array is done with: 79Getting data back out of the array is done with::
77 80
78 void *flex_array_get(struct flex_array *fa, unsigned int element_nr); 81 void *flex_array_get(struct flex_array *fa, unsigned int element_nr);
79 82
@@ -89,7 +92,7 @@ involving that number probably result from use of unstored array entries.
89Note that, if array elements are allocated with __GFP_ZERO, they will be 92Note that, if array elements are allocated with __GFP_ZERO, they will be
90initialized to zero and this poisoning will not happen. 93initialized to zero and this poisoning will not happen.
91 94
92Individual elements in the array can be cleared with: 95Individual elements in the array can be cleared with::
93 96
94 int flex_array_clear(struct flex_array *array, unsigned int element_nr); 97 int flex_array_clear(struct flex_array *array, unsigned int element_nr);
95 98
@@ -97,7 +100,7 @@ This function will set the given element to FLEX_ARRAY_FREE and return
97zero. If storage for the indicated element is not allocated for the array, 100zero. If storage for the indicated element is not allocated for the array,
98flex_array_clear() will return -EINVAL instead. Note that clearing an 101flex_array_clear() will return -EINVAL instead. Note that clearing an
99element does not release the storage associated with it; to reduce the 102element does not release the storage associated with it; to reduce the
100allocated size of an array, call: 103allocated size of an array, call::
101 104
102 int flex_array_shrink(struct flex_array *array); 105 int flex_array_shrink(struct flex_array *array);
103 106
@@ -106,12 +109,12 @@ This function works by scanning the array for pages containing nothing but
106FLEX_ARRAY_FREE bytes, so (1) it can be expensive, and (2) it will not work 109FLEX_ARRAY_FREE bytes, so (1) it can be expensive, and (2) it will not work
107if the array's pages are allocated with __GFP_ZERO. 110if the array's pages are allocated with __GFP_ZERO.
108 111
109It is possible to remove all elements of an array with a call to: 112It is possible to remove all elements of an array with a call to::
110 113
111 void flex_array_free_parts(struct flex_array *array); 114 void flex_array_free_parts(struct flex_array *array);
112 115
113This call frees all elements, but leaves the array itself in place. 116This call frees all elements, but leaves the array itself in place.
114Freeing the entire array is done with: 117Freeing the entire array is done with::
115 118
116 void flex_array_free(struct flex_array *array); 119 void flex_array_free(struct flex_array *array);
117 120
diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/futex-requeue-pi.txt
index 77b36f59d16b..14ab5787b9a7 100644
--- a/Documentation/futex-requeue-pi.txt
+++ b/Documentation/futex-requeue-pi.txt
@@ -1,5 +1,6 @@
1================
1Futex Requeue PI 2Futex Requeue PI
2---------------- 3================
3 4
4Requeueing of tasks from a non-PI futex to a PI futex requires 5Requeueing of tasks from a non-PI futex to a PI futex requires
5special handling in order to ensure the underlying rt_mutex is never 6special handling in order to ensure the underlying rt_mutex is never
@@ -20,28 +21,28 @@ implementation would wake the highest-priority waiter, and leave the
20rest to the natural wakeup inherent in unlocking the mutex 21rest to the natural wakeup inherent in unlocking the mutex
21associated with the condvar. 22associated with the condvar.
22 23
23Consider the simplified glibc calls: 24Consider the simplified glibc calls::
24 25
25/* caller must lock mutex */ 26 /* caller must lock mutex */
26pthread_cond_wait(cond, mutex) 27 pthread_cond_wait(cond, mutex)
27{ 28 {
28 lock(cond->__data.__lock); 29 lock(cond->__data.__lock);
29 unlock(mutex); 30 unlock(mutex);
30 do { 31 do {
31 unlock(cond->__data.__lock); 32 unlock(cond->__data.__lock);
32 futex_wait(cond->__data.__futex); 33 futex_wait(cond->__data.__futex);
33 lock(cond->__data.__lock); 34 lock(cond->__data.__lock);
34 } while(...) 35 } while(...)
35 unlock(cond->__data.__lock); 36 unlock(cond->__data.__lock);
36 lock(mutex); 37 lock(mutex);
37} 38 }
38 39
39pthread_cond_broadcast(cond) 40 pthread_cond_broadcast(cond)
40{ 41 {
41 lock(cond->__data.__lock); 42 lock(cond->__data.__lock);
42 unlock(cond->__data.__lock); 43 unlock(cond->__data.__lock);
43 futex_requeue(cond->data.__futex, cond->mutex); 44 futex_requeue(cond->data.__futex, cond->mutex);
44} 45 }
45 46
46Once pthread_cond_broadcast() requeues the tasks, the cond->mutex 47Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
47has waiters. Note that pthread_cond_wait() attempts to lock the 48has waiters. Note that pthread_cond_wait() attempts to lock the
@@ -53,29 +54,29 @@ In order to support PI-aware pthread_condvar's, the kernel needs to
53be able to requeue tasks to PI futexes. This support implies that 54be able to requeue tasks to PI futexes. This support implies that
54upon a successful futex_wait system call, the caller would return to 55upon a successful futex_wait system call, the caller would return to
55user space already holding the PI futex. The glibc implementation 56user space already holding the PI futex. The glibc implementation
56would be modified as follows: 57would be modified as follows::
57 58
58 59
59/* caller must lock mutex */ 60 /* caller must lock mutex */
60pthread_cond_wait_pi(cond, mutex) 61 pthread_cond_wait_pi(cond, mutex)
61{ 62 {
62 lock(cond->__data.__lock); 63 lock(cond->__data.__lock);
63 unlock(mutex); 64 unlock(mutex);
64 do { 65 do {
65 unlock(cond->__data.__lock); 66 unlock(cond->__data.__lock);
66 futex_wait_requeue_pi(cond->__data.__futex); 67 futex_wait_requeue_pi(cond->__data.__futex);
67 lock(cond->__data.__lock); 68 lock(cond->__data.__lock);
68 } while(...) 69 } while(...)
69 unlock(cond->__data.__lock); 70 unlock(cond->__data.__lock);
70 /* the kernel acquired the mutex for us */ 71 /* the kernel acquired the mutex for us */
71} 72 }
72 73
73pthread_cond_broadcast_pi(cond) 74 pthread_cond_broadcast_pi(cond)
74{ 75 {
75 lock(cond->__data.__lock); 76 lock(cond->__data.__lock);
76 unlock(cond->__data.__lock); 77 unlock(cond->__data.__lock);
77 futex_requeue_pi(cond->data.__futex, cond->mutex); 78 futex_requeue_pi(cond->data.__futex, cond->mutex);
78} 79 }
79 80
80The actual glibc implementation will likely test for PI and make the 81The actual glibc implementation will likely test for PI and make the
81necessary changes inside the existing calls rather than creating new 82necessary changes inside the existing calls rather than creating new
diff --git a/Documentation/gcc-plugins.txt b/Documentation/gcc-plugins.txt
index 433eaefb4aa1..8502f24396fb 100644
--- a/Documentation/gcc-plugins.txt
+++ b/Documentation/gcc-plugins.txt
@@ -1,14 +1,15 @@
1=========================
1GCC plugin infrastructure 2GCC plugin infrastructure
2========================= 3=========================
3 4
4 5
51. Introduction 6Introduction
6=============== 7============
7 8
8GCC plugins are loadable modules that provide extra features to the 9GCC plugins are loadable modules that provide extra features to the
9compiler [1]. They are useful for runtime instrumentation and static analysis. 10compiler [1]_. They are useful for runtime instrumentation and static analysis.
10We can analyse, change and add further code during compilation via 11We can analyse, change and add further code during compilation via
11callbacks [2], GIMPLE [3], IPA [4] and RTL passes [5]. 12callbacks [2]_, GIMPLE [3]_, IPA [4]_ and RTL passes [5]_.
12 13
13The GCC plugin infrastructure of the kernel supports all gcc versions from 14The GCC plugin infrastructure of the kernel supports all gcc versions from
144.5 to 6.0, building out-of-tree modules, cross-compilation and building in a 154.5 to 6.0, building out-of-tree modules, cross-compilation and building in a
@@ -21,56 +22,61 @@ and versions 4.8+ can only be compiled by a C++ compiler.
21Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and 22Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and
22powerpc architectures. 23powerpc architectures.
23 24
24This infrastructure was ported from grsecurity [6] and PaX [7]. 25This infrastructure was ported from grsecurity [6]_ and PaX [7]_.
25 26
26-- 27--
27[1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html
28[2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API
29[3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html
30[4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html
31[5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html
32[6] https://grsecurity.net/
33[7] https://pax.grsecurity.net/
34 28
29.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html
30.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API
31.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html
32.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html
33.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html
34.. [6] https://grsecurity.net/
35.. [7] https://pax.grsecurity.net/
36
37
38Files
39=====
35 40
362. Files 41**$(src)/scripts/gcc-plugins**
37========
38 42
39$(src)/scripts/gcc-plugins
40 This is the directory of the GCC plugins. 43 This is the directory of the GCC plugins.
41 44
42$(src)/scripts/gcc-plugins/gcc-common.h 45**$(src)/scripts/gcc-plugins/gcc-common.h**
46
43 This is a compatibility header for GCC plugins. 47 This is a compatibility header for GCC plugins.
44 It should be always included instead of individual gcc headers. 48 It should be always included instead of individual gcc headers.
45 49
46$(src)/scripts/gcc-plugin.sh 50**$(src)/scripts/gcc-plugin.sh**
51
47 This script checks the availability of the included headers in 52 This script checks the availability of the included headers in
48 gcc-common.h and chooses the proper host compiler to build the plugins 53 gcc-common.h and chooses the proper host compiler to build the plugins
49 (gcc-4.7 can be built by either gcc or g++). 54 (gcc-4.7 can be built by either gcc or g++).
50 55
51$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h 56**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h,
52$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h 57$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h,
53$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h 58$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h,
54$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h 59$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h**
60
55 These headers automatically generate the registration structures for 61 These headers automatically generate the registration structures for
56 GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions 62 GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions
57 from 4.5 to 6.0. 63 from 4.5 to 6.0.
58 They should be preferred to creating the structures by hand. 64 They should be preferred to creating the structures by hand.
59 65
60 66
613. Usage 67Usage
62======== 68=====
63 69
64You must install the gcc plugin headers for your gcc version, 70You must install the gcc plugin headers for your gcc version,
65e.g., on Ubuntu for gcc-4.9: 71e.g., on Ubuntu for gcc-4.9::
66 72
67 apt-get install gcc-4.9-plugin-dev 73 apt-get install gcc-4.9-plugin-dev
68 74
69Enable a GCC plugin based feature in the kernel config: 75Enable a GCC plugin based feature in the kernel config::
70 76
71 CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y 77 CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y
72 78
73To compile only the plugin(s): 79To compile only the plugin(s)::
74 80
75 make gcc-plugins 81 make gcc-plugins
76 82
diff --git a/Documentation/highuid.txt b/Documentation/highuid.txt
index 6bad6f1d1cac..6ee70465c0ea 100644
--- a/Documentation/highuid.txt
+++ b/Documentation/highuid.txt
@@ -1,4 +1,9 @@
1Notes on the change from 16-bit UIDs to 32-bit UIDs: 1===================================================
2Notes on the change from 16-bit UIDs to 32-bit UIDs
3===================================================
4
5:Author: Chris Wing <wingc@umich.edu>
6:Last updated: January 11, 2000
2 7
3- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t 8- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t
4 when communicating between user and kernel space in an ioctl or data 9 when communicating between user and kernel space in an ioctl or data
@@ -28,30 +33,34 @@ What's left to be done for 32-bit UIDs on all Linux architectures:
28 uses the 32-bit UID system calls properly otherwise. 33 uses the 32-bit UID system calls properly otherwise.
29 34
30 This affects at least: 35 This affects at least:
31 iBCS on Intel
32 36
33 sparc32 emulation on sparc64 37 - iBCS on Intel
34 (need to support whatever new 32-bit UID system calls are added to 38
35 sparc32) 39 - sparc32 emulation on sparc64
40 (need to support whatever new 32-bit UID system calls are added to
41 sparc32)
36 42
37- Validate that all filesystems behave properly. 43- Validate that all filesystems behave properly.
38 44
39 At present, 32-bit UIDs _should_ work for: 45 At present, 32-bit UIDs _should_ work for:
40 ext2 46
41 ufs 47 - ext2
42 isofs 48 - ufs
43 nfs 49 - isofs
44 coda 50 - nfs
45 udf 51 - coda
52 - udf
46 53
47 Ioctl() fixups have been made for: 54 Ioctl() fixups have been made for:
48 ncpfs 55
49 smbfs 56 - ncpfs
57 - smbfs
50 58
51 Filesystems with simple fixups to prevent 16-bit UID wraparound: 59 Filesystems with simple fixups to prevent 16-bit UID wraparound:
52 minix 60
53 sysv 61 - minix
54 qnx4 62 - sysv
63 - qnx4
55 64
56 Other filesystems have not been checked yet. 65 Other filesystems have not been checked yet.
57 66
@@ -69,9 +78,3 @@ What's left to be done for 32-bit UIDs on all Linux architectures:
69- make sure that the UID mapping feature of AX25 networking works properly 78- make sure that the UID mapping feature of AX25 networking works properly
70 (it should be safe because it's always used a 32-bit integer to 79 (it should be safe because it's always used a 32-bit integer to
71 communicate between user and kernel) 80 communicate between user and kernel)
72
73
74Chris Wing
75wingc@umich.edu
76
77last updated: January 11, 2000
diff --git a/Documentation/hw_random.txt b/Documentation/hw_random.txt
index fce1634907d0..121de96e395e 100644
--- a/Documentation/hw_random.txt
+++ b/Documentation/hw_random.txt
@@ -1,90 +1,105 @@
1Introduction: 1==========================================================
2 2Linux support for random number generator in i8xx chipsets
3 The hw_random framework is software that makes use of a 3==========================================================
4 special hardware feature on your CPU or motherboard, 4
5 a Random Number Generator (RNG). The software has two parts: 5Introduction
6 a core providing the /dev/hwrng character device and its 6============
7 sysfs support, plus a hardware-specific driver that plugs 7
8 into that core. 8The hw_random framework is software that makes use of a
9 9special hardware feature on your CPU or motherboard,
10 To make the most effective use of these mechanisms, you 10a Random Number Generator (RNG). The software has two parts:
11 should download the support software as well. Download the 11a core providing the /dev/hwrng character device and its
12 latest version of the "rng-tools" package from the 12sysfs support, plus a hardware-specific driver that plugs
13 hw_random driver's official Web site: 13into that core.
14 14
15 http://sourceforge.net/projects/gkernel/ 15To make the most effective use of these mechanisms, you
16 16should download the support software as well. Download the
17 Those tools use /dev/hwrng to fill the kernel entropy pool, 17latest version of the "rng-tools" package from the
18 which is used internally and exported by the /dev/urandom and 18hw_random driver's official Web site:
19 /dev/random special files. 19
20 20 http://sourceforge.net/projects/gkernel/
21Theory of operation: 21
22 22Those tools use /dev/hwrng to fill the kernel entropy pool,
23 CHARACTER DEVICE. Using the standard open() 23which is used internally and exported by the /dev/urandom and
24 and read() system calls, you can read random data from 24/dev/random special files.
25 the hardware RNG device. This data is NOT CHECKED by any 25
26 fitness tests, and could potentially be bogus (if the 26Theory of operation
27 hardware is faulty or has been tampered with). Data is only 27===================
28 output if the hardware "has-data" flag is set, but nevertheless 28
29 a security-conscious person would run fitness tests on the 29CHARACTER DEVICE. Using the standard open()
30 data before assuming it is truly random. 30and read() system calls, you can read random data from
31 31the hardware RNG device. This data is NOT CHECKED by any
32 The rng-tools package uses such tests in "rngd", and lets you 32fitness tests, and could potentially be bogus (if the
33 run them by hand with a "rngtest" utility. 33hardware is faulty or has been tampered with). Data is only
34 34output if the hardware "has-data" flag is set, but nevertheless
35 /dev/hwrng is char device major 10, minor 183. 35a security-conscious person would run fitness tests on the
36 36data before assuming it is truly random.
37 CLASS DEVICE. There is a /sys/class/misc/hw_random node with 37
38 two unique attributes, "rng_available" and "rng_current". The 38The rng-tools package uses such tests in "rngd", and lets you
39 "rng_available" attribute lists the hardware-specific drivers 39run them by hand with a "rngtest" utility.
40 available, while "rng_current" lists the one which is currently 40
41 connected to /dev/hwrng. If your system has more than one 41/dev/hwrng is char device major 10, minor 183.
42 RNG available, you may change the one used by writing a name from 42
43 the list in "rng_available" into "rng_current". 43CLASS DEVICE. There is a /sys/class/misc/hw_random node with
44two unique attributes, "rng_available" and "rng_current". The
45"rng_available" attribute lists the hardware-specific drivers
46available, while "rng_current" lists the one which is currently
47connected to /dev/hwrng. If your system has more than one
48RNG available, you may change the one used by writing a name from
49the list in "rng_available" into "rng_current".
44 50
45========================================================================== 51==========================================================================
46 52
47 Hardware driver for Intel/AMD/VIA Random Number Generators (RNG)
48 Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com>
49 Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com>
50 53
54Hardware driver for Intel/AMD/VIA Random Number Generators (RNG)
55 - Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com>
56 - Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com>
51 57
52About the Intel RNG hardware, from the firmware hub datasheet:
53 58
54 The Firmware Hub integrates a Random Number Generator (RNG) 59About the Intel RNG hardware, from the firmware hub datasheet
55 using thermal noise generated from inherently random quantum 60=============================================================
56 mechanical properties of silicon. When not generating new random
57 bits the RNG circuitry will enter a low power state. Intel will
58 provide a binary software driver to give third party software
59 access to our RNG for use as a security feature. At this time,
60 the RNG is only to be used with a system in an OS-present state.
61 61
62Intel RNG Driver notes: 62The Firmware Hub integrates a Random Number Generator (RNG)
63using thermal noise generated from inherently random quantum
64mechanical properties of silicon. When not generating new random
65bits the RNG circuitry will enter a low power state. Intel will
66provide a binary software driver to give third party software
67access to our RNG for use as a security feature. At this time,
68the RNG is only to be used with a system in an OS-present state.
63 69
64 * FIXME: support poll(2) 70Intel RNG Driver notes
71======================
65 72
66 NOTE: request_mem_region was removed, for three reasons: 73FIXME: support poll(2)
67 1) Only one RNG is supported by this driver, 2) The location 74
68 used by the RNG is a fixed location in MMIO-addressable memory, 75.. note::
76
77 request_mem_region was removed, for three reasons:
78
79 1) Only one RNG is supported by this driver;
80 2) The location used by the RNG is a fixed location in
81 MMIO-addressable memory;
69 3) users with properly working BIOS e820 handling will always 82 3) users with properly working BIOS e820 handling will always
70 have the region in which the RNG is located reserved, so 83 have the region in which the RNG is located reserved, so
71 request_mem_region calls always fail for proper setups. 84 request_mem_region calls always fail for proper setups.
72 However, for people who use mem=XX, BIOS e820 information is 85 However, for people who use mem=XX, BIOS e820 information is
73 -not- in /proc/iomem, and request_mem_region(RNG_ADDR) can 86 **not** in /proc/iomem, and request_mem_region(RNG_ADDR) can
74 succeed. 87 succeed.
75 88
76Driver details: 89Driver details
90==============
77 91
78 Based on: 92Based on:
79 Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet 93 Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet
80 May 1999 Order Number: 290658-002 R 94 May 1999 Order Number: 290658-002 R
81 95
82 Intel 82802 Firmware Hub: Random Number Generator 96Intel 82802 Firmware Hub:
97 Random Number Generator
83 Programmer's Reference Manual 98 Programmer's Reference Manual
84 December 1999 Order Number: 298029-001 R 99 December 1999 Order Number: 298029-001 R
85 100
86 Intel 82802 Firmware HUB Random Number Generator Driver 101Intel 82802 Firmware HUB Random Number Generator Driver
87 Copyright (c) 2000 Matt Sottek <msottek@quiknet.com> 102 Copyright (c) 2000 Matt Sottek <msottek@quiknet.com>
88 103
89 Special thanks to Matt Sottek. I did the "guts", he 104Special thanks to Matt Sottek. I did the "guts", he
90 did the "brains" and all the testing. 105did the "brains" and all the testing.
diff --git a/Documentation/hwspinlock.txt b/Documentation/hwspinlock.txt
index 61c1ee98e59f..ed640a278185 100644
--- a/Documentation/hwspinlock.txt
+++ b/Documentation/hwspinlock.txt
@@ -1,6 +1,9 @@
1===========================
1Hardware Spinlock Framework 2Hardware Spinlock Framework
3===========================
2 4
31. Introduction 5Introduction
6============
4 7
5Hardware spinlock modules provide hardware assistance for synchronization 8Hardware spinlock modules provide hardware assistance for synchronization
6and mutual exclusion between heterogeneous processors and those not operating 9and mutual exclusion between heterogeneous processors and those not operating
@@ -32,286 +35,370 @@ structure).
32A common hwspinlock interface makes it possible to have generic, platform- 35A common hwspinlock interface makes it possible to have generic, platform-
33independent, drivers. 36independent, drivers.
34 37
352. User API 38User API
39========
40
41::
36 42
37 struct hwspinlock *hwspin_lock_request(void); 43 struct hwspinlock *hwspin_lock_request(void);
38 - dynamically assign an hwspinlock and return its address, or NULL 44
39 in case an unused hwspinlock isn't available. Users of this 45Dynamically assign an hwspinlock and return its address, or NULL
40 API will usually want to communicate the lock's id to the remote core 46in case an unused hwspinlock isn't available. Users of this
41 before it can be used to achieve synchronization. 47API will usually want to communicate the lock's id to the remote core
42 Should be called from a process context (might sleep). 48before it can be used to achieve synchronization.
49
50Should be called from a process context (might sleep).
51
52::
43 53
44 struct hwspinlock *hwspin_lock_request_specific(unsigned int id); 54 struct hwspinlock *hwspin_lock_request_specific(unsigned int id);
45 - assign a specific hwspinlock id and return its address, or NULL 55
46 if that hwspinlock is already in use. Usually board code will 56Assign a specific hwspinlock id and return its address, or NULL
47 be calling this function in order to reserve specific hwspinlock 57if that hwspinlock is already in use. Usually board code will
48 ids for predefined purposes. 58be calling this function in order to reserve specific hwspinlock
49 Should be called from a process context (might sleep). 59ids for predefined purposes.
60
61Should be called from a process context (might sleep).
62
63::
50 64
51 int of_hwspin_lock_get_id(struct device_node *np, int index); 65 int of_hwspin_lock_get_id(struct device_node *np, int index);
52 - retrieve the global lock id for an OF phandle-based specific lock. 66
53 This function provides a means for DT users of a hwspinlock module 67Retrieve the global lock id for an OF phandle-based specific lock.
54 to get the global lock id of a specific hwspinlock, so that it can 68This function provides a means for DT users of a hwspinlock module
55 be requested using the normal hwspin_lock_request_specific() API. 69to get the global lock id of a specific hwspinlock, so that it can
56 The function returns a lock id number on success, -EPROBE_DEFER if 70be requested using the normal hwspin_lock_request_specific() API.
57 the hwspinlock device is not yet registered with the core, or other 71
58 error values. 72The function returns a lock id number on success, -EPROBE_DEFER if
59 Should be called from a process context (might sleep). 73the hwspinlock device is not yet registered with the core, or other
74error values.
75
76Should be called from a process context (might sleep).
77
78::
60 79
61 int hwspin_lock_free(struct hwspinlock *hwlock); 80 int hwspin_lock_free(struct hwspinlock *hwlock);
62 - free a previously-assigned hwspinlock; returns 0 on success, or an 81
63 appropriate error code on failure (e.g. -EINVAL if the hwspinlock 82Free a previously-assigned hwspinlock; returns 0 on success, or an
64 is already free). 83appropriate error code on failure (e.g. -EINVAL if the hwspinlock
65 Should be called from a process context (might sleep). 84is already free).
85
86Should be called from a process context (might sleep).
87
88::
66 89
67 int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout); 90 int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout);
68 - lock a previously-assigned hwspinlock with a timeout limit (specified in 91
69 msecs). If the hwspinlock is already taken, the function will busy loop 92Lock a previously-assigned hwspinlock with a timeout limit (specified in
70 waiting for it to be released, but give up when the timeout elapses. 93msecs). If the hwspinlock is already taken, the function will busy loop
71 Upon a successful return from this function, preemption is disabled so 94waiting for it to be released, but give up when the timeout elapses.
72 the caller must not sleep, and is advised to release the hwspinlock as 95Upon a successful return from this function, preemption is disabled so
73 soon as possible, in order to minimize remote cores polling on the 96the caller must not sleep, and is advised to release the hwspinlock as
74 hardware interconnect. 97soon as possible, in order to minimize remote cores polling on the
75 Returns 0 when successful and an appropriate error code otherwise (most 98hardware interconnect.
76 notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). 99
77 The function will never sleep. 100Returns 0 when successful and an appropriate error code otherwise (most
101notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
102The function will never sleep.
103
104::
78 105
79 int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout); 106 int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout);
80 - lock a previously-assigned hwspinlock with a timeout limit (specified in 107
81 msecs). If the hwspinlock is already taken, the function will busy loop 108Lock a previously-assigned hwspinlock with a timeout limit (specified in
82 waiting for it to be released, but give up when the timeout elapses. 109msecs). If the hwspinlock is already taken, the function will busy loop
83 Upon a successful return from this function, preemption and the local 110waiting for it to be released, but give up when the timeout elapses.
84 interrupts are disabled, so the caller must not sleep, and is advised to 111Upon a successful return from this function, preemption and the local
85 release the hwspinlock as soon as possible. 112interrupts are disabled, so the caller must not sleep, and is advised to
86 Returns 0 when successful and an appropriate error code otherwise (most 113release the hwspinlock as soon as possible.
87 notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). 114
88 The function will never sleep. 115Returns 0 when successful and an appropriate error code otherwise (most
116notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
117The function will never sleep.
118
119::
89 120
90 int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to, 121 int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to,
91 unsigned long *flags); 122 unsigned long *flags);
92 - lock a previously-assigned hwspinlock with a timeout limit (specified in 123
93 msecs). If the hwspinlock is already taken, the function will busy loop 124Lock a previously-assigned hwspinlock with a timeout limit (specified in
94 waiting for it to be released, but give up when the timeout elapses. 125msecs). If the hwspinlock is already taken, the function will busy loop
95 Upon a successful return from this function, preemption is disabled, 126waiting for it to be released, but give up when the timeout elapses.
96 local interrupts are disabled and their previous state is saved at the 127Upon a successful return from this function, preemption is disabled,
97 given flags placeholder. The caller must not sleep, and is advised to 128local interrupts are disabled and their previous state is saved at the
98 release the hwspinlock as soon as possible. 129given flags placeholder. The caller must not sleep, and is advised to
99 Returns 0 when successful and an appropriate error code otherwise (most 130release the hwspinlock as soon as possible.
100 notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). 131
101 The function will never sleep. 132Returns 0 when successful and an appropriate error code otherwise (most
133notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
134
135The function will never sleep.
136
137::
102 138
103 int hwspin_trylock(struct hwspinlock *hwlock); 139 int hwspin_trylock(struct hwspinlock *hwlock);
104 - attempt to lock a previously-assigned hwspinlock, but immediately fail if 140
105 it is already taken. 141
106 Upon a successful return from this function, preemption is disabled so 142Attempt to lock a previously-assigned hwspinlock, but immediately fail if
107 caller must not sleep, and is advised to release the hwspinlock as soon as 143it is already taken.
108 possible, in order to minimize remote cores polling on the hardware 144
109 interconnect. 145Upon a successful return from this function, preemption is disabled so
110 Returns 0 on success and an appropriate error code otherwise (most 146caller must not sleep, and is advised to release the hwspinlock as soon as
111 notably -EBUSY if the hwspinlock was already taken). 147possible, in order to minimize remote cores polling on the hardware
112 The function will never sleep. 148interconnect.
149
150Returns 0 on success and an appropriate error code otherwise (most
151notably -EBUSY if the hwspinlock was already taken).
152The function will never sleep.
153
154::
113 155
114 int hwspin_trylock_irq(struct hwspinlock *hwlock); 156 int hwspin_trylock_irq(struct hwspinlock *hwlock);
115 - attempt to lock a previously-assigned hwspinlock, but immediately fail if 157
116 it is already taken. 158
117 Upon a successful return from this function, preemption and the local 159Attempt to lock a previously-assigned hwspinlock, but immediately fail if
118 interrupts are disabled so caller must not sleep, and is advised to 160it is already taken.
119 release the hwspinlock as soon as possible. 161
120 Returns 0 on success and an appropriate error code otherwise (most 162Upon a successful return from this function, preemption and the local
121 notably -EBUSY if the hwspinlock was already taken). 163interrupts are disabled so caller must not sleep, and is advised to
122 The function will never sleep. 164release the hwspinlock as soon as possible.
165
166Returns 0 on success and an appropriate error code otherwise (most
167notably -EBUSY if the hwspinlock was already taken).
168
169The function will never sleep.
170
171::
123 172
124 int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags); 173 int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags);
125 - attempt to lock a previously-assigned hwspinlock, but immediately fail if 174
126 it is already taken. 175Attempt to lock a previously-assigned hwspinlock, but immediately fail if
127 Upon a successful return from this function, preemption is disabled, 176it is already taken.
128 the local interrupts are disabled and their previous state is saved 177
129 at the given flags placeholder. The caller must not sleep, and is advised 178Upon a successful return from this function, preemption is disabled,
130 to release the hwspinlock as soon as possible. 179the local interrupts are disabled and their previous state is saved
131 Returns 0 on success and an appropriate error code otherwise (most 180at the given flags placeholder. The caller must not sleep, and is advised
132 notably -EBUSY if the hwspinlock was already taken). 181to release the hwspinlock as soon as possible.
133 The function will never sleep. 182
183Returns 0 on success and an appropriate error code otherwise (most
184notably -EBUSY if the hwspinlock was already taken).
185The function will never sleep.
186
187::
134 188
135 void hwspin_unlock(struct hwspinlock *hwlock); 189 void hwspin_unlock(struct hwspinlock *hwlock);
136 - unlock a previously-locked hwspinlock. Always succeed, and can be called 190
137 from any context (the function never sleeps). Note: code should _never_ 191Unlock a previously-locked hwspinlock. Always succeed, and can be called
138 unlock an hwspinlock which is already unlocked (there is no protection 192from any context (the function never sleeps).
139 against this). 193
194.. note::
195
196 code should **never** unlock an hwspinlock which is already unlocked
197 (there is no protection against this).
198
199::
140 200
141 void hwspin_unlock_irq(struct hwspinlock *hwlock); 201 void hwspin_unlock_irq(struct hwspinlock *hwlock);
142 - unlock a previously-locked hwspinlock and enable local interrupts. 202
143 The caller should _never_ unlock an hwspinlock which is already unlocked. 203Unlock a previously-locked hwspinlock and enable local interrupts.
144 Doing so is considered a bug (there is no protection against this). 204The caller should **never** unlock an hwspinlock which is already unlocked.
145 Upon a successful return from this function, preemption and local 205
146 interrupts are enabled. This function will never sleep. 206Doing so is considered a bug (there is no protection against this).
207Upon a successful return from this function, preemption and local
208interrupts are enabled. This function will never sleep.
209
210::
147 211
148 void 212 void
149 hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags); 213 hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags);
150 - unlock a previously-locked hwspinlock. 214
151 The caller should _never_ unlock an hwspinlock which is already unlocked. 215Unlock a previously-locked hwspinlock.
152 Doing so is considered a bug (there is no protection against this). 216
153 Upon a successful return from this function, preemption is reenabled, 217The caller should **never** unlock an hwspinlock which is already unlocked.
154 and the state of the local interrupts is restored to the state saved at 218Doing so is considered a bug (there is no protection against this).
155 the given flags. This function will never sleep. 219Upon a successful return from this function, preemption is reenabled,
220and the state of the local interrupts is restored to the state saved at
221the given flags. This function will never sleep.
222
223::
156 224
157 int hwspin_lock_get_id(struct hwspinlock *hwlock); 225 int hwspin_lock_get_id(struct hwspinlock *hwlock);
158 - retrieve id number of a given hwspinlock. This is needed when an
159 hwspinlock is dynamically assigned: before it can be used to achieve
160 mutual exclusion with a remote cpu, the id number should be communicated
161 to the remote task with which we want to synchronize.
162 Returns the hwspinlock id number, or -EINVAL if hwlock is null.
163
1643. Typical usage
165
166#include <linux/hwspinlock.h>
167#include <linux/err.h>
168
169int hwspinlock_example1(void)
170{
171 struct hwspinlock *hwlock;
172 int ret;
173
174 /* dynamically assign a hwspinlock */
175 hwlock = hwspin_lock_request();
176 if (!hwlock)
177 ...
178
179 id = hwspin_lock_get_id(hwlock);
180 /* probably need to communicate id to a remote processor now */
181
182 /* take the lock, spin for 1 sec if it's already taken */
183 ret = hwspin_lock_timeout(hwlock, 1000);
184 if (ret)
185 ...
186
187 /*
188 * we took the lock, do our thing now, but do NOT sleep
189 */
190
191 /* release the lock */
192 hwspin_unlock(hwlock);
193
194 /* free the lock */
195 ret = hwspin_lock_free(hwlock);
196 if (ret)
197 ...
198
199 return ret;
200}
201
202int hwspinlock_example2(void)
203{
204 struct hwspinlock *hwlock;
205 int ret;
206
207 /*
208 * assign a specific hwspinlock id - this should be called early
209 * by board init code.
210 */
211 hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID);
212 if (!hwlock)
213 ...
214
215 /* try to take it, but don't spin on it */
216 ret = hwspin_trylock(hwlock);
217 if (!ret) {
218 pr_info("lock is already taken\n");
219 return -EBUSY;
220 }
221 226
222 /* 227Retrieve id number of a given hwspinlock. This is needed when an
223 * we took the lock, do our thing now, but do NOT sleep 228hwspinlock is dynamically assigned: before it can be used to achieve
224 */ 229mutual exclusion with a remote cpu, the id number should be communicated
230to the remote task with which we want to synchronize.
231
232Returns the hwspinlock id number, or -EINVAL if hwlock is null.
233
234Typical usage
235=============
225 236
226 /* release the lock */ 237::
227 hwspin_unlock(hwlock);
228 238
229 /* free the lock */ 239 #include <linux/hwspinlock.h>
230 ret = hwspin_lock_free(hwlock); 240 #include <linux/err.h>
231 if (ret)
232 ...
233 241
234 return ret; 242 int hwspinlock_example1(void)
235} 243 {
244 struct hwspinlock *hwlock;
245 int ret;
236 246
247 /* dynamically assign a hwspinlock */
248 hwlock = hwspin_lock_request();
249 if (!hwlock)
250 ...
237 251
2384. API for implementors 252 id = hwspin_lock_get_id(hwlock);
253 /* probably need to communicate id to a remote processor now */
254
255 /* take the lock, spin for 1 sec if it's already taken */
256 ret = hwspin_lock_timeout(hwlock, 1000);
257 if (ret)
258 ...
259
260 /*
261 * we took the lock, do our thing now, but do NOT sleep
262 */
263
264 /* release the lock */
265 hwspin_unlock(hwlock);
266
267 /* free the lock */
268 ret = hwspin_lock_free(hwlock);
269 if (ret)
270 ...
271
272 return ret;
273 }
274
275 int hwspinlock_example2(void)
276 {
277 struct hwspinlock *hwlock;
278 int ret;
279
280 /*
281 * assign a specific hwspinlock id - this should be called early
282 * by board init code.
283 */
284 hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID);
285 if (!hwlock)
286 ...
287
288 /* try to take it, but don't spin on it */
289 ret = hwspin_trylock(hwlock);
290 if (!ret) {
291 pr_info("lock is already taken\n");
292 return -EBUSY;
293 }
294
295 /*
296 * we took the lock, do our thing now, but do NOT sleep
297 */
298
299 /* release the lock */
300 hwspin_unlock(hwlock);
301
302 /* free the lock */
303 ret = hwspin_lock_free(hwlock);
304 if (ret)
305 ...
306
307 return ret;
308 }
309
310
311API for implementors
312====================
313
314::
239 315
240 int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev, 316 int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev,
241 const struct hwspinlock_ops *ops, int base_id, int num_locks); 317 const struct hwspinlock_ops *ops, int base_id, int num_locks);
242 - to be called from the underlying platform-specific implementation, in 318
243 order to register a new hwspinlock device (which is usually a bank of 319To be called from the underlying platform-specific implementation, in
244 numerous locks). Should be called from a process context (this function 320order to register a new hwspinlock device (which is usually a bank of
245 might sleep). 321numerous locks). Should be called from a process context (this function
246 Returns 0 on success, or appropriate error code on failure. 322might sleep).
323
324Returns 0 on success, or appropriate error code on failure.
325
326::
247 327
248 int hwspin_lock_unregister(struct hwspinlock_device *bank); 328 int hwspin_lock_unregister(struct hwspinlock_device *bank);
249 - to be called from the underlying vendor-specific implementation, in order
250 to unregister an hwspinlock device (which is usually a bank of numerous
251 locks).
252 Should be called from a process context (this function might sleep).
253 Returns the address of hwspinlock on success, or NULL on error (e.g.
254 if the hwspinlock is still in use).
255 329
2565. Important structs 330To be called from the underlying vendor-specific implementation, in order
331to unregister an hwspinlock device (which is usually a bank of numerous
332locks).
333
334Should be called from a process context (this function might sleep).
335
336Returns the address of hwspinlock on success, or NULL on error (e.g.
337if the hwspinlock is still in use).
338
339Important structs
340=================
257 341
258struct hwspinlock_device is a device which usually contains a bank 342struct hwspinlock_device is a device which usually contains a bank
259of hardware locks. It is registered by the underlying hwspinlock 343of hardware locks. It is registered by the underlying hwspinlock
260implementation using the hwspin_lock_register() API. 344implementation using the hwspin_lock_register() API.
261 345
262/** 346::
263 * struct hwspinlock_device - a device which usually spans numerous hwspinlocks 347
264 * @dev: underlying device, will be used to invoke runtime PM api 348 /**
265 * @ops: platform-specific hwspinlock handlers 349 * struct hwspinlock_device - a device which usually spans numerous hwspinlocks
266 * @base_id: id index of the first lock in this device 350 * @dev: underlying device, will be used to invoke runtime PM api
267 * @num_locks: number of locks in this device 351 * @ops: platform-specific hwspinlock handlers
268 * @lock: dynamically allocated array of 'struct hwspinlock' 352 * @base_id: id index of the first lock in this device
269 */ 353 * @num_locks: number of locks in this device
270struct hwspinlock_device { 354 * @lock: dynamically allocated array of 'struct hwspinlock'
271 struct device *dev; 355 */
272 const struct hwspinlock_ops *ops; 356 struct hwspinlock_device {
273 int base_id; 357 struct device *dev;
274 int num_locks; 358 const struct hwspinlock_ops *ops;
275 struct hwspinlock lock[0]; 359 int base_id;
276}; 360 int num_locks;
361 struct hwspinlock lock[0];
362 };
277 363
278struct hwspinlock_device contains an array of hwspinlock structs, each 364struct hwspinlock_device contains an array of hwspinlock structs, each
279of which represents a single hardware lock: 365of which represents a single hardware lock::
280 366
281/** 367 /**
282 * struct hwspinlock - this struct represents a single hwspinlock instance 368 * struct hwspinlock - this struct represents a single hwspinlock instance
283 * @bank: the hwspinlock_device structure which owns this lock 369 * @bank: the hwspinlock_device structure which owns this lock
284 * @lock: initialized and used by hwspinlock core 370 * @lock: initialized and used by hwspinlock core
285 * @priv: private data, owned by the underlying platform-specific hwspinlock drv 371 * @priv: private data, owned by the underlying platform-specific hwspinlock drv
286 */ 372 */
287struct hwspinlock { 373 struct hwspinlock {
288 struct hwspinlock_device *bank; 374 struct hwspinlock_device *bank;
289 spinlock_t lock; 375 spinlock_t lock;
290 void *priv; 376 void *priv;
291}; 377 };
292 378
293When registering a bank of locks, the hwspinlock driver only needs to 379When registering a bank of locks, the hwspinlock driver only needs to
294set the priv members of the locks. The rest of the members are set and 380set the priv members of the locks. The rest of the members are set and
295initialized by the hwspinlock core itself. 381initialized by the hwspinlock core itself.
296 382
2976. Implementation callbacks 383Implementation callbacks
384========================
298 385
299There are three possible callbacks defined in 'struct hwspinlock_ops': 386There are three possible callbacks defined in 'struct hwspinlock_ops'::
300 387
301struct hwspinlock_ops { 388 struct hwspinlock_ops {
302 int (*trylock)(struct hwspinlock *lock); 389 int (*trylock)(struct hwspinlock *lock);
303 void (*unlock)(struct hwspinlock *lock); 390 void (*unlock)(struct hwspinlock *lock);
304 void (*relax)(struct hwspinlock *lock); 391 void (*relax)(struct hwspinlock *lock);
305}; 392 };
306 393
307The first two callbacks are mandatory: 394The first two callbacks are mandatory:
308 395
309The ->trylock() callback should make a single attempt to take the lock, and 396The ->trylock() callback should make a single attempt to take the lock, and
310return 0 on failure and 1 on success. This callback may _not_ sleep. 397return 0 on failure and 1 on success. This callback may **not** sleep.
311 398
312The ->unlock() callback releases the lock. It always succeed, and it, too, 399The ->unlock() callback releases the lock. It always succeed, and it, too,
313may _not_ sleep. 400may **not** sleep.
314 401
315The ->relax() callback is optional. It is called by hwspinlock core while 402The ->relax() callback is optional. It is called by hwspinlock core while
316spinning on a lock, and can be used by the underlying implementation to force 403spinning on a lock, and can be used by the underlying implementation to force
317a delay between two successive invocations of ->trylock(). It may _not_ sleep. 404a delay between two successive invocations of ->trylock(). It may **not** sleep.
diff --git a/Documentation/intel_txt.txt b/Documentation/intel_txt.txt
index 91d89c540709..d83c1a2122c9 100644
--- a/Documentation/intel_txt.txt
+++ b/Documentation/intel_txt.txt
@@ -1,4 +1,5 @@
1Intel(R) TXT Overview: 1=====================
2Intel(R) TXT Overview
2===================== 3=====================
3 4
4Intel's technology for safer computing, Intel(R) Trusted Execution 5Intel's technology for safer computing, Intel(R) Trusted Execution
@@ -8,9 +9,10 @@ provide the building blocks for creating trusted platforms.
8Intel TXT was formerly known by the code name LaGrande Technology (LT). 9Intel TXT was formerly known by the code name LaGrande Technology (LT).
9 10
10Intel TXT in Brief: 11Intel TXT in Brief:
11o Provides dynamic root of trust for measurement (DRTM) 12
12o Data protection in case of improper shutdown 13- Provides dynamic root of trust for measurement (DRTM)
13o Measurement and verification of launched environment 14- Data protection in case of improper shutdown
15- Measurement and verification of launched environment
14 16
15Intel TXT is part of the vPro(TM) brand and is also available some 17Intel TXT is part of the vPro(TM) brand and is also available some
16non-vPro systems. It is currently available on desktop systems 18non-vPro systems. It is currently available on desktop systems
@@ -24,16 +26,21 @@ which has been updated for the new released platforms.
24 26
25Intel TXT has been presented at various events over the past few 27Intel TXT has been presented at various events over the past few
26years, some of which are: 28years, some of which are:
27 LinuxTAG 2008: 29
30 - LinuxTAG 2008:
28 http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html 31 http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html
29 TRUST2008: 32
33 - TRUST2008:
30 http://www.trust-conference.eu/downloads/Keynote-Speakers/ 34 http://www.trust-conference.eu/downloads/Keynote-Speakers/
31 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf 35 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf
32 IDF, Shanghai: 36
37 - IDF, Shanghai:
33 http://www.prcidf.com.cn/index_en.html 38 http://www.prcidf.com.cn/index_en.html
34 IDFs 2006, 2007 (I'm not sure if/where they are online)
35 39
36Trusted Boot Project Overview: 40 - IDFs 2006, 2007
41 (I'm not sure if/where they are online)
42
43Trusted Boot Project Overview
37============================= 44=============================
38 45
39Trusted Boot (tboot) is an open source, pre-kernel/VMM module that 46Trusted Boot (tboot) is an open source, pre-kernel/VMM module that
@@ -87,11 +94,12 @@ Intel-provided firmware).
87How Does it Work? 94How Does it Work?
88================= 95=================
89 96
90o Tboot is an executable that is launched by the bootloader as 97- Tboot is an executable that is launched by the bootloader as
91 the "kernel" (the binary the bootloader executes). 98 the "kernel" (the binary the bootloader executes).
92o It performs all of the work necessary to determine if the 99- It performs all of the work necessary to determine if the
93 platform supports Intel TXT and, if so, executes the GETSEC[SENTER] 100 platform supports Intel TXT and, if so, executes the GETSEC[SENTER]
94 processor instruction that initiates the dynamic root of trust. 101 processor instruction that initiates the dynamic root of trust.
102
95 - If tboot determines that the system does not support Intel TXT 103 - If tboot determines that the system does not support Intel TXT
96 or is not configured correctly (e.g. the SINIT AC Module was 104 or is not configured correctly (e.g. the SINIT AC Module was
97 incorrect), it will directly launch the kernel with no changes 105 incorrect), it will directly launch the kernel with no changes
@@ -99,12 +107,14 @@ o It performs all of the work necessary to determine if the
99 - Tboot will output various information about its progress to the 107 - Tboot will output various information about its progress to the
100 terminal, serial port, and/or an in-memory log; the output 108 terminal, serial port, and/or an in-memory log; the output
101 locations can be configured with a command line switch. 109 locations can be configured with a command line switch.
102o The GETSEC[SENTER] instruction will return control to tboot and 110
111- The GETSEC[SENTER] instruction will return control to tboot and
103 tboot then verifies certain aspects of the environment (e.g. TPM NV 112 tboot then verifies certain aspects of the environment (e.g. TPM NV
104 lock, e820 table does not have invalid entries, etc.). 113 lock, e820 table does not have invalid entries, etc.).
105o It will wake the APs from the special sleep state the GETSEC[SENTER] 114- It will wake the APs from the special sleep state the GETSEC[SENTER]
106 instruction had put them in and place them into a wait-for-SIPI 115 instruction had put them in and place them into a wait-for-SIPI
107 state. 116 state.
117
108 - Because the processors will not respond to an INIT or SIPI when 118 - Because the processors will not respond to an INIT or SIPI when
109 in the TXT environment, it is necessary to create a small VT-x 119 in the TXT environment, it is necessary to create a small VT-x
110 guest for the APs. When they run in this guest, they will 120 guest for the APs. When they run in this guest, they will
@@ -112,8 +122,10 @@ o It will wake the APs from the special sleep state the GETSEC[SENTER]
112 VMEXITs, and then disable VT and jump to the SIPI vector. This 122 VMEXITs, and then disable VT and jump to the SIPI vector. This
113 approach seemed like a better choice than having to insert 123 approach seemed like a better choice than having to insert
114 special code into the kernel's MP wakeup sequence. 124 special code into the kernel's MP wakeup sequence.
115o Tboot then applies an (optional) user-defined launch policy to 125
126- Tboot then applies an (optional) user-defined launch policy to
116 verify the kernel and initrd. 127 verify the kernel and initrd.
128
117 - This policy is rooted in TPM NV and is described in the tboot 129 - This policy is rooted in TPM NV and is described in the tboot
118 project. The tboot project also contains code for tools to 130 project. The tboot project also contains code for tools to
119 create and provision the policy. 131 create and provision the policy.
@@ -121,30 +133,34 @@ o Tboot then applies an (optional) user-defined launch policy to
121 then any kernel will be launched. 133 then any kernel will be launched.
122 - Policy action is flexible and can include halting on failures 134 - Policy action is flexible and can include halting on failures
123 or simply logging them and continuing. 135 or simply logging them and continuing.
124o Tboot adjusts the e820 table provided by the bootloader to reserve 136
137- Tboot adjusts the e820 table provided by the bootloader to reserve
125 its own location in memory as well as to reserve certain other 138 its own location in memory as well as to reserve certain other
126 TXT-related regions. 139 TXT-related regions.
127o As part of its launch, tboot DMA protects all of RAM (using the 140- As part of its launch, tboot DMA protects all of RAM (using the
128 VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on' 141 VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on'
129 in order to remove this blanket protection and use VT-d's 142 in order to remove this blanket protection and use VT-d's
130 page-level protection. 143 page-level protection.
131o Tboot will populate a shared page with some data about itself and 144- Tboot will populate a shared page with some data about itself and
132 pass this to the Linux kernel as it transfers control. 145 pass this to the Linux kernel as it transfers control.
146
133 - The location of the shared page is passed via the boot_params 147 - The location of the shared page is passed via the boot_params
134 struct as a physical address. 148 struct as a physical address.
135o The kernel will look for the tboot shared page address and, if it 149
150- The kernel will look for the tboot shared page address and, if it
136 exists, map it. 151 exists, map it.
137o As one of the checks/protections provided by TXT, it makes a copy 152- As one of the checks/protections provided by TXT, it makes a copy
138 of the VT-d DMARs in a DMA-protected region of memory and verifies 153 of the VT-d DMARs in a DMA-protected region of memory and verifies
139 them for correctness. The VT-d code will detect if the kernel was 154 them for correctness. The VT-d code will detect if the kernel was
140 launched with tboot and use this copy instead of the one in the 155 launched with tboot and use this copy instead of the one in the
141 ACPI table. 156 ACPI table.
142o At this point, tboot and TXT are out of the picture until a 157- At this point, tboot and TXT are out of the picture until a
143 shutdown (S<n>) 158 shutdown (S<n>)
144o In order to put a system into any of the sleep states after a TXT 159- In order to put a system into any of the sleep states after a TXT
145 launch, TXT must first be exited. This is to prevent attacks that 160 launch, TXT must first be exited. This is to prevent attacks that
146 attempt to crash the system to gain control on reboot and steal 161 attempt to crash the system to gain control on reboot and steal
147 data left in memory. 162 data left in memory.
163
148 - The kernel will perform all of its sleep preparation and 164 - The kernel will perform all of its sleep preparation and
149 populate the shared page with the ACPI data needed to put the 165 populate the shared page with the ACPI data needed to put the
150 platform in the desired sleep state. 166 platform in the desired sleep state.
@@ -172,7 +188,7 @@ o In order to put a system into any of the sleep states after a TXT
172That's pretty much it for TXT support. 188That's pretty much it for TXT support.
173 189
174 190
175Configuring the System: 191Configuring the System
176====================== 192======================
177 193
178This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels. 194This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels.
@@ -181,7 +197,8 @@ In BIOS, the user must enable: TPM, TXT, VT-x, VT-d. Not all BIOSes
181allow these to be individually enabled/disabled and the screens in 197allow these to be individually enabled/disabled and the screens in
182which to find them are BIOS-specific. 198which to find them are BIOS-specific.
183 199
184grub.conf needs to be modified as follows: 200grub.conf needs to be modified as follows::
201
185 title Linux 2.6.29-tip w/ tboot 202 title Linux 2.6.29-tip w/ tboot
186 root (hd0,0) 203 root (hd0,0)
187 kernel /tboot.gz logging=serial,vga,memory 204 kernel /tboot.gz logging=serial,vga,memory
diff --git a/Documentation/io-mapping.txt b/Documentation/io-mapping.txt
index 5ca78426f54c..a966239f04e4 100644
--- a/Documentation/io-mapping.txt
+++ b/Documentation/io-mapping.txt
@@ -1,66 +1,81 @@
1========================
2The io_mapping functions
3========================
4
5API
6===
7
1The io_mapping functions in linux/io-mapping.h provide an abstraction for 8The io_mapping functions in linux/io-mapping.h provide an abstraction for
2efficiently mapping small regions of an I/O device to the CPU. The initial 9efficiently mapping small regions of an I/O device to the CPU. The initial
3usage is to support the large graphics aperture on 32-bit processors where 10usage is to support the large graphics aperture on 32-bit processors where
4ioremap_wc cannot be used to statically map the entire aperture to the CPU 11ioremap_wc cannot be used to statically map the entire aperture to the CPU
5as it would consume too much of the kernel address space. 12as it would consume too much of the kernel address space.
6 13
7A mapping object is created during driver initialization using 14A mapping object is created during driver initialization using::
8 15
9 struct io_mapping *io_mapping_create_wc(unsigned long base, 16 struct io_mapping *io_mapping_create_wc(unsigned long base,
10 unsigned long size) 17 unsigned long size)
11 18
12 'base' is the bus address of the region to be made 19'base' is the bus address of the region to be made
13 mappable, while 'size' indicates how large a mapping region to 20mappable, while 'size' indicates how large a mapping region to
14 enable. Both are in bytes. 21enable. Both are in bytes.
15 22
16 This _wc variant provides a mapping which may only be used 23This _wc variant provides a mapping which may only be used
17 with the io_mapping_map_atomic_wc or io_mapping_map_wc. 24with the io_mapping_map_atomic_wc or io_mapping_map_wc.
18 25
19With this mapping object, individual pages can be mapped either atomically 26With this mapping object, individual pages can be mapped either atomically
20or not, depending on the necessary scheduling environment. Of course, atomic 27or not, depending on the necessary scheduling environment. Of course, atomic
21maps are more efficient: 28maps are more efficient::
22 29
23 void *io_mapping_map_atomic_wc(struct io_mapping *mapping, 30 void *io_mapping_map_atomic_wc(struct io_mapping *mapping,
24 unsigned long offset) 31 unsigned long offset)
25 32
26 'offset' is the offset within the defined mapping region. 33'offset' is the offset within the defined mapping region.
27 Accessing addresses beyond the region specified in the 34Accessing addresses beyond the region specified in the
28 creation function yields undefined results. Using an offset 35creation function yields undefined results. Using an offset
29 which is not page aligned yields an undefined result. The 36which is not page aligned yields an undefined result. The
30 return value points to a single page in CPU address space. 37return value points to a single page in CPU address space.
38
39This _wc variant returns a write-combining map to the
40page and may only be used with mappings created by
41io_mapping_create_wc
31 42
32 This _wc variant returns a write-combining map to the 43Note that the task may not sleep while holding this page
33 page and may only be used with mappings created by 44mapped.
34 io_mapping_create_wc
35 45
36 Note that the task may not sleep while holding this page 46::
37 mapped.
38 47
39 void io_mapping_unmap_atomic(void *vaddr) 48 void io_mapping_unmap_atomic(void *vaddr)
40 49
41 'vaddr' must be the value returned by the last 50'vaddr' must be the value returned by the last
42 io_mapping_map_atomic_wc call. This unmaps the specified 51io_mapping_map_atomic_wc call. This unmaps the specified
43 page and allows the task to sleep once again. 52page and allows the task to sleep once again.
44 53
45If you need to sleep while holding the lock, you can use the non-atomic 54If you need to sleep while holding the lock, you can use the non-atomic
46variant, although they may be significantly slower. 55variant, although they may be significantly slower.
47 56
57::
58
48 void *io_mapping_map_wc(struct io_mapping *mapping, 59 void *io_mapping_map_wc(struct io_mapping *mapping,
49 unsigned long offset) 60 unsigned long offset)
50 61
51 This works like io_mapping_map_atomic_wc except it allows 62This works like io_mapping_map_atomic_wc except it allows
52 the task to sleep while holding the page mapped. 63the task to sleep while holding the page mapped.
64
65
66::
53 67
54 void io_mapping_unmap(void *vaddr) 68 void io_mapping_unmap(void *vaddr)
55 69
56 This works like io_mapping_unmap_atomic, except it is used 70This works like io_mapping_unmap_atomic, except it is used
57 for pages mapped with io_mapping_map_wc. 71for pages mapped with io_mapping_map_wc.
58 72
59At driver close time, the io_mapping object must be freed: 73At driver close time, the io_mapping object must be freed::
60 74
61 void io_mapping_free(struct io_mapping *mapping) 75 void io_mapping_free(struct io_mapping *mapping)
62 76
63Current Implementation: 77Current Implementation
78======================
64 79
65The initial implementation of these functions uses existing mapping 80The initial implementation of these functions uses existing mapping
66mechanisms and so provides only an abstraction layer and no new 81mechanisms and so provides only an abstraction layer and no new
diff --git a/Documentation/io_ordering.txt b/Documentation/io_ordering.txt
index 9faae6f26d32..2ab303ce9a0d 100644
--- a/Documentation/io_ordering.txt
+++ b/Documentation/io_ordering.txt
@@ -1,3 +1,7 @@
1==============================================
2Ordering I/O writes to memory-mapped addresses
3==============================================
4
1On some platforms, so-called memory-mapped I/O is weakly ordered. On such 5On some platforms, so-called memory-mapped I/O is weakly ordered. On such
2platforms, driver writers are responsible for ensuring that I/O writes to 6platforms, driver writers are responsible for ensuring that I/O writes to
3memory-mapped addresses on their device arrive in the order intended. This is 7memory-mapped addresses on their device arrive in the order intended. This is
@@ -8,39 +12,39 @@ critical section of code protected by spinlocks. This would ensure that
8subsequent writes to I/O space arrived only after all prior writes (much like a 12subsequent writes to I/O space arrived only after all prior writes (much like a
9memory barrier op, mb(), only with respect to I/O). 13memory barrier op, mb(), only with respect to I/O).
10 14
11A more concrete example from a hypothetical device driver: 15A more concrete example from a hypothetical device driver::
12 16
13 ... 17 ...
14CPU A: spin_lock_irqsave(&dev_lock, flags) 18 CPU A: spin_lock_irqsave(&dev_lock, flags)
15CPU A: val = readl(my_status); 19 CPU A: val = readl(my_status);
16CPU A: ... 20 CPU A: ...
17CPU A: writel(newval, ring_ptr); 21 CPU A: writel(newval, ring_ptr);
18CPU A: spin_unlock_irqrestore(&dev_lock, flags) 22 CPU A: spin_unlock_irqrestore(&dev_lock, flags)
19 ... 23 ...
20CPU B: spin_lock_irqsave(&dev_lock, flags) 24 CPU B: spin_lock_irqsave(&dev_lock, flags)
21CPU B: val = readl(my_status); 25 CPU B: val = readl(my_status);
22CPU B: ... 26 CPU B: ...
23CPU B: writel(newval2, ring_ptr); 27 CPU B: writel(newval2, ring_ptr);
24CPU B: spin_unlock_irqrestore(&dev_lock, flags) 28 CPU B: spin_unlock_irqrestore(&dev_lock, flags)
25 ... 29 ...
26 30
27In the case above, the device may receive newval2 before it receives newval, 31In the case above, the device may receive newval2 before it receives newval,
28which could cause problems. Fixing it is easy enough though: 32which could cause problems. Fixing it is easy enough though::
29 33
30 ... 34 ...
31CPU A: spin_lock_irqsave(&dev_lock, flags) 35 CPU A: spin_lock_irqsave(&dev_lock, flags)
32CPU A: val = readl(my_status); 36 CPU A: val = readl(my_status);
33CPU A: ... 37 CPU A: ...
34CPU A: writel(newval, ring_ptr); 38 CPU A: writel(newval, ring_ptr);
35CPU A: (void)readl(safe_register); /* maybe a config register? */ 39 CPU A: (void)readl(safe_register); /* maybe a config register? */
36CPU A: spin_unlock_irqrestore(&dev_lock, flags) 40 CPU A: spin_unlock_irqrestore(&dev_lock, flags)
37 ... 41 ...
38CPU B: spin_lock_irqsave(&dev_lock, flags) 42 CPU B: spin_lock_irqsave(&dev_lock, flags)
39CPU B: val = readl(my_status); 43 CPU B: val = readl(my_status);
40CPU B: ... 44 CPU B: ...
41CPU B: writel(newval2, ring_ptr); 45 CPU B: writel(newval2, ring_ptr);
42CPU B: (void)readl(safe_register); /* maybe a config register? */ 46 CPU B: (void)readl(safe_register); /* maybe a config register? */
43CPU B: spin_unlock_irqrestore(&dev_lock, flags) 47 CPU B: spin_unlock_irqrestore(&dev_lock, flags)
44 48
45Here, the reads from safe_register will cause the I/O chipset to flush any 49Here, the reads from safe_register will cause the I/O chipset to flush any
46pending writes before actually posting the read to the chipset, preventing 50pending writes before actually posting the read to the chipset, preventing
diff --git a/Documentation/iostats.txt b/Documentation/iostats.txt
index 65f694f2d1c9..04d394a2e06c 100644
--- a/Documentation/iostats.txt
+++ b/Documentation/iostats.txt
@@ -1,49 +1,50 @@
1=====================
1I/O statistics fields 2I/O statistics fields
2--------------- 3=====================
3 4
4Since 2.4.20 (and some versions before, with patches), and 2.5.45, 5Since 2.4.20 (and some versions before, with patches), and 2.5.45,
5more extensive disk statistics have been introduced to help measure disk 6more extensive disk statistics have been introduced to help measure disk
6activity. Tools such as sar and iostat typically interpret these and do 7activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
7the work for you, but in case you are interested in creating your own 8the work for you, but in case you are interested in creating your own
8tools, the fields are explained here. 9tools, the fields are explained here.
9 10
10In 2.4 now, the information is found as additional fields in 11In 2.4 now, the information is found as additional fields in
11/proc/partitions. In 2.6, the same information is found in two 12``/proc/partitions``. In 2.6 and upper, the same information is found in two
12places: one is in the file /proc/diskstats, and the other is within 13places: one is in the file ``/proc/diskstats``, and the other is within
13the sysfs file system, which must be mounted in order to obtain 14the sysfs file system, which must be mounted in order to obtain
14the information. Throughout this document we'll assume that sysfs 15the information. Throughout this document we'll assume that sysfs
15is mounted on /sys, although of course it may be mounted anywhere. 16is mounted on ``/sys``, although of course it may be mounted anywhere.
16Both /proc/diskstats and sysfs use the same source for the information 17Both ``/proc/diskstats`` and sysfs use the same source for the information
17and so should not differ. 18and so should not differ.
18 19
19Here are examples of these different formats: 20Here are examples of these different formats::
20 21
212.4: 22 2.4:
22 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 23 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
23 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 24 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
24 25
26 2.6+ sysfs:
27 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
28 35486 38030 38030 38030
25 29
262.6 sysfs: 30 2.6+ diskstats:
27 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 31 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
28 35486 38030 38030 38030 32 3 1 hda1 35486 38030 38030 38030
29 33
302.6 diskstats: 34On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
31 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 35a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
32 3 1 hda1 35486 38030 38030 38030
33 36
34On 2.4 you might execute "grep 'hda ' /proc/partitions". On 2.6, you have
35a choice of "cat /sys/block/hda/stat" or "grep 'hda ' /proc/diskstats".
36The advantage of one over the other is that the sysfs choice works well 37The advantage of one over the other is that the sysfs choice works well
37if you are watching a known, small set of disks. /proc/diskstats may 38if you are watching a known, small set of disks. ``/proc/diskstats`` may
38be a better choice if you are watching a large number of disks because 39be a better choice if you are watching a large number of disks because
39you'll avoid the overhead of 50, 100, or 500 or more opens/closes with 40you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
40each snapshot of your disk statistics. 41each snapshot of your disk statistics.
41 42
42In 2.4, the statistics fields are those after the device name. In 43In 2.4, the statistics fields are those after the device name. In
43the above example, the first field of statistics would be 446216. 44the above example, the first field of statistics would be 446216.
44By contrast, in 2.6 if you look at /sys/block/hda/stat, you'll 45By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
45find just the eleven fields, beginning with 446216. If you look at 46find just the eleven fields, beginning with 446216. If you look at
46/proc/diskstats, the eleven fields will be preceded by the major and 47``/proc/diskstats``, the eleven fields will be preceded by the major and
47minor device numbers, and device name. Each of these formats provides 48minor device numbers, and device name. Each of these formats provides
48eleven fields of statistics, each meaning exactly the same things. 49eleven fields of statistics, each meaning exactly the same things.
49All fields except field 9 are cumulative since boot. Field 9 should 50All fields except field 9 are cumulative since boot. Field 9 should
@@ -59,30 +60,40 @@ system-wide stats you'll have to find all the devices and sum them all up.
59 60
60Field 1 -- # of reads completed 61Field 1 -- # of reads completed
61 This is the total number of reads completed successfully. 62 This is the total number of reads completed successfully.
63
62Field 2 -- # of reads merged, field 6 -- # of writes merged 64Field 2 -- # of reads merged, field 6 -- # of writes merged
63 Reads and writes which are adjacent to each other may be merged for 65 Reads and writes which are adjacent to each other may be merged for
64 efficiency. Thus two 4K reads may become one 8K read before it is 66 efficiency. Thus two 4K reads may become one 8K read before it is
65 ultimately handed to the disk, and so it will be counted (and queued) 67 ultimately handed to the disk, and so it will be counted (and queued)
66 as only one I/O. This field lets you know how often this was done. 68 as only one I/O. This field lets you know how often this was done.
69
67Field 3 -- # of sectors read 70Field 3 -- # of sectors read
68 This is the total number of sectors read successfully. 71 This is the total number of sectors read successfully.
72
69Field 4 -- # of milliseconds spent reading 73Field 4 -- # of milliseconds spent reading
70 This is the total number of milliseconds spent by all reads (as 74 This is the total number of milliseconds spent by all reads (as
71 measured from __make_request() to end_that_request_last()). 75 measured from __make_request() to end_that_request_last()).
76
72Field 5 -- # of writes completed 77Field 5 -- # of writes completed
73 This is the total number of writes completed successfully. 78 This is the total number of writes completed successfully.
79
74Field 6 -- # of writes merged 80Field 6 -- # of writes merged
75 See the description of field 2. 81 See the description of field 2.
82
76Field 7 -- # of sectors written 83Field 7 -- # of sectors written
77 This is the total number of sectors written successfully. 84 This is the total number of sectors written successfully.
85
78Field 8 -- # of milliseconds spent writing 86Field 8 -- # of milliseconds spent writing
79 This is the total number of milliseconds spent by all writes (as 87 This is the total number of milliseconds spent by all writes (as
80 measured from __make_request() to end_that_request_last()). 88 measured from __make_request() to end_that_request_last()).
89
81Field 9 -- # of I/Os currently in progress 90Field 9 -- # of I/Os currently in progress
82 The only field that should go to zero. Incremented as requests are 91 The only field that should go to zero. Incremented as requests are
83 given to appropriate struct request_queue and decremented as they finish. 92 given to appropriate struct request_queue and decremented as they finish.
93
84Field 10 -- # of milliseconds spent doing I/Os 94Field 10 -- # of milliseconds spent doing I/Os
85 This field increases so long as field 9 is nonzero. 95 This field increases so long as field 9 is nonzero.
96
86Field 11 -- weighted # of milliseconds spent doing I/Os 97Field 11 -- weighted # of milliseconds spent doing I/Os
87 This field is incremented at each I/O start, I/O completion, I/O 98 This field is incremented at each I/O start, I/O completion, I/O
88 merge, or read of these stats by the number of I/Os in progress 99 merge, or read of these stats by the number of I/Os in progress
@@ -97,7 +108,7 @@ introduced when changes collide, so (for instance) adding up all the
97read I/Os issued per partition should equal those made to the disks ... 108read I/Os issued per partition should equal those made to the disks ...
98but due to the lack of locking it may only be very close. 109but due to the lack of locking it may only be very close.
99 110
100In 2.6, there are counters for each CPU, which make the lack of locking 111In 2.6+, there are counters for each CPU, which make the lack of locking
101almost a non-issue. When the statistics are read, the per-CPU counters 112almost a non-issue. When the statistics are read, the per-CPU counters
102are summed (possibly overflowing the unsigned long variable they are 113are summed (possibly overflowing the unsigned long variable they are
103summed to) and the result given to the user. There is no convenient 114summed to) and the result given to the user. There is no convenient
@@ -106,22 +117,25 @@ user interface for accessing the per-CPU counters themselves.
106Disks vs Partitions 117Disks vs Partitions
107------------------- 118-------------------
108 119
109There were significant changes between 2.4 and 2.6 in the I/O subsystem. 120There were significant changes between 2.4 and 2.6+ in the I/O subsystem.
110As a result, some statistic information disappeared. The translation from 121As a result, some statistic information disappeared. The translation from
111a disk address relative to a partition to the disk address relative to 122a disk address relative to a partition to the disk address relative to
112the host disk happens much earlier. All merges and timings now happen 123the host disk happens much earlier. All merges and timings now happen
113at the disk level rather than at both the disk and partition level as 124at the disk level rather than at both the disk and partition level as
114in 2.4. Consequently, you'll see a different statistics output on 2.6 for 125in 2.4. Consequently, you'll see a different statistics output on 2.6+ for
115partitions from that for disks. There are only *four* fields available 126partitions from that for disks. There are only *four* fields available
116for partitions on 2.6 machines. This is reflected in the examples above. 127for partitions on 2.6+ machines. This is reflected in the examples above.
117 128
118Field 1 -- # of reads issued 129Field 1 -- # of reads issued
119 This is the total number of reads issued to this partition. 130 This is the total number of reads issued to this partition.
131
120Field 2 -- # of sectors read 132Field 2 -- # of sectors read
121 This is the total number of sectors requested to be read from this 133 This is the total number of sectors requested to be read from this
122 partition. 134 partition.
135
123Field 3 -- # of writes issued 136Field 3 -- # of writes issued
124 This is the total number of writes issued to this partition. 137 This is the total number of writes issued to this partition.
138
125Field 4 -- # of sectors written 139Field 4 -- # of sectors written
126 This is the total number of sectors requested to be written to 140 This is the total number of sectors requested to be written to
127 this partition. 141 this partition.
@@ -149,16 +163,16 @@ to some (probably insignificant) inaccuracy.
149Additional notes 163Additional notes
150---------------- 164----------------
151 165
152In 2.6, sysfs is not mounted by default. If your distribution of 166In 2.6+, sysfs is not mounted by default. If your distribution of
153Linux hasn't added it already, here's the line you'll want to add to 167Linux hasn't added it already, here's the line you'll want to add to
154your /etc/fstab: 168your ``/etc/fstab``::
155 169
156none /sys sysfs defaults 0 0 170 none /sys sysfs defaults 0 0
157 171
158 172
159In 2.6, all disk statistics were removed from /proc/stat. In 2.4, they 173In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they
160appear in both /proc/partitions and /proc/stat, although the ones in 174appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
161/proc/stat take a very different format from those in /proc/partitions 175``/proc/stat`` take a very different format from those in ``/proc/partitions``
162(see proc(5), if your system has it.) 176(see proc(5), if your system has it.)
163 177
164-- ricklind@us.ibm.com 178-- ricklind@us.ibm.com
diff --git a/Documentation/irqflags-tracing.txt b/Documentation/irqflags-tracing.txt
index f6da05670e16..bdd208259fb3 100644
--- a/Documentation/irqflags-tracing.txt
+++ b/Documentation/irqflags-tracing.txt
@@ -1,8 +1,10 @@
1=======================
1IRQ-flags state tracing 2IRQ-flags state tracing
3=======================
2 4
3started by Ingo Molnar <mingo@redhat.com> 5:Author: started by Ingo Molnar <mingo@redhat.com>
4 6
5the "irq-flags tracing" feature "traces" hardirq and softirq state, in 7The "irq-flags tracing" feature "traces" hardirq and softirq state, in
6that it gives interested subsystems an opportunity to be notified of 8that it gives interested subsystems an opportunity to be notified of
7every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that 9every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that
8happens in the kernel. 10happens in the kernel.
@@ -14,7 +16,7 @@ CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these
14are locking APIs that are not used in IRQ context. (the one exception 16are locking APIs that are not used in IRQ context. (the one exception
15for rwsems is worked around) 17for rwsems is worked around)
16 18
17architecture support for this is certainly not in the "trivial" 19Architecture support for this is certainly not in the "trivial"
18category, because lots of lowlevel assembly code deal with irq-flags 20category, because lots of lowlevel assembly code deal with irq-flags
19state changes. But an architecture can be irq-flags-tracing enabled in a 21state changes. But an architecture can be irq-flags-tracing enabled in a
20rather straightforward and risk-free manner. 22rather straightforward and risk-free manner.
@@ -41,7 +43,7 @@ irq-flags-tracing support:
41 excluded from the irq-tracing [and lock validation] mechanism via 43 excluded from the irq-tracing [and lock validation] mechanism via
42 lockdep_off()/lockdep_on(). 44 lockdep_off()/lockdep_on().
43 45
44in general there is no risk from having an incomplete irq-flags-tracing 46In general there is no risk from having an incomplete irq-flags-tracing
45implementation in an architecture: lockdep will detect that and will 47implementation in an architecture: lockdep will detect that and will
46turn itself off. I.e. the lock validator will still be reliable. There 48turn itself off. I.e. the lock validator will still be reliable. There
47should be no crashes due to irq-tracing bugs. (except if the assembly 49should be no crashes due to irq-tracing bugs. (except if the assembly
diff --git a/Documentation/isa.txt b/Documentation/isa.txt
index f232c26a40be..def4a7b690b5 100644
--- a/Documentation/isa.txt
+++ b/Documentation/isa.txt
@@ -1,5 +1,6 @@
1===========
1ISA Drivers 2ISA Drivers
2----------- 3===========
3 4
4The following text is adapted from the commit message of the initial 5The following text is adapted from the commit message of the initial
5commit of the ISA bus driver authored by Rene Herman. 6commit of the ISA bus driver authored by Rene Herman.
@@ -23,17 +24,17 @@ that all device creation has been made internal as well.
23 24
24The usage model this provides is nice, and has been acked from the ALSA 25The usage model this provides is nice, and has been acked from the ALSA
25side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's 26side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's
26now (for oldisa-only drivers) become: 27now (for oldisa-only drivers) become::
27 28
28static int __init alsa_card_foo_init(void) 29 static int __init alsa_card_foo_init(void)
29{ 30 {
30 return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS); 31 return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS);
31} 32 }
32 33
33static void __exit alsa_card_foo_exit(void) 34 static void __exit alsa_card_foo_exit(void)
34{ 35 {
35 isa_unregister_driver(&snd_foo_isa_driver); 36 isa_unregister_driver(&snd_foo_isa_driver);
36} 37 }
37 38
38Quite like the other bus models therefore. This removes a lot of 39Quite like the other bus models therefore. This removes a lot of
39duplicated init code from the ALSA ISA drivers. 40duplicated init code from the ALSA ISA drivers.
@@ -47,11 +48,11 @@ parameter, indicating how many devices to create and call our methods
47with. 48with.
48 49
49The platform_driver callbacks are called with a platform_device param; 50The platform_driver callbacks are called with a platform_device param;
50the isa_driver callbacks are being called with a "struct device *dev, 51the isa_driver callbacks are being called with a ``struct device *dev,
51unsigned int id" pair directly -- with the device creation completely 52unsigned int id`` pair directly -- with the device creation completely
52internal to the bus it's much cleaner to not leak isa_dev's by passing 53internal to the bus it's much cleaner to not leak isa_dev's by passing
53them in at all. The id is the only thing we ever want other then the 54them in at all. The id is the only thing we ever want other then the
54struct device * anyways, and it makes for nicer code in the callbacks as 55struct device anyways, and it makes for nicer code in the callbacks as
55well. 56well.
56 57
57With this additional .match() callback ISA drivers have all options. If 58With this additional .match() callback ISA drivers have all options. If
@@ -75,20 +76,20 @@ This exports only two functions; isa_{,un}register_driver().
75 76
76isa_register_driver() register's the struct device_driver, and then 77isa_register_driver() register's the struct device_driver, and then
77loops over the passed in ndev creating devices and registering them. 78loops over the passed in ndev creating devices and registering them.
78This causes the bus match method to be called for them, which is: 79This causes the bus match method to be called for them, which is::
79 80
80int isa_bus_match(struct device *dev, struct device_driver *driver) 81 int isa_bus_match(struct device *dev, struct device_driver *driver)
81{ 82 {
82 struct isa_driver *isa_driver = to_isa_driver(driver); 83 struct isa_driver *isa_driver = to_isa_driver(driver);
83 84
84 if (dev->platform_data == isa_driver) { 85 if (dev->platform_data == isa_driver) {
85 if (!isa_driver->match || 86 if (!isa_driver->match ||
86 isa_driver->match(dev, to_isa_dev(dev)->id)) 87 isa_driver->match(dev, to_isa_dev(dev)->id))
87 return 1; 88 return 1;
88 dev->platform_data = NULL; 89 dev->platform_data = NULL;
89 } 90 }
90 return 0; 91 return 0;
91} 92 }
92 93
93The first thing this does is check if this device is in fact one of this 94The first thing this does is check if this device is in fact one of this
94driver's devices by seeing if the device's platform_data pointer is set 95driver's devices by seeing if the device's platform_data pointer is set
@@ -102,7 +103,7 @@ well.
102Then, if the the driver did not provide a .match, it matches. If it did, 103Then, if the the driver did not provide a .match, it matches. If it did,
103the driver match() method is called to determine a match. 104the driver match() method is called to determine a match.
104 105
105If it did _not_ match, dev->platform_data is reset to indicate this to 106If it did **not** match, dev->platform_data is reset to indicate this to
106isa_register_driver which can then unregister the device again. 107isa_register_driver which can then unregister the device again.
107 108
108If during all this, there's any error, or no devices matched at all 109If during all this, there's any error, or no devices matched at all
diff --git a/Documentation/isapnp.txt b/Documentation/isapnp.txt
index 400d1b5b523d..8d0840ac847b 100644
--- a/Documentation/isapnp.txt
+++ b/Documentation/isapnp.txt
@@ -1,3 +1,4 @@
1==========================================================
1ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz> 2ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz>
2========================================================== 3==========================================================
3 4
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 2cb7dc5c0e0d..0f00f9c164ac 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -1,27 +1,29 @@
1REDUCING OS JITTER DUE TO PER-CPU KTHREADS 1==========================================
2Reducing OS jitter due to per-cpu kthreads
3==========================================
2 4
3This document lists per-CPU kthreads in the Linux kernel and presents 5This document lists per-CPU kthreads in the Linux kernel and presents
4options to control their OS jitter. Note that non-per-CPU kthreads are 6options to control their OS jitter. Note that non-per-CPU kthreads are
5not listed here. To reduce OS jitter from non-per-CPU kthreads, bind 7not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
6them to a "housekeeping" CPU dedicated to such work. 8them to a "housekeeping" CPU dedicated to such work.
7 9
10References
11==========
8 12
9REFERENCES 13- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
10 14
11o Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. 15- Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
12 16
13o Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. 17- man taskset: Using the taskset command to bind tasks to sets
14
15o man taskset: Using the taskset command to bind tasks to sets
16 of CPUs. 18 of CPUs.
17 19
18o man sched_setaffinity: Using the sched_setaffinity() system 20- man sched_setaffinity: Using the sched_setaffinity() system
19 call to bind tasks to sets of CPUs. 21 call to bind tasks to sets of CPUs.
20 22
21o /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state, 23- /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
22 writing "0" to offline and "1" to online. 24 writing "0" to offline and "1" to online.
23 25
24o In order to locate kernel-generated OS jitter on CPU N: 26- In order to locate kernel-generated OS jitter on CPU N:
25 27
26 cd /sys/kernel/debug/tracing 28 cd /sys/kernel/debug/tracing
27 echo 1 > max_graph_depth # Increase the "1" for more detail 29 echo 1 > max_graph_depth # Increase the "1" for more detail
@@ -29,12 +31,17 @@ o In order to locate kernel-generated OS jitter on CPU N:
29 # run workload 31 # run workload
30 cat per_cpu/cpuN/trace 32 cat per_cpu/cpuN/trace
31 33
34kthreads
35========
36
37Name:
38 ehca_comp/%u
32 39
33KTHREADS 40Purpose:
41 Periodically process Infiniband-related work.
34 42
35Name: ehca_comp/%u
36Purpose: Periodically process Infiniband-related work.
37To reduce its OS jitter, do any of the following: 43To reduce its OS jitter, do any of the following:
44
381. Don't use eHCA Infiniband hardware, instead choosing hardware 451. Don't use eHCA Infiniband hardware, instead choosing hardware
39 that does not require per-CPU kthreads. This will prevent these 46 that does not require per-CPU kthreads. This will prevent these
40 kthreads from being created in the first place. (This will 47 kthreads from being created in the first place. (This will
@@ -46,26 +53,45 @@ To reduce its OS jitter, do any of the following:
46 provisioned only on selected CPUs. 53 provisioned only on selected CPUs.
47 54
48 55
49Name: irq/%d-%s 56Name:
50Purpose: Handle threaded interrupts. 57 irq/%d-%s
58
59Purpose:
60 Handle threaded interrupts.
61
51To reduce its OS jitter, do the following: 62To reduce its OS jitter, do the following:
63
521. Use irq affinity to force the irq threads to execute on 641. Use irq affinity to force the irq threads to execute on
53 some other CPU. 65 some other CPU.
54 66
55Name: kcmtpd_ctr_%d 67Name:
56Purpose: Handle Bluetooth work. 68 kcmtpd_ctr_%d
69
70Purpose:
71 Handle Bluetooth work.
72
57To reduce its OS jitter, do one of the following: 73To reduce its OS jitter, do one of the following:
74
581. Don't use Bluetooth, in which case these kthreads won't be 751. Don't use Bluetooth, in which case these kthreads won't be
59 created in the first place. 76 created in the first place.
602. Use irq affinity to force Bluetooth-related interrupts to 772. Use irq affinity to force Bluetooth-related interrupts to
61 occur on some other CPU and furthermore initiate all 78 occur on some other CPU and furthermore initiate all
62 Bluetooth activity on some other CPU. 79 Bluetooth activity on some other CPU.
63 80
64Name: ksoftirqd/%u 81Name:
65Purpose: Execute softirq handlers when threaded or when under heavy load. 82 ksoftirqd/%u
83
84Purpose:
85 Execute softirq handlers when threaded or when under heavy load.
86
66To reduce its OS jitter, each softirq vector must be handled 87To reduce its OS jitter, each softirq vector must be handled
67separately as follows: 88separately as follows:
68TIMER_SOFTIRQ: Do all of the following: 89
90TIMER_SOFTIRQ
91-------------
92
93Do all of the following:
94
691. To the extent possible, keep the CPU out of the kernel when it 951. To the extent possible, keep the CPU out of the kernel when it
70 is non-idle, for example, by avoiding system calls and by forcing 96 is non-idle, for example, by avoiding system calls and by forcing
71 both kernel threads and interrupts to execute elsewhere. 97 both kernel threads and interrupts to execute elsewhere.
@@ -76,34 +102,59 @@ TIMER_SOFTIRQ: Do all of the following:
76 first one back online. Once you have onlined the CPUs in question, 102 first one back online. Once you have onlined the CPUs in question,
77 do not offline any other CPUs, because doing so could force the 103 do not offline any other CPUs, because doing so could force the
78 timer back onto one of the CPUs in question. 104 timer back onto one of the CPUs in question.
79NET_TX_SOFTIRQ and NET_RX_SOFTIRQ: Do all of the following: 105
106NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
107---------------------------------
108
109Do all of the following:
110
801. Force networking interrupts onto other CPUs. 1111. Force networking interrupts onto other CPUs.
812. Initiate any network I/O on other CPUs. 1122. Initiate any network I/O on other CPUs.
823. Once your application has started, prevent CPU-hotplug operations 1133. Once your application has started, prevent CPU-hotplug operations
83 from being initiated from tasks that might run on the CPU to 114 from being initiated from tasks that might run on the CPU to
84 be de-jittered. (It is OK to force this CPU offline and then 115 be de-jittered. (It is OK to force this CPU offline and then
85 bring it back online before you start your application.) 116 bring it back online before you start your application.)
86BLOCK_SOFTIRQ: Do all of the following: 117
118BLOCK_SOFTIRQ
119-------------
120
121Do all of the following:
122
871. Force block-device interrupts onto some other CPU. 1231. Force block-device interrupts onto some other CPU.
882. Initiate any block I/O on other CPUs. 1242. Initiate any block I/O on other CPUs.
893. Once your application has started, prevent CPU-hotplug operations 1253. Once your application has started, prevent CPU-hotplug operations
90 from being initiated from tasks that might run on the CPU to 126 from being initiated from tasks that might run on the CPU to
91 be de-jittered. (It is OK to force this CPU offline and then 127 be de-jittered. (It is OK to force this CPU offline and then
92 bring it back online before you start your application.) 128 bring it back online before you start your application.)
93IRQ_POLL_SOFTIRQ: Do all of the following: 129
130IRQ_POLL_SOFTIRQ
131----------------
132
133Do all of the following:
134
941. Force block-device interrupts onto some other CPU. 1351. Force block-device interrupts onto some other CPU.
952. Initiate any block I/O and block-I/O polling on other CPUs. 1362. Initiate any block I/O and block-I/O polling on other CPUs.
963. Once your application has started, prevent CPU-hotplug operations 1373. Once your application has started, prevent CPU-hotplug operations
97 from being initiated from tasks that might run on the CPU to 138 from being initiated from tasks that might run on the CPU to
98 be de-jittered. (It is OK to force this CPU offline and then 139 be de-jittered. (It is OK to force this CPU offline and then
99 bring it back online before you start your application.) 140 bring it back online before you start your application.)
100TASKLET_SOFTIRQ: Do one or more of the following: 141
142TASKLET_SOFTIRQ
143---------------
144
145Do one or more of the following:
146
1011. Avoid use of drivers that use tasklets. (Such drivers will contain 1471. Avoid use of drivers that use tasklets. (Such drivers will contain
102 calls to things like tasklet_schedule().) 148 calls to things like tasklet_schedule().)
1032. Convert all drivers that you must use from tasklets to workqueues. 1492. Convert all drivers that you must use from tasklets to workqueues.
1043. Force interrupts for drivers using tasklets onto other CPUs, 1503. Force interrupts for drivers using tasklets onto other CPUs,
105 and also do I/O involving these drivers on other CPUs. 151 and also do I/O involving these drivers on other CPUs.
106SCHED_SOFTIRQ: Do all of the following: 152
153SCHED_SOFTIRQ
154-------------
155
156Do all of the following:
157
1071. Avoid sending scheduler IPIs to the CPU to be de-jittered, 1581. Avoid sending scheduler IPIs to the CPU to be de-jittered,
108 for example, ensure that at most one runnable kthread is present 159 for example, ensure that at most one runnable kthread is present
109 on that CPU. If a thread that expects to run on the de-jittered 160 on that CPU. If a thread that expects to run on the de-jittered
@@ -120,7 +171,12 @@ SCHED_SOFTIRQ: Do all of the following:
120 forcing both kernel threads and interrupts to execute elsewhere. 171 forcing both kernel threads and interrupts to execute elsewhere.
121 This further reduces the number of scheduler-clock interrupts 172 This further reduces the number of scheduler-clock interrupts
122 received by the de-jittered CPU. 173 received by the de-jittered CPU.
123HRTIMER_SOFTIRQ: Do all of the following: 174
175HRTIMER_SOFTIRQ
176---------------
177
178Do all of the following:
179
1241. To the extent possible, keep the CPU out of the kernel when it 1801. To the extent possible, keep the CPU out of the kernel when it
125 is non-idle. For example, avoid system calls and force both 181 is non-idle. For example, avoid system calls and force both
126 kernel threads and interrupts to execute elsewhere. 182 kernel threads and interrupts to execute elsewhere.
@@ -131,9 +187,15 @@ HRTIMER_SOFTIRQ: Do all of the following:
131 back online. Once you have onlined the CPUs in question, do not 187 back online. Once you have onlined the CPUs in question, do not
132 offline any other CPUs, because doing so could force the timer 188 offline any other CPUs, because doing so could force the timer
133 back onto one of the CPUs in question. 189 back onto one of the CPUs in question.
134RCU_SOFTIRQ: Do at least one of the following: 190
191RCU_SOFTIRQ
192-----------
193
194Do at least one of the following:
195
1351. Offload callbacks and keep the CPU in either dyntick-idle or 1961. Offload callbacks and keep the CPU in either dyntick-idle or
136 adaptive-ticks state by doing all of the following: 197 adaptive-ticks state by doing all of the following:
198
137 a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be 199 a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
138 de-jittered is marked as an adaptive-ticks CPU using the 200 de-jittered is marked as an adaptive-ticks CPU using the
139 "nohz_full=" boot parameter. Bind the rcuo kthreads to 201 "nohz_full=" boot parameter. Bind the rcuo kthreads to
@@ -142,8 +204,10 @@ RCU_SOFTIRQ: Do at least one of the following:
142 when it is non-idle, for example, by avoiding system 204 when it is non-idle, for example, by avoiding system
143 calls and by forcing both kernel threads and interrupts 205 calls and by forcing both kernel threads and interrupts
144 to execute elsewhere. 206 to execute elsewhere.
207
1452. Enable RCU to do its processing remotely via dyntick-idle by 2082. Enable RCU to do its processing remotely via dyntick-idle by
146 doing all of the following: 209 doing all of the following:
210
147 a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. 211 a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
148 b. Ensure that the CPU goes idle frequently, allowing other 212 b. Ensure that the CPU goes idle frequently, allowing other
149 CPUs to detect that it has passed through an RCU quiescent 213 CPUs to detect that it has passed through an RCU quiescent
@@ -155,15 +219,20 @@ RCU_SOFTIRQ: Do at least one of the following:
155 calls and by forcing both kernel threads and interrupts 219 calls and by forcing both kernel threads and interrupts
156 to execute elsewhere. 220 to execute elsewhere.
157 221
158Name: kworker/%u:%d%s (cpu, id, priority) 222Name:
159Purpose: Execute workqueue requests 223 kworker/%u:%d%s (cpu, id, priority)
224
225Purpose:
226 Execute workqueue requests
227
160To reduce its OS jitter, do any of the following: 228To reduce its OS jitter, do any of the following:
229
1611. Run your workload at a real-time priority, which will allow 2301. Run your workload at a real-time priority, which will allow
162 preempting the kworker daemons. 231 preempting the kworker daemons.
1632. A given workqueue can be made visible in the sysfs filesystem 2322. A given workqueue can be made visible in the sysfs filesystem
164 by passing the WQ_SYSFS to that workqueue's alloc_workqueue(). 233 by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
165 Such a workqueue can be confined to a given subset of the 234 Such a workqueue can be confined to a given subset of the
166 CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs 235 CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
167 files. The set of WQ_SYSFS workqueues can be displayed using 236 files. The set of WQ_SYSFS workqueues can be displayed using
168 "ls sys/devices/virtual/workqueue". That said, the workqueues 237 "ls sys/devices/virtual/workqueue". That said, the workqueues
169 maintainer would like to caution people against indiscriminately 238 maintainer would like to caution people against indiscriminately
@@ -173,6 +242,7 @@ To reduce its OS jitter, do any of the following:
173 to remove it, even if its addition was a mistake. 242 to remove it, even if its addition was a mistake.
1743. Do any of the following needed to avoid jitter that your 2433. Do any of the following needed to avoid jitter that your
175 application cannot tolerate: 244 application cannot tolerate:
245
176 a. Build your kernel with CONFIG_SLUB=y rather than 246 a. Build your kernel with CONFIG_SLUB=y rather than
177 CONFIG_SLAB=y, thus avoiding the slab allocator's periodic 247 CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
178 use of each CPU's workqueues to run its cache_reap() 248 use of each CPU's workqueues to run its cache_reap()
@@ -186,6 +256,7 @@ To reduce its OS jitter, do any of the following:
186 be able to build your kernel with CONFIG_CPU_FREQ=n to 256 be able to build your kernel with CONFIG_CPU_FREQ=n to
187 avoid the CPU-frequency governor periodically running 257 avoid the CPU-frequency governor periodically running
188 on each CPU, including cs_dbs_timer() and od_dbs_timer(). 258 on each CPU, including cs_dbs_timer() and od_dbs_timer().
259
189 WARNING: Please check your CPU specifications to 260 WARNING: Please check your CPU specifications to
190 make sure that this is safe on your particular system. 261 make sure that this is safe on your particular system.
191 d. As of v3.18, Christoph Lameter's on-demand vmstat workers 262 d. As of v3.18, Christoph Lameter's on-demand vmstat workers
@@ -222,9 +293,14 @@ To reduce its OS jitter, do any of the following:
222 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, 293 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
223 avoiding OS jitter from rackmeter_do_timer(). 294 avoiding OS jitter from rackmeter_do_timer().
224 295
225Name: rcuc/%u 296Name:
226Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. 297 rcuc/%u
298
299Purpose:
300 Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
301
227To reduce its OS jitter, do at least one of the following: 302To reduce its OS jitter, do at least one of the following:
303
2281. Build the kernel with CONFIG_PREEMPT=n. This prevents these 3041. Build the kernel with CONFIG_PREEMPT=n. This prevents these
229 kthreads from being created in the first place, and also obviates 305 kthreads from being created in the first place, and also obviates
230 the need for RCU priority boosting. This approach is feasible 306 the need for RCU priority boosting. This approach is feasible
@@ -244,9 +320,14 @@ To reduce its OS jitter, do at least one of the following:
244 CPU, again preventing the rcuc/%u kthreads from having any work 320 CPU, again preventing the rcuc/%u kthreads from having any work
245 to do. 321 to do.
246 322
247Name: rcuob/%d, rcuop/%d, and rcuos/%d 323Name:
248Purpose: Offload RCU callbacks from the corresponding CPU. 324 rcuob/%d, rcuop/%d, and rcuos/%d
325
326Purpose:
327 Offload RCU callbacks from the corresponding CPU.
328
249To reduce its OS jitter, do at least one of the following: 329To reduce its OS jitter, do at least one of the following:
330
2501. Use affinity, cgroups, or other mechanism to force these kthreads 3311. Use affinity, cgroups, or other mechanism to force these kthreads
251 to execute on some other CPU. 332 to execute on some other CPU.
2522. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these 3332. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
@@ -254,9 +335,14 @@ To reduce its OS jitter, do at least one of the following:
254 note that this will not eliminate OS jitter, but will instead 335 note that this will not eliminate OS jitter, but will instead
255 shift it to RCU_SOFTIRQ. 336 shift it to RCU_SOFTIRQ.
256 337
257Name: watchdog/%u 338Name:
258Purpose: Detect software lockups on each CPU. 339 watchdog/%u
340
341Purpose:
342 Detect software lockups on each CPU.
343
259To reduce its OS jitter, do at least one of the following: 344To reduce its OS jitter, do at least one of the following:
345
2601. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these 3461. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
261 kthreads from being created in the first place. 347 kthreads from being created in the first place.
2622. Boot with "nosoftlockup=0", which will also prevent these kthreads 3482. Boot with "nosoftlockup=0", which will also prevent these kthreads
diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt
index 1be59a3a521c..fc9485d79061 100644
--- a/Documentation/kobject.txt
+++ b/Documentation/kobject.txt
@@ -1,13 +1,13 @@
1=====================================================================
1Everything you never wanted to know about kobjects, ksets, and ktypes 2Everything you never wanted to know about kobjects, ksets, and ktypes
3=====================================================================
2 4
3Greg Kroah-Hartman <gregkh@linuxfoundation.org> 5:Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6:Last updated: December 19, 2007
4 7
5Based on an original article by Jon Corbet for lwn.net written October 1, 8Based on an original article by Jon Corbet for lwn.net written October 1,
62003 and located at http://lwn.net/Articles/51437/ 92003 and located at http://lwn.net/Articles/51437/
7 10
8Last updated December 19, 2007
9
10
11Part of the difficulty in understanding the driver model - and the kobject 11Part of the difficulty in understanding the driver model - and the kobject
12abstraction upon which it is built - is that there is no obvious starting 12abstraction upon which it is built - is that there is no obvious starting
13place. Dealing with kobjects requires understanding a few different types, 13place. Dealing with kobjects requires understanding a few different types,
@@ -47,6 +47,7 @@ approach will be taken, so we'll go back to kobjects.
47 47
48 48
49Embedding kobjects 49Embedding kobjects
50==================
50 51
51It is rare for kernel code to create a standalone kobject, with one major 52It is rare for kernel code to create a standalone kobject, with one major
52exception explained below. Instead, kobjects are used to control access to 53exception explained below. Instead, kobjects are used to control access to
@@ -65,7 +66,7 @@ their own, but are invariably found embedded in the larger objects of
65interest.) 66interest.)
66 67
67So, for example, the UIO code in drivers/uio/uio.c has a structure that 68So, for example, the UIO code in drivers/uio/uio.c has a structure that
68defines the memory region associated with a uio device: 69defines the memory region associated with a uio device::
69 70
70 struct uio_map { 71 struct uio_map {
71 struct kobject kobj; 72 struct kobject kobj;
@@ -77,7 +78,7 @@ just a matter of using the kobj member. Code that works with kobjects will
77often have the opposite problem, however: given a struct kobject pointer, 78often have the opposite problem, however: given a struct kobject pointer,
78what is the pointer to the containing structure? You must avoid tricks 79what is the pointer to the containing structure? You must avoid tricks
79(such as assuming that the kobject is at the beginning of the structure) 80(such as assuming that the kobject is at the beginning of the structure)
80and, instead, use the container_of() macro, found in <linux/kernel.h>: 81and, instead, use the container_of() macro, found in <linux/kernel.h>::
81 82
82 container_of(pointer, type, member) 83 container_of(pointer, type, member)
83 84
@@ -90,13 +91,13 @@ where:
90The return value from container_of() is a pointer to the corresponding 91The return value from container_of() is a pointer to the corresponding
91container type. So, for example, a pointer "kp" to a struct kobject 92container type. So, for example, a pointer "kp" to a struct kobject
92embedded *within* a struct uio_map could be converted to a pointer to the 93embedded *within* a struct uio_map could be converted to a pointer to the
93*containing* uio_map structure with: 94*containing* uio_map structure with::
94 95
95 struct uio_map *u_map = container_of(kp, struct uio_map, kobj); 96 struct uio_map *u_map = container_of(kp, struct uio_map, kobj);
96 97
97For convenience, programmers often define a simple macro for "back-casting" 98For convenience, programmers often define a simple macro for "back-casting"
98kobject pointers to the containing type. Exactly this happens in the 99kobject pointers to the containing type. Exactly this happens in the
99earlier drivers/uio/uio.c, as you can see here: 100earlier drivers/uio/uio.c, as you can see here::
100 101
101 struct uio_map { 102 struct uio_map {
102 struct kobject kobj; 103 struct kobject kobj;
@@ -106,23 +107,25 @@ earlier drivers/uio/uio.c, as you can see here:
106 #define to_map(map) container_of(map, struct uio_map, kobj) 107 #define to_map(map) container_of(map, struct uio_map, kobj)
107 108
108where the macro argument "map" is a pointer to the struct kobject in 109where the macro argument "map" is a pointer to the struct kobject in
109question. That macro is subsequently invoked with: 110question. That macro is subsequently invoked with::
110 111
111 struct uio_map *map = to_map(kobj); 112 struct uio_map *map = to_map(kobj);
112 113
113 114
114Initialization of kobjects 115Initialization of kobjects
116==========================
115 117
116Code which creates a kobject must, of course, initialize that object. Some 118Code which creates a kobject must, of course, initialize that object. Some
117of the internal fields are setup with a (mandatory) call to kobject_init(): 119of the internal fields are setup with a (mandatory) call to kobject_init()::
118 120
119 void kobject_init(struct kobject *kobj, struct kobj_type *ktype); 121 void kobject_init(struct kobject *kobj, struct kobj_type *ktype);
120 122
121The ktype is required for a kobject to be created properly, as every kobject 123The ktype is required for a kobject to be created properly, as every kobject
122must have an associated kobj_type. After calling kobject_init(), to 124must have an associated kobj_type. After calling kobject_init(), to
123register the kobject with sysfs, the function kobject_add() must be called: 125register the kobject with sysfs, the function kobject_add() must be called::
124 126
125 int kobject_add(struct kobject *kobj, struct kobject *parent, const char *fmt, ...); 127 int kobject_add(struct kobject *kobj, struct kobject *parent,
128 const char *fmt, ...);
126 129
127This sets up the parent of the kobject and the name for the kobject 130This sets up the parent of the kobject and the name for the kobject
128properly. If the kobject is to be associated with a specific kset, 131properly. If the kobject is to be associated with a specific kset,
@@ -133,7 +136,7 @@ kset itself.
133 136
134As the name of the kobject is set when it is added to the kernel, the name 137As the name of the kobject is set when it is added to the kernel, the name
135of the kobject should never be manipulated directly. If you must change 138of the kobject should never be manipulated directly. If you must change
136the name of the kobject, call kobject_rename(): 139the name of the kobject, call kobject_rename()::
137 140
138 int kobject_rename(struct kobject *kobj, const char *new_name); 141 int kobject_rename(struct kobject *kobj, const char *new_name);
139 142
@@ -146,12 +149,12 @@ is being removed. If your code needs to call this function, it is
146incorrect and needs to be fixed. 149incorrect and needs to be fixed.
147 150
148To properly access the name of the kobject, use the function 151To properly access the name of the kobject, use the function
149kobject_name(): 152kobject_name()::
150 153
151 const char *kobject_name(const struct kobject * kobj); 154 const char *kobject_name(const struct kobject * kobj);
152 155
153There is a helper function to both initialize and add the kobject to the 156There is a helper function to both initialize and add the kobject to the
154kernel at the same time, called surprisingly enough kobject_init_and_add(): 157kernel at the same time, called surprisingly enough kobject_init_and_add()::
155 158
156 int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, 159 int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype,
157 struct kobject *parent, const char *fmt, ...); 160 struct kobject *parent, const char *fmt, ...);
@@ -161,10 +164,11 @@ kobject_add() functions described above.
161 164
162 165
163Uevents 166Uevents
167=======
164 168
165After a kobject has been registered with the kobject core, you need to 169After a kobject has been registered with the kobject core, you need to
166announce to the world that it has been created. This can be done with a 170announce to the world that it has been created. This can be done with a
167call to kobject_uevent(): 171call to kobject_uevent()::
168 172
169 int kobject_uevent(struct kobject *kobj, enum kobject_action action); 173 int kobject_uevent(struct kobject *kobj, enum kobject_action action);
170 174
@@ -180,11 +184,12 @@ hand.
180 184
181 185
182Reference counts 186Reference counts
187================
183 188
184One of the key functions of a kobject is to serve as a reference counter 189One of the key functions of a kobject is to serve as a reference counter
185for the object in which it is embedded. As long as references to the object 190for the object in which it is embedded. As long as references to the object
186exist, the object (and the code which supports it) must continue to exist. 191exist, the object (and the code which supports it) must continue to exist.
187The low-level functions for manipulating a kobject's reference counts are: 192The low-level functions for manipulating a kobject's reference counts are::
188 193
189 struct kobject *kobject_get(struct kobject *kobj); 194 struct kobject *kobject_get(struct kobject *kobj);
190 void kobject_put(struct kobject *kobj); 195 void kobject_put(struct kobject *kobj);
@@ -209,21 +214,24 @@ file Documentation/kref.txt in the Linux kernel source tree.
209 214
210 215
211Creating "simple" kobjects 216Creating "simple" kobjects
217==========================
212 218
213Sometimes all that a developer wants is a way to create a simple directory 219Sometimes all that a developer wants is a way to create a simple directory
214in the sysfs hierarchy, and not have to mess with the whole complication of 220in the sysfs hierarchy, and not have to mess with the whole complication of
215ksets, show and store functions, and other details. This is the one 221ksets, show and store functions, and other details. This is the one
216exception where a single kobject should be created. To create such an 222exception where a single kobject should be created. To create such an
217entry, use the function: 223entry, use the function::
218 224
219 struct kobject *kobject_create_and_add(char *name, struct kobject *parent); 225 struct kobject *kobject_create_and_add(char *name, struct kobject *parent);
220 226
221This function will create a kobject and place it in sysfs in the location 227This function will create a kobject and place it in sysfs in the location
222underneath the specified parent kobject. To create simple attributes 228underneath the specified parent kobject. To create simple attributes
223associated with this kobject, use: 229associated with this kobject, use::
224 230
225 int sysfs_create_file(struct kobject *kobj, struct attribute *attr); 231 int sysfs_create_file(struct kobject *kobj, struct attribute *attr);
226or 232
233or::
234
227 int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); 235 int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp);
228 236
229Both types of attributes used here, with a kobject that has been created 237Both types of attributes used here, with a kobject that has been created
@@ -236,6 +244,7 @@ implementation of a simple kobject and attributes.
236 244
237 245
238ktypes and release methods 246ktypes and release methods
247==========================
239 248
240One important thing still missing from the discussion is what happens to a 249One important thing still missing from the discussion is what happens to a
241kobject when its reference count reaches zero. The code which created the 250kobject when its reference count reaches zero. The code which created the
@@ -257,7 +266,7 @@ is good practice to always use kobject_put() after kobject_init() to avoid
257errors creeping in. 266errors creeping in.
258 267
259This notification is done through a kobject's release() method. Usually 268This notification is done through a kobject's release() method. Usually
260such a method has a form like: 269such a method has a form like::
261 270
262 void my_object_release(struct kobject *kobj) 271 void my_object_release(struct kobject *kobj)
263 { 272 {
@@ -281,7 +290,7 @@ leak in the kobject core, which makes people unhappy.
281 290
282Interestingly, the release() method is not stored in the kobject itself; 291Interestingly, the release() method is not stored in the kobject itself;
283instead, it is associated with the ktype. So let us introduce struct 292instead, it is associated with the ktype. So let us introduce struct
284kobj_type: 293kobj_type::
285 294
286 struct kobj_type { 295 struct kobj_type {
287 void (*release)(struct kobject *kobj); 296 void (*release)(struct kobject *kobj);
@@ -306,6 +315,7 @@ automatically created for any kobject that is registered with this ktype.
306 315
307 316
308ksets 317ksets
318=====
309 319
310A kset is merely a collection of kobjects that want to be associated with 320A kset is merely a collection of kobjects that want to be associated with
311each other. There is no restriction that they be of the same ktype, but be 321each other. There is no restriction that they be of the same ktype, but be
@@ -335,13 +345,16 @@ kobject) in their parent.
335 345
336As a kset contains a kobject within it, it should always be dynamically 346As a kset contains a kobject within it, it should always be dynamically
337created and never declared statically or on the stack. To create a new 347created and never declared statically or on the stack. To create a new
338kset use: 348kset use::
349
339 struct kset *kset_create_and_add(const char *name, 350 struct kset *kset_create_and_add(const char *name,
340 struct kset_uevent_ops *u, 351 struct kset_uevent_ops *u,
341 struct kobject *parent); 352 struct kobject *parent);
342 353
343When you are finished with the kset, call: 354When you are finished with the kset, call::
355
344 void kset_unregister(struct kset *kset); 356 void kset_unregister(struct kset *kset);
357
345to destroy it. This removes the kset from sysfs and decrements its reference 358to destroy it. This removes the kset from sysfs and decrements its reference
346count. When the reference count goes to zero, the kset will be released. 359count. When the reference count goes to zero, the kset will be released.
347Because other references to the kset may still exist, the release may happen 360Because other references to the kset may still exist, the release may happen
@@ -351,14 +364,14 @@ An example of using a kset can be seen in the
351samples/kobject/kset-example.c file in the kernel tree. 364samples/kobject/kset-example.c file in the kernel tree.
352 365
353If a kset wishes to control the uevent operations of the kobjects 366If a kset wishes to control the uevent operations of the kobjects
354associated with it, it can use the struct kset_uevent_ops to handle it: 367associated with it, it can use the struct kset_uevent_ops to handle it::
355 368
356struct kset_uevent_ops { 369 struct kset_uevent_ops {
357 int (*filter)(struct kset *kset, struct kobject *kobj); 370 int (*filter)(struct kset *kset, struct kobject *kobj);
358 const char *(*name)(struct kset *kset, struct kobject *kobj); 371 const char *(*name)(struct kset *kset, struct kobject *kobj);
359 int (*uevent)(struct kset *kset, struct kobject *kobj, 372 int (*uevent)(struct kset *kset, struct kobject *kobj,
360 struct kobj_uevent_env *env); 373 struct kobj_uevent_env *env);
361}; 374 };
362 375
363 376
364The filter function allows a kset to prevent a uevent from being emitted to 377The filter function allows a kset to prevent a uevent from being emitted to
@@ -386,6 +399,7 @@ added below the parent kobject.
386 399
387 400
388Kobject removal 401Kobject removal
402===============
389 403
390After a kobject has been registered with the kobject core successfully, it 404After a kobject has been registered with the kobject core successfully, it
391must be cleaned up when the code is finished with it. To do that, call 405must be cleaned up when the code is finished with it. To do that, call
@@ -409,6 +423,7 @@ called, and the objects in the former circle release each other.
409 423
410 424
411Example code to copy from 425Example code to copy from
426=========================
412 427
413For a more complete example of using ksets and kobjects properly, see the 428For a more complete example of using ksets and kobjects properly, see the
414example programs samples/kobject/{kobject-example.c,kset-example.c}, 429example programs samples/kobject/{kobject-example.c,kset-example.c},
diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 1f6d45abfe42..2335715bf471 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -1,30 +1,36 @@
1Title : Kernel Probes (Kprobes) 1=======================
2Authors : Jim Keniston <jkenisto@us.ibm.com> 2Kernel Probes (Kprobes)
3 : Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com> 3=======================
4 : Masami Hiramatsu <mhiramat@redhat.com> 4
5 5:Author: Jim Keniston <jkenisto@us.ibm.com>
6CONTENTS 6:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
7 7:Author: Masami Hiramatsu <mhiramat@redhat.com>
81. Concepts: Kprobes, Jprobes, Return Probes 8
92. Architectures Supported 9.. CONTENTS
103. Configuring Kprobes 10
114. API Reference 11 1. Concepts: Kprobes, Jprobes, Return Probes
125. Kprobes Features and Limitations 12 2. Architectures Supported
136. Probe Overhead 13 3. Configuring Kprobes
147. TODO 14 4. API Reference
158. Kprobes Example 15 5. Kprobes Features and Limitations
169. Jprobes Example 16 6. Probe Overhead
1710. Kretprobes Example 17 7. TODO
18Appendix A: The kprobes debugfs interface 18 8. Kprobes Example
19Appendix B: The kprobes sysctl interface 19 9. Jprobes Example
20 20 10. Kretprobes Example
211. Concepts: Kprobes, Jprobes, Return Probes 21 Appendix A: The kprobes debugfs interface
22 Appendix B: The kprobes sysctl interface
23
24Concepts: Kprobes, Jprobes, Return Probes
25=========================================
22 26
23Kprobes enables you to dynamically break into any kernel routine and 27Kprobes enables you to dynamically break into any kernel routine and
24collect debugging and performance information non-disruptively. You 28collect debugging and performance information non-disruptively. You
25can trap at almost any kernel code address(*), specifying a handler 29can trap at almost any kernel code address [1]_, specifying a handler
26routine to be invoked when the breakpoint is hit. 30routine to be invoked when the breakpoint is hit.
27(*: some parts of the kernel code can not be trapped, see 1.5 Blacklist) 31
32.. [1] some parts of the kernel code can not be trapped, see
33 :ref:`kprobes_blacklist`)
28 34
29There are currently three types of probes: kprobes, jprobes, and 35There are currently three types of probes: kprobes, jprobes, and
30kretprobes (also called return probes). A kprobe can be inserted 36kretprobes (also called return probes). A kprobe can be inserted
@@ -40,8 +46,8 @@ registration function such as register_kprobe() specifies where
40the probe is to be inserted and what handler is to be called when 46the probe is to be inserted and what handler is to be called when
41the probe is hit. 47the probe is hit.
42 48
43There are also register_/unregister_*probes() functions for batch 49There are also ``register_/unregister_*probes()`` functions for batch
44registration/unregistration of a group of *probes. These functions 50registration/unregistration of a group of ``*probes``. These functions
45can speed up unregistration process when you have to unregister 51can speed up unregistration process when you have to unregister
46a lot of probes at once. 52a lot of probes at once.
47 53
@@ -51,9 +57,10 @@ things that you'll need to know in order to make the best use of
51Kprobes -- e.g., the difference between a pre_handler and 57Kprobes -- e.g., the difference between a pre_handler and
52a post_handler, and how to use the maxactive and nmissed fields of 58a post_handler, and how to use the maxactive and nmissed fields of
53a kretprobe. But if you're in a hurry to start using Kprobes, you 59a kretprobe. But if you're in a hurry to start using Kprobes, you
54can skip ahead to section 2. 60can skip ahead to :ref:`kprobes_archs_supported`.
55 61
561.1 How Does a Kprobe Work? 62How Does a Kprobe Work?
63-----------------------
57 64
58When a kprobe is registered, Kprobes makes a copy of the probed 65When a kprobe is registered, Kprobes makes a copy of the probed
59instruction and replaces the first byte(s) of the probed instruction 66instruction and replaces the first byte(s) of the probed instruction
@@ -75,7 +82,8 @@ After the instruction is single-stepped, Kprobes executes the
75"post_handler," if any, that is associated with the kprobe. 82"post_handler," if any, that is associated with the kprobe.
76Execution then continues with the instruction following the probepoint. 83Execution then continues with the instruction following the probepoint.
77 84
781.2 How Does a Jprobe Work? 85How Does a Jprobe Work?
86-----------------------
79 87
80A jprobe is implemented using a kprobe that is placed on a function's 88A jprobe is implemented using a kprobe that is placed on a function's
81entry point. It employs a simple mirroring principle to allow 89entry point. It employs a simple mirroring principle to allow
@@ -113,9 +121,11 @@ more than eight function arguments, an argument of more than sixteen
113bytes, or more than 64 bytes of argument data, depending on 121bytes, or more than 64 bytes of argument data, depending on
114architecture). 122architecture).
115 123
1161.3 Return Probes 124Return Probes
125-------------
117 126
1181.3.1 How Does a Return Probe Work? 127How Does a Return Probe Work?
128^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119 129
120When you call register_kretprobe(), Kprobes establishes a kprobe at 130When you call register_kretprobe(), Kprobes establishes a kprobe at
121the entry to the function. When the probed function is called and this 131the entry to the function. When the probed function is called and this
@@ -150,7 +160,8 @@ zero when the return probe is registered, and is incremented every
150time the probed function is entered but there is no kretprobe_instance 160time the probed function is entered but there is no kretprobe_instance
151object available for establishing the return probe. 161object available for establishing the return probe.
152 162
1531.3.2 Kretprobe entry-handler 163Kretprobe entry-handler
164^^^^^^^^^^^^^^^^^^^^^^^
154 165
155Kretprobes also provides an optional user-specified handler which runs 166Kretprobes also provides an optional user-specified handler which runs
156on function entry. This handler is specified by setting the entry_handler 167on function entry. This handler is specified by setting the entry_handler
@@ -174,7 +185,10 @@ In case probed function is entered but there is no kretprobe_instance
174object available, then in addition to incrementing the nmissed count, 185object available, then in addition to incrementing the nmissed count,
175the user entry_handler invocation is also skipped. 186the user entry_handler invocation is also skipped.
176 187
1771.4 How Does Jump Optimization Work? 188.. _kprobes_jump_optimization:
189
190How Does Jump Optimization Work?
191--------------------------------
178 192
179If your kernel is built with CONFIG_OPTPROBES=y (currently this flag 193If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
180is automatically set 'y' on x86/x86-64, non-preemptive kernel) and 194is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
@@ -182,53 +196,60 @@ the "debug.kprobes_optimization" kernel parameter is set to 1 (see
182sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump 196sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
183instruction instead of a breakpoint instruction at each probepoint. 197instruction instead of a breakpoint instruction at each probepoint.
184 198
1851.4.1 Init a Kprobe 199Init a Kprobe
200^^^^^^^^^^^^^
186 201
187When a probe is registered, before attempting this optimization, 202When a probe is registered, before attempting this optimization,
188Kprobes inserts an ordinary, breakpoint-based kprobe at the specified 203Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
189address. So, even if it's not possible to optimize this particular 204address. So, even if it's not possible to optimize this particular
190probepoint, there'll be a probe there. 205probepoint, there'll be a probe there.
191 206
1921.4.2 Safety Check 207Safety Check
208^^^^^^^^^^^^
193 209
194Before optimizing a probe, Kprobes performs the following safety checks: 210Before optimizing a probe, Kprobes performs the following safety checks:
195 211
196- Kprobes verifies that the region that will be replaced by the jump 212- Kprobes verifies that the region that will be replaced by the jump
197instruction (the "optimized region") lies entirely within one function. 213 instruction (the "optimized region") lies entirely within one function.
198(A jump instruction is multiple bytes, and so may overlay multiple 214 (A jump instruction is multiple bytes, and so may overlay multiple
199instructions.) 215 instructions.)
200 216
201- Kprobes analyzes the entire function and verifies that there is no 217- Kprobes analyzes the entire function and verifies that there is no
202jump into the optimized region. Specifically: 218 jump into the optimized region. Specifically:
219
203 - the function contains no indirect jump; 220 - the function contains no indirect jump;
204 - the function contains no instruction that causes an exception (since 221 - the function contains no instruction that causes an exception (since
205 the fixup code triggered by the exception could jump back into the 222 the fixup code triggered by the exception could jump back into the
206 optimized region -- Kprobes checks the exception tables to verify this); 223 optimized region -- Kprobes checks the exception tables to verify this);
207 and
208 - there is no near jump to the optimized region (other than to the first 224 - there is no near jump to the optimized region (other than to the first
209 byte). 225 byte).
210 226
211- For each instruction in the optimized region, Kprobes verifies that 227- For each instruction in the optimized region, Kprobes verifies that
212the instruction can be executed out of line. 228 the instruction can be executed out of line.
213 229
2141.4.3 Preparing Detour Buffer 230Preparing Detour Buffer
231^^^^^^^^^^^^^^^^^^^^^^^
215 232
216Next, Kprobes prepares a "detour" buffer, which contains the following 233Next, Kprobes prepares a "detour" buffer, which contains the following
217instruction sequence: 234instruction sequence:
235
218- code to push the CPU's registers (emulating a breakpoint trap) 236- code to push the CPU's registers (emulating a breakpoint trap)
219- a call to the trampoline code which calls user's probe handlers. 237- a call to the trampoline code which calls user's probe handlers.
220- code to restore registers 238- code to restore registers
221- the instructions from the optimized region 239- the instructions from the optimized region
222- a jump back to the original execution path. 240- a jump back to the original execution path.
223 241
2241.4.4 Pre-optimization 242Pre-optimization
243^^^^^^^^^^^^^^^^
225 244
226After preparing the detour buffer, Kprobes verifies that none of the 245After preparing the detour buffer, Kprobes verifies that none of the
227following situations exist: 246following situations exist:
247
228- The probe has either a break_handler (i.e., it's a jprobe) or a 248- The probe has either a break_handler (i.e., it's a jprobe) or a
229post_handler. 249 post_handler.
230- Other instructions in the optimized region are probed. 250- Other instructions in the optimized region are probed.
231- The probe is disabled. 251- The probe is disabled.
252
232In any of the above cases, Kprobes won't start optimizing the probe. 253In any of the above cases, Kprobes won't start optimizing the probe.
233Since these are temporary situations, Kprobes tries to start 254Since these are temporary situations, Kprobes tries to start
234optimizing it again if the situation is changed. 255optimizing it again if the situation is changed.
@@ -240,21 +261,23 @@ Kprobes returns control to the original instruction path by setting
240the CPU's instruction pointer to the copied code in the detour buffer 261the CPU's instruction pointer to the copied code in the detour buffer
241-- thus at least avoiding the single-step. 262-- thus at least avoiding the single-step.
242 263
2431.4.5 Optimization 264Optimization
265^^^^^^^^^^^^
244 266
245The Kprobe-optimizer doesn't insert the jump instruction immediately; 267The Kprobe-optimizer doesn't insert the jump instruction immediately;
246rather, it calls synchronize_sched() for safety first, because it's 268rather, it calls synchronize_sched() for safety first, because it's
247possible for a CPU to be interrupted in the middle of executing the 269possible for a CPU to be interrupted in the middle of executing the
248optimized region(*). As you know, synchronize_sched() can ensure 270optimized region [3]_. As you know, synchronize_sched() can ensure
249that all interruptions that were active when synchronize_sched() 271that all interruptions that were active when synchronize_sched()
250was called are done, but only if CONFIG_PREEMPT=n. So, this version 272was called are done, but only if CONFIG_PREEMPT=n. So, this version
251of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**) 273of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
252 274
253After that, the Kprobe-optimizer calls stop_machine() to replace 275After that, the Kprobe-optimizer calls stop_machine() to replace
254the optimized region with a jump instruction to the detour buffer, 276the optimized region with a jump instruction to the detour buffer,
255using text_poke_smp(). 277using text_poke_smp().
256 278
2571.4.6 Unoptimization 279Unoptimization
280^^^^^^^^^^^^^^
258 281
259When an optimized kprobe is unregistered, disabled, or blocked by 282When an optimized kprobe is unregistered, disabled, or blocked by
260another kprobe, it will be unoptimized. If this happens before 283another kprobe, it will be unoptimized. If this happens before
@@ -263,15 +286,15 @@ optimized list. If the optimization has been done, the jump is
263replaced with the original code (except for an int3 breakpoint in 286replaced with the original code (except for an int3 breakpoint in
264the first byte) by using text_poke_smp(). 287the first byte) by using text_poke_smp().
265 288
266(*)Please imagine that the 2nd instruction is interrupted and then 289.. [3] Please imagine that the 2nd instruction is interrupted and then
267the optimizer replaces the 2nd instruction with the jump *address* 290 the optimizer replaces the 2nd instruction with the jump *address*
268while the interrupt handler is running. When the interrupt 291 while the interrupt handler is running. When the interrupt
269returns to original address, there is no valid instruction, 292 returns to original address, there is no valid instruction,
270and it causes an unexpected result. 293 and it causes an unexpected result.
271 294
272(**)This optimization-safety checking may be replaced with the 295.. [4] This optimization-safety checking may be replaced with the
273stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y 296 stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
274kernel. 297 kernel.
275 298
276NOTE for geeks: 299NOTE for geeks:
277The jump optimization changes the kprobe's pre_handler behavior. 300The jump optimization changes the kprobe's pre_handler behavior.
@@ -280,11 +303,17 @@ path by changing regs->ip and returning 1. However, when the probe
280is optimized, that modification is ignored. Thus, if you want to 303is optimized, that modification is ignored. Thus, if you want to
281tweak the kernel's execution path, you need to suppress optimization, 304tweak the kernel's execution path, you need to suppress optimization,
282using one of the following techniques: 305using one of the following techniques:
306
283- Specify an empty function for the kprobe's post_handler or break_handler. 307- Specify an empty function for the kprobe's post_handler or break_handler.
284 or 308
309or
310
285- Execute 'sysctl -w debug.kprobes_optimization=n' 311- Execute 'sysctl -w debug.kprobes_optimization=n'
286 312
2871.5 Blacklist 313.. _kprobes_blacklist:
314
315Blacklist
316---------
288 317
289Kprobes can probe most of the kernel except itself. This means 318Kprobes can probe most of the kernel except itself. This means
290that there are some functions where kprobes cannot probe. Probing 319that there are some functions where kprobes cannot probe. Probing
@@ -297,7 +326,10 @@ to specify a blacklisted function.
297Kprobes checks the given probe address against the blacklist and 326Kprobes checks the given probe address against the blacklist and
298rejects registering it, if the given address is in the blacklist. 327rejects registering it, if the given address is in the blacklist.
299 328
3002. Architectures Supported 329.. _kprobes_archs_supported:
330
331Architectures Supported
332=======================
301 333
302Kprobes, jprobes, and return probes are implemented on the following 334Kprobes, jprobes, and return probes are implemented on the following
303architectures: 335architectures:
@@ -312,7 +344,8 @@ architectures:
312- mips 344- mips
313- s390 345- s390
314 346
3153. Configuring Kprobes 347Configuring Kprobes
348===================
316 349
317When configuring the kernel using make menuconfig/xconfig/oldconfig, 350When configuring the kernel using make menuconfig/xconfig/oldconfig,
318ensure that CONFIG_KPROBES is set to "y". Under "General setup", look 351ensure that CONFIG_KPROBES is set to "y". Under "General setup", look
@@ -331,7 +364,8 @@ it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
331so you can use "objdump -d -l vmlinux" to see the source-to-object 364so you can use "objdump -d -l vmlinux" to see the source-to-object
332code mapping. 365code mapping.
333 366
3344. API Reference 367API Reference
368=============
335 369
336The Kprobes API includes a "register" function and an "unregister" 370The Kprobes API includes a "register" function and an "unregister"
337function for each type of probe. The API also includes "register_*probes" 371function for each type of probe. The API also includes "register_*probes"
@@ -340,10 +374,13 @@ Here are terse, mini-man-page specifications for these functions and
340the associated probe handlers that you'll write. See the files in the 374the associated probe handlers that you'll write. See the files in the
341samples/kprobes/ sub-directory for examples. 375samples/kprobes/ sub-directory for examples.
342 376
3434.1 register_kprobe 377register_kprobe
378---------------
379
380::
344 381
345#include <linux/kprobes.h> 382 #include <linux/kprobes.h>
346int register_kprobe(struct kprobe *kp); 383 int register_kprobe(struct kprobe *kp);
347 384
348Sets a breakpoint at the address kp->addr. When the breakpoint is 385Sets a breakpoint at the address kp->addr. When the breakpoint is
349hit, Kprobes calls kp->pre_handler. After the probed instruction 386hit, Kprobes calls kp->pre_handler. After the probed instruction
@@ -354,61 +391,68 @@ kp->fault_handler. Any or all handlers can be NULL. If kp->flags
354is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled, 391is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled,
355so, its handlers aren't hit until calling enable_kprobe(kp). 392so, its handlers aren't hit until calling enable_kprobe(kp).
356 393
357NOTE: 394.. note::
3581. With the introduction of the "symbol_name" field to struct kprobe, 395
359the probepoint address resolution will now be taken care of by the kernel. 396 1. With the introduction of the "symbol_name" field to struct kprobe,
360The following will now work: 397 the probepoint address resolution will now be taken care of by the kernel.
398 The following will now work::
361 399
362 kp.symbol_name = "symbol_name"; 400 kp.symbol_name = "symbol_name";
363 401
364(64-bit powerpc intricacies such as function descriptors are handled 402 (64-bit powerpc intricacies such as function descriptors are handled
365transparently) 403 transparently)
366 404
3672. Use the "offset" field of struct kprobe if the offset into the symbol 405 2. Use the "offset" field of struct kprobe if the offset into the symbol
368to install a probepoint is known. This field is used to calculate the 406 to install a probepoint is known. This field is used to calculate the
369probepoint. 407 probepoint.
370 408
3713. Specify either the kprobe "symbol_name" OR the "addr". If both are 409 3. Specify either the kprobe "symbol_name" OR the "addr". If both are
372specified, kprobe registration will fail with -EINVAL. 410 specified, kprobe registration will fail with -EINVAL.
373 411
3744. With CISC architectures (such as i386 and x86_64), the kprobes code 412 4. With CISC architectures (such as i386 and x86_64), the kprobes code
375does not validate if the kprobe.addr is at an instruction boundary. 413 does not validate if the kprobe.addr is at an instruction boundary.
376Use "offset" with caution. 414 Use "offset" with caution.
377 415
378register_kprobe() returns 0 on success, or a negative errno otherwise. 416register_kprobe() returns 0 on success, or a negative errno otherwise.
379 417
380User's pre-handler (kp->pre_handler): 418User's pre-handler (kp->pre_handler)::
381#include <linux/kprobes.h> 419
382#include <linux/ptrace.h> 420 #include <linux/kprobes.h>
383int pre_handler(struct kprobe *p, struct pt_regs *regs); 421 #include <linux/ptrace.h>
422 int pre_handler(struct kprobe *p, struct pt_regs *regs);
384 423
385Called with p pointing to the kprobe associated with the breakpoint, 424Called with p pointing to the kprobe associated with the breakpoint,
386and regs pointing to the struct containing the registers saved when 425and regs pointing to the struct containing the registers saved when
387the breakpoint was hit. Return 0 here unless you're a Kprobes geek. 426the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
388 427
389User's post-handler (kp->post_handler): 428User's post-handler (kp->post_handler)::
390#include <linux/kprobes.h> 429
391#include <linux/ptrace.h> 430 #include <linux/kprobes.h>
392void post_handler(struct kprobe *p, struct pt_regs *regs, 431 #include <linux/ptrace.h>
393 unsigned long flags); 432 void post_handler(struct kprobe *p, struct pt_regs *regs,
433 unsigned long flags);
394 434
395p and regs are as described for the pre_handler. flags always seems 435p and regs are as described for the pre_handler. flags always seems
396to be zero. 436to be zero.
397 437
398User's fault-handler (kp->fault_handler): 438User's fault-handler (kp->fault_handler)::
399#include <linux/kprobes.h> 439
400#include <linux/ptrace.h> 440 #include <linux/kprobes.h>
401int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); 441 #include <linux/ptrace.h>
442 int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr);
402 443
403p and regs are as described for the pre_handler. trapnr is the 444p and regs are as described for the pre_handler. trapnr is the
404architecture-specific trap number associated with the fault (e.g., 445architecture-specific trap number associated with the fault (e.g.,
405on i386, 13 for a general protection fault or 14 for a page fault). 446on i386, 13 for a general protection fault or 14 for a page fault).
406Returns 1 if it successfully handled the exception. 447Returns 1 if it successfully handled the exception.
407 448
4084.2 register_jprobe 449register_jprobe
450---------------
409 451
410#include <linux/kprobes.h> 452::
411int register_jprobe(struct jprobe *jp) 453
454 #include <linux/kprobes.h>
455 int register_jprobe(struct jprobe *jp)
412 456
413Sets a breakpoint at the address jp->kp.addr, which must be the address 457Sets a breakpoint at the address jp->kp.addr, which must be the address
414of the first instruction of a function. When the breakpoint is hit, 458of the first instruction of a function. When the breakpoint is hit,
@@ -423,10 +467,13 @@ declaration must match.
423 467
424register_jprobe() returns 0 on success, or a negative errno otherwise. 468register_jprobe() returns 0 on success, or a negative errno otherwise.
425 469
4264.3 register_kretprobe 470register_kretprobe
471------------------
472
473::
427 474
428#include <linux/kprobes.h> 475 #include <linux/kprobes.h>
429int register_kretprobe(struct kretprobe *rp); 476 int register_kretprobe(struct kretprobe *rp);
430 477
431Establishes a return probe for the function whose address is 478Establishes a return probe for the function whose address is
432rp->kp.addr. When that function returns, Kprobes calls rp->handler. 479rp->kp.addr. When that function returns, Kprobes calls rp->handler.
@@ -436,14 +483,17 @@ register_kretprobe(); see "How Does a Return Probe Work?" for details.
436register_kretprobe() returns 0 on success, or a negative errno 483register_kretprobe() returns 0 on success, or a negative errno
437otherwise. 484otherwise.
438 485
439User's return-probe handler (rp->handler): 486User's return-probe handler (rp->handler)::
440#include <linux/kprobes.h> 487
441#include <linux/ptrace.h> 488 #include <linux/kprobes.h>
442int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs); 489 #include <linux/ptrace.h>
490 int kretprobe_handler(struct kretprobe_instance *ri,
491 struct pt_regs *regs);
443 492
444regs is as described for kprobe.pre_handler. ri points to the 493regs is as described for kprobe.pre_handler. ri points to the
445kretprobe_instance object, of which the following fields may be 494kretprobe_instance object, of which the following fields may be
446of interest: 495of interest:
496
447- ret_addr: the return address 497- ret_addr: the return address
448- rp: points to the corresponding kretprobe object 498- rp: points to the corresponding kretprobe object
449- task: points to the corresponding task struct 499- task: points to the corresponding task struct
@@ -456,74 +506,94 @@ the architecture's ABI.
456 506
457The handler's return value is currently ignored. 507The handler's return value is currently ignored.
458 508
4594.4 unregister_*probe 509unregister_*probe
510------------------
511
512::
460 513
461#include <linux/kprobes.h> 514 #include <linux/kprobes.h>
462void unregister_kprobe(struct kprobe *kp); 515 void unregister_kprobe(struct kprobe *kp);
463void unregister_jprobe(struct jprobe *jp); 516 void unregister_jprobe(struct jprobe *jp);
464void unregister_kretprobe(struct kretprobe *rp); 517 void unregister_kretprobe(struct kretprobe *rp);
465 518
466Removes the specified probe. The unregister function can be called 519Removes the specified probe. The unregister function can be called
467at any time after the probe has been registered. 520at any time after the probe has been registered.
468 521
469NOTE: 522.. note::
470If the functions find an incorrect probe (ex. an unregistered probe), 523
471they clear the addr field of the probe. 524 If the functions find an incorrect probe (ex. an unregistered probe),
525 they clear the addr field of the probe.
526
527register_*probes
528----------------
472 529
4734.5 register_*probes 530::
474 531
475#include <linux/kprobes.h> 532 #include <linux/kprobes.h>
476int register_kprobes(struct kprobe **kps, int num); 533 int register_kprobes(struct kprobe **kps, int num);
477int register_kretprobes(struct kretprobe **rps, int num); 534 int register_kretprobes(struct kretprobe **rps, int num);
478int register_jprobes(struct jprobe **jps, int num); 535 int register_jprobes(struct jprobe **jps, int num);
479 536
480Registers each of the num probes in the specified array. If any 537Registers each of the num probes in the specified array. If any
481error occurs during registration, all probes in the array, up to 538error occurs during registration, all probes in the array, up to
482the bad probe, are safely unregistered before the register_*probes 539the bad probe, are safely unregistered before the register_*probes
483function returns. 540function returns.
484- kps/rps/jps: an array of pointers to *probe data structures 541
542- kps/rps/jps: an array of pointers to ``*probe`` data structures
485- num: the number of the array entries. 543- num: the number of the array entries.
486 544
487NOTE: 545.. note::
488You have to allocate(or define) an array of pointers and set all 546
489of the array entries before using these functions. 547 You have to allocate(or define) an array of pointers and set all
548 of the array entries before using these functions.
490 549
4914.6 unregister_*probes 550unregister_*probes
551------------------
492 552
493#include <linux/kprobes.h> 553::
494void unregister_kprobes(struct kprobe **kps, int num); 554
495void unregister_kretprobes(struct kretprobe **rps, int num); 555 #include <linux/kprobes.h>
496void unregister_jprobes(struct jprobe **jps, int num); 556 void unregister_kprobes(struct kprobe **kps, int num);
557 void unregister_kretprobes(struct kretprobe **rps, int num);
558 void unregister_jprobes(struct jprobe **jps, int num);
497 559
498Removes each of the num probes in the specified array at once. 560Removes each of the num probes in the specified array at once.
499 561
500NOTE: 562.. note::
501If the functions find some incorrect probes (ex. unregistered 563
502probes) in the specified array, they clear the addr field of those 564 If the functions find some incorrect probes (ex. unregistered
503incorrect probes. However, other probes in the array are 565 probes) in the specified array, they clear the addr field of those
504unregistered correctly. 566 incorrect probes. However, other probes in the array are
567 unregistered correctly.
505 568
5064.7 disable_*probe 569disable_*probe
570--------------
507 571
508#include <linux/kprobes.h> 572::
509int disable_kprobe(struct kprobe *kp);
510int disable_kretprobe(struct kretprobe *rp);
511int disable_jprobe(struct jprobe *jp);
512 573
513Temporarily disables the specified *probe. You can enable it again by using 574 #include <linux/kprobes.h>
575 int disable_kprobe(struct kprobe *kp);
576 int disable_kretprobe(struct kretprobe *rp);
577 int disable_jprobe(struct jprobe *jp);
578
579Temporarily disables the specified ``*probe``. You can enable it again by using
514enable_*probe(). You must specify the probe which has been registered. 580enable_*probe(). You must specify the probe which has been registered.
515 581
5164.8 enable_*probe 582enable_*probe
583-------------
584
585::
517 586
518#include <linux/kprobes.h> 587 #include <linux/kprobes.h>
519int enable_kprobe(struct kprobe *kp); 588 int enable_kprobe(struct kprobe *kp);
520int enable_kretprobe(struct kretprobe *rp); 589 int enable_kretprobe(struct kretprobe *rp);
521int enable_jprobe(struct jprobe *jp); 590 int enable_jprobe(struct jprobe *jp);
522 591
523Enables *probe which has been disabled by disable_*probe(). You must specify 592Enables ``*probe`` which has been disabled by disable_*probe(). You must specify
524the probe which has been registered. 593the probe which has been registered.
525 594
5265. Kprobes Features and Limitations 595Kprobes Features and Limitations
596================================
527 597
528Kprobes allows multiple probes at the same address. Currently, 598Kprobes allows multiple probes at the same address. Currently,
529however, there cannot be multiple jprobes on the same function at 599however, there cannot be multiple jprobes on the same function at
@@ -538,7 +608,7 @@ are discussed in this section.
538 608
539The register_*probe functions will return -EINVAL if you attempt 609The register_*probe functions will return -EINVAL if you attempt
540to install a probe in the code that implements Kprobes (mostly 610to install a probe in the code that implements Kprobes (mostly
541kernel/kprobes.c and arch/*/kernel/kprobes.c, but also functions such 611kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such
542as do_page_fault and notifier_call_chain). 612as do_page_fault and notifier_call_chain).
543 613
544If you install a probe in an inline-able function, Kprobes makes 614If you install a probe in an inline-able function, Kprobes makes
@@ -602,19 +672,21 @@ explain it, we introduce some terminology. Imagine a 3-instruction
602sequence consisting of a two 2-byte instructions and one 3-byte 672sequence consisting of a two 2-byte instructions and one 3-byte
603instruction. 673instruction.
604 674
605 IA 675::
606 |
607[-2][-1][0][1][2][3][4][5][6][7]
608 [ins1][ins2][ ins3 ]
609 [<- DCR ->]
610 [<- JTPR ->]
611 676
612ins1: 1st Instruction 677 IA
613ins2: 2nd Instruction 678 |
614ins3: 3rd Instruction 679 [-2][-1][0][1][2][3][4][5][6][7]
615IA: Insertion Address 680 [ins1][ins2][ ins3 ]
616JTPR: Jump Target Prohibition Region 681 [<- DCR ->]
617DCR: Detoured Code Region 682 [<- JTPR ->]
683
684 ins1: 1st Instruction
685 ins2: 2nd Instruction
686 ins3: 3rd Instruction
687 IA: Insertion Address
688 JTPR: Jump Target Prohibition Region
689 DCR: Detoured Code Region
618 690
619The instructions in DCR are copied to the out-of-line buffer 691The instructions in DCR are copied to the out-of-line buffer
620of the kprobe, because the bytes in DCR are replaced by 692of the kprobe, because the bytes in DCR are replaced by
@@ -628,7 +700,8 @@ d) DCR must not straddle the border between functions.
628Anyway, these limitations are checked by the in-kernel instruction 700Anyway, these limitations are checked by the in-kernel instruction
629decoder, so you don't need to worry about that. 701decoder, so you don't need to worry about that.
630 702
6316. Probe Overhead 703Probe Overhead
704==============
632 705
633On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 706On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
634microseconds to process. Specifically, a benchmark that hits the same 707microseconds to process. Specifically, a benchmark that hits the same
@@ -638,70 +711,80 @@ return-probe hit typically takes 50-75% longer than a kprobe hit.
638When you have a return probe set on a function, adding a kprobe at 711When you have a return probe set on a function, adding a kprobe at
639the entry to that function adds essentially no overhead. 712the entry to that function adds essentially no overhead.
640 713
641Here are sample overhead figures (in usec) for different architectures. 714Here are sample overhead figures (in usec) for different architectures::
642k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe 715
643on same function; jr = jprobe + return probe on same function 716 k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe
717 on same function; jr = jprobe + return probe on same function::
644 718
645i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips 719 i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
646k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 720 k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40
647 721
648x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips 722 x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
649k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 723 k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07
650 724
651ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) 725 ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
652k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 726 k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99
653 727
6546.1 Optimized Probe Overhead 728Optimized Probe Overhead
729------------------------
655 730
656Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to 731Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
657process. Here are sample overhead figures (in usec) for x86 architectures. 732process. Here are sample overhead figures (in usec) for x86 architectures::
658k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
659r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
660 733
661i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips 734 k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
662k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 735 r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
663 736
664x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips 737 i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
665k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 738 k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
666 739
6677. TODO 740 x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
741 k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
742
743TODO
744====
668 745
669a. SystemTap (http://sourceware.org/systemtap): Provides a simplified 746a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
670programming interface for probe-based instrumentation. Try it out. 747 programming interface for probe-based instrumentation. Try it out.
671b. Kernel return probes for sparc64. 748b. Kernel return probes for sparc64.
672c. Support for other architectures. 749c. Support for other architectures.
673d. User-space probes. 750d. User-space probes.
674e. Watchpoint probes (which fire on data references). 751e. Watchpoint probes (which fire on data references).
675 752
6768. Kprobes Example 753Kprobes Example
754===============
677 755
678See samples/kprobes/kprobe_example.c 756See samples/kprobes/kprobe_example.c
679 757
6809. Jprobes Example 758Jprobes Example
759===============
681 760
682See samples/kprobes/jprobe_example.c 761See samples/kprobes/jprobe_example.c
683 762
68410. Kretprobes Example 763Kretprobes Example
764==================
685 765
686See samples/kprobes/kretprobe_example.c 766See samples/kprobes/kretprobe_example.c
687 767
688For additional information on Kprobes, refer to the following URLs: 768For additional information on Kprobes, refer to the following URLs:
689http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe 769
690http://www.redhat.com/magazine/005mar05/features/kprobes/ 770- http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe
691http://www-users.cs.umn.edu/~boutcher/kprobes/ 771- http://www.redhat.com/magazine/005mar05/features/kprobes/
692http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) 772- http://www-users.cs.umn.edu/~boutcher/kprobes/
773- http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115)
693 774
694 775
695Appendix A: The kprobes debugfs interface 776The kprobes debugfs interface
777=============================
778
696 779
697With recent kernels (> 2.6.20) the list of registered kprobes is visible 780With recent kernels (> 2.6.20) the list of registered kprobes is visible
698under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug). 781under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug).
699 782
700/sys/kernel/debug/kprobes/list: Lists all registered probes on the system 783/sys/kernel/debug/kprobes/list: Lists all registered probes on the system::
701 784
702c015d71a k vfs_read+0x0 785 c015d71a k vfs_read+0x0
703c011a316 j do_fork+0x0 786 c011a316 j do_fork+0x0
704c03dedc5 r tcp_v4_rcv+0x0 787 c03dedc5 r tcp_v4_rcv+0x0
705 788
706The first column provides the kernel address where the probe is inserted. 789The first column provides the kernel address where the probe is inserted.
707The second column identifies the type of probe (k - kprobe, r - kretprobe 790The second column identifies the type of probe (k - kprobe, r - kretprobe
@@ -725,17 +808,19 @@ change each probe's disabling state. This means that disabled kprobes (marked
725[DISABLED]) will be not enabled if you turn ON all kprobes by this knob. 808[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
726 809
727 810
728Appendix B: The kprobes sysctl interface 811The kprobes sysctl interface
812============================
729 813
730/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF. 814/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
731 815
732When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides 816When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
733a knob to globally and forcibly turn jump optimization (see section 817a knob to globally and forcibly turn jump optimization (see section
7341.4) ON or OFF. By default, jump optimization is allowed (ON). 818:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization
735If you echo "0" to this file or set "debug.kprobes_optimization" to 819is allowed (ON). If you echo "0" to this file or set
7360 via sysctl, all optimized probes will be unoptimized, and any new 820"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be
737probes registered after that will not be optimized. Note that this 821unoptimized, and any new probes registered after that will not be optimized.
738knob *changes* the optimized state. This means that optimized probes 822
739(marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be 823Note that this knob *changes* the optimized state. This means that optimized
824probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
740removed). If the knob is turned on, they will be optimized again. 825removed). If the knob is turned on, they will be optimized again.
741 826
diff --git a/Documentation/kref.txt b/Documentation/kref.txt
index d26a27ca964d..3af384156d7e 100644
--- a/Documentation/kref.txt
+++ b/Documentation/kref.txt
@@ -1,24 +1,42 @@
1===================================================
2Adding reference counters (krefs) to kernel objects
3===================================================
4
5:Author: Corey Minyard <minyard@acm.org>
6:Author: Thomas Hellstrom <thellstrom@vmware.com>
7
8A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and
9presentation on krefs, which can be found at:
10
11 - http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf
12 - http://www.kroah.com/linux/talks/ols_2004_kref_talk/
13
14Introduction
15============
1 16
2krefs allow you to add reference counters to your objects. If you 17krefs allow you to add reference counters to your objects. If you
3have objects that are used in multiple places and passed around, and 18have objects that are used in multiple places and passed around, and
4you don't have refcounts, your code is almost certainly broken. If 19you don't have refcounts, your code is almost certainly broken. If
5you want refcounts, krefs are the way to go. 20you want refcounts, krefs are the way to go.
6 21
7To use a kref, add one to your data structures like: 22To use a kref, add one to your data structures like::
8 23
9struct my_data 24 struct my_data
10{ 25 {
11 . 26 .
12 . 27 .
13 struct kref refcount; 28 struct kref refcount;
14 . 29 .
15 . 30 .
16}; 31 };
17 32
18The kref can occur anywhere within the data structure. 33The kref can occur anywhere within the data structure.
19 34
35Initialization
36==============
37
20You must initialize the kref after you allocate it. To do this, call 38You must initialize the kref after you allocate it. To do this, call
21kref_init as so: 39kref_init as so::
22 40
23 struct my_data *data; 41 struct my_data *data;
24 42
@@ -29,18 +47,25 @@ kref_init as so:
29 47
30This sets the refcount in the kref to 1. 48This sets the refcount in the kref to 1.
31 49
50Kref rules
51==========
52
32Once you have an initialized kref, you must follow the following 53Once you have an initialized kref, you must follow the following
33rules: 54rules:
34 55
351) If you make a non-temporary copy of a pointer, especially if 561) If you make a non-temporary copy of a pointer, especially if
36 it can be passed to another thread of execution, you must 57 it can be passed to another thread of execution, you must
37 increment the refcount with kref_get() before passing it off: 58 increment the refcount with kref_get() before passing it off::
59
38 kref_get(&data->refcount); 60 kref_get(&data->refcount);
61
39 If you already have a valid pointer to a kref-ed structure (the 62 If you already have a valid pointer to a kref-ed structure (the
40 refcount cannot go to zero) you may do this without a lock. 63 refcount cannot go to zero) you may do this without a lock.
41 64
422) When you are done with a pointer, you must call kref_put(): 652) When you are done with a pointer, you must call kref_put()::
66
43 kref_put(&data->refcount, data_release); 67 kref_put(&data->refcount, data_release);
68
44 If this is the last reference to the pointer, the release 69 If this is the last reference to the pointer, the release
45 routine will be called. If the code never tries to get 70 routine will be called. If the code never tries to get
46 a valid pointer to a kref-ed structure without already 71 a valid pointer to a kref-ed structure without already
@@ -53,25 +78,25 @@ rules:
53 structure must remain valid during the kref_get(). 78 structure must remain valid during the kref_get().
54 79
55For example, if you allocate some data and then pass it to another 80For example, if you allocate some data and then pass it to another
56thread to process: 81thread to process::
57 82
58void data_release(struct kref *ref) 83 void data_release(struct kref *ref)
59{ 84 {
60 struct my_data *data = container_of(ref, struct my_data, refcount); 85 struct my_data *data = container_of(ref, struct my_data, refcount);
61 kfree(data); 86 kfree(data);
62} 87 }
63 88
64void more_data_handling(void *cb_data) 89 void more_data_handling(void *cb_data)
65{ 90 {
66 struct my_data *data = cb_data; 91 struct my_data *data = cb_data;
67 . 92 .
68 . do stuff with data here 93 . do stuff with data here
69 . 94 .
70 kref_put(&data->refcount, data_release); 95 kref_put(&data->refcount, data_release);
71} 96 }
72 97
73int my_data_handler(void) 98 int my_data_handler(void)
74{ 99 {
75 int rv = 0; 100 int rv = 0;
76 struct my_data *data; 101 struct my_data *data;
77 struct task_struct *task; 102 struct task_struct *task;
@@ -91,10 +116,10 @@ int my_data_handler(void)
91 . 116 .
92 . do stuff with data here 117 . do stuff with data here
93 . 118 .
94 out: 119 out:
95 kref_put(&data->refcount, data_release); 120 kref_put(&data->refcount, data_release);
96 return rv; 121 return rv;
97} 122 }
98 123
99This way, it doesn't matter what order the two threads handle the 124This way, it doesn't matter what order the two threads handle the
100data, the kref_put() handles knowing when the data is not referenced 125data, the kref_put() handles knowing when the data is not referenced
@@ -104,7 +129,7 @@ put needs no lock because nothing tries to get the data without
104already holding a pointer. 129already holding a pointer.
105 130
106Note that the "before" in rule 1 is very important. You should never 131Note that the "before" in rule 1 is very important. You should never
107do something like: 132do something like::
108 133
109 task = kthread_run(more_data_handling, data, "more_data_handling"); 134 task = kthread_run(more_data_handling, data, "more_data_handling");
110 if (task == ERR_PTR(-ENOMEM)) { 135 if (task == ERR_PTR(-ENOMEM)) {
@@ -124,14 +149,14 @@ bad style. Don't do it.
124There are some situations where you can optimize the gets and puts. 149There are some situations where you can optimize the gets and puts.
125For instance, if you are done with an object and enqueuing it for 150For instance, if you are done with an object and enqueuing it for
126something else or passing it off to something else, there is no reason 151something else or passing it off to something else, there is no reason
127to do a get then a put: 152to do a get then a put::
128 153
129 /* Silly extra get and put */ 154 /* Silly extra get and put */
130 kref_get(&obj->ref); 155 kref_get(&obj->ref);
131 enqueue(obj); 156 enqueue(obj);
132 kref_put(&obj->ref, obj_cleanup); 157 kref_put(&obj->ref, obj_cleanup);
133 158
134Just do the enqueue. A comment about this is always welcome: 159Just do the enqueue. A comment about this is always welcome::
135 160
136 enqueue(obj); 161 enqueue(obj);
137 /* We are done with obj, so we pass our refcount off 162 /* We are done with obj, so we pass our refcount off
@@ -142,109 +167,99 @@ instance, you have a list of items that are each kref-ed, and you wish
142to get the first one. You can't just pull the first item off the list 167to get the first one. You can't just pull the first item off the list
143and kref_get() it. That violates rule 3 because you are not already 168and kref_get() it. That violates rule 3 because you are not already
144holding a valid pointer. You must add a mutex (or some other lock). 169holding a valid pointer. You must add a mutex (or some other lock).
145For instance: 170For instance::
146 171
147static DEFINE_MUTEX(mutex); 172 static DEFINE_MUTEX(mutex);
148static LIST_HEAD(q); 173 static LIST_HEAD(q);
149struct my_data 174 struct my_data
150{ 175 {
151 struct kref refcount; 176 struct kref refcount;
152 struct list_head link; 177 struct list_head link;
153}; 178 };
154 179
155static struct my_data *get_entry() 180 static struct my_data *get_entry()
156{ 181 {
157 struct my_data *entry = NULL; 182 struct my_data *entry = NULL;
158 mutex_lock(&mutex); 183 mutex_lock(&mutex);
159 if (!list_empty(&q)) { 184 if (!list_empty(&q)) {
160 entry = container_of(q.next, struct my_data, link); 185 entry = container_of(q.next, struct my_data, link);
161 kref_get(&entry->refcount); 186 kref_get(&entry->refcount);
187 }
188 mutex_unlock(&mutex);
189 return entry;
162 } 190 }
163 mutex_unlock(&mutex);
164 return entry;
165}
166 191
167static void release_entry(struct kref *ref) 192 static void release_entry(struct kref *ref)
168{ 193 {
169 struct my_data *entry = container_of(ref, struct my_data, refcount); 194 struct my_data *entry = container_of(ref, struct my_data, refcount);
170 195
171 list_del(&entry->link); 196 list_del(&entry->link);
172 kfree(entry); 197 kfree(entry);
173} 198 }
174 199
175static void put_entry(struct my_data *entry) 200 static void put_entry(struct my_data *entry)
176{ 201 {
177 mutex_lock(&mutex); 202 mutex_lock(&mutex);
178 kref_put(&entry->refcount, release_entry); 203 kref_put(&entry->refcount, release_entry);
179 mutex_unlock(&mutex); 204 mutex_unlock(&mutex);
180} 205 }
181 206
182The kref_put() return value is useful if you do not want to hold the 207The kref_put() return value is useful if you do not want to hold the
183lock during the whole release operation. Say you didn't want to call 208lock during the whole release operation. Say you didn't want to call
184kfree() with the lock held in the example above (since it is kind of 209kfree() with the lock held in the example above (since it is kind of
185pointless to do so). You could use kref_put() as follows: 210pointless to do so). You could use kref_put() as follows::
186 211
187static void release_entry(struct kref *ref) 212 static void release_entry(struct kref *ref)
188{ 213 {
189 /* All work is done after the return from kref_put(). */ 214 /* All work is done after the return from kref_put(). */
190} 215 }
191 216
192static void put_entry(struct my_data *entry) 217 static void put_entry(struct my_data *entry)
193{ 218 {
194 mutex_lock(&mutex); 219 mutex_lock(&mutex);
195 if (kref_put(&entry->refcount, release_entry)) { 220 if (kref_put(&entry->refcount, release_entry)) {
196 list_del(&entry->link); 221 list_del(&entry->link);
197 mutex_unlock(&mutex); 222 mutex_unlock(&mutex);
198 kfree(entry); 223 kfree(entry);
199 } else 224 } else
200 mutex_unlock(&mutex); 225 mutex_unlock(&mutex);
201} 226 }
202 227
203This is really more useful if you have to call other routines as part 228This is really more useful if you have to call other routines as part
204of the free operations that could take a long time or might claim the 229of the free operations that could take a long time or might claim the
205same lock. Note that doing everything in the release routine is still 230same lock. Note that doing everything in the release routine is still
206preferred as it is a little neater. 231preferred as it is a little neater.
207 232
208
209Corey Minyard <minyard@acm.org>
210
211A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and
212presentation on krefs, which can be found at:
213 http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf
214and:
215 http://www.kroah.com/linux/talks/ols_2004_kref_talk/
216
217
218The above example could also be optimized using kref_get_unless_zero() in 233The above example could also be optimized using kref_get_unless_zero() in
219the following way: 234the following way::
220 235
221static struct my_data *get_entry() 236 static struct my_data *get_entry()
222{ 237 {
223 struct my_data *entry = NULL; 238 struct my_data *entry = NULL;
224 mutex_lock(&mutex); 239 mutex_lock(&mutex);
225 if (!list_empty(&q)) { 240 if (!list_empty(&q)) {
226 entry = container_of(q.next, struct my_data, link); 241 entry = container_of(q.next, struct my_data, link);
227 if (!kref_get_unless_zero(&entry->refcount)) 242 if (!kref_get_unless_zero(&entry->refcount))
228 entry = NULL; 243 entry = NULL;
244 }
245 mutex_unlock(&mutex);
246 return entry;
229 } 247 }
230 mutex_unlock(&mutex);
231 return entry;
232}
233 248
234static void release_entry(struct kref *ref) 249 static void release_entry(struct kref *ref)
235{ 250 {
236 struct my_data *entry = container_of(ref, struct my_data, refcount); 251 struct my_data *entry = container_of(ref, struct my_data, refcount);
237 252
238 mutex_lock(&mutex); 253 mutex_lock(&mutex);
239 list_del(&entry->link); 254 list_del(&entry->link);
240 mutex_unlock(&mutex); 255 mutex_unlock(&mutex);
241 kfree(entry); 256 kfree(entry);
242} 257 }
243 258
244static void put_entry(struct my_data *entry) 259 static void put_entry(struct my_data *entry)
245{ 260 {
246 kref_put(&entry->refcount, release_entry); 261 kref_put(&entry->refcount, release_entry);
247} 262 }
248 263
249Which is useful to remove the mutex lock around kref_put() in put_entry(), but 264Which is useful to remove the mutex lock around kref_put() in put_entry(), but
250it's important that kref_get_unless_zero is enclosed in the same critical 265it's important that kref_get_unless_zero is enclosed in the same critical
@@ -254,51 +269,51 @@ Note that it is illegal to use kref_get_unless_zero without checking its
254return value. If you are sure (by already having a valid pointer) that 269return value. If you are sure (by already having a valid pointer) that
255kref_get_unless_zero() will return true, then use kref_get() instead. 270kref_get_unless_zero() will return true, then use kref_get() instead.
256 271
257The function kref_get_unless_zero also makes it possible to use rcu 272Krefs and RCU
258locking for lookups in the above example: 273=============
259 274
260struct my_data 275The function kref_get_unless_zero also makes it possible to use rcu
261{ 276locking for lookups in the above example::
262 struct rcu_head rhead; 277
263 . 278 struct my_data
264 struct kref refcount; 279 {
265 . 280 struct rcu_head rhead;
266 . 281 .
267}; 282 struct kref refcount;
268 283 .
269static struct my_data *get_entry_rcu() 284 .
270{ 285 };
271 struct my_data *entry = NULL; 286
272 rcu_read_lock(); 287 static struct my_data *get_entry_rcu()
273 if (!list_empty(&q)) { 288 {
274 entry = container_of(q.next, struct my_data, link); 289 struct my_data *entry = NULL;
275 if (!kref_get_unless_zero(&entry->refcount)) 290 rcu_read_lock();
276 entry = NULL; 291 if (!list_empty(&q)) {
292 entry = container_of(q.next, struct my_data, link);
293 if (!kref_get_unless_zero(&entry->refcount))
294 entry = NULL;
295 }
296 rcu_read_unlock();
297 return entry;
277 } 298 }
278 rcu_read_unlock();
279 return entry;
280}
281 299
282static void release_entry_rcu(struct kref *ref) 300 static void release_entry_rcu(struct kref *ref)
283{ 301 {
284 struct my_data *entry = container_of(ref, struct my_data, refcount); 302 struct my_data *entry = container_of(ref, struct my_data, refcount);
285 303
286 mutex_lock(&mutex); 304 mutex_lock(&mutex);
287 list_del_rcu(&entry->link); 305 list_del_rcu(&entry->link);
288 mutex_unlock(&mutex); 306 mutex_unlock(&mutex);
289 kfree_rcu(entry, rhead); 307 kfree_rcu(entry, rhead);
290} 308 }
291 309
292static void put_entry(struct my_data *entry) 310 static void put_entry(struct my_data *entry)
293{ 311 {
294 kref_put(&entry->refcount, release_entry_rcu); 312 kref_put(&entry->refcount, release_entry_rcu);
295} 313 }
296 314
297But note that the struct kref member needs to remain in valid memory for a 315But note that the struct kref member needs to remain in valid memory for a
298rcu grace period after release_entry_rcu was called. That can be accomplished 316rcu grace period after release_entry_rcu was called. That can be accomplished
299by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu() 317by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu()
300before using kfree, but note that synchronize_rcu() may sleep for a 318before using kfree, but note that synchronize_rcu() may sleep for a
301substantial amount of time. 319substantial amount of time.
302
303
304Thomas Hellstrom <thellstrom@vmware.com>
diff --git a/Documentation/ldm.txt b/Documentation/ldm.txt
index 4f80edd14d0a..12c571368e73 100644
--- a/Documentation/ldm.txt
+++ b/Documentation/ldm.txt
@@ -1,9 +1,9 @@
1==========================================
2LDM - Logical Disk Manager (Dynamic Disks)
3==========================================
1 4
2 LDM - Logical Disk Manager (Dynamic Disks) 5:Author: Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>.
3 ------------------------------------------ 6:Last Updated: Anton Altaparmakov on 30 March 2007 for Windows Vista.
4
5Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>.
6Last Updated by Anton Altaparmakov on 30 March 2007 for Windows Vista.
7 7
8Overview 8Overview
9-------- 9--------
@@ -37,24 +37,36 @@ Example
37------- 37-------
38 38
39Below we have a 50MiB disk, divided into seven partitions. 39Below we have a 50MiB disk, divided into seven partitions.
40N.B. The missing 1MiB at the end of the disk is where the LDM database is 40
41 stored. 41.. note::
42 42
43 Device | Offset Bytes Sectors MiB | Size Bytes Sectors MiB 43 The missing 1MiB at the end of the disk is where the LDM database is
44 -------+----------------------------+--------------------------- 44 stored.
45 hda | 0 0 0 | 52428800 102400 50 45
46 hda1 | 51380224 100352 49 | 1048576 2048 1 46+-------++--------------+---------+-----++--------------+---------+----+
47 hda2 | 16384 32 0 | 6979584 13632 6 47|Device || Offset Bytes | Sectors | MiB || Size Bytes | Sectors | MiB|
48 hda3 | 6995968 13664 6 | 10485760 20480 10 48+=======++==============+=========+=====++==============+=========+====+
49 hda4 | 17481728 34144 16 | 4194304 8192 4 49|hda || 0 | 0 | 0 || 52428800 | 102400 | 50|
50 hda5 | 21676032 42336 20 | 5242880 10240 5 50+-------++--------------+---------+-----++--------------+---------+----+
51 hda6 | 26918912 52576 25 | 10485760 20480 10 51|hda1 || 51380224 | 100352 | 49 || 1048576 | 2048 | 1|
52 hda7 | 37404672 73056 35 | 13959168 27264 13 52+-------++--------------+---------+-----++--------------+---------+----+
53|hda2 || 16384 | 32 | 0 || 6979584 | 13632 | 6|
54+-------++--------------+---------+-----++--------------+---------+----+
55|hda3 || 6995968 | 13664 | 6 || 10485760 | 20480 | 10|
56+-------++--------------+---------+-----++--------------+---------+----+
57|hda4 || 17481728 | 34144 | 16 || 4194304 | 8192 | 4|
58+-------++--------------+---------+-----++--------------+---------+----+
59|hda5 || 21676032 | 42336 | 20 || 5242880 | 10240 | 5|
60+-------++--------------+---------+-----++--------------+---------+----+
61|hda6 || 26918912 | 52576 | 25 || 10485760 | 20480 | 10|
62+-------++--------------+---------+-----++--------------+---------+----+
63|hda7 || 37404672 | 73056 | 35 || 13959168 | 27264 | 13|
64+-------++--------------+---------+-----++--------------+---------+----+
53 65
54The LDM Database may not store the partitions in the order that they appear on 66The LDM Database may not store the partitions in the order that they appear on
55disk, but the driver will sort them. 67disk, but the driver will sort them.
56 68
57When Linux boots, you will see something like: 69When Linux boots, you will see something like::
58 70
59 hda: 102400 sectors w/32KiB Cache, CHS=50/64/32 71 hda: 102400 sectors w/32KiB Cache, CHS=50/64/32
60 hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7 72 hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7
@@ -65,13 +77,13 @@ Compiling LDM Support
65 77
66To enable LDM, choose the following two options: 78To enable LDM, choose the following two options:
67 79
68 "Advanced partition selection" CONFIG_PARTITION_ADVANCED 80 - "Advanced partition selection" CONFIG_PARTITION_ADVANCED
69 "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION 81 - "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION
70 82
71If you believe the driver isn't working as it should, you can enable the extra 83If you believe the driver isn't working as it should, you can enable the extra
72debugging code. This will produce a LOT of output. The option is: 84debugging code. This will produce a LOT of output. The option is:
73 85
74 "Windows LDM extra logging" CONFIG_LDM_DEBUG 86 - "Windows LDM extra logging" CONFIG_LDM_DEBUG
75 87
76N.B. The partition code cannot be compiled as a module. 88N.B. The partition code cannot be compiled as a module.
77 89
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index c8b8378513d6..290840c160af 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -30,7 +30,8 @@ timeout is set through the confusingly named "kernel.panic" sysctl),
30to cause the system to reboot automatically after a specified amount 30to cause the system to reboot automatically after a specified amount
31of time. 31of time.
32 32
33=== Implementation === 33Implementation
34==============
34 35
35The soft and hard lockup detectors are built on top of the hrtimer and 36The soft and hard lockup detectors are built on top of the hrtimer and
36perf subsystems, respectively. A direct consequence of this is that, 37perf subsystems, respectively. A direct consequence of this is that,
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index 285c54f66779..6fa6a93d0949 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -1,8 +1,9 @@
1 1===========================================================
2LZO stream format as understood by Linux's LZO decompressor 2LZO stream format as understood by Linux's LZO decompressor
3=========================================================== 3===========================================================
4 4
5Introduction 5Introduction
6============
6 7
7 This is not a specification. No specification seems to be publicly available 8 This is not a specification. No specification seems to be publicly available
8 for the LZO stream format. This document describes what input format the LZO 9 for the LZO stream format. This document describes what input format the LZO
@@ -14,12 +15,13 @@ Introduction
14 for future bug reports. 15 for future bug reports.
15 16
16Description 17Description
18===========
17 19
18 The stream is composed of a series of instructions, operands, and data. The 20 The stream is composed of a series of instructions, operands, and data. The
19 instructions consist in a few bits representing an opcode, and bits forming 21 instructions consist in a few bits representing an opcode, and bits forming
20 the operands for the instruction, whose size and position depend on the 22 the operands for the instruction, whose size and position depend on the
21 opcode and on the number of literals copied by previous instruction. The 23 opcode and on the number of literals copied by previous instruction. The
22 operands are used to indicate : 24 operands are used to indicate:
23 25
24 - a distance when copying data from the dictionary (past output buffer) 26 - a distance when copying data from the dictionary (past output buffer)
25 - a length (number of bytes to copy from dictionary) 27 - a length (number of bytes to copy from dictionary)
@@ -38,7 +40,7 @@ Description
38 of bits in the operand. If the number of bits isn't enough to represent the 40 of bits in the operand. If the number of bits isn't enough to represent the
39 length, up to 255 may be added in increments by consuming more bytes with a 41 length, up to 255 may be added in increments by consuming more bytes with a
40 rate of at most 255 per extra byte (thus the compression ratio cannot exceed 42 rate of at most 255 per extra byte (thus the compression ratio cannot exceed
41 around 255:1). The variable length encoding using #bits is always the same : 43 around 255:1). The variable length encoding using #bits is always the same::
42 44
43 length = byte & ((1 << #bits) - 1) 45 length = byte & ((1 << #bits) - 1)
44 if (!length) { 46 if (!length) {
@@ -67,15 +69,19 @@ Description
67 instruction may encode this distance (0001HLLL), it takes one LE16 operand 69 instruction may encode this distance (0001HLLL), it takes one LE16 operand
68 for the distance, thus requiring 3 bytes. 70 for the distance, thus requiring 3 bytes.
69 71
70 IMPORTANT NOTE : in the code some length checks are missing because certain 72 .. important::
71 instructions are called under the assumption that a certain number of bytes 73
72 follow because it has already been guaranteed before parsing the instructions. 74 In the code some length checks are missing because certain instructions
73 They just have to "refill" this credit if they consume extra bytes. This is 75 are called under the assumption that a certain number of bytes follow
74 an implementation design choice independent on the algorithm or encoding. 76 because it has already been guaranteed before parsing the instructions.
77 They just have to "refill" this credit if they consume extra bytes. This
78 is an implementation design choice independent on the algorithm or
79 encoding.
75 80
76Byte sequences 81Byte sequences
82==============
77 83
78 First byte encoding : 84 First byte encoding::
79 85
80 0..17 : follow regular instruction encoding, see below. It is worth 86 0..17 : follow regular instruction encoding, see below. It is worth
81 noting that codes 16 and 17 will represent a block copy from 87 noting that codes 16 and 17 will represent a block copy from
@@ -91,7 +97,7 @@ Byte sequences
91 state = 4 [ don't copy extra literals ] 97 state = 4 [ don't copy extra literals ]
92 skip byte 98 skip byte
93 99
94 Instruction encoding : 100 Instruction encoding::
95 101
96 0 0 0 0 X X X X (0..15) 102 0 0 0 0 X X X X (0..15)
97 Depends on the number of literals copied by the last instruction. 103 Depends on the number of literals copied by the last instruction.
@@ -156,6 +162,7 @@ Byte sequences
156 distance = (H << 3) + D + 1 162 distance = (H << 3) + D + 1
157 163
158Authors 164Authors
165=======
159 166
160 This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an 167 This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
161 analysis of the decompression code available in Linux 3.16-rc5. The code is 168 analysis of the decompression code available in Linux 3.16-rc5. The code is
diff --git a/Documentation/mailbox.txt b/Documentation/mailbox.txt
index 7ed371c85204..0ed95009cc30 100644
--- a/Documentation/mailbox.txt
+++ b/Documentation/mailbox.txt
@@ -1,7 +1,10 @@
1 The Common Mailbox Framework 1============================
2 Jassi Brar <jaswinder.singh@linaro.org> 2The Common Mailbox Framework
3============================
3 4
4 This document aims to help developers write client and controller 5:Author: Jassi Brar <jaswinder.singh@linaro.org>
6
7This document aims to help developers write client and controller
5drivers for the API. But before we start, let us note that the 8drivers for the API. But before we start, let us note that the
6client (especially) and controller drivers are likely going to be 9client (especially) and controller drivers are likely going to be
7very platform specific because the remote firmware is likely to be 10very platform specific because the remote firmware is likely to be
@@ -13,14 +16,17 @@ similar copies of code written for each platform. Having said that,
13nothing prevents the remote f/w to also be Linux based and use the 16nothing prevents the remote f/w to also be Linux based and use the
14same api there. However none of that helps us locally because we only 17same api there. However none of that helps us locally because we only
15ever deal at client's protocol level. 18ever deal at client's protocol level.
16 Some of the choices made during implementation are the result of this 19
20Some of the choices made during implementation are the result of this
17peculiarity of this "common" framework. 21peculiarity of this "common" framework.
18 22
19 23
20 24
21 Part 1 - Controller Driver (See include/linux/mailbox_controller.h) 25Controller Driver (See include/linux/mailbox_controller.h)
26==========================================================
27
22 28
23 Allocate mbox_controller and the array of mbox_chan. 29Allocate mbox_controller and the array of mbox_chan.
24Populate mbox_chan_ops, except peek_data() all are mandatory. 30Populate mbox_chan_ops, except peek_data() all are mandatory.
25The controller driver might know a message has been consumed 31The controller driver might know a message has been consumed
26by the remote by getting an IRQ or polling some hardware flag 32by the remote by getting an IRQ or polling some hardware flag
@@ -30,91 +36,94 @@ the controller driver should set via 'txdone_irq' or 'txdone_poll'
30or neither. 36or neither.
31 37
32 38
33 Part 2 - Client Driver (See include/linux/mailbox_client.h) 39Client Driver (See include/linux/mailbox_client.h)
40==================================================
34 41
35 The client might want to operate in blocking mode (synchronously 42
43The client might want to operate in blocking mode (synchronously
36send a message through before returning) or non-blocking/async mode (submit 44send a message through before returning) or non-blocking/async mode (submit
37a message and a callback function to the API and return immediately). 45a message and a callback function to the API and return immediately).
38 46
39 47::
40struct demo_client { 48
41 struct mbox_client cl; 49 struct demo_client {
42 struct mbox_chan *mbox; 50 struct mbox_client cl;
43 struct completion c; 51 struct mbox_chan *mbox;
44 bool async; 52 struct completion c;
45 /* ... */ 53 bool async;
46}; 54 /* ... */
47 55 };
48/* 56
49 * This is the handler for data received from remote. The behaviour is purely 57 /*
50 * dependent upon the protocol. This is just an example. 58 * This is the handler for data received from remote. The behaviour is purely
51 */ 59 * dependent upon the protocol. This is just an example.
52static void message_from_remote(struct mbox_client *cl, void *mssg) 60 */
53{ 61 static void message_from_remote(struct mbox_client *cl, void *mssg)
54 struct demo_client *dc = container_of(cl, struct demo_client, cl); 62 {
55 if (dc->async) { 63 struct demo_client *dc = container_of(cl, struct demo_client, cl);
56 if (is_an_ack(mssg)) { 64 if (dc->async) {
57 /* An ACK to our last sample sent */ 65 if (is_an_ack(mssg)) {
58 return; /* Or do something else here */ 66 /* An ACK to our last sample sent */
59 } else { /* A new message from remote */ 67 return; /* Or do something else here */
60 queue_req(mssg); 68 } else { /* A new message from remote */
69 queue_req(mssg);
70 }
71 } else {
72 /* Remote f/w sends only ACK packets on this channel */
73 return;
61 } 74 }
62 } else {
63 /* Remote f/w sends only ACK packets on this channel */
64 return;
65 } 75 }
66} 76
67 77 static void sample_sent(struct mbox_client *cl, void *mssg, int r)
68static void sample_sent(struct mbox_client *cl, void *mssg, int r) 78 {
69{ 79 struct demo_client *dc = container_of(cl, struct demo_client, cl);
70 struct demo_client *dc = container_of(cl, struct demo_client, cl); 80 complete(&dc->c);
71 complete(&dc->c); 81 }
72} 82
73 83 static void client_demo(struct platform_device *pdev)
74static void client_demo(struct platform_device *pdev) 84 {
75{ 85 struct demo_client *dc_sync, *dc_async;
76 struct demo_client *dc_sync, *dc_async; 86 /* The controller already knows async_pkt and sync_pkt */
77 /* The controller already knows async_pkt and sync_pkt */ 87 struct async_pkt ap;
78 struct async_pkt ap; 88 struct sync_pkt sp;
79 struct sync_pkt sp; 89
80 90 dc_sync = kzalloc(sizeof(*dc_sync), GFP_KERNEL);
81 dc_sync = kzalloc(sizeof(*dc_sync), GFP_KERNEL); 91 dc_async = kzalloc(sizeof(*dc_async), GFP_KERNEL);
82 dc_async = kzalloc(sizeof(*dc_async), GFP_KERNEL); 92
83 93 /* Populate non-blocking mode client */
84 /* Populate non-blocking mode client */ 94 dc_async->cl.dev = &pdev->dev;
85 dc_async->cl.dev = &pdev->dev; 95 dc_async->cl.rx_callback = message_from_remote;
86 dc_async->cl.rx_callback = message_from_remote; 96 dc_async->cl.tx_done = sample_sent;
87 dc_async->cl.tx_done = sample_sent; 97 dc_async->cl.tx_block = false;
88 dc_async->cl.tx_block = false; 98 dc_async->cl.tx_tout = 0; /* doesn't matter here */
89 dc_async->cl.tx_tout = 0; /* doesn't matter here */ 99 dc_async->cl.knows_txdone = false; /* depending upon protocol */
90 dc_async->cl.knows_txdone = false; /* depending upon protocol */ 100 dc_async->async = true;
91 dc_async->async = true; 101 init_completion(&dc_async->c);
92 init_completion(&dc_async->c); 102
93 103 /* Populate blocking mode client */
94 /* Populate blocking mode client */ 104 dc_sync->cl.dev = &pdev->dev;
95 dc_sync->cl.dev = &pdev->dev; 105 dc_sync->cl.rx_callback = message_from_remote;
96 dc_sync->cl.rx_callback = message_from_remote; 106 dc_sync->cl.tx_done = NULL; /* operate in blocking mode */
97 dc_sync->cl.tx_done = NULL; /* operate in blocking mode */ 107 dc_sync->cl.tx_block = true;
98 dc_sync->cl.tx_block = true; 108 dc_sync->cl.tx_tout = 500; /* by half a second */
99 dc_sync->cl.tx_tout = 500; /* by half a second */ 109 dc_sync->cl.knows_txdone = false; /* depending upon protocol */
100 dc_sync->cl.knows_txdone = false; /* depending upon protocol */ 110 dc_sync->async = false;
101 dc_sync->async = false; 111
102 112 /* ASync mailbox is listed second in 'mboxes' property */
103 /* ASync mailbox is listed second in 'mboxes' property */ 113 dc_async->mbox = mbox_request_channel(&dc_async->cl, 1);
104 dc_async->mbox = mbox_request_channel(&dc_async->cl, 1); 114 /* Populate data packet */
105 /* Populate data packet */ 115 /* ap.xxx = 123; etc */
106 /* ap.xxx = 123; etc */ 116 /* Send async message to remote */
107 /* Send async message to remote */ 117 mbox_send_message(dc_async->mbox, &ap);
108 mbox_send_message(dc_async->mbox, &ap); 118
109 119 /* Sync mailbox is listed first in 'mboxes' property */
110 /* Sync mailbox is listed first in 'mboxes' property */ 120 dc_sync->mbox = mbox_request_channel(&dc_sync->cl, 0);
111 dc_sync->mbox = mbox_request_channel(&dc_sync->cl, 0); 121 /* Populate data packet */
112 /* Populate data packet */ 122 /* sp.abc = 123; etc */
113 /* sp.abc = 123; etc */ 123 /* Send message to remote in blocking mode */
114 /* Send message to remote in blocking mode */ 124 mbox_send_message(dc_sync->mbox, &sp);
115 mbox_send_message(dc_sync->mbox, &sp); 125 /* At this point 'sp' has been sent */
116 /* At this point 'sp' has been sent */ 126
117 127 /* Now wait for async chan to be done */
118 /* Now wait for async chan to be done */ 128 wait_for_completion(&dc_async->c);
119 wait_for_completion(&dc_async->c); 129 }
120}
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 5c628e19d6cd..7f49ebf3ddb2 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -2,43 +2,48 @@
2Memory Hotplug 2Memory Hotplug
3============== 3==============
4 4
5Created: Jul 28 2007 5:Created: Jul 28 2007
6Add description of notifier of memory hotplug Oct 11 2007 6:Updated: Add description of notifier of memory hotplug: Oct 11 2007
7 7
8This document is about memory hotplug including how-to-use and current status. 8This document is about memory hotplug including how-to-use and current status.
9Because Memory Hotplug is still under development, contents of this text will 9Because Memory Hotplug is still under development, contents of this text will
10be changed often. 10be changed often.
11 11
121. Introduction 12.. CONTENTS
13 1.1 purpose of memory hotplug
14 1.2. Phases of memory hotplug
15 1.3. Unit of Memory online/offline operation
162. Kernel Configuration
173. sysfs files for memory hotplug
184. Physical memory hot-add phase
19 4.1 Hardware(Firmware) Support
20 4.2 Notify memory hot-add event by hand
215. Logical Memory hot-add phase
22 5.1. State of memory
23 5.2. How to online memory
246. Logical memory remove
25 6.1 Memory offline and ZONE_MOVABLE
26 6.2. How to offline memory
277. Physical memory remove
288. Memory hotplug event notifier
299. Future Work List
30
31Note(1): x86_64's has special implementation for memory hotplug.
32 This text does not describe it.
33Note(2): This text assumes that sysfs is mounted at /sys.
34 13
14 1. Introduction
15 1.1 purpose of memory hotplug
16 1.2. Phases of memory hotplug
17 1.3. Unit of Memory online/offline operation
18 2. Kernel Configuration
19 3. sysfs files for memory hotplug
20 4. Physical memory hot-add phase
21 4.1 Hardware(Firmware) Support
22 4.2 Notify memory hot-add event by hand
23 5. Logical Memory hot-add phase
24 5.1. State of memory
25 5.2. How to online memory
26 6. Logical memory remove
27 6.1 Memory offline and ZONE_MOVABLE
28 6.2. How to offline memory
29 7. Physical memory remove
30 8. Memory hotplug event notifier
31 9. Future Work List
35 32
36---------------
371. Introduction
38---------------
39 33
401.1 purpose of memory hotplug 34.. note::
41------------ 35
36 (1) x86_64's has special implementation for memory hotplug.
37 This text does not describe it.
38 (2) This text assumes that sysfs is mounted at /sys.
39
40
41Introduction
42============
43
44purpose of memory hotplug
45-------------------------
46
42Memory Hotplug allows users to increase/decrease the amount of memory. 47Memory Hotplug allows users to increase/decrease the amount of memory.
43Generally, there are two purposes. 48Generally, there are two purposes.
44 49
@@ -53,9 +58,11 @@ hardware which supports memory power management.
53Linux memory hotplug is designed for both purpose. 58Linux memory hotplug is designed for both purpose.
54 59
55 60
561.2. Phases of memory hotplug 61Phases of memory hotplug
57--------------- 62------------------------
58There are 2 phases in Memory Hotplug. 63
64There are 2 phases in Memory Hotplug:
65
59 1) Physical Memory Hotplug phase 66 1) Physical Memory Hotplug phase
60 2) Logical Memory Hotplug phase. 67 2) Logical Memory Hotplug phase.
61 68
@@ -70,7 +77,7 @@ management tables, and makes sysfs files for new memory's operation.
70If firmware supports notification of connection of new memory to OS, 77If firmware supports notification of connection of new memory to OS,
71this phase is triggered automatically. ACPI can notify this event. If not, 78this phase is triggered automatically. ACPI can notify this event. If not,
72"probe" operation by system administration is used instead. 79"probe" operation by system administration is used instead.
73(see Section 4.). 80(see :ref:`memory_hotplug_physical_mem`).
74 81
75Logical Memory Hotplug phase is to change memory state into 82Logical Memory Hotplug phase is to change memory state into
76available/unavailable for users. Amount of memory from user's view is 83available/unavailable for users. Amount of memory from user's view is
@@ -83,11 +90,12 @@ Logical Memory Hotplug phase is triggered by write of sysfs file by system
83administrator. For the hot-add case, it must be executed after Physical Hotplug 90administrator. For the hot-add case, it must be executed after Physical Hotplug
84phase by hand. 91phase by hand.
85(However, if you writes udev's hotplug scripts for memory hotplug, these 92(However, if you writes udev's hotplug scripts for memory hotplug, these
86 phases can be execute in seamless way.) 93phases can be execute in seamless way.)
94
87 95
96Unit of Memory online/offline operation
97---------------------------------------
88 98
891.3. Unit of Memory online/offline operation
90------------
91Memory hotplug uses SPARSEMEM memory model which allows memory to be divided 99Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
92into chunks of the same size. These chunks are called "sections". The size of 100into chunks of the same size. These chunks are called "sections". The size of
93a memory section is architecture dependent. For example, power uses 16MiB, ia64 101a memory section is architecture dependent. For example, power uses 16MiB, ia64
@@ -97,46 +105,50 @@ Memory sections are combined into chunks referred to as "memory blocks". The
97size of a memory block is architecture dependent and represents the logical 105size of a memory block is architecture dependent and represents the logical
98unit upon which memory online/offline operations are to be performed. The 106unit upon which memory online/offline operations are to be performed. The
99default size of a memory block is the same as memory section size unless an 107default size of a memory block is the same as memory section size unless an
100architecture specifies otherwise. (see Section 3.) 108architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
101 109
102To determine the size (in bytes) of a memory block please read this file: 110To determine the size (in bytes) of a memory block please read this file:
103 111
104/sys/devices/system/memory/block_size_bytes 112/sys/devices/system/memory/block_size_bytes
105 113
106 114
107----------------------- 115Kernel Configuration
1082. Kernel Configuration 116====================
109----------------------- 117
110To use memory hotplug feature, kernel must be compiled with following 118To use memory hotplug feature, kernel must be compiled with following
111config options. 119config options.
112 120
113- For all memory hotplug 121- For all memory hotplug:
114 Memory model -> Sparse Memory (CONFIG_SPARSEMEM) 122 - Memory model -> Sparse Memory (CONFIG_SPARSEMEM)
115 Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) 123 - Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG)
116 124
117- To enable memory removal, the following are also necessary 125- To enable memory removal, the following are also necessary:
118 Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE) 126 - Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE)
119 Page Migration (CONFIG_MIGRATION) 127 - Page Migration (CONFIG_MIGRATION)
120 128
121- For ACPI memory hotplug, the following are also necessary 129- For ACPI memory hotplug, the following are also necessary:
122 Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY) 130 - Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY)
123 This option can be kernel module. 131 - This option can be kernel module.
124 132
125- As a related configuration, if your box has a feature of NUMA-node hotplug 133- As a related configuration, if your box has a feature of NUMA-node hotplug
126 via ACPI, then this option is necessary too. 134 via ACPI, then this option is necessary too.
127 ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
128 (CONFIG_ACPI_CONTAINER).
129 This option can be kernel module too.
130 135
136 - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
137 (CONFIG_ACPI_CONTAINER).
138
139 This option can be kernel module too.
140
141
142.. _memory_hotplug_sysfs_files:
143
144sysfs files for memory hotplug
145==============================
131 146
132--------------------------------
1333 sysfs files for memory hotplug
134--------------------------------
135All memory blocks have their device information in sysfs. Each memory block 147All memory blocks have their device information in sysfs. Each memory block
136is described under /sys/devices/system/memory as 148is described under /sys/devices/system/memory as:
137 149
138/sys/devices/system/memory/memoryXXX 150 /sys/devices/system/memory/memoryXXX
139(XXX is the memory block id.) 151 (XXX is the memory block id.)
140 152
141For the memory block covered by the sysfs directory. It is expected that all 153For the memory block covered by the sysfs directory. It is expected that all
142memory sections in this range are present and no memory holes exist in the 154memory sections in this range are present and no memory holes exist in the
@@ -145,43 +157,53 @@ the existence of one should not affect the hotplug capabilities of the memory
145block. 157block.
146 158
147For example, assume 1GiB memory block size. A device for a memory starting at 159For example, assume 1GiB memory block size. A device for a memory starting at
1480x100000000 is /sys/device/system/memory/memory4 1600x100000000 is /sys/device/system/memory/memory4::
149(0x100000000 / 1Gib = 4) 161
162 (0x100000000 / 1Gib = 4)
163
150This device covers address range [0x100000000 ... 0x140000000) 164This device covers address range [0x100000000 ... 0x140000000)
151 165
152Under each memory block, you can see 5 files: 166Under each memory block, you can see 5 files:
153 167
154/sys/devices/system/memory/memoryXXX/phys_index 168- /sys/devices/system/memory/memoryXXX/phys_index
155/sys/devices/system/memory/memoryXXX/phys_device 169- /sys/devices/system/memory/memoryXXX/phys_device
156/sys/devices/system/memory/memoryXXX/state 170- /sys/devices/system/memory/memoryXXX/state
157/sys/devices/system/memory/memoryXXX/removable 171- /sys/devices/system/memory/memoryXXX/removable
158/sys/devices/system/memory/memoryXXX/valid_zones 172- /sys/devices/system/memory/memoryXXX/valid_zones
173
174=================== ============================================================
175``phys_index`` read-only and contains memory block id, same as XXX.
176``state`` read-write
177
178 - at read: contains online/offline state of memory.
179 - at write: user can specify "online_kernel",
159 180
160'phys_index' : read-only and contains memory block id, same as XXX.
161'state' : read-write
162 at read: contains online/offline state of memory.
163 at write: user can specify "online_kernel",
164 "online_movable", "online", "offline" command 181 "online_movable", "online", "offline" command
165 which will be performed on all sections in the block. 182 which will be performed on all sections in the block.
166'phys_device' : read-only: designed to show the name of physical memory 183``phys_device`` read-only: designed to show the name of physical memory
167 device. This is not well implemented now. 184 device. This is not well implemented now.
168'removable' : read-only: contains an integer value indicating 185``removable`` read-only: contains an integer value indicating
169 whether the memory block is removable or not 186 whether the memory block is removable or not
170 removable. A value of 1 indicates that the memory 187 removable. A value of 1 indicates that the memory
171 block is removable and a value of 0 indicates that 188 block is removable and a value of 0 indicates that
172 it is not removable. A memory block is removable only if 189 it is not removable. A memory block is removable only if
173 every section in the block is removable. 190 every section in the block is removable.
174'valid_zones' : read-only: designed to show which zones this memory block 191``valid_zones`` read-only: designed to show which zones this memory block
175 can be onlined to. 192 can be onlined to.
176 The first column shows it's default zone. 193
194 The first column shows it`s default zone.
195
177 "memory6/valid_zones: Normal Movable" shows this memoryblock 196 "memory6/valid_zones: Normal Movable" shows this memoryblock
178 can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE 197 can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
179 by online_movable. 198 by online_movable.
199
180 "memory7/valid_zones: Movable Normal" shows this memoryblock 200 "memory7/valid_zones: Movable Normal" shows this memoryblock
181 can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL 201 can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
182 by online_kernel. 202 by online_kernel.
203=================== ============================================================
204
205.. note::
183 206
184NOTE:
185 These directories/files appear after physical memory hotplug phase. 207 These directories/files appear after physical memory hotplug phase.
186 208
187If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed 209If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
@@ -193,13 +215,14 @@ For example:
193A backlink will also be created: 215A backlink will also be created:
194/sys/devices/system/memory/memory9/node0 -> ../../node/node0 216/sys/devices/system/memory/memory9/node0 -> ../../node/node0
195 217
218.. _memory_hotplug_physical_mem:
219
220Physical memory hot-add phase
221=============================
196 222
197-------------------------------- 223Hardware(Firmware) Support
1984. Physical memory hot-add phase 224--------------------------
199--------------------------------
200 225
2014.1 Hardware(Firmware) Support
202------------
203On x86_64/ia64 platform, memory hotplug by ACPI is supported. 226On x86_64/ia64 platform, memory hotplug by ACPI is supported.
204 227
205In general, the firmware (ACPI) which supports memory hotplug defines 228In general, the firmware (ACPI) which supports memory hotplug defines
@@ -209,7 +232,8 @@ script. This will be done automatically.
209 232
210But scripts for memory hotplug are not contained in generic udev package(now). 233But scripts for memory hotplug are not contained in generic udev package(now).
211You may have to write it by yourself or online/offline memory by hand. 234You may have to write it by yourself or online/offline memory by hand.
212Please see "How to online memory", "How to offline memory" in this text. 235Please see :ref:`memory_hotplug_how_to_online_memory` and
236:ref:`memory_hotplug_how_to_offline_memory`.
213 237
214If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", 238If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
215"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler 239"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
@@ -217,8 +241,9 @@ calls hotplug code for all of objects which are defined in it.
217If memory device is found, memory hotplug code will be called. 241If memory device is found, memory hotplug code will be called.
218 242
219 243
2204.2 Notify memory hot-add event by hand 244Notify memory hot-add event by hand
221------------ 245-----------------------------------
246
222On some architectures, the firmware may not notify the kernel of a memory 247On some architectures, the firmware may not notify the kernel of a memory
223hotplug event. Therefore, the memory "probe" interface is supported to 248hotplug event. Therefore, the memory "probe" interface is supported to
224explicitly notify the kernel. This interface depends on 249explicitly notify the kernel. This interface depends on
@@ -229,45 +254,48 @@ notification.
229Probe interface is located at 254Probe interface is located at
230/sys/devices/system/memory/probe 255/sys/devices/system/memory/probe
231 256
232You can tell the physical address of new memory to the kernel by 257You can tell the physical address of new memory to the kernel by::
233 258
234% echo start_address_of_new_memory > /sys/devices/system/memory/probe 259 % echo start_address_of_new_memory > /sys/devices/system/memory/probe
235 260
236Then, [start_address_of_new_memory, start_address_of_new_memory + 261Then, [start_address_of_new_memory, start_address_of_new_memory +
237memory_block_size] memory range is hot-added. In this case, hotplug script is 262memory_block_size] memory range is hot-added. In this case, hotplug script is
238not called (in current implementation). You'll have to online memory by 263not called (in current implementation). You'll have to online memory by
239yourself. Please see "How to online memory" in this text. 264yourself. Please see :ref:`memory_hotplug_how_to_online_memory`.
240 265
241 266
242------------------------------ 267Logical Memory hot-add phase
2435. Logical Memory hot-add phase 268============================
244------------------------------
245 269
2465.1. State of memory 270State of memory
247------------ 271---------------
248To see (online/offline) state of a memory block, read 'state' file. 272
273To see (online/offline) state of a memory block, read 'state' file::
274
275 % cat /sys/device/system/memory/memoryXXX/state
249 276
250% cat /sys/device/system/memory/memoryXXX/state
251 277
278- If the memory block is online, you'll read "online".
279- If the memory block is offline, you'll read "offline".
252 280
253If the memory block is online, you'll read "online".
254If the memory block is offline, you'll read "offline".
255 281
282.. _memory_hotplug_how_to_online_memory:
283
284How to online memory
285--------------------
256 286
2575.2. How to online memory
258------------
259When the memory is hot-added, the kernel decides whether or not to "online" 287When the memory is hot-added, the kernel decides whether or not to "online"
260it according to the policy which can be read from "auto_online_blocks" file: 288it according to the policy which can be read from "auto_online_blocks" file::
261 289
262% cat /sys/devices/system/memory/auto_online_blocks 290 % cat /sys/devices/system/memory/auto_online_blocks
263 291
264The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config 292The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
265option. If it is disabled the default is "offline" which means the newly added 293option. If it is disabled the default is "offline" which means the newly added
266memory is not in a ready-to-use state and you have to "online" the newly added 294memory is not in a ready-to-use state and you have to "online" the newly added
267memory blocks manually. Automatic onlining can be requested by writing "online" 295memory blocks manually. Automatic onlining can be requested by writing "online"
268to "auto_online_blocks" file: 296to "auto_online_blocks" file::
269 297
270% echo online > /sys/devices/system/memory/auto_online_blocks 298 % echo online > /sys/devices/system/memory/auto_online_blocks
271 299
272This sets a global policy and impacts all memory blocks that will subsequently 300This sets a global policy and impacts all memory blocks that will subsequently
273be hotplugged. Currently offline blocks keep their state. It is possible, under 301be hotplugged. Currently offline blocks keep their state. It is possible, under
@@ -277,24 +305,26 @@ online. User space tools can check their "state" files
277 305
278If the automatic onlining wasn't requested, failed, or some memory block was 306If the automatic onlining wasn't requested, failed, or some memory block was
279offlined it is possible to change the individual block's state by writing to the 307offlined it is possible to change the individual block's state by writing to the
280"state" file: 308"state" file::
281 309
282% echo online > /sys/devices/system/memory/memoryXXX/state 310 % echo online > /sys/devices/system/memory/memoryXXX/state
283 311
284This onlining will not change the ZONE type of the target memory block, 312This onlining will not change the ZONE type of the target memory block,
285If the memory block doesn't belong to any zone an appropriate kernel zone 313If the memory block doesn't belong to any zone an appropriate kernel zone
286(usually ZONE_NORMAL) will be used unless movable_node kernel command line 314(usually ZONE_NORMAL) will be used unless movable_node kernel command line
287option is specified when ZONE_MOVABLE will be used. 315option is specified when ZONE_MOVABLE will be used.
288 316
289You can explicitly request to associate it with ZONE_MOVABLE by 317You can explicitly request to associate it with ZONE_MOVABLE by::
318
319 % echo online_movable > /sys/devices/system/memory/memoryXXX/state
290 320
291% echo online_movable > /sys/devices/system/memory/memoryXXX/state 321.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
292(NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
293 322
294Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by: 323Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
295 324
296% echo online_kernel > /sys/devices/system/memory/memoryXXX/state 325 % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
297(NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL) 326
327.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
298 328
299An explicit zone onlining can fail (e.g. when the range is already within 329An explicit zone onlining can fail (e.g. when the range is already within
300and existing and incompatible zone already). 330and existing and incompatible zone already).
@@ -306,12 +336,12 @@ This may be changed in future.
306 336
307 337
308 338
309------------------------ 339Logical memory remove
3106. Logical memory remove 340=====================
311------------------------ 341
342Memory offline and ZONE_MOVABLE
343-------------------------------
312 344
3136.1 Memory offline and ZONE_MOVABLE
314------------
315Memory offlining is more complicated than memory online. Because memory offline 345Memory offlining is more complicated than memory online. Because memory offline
316has to make the whole memory block be unused, memory offline can fail if 346has to make the whole memory block be unused, memory offline can fail if
317the memory block includes memory which cannot be freed. 347the memory block includes memory which cannot be freed.
@@ -336,24 +366,27 @@ Assume the system has "TOTAL" amount of memory at boot time, this boot option
336creates ZONE_MOVABLE as following. 366creates ZONE_MOVABLE as following.
337 367
3381) When kernelcore=YYYY boot option is used, 3681) When kernelcore=YYYY boot option is used,
339 Size of memory not for movable pages (not for offline) is YYYY. 369 Size of memory not for movable pages (not for offline) is YYYY.
340 Size of memory for movable pages (for offline) is TOTAL-YYYY. 370 Size of memory for movable pages (for offline) is TOTAL-YYYY.
341 371
3422) When movablecore=ZZZZ boot option is used, 3722) When movablecore=ZZZZ boot option is used,
343 Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. 373 Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
344 Size of memory for movable pages (for offline) is ZZZZ. 374 Size of memory for movable pages (for offline) is ZZZZ.
375
376.. note::
345 377
378 Unfortunately, there is no information to show which memory block belongs
379 to ZONE_MOVABLE. This is TBD.
346 380
347Note: Unfortunately, there is no information to show which memory block belongs 381.. _memory_hotplug_how_to_offline_memory:
348to ZONE_MOVABLE. This is TBD.
349 382
383How to offline memory
384---------------------
350 385
3516.2. How to offline memory
352------------
353You can offline a memory block by using the same sysfs interface that was used 386You can offline a memory block by using the same sysfs interface that was used
354in memory onlining. 387in memory onlining::
355 388
356% echo offline > /sys/devices/system/memory/memoryXXX/state 389 % echo offline > /sys/devices/system/memory/memoryXXX/state
357 390
358If offline succeeds, the state of the memory block is changed to be "offline". 391If offline succeeds, the state of the memory block is changed to be "offline".
359If it fails, some error core (like -EBUSY) will be returned by the kernel. 392If it fails, some error core (like -EBUSY) will be returned by the kernel.
@@ -367,22 +400,22 @@ able to offline it (or not). (For example, a page is referred to by some kernel
367internal call and released soon.) 400internal call and released soon.)
368 401
369Consideration: 402Consideration:
370Memory hotplug's design direction is to make the possibility of memory offlining 403 Memory hotplug's design direction is to make the possibility of memory
371higher and to guarantee unplugging memory under any situation. But it needs 404 offlining higher and to guarantee unplugging memory under any situation. But
372more work. Returning -EBUSY under some situation may be good because the user 405 it needs more work. Returning -EBUSY under some situation may be good because
373can decide to retry more or not by himself. Currently, memory offlining code 406 the user can decide to retry more or not by himself. Currently, memory
374does some amount of retry with 120 seconds timeout. 407 offlining code does some amount of retry with 120 seconds timeout.
408
409Physical memory remove
410======================
375 411
376-------------------------
3777. Physical memory remove
378-------------------------
379Need more implementation yet.... 412Need more implementation yet....
380 - Notification completion of remove works by OS to firmware. 413 - Notification completion of remove works by OS to firmware.
381 - Guard from remove if not yet. 414 - Guard from remove if not yet.
382 415
383-------------------------------- 416Memory hotplug event notifier
3848. Memory hotplug event notifier 417=============================
385-------------------------------- 418
386Hotplugging events are sent to a notification queue. 419Hotplugging events are sent to a notification queue.
387 420
388There are six types of notification defined in include/linux/memory.h: 421There are six types of notification defined in include/linux/memory.h:
@@ -412,14 +445,14 @@ MEM_CANCEL_OFFLINE
412MEM_OFFLINE 445MEM_OFFLINE
413 Generated after offlining memory is complete. 446 Generated after offlining memory is complete.
414 447
415A callback routine can be registered by calling 448A callback routine can be registered by calling::
416 449
417 hotplug_memory_notifier(callback_func, priority) 450 hotplug_memory_notifier(callback_func, priority)
418 451
419Callback functions with higher values of priority are called before callback 452Callback functions with higher values of priority are called before callback
420functions with lower values. 453functions with lower values.
421 454
422A callback function must have the following prototype: 455A callback function must have the following prototype::
423 456
424 int callback_func( 457 int callback_func(
425 struct notifier_block *self, unsigned long action, void *arg); 458 struct notifier_block *self, unsigned long action, void *arg);
@@ -427,27 +460,28 @@ A callback function must have the following prototype:
427The first argument of the callback function (self) is a pointer to the block 460The first argument of the callback function (self) is a pointer to the block
428of the notifier chain that points to the callback function itself. 461of the notifier chain that points to the callback function itself.
429The second argument (action) is one of the event types described above. 462The second argument (action) is one of the event types described above.
430The third argument (arg) passes a pointer of struct memory_notify. 463The third argument (arg) passes a pointer of struct memory_notify::
431 464
432struct memory_notify { 465 struct memory_notify {
433 unsigned long start_pfn; 466 unsigned long start_pfn;
434 unsigned long nr_pages; 467 unsigned long nr_pages;
435 int status_change_nid_normal; 468 int status_change_nid_normal;
436 int status_change_nid_high; 469 int status_change_nid_high;
437 int status_change_nid; 470 int status_change_nid;
438} 471 }
439 472
440start_pfn is start_pfn of online/offline memory. 473- start_pfn is start_pfn of online/offline memory.
441nr_pages is # of pages of online/offline memory. 474- nr_pages is # of pages of online/offline memory.
442status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask 475- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
443is (will be) set/clear, if this is -1, then nodemask status is not changed. 476 is (will be) set/clear, if this is -1, then nodemask status is not changed.
444status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask 477- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
445is (will be) set/clear, if this is -1, then nodemask status is not changed. 478 is (will be) set/clear, if this is -1, then nodemask status is not changed.
446status_change_nid is set node id when N_MEMORY of nodemask is (will be) 479- status_change_nid is set node id when N_MEMORY of nodemask is (will be)
447set/clear. It means a new(memoryless) node gets new memory by online and a 480 set/clear. It means a new(memoryless) node gets new memory by online and a
448node loses all memory. If this is -1, then nodemask status is not changed. 481 node loses all memory. If this is -1, then nodemask status is not changed.
449If status_changed_nid* >= 0, callback should create/discard structures for the 482
450node if necessary. 483 If status_changed_nid* >= 0, callback should create/discard structures for the
484 node if necessary.
451 485
452The callback routine shall return one of the values 486The callback routine shall return one of the values
453NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP 487NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP
@@ -461,9 +495,9 @@ further processing of the notification queue.
461 495
462NOTIFY_STOP stops further processing of the notification queue. 496NOTIFY_STOP stops further processing of the notification queue.
463 497
464-------------- 498Future Work
4659. Future Work 499===========
466-------------- 500
467 - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like 501 - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
468 sysctl or new control file. 502 sysctl or new control file.
469 - showing memory block and physical device relationship. 503 - showing memory block and physical device relationship.
@@ -471,4 +505,3 @@ NOTIFY_STOP stops further processing of the notification queue.
471 - support HugeTLB page migration and offlining. 505 - support HugeTLB page migration and offlining.
472 - memmap removing at memory offline. 506 - memmap removing at memory offline.
473 - physical remove memory. 507 - physical remove memory.
474
diff --git a/Documentation/men-chameleon-bus.txt b/Documentation/men-chameleon-bus.txt
index 30ded732027e..1b1f048aa748 100644
--- a/Documentation/men-chameleon-bus.txt
+++ b/Documentation/men-chameleon-bus.txt
@@ -1,163 +1,175 @@
1 MEN Chameleon Bus
2 =================
3
4Table of Contents
5================= 1=================
61 Introduction 2MEN Chameleon Bus
7 1.1 Scope of this Document 3=================
8 1.2 Limitations of the current implementation 4
92 Architecture 5.. Table of Contents
10 2.1 MEN Chameleon Bus 6 =================
11 2.2 Carrier Devices 7 1 Introduction
12 2.3 Parser 8 1.1 Scope of this Document
133 Resource handling 9 1.2 Limitations of the current implementation
14 3.1 Memory Resources 10 2 Architecture
15 3.2 IRQs 11 2.1 MEN Chameleon Bus
164 Writing an MCB driver 12 2.2 Carrier Devices
17 4.1 The driver structure 13 2.3 Parser
18 4.2 Probing and attaching 14 3 Resource handling
19 4.3 Initializing the driver 15 3.1 Memory Resources
20 16 3.2 IRQs
21 17 4 Writing an MCB driver
221 Introduction 18 4.1 The driver structure
23=============== 19 4.2 Probing and attaching
24 This document describes the architecture and implementation of the MEN 20 4.3 Initializing the driver
25 Chameleon Bus (called MCB throughout this document). 21
26 22
271.1 Scope of this Document 23Introduction
28--------------------------- 24============
29 This document is intended to be a short overview of the current 25
30 implementation and does by no means describe the complete possibilities of MCB 26This document describes the architecture and implementation of the MEN
31 based devices. 27Chameleon Bus (called MCB throughout this document).
32 28
331.2 Limitations of the current implementation 29Scope of this Document
34----------------------------------------------
35 The current implementation is limited to PCI and PCIe based carrier devices
36 that only use a single memory resource and share the PCI legacy IRQ. Not
37 implemented are:
38 - Multi-resource MCB devices like the VME Controller or M-Module carrier.
39 - MCB devices that need another MCB device, like SRAM for a DMA Controller's
40 buffer descriptors or a video controller's video memory.
41 - A per-carrier IRQ domain for carrier devices that have one (or more) IRQs
42 per MCB device like PCIe based carriers with MSI or MSI-X support.
43
442 Architecture
45===============
46 MCB is divided into 3 functional blocks:
47 - The MEN Chameleon Bus itself,
48 - drivers for MCB Carrier Devices and
49 - the parser for the Chameleon table.
50
512.1 MEN Chameleon Bus
52---------------------- 30----------------------
53 The MEN Chameleon Bus is an artificial bus system that attaches to a so 31
54 called Chameleon FPGA device found on some hardware produced my MEN Mikro 32This document is intended to be a short overview of the current
55 Elektronik GmbH. These devices are multi-function devices implemented in a 33implementation and does by no means describe the complete possibilities of MCB
56 single FPGA and usually attached via some sort of PCI or PCIe link. Each 34based devices.
57 FPGA contains a header section describing the content of the FPGA. The 35
58 header lists the device id, PCI BAR, offset from the beginning of the PCI 36Limitations of the current implementation
59 BAR, size in the FPGA, interrupt number and some other properties currently 37-----------------------------------------
60 not handled by the MCB implementation. 38
61 39The current implementation is limited to PCI and PCIe based carrier devices
622.2 Carrier Devices 40that only use a single memory resource and share the PCI legacy IRQ. Not
41implemented are:
42
43- Multi-resource MCB devices like the VME Controller or M-Module carrier.
44- MCB devices that need another MCB device, like SRAM for a DMA Controller's
45 buffer descriptors or a video controller's video memory.
46- A per-carrier IRQ domain for carrier devices that have one (or more) IRQs
47 per MCB device like PCIe based carriers with MSI or MSI-X support.
48
49Architecture
50============
51
52MCB is divided into 3 functional blocks:
53
54- The MEN Chameleon Bus itself,
55- drivers for MCB Carrier Devices and
56- the parser for the Chameleon table.
57
58MEN Chameleon Bus
59-----------------
60
61The MEN Chameleon Bus is an artificial bus system that attaches to a so
62called Chameleon FPGA device found on some hardware produced my MEN Mikro
63Elektronik GmbH. These devices are multi-function devices implemented in a
64single FPGA and usually attached via some sort of PCI or PCIe link. Each
65FPGA contains a header section describing the content of the FPGA. The
66header lists the device id, PCI BAR, offset from the beginning of the PCI
67BAR, size in the FPGA, interrupt number and some other properties currently
68not handled by the MCB implementation.
69
70Carrier Devices
71---------------
72
73A carrier device is just an abstraction for the real world physical bus the
74Chameleon FPGA is attached to. Some IP Core drivers may need to interact with
75properties of the carrier device (like querying the IRQ number of a PCI
76device). To provide abstraction from the real hardware bus, an MCB carrier
77device provides callback methods to translate the driver's MCB function calls
78to hardware related function calls. For example a carrier device may
79implement the get_irq() method which can be translated into a hardware bus
80query for the IRQ number the device should use.
81
82Parser
83------
84
85The parser reads the first 512 bytes of a Chameleon device and parses the
86Chameleon table. Currently the parser only supports the Chameleon v2 variant
87of the Chameleon table but can easily be adopted to support an older or
88possible future variant. While parsing the table's entries new MCB devices
89are allocated and their resources are assigned according to the resource
90assignment in the Chameleon table. After resource assignment is finished, the
91MCB devices are registered at the MCB and thus at the driver core of the
92Linux kernel.
93
94Resource handling
95=================
96
97The current implementation assigns exactly one memory and one IRQ resource
98per MCB device. But this is likely going to change in the future.
99
100Memory Resources
101----------------
102
103Each MCB device has exactly one memory resource, which can be requested from
104the MCB bus. This memory resource is the physical address of the MCB device
105inside the carrier and is intended to be passed to ioremap() and friends. It
106is already requested from the kernel by calling request_mem_region().
107
108IRQs
109----
110
111Each MCB device has exactly one IRQ resource, which can be requested from the
112MCB bus. If a carrier device driver implements the ->get_irq() callback
113method, the IRQ number assigned by the carrier device will be returned,
114otherwise the IRQ number inside the Chameleon table will be returned. This
115number is suitable to be passed to request_irq().
116
117Writing an MCB driver
118=====================
119
120The driver structure
63-------------------- 121--------------------
64 A carrier device is just an abstraction for the real world physical bus the 122
65 Chameleon FPGA is attached to. Some IP Core drivers may need to interact with 123Each MCB driver has a structure to identify the device driver as well as
66 properties of the carrier device (like querying the IRQ number of a PCI 124device ids which identify the IP Core inside the FPGA. The driver structure
67 device). To provide abstraction from the real hardware bus, an MCB carrier 125also contains callback methods which get executed on driver probe and
68 device provides callback methods to translate the driver's MCB function calls 126removal from the system::
69 to hardware related function calls. For example a carrier device may 127
70 implement the get_irq() method which can be translated into a hardware bus 128 static const struct mcb_device_id foo_ids[] = {
71 query for the IRQ number the device should use. 129 { .device = 0x123 },
72 130 { }
732.3 Parser 131 };
74----------- 132 MODULE_DEVICE_TABLE(mcb, foo_ids);
75 The parser reads the first 512 bytes of a Chameleon device and parses the 133
76 Chameleon table. Currently the parser only supports the Chameleon v2 variant 134 static struct mcb_driver foo_driver = {
77 of the Chameleon table but can easily be adopted to support an older or 135 driver = {
78 possible future variant. While parsing the table's entries new MCB devices 136 .name = "foo-bar",
79 are allocated and their resources are assigned according to the resource 137 .owner = THIS_MODULE,
80 assignment in the Chameleon table. After resource assignment is finished, the 138 },
81 MCB devices are registered at the MCB and thus at the driver core of the 139 .probe = foo_probe,
82 Linux kernel. 140 .remove = foo_remove,
83 141 .id_table = foo_ids,
843 Resource handling 142 };
85==================== 143
86 The current implementation assigns exactly one memory and one IRQ resource 144Probing and attaching
87 per MCB device. But this is likely going to change in the future.
88
893.1 Memory Resources
90--------------------- 145---------------------
91 Each MCB device has exactly one memory resource, which can be requested from 146
92 the MCB bus. This memory resource is the physical address of the MCB device 147When a driver is loaded and the MCB devices it services are found, the MCB
93 inside the carrier and is intended to be passed to ioremap() and friends. It 148core will call the driver's probe callback method. When the driver is removed
94 is already requested from the kernel by calling request_mem_region(). 149from the system, the MCB core will call the driver's remove callback method::
95 150
963.2 IRQs 151 static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id);
97--------- 152 static void foo_remove(struct mcb_device *mdev);
98 Each MCB device has exactly one IRQ resource, which can be requested from the 153
99 MCB bus. If a carrier device driver implements the ->get_irq() callback 154Initializing the driver
100 method, the IRQ number assigned by the carrier device will be returned, 155-----------------------
101 otherwise the IRQ number inside the Chameleon table will be returned. This 156
102 number is suitable to be passed to request_irq(). 157When the kernel is booted or your foo driver module is inserted, you have to
103 158perform driver initialization. Usually it is enough to register your driver
1044 Writing an MCB driver 159module at the MCB core::
105======================= 160
106 161 static int __init foo_init(void)
1074.1 The driver structure 162 {
108------------------------- 163 return mcb_register_driver(&foo_driver);
109 Each MCB driver has a structure to identify the device driver as well as 164 }
110 device ids which identify the IP Core inside the FPGA. The driver structure 165 module_init(foo_init);
111 also contains callback methods which get executed on driver probe and 166
112 removal from the system. 167 static void __exit foo_exit(void)
113 168 {
114 169 mcb_unregister_driver(&foo_driver);
115 static const struct mcb_device_id foo_ids[] = { 170 }
116 { .device = 0x123 }, 171 module_exit(foo_exit);
117 { } 172
118 }; 173The module_mcb_driver() macro can be used to reduce the above code::
119 MODULE_DEVICE_TABLE(mcb, foo_ids); 174
120 175 module_mcb_driver(foo_driver);
121 static struct mcb_driver foo_driver = {
122 driver = {
123 .name = "foo-bar",
124 .owner = THIS_MODULE,
125 },
126 .probe = foo_probe,
127 .remove = foo_remove,
128 .id_table = foo_ids,
129 };
130
1314.2 Probing and attaching
132--------------------------
133 When a driver is loaded and the MCB devices it services are found, the MCB
134 core will call the driver's probe callback method. When the driver is removed
135 from the system, the MCB core will call the driver's remove callback method.
136
137
138 static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id);
139 static void foo_remove(struct mcb_device *mdev);
140
1414.3 Initializing the driver
142----------------------------
143 When the kernel is booted or your foo driver module is inserted, you have to
144 perform driver initialization. Usually it is enough to register your driver
145 module at the MCB core.
146
147
148 static int __init foo_init(void)
149 {
150 return mcb_register_driver(&foo_driver);
151 }
152 module_init(foo_init);
153
154 static void __exit foo_exit(void)
155 {
156 mcb_unregister_driver(&foo_driver);
157 }
158 module_exit(foo_exit);
159
160 The module_mcb_driver() macro can be used to reduce the above code.
161
162
163 module_mcb_driver(foo_driver);
diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt
index ae57b9ea0d41..69556f0d494b 100644
--- a/Documentation/nommu-mmap.txt
+++ b/Documentation/nommu-mmap.txt
@@ -1,6 +1,6 @@
1 ============================= 1=============================
2 NO-MMU MEMORY MAPPING SUPPORT 2No-MMU memory mapping support
3 ============================= 3=============================
4 4
5The kernel has limited support for memory mapping under no-MMU conditions, such 5The kernel has limited support for memory mapping under no-MMU conditions, such
6as are used in uClinux environments. From the userspace point of view, memory 6as are used in uClinux environments. From the userspace point of view, memory
@@ -16,7 +16,7 @@ the CLONE_VM flag.
16The behaviour is similar between the MMU and no-MMU cases, but not identical; 16The behaviour is similar between the MMU and no-MMU cases, but not identical;
17and it's also much more restricted in the latter case: 17and it's also much more restricted in the latter case:
18 18
19 (*) Anonymous mapping, MAP_PRIVATE 19 (#) Anonymous mapping, MAP_PRIVATE
20 20
21 In the MMU case: VM regions backed by arbitrary pages; copy-on-write 21 In the MMU case: VM regions backed by arbitrary pages; copy-on-write
22 across fork. 22 across fork.
@@ -24,14 +24,14 @@ and it's also much more restricted in the latter case:
24 In the no-MMU case: VM regions backed by arbitrary contiguous runs of 24 In the no-MMU case: VM regions backed by arbitrary contiguous runs of
25 pages. 25 pages.
26 26
27 (*) Anonymous mapping, MAP_SHARED 27 (#) Anonymous mapping, MAP_SHARED
28 28
29 These behave very much like private mappings, except that they're 29 These behave very much like private mappings, except that they're
30 shared across fork() or clone() without CLONE_VM in the MMU case. Since 30 shared across fork() or clone() without CLONE_VM in the MMU case. Since
31 the no-MMU case doesn't support these, behaviour is identical to 31 the no-MMU case doesn't support these, behaviour is identical to
32 MAP_PRIVATE there. 32 MAP_PRIVATE there.
33 33
34 (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE 34 (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
35 35
36 In the MMU case: VM regions backed by pages read from file; changes to 36 In the MMU case: VM regions backed by pages read from file; changes to
37 the underlying file are reflected in the mapping; copied across fork. 37 the underlying file are reflected in the mapping; copied across fork.
@@ -56,7 +56,7 @@ and it's also much more restricted in the latter case:
56 are visible in other processes (no MMU protection), but should not 56 are visible in other processes (no MMU protection), but should not
57 happen. 57 happen.
58 58
59 (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE 59 (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
60 60
61 In the MMU case: like the non-PROT_WRITE case, except that the pages in 61 In the MMU case: like the non-PROT_WRITE case, except that the pages in
62 question get copied before the write actually happens. From that point 62 question get copied before the write actually happens. From that point
@@ -66,7 +66,7 @@ and it's also much more restricted in the latter case:
66 In the no-MMU case: works much like the non-PROT_WRITE case, except 66 In the no-MMU case: works much like the non-PROT_WRITE case, except
67 that a copy is always taken and never shared. 67 that a copy is always taken and never shared.
68 68
69 (*) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE 69 (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
70 70
71 In the MMU case: VM regions backed by pages read from file; changes to 71 In the MMU case: VM regions backed by pages read from file; changes to
72 pages written back to file; writes to file reflected into pages backing 72 pages written back to file; writes to file reflected into pages backing
@@ -74,7 +74,7 @@ and it's also much more restricted in the latter case:
74 74
75 In the no-MMU case: not supported. 75 In the no-MMU case: not supported.
76 76
77 (*) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE 77 (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
78 78
79 In the MMU case: As for ordinary regular files. 79 In the MMU case: As for ordinary regular files.
80 80
@@ -85,7 +85,7 @@ and it's also much more restricted in the latter case:
85 as for the MMU case. If the filesystem does not provide any such 85 as for the MMU case. If the filesystem does not provide any such
86 support, then the mapping request will be denied. 86 support, then the mapping request will be denied.
87 87
88 (*) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE 88 (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
89 89
90 In the MMU case: As for ordinary regular files. 90 In the MMU case: As for ordinary regular files.
91 91
@@ -94,7 +94,7 @@ and it's also much more restricted in the latter case:
94 truncate being called. The ramdisk driver could do this if it allocated 94 truncate being called. The ramdisk driver could do this if it allocated
95 all its memory as a contiguous array upfront. 95 all its memory as a contiguous array upfront.
96 96
97 (*) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE 97 (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
98 98
99 In the MMU case: As for ordinary regular files. 99 In the MMU case: As for ordinary regular files.
100 100
@@ -105,21 +105,20 @@ and it's also much more restricted in the latter case:
105 provide any such support, then the mapping request will be denied. 105 provide any such support, then the mapping request will be denied.
106 106
107 107
108============================ 108Further notes on no-MMU MMAP
109FURTHER NOTES ON NO-MMU MMAP
110============================ 109============================
111 110
112 (*) A request for a private mapping of a file may return a buffer that is not 111 (#) A request for a private mapping of a file may return a buffer that is not
113 page-aligned. This is because XIP may take place, and the data may not be 112 page-aligned. This is because XIP may take place, and the data may not be
114 paged aligned in the backing store. 113 paged aligned in the backing store.
115 114
116 (*) A request for an anonymous mapping will always be page aligned. If 115 (#) A request for an anonymous mapping will always be page aligned. If
117 possible the size of the request should be a power of two otherwise some 116 possible the size of the request should be a power of two otherwise some
118 of the space may be wasted as the kernel must allocate a power-of-2 117 of the space may be wasted as the kernel must allocate a power-of-2
119 granule but will only discard the excess if appropriately configured as 118 granule but will only discard the excess if appropriately configured as
120 this has an effect on fragmentation. 119 this has an effect on fragmentation.
121 120
122 (*) The memory allocated by a request for an anonymous mapping will normally 121 (#) The memory allocated by a request for an anonymous mapping will normally
123 be cleared by the kernel before being returned in accordance with the 122 be cleared by the kernel before being returned in accordance with the
124 Linux man pages (ver 2.22 or later). 123 Linux man pages (ver 2.22 or later).
125 124
@@ -145,24 +144,23 @@ FURTHER NOTES ON NO-MMU MMAP
145 uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this 144 uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
146 to allocate the brk and stack region. 145 to allocate the brk and stack region.
147 146
148 (*) A list of all the private copy and anonymous mappings on the system is 147 (#) A list of all the private copy and anonymous mappings on the system is
149 visible through /proc/maps in no-MMU mode. 148 visible through /proc/maps in no-MMU mode.
150 149
151 (*) A list of all the mappings in use by a process is visible through 150 (#) A list of all the mappings in use by a process is visible through
152 /proc/<pid>/maps in no-MMU mode. 151 /proc/<pid>/maps in no-MMU mode.
153 152
154 (*) Supplying MAP_FIXED or a requesting a particular mapping address will 153 (#) Supplying MAP_FIXED or a requesting a particular mapping address will
155 result in an error. 154 result in an error.
156 155
157 (*) Files mapped privately usually have to have a read method provided by the 156 (#) Files mapped privately usually have to have a read method provided by the
158 driver or filesystem so that the contents can be read into the memory 157 driver or filesystem so that the contents can be read into the memory
159 allocated if mmap() chooses not to map the backing device directly. An 158 allocated if mmap() chooses not to map the backing device directly. An
160 error will result if they don't. This is most likely to be encountered 159 error will result if they don't. This is most likely to be encountered
161 with character device files, pipes, fifos and sockets. 160 with character device files, pipes, fifos and sockets.
162 161
163 162
164========================== 163Interprocess shared memory
165INTERPROCESS SHARED MEMORY
166========================== 164==========================
167 165
168Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU 166Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
@@ -170,8 +168,7 @@ mode. The former through the usual mechanism, the latter through files created
170on ramfs or tmpfs mounts. 168on ramfs or tmpfs mounts.
171 169
172 170
173======= 171Futexes
174FUTEXES
175======= 172=======
176 173
177Futexes are supported in NOMMU mode if the arch supports them. An error will 174Futexes are supported in NOMMU mode if the arch supports them. An error will
@@ -180,12 +177,11 @@ mappings made by a process or if the mapping in which the address lies does not
180support futexes (such as an I/O chardev mapping). 177support futexes (such as an I/O chardev mapping).
181 178
182 179
183============= 180No-MMU mremap
184NO-MMU MREMAP
185============= 181=============
186 182
187The mremap() function is partially supported. It may change the size of a 183The mremap() function is partially supported. It may change the size of a
188mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size 184mapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
189of the mapping exceeds the size of the slab object currently occupied by the 185of the mapping exceeds the size of the slab object currently occupied by the
190memory to which the mapping refers, or if a smaller slab object could be used. 186memory to which the mapping refers, or if a smaller slab object could be used.
191 187
@@ -200,11 +196,10 @@ a previously mapped object. It may not be used to create holes in existing
200mappings, move parts of existing mappings or resize parts of mappings. It must 196mappings, move parts of existing mappings or resize parts of mappings. It must
201act on a complete mapping. 197act on a complete mapping.
202 198
203[*] Not currently supported. 199.. [#] Not currently supported.
204 200
205 201
206============================================ 202Providing shareable character device support
207PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT
208============================================ 203============================================
209 204
210To provide shareable character device support, a driver must provide a 205To provide shareable character device support, a driver must provide a
@@ -235,7 +230,7 @@ direct the call to the device-specific driver. Under such circumstances, the
235mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a 230mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
236copy mapped otherwise. 231copy mapped otherwise.
237 232
238IMPORTANT NOTE: 233.. important::
239 234
240 Some types of device may present a different appearance to anyone 235 Some types of device may present a different appearance to anyone
241 looking at them in certain modes. Flash chips can be like this; for 236 looking at them in certain modes. Flash chips can be like this; for
@@ -249,8 +244,7 @@ IMPORTANT NOTE:
249 circumstances! 244 circumstances!
250 245
251 246
252============================================== 247Providing shareable memory-backed file support
253PROVIDING SHAREABLE MEMORY-BACKED FILE SUPPORT
254============================================== 248==============================================
255 249
256Provision of shared mappings on memory backed files is similar to the provision 250Provision of shared mappings on memory backed files is similar to the provision
@@ -267,8 +261,7 @@ Memory backed devices are indicated by the mapping's backing device info having
267the memory_backed flag set. 261the memory_backed flag set.
268 262
269 263
270======================================== 264Providing shareable block device support
271PROVIDING SHAREABLE BLOCK DEVICE SUPPORT
272======================================== 265========================================
273 266
274Provision of shared mappings on block device files is exactly the same as for 267Provision of shared mappings on block device files is exactly the same as for
@@ -276,8 +269,7 @@ character devices. If there isn't a real device underneath, then the driver
276should allocate sufficient contiguous memory to honour any supported mapping. 269should allocate sufficient contiguous memory to honour any supported mapping.
277 270
278 271
279================================= 272Adjusting page trimming behaviour
280ADJUSTING PAGE TRIMMING BEHAVIOUR
281================================= 273=================================
282 274
283NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages 275NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
@@ -288,4 +280,4 @@ allocator. In order to retain finer-grained control over fragmentation, this
288behaviour can either be disabled completely, or bumped up to a higher page 280behaviour can either be disabled completely, or bumped up to a higher page
289watermark where trimming begins. 281watermark where trimming begins.
290 282
291Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'. 283Page trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
diff --git a/Documentation/ntb.txt b/Documentation/ntb.txt
index a5af4f0159f3..a043854d28df 100644
--- a/Documentation/ntb.txt
+++ b/Documentation/ntb.txt
@@ -1,4 +1,6 @@
1# NTB Drivers 1===========
2NTB Drivers
3===========
2 4
3NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects 5NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects
4the separate memory systems of two or more computers to the same PCI-Express 6the separate memory systems of two or more computers to the same PCI-Express
@@ -12,7 +14,8 @@ special status bits to make sure the information isn't rewritten by another
12peer. Doorbell registers provide a way for peers to send interrupt events. 14peer. Doorbell registers provide a way for peers to send interrupt events.
13Memory windows allow translated read and write access to the peer memory. 15Memory windows allow translated read and write access to the peer memory.
14 16
15## NTB Core Driver (ntb) 17NTB Core Driver (ntb)
18=====================
16 19
17The NTB core driver defines an api wrapping the common feature set, and allows 20The NTB core driver defines an api wrapping the common feature set, and allows
18clients interested in NTB features to discover NTB the devices supported by 21clients interested in NTB features to discover NTB the devices supported by
@@ -20,7 +23,8 @@ hardware drivers. The term "client" is used here to mean an upper layer
20component making use of the NTB api. The term "driver," or "hardware driver," 23component making use of the NTB api. The term "driver," or "hardware driver,"
21is used here to mean a driver for a specific vendor and model of NTB hardware. 24is used here to mean a driver for a specific vendor and model of NTB hardware.
22 25
23## NTB Client Drivers 26NTB Client Drivers
27==================
24 28
25NTB client drivers should register with the NTB core driver. After 29NTB client drivers should register with the NTB core driver. After
26registering, the client probe and remove functions will be called appropriately 30registering, the client probe and remove functions will be called appropriately
@@ -28,7 +32,8 @@ as ntb hardware, or hardware drivers, are inserted and removed. The
28registration uses the Linux Device framework, so it should feel familiar to 32registration uses the Linux Device framework, so it should feel familiar to
29anyone who has written a pci driver. 33anyone who has written a pci driver.
30 34
31### NTB Typical client driver implementation 35NTB Typical client driver implementation
36----------------------------------------
32 37
33Primary purpose of NTB is to share some peace of memory between at least two 38Primary purpose of NTB is to share some peace of memory between at least two
34systems. So the NTB device features like Scratchpad/Message registers are 39systems. So the NTB device features like Scratchpad/Message registers are
@@ -109,7 +114,8 @@ follows:
109Also it is worth to note, that method ntb_mw_count(pidx) should return the 114Also it is worth to note, that method ntb_mw_count(pidx) should return the
110same value as ntb_peer_mw_count() on the peer with port index - pidx. 115same value as ntb_peer_mw_count() on the peer with port index - pidx.
111 116
112### NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) 117NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev)
118------------------------------------------------------------------
113 119
114The primary client for NTB is the Transport client, used in tandem with NTB 120The primary client for NTB is the Transport client, used in tandem with NTB
115Netdev. These drivers function together to create a logical link to the peer, 121Netdev. These drivers function together to create a logical link to the peer,
@@ -120,7 +126,8 @@ Transport queue pair. Network data is copied between socket buffers and the
120Transport queue pair buffer. The Transport client may be used for other things 126Transport queue pair buffer. The Transport client may be used for other things
121besides Netdev, however no other applications have yet been written. 127besides Netdev, however no other applications have yet been written.
122 128
123### NTB Ping Pong Test Client (ntb\_pingpong) 129NTB Ping Pong Test Client (ntb\_pingpong)
130-----------------------------------------
124 131
125The Ping Pong test client serves as a demonstration to exercise the doorbell 132The Ping Pong test client serves as a demonstration to exercise the doorbell
126and scratchpad registers of NTB hardware, and as an example simple NTB client. 133and scratchpad registers of NTB hardware, and as an example simple NTB client.
@@ -147,7 +154,8 @@ Module Parameters:
147* dyndbg - It is suggested to specify dyndbg=+p when loading this module, and 154* dyndbg - It is suggested to specify dyndbg=+p when loading this module, and
148 then to observe debugging output on the console. 155 then to observe debugging output on the console.
149 156
150### NTB Tool Test Client (ntb\_tool) 157NTB Tool Test Client (ntb\_tool)
158--------------------------------
151 159
152The Tool test client serves for debugging, primarily, ntb hardware and drivers. 160The Tool test client serves for debugging, primarily, ntb hardware and drivers.
153The Tool provides access through debugfs for reading, setting, and clearing the 161The Tool provides access through debugfs for reading, setting, and clearing the
@@ -157,48 +165,60 @@ The Tool does not currently have any module parameters.
157 165
158Debugfs Files: 166Debugfs Files:
159 167
160* *debugfs*/ntb\_tool/*hw*/ - A directory in debugfs will be created for each 168* *debugfs*/ntb\_tool/*hw*/
169 A directory in debugfs will be created for each
161 NTB device probed by the tool. This directory is shortened to *hw* 170 NTB device probed by the tool. This directory is shortened to *hw*
162 below. 171 below.
163* *hw*/db - This file is used to read, set, and clear the local doorbell. Not 172* *hw*/db
173 This file is used to read, set, and clear the local doorbell. Not
164 all operations may be supported by all hardware. To read the doorbell, 174 all operations may be supported by all hardware. To read the doorbell,
165 read the file. To set the doorbell, write `s` followed by the bits to 175 read the file. To set the doorbell, write `s` followed by the bits to
166 set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c` 176 set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c`
167 followed by the bits to clear. 177 followed by the bits to clear.
168* *hw*/mask - This file is used to read, set, and clear the local doorbell mask. 178* *hw*/mask
179 This file is used to read, set, and clear the local doorbell mask.
169 See *db* for details. 180 See *db* for details.
170* *hw*/peer\_db - This file is used to read, set, and clear the peer doorbell. 181* *hw*/peer\_db
182 This file is used to read, set, and clear the peer doorbell.
171 See *db* for details. 183 See *db* for details.
172* *hw*/peer\_mask - This file is used to read, set, and clear the peer doorbell 184* *hw*/peer\_mask
185 This file is used to read, set, and clear the peer doorbell
173 mask. See *db* for details. 186 mask. See *db* for details.
174* *hw*/spad - This file is used to read and write local scratchpads. To read 187* *hw*/spad
188 This file is used to read and write local scratchpads. To read
175 the values of all scratchpads, read the file. To write values, write a 189 the values of all scratchpads, read the file. To write values, write a
176 series of pairs of scratchpad number and value 190 series of pairs of scratchpad number and value
177 (eg: `echo '4 0x123 7 0xabc' > spad` 191 (eg: `echo '4 0x123 7 0xabc' > spad`
178 # to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively). 192 # to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively).
179* *hw*/peer\_spad - This file is used to read and write peer scratchpads. See 193* *hw*/peer\_spad
194 This file is used to read and write peer scratchpads. See
180 *spad* for details. 195 *spad* for details.
181 196
182## NTB Hardware Drivers 197NTB Hardware Drivers
198====================
183 199
184NTB hardware drivers should register devices with the NTB core driver. After 200NTB hardware drivers should register devices with the NTB core driver. After
185registering, clients probe and remove functions will be called. 201registering, clients probe and remove functions will be called.
186 202
187### NTB Intel Hardware Driver (ntb\_hw\_intel) 203NTB Intel Hardware Driver (ntb\_hw\_intel)
204------------------------------------------
188 205
189The Intel hardware driver supports NTB on Xeon and Atom CPUs. 206The Intel hardware driver supports NTB on Xeon and Atom CPUs.
190 207
191Module Parameters: 208Module Parameters:
192 209
193* b2b\_mw\_idx - If the peer ntb is to be accessed via a memory window, then use 210* b2b\_mw\_idx
211 If the peer ntb is to be accessed via a memory window, then use
194 this memory window to access the peer ntb. A value of zero or positive 212 this memory window to access the peer ntb. A value of zero or positive
195 starts from the first mw idx, and a negative value starts from the last 213 starts from the first mw idx, and a negative value starts from the last
196 mw idx. Both sides MUST set the same value here! The default value is 214 mw idx. Both sides MUST set the same value here! The default value is
197 `-1`. 215 `-1`.
198* b2b\_mw\_share - If the peer ntb is to be accessed via a memory window, and if 216* b2b\_mw\_share
217 If the peer ntb is to be accessed via a memory window, and if
199 the memory window is large enough, still allow the client to use the 218 the memory window is large enough, still allow the client to use the
200 second half of the memory window for address translation to the peer. 219 second half of the memory window for address translation to the peer.
201* xeon\_b2b\_usd\_bar2\_addr64 - If using B2B topology on Xeon hardware, use 220* xeon\_b2b\_usd\_bar2\_addr64
221 If using B2B topology on Xeon hardware, use
202 this 64 bit address on the bus between the NTB devices for the window 222 this 64 bit address on the bus between the NTB devices for the window
203 at BAR2, on the upstream side of the link. 223 at BAR2, on the upstream side of the link.
204* xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. 224* xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*.
diff --git a/Documentation/numastat.txt b/Documentation/numastat.txt
index 520327790d54..aaf1667489f8 100644
--- a/Documentation/numastat.txt
+++ b/Documentation/numastat.txt
@@ -1,10 +1,12 @@
1 1===============================
2Numa policy hit/miss statistics 2Numa policy hit/miss statistics
3===============================
3 4
4/sys/devices/system/node/node*/numastat 5/sys/devices/system/node/node*/numastat
5 6
6All units are pages. Hugepages have separate counters. 7All units are pages. Hugepages have separate counters.
7 8
9=============== ============================================================
8numa_hit A process wanted to allocate memory from this node, 10numa_hit A process wanted to allocate memory from this node,
9 and succeeded. 11 and succeeded.
10 12
@@ -20,6 +22,7 @@ other_node A process ran on this node and got memory from another node.
20 22
21interleave_hit Interleaving wanted to allocate from this node 23interleave_hit Interleaving wanted to allocate from this node
22 and succeeded. 24 and succeeded.
25=============== ============================================================
23 26
24For easier reading you can use the numastat utility from the numactl package 27For easier reading you can use the numastat utility from the numactl package
25(http://oss.sgi.com/projects/libnuma/). Note that it only works 28(http://oss.sgi.com/projects/libnuma/). Note that it only works
diff --git a/Documentation/padata.txt b/Documentation/padata.txt
index 7ddfe216a0aa..b103d0c82000 100644
--- a/Documentation/padata.txt
+++ b/Documentation/padata.txt
@@ -1,5 +1,8 @@
1=======================================
1The padata parallel execution mechanism 2The padata parallel execution mechanism
2Last updated for 2.6.36 3=======================================
4
5:Last updated: for 2.6.36
3 6
4Padata is a mechanism by which the kernel can farm work out to be done in 7Padata is a mechanism by which the kernel can farm work out to be done in
5parallel on multiple CPUs while retaining the ordering of tasks. It was 8parallel on multiple CPUs while retaining the ordering of tasks. It was
@@ -9,7 +12,7 @@ those packets. The crypto developers made a point of writing padata in a
9sufficiently general fashion that it could be put to other uses as well. 12sufficiently general fashion that it could be put to other uses as well.
10 13
11The first step in using padata is to set up a padata_instance structure for 14The first step in using padata is to set up a padata_instance structure for
12overall control of how tasks are to be run: 15overall control of how tasks are to be run::
13 16
14 #include <linux/padata.h> 17 #include <linux/padata.h>
15 18
@@ -24,7 +27,7 @@ The workqueue wq is where the work will actually be done; it should be
24a multithreaded queue, naturally. 27a multithreaded queue, naturally.
25 28
26To allocate a padata instance with the cpu_possible_mask for both 29To allocate a padata instance with the cpu_possible_mask for both
27cpumasks this helper function can be used: 30cpumasks this helper function can be used::
28 31
29 struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq); 32 struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq);
30 33
@@ -36,7 +39,7 @@ it is legal to supply a cpumask to padata that contains offline CPUs.
36Once an offline CPU in the user supplied cpumask comes online, padata 39Once an offline CPU in the user supplied cpumask comes online, padata
37is going to use it. 40is going to use it.
38 41
39There are functions for enabling and disabling the instance: 42There are functions for enabling and disabling the instance::
40 43
41 int padata_start(struct padata_instance *pinst); 44 int padata_start(struct padata_instance *pinst);
42 void padata_stop(struct padata_instance *pinst); 45 void padata_stop(struct padata_instance *pinst);
@@ -48,7 +51,7 @@ padata cpumask contains no active CPU (flag not set).
48padata_stop clears the flag and blocks until the padata instance 51padata_stop clears the flag and blocks until the padata instance
49is unused. 52is unused.
50 53
51The list of CPUs to be used can be adjusted with these functions: 54The list of CPUs to be used can be adjusted with these functions::
52 55
53 int padata_set_cpumasks(struct padata_instance *pinst, 56 int padata_set_cpumasks(struct padata_instance *pinst,
54 cpumask_var_t pcpumask, 57 cpumask_var_t pcpumask,
@@ -71,12 +74,12 @@ padata_add_cpu/padata_remove_cpu are used. cpu specifies the CPU to add or
71remove and mask is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL. 74remove and mask is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL.
72 75
73If a user is interested in padata cpumask changes, he can register to 76If a user is interested in padata cpumask changes, he can register to
74the padata cpumask change notifier: 77the padata cpumask change notifier::
75 78
76 int padata_register_cpumask_notifier(struct padata_instance *pinst, 79 int padata_register_cpumask_notifier(struct padata_instance *pinst,
77 struct notifier_block *nblock); 80 struct notifier_block *nblock);
78 81
79To unregister from that notifier: 82To unregister from that notifier::
80 83
81 int padata_unregister_cpumask_notifier(struct padata_instance *pinst, 84 int padata_unregister_cpumask_notifier(struct padata_instance *pinst,
82 struct notifier_block *nblock); 85 struct notifier_block *nblock);
@@ -84,7 +87,7 @@ To unregister from that notifier:
84The padata cpumask change notifier notifies about changes of the usable 87The padata cpumask change notifier notifies about changes of the usable
85cpumasks, i.e. the subset of active CPUs in the user supplied cpumask. 88cpumasks, i.e. the subset of active CPUs in the user supplied cpumask.
86 89
87Padata calls the notifier chain with: 90Padata calls the notifier chain with::
88 91
89 blocking_notifier_call_chain(&pinst->cpumask_change_notifier, 92 blocking_notifier_call_chain(&pinst->cpumask_change_notifier,
90 notification_mask, 93 notification_mask,
@@ -95,7 +98,7 @@ is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL and cpumask is a pointer
95to a struct padata_cpumask that contains the new cpumask information. 98to a struct padata_cpumask that contains the new cpumask information.
96 99
97Actually submitting work to the padata instance requires the creation of a 100Actually submitting work to the padata instance requires the creation of a
98padata_priv structure: 101padata_priv structure::
99 102
100 struct padata_priv { 103 struct padata_priv {
101 /* Other stuff here... */ 104 /* Other stuff here... */
@@ -110,7 +113,7 @@ parallel() and serial() functions should be provided. Those functions will
110be called in the process of getting the work done as we will see 113be called in the process of getting the work done as we will see
111momentarily. 114momentarily.
112 115
113The submission of work is done with: 116The submission of work is done with::
114 117
115 int padata_do_parallel(struct padata_instance *pinst, 118 int padata_do_parallel(struct padata_instance *pinst,
116 struct padata_priv *padata, int cb_cpu); 119 struct padata_priv *padata, int cb_cpu);
@@ -138,7 +141,7 @@ need not be completed during this call, but, if parallel() leaves work
138outstanding, it should be prepared to be called again with a new job before 141outstanding, it should be prepared to be called again with a new job before
139the previous one completes. When a task does complete, parallel() (or 142the previous one completes. When a task does complete, parallel() (or
140whatever function actually finishes the job) should inform padata of the 143whatever function actually finishes the job) should inform padata of the
141fact with a call to: 144fact with a call to::
142 145
143 void padata_do_serial(struct padata_priv *padata); 146 void padata_do_serial(struct padata_priv *padata);
144 147
@@ -151,7 +154,7 @@ pains to ensure that tasks are completed in the order in which they were
151submitted. 154submitted.
152 155
153The one remaining function in the padata API should be called to clean up 156The one remaining function in the padata API should be called to clean up
154when a padata instance is no longer needed: 157when a padata instance is no longer needed::
155 158
156 void padata_free(struct padata_instance *pinst); 159 void padata_free(struct padata_instance *pinst);
157 160
diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt
index 120eb20dbb09..0633d70ffda7 100644
--- a/Documentation/parport-lowlevel.txt
+++ b/Documentation/parport-lowlevel.txt
@@ -1,11 +1,12 @@
1===============================
1PARPORT interface documentation 2PARPORT interface documentation
2------------------------------- 3===============================
3 4
4Time-stamp: <2000-02-24 13:30:20 twaugh> 5:Time-stamp: <2000-02-24 13:30:20 twaugh>
5 6
6Described here are the following functions: 7Described here are the following functions:
7 8
8Global functions: 9Global functions::
9 parport_register_driver 10 parport_register_driver
10 parport_unregister_driver 11 parport_unregister_driver
11 parport_enumerate 12 parport_enumerate
@@ -31,7 +32,8 @@ Global functions:
31 parport_set_timeout 32 parport_set_timeout
32 33
33Port functions (can be overridden by low-level drivers): 34Port functions (can be overridden by low-level drivers):
34 SPP: 35
36 SPP::
35 port->ops->read_data 37 port->ops->read_data
36 port->ops->write_data 38 port->ops->write_data
37 port->ops->read_status 39 port->ops->read_status
@@ -43,23 +45,23 @@ Port functions (can be overridden by low-level drivers):
43 port->ops->data_forward 45 port->ops->data_forward
44 port->ops->data_reverse 46 port->ops->data_reverse
45 47
46 EPP: 48 EPP::
47 port->ops->epp_write_data 49 port->ops->epp_write_data
48 port->ops->epp_read_data 50 port->ops->epp_read_data
49 port->ops->epp_write_addr 51 port->ops->epp_write_addr
50 port->ops->epp_read_addr 52 port->ops->epp_read_addr
51 53
52 ECP: 54 ECP::
53 port->ops->ecp_write_data 55 port->ops->ecp_write_data
54 port->ops->ecp_read_data 56 port->ops->ecp_read_data
55 port->ops->ecp_write_addr 57 port->ops->ecp_write_addr
56 58
57 Other: 59 Other::
58 port->ops->nibble_read_data 60 port->ops->nibble_read_data
59 port->ops->byte_read_data 61 port->ops->byte_read_data
60 port->ops->compat_write_data 62 port->ops->compat_write_data
61 63
62The parport subsystem comprises 'parport' (the core port-sharing 64The parport subsystem comprises ``parport`` (the core port-sharing
63code), and a variety of low-level drivers that actually do the port 65code), and a variety of low-level drivers that actually do the port
64accesses. Each low-level driver handles a particular style of port 66accesses. Each low-level driver handles a particular style of port
65(PC, Amiga, and so on). 67(PC, Amiga, and so on).
@@ -70,14 +72,14 @@ into global functions and port functions.
70The global functions are mostly for communicating between the device 72The global functions are mostly for communicating between the device
71driver and the parport subsystem: acquiring a list of available ports, 73driver and the parport subsystem: acquiring a list of available ports,
72claiming a port for exclusive use, and so on. They also include 74claiming a port for exclusive use, and so on. They also include
73'generic' functions for doing standard things that will work on any 75``generic`` functions for doing standard things that will work on any
74IEEE 1284-capable architecture. 76IEEE 1284-capable architecture.
75 77
76The port functions are provided by the low-level drivers, although the 78The port functions are provided by the low-level drivers, although the
77core parport module provides generic 'defaults' for some routines. 79core parport module provides generic ``defaults`` for some routines.
78The port functions can be split into three groups: SPP, EPP, and ECP. 80The port functions can be split into three groups: SPP, EPP, and ECP.
79 81
80SPP (Standard Parallel Port) functions modify so-called 'SPP' 82SPP (Standard Parallel Port) functions modify so-called ``SPP``
81registers: data, status, and control. The hardware may not actually 83registers: data, status, and control. The hardware may not actually
82have registers exactly like that, but the PC does and this interface is 84have registers exactly like that, but the PC does and this interface is
83modelled after common PC implementations. Other low-level drivers may 85modelled after common PC implementations. Other low-level drivers may
@@ -95,58 +97,63 @@ to cope with peripherals that only tenuously support IEEE 1284, a
95low-level driver specific function is provided, for altering 'fudge 97low-level driver specific function is provided, for altering 'fudge
96factors'. 98factors'.
97 99
98GLOBAL FUNCTIONS 100Global functions
99---------------- 101================
100 102
101parport_register_driver - register a device driver with parport 103parport_register_driver - register a device driver with parport
102----------------------- 104---------------------------------------------------------------
103 105
104SYNOPSIS 106SYNOPSIS
107^^^^^^^^
108
109::
105 110
106#include <linux/parport.h> 111 #include <linux/parport.h>
107 112
108struct parport_driver { 113 struct parport_driver {
109 const char *name; 114 const char *name;
110 void (*attach) (struct parport *); 115 void (*attach) (struct parport *);
111 void (*detach) (struct parport *); 116 void (*detach) (struct parport *);
112 struct parport_driver *next; 117 struct parport_driver *next;
113}; 118 };
114int parport_register_driver (struct parport_driver *driver); 119 int parport_register_driver (struct parport_driver *driver);
115 120
116DESCRIPTION 121DESCRIPTION
122^^^^^^^^^^^
117 123
118In order to be notified about parallel ports when they are detected, 124In order to be notified about parallel ports when they are detected,
119parport_register_driver should be called. Your driver will 125parport_register_driver should be called. Your driver will
120immediately be notified of all ports that have already been detected, 126immediately be notified of all ports that have already been detected,
121and of each new port as low-level drivers are loaded. 127and of each new port as low-level drivers are loaded.
122 128
123A 'struct parport_driver' contains the textual name of your driver, 129A ``struct parport_driver`` contains the textual name of your driver,
124a pointer to a function to handle new ports, and a pointer to a 130a pointer to a function to handle new ports, and a pointer to a
125function to handle ports going away due to a low-level driver 131function to handle ports going away due to a low-level driver
126unloading. Ports will only be detached if they are not being used 132unloading. Ports will only be detached if they are not being used
127(i.e. there are no devices registered on them). 133(i.e. there are no devices registered on them).
128 134
129The visible parts of the 'struct parport *' argument given to 135The visible parts of the ``struct parport *`` argument given to
130attach/detach are: 136attach/detach are::
131 137
132struct parport 138 struct parport
133{ 139 {
134 struct parport *next; /* next parport in list */ 140 struct parport *next; /* next parport in list */
135 const char *name; /* port's name */ 141 const char *name; /* port's name */
136 unsigned int modes; /* bitfield of hardware modes */ 142 unsigned int modes; /* bitfield of hardware modes */
137 struct parport_device_info probe_info; 143 struct parport_device_info probe_info;
138 /* IEEE1284 info */ 144 /* IEEE1284 info */
139 int number; /* parport index */ 145 int number; /* parport index */
140 struct parport_operations *ops; 146 struct parport_operations *ops;
141 ... 147 ...
142}; 148 };
143 149
144There are other members of the structure, but they should not be 150There are other members of the structure, but they should not be
145touched. 151touched.
146 152
147The 'modes' member summarises the capabilities of the underlying 153The ``modes`` member summarises the capabilities of the underlying
148hardware. It consists of flags which may be bitwise-ored together: 154hardware. It consists of flags which may be bitwise-ored together:
149 155
156 ============================= ===============================================
150 PARPORT_MODE_PCSPP IBM PC registers are available, 157 PARPORT_MODE_PCSPP IBM PC registers are available,
151 i.e. functions that act on data, 158 i.e. functions that act on data,
152 control and status registers are 159 control and status registers are
@@ -169,297 +176,351 @@ hardware. It consists of flags which may be bitwise-ored together:
169 GFP_DMA flag with kmalloc) to the 176 GFP_DMA flag with kmalloc) to the
170 low-level driver in order to take 177 low-level driver in order to take
171 advantage of it. 178 advantage of it.
179 ============================= ===============================================
172 180
173There may be other flags in 'modes' as well. 181There may be other flags in ``modes`` as well.
174 182
175The contents of 'modes' is advisory only. For example, if the 183The contents of ``modes`` is advisory only. For example, if the
176hardware is capable of DMA, and PARPORT_MODE_DMA is in 'modes', it 184hardware is capable of DMA, and PARPORT_MODE_DMA is in ``modes``, it
177doesn't necessarily mean that DMA will always be used when possible. 185doesn't necessarily mean that DMA will always be used when possible.
178Similarly, hardware that is capable of assisting ECP transfers won't 186Similarly, hardware that is capable of assisting ECP transfers won't
179necessarily be used. 187necessarily be used.
180 188
181RETURN VALUE 189RETURN VALUE
190^^^^^^^^^^^^
182 191
183Zero on success, otherwise an error code. 192Zero on success, otherwise an error code.
184 193
185ERRORS 194ERRORS
195^^^^^^
186 196
187None. (Can it fail? Why return int?) 197None. (Can it fail? Why return int?)
188 198
189EXAMPLE 199EXAMPLE
200^^^^^^^
190 201
191static void lp_attach (struct parport *port) 202::
192{
193 ...
194 private = kmalloc (...);
195 dev[count++] = parport_register_device (...);
196 ...
197}
198 203
199static void lp_detach (struct parport *port) 204 static void lp_attach (struct parport *port)
200{ 205 {
201 ... 206 ...
202} 207 private = kmalloc (...);
208 dev[count++] = parport_register_device (...);
209 ...
210 }
203 211
204static struct parport_driver lp_driver = { 212 static void lp_detach (struct parport *port)
205 "lp", 213 {
206 lp_attach, 214 ...
207 lp_detach, 215 }
208 NULL /* always put NULL here */
209};
210 216
211int lp_init (void) 217 static struct parport_driver lp_driver = {
212{ 218 "lp",
213 ... 219 lp_attach,
214 if (parport_register_driver (&lp_driver)) { 220 lp_detach,
215 /* Failed; nothing we can do. */ 221 NULL /* always put NULL here */
216 return -EIO; 222 };
223
224 int lp_init (void)
225 {
226 ...
227 if (parport_register_driver (&lp_driver)) {
228 /* Failed; nothing we can do. */
229 return -EIO;
230 }
231 ...
217 } 232 }
218 ... 233
219}
220 234
221SEE ALSO 235SEE ALSO
236^^^^^^^^
222 237
223parport_unregister_driver, parport_register_device, parport_enumerate 238parport_unregister_driver, parport_register_device, parport_enumerate
224 239
240
241
225parport_unregister_driver - tell parport to forget about this driver 242parport_unregister_driver - tell parport to forget about this driver
226------------------------- 243--------------------------------------------------------------------
227 244
228SYNOPSIS 245SYNOPSIS
246^^^^^^^^
229 247
230#include <linux/parport.h> 248::
231 249
232struct parport_driver { 250 #include <linux/parport.h>
233 const char *name; 251
234 void (*attach) (struct parport *); 252 struct parport_driver {
235 void (*detach) (struct parport *); 253 const char *name;
236 struct parport_driver *next; 254 void (*attach) (struct parport *);
237}; 255 void (*detach) (struct parport *);
238void parport_unregister_driver (struct parport_driver *driver); 256 struct parport_driver *next;
257 };
258 void parport_unregister_driver (struct parport_driver *driver);
239 259
240DESCRIPTION 260DESCRIPTION
261^^^^^^^^^^^
241 262
242This tells parport not to notify the device driver of new ports or of 263This tells parport not to notify the device driver of new ports or of
243ports going away. Registered devices belonging to that driver are NOT 264ports going away. Registered devices belonging to that driver are NOT
244unregistered: parport_unregister_device must be used for each one. 265unregistered: parport_unregister_device must be used for each one.
245 266
246EXAMPLE 267EXAMPLE
268^^^^^^^
247 269
248void cleanup_module (void) 270::
249{
250 ...
251 /* Stop notifications. */
252 parport_unregister_driver (&lp_driver);
253 271
254 /* Unregister devices. */ 272 void cleanup_module (void)
255 for (i = 0; i < NUM_DEVS; i++) 273 {
256 parport_unregister_device (dev[i]); 274 ...
257 ... 275 /* Stop notifications. */
258} 276 parport_unregister_driver (&lp_driver);
277
278 /* Unregister devices. */
279 for (i = 0; i < NUM_DEVS; i++)
280 parport_unregister_device (dev[i]);
281 ...
282 }
259 283
260SEE ALSO 284SEE ALSO
285^^^^^^^^
261 286
262parport_register_driver, parport_enumerate 287parport_register_driver, parport_enumerate
263 288
289
290
264parport_enumerate - retrieve a list of parallel ports (DEPRECATED) 291parport_enumerate - retrieve a list of parallel ports (DEPRECATED)
265----------------- 292------------------------------------------------------------------
266 293
267SYNOPSIS 294SYNOPSIS
295^^^^^^^^
268 296
269#include <linux/parport.h> 297::
270 298
271struct parport *parport_enumerate (void); 299 #include <linux/parport.h>
300
301 struct parport *parport_enumerate (void);
272 302
273DESCRIPTION 303DESCRIPTION
304^^^^^^^^^^^
274 305
275Retrieve the first of a list of valid parallel ports for this machine. 306Retrieve the first of a list of valid parallel ports for this machine.
276Successive parallel ports can be found using the 'struct parport 307Successive parallel ports can be found using the ``struct parport
277*next' element of the 'struct parport *' that is returned. If 'next' 308*next`` element of the ``struct parport *`` that is returned. If ``next``
278is NULL, there are no more parallel ports in the list. The number of 309is NULL, there are no more parallel ports in the list. The number of
279ports in the list will not exceed PARPORT_MAX. 310ports in the list will not exceed PARPORT_MAX.
280 311
281RETURN VALUE 312RETURN VALUE
313^^^^^^^^^^^^
282 314
283A 'struct parport *' describing a valid parallel port for the machine, 315A ``struct parport *`` describing a valid parallel port for the machine,
284or NULL if there are none. 316or NULL if there are none.
285 317
286ERRORS 318ERRORS
319^^^^^^
287 320
288This function can return NULL to indicate that there are no parallel 321This function can return NULL to indicate that there are no parallel
289ports to use. 322ports to use.
290 323
291EXAMPLE 324EXAMPLE
325^^^^^^^
326
327::
292 328
293int detect_device (void) 329 int detect_device (void)
294{ 330 {
295 struct parport *port; 331 struct parport *port;
332
333 for (port = parport_enumerate ();
334 port != NULL;
335 port = port->next) {
336 /* Try to detect a device on the port... */
337 ...
338 }
339 }
296 340
297 for (port = parport_enumerate ();
298 port != NULL;
299 port = port->next) {
300 /* Try to detect a device on the port... */
301 ... 341 ...
302 }
303 } 342 }
304 343
305 ...
306}
307
308NOTES 344NOTES
345^^^^^
309 346
310parport_enumerate is deprecated; parport_register_driver should be 347parport_enumerate is deprecated; parport_register_driver should be
311used instead. 348used instead.
312 349
313SEE ALSO 350SEE ALSO
351^^^^^^^^
314 352
315parport_register_driver, parport_unregister_driver 353parport_register_driver, parport_unregister_driver
316 354
355
356
317parport_register_device - register to use a port 357parport_register_device - register to use a port
318----------------------- 358------------------------------------------------
319 359
320SYNOPSIS 360SYNOPSIS
361^^^^^^^^
321 362
322#include <linux/parport.h> 363::
323 364
324typedef int (*preempt_func) (void *handle); 365 #include <linux/parport.h>
325typedef void (*wakeup_func) (void *handle);
326typedef int (*irq_func) (int irq, void *handle, struct pt_regs *);
327 366
328struct pardevice *parport_register_device(struct parport *port, 367 typedef int (*preempt_func) (void *handle);
329 const char *name, 368 typedef void (*wakeup_func) (void *handle);
330 preempt_func preempt, 369 typedef int (*irq_func) (int irq, void *handle, struct pt_regs *);
331 wakeup_func wakeup, 370
332 irq_func irq, 371 struct pardevice *parport_register_device(struct parport *port,
333 int flags, 372 const char *name,
334 void *handle); 373 preempt_func preempt,
374 wakeup_func wakeup,
375 irq_func irq,
376 int flags,
377 void *handle);
335 378
336DESCRIPTION 379DESCRIPTION
380^^^^^^^^^^^
337 381
338Use this function to register your device driver on a parallel port 382Use this function to register your device driver on a parallel port
339('port'). Once you have done that, you will be able to use 383(``port``). Once you have done that, you will be able to use
340parport_claim and parport_release in order to use the port. 384parport_claim and parport_release in order to use the port.
341 385
342The ('name') argument is the name of the device that appears in /proc 386The (``name``) argument is the name of the device that appears in /proc
343filesystem. The string must be valid for the whole lifetime of the 387filesystem. The string must be valid for the whole lifetime of the
344device (until parport_unregister_device is called). 388device (until parport_unregister_device is called).
345 389
346This function will register three callbacks into your driver: 390This function will register three callbacks into your driver:
347'preempt', 'wakeup' and 'irq'. Each of these may be NULL in order to 391``preempt``, ``wakeup`` and ``irq``. Each of these may be NULL in order to
348indicate that you do not want a callback. 392indicate that you do not want a callback.
349 393
350When the 'preempt' function is called, it is because another driver 394When the ``preempt`` function is called, it is because another driver
351wishes to use the parallel port. The 'preempt' function should return 395wishes to use the parallel port. The ``preempt`` function should return
352non-zero if the parallel port cannot be released yet -- if zero is 396non-zero if the parallel port cannot be released yet -- if zero is
353returned, the port is lost to another driver and the port must be 397returned, the port is lost to another driver and the port must be
354re-claimed before use. 398re-claimed before use.
355 399
356The 'wakeup' function is called once another driver has released the 400The ``wakeup`` function is called once another driver has released the
357port and no other driver has yet claimed it. You can claim the 401port and no other driver has yet claimed it. You can claim the
358parallel port from within the 'wakeup' function (in which case the 402parallel port from within the ``wakeup`` function (in which case the
359claim is guaranteed to succeed), or choose not to if you don't need it 403claim is guaranteed to succeed), or choose not to if you don't need it
360now. 404now.
361 405
362If an interrupt occurs on the parallel port your driver has claimed, 406If an interrupt occurs on the parallel port your driver has claimed,
363the 'irq' function will be called. (Write something about shared 407the ``irq`` function will be called. (Write something about shared
364interrupts here.) 408interrupts here.)
365 409
366The 'handle' is a pointer to driver-specific data, and is passed to 410The ``handle`` is a pointer to driver-specific data, and is passed to
367the callback functions. 411the callback functions.
368 412
369'flags' may be a bitwise combination of the following flags: 413``flags`` may be a bitwise combination of the following flags:
370 414
415 ===================== =================================================
371 Flag Meaning 416 Flag Meaning
417 ===================== =================================================
372 PARPORT_DEV_EXCL The device cannot share the parallel port at all. 418 PARPORT_DEV_EXCL The device cannot share the parallel port at all.
373 Use this only when absolutely necessary. 419 Use this only when absolutely necessary.
420 ===================== =================================================
374 421
375The typedefs are not actually defined -- they are only shown in order 422The typedefs are not actually defined -- they are only shown in order
376to make the function prototype more readable. 423to make the function prototype more readable.
377 424
378The visible parts of the returned 'struct pardevice' are: 425The visible parts of the returned ``struct pardevice`` are::
379 426
380struct pardevice { 427 struct pardevice {
381 struct parport *port; /* Associated port */ 428 struct parport *port; /* Associated port */
382 void *private; /* Device driver's 'handle' */ 429 void *private; /* Device driver's 'handle' */
383 ... 430 ...
384}; 431 };
385 432
386RETURN VALUE 433RETURN VALUE
434^^^^^^^^^^^^
387 435
388A 'struct pardevice *': a handle to the registered parallel port 436A ``struct pardevice *``: a handle to the registered parallel port
389device that can be used for parport_claim, parport_release, etc. 437device that can be used for parport_claim, parport_release, etc.
390 438
391ERRORS 439ERRORS
440^^^^^^
392 441
393A return value of NULL indicates that there was a problem registering 442A return value of NULL indicates that there was a problem registering
394a device on that port. 443a device on that port.
395 444
396EXAMPLE 445EXAMPLE
446^^^^^^^
447
448::
449
450 static int preempt (void *handle)
451 {
452 if (busy_right_now)
453 return 1;
454
455 must_reclaim_port = 1;
456 return 0;
457 }
458
459 static void wakeup (void *handle)
460 {
461 struct toaster *private = handle;
462 struct pardevice *dev = private->dev;
463 if (!dev) return; /* avoid races */
464
465 if (want_port)
466 parport_claim (dev);
467 }
468
469 static int toaster_detect (struct toaster *private, struct parport *port)
470 {
471 private->dev = parport_register_device (port, "toaster", preempt,
472 wakeup, NULL, 0,
473 private);
474 if (!private->dev)
475 /* Couldn't register with parport. */
476 return -EIO;
397 477
398static int preempt (void *handle)
399{
400 if (busy_right_now)
401 return 1;
402
403 must_reclaim_port = 1;
404 return 0;
405}
406
407static void wakeup (void *handle)
408{
409 struct toaster *private = handle;
410 struct pardevice *dev = private->dev;
411 if (!dev) return; /* avoid races */
412
413 if (want_port)
414 parport_claim (dev);
415}
416
417static int toaster_detect (struct toaster *private, struct parport *port)
418{
419 private->dev = parport_register_device (port, "toaster", preempt,
420 wakeup, NULL, 0,
421 private);
422 if (!private->dev)
423 /* Couldn't register with parport. */
424 return -EIO;
425
426 must_reclaim_port = 0;
427 busy_right_now = 1;
428 parport_claim_or_block (private->dev);
429 ...
430 /* Don't need the port while the toaster warms up. */
431 busy_right_now = 0;
432 ...
433 busy_right_now = 1;
434 if (must_reclaim_port) {
435 parport_claim_or_block (private->dev);
436 must_reclaim_port = 0; 478 must_reclaim_port = 0;
479 busy_right_now = 1;
480 parport_claim_or_block (private->dev);
481 ...
482 /* Don't need the port while the toaster warms up. */
483 busy_right_now = 0;
484 ...
485 busy_right_now = 1;
486 if (must_reclaim_port) {
487 parport_claim_or_block (private->dev);
488 must_reclaim_port = 0;
489 }
490 ...
437 } 491 }
438 ...
439}
440 492
441SEE ALSO 493SEE ALSO
494^^^^^^^^
442 495
443parport_unregister_device, parport_claim 496parport_unregister_device, parport_claim
497
498
444 499
445parport_unregister_device - finish using a port 500parport_unregister_device - finish using a port
446------------------------- 501-----------------------------------------------
447 502
448SYNPOPSIS 503SYNPOPSIS
449 504
450#include <linux/parport.h> 505::
506
507 #include <linux/parport.h>
451 508
452void parport_unregister_device (struct pardevice *dev); 509 void parport_unregister_device (struct pardevice *dev);
453 510
454DESCRIPTION 511DESCRIPTION
512^^^^^^^^^^^
455 513
456This function is the opposite of parport_register_device. After using 514This function is the opposite of parport_register_device. After using
457parport_unregister_device, 'dev' is no longer a valid device handle. 515parport_unregister_device, ``dev`` is no longer a valid device handle.
458 516
459You should not unregister a device that is currently claimed, although 517You should not unregister a device that is currently claimed, although
460if you do it will be released automatically. 518if you do it will be released automatically.
461 519
462EXAMPLE 520EXAMPLE
521^^^^^^^
522
523::
463 524
464 ... 525 ...
465 kfree (dev->private); /* before we lose the pointer */ 526 kfree (dev->private); /* before we lose the pointer */
@@ -467,460 +528,602 @@ EXAMPLE
467 ... 528 ...
468 529
469SEE ALSO 530SEE ALSO
531^^^^^^^^
532
470 533
471parport_unregister_driver 534parport_unregister_driver
472 535
473parport_claim, parport_claim_or_block - claim the parallel port for a device 536parport_claim, parport_claim_or_block - claim the parallel port for a device
474------------------------------------- 537----------------------------------------------------------------------------
475 538
476SYNOPSIS 539SYNOPSIS
540^^^^^^^^
541
542::
477 543
478#include <linux/parport.h> 544 #include <linux/parport.h>
479 545
480int parport_claim (struct pardevice *dev); 546 int parport_claim (struct pardevice *dev);
481int parport_claim_or_block (struct pardevice *dev); 547 int parport_claim_or_block (struct pardevice *dev);
482 548
483DESCRIPTION 549DESCRIPTION
550^^^^^^^^^^^
484 551
485These functions attempt to gain control of the parallel port on which 552These functions attempt to gain control of the parallel port on which
486'dev' is registered. 'parport_claim' does not block, but 553``dev`` is registered. ``parport_claim`` does not block, but
487'parport_claim_or_block' may do. (Put something here about blocking 554``parport_claim_or_block`` may do. (Put something here about blocking
488interruptibly or non-interruptibly.) 555interruptibly or non-interruptibly.)
489 556
490You should not try to claim a port that you have already claimed. 557You should not try to claim a port that you have already claimed.
491 558
492RETURN VALUE 559RETURN VALUE
560^^^^^^^^^^^^
493 561
494A return value of zero indicates that the port was successfully 562A return value of zero indicates that the port was successfully
495claimed, and the caller now has possession of the parallel port. 563claimed, and the caller now has possession of the parallel port.
496 564
497If 'parport_claim_or_block' blocks before returning successfully, the 565If ``parport_claim_or_block`` blocks before returning successfully, the
498return value is positive. 566return value is positive.
499 567
500ERRORS 568ERRORS
569^^^^^^
501 570
571========== ==========================================================
502 -EAGAIN The port is unavailable at the moment, but another attempt 572 -EAGAIN The port is unavailable at the moment, but another attempt
503 to claim it may succeed. 573 to claim it may succeed.
574========== ==========================================================
504 575
505SEE ALSO 576SEE ALSO
577^^^^^^^^
578
506 579
507parport_release 580parport_release
508 581
509parport_release - release the parallel port 582parport_release - release the parallel port
510--------------- 583-------------------------------------------
511 584
512SYNOPSIS 585SYNOPSIS
586^^^^^^^^
587
588::
513 589
514#include <linux/parport.h> 590 #include <linux/parport.h>
515 591
516void parport_release (struct pardevice *dev); 592 void parport_release (struct pardevice *dev);
517 593
518DESCRIPTION 594DESCRIPTION
595^^^^^^^^^^^
519 596
520Once a parallel port device has been claimed, it can be released using 597Once a parallel port device has been claimed, it can be released using
521'parport_release'. It cannot fail, but you should not release a 598``parport_release``. It cannot fail, but you should not release a
522device that you do not have possession of. 599device that you do not have possession of.
523 600
524EXAMPLE 601EXAMPLE
602^^^^^^^
525 603
526static size_t write (struct pardevice *dev, const void *buf, 604::
527 size_t len) 605
528{ 606 static size_t write (struct pardevice *dev, const void *buf,
529 ... 607 size_t len)
530 written = dev->port->ops->write_ecp_data (dev->port, buf, 608 {
531 len); 609 ...
532 parport_release (dev); 610 written = dev->port->ops->write_ecp_data (dev->port, buf,
533 ... 611 len);
534} 612 parport_release (dev);
613 ...
614 }
535 615
536 616
537SEE ALSO 617SEE ALSO
618^^^^^^^^
538 619
539change_mode, parport_claim, parport_claim_or_block, parport_yield 620change_mode, parport_claim, parport_claim_or_block, parport_yield
540 621
622
623
541parport_yield, parport_yield_blocking - temporarily release a parallel port 624parport_yield, parport_yield_blocking - temporarily release a parallel port
542------------------------------------- 625---------------------------------------------------------------------------
543 626
544SYNOPSIS 627SYNOPSIS
628^^^^^^^^
629
630::
545 631
546#include <linux/parport.h> 632 #include <linux/parport.h>
547 633
548int parport_yield (struct pardevice *dev) 634 int parport_yield (struct pardevice *dev)
549int parport_yield_blocking (struct pardevice *dev); 635 int parport_yield_blocking (struct pardevice *dev);
550 636
551DESCRIPTION 637DESCRIPTION
638^^^^^^^^^^^
552 639
553When a driver has control of a parallel port, it may allow another 640When a driver has control of a parallel port, it may allow another
554driver to temporarily 'borrow' it. 'parport_yield' does not block; 641driver to temporarily ``borrow`` it. ``parport_yield`` does not block;
555'parport_yield_blocking' may do. 642``parport_yield_blocking`` may do.
556 643
557RETURN VALUE 644RETURN VALUE
645^^^^^^^^^^^^
558 646
559A return value of zero indicates that the caller still owns the port 647A return value of zero indicates that the caller still owns the port
560and the call did not block. 648and the call did not block.
561 649
562A positive return value from 'parport_yield_blocking' indicates that 650A positive return value from ``parport_yield_blocking`` indicates that
563the caller still owns the port and the call blocked. 651the caller still owns the port and the call blocked.
564 652
565A return value of -EAGAIN indicates that the caller no longer owns the 653A return value of -EAGAIN indicates that the caller no longer owns the
566port, and it must be re-claimed before use. 654port, and it must be re-claimed before use.
567 655
568ERRORS 656ERRORS
657^^^^^^
569 658
659========= ==========================================================
570 -EAGAIN Ownership of the parallel port was given away. 660 -EAGAIN Ownership of the parallel port was given away.
661========= ==========================================================
571 662
572SEE ALSO 663SEE ALSO
664^^^^^^^^
573 665
574parport_release 666parport_release
667
668
575 669
576parport_wait_peripheral - wait for status lines, up to 35ms 670parport_wait_peripheral - wait for status lines, up to 35ms
577----------------------- 671-----------------------------------------------------------
578 672
579SYNOPSIS 673SYNOPSIS
674^^^^^^^^
675
676::
580 677
581#include <linux/parport.h> 678 #include <linux/parport.h>
582 679
583int parport_wait_peripheral (struct parport *port, 680 int parport_wait_peripheral (struct parport *port,
584 unsigned char mask, 681 unsigned char mask,
585 unsigned char val); 682 unsigned char val);
586 683
587DESCRIPTION 684DESCRIPTION
685^^^^^^^^^^^
588 686
589Wait for the status lines in mask to match the values in val. 687Wait for the status lines in mask to match the values in val.
590 688
591RETURN VALUE 689RETURN VALUE
690^^^^^^^^^^^^
592 691
692======== ==========================================================
593 -EINTR a signal is pending 693 -EINTR a signal is pending
594 0 the status lines in mask have values in val 694 0 the status lines in mask have values in val
595 1 timed out while waiting (35ms elapsed) 695 1 timed out while waiting (35ms elapsed)
696======== ==========================================================
596 697
597SEE ALSO 698SEE ALSO
699^^^^^^^^
598 700
599parport_poll_peripheral 701parport_poll_peripheral
702
703
600 704
601parport_poll_peripheral - wait for status lines, in usec 705parport_poll_peripheral - wait for status lines, in usec
602----------------------- 706--------------------------------------------------------
603 707
604SYNOPSIS 708SYNOPSIS
709^^^^^^^^
710
711::
605 712
606#include <linux/parport.h> 713 #include <linux/parport.h>
607 714
608int parport_poll_peripheral (struct parport *port, 715 int parport_poll_peripheral (struct parport *port,
609 unsigned char mask, 716 unsigned char mask,
610 unsigned char val, 717 unsigned char val,
611 int usec); 718 int usec);
612 719
613DESCRIPTION 720DESCRIPTION
721^^^^^^^^^^^
614 722
615Wait for the status lines in mask to match the values in val. 723Wait for the status lines in mask to match the values in val.
616 724
617RETURN VALUE 725RETURN VALUE
726^^^^^^^^^^^^
618 727
728======== ==========================================================
619 -EINTR a signal is pending 729 -EINTR a signal is pending
620 0 the status lines in mask have values in val 730 0 the status lines in mask have values in val
621 1 timed out while waiting (usec microseconds have elapsed) 731 1 timed out while waiting (usec microseconds have elapsed)
732======== ==========================================================
622 733
623SEE ALSO 734SEE ALSO
735^^^^^^^^
624 736
625parport_wait_peripheral 737parport_wait_peripheral
626 738
739
740
627parport_wait_event - wait for an event on a port 741parport_wait_event - wait for an event on a port
628------------------ 742------------------------------------------------
629 743
630SYNOPSIS 744SYNOPSIS
745^^^^^^^^
631 746
632#include <linux/parport.h> 747::
633 748
634int parport_wait_event (struct parport *port, signed long timeout) 749 #include <linux/parport.h>
750
751 int parport_wait_event (struct parport *port, signed long timeout)
635 752
636DESCRIPTION 753DESCRIPTION
754^^^^^^^^^^^
637 755
638Wait for an event (e.g. interrupt) on a port. The timeout is in 756Wait for an event (e.g. interrupt) on a port. The timeout is in
639jiffies. 757jiffies.
640 758
641RETURN VALUE 759RETURN VALUE
760^^^^^^^^^^^^
642 761
762======= ==========================================================
643 0 success 763 0 success
644 <0 error (exit as soon as possible) 764 <0 error (exit as soon as possible)
645 >0 timed out 765 >0 timed out
646 766======= ==========================================================
767
647parport_negotiate - perform IEEE 1284 negotiation 768parport_negotiate - perform IEEE 1284 negotiation
648----------------- 769-------------------------------------------------
649 770
650SYNOPSIS 771SYNOPSIS
772^^^^^^^^
773
774::
651 775
652#include <linux/parport.h> 776 #include <linux/parport.h>
653 777
654int parport_negotiate (struct parport *, int mode); 778 int parport_negotiate (struct parport *, int mode);
655 779
656DESCRIPTION 780DESCRIPTION
781^^^^^^^^^^^
657 782
658Perform IEEE 1284 negotiation. 783Perform IEEE 1284 negotiation.
659 784
660RETURN VALUE 785RETURN VALUE
786^^^^^^^^^^^^
661 787
788======= ==========================================================
662 0 handshake OK; IEEE 1284 peripheral and mode available 789 0 handshake OK; IEEE 1284 peripheral and mode available
663 -1 handshake failed; peripheral not compliant (or none present) 790 -1 handshake failed; peripheral not compliant (or none present)
664 1 handshake OK; IEEE 1284 peripheral present but mode not 791 1 handshake OK; IEEE 1284 peripheral present but mode not
665 available 792 available
793======= ==========================================================
666 794
667SEE ALSO 795SEE ALSO
796^^^^^^^^
668 797
669parport_read, parport_write 798parport_read, parport_write
670 799
800
801
671parport_read - read data from device 802parport_read - read data from device
672------------ 803------------------------------------
673 804
674SYNOPSIS 805SYNOPSIS
806^^^^^^^^
807
808::
675 809
676#include <linux/parport.h> 810 #include <linux/parport.h>
677 811
678ssize_t parport_read (struct parport *, void *buf, size_t len); 812 ssize_t parport_read (struct parport *, void *buf, size_t len);
679 813
680DESCRIPTION 814DESCRIPTION
815^^^^^^^^^^^
681 816
682Read data from device in current IEEE 1284 transfer mode. This only 817Read data from device in current IEEE 1284 transfer mode. This only
683works for modes that support reverse data transfer. 818works for modes that support reverse data transfer.
684 819
685RETURN VALUE 820RETURN VALUE
821^^^^^^^^^^^^
686 822
687If negative, an error code; otherwise the number of bytes transferred. 823If negative, an error code; otherwise the number of bytes transferred.
688 824
689SEE ALSO 825SEE ALSO
826^^^^^^^^
690 827
691parport_write, parport_negotiate 828parport_write, parport_negotiate
692 829
830
831
693parport_write - write data to device 832parport_write - write data to device
694------------- 833------------------------------------
695 834
696SYNOPSIS 835SYNOPSIS
836^^^^^^^^
837
838::
697 839
698#include <linux/parport.h> 840 #include <linux/parport.h>
699 841
700ssize_t parport_write (struct parport *, const void *buf, size_t len); 842 ssize_t parport_write (struct parport *, const void *buf, size_t len);
701 843
702DESCRIPTION 844DESCRIPTION
845^^^^^^^^^^^
703 846
704Write data to device in current IEEE 1284 transfer mode. This only 847Write data to device in current IEEE 1284 transfer mode. This only
705works for modes that support forward data transfer. 848works for modes that support forward data transfer.
706 849
707RETURN VALUE 850RETURN VALUE
851^^^^^^^^^^^^
708 852
709If negative, an error code; otherwise the number of bytes transferred. 853If negative, an error code; otherwise the number of bytes transferred.
710 854
711SEE ALSO 855SEE ALSO
856^^^^^^^^
712 857
713parport_read, parport_negotiate 858parport_read, parport_negotiate
859
860
714 861
715parport_open - register device for particular device number 862parport_open - register device for particular device number
716------------ 863-----------------------------------------------------------
717 864
718SYNOPSIS 865SYNOPSIS
866^^^^^^^^
719 867
720#include <linux/parport.h> 868::
721 869
722struct pardevice *parport_open (int devnum, const char *name, 870 #include <linux/parport.h>
723 int (*pf) (void *), 871
724 void (*kf) (void *), 872 struct pardevice *parport_open (int devnum, const char *name,
725 void (*irqf) (int, void *, 873 int (*pf) (void *),
726 struct pt_regs *), 874 void (*kf) (void *),
727 int flags, void *handle); 875 void (*irqf) (int, void *,
876 struct pt_regs *),
877 int flags, void *handle);
728 878
729DESCRIPTION 879DESCRIPTION
880^^^^^^^^^^^
730 881
731This is like parport_register_device but takes a device number instead 882This is like parport_register_device but takes a device number instead
732of a pointer to a struct parport. 883of a pointer to a struct parport.
733 884
734RETURN VALUE 885RETURN VALUE
886^^^^^^^^^^^^
735 887
736See parport_register_device. If no device is associated with devnum, 888See parport_register_device. If no device is associated with devnum,
737NULL is returned. 889NULL is returned.
738 890
739SEE ALSO 891SEE ALSO
892^^^^^^^^
740 893
741parport_register_device 894parport_register_device
742 895
896
897
743parport_close - unregister device for particular device number 898parport_close - unregister device for particular device number
744------------- 899--------------------------------------------------------------
745 900
746SYNOPSIS 901SYNOPSIS
902^^^^^^^^
903
904::
747 905
748#include <linux/parport.h> 906 #include <linux/parport.h>
749 907
750void parport_close (struct pardevice *dev); 908 void parport_close (struct pardevice *dev);
751 909
752DESCRIPTION 910DESCRIPTION
911^^^^^^^^^^^
753 912
754This is the equivalent of parport_unregister_device for parport_open. 913This is the equivalent of parport_unregister_device for parport_open.
755 914
756SEE ALSO 915SEE ALSO
916^^^^^^^^
757 917
758parport_unregister_device, parport_open 918parport_unregister_device, parport_open
759 919
920
921
760parport_device_id - obtain IEEE 1284 Device ID 922parport_device_id - obtain IEEE 1284 Device ID
761----------------- 923----------------------------------------------
762 924
763SYNOPSIS 925SYNOPSIS
926^^^^^^^^
927
928::
764 929
765#include <linux/parport.h> 930 #include <linux/parport.h>
766 931
767ssize_t parport_device_id (int devnum, char *buffer, size_t len); 932 ssize_t parport_device_id (int devnum, char *buffer, size_t len);
768 933
769DESCRIPTION 934DESCRIPTION
935^^^^^^^^^^^
770 936
771Obtains the IEEE 1284 Device ID associated with a given device. 937Obtains the IEEE 1284 Device ID associated with a given device.
772 938
773RETURN VALUE 939RETURN VALUE
940^^^^^^^^^^^^
774 941
775If negative, an error code; otherwise, the number of bytes of buffer 942If negative, an error code; otherwise, the number of bytes of buffer
776that contain the device ID. The format of the device ID is as 943that contain the device ID. The format of the device ID is as
777follows: 944follows::
778 945
779[length][ID] 946 [length][ID]
780 947
781The first two bytes indicate the inclusive length of the entire Device 948The first two bytes indicate the inclusive length of the entire Device
782ID, and are in big-endian order. The ID is a sequence of pairs of the 949ID, and are in big-endian order. The ID is a sequence of pairs of the
783form: 950form::
784 951
785key:value; 952 key:value;
786 953
787NOTES 954NOTES
955^^^^^
788 956
789Many devices have ill-formed IEEE 1284 Device IDs. 957Many devices have ill-formed IEEE 1284 Device IDs.
790 958
791SEE ALSO 959SEE ALSO
960^^^^^^^^
792 961
793parport_find_class, parport_find_device 962parport_find_class, parport_find_device
794 963
964
965
795parport_device_coords - convert device number to device coordinates 966parport_device_coords - convert device number to device coordinates
796------------------ 967-------------------------------------------------------------------
797 968
798SYNOPSIS 969SYNOPSIS
970^^^^^^^^
971
972::
799 973
800#include <linux/parport.h> 974 #include <linux/parport.h>
801 975
802int parport_device_coords (int devnum, int *parport, int *mux, 976 int parport_device_coords (int devnum, int *parport, int *mux,
803 int *daisy); 977 int *daisy);
804 978
805DESCRIPTION 979DESCRIPTION
980^^^^^^^^^^^
806 981
807Convert between device number (zero-based) and device coordinates 982Convert between device number (zero-based) and device coordinates
808(port, multiplexor, daisy chain address). 983(port, multiplexor, daisy chain address).
809 984
810RETURN VALUE 985RETURN VALUE
986^^^^^^^^^^^^
811 987
812Zero on success, in which case the coordinates are (*parport, *mux, 988Zero on success, in which case the coordinates are (``*parport``, ``*mux``,
813*daisy). 989``*daisy``).
814 990
815SEE ALSO 991SEE ALSO
992^^^^^^^^
816 993
817parport_open, parport_device_id 994parport_open, parport_device_id
818 995
996
997
819parport_find_class - find a device by its class 998parport_find_class - find a device by its class
820------------------ 999-----------------------------------------------
821 1000
822SYNOPSIS 1001SYNOPSIS
823 1002^^^^^^^^
824#include <linux/parport.h> 1003
825 1004::
826typedef enum { 1005
827 PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */ 1006 #include <linux/parport.h>
828 PARPORT_CLASS_PRINTER, 1007
829 PARPORT_CLASS_MODEM, 1008 typedef enum {
830 PARPORT_CLASS_NET, 1009 PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */
831 PARPORT_CLASS_HDC, /* Hard disk controller */ 1010 PARPORT_CLASS_PRINTER,
832 PARPORT_CLASS_PCMCIA, 1011 PARPORT_CLASS_MODEM,
833 PARPORT_CLASS_MEDIA, /* Multimedia device */ 1012 PARPORT_CLASS_NET,
834 PARPORT_CLASS_FDC, /* Floppy disk controller */ 1013 PARPORT_CLASS_HDC, /* Hard disk controller */
835 PARPORT_CLASS_PORTS, 1014 PARPORT_CLASS_PCMCIA,
836 PARPORT_CLASS_SCANNER, 1015 PARPORT_CLASS_MEDIA, /* Multimedia device */
837 PARPORT_CLASS_DIGCAM, 1016 PARPORT_CLASS_FDC, /* Floppy disk controller */
838 PARPORT_CLASS_OTHER, /* Anything else */ 1017 PARPORT_CLASS_PORTS,
839 PARPORT_CLASS_UNSPEC, /* No CLS field in ID */ 1018 PARPORT_CLASS_SCANNER,
840 PARPORT_CLASS_SCSIADAPTER 1019 PARPORT_CLASS_DIGCAM,
841} parport_device_class; 1020 PARPORT_CLASS_OTHER, /* Anything else */
842 1021 PARPORT_CLASS_UNSPEC, /* No CLS field in ID */
843int parport_find_class (parport_device_class cls, int from); 1022 PARPORT_CLASS_SCSIADAPTER
1023 } parport_device_class;
1024
1025 int parport_find_class (parport_device_class cls, int from);
844 1026
845DESCRIPTION 1027DESCRIPTION
1028^^^^^^^^^^^
846 1029
847Find a device by class. The search starts from device number from+1. 1030Find a device by class. The search starts from device number from+1.
848 1031
849RETURN VALUE 1032RETURN VALUE
1033^^^^^^^^^^^^
850 1034
851The device number of the next device in that class, or -1 if no such 1035The device number of the next device in that class, or -1 if no such
852device exists. 1036device exists.
853 1037
854NOTES 1038NOTES
1039^^^^^
855 1040
856Example usage: 1041Example usage::
857 1042
858int devnum = -1; 1043 int devnum = -1;
859while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) { 1044 while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) {
860 struct pardevice *dev = parport_open (devnum, ...); 1045 struct pardevice *dev = parport_open (devnum, ...);
861 ... 1046 ...
862} 1047 }
863 1048
864SEE ALSO 1049SEE ALSO
1050^^^^^^^^
865 1051
866parport_find_device, parport_open, parport_device_id 1052parport_find_device, parport_open, parport_device_id
867 1053
1054
1055
868parport_find_device - find a device by its class 1056parport_find_device - find a device by its class
869------------------ 1057------------------------------------------------
870 1058
871SYNOPSIS 1059SYNOPSIS
1060^^^^^^^^
872 1061
873#include <linux/parport.h> 1062::
874 1063
875int parport_find_device (const char *mfg, const char *mdl, int from); 1064 #include <linux/parport.h>
1065
1066 int parport_find_device (const char *mfg, const char *mdl, int from);
876 1067
877DESCRIPTION 1068DESCRIPTION
1069^^^^^^^^^^^
878 1070
879Find a device by vendor and model. The search starts from device 1071Find a device by vendor and model. The search starts from device
880number from+1. 1072number from+1.
881 1073
882RETURN VALUE 1074RETURN VALUE
1075^^^^^^^^^^^^
883 1076
884The device number of the next device matching the specifications, or 1077The device number of the next device matching the specifications, or
885-1 if no such device exists. 1078-1 if no such device exists.
886 1079
887NOTES 1080NOTES
1081^^^^^
888 1082
889Example usage: 1083Example usage::
890 1084
891int devnum = -1; 1085 int devnum = -1;
892while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) { 1086 while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) {
893 struct pardevice *dev = parport_open (devnum, ...); 1087 struct pardevice *dev = parport_open (devnum, ...);
894 ... 1088 ...
895} 1089 }
896 1090
897SEE ALSO 1091SEE ALSO
1092^^^^^^^^
898 1093
899parport_find_class, parport_open, parport_device_id 1094parport_find_class, parport_open, parport_device_id
1095
1096
900 1097
901parport_set_timeout - set the inactivity timeout 1098parport_set_timeout - set the inactivity timeout
902------------------- 1099------------------------------------------------
903 1100
904SYNOPSIS 1101SYNOPSIS
1102^^^^^^^^
1103
1104::
905 1105
906#include <linux/parport.h> 1106 #include <linux/parport.h>
907 1107
908long parport_set_timeout (struct pardevice *dev, long inactivity); 1108 long parport_set_timeout (struct pardevice *dev, long inactivity);
909 1109
910DESCRIPTION 1110DESCRIPTION
1111^^^^^^^^^^^
911 1112
912Set the inactivity timeout, in jiffies, for a registered device. The 1113Set the inactivity timeout, in jiffies, for a registered device. The
913previous timeout is returned. 1114previous timeout is returned.
914 1115
915RETURN VALUE 1116RETURN VALUE
1117^^^^^^^^^^^^
916 1118
917The previous timeout, in jiffies. 1119The previous timeout, in jiffies.
918 1120
919NOTES 1121NOTES
1122^^^^^
920 1123
921Some of the port->ops functions for a parport may take time, owing to 1124Some of the port->ops functions for a parport may take time, owing to
922delays at the peripheral. After the peripheral has not responded for 1125delays at the peripheral. After the peripheral has not responded for
923'inactivity' jiffies, a timeout will occur and the blocking function 1126``inactivity`` jiffies, a timeout will occur and the blocking function
924will return. 1127will return.
925 1128
926A timeout of 0 jiffies is a special case: the function must do as much 1129A timeout of 0 jiffies is a special case: the function must do as much
@@ -932,29 +1135,37 @@ Once set for a registered device, the timeout will remain at the set
932value until set again. 1135value until set again.
933 1136
934SEE ALSO 1137SEE ALSO
1138^^^^^^^^
935 1139
936port->ops->xxx_read/write_yyy 1140port->ops->xxx_read/write_yyy
937 1141
1142
1143
1144
938PORT FUNCTIONS 1145PORT FUNCTIONS
939-------------- 1146==============
940 1147
941The functions in the port->ops structure (struct parport_operations) 1148The functions in the port->ops structure (struct parport_operations)
942are provided by the low-level driver responsible for that port. 1149are provided by the low-level driver responsible for that port.
943 1150
944port->ops->read_data - read the data register 1151port->ops->read_data - read the data register
945-------------------- 1152---------------------------------------------
946 1153
947SYNOPSIS 1154SYNOPSIS
1155^^^^^^^^
948 1156
949#include <linux/parport.h> 1157::
950 1158
951struct parport_operations { 1159 #include <linux/parport.h>
952 ... 1160
953 unsigned char (*read_data) (struct parport *port); 1161 struct parport_operations {
954 ... 1162 ...
955}; 1163 unsigned char (*read_data) (struct parport *port);
1164 ...
1165 };
956 1166
957DESCRIPTION 1167DESCRIPTION
1168^^^^^^^^^^^
958 1169
959If port->modes contains the PARPORT_MODE_TRISTATE flag and the 1170If port->modes contains the PARPORT_MODE_TRISTATE flag and the
960PARPORT_CONTROL_DIRECTION bit in the control register is set, this 1171PARPORT_CONTROL_DIRECTION bit in the control register is set, this
@@ -964,45 +1175,59 @@ not set, the return value _may_ be the last value written to the data
964register. Otherwise the return value is undefined. 1175register. Otherwise the return value is undefined.
965 1176
966SEE ALSO 1177SEE ALSO
1178^^^^^^^^
967 1179
968write_data, read_status, write_control 1180write_data, read_status, write_control
1181
1182
969 1183
970port->ops->write_data - write the data register 1184port->ops->write_data - write the data register
971--------------------- 1185-----------------------------------------------
972 1186
973SYNOPSIS 1187SYNOPSIS
1188^^^^^^^^
974 1189
975#include <linux/parport.h> 1190::
976 1191
977struct parport_operations { 1192 #include <linux/parport.h>
978 ... 1193
979 void (*write_data) (struct parport *port, unsigned char d); 1194 struct parport_operations {
980 ... 1195 ...
981}; 1196 void (*write_data) (struct parport *port, unsigned char d);
1197 ...
1198 };
982 1199
983DESCRIPTION 1200DESCRIPTION
1201^^^^^^^^^^^
984 1202
985Writes to the data register. May have side-effects (a STROBE pulse, 1203Writes to the data register. May have side-effects (a STROBE pulse,
986for instance). 1204for instance).
987 1205
988SEE ALSO 1206SEE ALSO
1207^^^^^^^^
989 1208
990read_data, read_status, write_control 1209read_data, read_status, write_control
1210
1211
991 1212
992port->ops->read_status - read the status register 1213port->ops->read_status - read the status register
993---------------------- 1214-------------------------------------------------
994 1215
995SYNOPSIS 1216SYNOPSIS
1217^^^^^^^^
996 1218
997#include <linux/parport.h> 1219::
998 1220
999struct parport_operations { 1221 #include <linux/parport.h>
1000 ... 1222
1001 unsigned char (*read_status) (struct parport *port); 1223 struct parport_operations {
1002 ... 1224 ...
1003}; 1225 unsigned char (*read_status) (struct parport *port);
1226 ...
1227 };
1004 1228
1005DESCRIPTION 1229DESCRIPTION
1230^^^^^^^^^^^
1006 1231
1007Reads from the status register. This is a bitmask: 1232Reads from the status register. This is a bitmask:
1008 1233
@@ -1015,76 +1240,98 @@ Reads from the status register. This is a bitmask:
1015There may be other bits set. 1240There may be other bits set.
1016 1241
1017SEE ALSO 1242SEE ALSO
1243^^^^^^^^
1018 1244
1019read_data, write_data, write_control 1245read_data, write_data, write_control
1246
1247
1020 1248
1021port->ops->read_control - read the control register 1249port->ops->read_control - read the control register
1022----------------------- 1250---------------------------------------------------
1023 1251
1024SYNOPSIS 1252SYNOPSIS
1253^^^^^^^^
1025 1254
1026#include <linux/parport.h> 1255::
1027 1256
1028struct parport_operations { 1257 #include <linux/parport.h>
1029 ... 1258
1030 unsigned char (*read_control) (struct parport *port); 1259 struct parport_operations {
1031 ... 1260 ...
1032}; 1261 unsigned char (*read_control) (struct parport *port);
1262 ...
1263 };
1033 1264
1034DESCRIPTION 1265DESCRIPTION
1266^^^^^^^^^^^
1035 1267
1036Returns the last value written to the control register (either from 1268Returns the last value written to the control register (either from
1037write_control or frob_control). No port access is performed. 1269write_control or frob_control). No port access is performed.
1038 1270
1039SEE ALSO 1271SEE ALSO
1272^^^^^^^^
1040 1273
1041read_data, write_data, read_status, write_control 1274read_data, write_data, read_status, write_control
1275
1276
1042 1277
1043port->ops->write_control - write the control register 1278port->ops->write_control - write the control register
1044------------------------ 1279-----------------------------------------------------
1045 1280
1046SYNOPSIS 1281SYNOPSIS
1282^^^^^^^^
1047 1283
1048#include <linux/parport.h> 1284::
1049 1285
1050struct parport_operations { 1286 #include <linux/parport.h>
1051 ... 1287
1052 void (*write_control) (struct parport *port, unsigned char s); 1288 struct parport_operations {
1053 ... 1289 ...
1054}; 1290 void (*write_control) (struct parport *port, unsigned char s);
1291 ...
1292 };
1055 1293
1056DESCRIPTION 1294DESCRIPTION
1295^^^^^^^^^^^
1057 1296
1058Writes to the control register. This is a bitmask: 1297Writes to the control register. This is a bitmask::
1059 _______ 1298
1060- PARPORT_CONTROL_STROBE (nStrobe) 1299 _______
1061 _______ 1300 - PARPORT_CONTROL_STROBE (nStrobe)
1062- PARPORT_CONTROL_AUTOFD (nAutoFd) 1301 _______
1063 _____ 1302 - PARPORT_CONTROL_AUTOFD (nAutoFd)
1064- PARPORT_CONTROL_INIT (nInit) 1303 _____
1065 _________ 1304 - PARPORT_CONTROL_INIT (nInit)
1066- PARPORT_CONTROL_SELECT (nSelectIn) 1305 _________
1306 - PARPORT_CONTROL_SELECT (nSelectIn)
1067 1307
1068SEE ALSO 1308SEE ALSO
1309^^^^^^^^
1069 1310
1070read_data, write_data, read_status, frob_control 1311read_data, write_data, read_status, frob_control
1312
1313
1071 1314
1072port->ops->frob_control - write control register bits 1315port->ops->frob_control - write control register bits
1073----------------------- 1316-----------------------------------------------------
1074 1317
1075SYNOPSIS 1318SYNOPSIS
1319^^^^^^^^
1076 1320
1077#include <linux/parport.h> 1321::
1078 1322
1079struct parport_operations { 1323 #include <linux/parport.h>
1080 ... 1324
1081 unsigned char (*frob_control) (struct parport *port, 1325 struct parport_operations {
1082 unsigned char mask, 1326 ...
1083 unsigned char val); 1327 unsigned char (*frob_control) (struct parport *port,
1084 ... 1328 unsigned char mask,
1085}; 1329 unsigned char val);
1330 ...
1331 };
1086 1332
1087DESCRIPTION 1333DESCRIPTION
1334^^^^^^^^^^^
1088 1335
1089This is equivalent to reading from the control register, masking out 1336This is equivalent to reading from the control register, masking out
1090the bits in mask, exclusive-or'ing with the bits in val, and writing 1337the bits in mask, exclusive-or'ing with the bits in val, and writing
@@ -1095,23 +1342,30 @@ of its contents is maintained, so frob_control is in fact only one
1095port access. 1342port access.
1096 1343
1097SEE ALSO 1344SEE ALSO
1345^^^^^^^^
1098 1346
1099read_data, write_data, read_status, write_control 1347read_data, write_data, read_status, write_control
1348
1349
1100 1350
1101port->ops->enable_irq - enable interrupt generation 1351port->ops->enable_irq - enable interrupt generation
1102--------------------- 1352---------------------------------------------------
1103 1353
1104SYNOPSIS 1354SYNOPSIS
1355^^^^^^^^
1105 1356
1106#include <linux/parport.h> 1357::
1107 1358
1108struct parport_operations { 1359 #include <linux/parport.h>
1109 ... 1360
1110 void (*enable_irq) (struct parport *port); 1361 struct parport_operations {
1111 ... 1362 ...
1112}; 1363 void (*enable_irq) (struct parport *port);
1364 ...
1365 };
1113 1366
1114DESCRIPTION 1367DESCRIPTION
1368^^^^^^^^^^^
1115 1369
1116The parallel port hardware is instructed to generate interrupts at 1370The parallel port hardware is instructed to generate interrupts at
1117appropriate moments, although those moments are 1371appropriate moments, although those moments are
@@ -1119,353 +1373,460 @@ architecture-specific. For the PC architecture, interrupts are
1119commonly generated on the rising edge of nAck. 1373commonly generated on the rising edge of nAck.
1120 1374
1121SEE ALSO 1375SEE ALSO
1376^^^^^^^^
1122 1377
1123disable_irq 1378disable_irq
1379
1380
1124 1381
1125port->ops->disable_irq - disable interrupt generation 1382port->ops->disable_irq - disable interrupt generation
1126---------------------- 1383-----------------------------------------------------
1127 1384
1128SYNOPSIS 1385SYNOPSIS
1386^^^^^^^^
1129 1387
1130#include <linux/parport.h> 1388::
1131 1389
1132struct parport_operations { 1390 #include <linux/parport.h>
1133 ... 1391
1134 void (*disable_irq) (struct parport *port); 1392 struct parport_operations {
1135 ... 1393 ...
1136}; 1394 void (*disable_irq) (struct parport *port);
1395 ...
1396 };
1137 1397
1138DESCRIPTION 1398DESCRIPTION
1399^^^^^^^^^^^
1139 1400
1140The parallel port hardware is instructed not to generate interrupts. 1401The parallel port hardware is instructed not to generate interrupts.
1141The interrupt itself is not masked. 1402The interrupt itself is not masked.
1142 1403
1143SEE ALSO 1404SEE ALSO
1405^^^^^^^^
1144 1406
1145enable_irq 1407enable_irq
1146 1408
1409
1410
1147port->ops->data_forward - enable data drivers 1411port->ops->data_forward - enable data drivers
1148----------------------- 1412---------------------------------------------
1149 1413
1150SYNOPSIS 1414SYNOPSIS
1415^^^^^^^^
1151 1416
1152#include <linux/parport.h> 1417::
1153 1418
1154struct parport_operations { 1419 #include <linux/parport.h>
1155 ... 1420
1156 void (*data_forward) (struct parport *port); 1421 struct parport_operations {
1157 ... 1422 ...
1158}; 1423 void (*data_forward) (struct parport *port);
1424 ...
1425 };
1159 1426
1160DESCRIPTION 1427DESCRIPTION
1428^^^^^^^^^^^
1161 1429
1162Enables the data line drivers, for 8-bit host-to-peripheral 1430Enables the data line drivers, for 8-bit host-to-peripheral
1163communications. 1431communications.
1164 1432
1165SEE ALSO 1433SEE ALSO
1434^^^^^^^^
1166 1435
1167data_reverse 1436data_reverse
1437
1438
1168 1439
1169port->ops->data_reverse - tristate the buffer 1440port->ops->data_reverse - tristate the buffer
1170----------------------- 1441---------------------------------------------
1171 1442
1172SYNOPSIS 1443SYNOPSIS
1444^^^^^^^^
1173 1445
1174#include <linux/parport.h> 1446::
1175 1447
1176struct parport_operations { 1448 #include <linux/parport.h>
1177 ... 1449
1178 void (*data_reverse) (struct parport *port); 1450 struct parport_operations {
1179 ... 1451 ...
1180}; 1452 void (*data_reverse) (struct parport *port);
1453 ...
1454 };
1181 1455
1182DESCRIPTION 1456DESCRIPTION
1457^^^^^^^^^^^
1183 1458
1184Places the data bus in a high impedance state, if port->modes has the 1459Places the data bus in a high impedance state, if port->modes has the
1185PARPORT_MODE_TRISTATE bit set. 1460PARPORT_MODE_TRISTATE bit set.
1186 1461
1187SEE ALSO 1462SEE ALSO
1463^^^^^^^^
1188 1464
1189data_forward 1465data_forward
1190 1466
1467
1468
1191port->ops->epp_write_data - write EPP data 1469port->ops->epp_write_data - write EPP data
1192------------------------- 1470------------------------------------------
1193 1471
1194SYNOPSIS 1472SYNOPSIS
1473^^^^^^^^
1195 1474
1196#include <linux/parport.h> 1475::
1197 1476
1198struct parport_operations { 1477 #include <linux/parport.h>
1199 ... 1478
1200 size_t (*epp_write_data) (struct parport *port, const void *buf, 1479 struct parport_operations {
1201 size_t len, int flags); 1480 ...
1202 ... 1481 size_t (*epp_write_data) (struct parport *port, const void *buf,
1203}; 1482 size_t len, int flags);
1483 ...
1484 };
1204 1485
1205DESCRIPTION 1486DESCRIPTION
1487^^^^^^^^^^^
1206 1488
1207Writes data in EPP mode, and returns the number of bytes written. 1489Writes data in EPP mode, and returns the number of bytes written.
1208 1490
1209The 'flags' parameter may be one or more of the following, 1491The ``flags`` parameter may be one or more of the following,
1210bitwise-or'ed together: 1492bitwise-or'ed together:
1211 1493
1494======================= =================================================
1212PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 1495PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and
1213 32-bit registers. However, if a transfer 1496 32-bit registers. However, if a transfer
1214 times out, the return value may be unreliable. 1497 times out, the return value may be unreliable.
1498======================= =================================================
1215 1499
1216SEE ALSO 1500SEE ALSO
1501^^^^^^^^
1217 1502
1218epp_read_data, epp_write_addr, epp_read_addr 1503epp_read_data, epp_write_addr, epp_read_addr
1504
1505
1219 1506
1220port->ops->epp_read_data - read EPP data 1507port->ops->epp_read_data - read EPP data
1221------------------------ 1508----------------------------------------
1222 1509
1223SYNOPSIS 1510SYNOPSIS
1511^^^^^^^^
1224 1512
1225#include <linux/parport.h> 1513::
1226 1514
1227struct parport_operations { 1515 #include <linux/parport.h>
1228 ... 1516
1229 size_t (*epp_read_data) (struct parport *port, void *buf, 1517 struct parport_operations {
1230 size_t len, int flags); 1518 ...
1231 ... 1519 size_t (*epp_read_data) (struct parport *port, void *buf,
1232}; 1520 size_t len, int flags);
1521 ...
1522 };
1233 1523
1234DESCRIPTION 1524DESCRIPTION
1525^^^^^^^^^^^
1235 1526
1236Reads data in EPP mode, and returns the number of bytes read. 1527Reads data in EPP mode, and returns the number of bytes read.
1237 1528
1238The 'flags' parameter may be one or more of the following, 1529The ``flags`` parameter may be one or more of the following,
1239bitwise-or'ed together: 1530bitwise-or'ed together:
1240 1531
1532======================= =================================================
1241PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 1533PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and
1242 32-bit registers. However, if a transfer 1534 32-bit registers. However, if a transfer
1243 times out, the return value may be unreliable. 1535 times out, the return value may be unreliable.
1536======================= =================================================
1244 1537
1245SEE ALSO 1538SEE ALSO
1539^^^^^^^^
1246 1540
1247epp_write_data, epp_write_addr, epp_read_addr 1541epp_write_data, epp_write_addr, epp_read_addr
1248 1542
1543
1544
1249port->ops->epp_write_addr - write EPP address 1545port->ops->epp_write_addr - write EPP address
1250------------------------- 1546---------------------------------------------
1251 1547
1252SYNOPSIS 1548SYNOPSIS
1549^^^^^^^^
1253 1550
1254#include <linux/parport.h> 1551::
1255 1552
1256struct parport_operations { 1553 #include <linux/parport.h>
1257 ... 1554
1258 size_t (*epp_write_addr) (struct parport *port, 1555 struct parport_operations {
1259 const void *buf, size_t len, int flags); 1556 ...
1260 ... 1557 size_t (*epp_write_addr) (struct parport *port,
1261}; 1558 const void *buf, size_t len, int flags);
1559 ...
1560 };
1262 1561
1263DESCRIPTION 1562DESCRIPTION
1563^^^^^^^^^^^
1264 1564
1265Writes EPP addresses (8 bits each), and returns the number written. 1565Writes EPP addresses (8 bits each), and returns the number written.
1266 1566
1267The 'flags' parameter may be one or more of the following, 1567The ``flags`` parameter may be one or more of the following,
1268bitwise-or'ed together: 1568bitwise-or'ed together:
1269 1569
1570======================= =================================================
1270PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 1571PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and
1271 32-bit registers. However, if a transfer 1572 32-bit registers. However, if a transfer
1272 times out, the return value may be unreliable. 1573 times out, the return value may be unreliable.
1574======================= =================================================
1273 1575
1274(Does PARPORT_EPP_FAST make sense for this function?) 1576(Does PARPORT_EPP_FAST make sense for this function?)
1275 1577
1276SEE ALSO 1578SEE ALSO
1579^^^^^^^^
1277 1580
1278epp_write_data, epp_read_data, epp_read_addr 1581epp_write_data, epp_read_data, epp_read_addr
1582
1583
1279 1584
1280port->ops->epp_read_addr - read EPP address 1585port->ops->epp_read_addr - read EPP address
1281------------------------ 1586-------------------------------------------
1282 1587
1283SYNOPSIS 1588SYNOPSIS
1589^^^^^^^^
1284 1590
1285#include <linux/parport.h> 1591::
1286 1592
1287struct parport_operations { 1593 #include <linux/parport.h>
1288 ... 1594
1289 size_t (*epp_read_addr) (struct parport *port, void *buf, 1595 struct parport_operations {
1290 size_t len, int flags); 1596 ...
1291 ... 1597 size_t (*epp_read_addr) (struct parport *port, void *buf,
1292}; 1598 size_t len, int flags);
1599 ...
1600 };
1293 1601
1294DESCRIPTION 1602DESCRIPTION
1603^^^^^^^^^^^
1295 1604
1296Reads EPP addresses (8 bits each), and returns the number read. 1605Reads EPP addresses (8 bits each), and returns the number read.
1297 1606
1298The 'flags' parameter may be one or more of the following, 1607The ``flags`` parameter may be one or more of the following,
1299bitwise-or'ed together: 1608bitwise-or'ed together:
1300 1609
1610======================= =================================================
1301PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 1611PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and
1302 32-bit registers. However, if a transfer 1612 32-bit registers. However, if a transfer
1303 times out, the return value may be unreliable. 1613 times out, the return value may be unreliable.
1614======================= =================================================
1304 1615
1305(Does PARPORT_EPP_FAST make sense for this function?) 1616(Does PARPORT_EPP_FAST make sense for this function?)
1306 1617
1307SEE ALSO 1618SEE ALSO
1619^^^^^^^^
1308 1620
1309epp_write_data, epp_read_data, epp_write_addr 1621epp_write_data, epp_read_data, epp_write_addr
1622
1623
1310 1624
1311port->ops->ecp_write_data - write a block of ECP data 1625port->ops->ecp_write_data - write a block of ECP data
1312------------------------- 1626-----------------------------------------------------
1313 1627
1314SYNOPSIS 1628SYNOPSIS
1629^^^^^^^^
1315 1630
1316#include <linux/parport.h> 1631::
1317 1632
1318struct parport_operations { 1633 #include <linux/parport.h>
1319 ... 1634
1320 size_t (*ecp_write_data) (struct parport *port, 1635 struct parport_operations {
1321 const void *buf, size_t len, int flags); 1636 ...
1322 ... 1637 size_t (*ecp_write_data) (struct parport *port,
1323}; 1638 const void *buf, size_t len, int flags);
1639 ...
1640 };
1324 1641
1325DESCRIPTION 1642DESCRIPTION
1643^^^^^^^^^^^
1326 1644
1327Writes a block of ECP data. The 'flags' parameter is ignored. 1645Writes a block of ECP data. The ``flags`` parameter is ignored.
1328 1646
1329RETURN VALUE 1647RETURN VALUE
1648^^^^^^^^^^^^
1330 1649
1331The number of bytes written. 1650The number of bytes written.
1332 1651
1333SEE ALSO 1652SEE ALSO
1653^^^^^^^^
1334 1654
1335ecp_read_data, ecp_write_addr 1655ecp_read_data, ecp_write_addr
1336 1656
1657
1658
1337port->ops->ecp_read_data - read a block of ECP data 1659port->ops->ecp_read_data - read a block of ECP data
1338------------------------ 1660---------------------------------------------------
1339 1661
1340SYNOPSIS 1662SYNOPSIS
1663^^^^^^^^
1341 1664
1342#include <linux/parport.h> 1665::
1343 1666
1344struct parport_operations { 1667 #include <linux/parport.h>
1345 ... 1668
1346 size_t (*ecp_read_data) (struct parport *port, 1669 struct parport_operations {
1347 void *buf, size_t len, int flags); 1670 ...
1348 ... 1671 size_t (*ecp_read_data) (struct parport *port,
1349}; 1672 void *buf, size_t len, int flags);
1673 ...
1674 };
1350 1675
1351DESCRIPTION 1676DESCRIPTION
1677^^^^^^^^^^^
1352 1678
1353Reads a block of ECP data. The 'flags' parameter is ignored. 1679Reads a block of ECP data. The ``flags`` parameter is ignored.
1354 1680
1355RETURN VALUE 1681RETURN VALUE
1682^^^^^^^^^^^^
1356 1683
1357The number of bytes read. NB. There may be more unread data in a 1684The number of bytes read. NB. There may be more unread data in a
1358FIFO. Is there a way of stunning the FIFO to prevent this? 1685FIFO. Is there a way of stunning the FIFO to prevent this?
1359 1686
1360SEE ALSO 1687SEE ALSO
1688^^^^^^^^
1361 1689
1362ecp_write_block, ecp_write_addr 1690ecp_write_block, ecp_write_addr
1363 1691
1692
1693
1364port->ops->ecp_write_addr - write a block of ECP addresses 1694port->ops->ecp_write_addr - write a block of ECP addresses
1365------------------------- 1695----------------------------------------------------------
1366 1696
1367SYNOPSIS 1697SYNOPSIS
1698^^^^^^^^
1368 1699
1369#include <linux/parport.h> 1700::
1370 1701
1371struct parport_operations { 1702 #include <linux/parport.h>
1372 ... 1703
1373 size_t (*ecp_write_addr) (struct parport *port, 1704 struct parport_operations {
1374 const void *buf, size_t len, int flags); 1705 ...
1375 ... 1706 size_t (*ecp_write_addr) (struct parport *port,
1376}; 1707 const void *buf, size_t len, int flags);
1708 ...
1709 };
1377 1710
1378DESCRIPTION 1711DESCRIPTION
1712^^^^^^^^^^^
1379 1713
1380Writes a block of ECP addresses. The 'flags' parameter is ignored. 1714Writes a block of ECP addresses. The ``flags`` parameter is ignored.
1381 1715
1382RETURN VALUE 1716RETURN VALUE
1717^^^^^^^^^^^^
1383 1718
1384The number of bytes written. 1719The number of bytes written.
1385 1720
1386NOTES 1721NOTES
1722^^^^^
1387 1723
1388This may use a FIFO, and if so shall not return until the FIFO is empty. 1724This may use a FIFO, and if so shall not return until the FIFO is empty.
1389 1725
1390SEE ALSO 1726SEE ALSO
1727^^^^^^^^
1391 1728
1392ecp_read_data, ecp_write_data 1729ecp_read_data, ecp_write_data
1393 1730
1731
1732
1394port->ops->nibble_read_data - read a block of data in nibble mode 1733port->ops->nibble_read_data - read a block of data in nibble mode
1395--------------------------- 1734-----------------------------------------------------------------
1396 1735
1397SYNOPSIS 1736SYNOPSIS
1737^^^^^^^^
1398 1738
1399#include <linux/parport.h> 1739::
1400 1740
1401struct parport_operations { 1741 #include <linux/parport.h>
1402 ... 1742
1403 size_t (*nibble_read_data) (struct parport *port, 1743 struct parport_operations {
1404 void *buf, size_t len, int flags); 1744 ...
1405 ... 1745 size_t (*nibble_read_data) (struct parport *port,
1406}; 1746 void *buf, size_t len, int flags);
1747 ...
1748 };
1407 1749
1408DESCRIPTION 1750DESCRIPTION
1751^^^^^^^^^^^
1409 1752
1410Reads a block of data in nibble mode. The 'flags' parameter is ignored. 1753Reads a block of data in nibble mode. The ``flags`` parameter is ignored.
1411 1754
1412RETURN VALUE 1755RETURN VALUE
1756^^^^^^^^^^^^
1413 1757
1414The number of whole bytes read. 1758The number of whole bytes read.
1415 1759
1416SEE ALSO 1760SEE ALSO
1761^^^^^^^^
1417 1762
1418byte_read_data, compat_write_data 1763byte_read_data, compat_write_data
1764
1765
1419 1766
1420port->ops->byte_read_data - read a block of data in byte mode 1767port->ops->byte_read_data - read a block of data in byte mode
1421------------------------- 1768-------------------------------------------------------------
1422 1769
1423SYNOPSIS 1770SYNOPSIS
1771^^^^^^^^
1424 1772
1425#include <linux/parport.h> 1773::
1426 1774
1427struct parport_operations { 1775 #include <linux/parport.h>
1428 ... 1776
1429 size_t (*byte_read_data) (struct parport *port, 1777 struct parport_operations {
1430 void *buf, size_t len, int flags); 1778 ...
1431 ... 1779 size_t (*byte_read_data) (struct parport *port,
1432}; 1780 void *buf, size_t len, int flags);
1781 ...
1782 };
1433 1783
1434DESCRIPTION 1784DESCRIPTION
1785^^^^^^^^^^^
1435 1786
1436Reads a block of data in byte mode. The 'flags' parameter is ignored. 1787Reads a block of data in byte mode. The ``flags`` parameter is ignored.
1437 1788
1438RETURN VALUE 1789RETURN VALUE
1790^^^^^^^^^^^^
1439 1791
1440The number of bytes read. 1792The number of bytes read.
1441 1793
1442SEE ALSO 1794SEE ALSO
1795^^^^^^^^
1443 1796
1444nibble_read_data, compat_write_data 1797nibble_read_data, compat_write_data
1798
1799
1445 1800
1446port->ops->compat_write_data - write a block of data in compatibility mode 1801port->ops->compat_write_data - write a block of data in compatibility mode
1447---------------------------- 1802--------------------------------------------------------------------------
1448 1803
1449SYNOPSIS 1804SYNOPSIS
1805^^^^^^^^
1450 1806
1451#include <linux/parport.h> 1807::
1452 1808
1453struct parport_operations { 1809 #include <linux/parport.h>
1454 ... 1810
1455 size_t (*compat_write_data) (struct parport *port, 1811 struct parport_operations {
1456 const void *buf, size_t len, int flags); 1812 ...
1457 ... 1813 size_t (*compat_write_data) (struct parport *port,
1458}; 1814 const void *buf, size_t len, int flags);
1815 ...
1816 };
1459 1817
1460DESCRIPTION 1818DESCRIPTION
1819^^^^^^^^^^^
1461 1820
1462Writes a block of data in compatibility mode. The 'flags' parameter 1821Writes a block of data in compatibility mode. The ``flags`` parameter
1463is ignored. 1822is ignored.
1464 1823
1465RETURN VALUE 1824RETURN VALUE
1825^^^^^^^^^^^^
1466 1826
1467The number of bytes written. 1827The number of bytes written.
1468 1828
1469SEE ALSO 1829SEE ALSO
1830^^^^^^^^
1470 1831
1471nibble_read_data, byte_read_data 1832nibble_read_data, byte_read_data
diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/percpu-rw-semaphore.txt
index 7d3c82431909..247de6410855 100644
--- a/Documentation/percpu-rw-semaphore.txt
+++ b/Documentation/percpu-rw-semaphore.txt
@@ -1,5 +1,6 @@
1====================
1Percpu rw semaphores 2Percpu rw semaphores
2-------------------- 3====================
3 4
4Percpu rw semaphores is a new read-write semaphore design that is 5Percpu rw semaphores is a new read-write semaphore design that is
5optimized for locking for reading. 6optimized for locking for reading.
diff --git a/Documentation/phy.txt b/Documentation/phy.txt
index 383cdd863f08..457c3e0f86d6 100644
--- a/Documentation/phy.txt
+++ b/Documentation/phy.txt
@@ -1,10 +1,14 @@
1 PHY SUBSYSTEM 1=============
2 Kishon Vijay Abraham I <kishon@ti.com> 2PHY subsystem
3=============
4
5:Author: Kishon Vijay Abraham I <kishon@ti.com>
3 6
4This document explains the Generic PHY Framework along with the APIs provided, 7This document explains the Generic PHY Framework along with the APIs provided,
5and how-to-use. 8and how-to-use.
6 9
71. Introduction 10Introduction
11============
8 12
9*PHY* is the abbreviation for physical layer. It is used to connect a device 13*PHY* is the abbreviation for physical layer. It is used to connect a device
10to the physical medium e.g., the USB controller has a PHY to provide functions 14to the physical medium e.g., the USB controller has a PHY to provide functions
@@ -21,7 +25,8 @@ better code maintainability.
21This framework will be of use only to devices that use external PHY (PHY 25This framework will be of use only to devices that use external PHY (PHY
22functionality is not embedded within the controller). 26functionality is not embedded within the controller).
23 27
242. Registering/Unregistering the PHY provider 28Registering/Unregistering the PHY provider
29==========================================
25 30
26PHY provider refers to an entity that implements one or more PHY instances. 31PHY provider refers to an entity that implements one or more PHY instances.
27For the simple case where the PHY provider implements only a single instance of 32For the simple case where the PHY provider implements only a single instance of
@@ -30,11 +35,14 @@ of_phy_simple_xlate. If the PHY provider implements multiple instances, it
30should provide its own implementation of of_xlate. of_xlate is used only for 35should provide its own implementation of of_xlate. of_xlate is used only for
31dt boot case. 36dt boot case.
32 37
33#define of_phy_provider_register(dev, xlate) \ 38::
34 __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) 39
40 #define of_phy_provider_register(dev, xlate) \
41 __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate))
35 42
36#define devm_of_phy_provider_register(dev, xlate) \ 43 #define devm_of_phy_provider_register(dev, xlate) \
37 __devm_of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) 44 __devm_of_phy_provider_register((dev), NULL, THIS_MODULE,
45 (xlate))
38 46
39of_phy_provider_register and devm_of_phy_provider_register macros can be used to 47of_phy_provider_register and devm_of_phy_provider_register macros can be used to
40register the phy_provider and it takes device and of_xlate as 48register the phy_provider and it takes device and of_xlate as
@@ -47,28 +55,35 @@ nodes within extra levels for context and extensibility, in which case the low
47level of_phy_provider_register_full() and devm_of_phy_provider_register_full() 55level of_phy_provider_register_full() and devm_of_phy_provider_register_full()
48macros can be used to override the node containing the children. 56macros can be used to override the node containing the children.
49 57
50#define of_phy_provider_register_full(dev, children, xlate) \ 58::
51 __of_phy_provider_register(dev, children, THIS_MODULE, xlate) 59
60 #define of_phy_provider_register_full(dev, children, xlate) \
61 __of_phy_provider_register(dev, children, THIS_MODULE, xlate)
52 62
53#define devm_of_phy_provider_register_full(dev, children, xlate) \ 63 #define devm_of_phy_provider_register_full(dev, children, xlate) \
54 __devm_of_phy_provider_register_full(dev, children, THIS_MODULE, xlate) 64 __devm_of_phy_provider_register_full(dev, children,
65 THIS_MODULE, xlate)
55 66
56void devm_of_phy_provider_unregister(struct device *dev, 67 void devm_of_phy_provider_unregister(struct device *dev,
57 struct phy_provider *phy_provider); 68 struct phy_provider *phy_provider);
58void of_phy_provider_unregister(struct phy_provider *phy_provider); 69 void of_phy_provider_unregister(struct phy_provider *phy_provider);
59 70
60devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to 71devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to
61unregister the PHY. 72unregister the PHY.
62 73
633. Creating the PHY 74Creating the PHY
75================
64 76
65The PHY driver should create the PHY in order for other peripheral controllers 77The PHY driver should create the PHY in order for other peripheral controllers
66to make use of it. The PHY framework provides 2 APIs to create the PHY. 78to make use of it. The PHY framework provides 2 APIs to create the PHY.
67 79
68struct phy *phy_create(struct device *dev, struct device_node *node, 80::
69 const struct phy_ops *ops); 81
70struct phy *devm_phy_create(struct device *dev, struct device_node *node, 82 struct phy *phy_create(struct device *dev, struct device_node *node,
71 const struct phy_ops *ops); 83 const struct phy_ops *ops);
84 struct phy *devm_phy_create(struct device *dev,
85 struct device_node *node,
86 const struct phy_ops *ops);
72 87
73The PHY drivers can use one of the above 2 APIs to create the PHY by passing 88The PHY drivers can use one of the above 2 APIs to create the PHY by passing
74the device pointer and phy ops. 89the device pointer and phy ops.
@@ -84,12 +99,16 @@ phy_ops to get back the private data.
84Before the controller can make use of the PHY, it has to get a reference to 99Before the controller can make use of the PHY, it has to get a reference to
85it. This framework provides the following APIs to get a reference to the PHY. 100it. This framework provides the following APIs to get a reference to the PHY.
86 101
87struct phy *phy_get(struct device *dev, const char *string); 102::
88struct phy *phy_optional_get(struct device *dev, const char *string); 103
89struct phy *devm_phy_get(struct device *dev, const char *string); 104 struct phy *phy_get(struct device *dev, const char *string);
90struct phy *devm_phy_optional_get(struct device *dev, const char *string); 105 struct phy *phy_optional_get(struct device *dev, const char *string);
91struct phy *devm_of_phy_get_by_index(struct device *dev, struct device_node *np, 106 struct phy *devm_phy_get(struct device *dev, const char *string);
92 int index); 107 struct phy *devm_phy_optional_get(struct device *dev,
108 const char *string);
109 struct phy *devm_of_phy_get_by_index(struct device *dev,
110 struct device_node *np,
111 int index);
93 112
94phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can 113phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can
95be used to get the PHY. In the case of dt boot, the string arguments 114be used to get the PHY. In the case of dt boot, the string arguments
@@ -111,30 +130,35 @@ the phy_init() and phy_exit() calls, and phy_power_on() and
111phy_power_off() calls are all NOP when applied to a NULL phy. The NULL 130phy_power_off() calls are all NOP when applied to a NULL phy. The NULL
112phy is useful in devices for handling optional phy devices. 131phy is useful in devices for handling optional phy devices.
113 132
1145. Releasing a reference to the PHY 133Releasing a reference to the PHY
134================================
115 135
116When the controller no longer needs the PHY, it has to release the reference 136When the controller no longer needs the PHY, it has to release the reference
117to the PHY it has obtained using the APIs mentioned in the above section. The 137to the PHY it has obtained using the APIs mentioned in the above section. The
118PHY framework provides 2 APIs to release a reference to the PHY. 138PHY framework provides 2 APIs to release a reference to the PHY.
119 139
120void phy_put(struct phy *phy); 140::
121void devm_phy_put(struct device *dev, struct phy *phy); 141
142 void phy_put(struct phy *phy);
143 void devm_phy_put(struct device *dev, struct phy *phy);
122 144
123Both these APIs are used to release a reference to the PHY and devm_phy_put 145Both these APIs are used to release a reference to the PHY and devm_phy_put
124destroys the devres associated with this PHY. 146destroys the devres associated with this PHY.
125 147
1266. Destroying the PHY 148Destroying the PHY
149==================
127 150
128When the driver that created the PHY is unloaded, it should destroy the PHY it 151When the driver that created the PHY is unloaded, it should destroy the PHY it
129created using one of the following 2 APIs. 152created using one of the following 2 APIs::
130 153
131void phy_destroy(struct phy *phy); 154 void phy_destroy(struct phy *phy);
132void devm_phy_destroy(struct device *dev, struct phy *phy); 155 void devm_phy_destroy(struct device *dev, struct phy *phy);
133 156
134Both these APIs destroy the PHY and devm_phy_destroy destroys the devres 157Both these APIs destroy the PHY and devm_phy_destroy destroys the devres
135associated with this PHY. 158associated with this PHY.
136 159
1377. PM Runtime 160PM Runtime
161==========
138 162
139This subsystem is pm runtime enabled. So while creating the PHY, 163This subsystem is pm runtime enabled. So while creating the PHY,
140pm_runtime_enable of the phy device created by this subsystem is called and 164pm_runtime_enable of the phy device created by this subsystem is called and
@@ -150,7 +174,8 @@ There are exported APIs like phy_pm_runtime_get, phy_pm_runtime_get_sync,
150phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and 174phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and
151phy_pm_runtime_forbid for performing PM operations. 175phy_pm_runtime_forbid for performing PM operations.
152 176
1538. PHY Mappings 177PHY Mappings
178============
154 179
155In order to get reference to a PHY without help from DeviceTree, the framework 180In order to get reference to a PHY without help from DeviceTree, the framework
156offers lookups which can be compared to clkdev that allow clk structures to be 181offers lookups which can be compared to clkdev that allow clk structures to be
@@ -158,12 +183,15 @@ bound to devices. A lookup can be made be made during runtime when a handle to
158the struct phy already exists. 183the struct phy already exists.
159 184
160The framework offers the following API for registering and unregistering the 185The framework offers the following API for registering and unregistering the
161lookups. 186lookups::
162 187
163int phy_create_lookup(struct phy *phy, const char *con_id, const char *dev_id); 188 int phy_create_lookup(struct phy *phy, const char *con_id,
164void phy_remove_lookup(struct phy *phy, const char *con_id, const char *dev_id); 189 const char *dev_id);
190 void phy_remove_lookup(struct phy *phy, const char *con_id,
191 const char *dev_id);
165 192
1669. DeviceTree Binding 193DeviceTree Binding
194==================
167 195
168The documentation for PHY dt binding can be found @ 196The documentation for PHY dt binding can be found @
169Documentation/devicetree/bindings/phy/phy-bindings.txt 197Documentation/devicetree/bindings/phy/phy-bindings.txt
diff --git a/Documentation/pi-futex.txt b/Documentation/pi-futex.txt
index 9a5bc8651c29..aafddbee7377 100644
--- a/Documentation/pi-futex.txt
+++ b/Documentation/pi-futex.txt
@@ -1,5 +1,6 @@
1======================
1Lightweight PI-futexes 2Lightweight PI-futexes
2---------------------- 3======================
3 4
4We are calling them lightweight for 3 reasons: 5We are calling them lightweight for 3 reasons:
5 6
@@ -25,8 +26,8 @@ determinism and well-bound latencies. Even in the worst-case, PI will
25improve the statistical distribution of locking related application 26improve the statistical distribution of locking related application
26delays. 27delays.
27 28
28The longer reply: 29The longer reply
29----------------- 30----------------
30 31
31Firstly, sharing locks between multiple tasks is a common programming 32Firstly, sharing locks between multiple tasks is a common programming
32technique that often cannot be replaced with lockless algorithms. As we 33technique that often cannot be replaced with lockless algorithms. As we
@@ -71,8 +72,8 @@ deterministic execution of the high-prio task: any medium-priority task
71could preempt the low-prio task while it holds the shared lock and 72could preempt the low-prio task while it holds the shared lock and
72executes the critical section, and could delay it indefinitely. 73executes the critical section, and could delay it indefinitely.
73 74
74Implementation: 75Implementation
75--------------- 76--------------
76 77
77As mentioned before, the userspace fastpath of PI-enabled pthread 78As mentioned before, the userspace fastpath of PI-enabled pthread
78mutexes involves no kernel work at all - they behave quite similarly to 79mutexes involves no kernel work at all - they behave quite similarly to
@@ -83,8 +84,8 @@ entering the kernel.
83 84
84To handle the slowpath, we have added two new futex ops: 85To handle the slowpath, we have added two new futex ops:
85 86
86 FUTEX_LOCK_PI 87 - FUTEX_LOCK_PI
87 FUTEX_UNLOCK_PI 88 - FUTEX_UNLOCK_PI
88 89
89If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to 90If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
90TID fails], then FUTEX_LOCK_PI is called. The kernel does all the 91TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
diff --git a/Documentation/pnp.txt b/Documentation/pnp.txt
index 763e4659bf18..bab2d10631f0 100644
--- a/Documentation/pnp.txt
+++ b/Documentation/pnp.txt
@@ -1,98 +1,118 @@
1=================================
1Linux Plug and Play Documentation 2Linux Plug and Play Documentation
2by Adam Belay <ambx1@neo.rr.com> 3=================================
3last updated: Oct. 16, 2002
4---------------------------------------------------------------------------------------
5 4
5:Author: Adam Belay <ambx1@neo.rr.com>
6:Last updated: Oct. 16, 2002
6 7
7 8
8Overview 9Overview
9-------- 10--------
10 Plug and Play provides a means of detecting and setting resources for legacy or 11
12Plug and Play provides a means of detecting and setting resources for legacy or
11otherwise unconfigurable devices. The Linux Plug and Play Layer provides these 13otherwise unconfigurable devices. The Linux Plug and Play Layer provides these
12services to compatible drivers. 14services to compatible drivers.
13 15
14 16
15
16The User Interface 17The User Interface
17------------------ 18------------------
18 The Linux Plug and Play user interface provides a means to activate PnP devices 19
20The Linux Plug and Play user interface provides a means to activate PnP devices
19for legacy and user level drivers that do not support Linux Plug and Play. The 21for legacy and user level drivers that do not support Linux Plug and Play. The
20user interface is integrated into sysfs. 22user interface is integrated into sysfs.
21 23
22In addition to the standard sysfs file the following are created in each 24In addition to the standard sysfs file the following are created in each
23device's directory: 25device's directory:
24id - displays a list of support EISA IDs 26- id - displays a list of support EISA IDs
25options - displays possible resource configurations 27- options - displays possible resource configurations
26resources - displays currently allocated resources and allows resource changes 28- resources - displays currently allocated resources and allows resource changes
27 29
28-activating a device 30activating a device
31^^^^^^^^^^^^^^^^^^^
29 32
30#echo "auto" > resources 33::
34
35 # echo "auto" > resources
31 36
32this will invoke the automatic resource config system to activate the device 37this will invoke the automatic resource config system to activate the device
33 38
34-manually activating a device 39manually activating a device
40^^^^^^^^^^^^^^^^^^^^^^^^^^^^
41
42::
43
44 # echo "manual <depnum> <mode>" > resources
35 45
36#echo "manual <depnum> <mode>" > resources 46 <depnum> - the configuration number
37<depnum> - the configuration number 47 <mode> - static or dynamic
38<mode> - static or dynamic 48 static = for next boot
39 static = for next boot 49 dynamic = now
40 dynamic = now
41 50
42-disabling a device 51disabling a device
52^^^^^^^^^^^^^^^^^^
43 53
44#echo "disable" > resources 54::
55
56 # echo "disable" > resources
45 57
46 58
47EXAMPLE: 59EXAMPLE:
48 60
49Suppose you need to activate the floppy disk controller. 61Suppose you need to activate the floppy disk controller.
501.) change to the proper directory, in my case it is 62
51/driver/bus/pnp/devices/00:0f 631. change to the proper directory, in my case it is
52# cd /driver/bus/pnp/devices/00:0f 64 /driver/bus/pnp/devices/00:0f::
53# cat name 65
54PC standard floppy disk controller 66 # cd /driver/bus/pnp/devices/00:0f
55 67 # cat name
562.) check if the device is already active 68 PC standard floppy disk controller
57# cat resources 69
58DISABLED 702. check if the device is already active::
59 71
60- Notice the string "DISABLED". This means the device is not active. 72 # cat resources
61 73 DISABLED
623.) check the device's possible configurations (optional) 74
63# cat options 75 - Notice the string "DISABLED". This means the device is not active.
64Dependent: 01 - Priority acceptable 76
65 port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding 773. check the device's possible configurations (optional)::
66 port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding 78
67 irq 6 79 # cat options
68 dma 2 8-bit compatible 80 Dependent: 01 - Priority acceptable
69Dependent: 02 - Priority acceptable 81 port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding
70 port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding 82 port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding
71 port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding 83 irq 6
72 irq 6 84 dma 2 8-bit compatible
73 dma 2 8-bit compatible 85 Dependent: 02 - Priority acceptable
74 86 port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding
754.) now activate the device 87 port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding
76# echo "auto" > resources 88 irq 6
77 89 dma 2 8-bit compatible
785.) finally check if the device is active 90
79# cat resources 914. now activate the device::
80io 0x3f0-0x3f5 92
81io 0x3f7-0x3f7 93 # echo "auto" > resources
82irq 6 94
83dma 2 955. finally check if the device is active::
84 96
85also there are a series of kernel parameters: 97 # cat resources
86pnp_reserve_irq=irq1[,irq2] .... 98 io 0x3f0-0x3f5
87pnp_reserve_dma=dma1[,dma2] .... 99 io 0x3f7-0x3f7
88pnp_reserve_io=io1,size1[,io2,size2] .... 100 irq 6
89pnp_reserve_mem=mem1,size1[,mem2,size2] .... 101 dma 2
102
103also there are a series of kernel parameters::
104
105 pnp_reserve_irq=irq1[,irq2] ....
106 pnp_reserve_dma=dma1[,dma2] ....
107 pnp_reserve_io=io1,size1[,io2,size2] ....
108 pnp_reserve_mem=mem1,size1[,mem2,size2] ....
90 109
91 110
92 111
93The Unified Plug and Play Layer 112The Unified Plug and Play Layer
94------------------------------- 113-------------------------------
95 All Plug and Play drivers, protocols, and services meet at a central location 114
115All Plug and Play drivers, protocols, and services meet at a central location
96called the Plug and Play Layer. This layer is responsible for the exchange of 116called the Plug and Play Layer. This layer is responsible for the exchange of
97information between PnP drivers and PnP protocols. Thus it automatically 117information between PnP drivers and PnP protocols. Thus it automatically
98forwards commands to the proper protocol. This makes writing PnP drivers 118forwards commands to the proper protocol. This makes writing PnP drivers
@@ -101,64 +121,73 @@ significantly easier.
101The following functions are available from the Plug and Play Layer: 121The following functions are available from the Plug and Play Layer:
102 122
103pnp_get_protocol 123pnp_get_protocol
104- increments the number of uses by one 124 increments the number of uses by one
105 125
106pnp_put_protocol 126pnp_put_protocol
107- deincrements the number of uses by one 127 deincrements the number of uses by one
108 128
109pnp_register_protocol 129pnp_register_protocol
110- use this to register a new PnP protocol 130 use this to register a new PnP protocol
111 131
112pnp_unregister_protocol 132pnp_unregister_protocol
113- use this function to remove a PnP protocol from the Plug and Play Layer 133 use this function to remove a PnP protocol from the Plug and Play Layer
114 134
115pnp_register_driver 135pnp_register_driver
116- adds a PnP driver to the Plug and Play Layer 136 adds a PnP driver to the Plug and Play Layer
117- this includes driver model integration 137
118- returns zero for success or a negative error number for failure; count 138 this includes driver model integration
139 returns zero for success or a negative error number for failure; count
119 calls to the .add() method if you need to know how many devices bind to 140 calls to the .add() method if you need to know how many devices bind to
120 the driver 141 the driver
121 142
122pnp_unregister_driver 143pnp_unregister_driver
123- removes a PnP driver from the Plug and Play Layer 144 removes a PnP driver from the Plug and Play Layer
124 145
125 146
126 147
127Plug and Play Protocols 148Plug and Play Protocols
128----------------------- 149-----------------------
129 This section contains information for PnP protocol developers. 150
151This section contains information for PnP protocol developers.
130 152
131The following Protocols are currently available in the computing world: 153The following Protocols are currently available in the computing world:
132- PNPBIOS: used for system devices such as serial and parallel ports. 154
133- ISAPNP: provides PnP support for the ISA bus 155- PNPBIOS:
134- ACPI: among its many uses, ACPI provides information about system level 156 used for system devices such as serial and parallel ports.
135devices. 157- ISAPNP:
158 provides PnP support for the ISA bus
159- ACPI:
160 among its many uses, ACPI provides information about system level
161 devices.
162
136It is meant to replace the PNPBIOS. It is not currently supported by Linux 163It is meant to replace the PNPBIOS. It is not currently supported by Linux
137Plug and Play but it is planned to be in the near future. 164Plug and Play but it is planned to be in the near future.
138 165
139 166
140Requirements for a Linux PnP protocol: 167Requirements for a Linux PnP protocol:
1411.) the protocol must use EISA IDs 1681. the protocol must use EISA IDs
1422.) the protocol must inform the PnP Layer of a device's current configuration 1692. the protocol must inform the PnP Layer of a device's current configuration
170
143- the ability to set resources is optional but preferred. 171- the ability to set resources is optional but preferred.
144 172
145The following are PnP protocol related functions: 173The following are PnP protocol related functions:
146 174
147pnp_add_device 175pnp_add_device
148- use this function to add a PnP device to the PnP layer 176 use this function to add a PnP device to the PnP layer
149- only call this function when all wanted values are set in the pnp_dev 177
150structure 178 only call this function when all wanted values are set in the pnp_dev
179 structure
151 180
152pnp_init_device 181pnp_init_device
153- call this to initialize the PnP structure 182 call this to initialize the PnP structure
154 183
155pnp_remove_device 184pnp_remove_device
156- call this to remove a device from the Plug and Play Layer. 185 call this to remove a device from the Plug and Play Layer.
157- it will fail if the device is still in use. 186 it will fail if the device is still in use.
158- automatically will free mem used by the device and related structures 187 automatically will free mem used by the device and related structures
159 188
160pnp_add_id 189pnp_add_id
161- adds an EISA ID to the list of supported IDs for the specified device 190 adds an EISA ID to the list of supported IDs for the specified device
162 191
163For more information consult the source of a protocol such as 192For more information consult the source of a protocol such as
164/drivers/pnp/pnpbios/core.c. 193/drivers/pnp/pnpbios/core.c.
@@ -167,85 +196,97 @@ For more information consult the source of a protocol such as
167 196
168Linux Plug and Play Drivers 197Linux Plug and Play Drivers
169--------------------------- 198---------------------------
170 This section contains information for Linux PnP driver developers. 199
200This section contains information for Linux PnP driver developers.
171 201
172The New Way 202The New Way
173........... 203^^^^^^^^^^^
1741.) first make a list of supported EISA IDS 204
175ex: 2051. first make a list of supported EISA IDS
176static const struct pnp_id pnp_dev_table[] = { 206
177 /* Standard LPT Printer Port */ 207 ex::
178 {.id = "PNP0400", .driver_data = 0}, 208
179 /* ECP Printer Port */ 209 static const struct pnp_id pnp_dev_table[] = {
180 {.id = "PNP0401", .driver_data = 0}, 210 /* Standard LPT Printer Port */
181 {.id = ""} 211 {.id = "PNP0400", .driver_data = 0},
182}; 212 /* ECP Printer Port */
183 213 {.id = "PNP0401", .driver_data = 0},
184Please note that the character 'X' can be used as a wild card in the function 214 {.id = ""}
185portion (last four characters). 215 };
186ex: 216
217 Please note that the character 'X' can be used as a wild card in the function
218 portion (last four characters).
219
220 ex::
221
187 /* Unknown PnP modems */ 222 /* Unknown PnP modems */
188 { "PNPCXXX", UNKNOWN_DEV }, 223 { "PNPCXXX", UNKNOWN_DEV },
189 224
190Supported PnP card IDs can optionally be defined. 225 Supported PnP card IDs can optionally be defined.
191ex: 226 ex::
192static const struct pnp_id pnp_card_table[] = { 227
193 { "ANYDEVS", 0 }, 228 static const struct pnp_id pnp_card_table[] = {
194 { "", 0 } 229 { "ANYDEVS", 0 },
195}; 230 { "", 0 }
196 231 };
1972.) Optionally define probe and remove functions. It may make sense not to 232
198define these functions if the driver already has a reliable method of detecting 2332. Optionally define probe and remove functions. It may make sense not to
199the resources, such as the parport_pc driver. 234 define these functions if the driver already has a reliable method of detecting
200ex: 235 the resources, such as the parport_pc driver.
201static int 236
202serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const 237 ex::
203 struct pnp_id *dev_id) 238
204{ 239 static int
205. . . 240 serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const
206 241 struct pnp_id *dev_id)
207ex: 242 {
208static void serial_pnp_remove(struct pnp_dev * dev) 243 . . .
209{ 244
210. . . 245 ex::
211 246
212consult /drivers/serial/8250_pnp.c for more information. 247 static void serial_pnp_remove(struct pnp_dev * dev)
213 248 {
2143.) create a driver structure 249 . . .
215ex: 250
216 251 consult /drivers/serial/8250_pnp.c for more information.
217static struct pnp_driver serial_pnp_driver = { 252
218 .name = "serial", 2533. create a driver structure
219 .card_id_table = pnp_card_table, 254
220 .id_table = pnp_dev_table, 255 ex::
221 .probe = serial_pnp_probe, 256
222 .remove = serial_pnp_remove, 257 static struct pnp_driver serial_pnp_driver = {
223}; 258 .name = "serial",
224 259 .card_id_table = pnp_card_table,
225* name and id_table cannot be NULL. 260 .id_table = pnp_dev_table,
226 261 .probe = serial_pnp_probe,
2274.) register the driver 262 .remove = serial_pnp_remove,
228ex: 263 };
229 264
230static int __init serial8250_pnp_init(void) 265 * name and id_table cannot be NULL.
231{ 266
232 return pnp_register_driver(&serial_pnp_driver); 2674. register the driver
233} 268
269 ex::
270
271 static int __init serial8250_pnp_init(void)
272 {
273 return pnp_register_driver(&serial_pnp_driver);
274 }
234 275
235The Old Way 276The Old Way
236........... 277^^^^^^^^^^^
237 278
238A series of compatibility functions have been created to make it easy to convert 279A series of compatibility functions have been created to make it easy to convert
239ISAPNP drivers. They should serve as a temporary solution only. 280ISAPNP drivers. They should serve as a temporary solution only.
240 281
241They are as follows: 282They are as follows::
242 283
243struct pnp_card *pnp_find_card(unsigned short vendor, 284 struct pnp_card *pnp_find_card(unsigned short vendor,
244 unsigned short device, 285 unsigned short device,
245 struct pnp_card *from) 286 struct pnp_card *from)
246 287
247struct pnp_dev *pnp_find_dev(struct pnp_card *card, 288 struct pnp_dev *pnp_find_dev(struct pnp_card *card,
248 unsigned short vendor, 289 unsigned short vendor,
249 unsigned short function, 290 unsigned short function,
250 struct pnp_dev *from) 291 struct pnp_dev *from)
251 292
diff --git a/Documentation/preempt-locking.txt b/Documentation/preempt-locking.txt
index e89ce6624af2..c945062be66c 100644
--- a/Documentation/preempt-locking.txt
+++ b/Documentation/preempt-locking.txt
@@ -1,10 +1,13 @@
1 Proper Locking Under a Preemptible Kernel: 1===========================================================================
2 Keeping Kernel Code Preempt-Safe 2Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe
3 Robert Love <rml@tech9.net> 3===========================================================================
4 Last Updated: 28 Aug 2002
5 4
5:Author: Robert Love <rml@tech9.net>
6:Last Updated: 28 Aug 2002
6 7
7INTRODUCTION 8
9Introduction
10============
8 11
9 12
10A preemptible kernel creates new locking issues. The issues are the same as 13A preemptible kernel creates new locking issues. The issues are the same as
@@ -17,9 +20,10 @@ requires protecting these situations.
17 20
18 21
19RULE #1: Per-CPU data structures need explicit protection 22RULE #1: Per-CPU data structures need explicit protection
23^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
20 24
21 25
22Two similar problems arise. An example code snippet: 26Two similar problems arise. An example code snippet::
23 27
24 struct this_needs_locking tux[NR_CPUS]; 28 struct this_needs_locking tux[NR_CPUS];
25 tux[smp_processor_id()] = some_value; 29 tux[smp_processor_id()] = some_value;
@@ -35,6 +39,7 @@ You can also use put_cpu() and get_cpu(), which will disable preemption.
35 39
36 40
37RULE #2: CPU state must be protected. 41RULE #2: CPU state must be protected.
42^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38 43
39 44
40Under preemption, the state of the CPU must be protected. This is arch- 45Under preemption, the state of the CPU must be protected. This is arch-
@@ -52,6 +57,7 @@ However, fpu__restore() must be called with preemption disabled.
52 57
53 58
54RULE #3: Lock acquire and release must be performed by same task 59RULE #3: Lock acquire and release must be performed by same task
60^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
55 61
56 62
57A lock acquired in one task must be released by the same task. This 63A lock acquired in one task must be released by the same task. This
@@ -61,17 +67,20 @@ like this, acquire and release the task in the same code path and
61have the caller wait on an event by the other task. 67have the caller wait on an event by the other task.
62 68
63 69
64SOLUTION 70Solution
71========
65 72
66 73
67Data protection under preemption is achieved by disabling preemption for the 74Data protection under preemption is achieved by disabling preemption for the
68duration of the critical region. 75duration of the critical region.
69 76
70preempt_enable() decrement the preempt counter 77::
71preempt_disable() increment the preempt counter 78
72preempt_enable_no_resched() decrement, but do not immediately preempt 79 preempt_enable() decrement the preempt counter
73preempt_check_resched() if needed, reschedule 80 preempt_disable() increment the preempt counter
74preempt_count() return the preempt counter 81 preempt_enable_no_resched() decrement, but do not immediately preempt
82 preempt_check_resched() if needed, reschedule
83 preempt_count() return the preempt counter
75 84
76The functions are nestable. In other words, you can call preempt_disable 85The functions are nestable. In other words, you can call preempt_disable
77n-times in a code path, and preemption will not be reenabled until the n-th 86n-times in a code path, and preemption will not be reenabled until the n-th
@@ -89,7 +98,7 @@ So use this implicit preemption-disabling property only if you know that the
89affected codepath does not do any of this. Best policy is to use this only for 98affected codepath does not do any of this. Best policy is to use this only for
90small, atomic code that you wrote and which calls no complex functions. 99small, atomic code that you wrote and which calls no complex functions.
91 100
92Example: 101Example::
93 102
94 cpucache_t *cc; /* this is per-CPU */ 103 cpucache_t *cc; /* this is per-CPU */
95 preempt_disable(); 104 preempt_disable();
@@ -102,7 +111,7 @@ Example:
102 return 0; 111 return 0;
103 112
104Notice how the preemption statements must encompass every reference of the 113Notice how the preemption statements must encompass every reference of the
105critical variables. Another example: 114critical variables. Another example::
106 115
107 int buf[NR_CPUS]; 116 int buf[NR_CPUS];
108 set_cpu_val(buf); 117 set_cpu_val(buf);
@@ -114,7 +123,8 @@ This code is not preempt-safe, but see how easily we can fix it by simply
114moving the spin_lock up two lines. 123moving the spin_lock up two lines.
115 124
116 125
117PREVENTING PREEMPTION USING INTERRUPT DISABLING 126Preventing preemption using interrupt disabling
127===============================================
118 128
119 129
120It is possible to prevent a preemption event using local_irq_disable and 130It is possible to prevent a preemption event using local_irq_disable and
diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 619cdffa5d44..65ea5915178b 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -1,5 +1,18 @@
1If variable is of Type, use printk format specifier: 1=========================================
2--------------------------------------------------------- 2How to get printk format specifiers right
3=========================================
4
5:Author: Randy Dunlap <rdunlap@infradead.org>
6:Author: Andrew Murray <amurray@mpc-data.co.uk>
7
8
9Integer types
10=============
11
12::
13
14 If variable is of Type, use printk format specifier:
15 ------------------------------------------------------------
3 int %d or %x 16 int %d or %x
4 unsigned int %u or %x 17 unsigned int %u or %x
5 long %ld or %lx 18 long %ld or %lx
@@ -13,25 +26,29 @@ If variable is of Type, use printk format specifier:
13 s64 %lld or %llx 26 s64 %lld or %llx
14 u64 %llu or %llx 27 u64 %llu or %llx
15 28
16If <type> is dependent on a config option for its size (e.g., sector_t, 29If <type> is dependent on a config option for its size (e.g., ``sector_t``,
17blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a 30``blkcnt_t``) or is architecture-dependent for its size (e.g., ``tcflag_t``),
18format specifier of its largest possible type and explicitly cast to it. 31use a format specifier of its largest possible type and explicitly cast to it.
19Example: 32
33Example::
20 34
21 printk("test: sector number/total blocks: %llu/%llu\n", 35 printk("test: sector number/total blocks: %llu/%llu\n",
22 (unsigned long long)sector, (unsigned long long)blockcount); 36 (unsigned long long)sector, (unsigned long long)blockcount);
23 37
24Reminder: sizeof() result is of type size_t. 38Reminder: ``sizeof()`` result is of type ``size_t``.
25 39
26The kernel's printf does not support %n. For obvious reasons, floating 40The kernel's printf does not support ``%n``. For obvious reasons, floating
27point formats (%e, %f, %g, %a) are also not recognized. Use of any 41point formats (``%e, %f, %g, %a``) are also not recognized. Use of any
28unsupported specifier or length qualifier results in a WARN and early 42unsupported specifier or length qualifier results in a WARN and early
29return from vsnprintf. 43return from vsnprintf.
30 44
31Raw pointer value SHOULD be printed with %p. The kernel supports 45Raw pointer value SHOULD be printed with %p. The kernel supports
32the following extended format specifiers for pointer types: 46the following extended format specifiers for pointer types:
33 47
34Symbols/Function Pointers: 48Symbols/Function Pointers
49=========================
50
51::
35 52
36 %pF versatile_init+0x0/0x110 53 %pF versatile_init+0x0/0x110
37 %pf versatile_init 54 %pf versatile_init
@@ -41,99 +58,122 @@ Symbols/Function Pointers:
41 %ps versatile_init 58 %ps versatile_init
42 %pB prev_fn_of_versatile_init+0x88/0x88 59 %pB prev_fn_of_versatile_init+0x88/0x88
43 60
44 For printing symbols and function pointers. The 'S' and 's' specifiers 61For printing symbols and function pointers. The ``S`` and ``s`` specifiers
45 result in the symbol name with ('S') or without ('s') offsets. Where 62result in the symbol name with (``S``) or without (``s``) offsets. Where
46 this is used on a kernel without KALLSYMS - the symbol address is 63this is used on a kernel without KALLSYMS - the symbol address is
47 printed instead. 64printed instead.
65
66The ``B`` specifier results in the symbol name with offsets and should be
67used when printing stack backtraces. The specifier takes into
68consideration the effect of compiler optimisations which may occur
69when tail-call``s are used and marked with the noreturn GCC attribute.
48 70
49 The 'B' specifier results in the symbol name with offsets and should be 71On ia64, ppc64 and parisc64 architectures function pointers are
50 used when printing stack backtraces. The specifier takes into 72actually function descriptors which must first be resolved. The ``F`` and
51 consideration the effect of compiler optimisations which may occur 73``f`` specifiers perform this resolution and then provide the same
52 when tail-call's are used and marked with the noreturn GCC attribute. 74functionality as the ``S`` and ``s`` specifiers.
53 75
54 On ia64, ppc64 and parisc64 architectures function pointers are 76Kernel Pointers
55 actually function descriptors which must first be resolved. The 'F' and 77===============
56 'f' specifiers perform this resolution and then provide the same
57 functionality as the 'S' and 's' specifiers.
58 78
59Kernel Pointers: 79::
60 80
61 %pK 0x01234567 or 0x0123456789abcdef 81 %pK 0x01234567 or 0x0123456789abcdef
62 82
63 For printing kernel pointers which should be hidden from unprivileged 83For printing kernel pointers which should be hidden from unprivileged
64 users. The behaviour of %pK depends on the kptr_restrict sysctl - see 84users. The behaviour of ``%pK`` depends on the ``kptr_restrict sysctl`` - see
65 Documentation/sysctl/kernel.txt for more details. 85Documentation/sysctl/kernel.txt for more details.
86
87Struct Resources
88================
66 89
67Struct Resources: 90::
68 91
69 %pr [mem 0x60000000-0x6fffffff flags 0x2200] or 92 %pr [mem 0x60000000-0x6fffffff flags 0x2200] or
70 [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] 93 [mem 0x0000000060000000-0x000000006fffffff flags 0x2200]
71 %pR [mem 0x60000000-0x6fffffff pref] or 94 %pR [mem 0x60000000-0x6fffffff pref] or
72 [mem 0x0000000060000000-0x000000006fffffff pref] 95 [mem 0x0000000060000000-0x000000006fffffff pref]
73 96
74 For printing struct resources. The 'R' and 'r' specifiers result in a 97For printing struct resources. The ``R`` and ``r`` specifiers result in a
75 printed resource with ('R') or without ('r') a decoded flags member. 98printed resource with (``R``) or without (``r``) a decoded flags member.
76 Passed by reference. 99Passed by reference.
100
101Physical addresses types ``phys_addr_t``
102========================================
77 103
78Physical addresses types phys_addr_t: 104::
79 105
80 %pa[p] 0x01234567 or 0x0123456789abcdef 106 %pa[p] 0x01234567 or 0x0123456789abcdef
81 107
82 For printing a phys_addr_t type (and its derivatives, such as 108For printing a ``phys_addr_t`` type (and its derivatives, such as
83 resource_size_t) which can vary based on build options, regardless of 109``resource_size_t``) which can vary based on build options, regardless of
84 the width of the CPU data path. Passed by reference. 110the width of the CPU data path. Passed by reference.
85 111
86DMA addresses types dma_addr_t: 112DMA addresses types ``dma_addr_t``
113==================================
114
115::
87 116
88 %pad 0x01234567 or 0x0123456789abcdef 117 %pad 0x01234567 or 0x0123456789abcdef
89 118
90 For printing a dma_addr_t type which can vary based on build options, 119For printing a ``dma_addr_t`` type which can vary based on build options,
91 regardless of the width of the CPU data path. Passed by reference. 120regardless of the width of the CPU data path. Passed by reference.
121
122Raw buffer as an escaped string
123===============================
92 124
93Raw buffer as an escaped string: 125::
94 126
95 %*pE[achnops] 127 %*pE[achnops]
96 128
97 For printing raw buffer as an escaped string. For the following buffer 129For printing raw buffer as an escaped string. For the following buffer::
98 130
99 1b 62 20 5c 43 07 22 90 0d 5d 131 1b 62 20 5c 43 07 22 90 0d 5d
100 132
101 few examples show how the conversion would be done (the result string 133few examples show how the conversion would be done (the result string
102 without surrounding quotes): 134without surrounding quotes)::
103 135
104 %*pE "\eb \C\a"\220\r]" 136 %*pE "\eb \C\a"\220\r]"
105 %*pEhp "\x1bb \C\x07"\x90\x0d]" 137 %*pEhp "\x1bb \C\x07"\x90\x0d]"
106 %*pEa "\e\142\040\\\103\a\042\220\r\135" 138 %*pEa "\e\142\040\\\103\a\042\220\r\135"
107 139
108 The conversion rules are applied according to an optional combination 140The conversion rules are applied according to an optional combination
109 of flags (see string_escape_mem() kernel documentation for the 141of flags (see :c:func:`string_escape_mem` kernel documentation for the
110 details): 142details):
111 a - ESCAPE_ANY 143
112 c - ESCAPE_SPECIAL 144 - ``a`` - ESCAPE_ANY
113 h - ESCAPE_HEX 145 - ``c`` - ESCAPE_SPECIAL
114 n - ESCAPE_NULL 146 - ``h`` - ESCAPE_HEX
115 o - ESCAPE_OCTAL 147 - ``n`` - ESCAPE_NULL
116 p - ESCAPE_NP 148 - ``o`` - ESCAPE_OCTAL
117 s - ESCAPE_SPACE 149 - ``p`` - ESCAPE_NP
118 By default ESCAPE_ANY_NP is used. 150 - ``s`` - ESCAPE_SPACE
119 151
120 ESCAPE_ANY_NP is the sane choice for many cases, in particularly for 152By default ESCAPE_ANY_NP is used.
121 printing SSIDs.
122 153
123 If field width is omitted the 1 byte only will be escaped. 154ESCAPE_ANY_NP is the sane choice for many cases, in particularly for
155printing SSIDs.
124 156
125Raw buffer as a hex string: 157If field width is omitted the 1 byte only will be escaped.
158
159Raw buffer as a hex string
160==========================
161
162::
126 163
127 %*ph 00 01 02 ... 3f 164 %*ph 00 01 02 ... 3f
128 %*phC 00:01:02: ... :3f 165 %*phC 00:01:02: ... :3f
129 %*phD 00-01-02- ... -3f 166 %*phD 00-01-02- ... -3f
130 %*phN 000102 ... 3f 167 %*phN 000102 ... 3f
131 168
132 For printing a small buffers (up to 64 bytes long) as a hex string with 169For printing a small buffers (up to 64 bytes long) as a hex string with
133 certain separator. For the larger buffers consider to use 170certain separator. For the larger buffers consider to use
134 print_hex_dump(). 171:c:func:`print_hex_dump`.
172
173MAC/FDDI addresses
174==================
135 175
136MAC/FDDI addresses: 176::
137 177
138 %pM 00:01:02:03:04:05 178 %pM 00:01:02:03:04:05
139 %pMR 05:04:03:02:01:00 179 %pMR 05:04:03:02:01:00
@@ -141,53 +181,62 @@ MAC/FDDI addresses:
141 %pm 000102030405 181 %pm 000102030405
142 %pmR 050403020100 182 %pmR 050403020100
143 183
144 For printing 6-byte MAC/FDDI addresses in hex notation. The 'M' and 'm' 184For printing 6-byte MAC/FDDI addresses in hex notation. The ``M`` and ``m``
145 specifiers result in a printed address with ('M') or without ('m') byte 185specifiers result in a printed address with (``M``) or without (``m``) byte
146 separators. The default byte separator is the colon (':'). 186separators. The default byte separator is the colon (``:``).
147 187
148 Where FDDI addresses are concerned the 'F' specifier can be used after 188Where FDDI addresses are concerned the ``F`` specifier can be used after
149 the 'M' specifier to use dash ('-') separators instead of the default 189the ``M`` specifier to use dash (``-``) separators instead of the default
150 separator. 190separator.
151 191
152 For Bluetooth addresses the 'R' specifier shall be used after the 'M' 192For Bluetooth addresses the ``R`` specifier shall be used after the ``M``
153 specifier to use reversed byte order suitable for visual interpretation 193specifier to use reversed byte order suitable for visual interpretation
154 of Bluetooth addresses which are in the little endian order. 194of Bluetooth addresses which are in the little endian order.
155 195
156 Passed by reference. 196Passed by reference.
197
198IPv4 addresses
199==============
157 200
158IPv4 addresses: 201::
159 202
160 %pI4 1.2.3.4 203 %pI4 1.2.3.4
161 %pi4 001.002.003.004 204 %pi4 001.002.003.004
162 %p[Ii]4[hnbl] 205 %p[Ii]4[hnbl]
163 206
164 For printing IPv4 dot-separated decimal addresses. The 'I4' and 'i4' 207For printing IPv4 dot-separated decimal addresses. The ``I4`` and ``i4``
165 specifiers result in a printed address with ('i4') or without ('I4') 208specifiers result in a printed address with (``i4``) or without (``I4``)
166 leading zeros. 209leading zeros.
167 210
168 The additional 'h', 'n', 'b', and 'l' specifiers are used to specify 211The additional ``h``, ``n``, ``b``, and ``l`` specifiers are used to specify
169 host, network, big or little endian order addresses respectively. Where 212host, network, big or little endian order addresses respectively. Where
170 no specifier is provided the default network/big endian order is used. 213no specifier is provided the default network/big endian order is used.
171 214
172 Passed by reference. 215Passed by reference.
173 216
174IPv6 addresses: 217IPv6 addresses
218==============
219
220::
175 221
176 %pI6 0001:0002:0003:0004:0005:0006:0007:0008 222 %pI6 0001:0002:0003:0004:0005:0006:0007:0008
177 %pi6 00010002000300040005000600070008 223 %pi6 00010002000300040005000600070008
178 %pI6c 1:2:3:4:5:6:7:8 224 %pI6c 1:2:3:4:5:6:7:8
179 225
180 For printing IPv6 network-order 16-bit hex addresses. The 'I6' and 'i6' 226For printing IPv6 network-order 16-bit hex addresses. The ``I6`` and ``i6``
181 specifiers result in a printed address with ('I6') or without ('i6') 227specifiers result in a printed address with (``I6``) or without (``i6``)
182 colon-separators. Leading zeros are always used. 228colon-separators. Leading zeros are always used.
183 229
184 The additional 'c' specifier can be used with the 'I' specifier to 230The additional ``c`` specifier can be used with the ``I`` specifier to
185 print a compressed IPv6 address as described by 231print a compressed IPv6 address as described by
186 http://tools.ietf.org/html/rfc5952 232http://tools.ietf.org/html/rfc5952
187 233
188 Passed by reference. 234Passed by reference.
189 235
190IPv4/IPv6 addresses (generic, with port, flowinfo, scope): 236IPv4/IPv6 addresses (generic, with port, flowinfo, scope)
237=========================================================
238
239::
191 240
192 %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 241 %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008
193 %piS 001.002.003.004 or 00010002000300040005000600070008 242 %piS 001.002.003.004 or 00010002000300040005000600070008
@@ -195,87 +244,103 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope):
195 %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345 244 %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345
196 %p[Ii]S[pfschnbl] 245 %p[Ii]S[pfschnbl]
197 246
198 For printing an IP address without the need to distinguish whether it's 247For printing an IP address without the need to distinguish whether it``s
199 of type AF_INET or AF_INET6, a pointer to a valid 'struct sockaddr', 248of type AF_INET or AF_INET6, a pointer to a valid ``struct sockaddr``,
200 specified through 'IS' or 'iS', can be passed to this format specifier. 249specified through ``IS`` or ``iS``, can be passed to this format specifier.
201 250
202 The additional 'p', 'f', and 's' specifiers are used to specify port 251The additional ``p``, ``f``, and ``s`` specifiers are used to specify port
203 (IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ':' prefix, 252(IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ``:`` prefix,
204 flowinfo a '/' and scope a '%', each followed by the actual value. 253flowinfo a ``/`` and scope a ``%``, each followed by the actual value.
205 254
206 In case of an IPv6 address the compressed IPv6 address as described by 255In case of an IPv6 address the compressed IPv6 address as described by
207 http://tools.ietf.org/html/rfc5952 is being used if the additional 256http://tools.ietf.org/html/rfc5952 is being used if the additional
208 specifier 'c' is given. The IPv6 address is surrounded by '[', ']' in 257specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in
209 case of additional specifiers 'p', 'f' or 's' as suggested by 258case of additional specifiers ``p``, ``f`` or ``s`` as suggested by
210 https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 259https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07
211 260
212 In case of IPv4 addresses, the additional 'h', 'n', 'b', and 'l' 261In case of IPv4 addresses, the additional ``h``, ``n``, ``b``, and ``l``
213 specifiers can be used as well and are ignored in case of an IPv6 262specifiers can be used as well and are ignored in case of an IPv6
214 address. 263address.
215 264
216 Passed by reference. 265Passed by reference.
217 266
218 Further examples: 267Further examples::
219 268
220 %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 269 %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789
221 %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890 270 %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890
222 %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789 271 %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789
223 272
224UUID/GUID addresses: 273UUID/GUID addresses
274===================
275
276::
225 277
226 %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f 278 %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f
227 %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F 279 %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F
228 %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f 280 %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f
229 %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F 281 %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F
230 282
231 For printing 16-byte UUID/GUIDs addresses. The additional 'l', 'L', 283For printing 16-byte UUID/GUIDs addresses. The additional 'l', 'L',
232 'b' and 'B' specifiers are used to specify a little endian order in 284'b' and 'B' specifiers are used to specify a little endian order in
233 lower ('l') or upper case ('L') hex characters - and big endian order 285lower ('l') or upper case ('L') hex characters - and big endian order
234 in lower ('b') or upper case ('B') hex characters. 286in lower ('b') or upper case ('B') hex characters.
235 287
236 Where no additional specifiers are used the default big endian 288Where no additional specifiers are used the default big endian
237 order with lower case hex characters will be printed. 289order with lower case hex characters will be printed.
238 290
239 Passed by reference. 291Passed by reference.
292
293dentry names
294============
240 295
241dentry names: 296::
242 297
243 %pd{,2,3,4} 298 %pd{,2,3,4}
244 %pD{,2,3,4} 299 %pD{,2,3,4}
245 300
246 For printing dentry name; if we race with d_move(), the name might be 301For printing dentry name; if we race with :c:func:`d_move`, the name might be
247 a mix of old and new ones, but it won't oops. %pd dentry is a safer 302a mix of old and new ones, but it won't oops. ``%pd`` dentry is a safer
248 equivalent of %s dentry->d_name.name we used to use, %pd<n> prints 303equivalent of ``%s`` ``dentry->d_name.name`` we used to use, ``%pd<n>`` prints
249 n last components. %pD does the same thing for struct file. 304``n`` last components. ``%pD`` does the same thing for struct file.
250 305
251 Passed by reference. 306Passed by reference.
252 307
253block_device names: 308block_device names
309==================
310
311::
254 312
255 %pg sda, sda1 or loop0p1 313 %pg sda, sda1 or loop0p1
256 314
257 For printing name of block_device pointers. 315For printing name of block_device pointers.
316
317struct va_format
318================
258 319
259struct va_format: 320::
260 321
261 %pV 322 %pV
262 323
263 For printing struct va_format structures. These contain a format string 324For printing struct va_format structures. These contain a format string
264 and va_list as follows: 325and va_list as follows::
265 326
266 struct va_format { 327 struct va_format {
267 const char *fmt; 328 const char *fmt;
268 va_list *va; 329 va_list *va;
269 }; 330 };
270 331
271 Implements a "recursive vsnprintf". 332Implements a "recursive vsnprintf".
272 333
273 Do not use this feature without some mechanism to verify the 334Do not use this feature without some mechanism to verify the
274 correctness of the format string and va_list arguments. 335correctness of the format string and va_list arguments.
275 336
276 Passed by reference. 337Passed by reference.
338
339kobjects
340========
341
342::
277 343
278kobjects:
279 %pO 344 %pO
280 345
281 Base specifier for kobject based structs. Must be followed with 346 Base specifier for kobject based structs. Must be followed with
@@ -311,61 +376,70 @@ kobjects:
311 376
312 Passed by reference. 377 Passed by reference.
313 378
314struct clk: 379
380struct clk
381==========
382
383::
315 384
316 %pC pll1 385 %pC pll1
317 %pCn pll1 386 %pCn pll1
318 %pCr 1560000000 387 %pCr 1560000000
319 388
320 For printing struct clk structures. '%pC' and '%pCn' print the name 389For printing struct clk structures. ``%pC`` and ``%pCn`` print the name
321 (Common Clock Framework) or address (legacy clock framework) of the 390(Common Clock Framework) or address (legacy clock framework) of the
322 structure; '%pCr' prints the current clock rate. 391structure; ``%pCr`` prints the current clock rate.
323 392
324 Passed by reference. 393Passed by reference.
325 394
326bitmap and its derivatives such as cpumask and nodemask: 395bitmap and its derivatives such as cpumask and nodemask
396=======================================================
397
398::
327 399
328 %*pb 0779 400 %*pb 0779
329 %*pbl 0,3-6,8-10 401 %*pbl 0,3-6,8-10
330 402
331 For printing bitmap and its derivatives such as cpumask and nodemask, 403For printing bitmap and its derivatives such as cpumask and nodemask,
332 %*pb output the bitmap with field width as the number of bits and %*pbl 404``%*pb`` output the bitmap with field width as the number of bits and ``%*pbl``
333 output the bitmap as range list with field width as the number of bits. 405output the bitmap as range list with field width as the number of bits.
334 406
335 Passed by reference. 407Passed by reference.
408
409Flags bitfields such as page flags, gfp_flags
410=============================================
336 411
337Flags bitfields such as page flags, gfp_flags: 412::
338 413
339 %pGp referenced|uptodate|lru|active|private 414 %pGp referenced|uptodate|lru|active|private
340 %pGg GFP_USER|GFP_DMA32|GFP_NOWARN 415 %pGg GFP_USER|GFP_DMA32|GFP_NOWARN
341 %pGv read|exec|mayread|maywrite|mayexec|denywrite 416 %pGv read|exec|mayread|maywrite|mayexec|denywrite
342 417
343 For printing flags bitfields as a collection of symbolic constants that 418For printing flags bitfields as a collection of symbolic constants that
344 would construct the value. The type of flags is given by the third 419would construct the value. The type of flags is given by the third
345 character. Currently supported are [p]age flags, [v]ma_flags (both 420character. Currently supported are [p]age flags, [v]ma_flags (both
346 expect unsigned long *) and [g]fp_flags (expects gfp_t *). The flag 421expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
347 names and print order depends on the particular type. 422names and print order depends on the particular type.
348 423
349 Note that this format should not be used directly in TP_printk() part 424Note that this format should not be used directly in :c:func:`TP_printk()` part
350 of a tracepoint. Instead, use the show_*_flags() functions from 425of a tracepoint. Instead, use the ``show_*_flags()`` functions from
351 <trace/events/mmflags.h>. 426<trace/events/mmflags.h>.
352 427
353 Passed by reference. 428Passed by reference.
429
430Network device features
431=======================
354 432
355Network device features: 433::
356 434
357 %pNF 0x000000000000c000 435 %pNF 0x000000000000c000
358 436
359 For printing netdev_features_t. 437For printing netdev_features_t.
360 438
361 Passed by reference. 439Passed by reference.
362 440
363If you add other %p extensions, please extend lib/test_printf.c with 441If you add other ``%p`` extensions, please extend lib/test_printf.c with
364one or more test cases, if at all feasible. 442one or more test cases, if at all feasible.
365 443
366 444
367Thank you for your cooperation and attention. 445Thank you for your cooperation and attention.
368
369
370By Randy Dunlap <rdunlap@infradead.org> and
371Andrew Murray <amurray@mpc-data.co.uk>
diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt
index b9d9cc57be18..b8a8c70b0188 100644
--- a/Documentation/rbtree.txt
+++ b/Documentation/rbtree.txt
@@ -1,7 +1,10 @@
1=================================
1Red-black Trees (rbtree) in Linux 2Red-black Trees (rbtree) in Linux
2January 18, 2007 3=================================
3Rob Landley <rob@landley.net> 4
4============================= 5
6:Date: January 18, 2007
7:Author: Rob Landley <rob@landley.net>
5 8
6What are red-black trees, and what are they for? 9What are red-black trees, and what are they for?
7------------------------------------------------ 10------------------------------------------------
@@ -56,7 +59,7 @@ user of the rbtree code.
56Creating a new rbtree 59Creating a new rbtree
57--------------------- 60---------------------
58 61
59Data nodes in an rbtree tree are structures containing a struct rb_node member: 62Data nodes in an rbtree tree are structures containing a struct rb_node member::
60 63
61 struct mytype { 64 struct mytype {
62 struct rb_node node; 65 struct rb_node node;
@@ -78,7 +81,7 @@ Searching for a value in an rbtree
78Writing a search function for your tree is fairly straightforward: start at the 81Writing a search function for your tree is fairly straightforward: start at the
79root, compare each value, and follow the left or right branch as necessary. 82root, compare each value, and follow the left or right branch as necessary.
80 83
81Example: 84Example::
82 85
83 struct mytype *my_search(struct rb_root *root, char *string) 86 struct mytype *my_search(struct rb_root *root, char *string)
84 { 87 {
@@ -110,7 +113,7 @@ The search for insertion differs from the previous search by finding the
110location of the pointer on which to graft the new node. The new node also 113location of the pointer on which to graft the new node. The new node also
111needs a link to its parent node for rebalancing purposes. 114needs a link to its parent node for rebalancing purposes.
112 115
113Example: 116Example::
114 117
115 int my_insert(struct rb_root *root, struct mytype *data) 118 int my_insert(struct rb_root *root, struct mytype *data)
116 { 119 {
@@ -140,11 +143,11 @@ Example:
140Removing or replacing existing data in an rbtree 143Removing or replacing existing data in an rbtree
141------------------------------------------------ 144------------------------------------------------
142 145
143To remove an existing node from a tree, call: 146To remove an existing node from a tree, call::
144 147
145 void rb_erase(struct rb_node *victim, struct rb_root *tree); 148 void rb_erase(struct rb_node *victim, struct rb_root *tree);
146 149
147Example: 150Example::
148 151
149 struct mytype *data = mysearch(&mytree, "walrus"); 152 struct mytype *data = mysearch(&mytree, "walrus");
150 153
@@ -153,7 +156,7 @@ Example:
153 myfree(data); 156 myfree(data);
154 } 157 }
155 158
156To replace an existing node in a tree with a new one with the same key, call: 159To replace an existing node in a tree with a new one with the same key, call::
157 160
158 void rb_replace_node(struct rb_node *old, struct rb_node *new, 161 void rb_replace_node(struct rb_node *old, struct rb_node *new,
159 struct rb_root *tree); 162 struct rb_root *tree);
@@ -166,7 +169,7 @@ Iterating through the elements stored in an rbtree (in sort order)
166 169
167Four functions are provided for iterating through an rbtree's contents in 170Four functions are provided for iterating through an rbtree's contents in
168sorted order. These work on arbitrary trees, and should not need to be 171sorted order. These work on arbitrary trees, and should not need to be
169modified or wrapped (except for locking purposes): 172modified or wrapped (except for locking purposes)::
170 173
171 struct rb_node *rb_first(struct rb_root *tree); 174 struct rb_node *rb_first(struct rb_root *tree);
172 struct rb_node *rb_last(struct rb_root *tree); 175 struct rb_node *rb_last(struct rb_root *tree);
@@ -184,7 +187,7 @@ which the containing data structure may be accessed with the container_of()
184macro, and individual members may be accessed directly via 187macro, and individual members may be accessed directly via
185rb_entry(node, type, member). 188rb_entry(node, type, member).
186 189
187Example: 190Example::
188 191
189 struct rb_node *node; 192 struct rb_node *node;
190 for (node = rb_first(&mytree); node; node = rb_next(node)) 193 for (node = rb_first(&mytree); node; node = rb_next(node))
@@ -241,7 +244,8 @@ user should have a single rb_erase_augmented() call site in order to limit
241compiled code size. 244compiled code size.
242 245
243 246
244Sample usage: 247Sample usage
248^^^^^^^^^^^^
245 249
246Interval tree is an example of augmented rb tree. Reference - 250Interval tree is an example of augmented rb tree. Reference -
247"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. 251"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein.
@@ -259,12 +263,12 @@ This "extra information" stored in each node is the maximum hi
259information can be maintained at each node just be looking at the node 263information can be maintained at each node just be looking at the node
260and its immediate children. And this will be used in O(log n) lookup 264and its immediate children. And this will be used in O(log n) lookup
261for lowest match (lowest start address among all possible matches) 265for lowest match (lowest start address among all possible matches)
262with something like: 266with something like::
263 267
264struct interval_tree_node * 268 struct interval_tree_node *
265interval_tree_first_match(struct rb_root *root, 269 interval_tree_first_match(struct rb_root *root,
266 unsigned long start, unsigned long last) 270 unsigned long start, unsigned long last)
267{ 271 {
268 struct interval_tree_node *node; 272 struct interval_tree_node *node;
269 273
270 if (!root->rb_node) 274 if (!root->rb_node)
@@ -301,13 +305,13 @@ interval_tree_first_match(struct rb_root *root,
301 } 305 }
302 return NULL; /* No match */ 306 return NULL; /* No match */
303 } 307 }
304} 308 }
305 309
306Insertion/removal are defined using the following augmented callbacks: 310Insertion/removal are defined using the following augmented callbacks::
307 311
308static inline unsigned long 312 static inline unsigned long
309compute_subtree_last(struct interval_tree_node *node) 313 compute_subtree_last(struct interval_tree_node *node)
310{ 314 {
311 unsigned long max = node->last, subtree_last; 315 unsigned long max = node->last, subtree_last;
312 if (node->rb.rb_left) { 316 if (node->rb.rb_left) {
313 subtree_last = rb_entry(node->rb.rb_left, 317 subtree_last = rb_entry(node->rb.rb_left,
@@ -322,10 +326,10 @@ compute_subtree_last(struct interval_tree_node *node)
322 max = subtree_last; 326 max = subtree_last;
323 } 327 }
324 return max; 328 return max;
325} 329 }
326 330
327static void augment_propagate(struct rb_node *rb, struct rb_node *stop) 331 static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
328{ 332 {
329 while (rb != stop) { 333 while (rb != stop) {
330 struct interval_tree_node *node = 334 struct interval_tree_node *node =
331 rb_entry(rb, struct interval_tree_node, rb); 335 rb_entry(rb, struct interval_tree_node, rb);
@@ -335,20 +339,20 @@ static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
335 node->__subtree_last = subtree_last; 339 node->__subtree_last = subtree_last;
336 rb = rb_parent(&node->rb); 340 rb = rb_parent(&node->rb);
337 } 341 }
338} 342 }
339 343
340static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) 344 static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new)
341{ 345 {
342 struct interval_tree_node *old = 346 struct interval_tree_node *old =
343 rb_entry(rb_old, struct interval_tree_node, rb); 347 rb_entry(rb_old, struct interval_tree_node, rb);
344 struct interval_tree_node *new = 348 struct interval_tree_node *new =
345 rb_entry(rb_new, struct interval_tree_node, rb); 349 rb_entry(rb_new, struct interval_tree_node, rb);
346 350
347 new->__subtree_last = old->__subtree_last; 351 new->__subtree_last = old->__subtree_last;
348} 352 }
349 353
350static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) 354 static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
351{ 355 {
352 struct interval_tree_node *old = 356 struct interval_tree_node *old =
353 rb_entry(rb_old, struct interval_tree_node, rb); 357 rb_entry(rb_old, struct interval_tree_node, rb);
354 struct interval_tree_node *new = 358 struct interval_tree_node *new =
@@ -356,15 +360,15 @@ static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
356 360
357 new->__subtree_last = old->__subtree_last; 361 new->__subtree_last = old->__subtree_last;
358 old->__subtree_last = compute_subtree_last(old); 362 old->__subtree_last = compute_subtree_last(old);
359} 363 }
360 364
361static const struct rb_augment_callbacks augment_callbacks = { 365 static const struct rb_augment_callbacks augment_callbacks = {
362 augment_propagate, augment_copy, augment_rotate 366 augment_propagate, augment_copy, augment_rotate
363}; 367 };
364 368
365void interval_tree_insert(struct interval_tree_node *node, 369 void interval_tree_insert(struct interval_tree_node *node,
366 struct rb_root *root) 370 struct rb_root *root)
367{ 371 {
368 struct rb_node **link = &root->rb_node, *rb_parent = NULL; 372 struct rb_node **link = &root->rb_node, *rb_parent = NULL;
369 unsigned long start = node->start, last = node->last; 373 unsigned long start = node->start, last = node->last;
370 struct interval_tree_node *parent; 374 struct interval_tree_node *parent;
@@ -383,10 +387,10 @@ void interval_tree_insert(struct interval_tree_node *node,
383 node->__subtree_last = last; 387 node->__subtree_last = last;
384 rb_link_node(&node->rb, rb_parent, link); 388 rb_link_node(&node->rb, rb_parent, link);
385 rb_insert_augmented(&node->rb, root, &augment_callbacks); 389 rb_insert_augmented(&node->rb, root, &augment_callbacks);
386} 390 }
387 391
388void interval_tree_remove(struct interval_tree_node *node, 392 void interval_tree_remove(struct interval_tree_node *node,
389 struct rb_root *root) 393 struct rb_root *root)
390{ 394 {
391 rb_erase_augmented(&node->rb, root, &augment_callbacks); 395 rb_erase_augmented(&node->rb, root, &augment_callbacks);
392} 396 }
diff --git a/Documentation/remoteproc.txt b/Documentation/remoteproc.txt
index f07597482351..77fb03acdbb4 100644
--- a/Documentation/remoteproc.txt
+++ b/Documentation/remoteproc.txt
@@ -1,6 +1,9 @@
1==========================
1Remote Processor Framework 2Remote Processor Framework
3==========================
2 4
31. Introduction 5Introduction
6============
4 7
5Modern SoCs typically have heterogeneous remote processor devices in asymmetric 8Modern SoCs typically have heterogeneous remote processor devices in asymmetric
6multiprocessing (AMP) configurations, which may be running different instances 9multiprocessing (AMP) configurations, which may be running different instances
@@ -26,44 +29,62 @@ remoteproc will add those devices. This makes it possible to reuse the
26existing virtio drivers with remote processor backends at a minimal development 29existing virtio drivers with remote processor backends at a minimal development
27cost. 30cost.
28 31
292. User API 32User API
33========
34
35::
30 36
31 int rproc_boot(struct rproc *rproc) 37 int rproc_boot(struct rproc *rproc)
32 - Boot a remote processor (i.e. load its firmware, power it on, ...). 38
33 If the remote processor is already powered on, this function immediately 39Boot a remote processor (i.e. load its firmware, power it on, ...).
34 returns (successfully). 40
35 Returns 0 on success, and an appropriate error value otherwise. 41If the remote processor is already powered on, this function immediately
36 Note: to use this function you should already have a valid rproc 42returns (successfully).
37 handle. There are several ways to achieve that cleanly (devres, pdata, 43
38 the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we 44Returns 0 on success, and an appropriate error value otherwise.
39 might also consider using dev_archdata for this). 45Note: to use this function you should already have a valid rproc
46handle. There are several ways to achieve that cleanly (devres, pdata,
47the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we
48might also consider using dev_archdata for this).
49
50::
40 51
41 void rproc_shutdown(struct rproc *rproc) 52 void rproc_shutdown(struct rproc *rproc)
42 - Power off a remote processor (previously booted with rproc_boot()). 53
43 In case @rproc is still being used by an additional user(s), then 54Power off a remote processor (previously booted with rproc_boot()).
44 this function will just decrement the power refcount and exit, 55In case @rproc is still being used by an additional user(s), then
45 without really powering off the device. 56this function will just decrement the power refcount and exit,
46 Every call to rproc_boot() must (eventually) be accompanied by a call 57without really powering off the device.
47 to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug. 58
48 Notes: 59Every call to rproc_boot() must (eventually) be accompanied by a call
49 - we're not decrementing the rproc's refcount, only the power refcount. 60to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug.
50 which means that the @rproc handle stays valid even after 61
51 rproc_shutdown() returns, and users can still use it with a subsequent 62.. note::
52 rproc_boot(), if needed. 63
64 we're not decrementing the rproc's refcount, only the power refcount.
65 which means that the @rproc handle stays valid even after
66 rproc_shutdown() returns, and users can still use it with a subsequent
67 rproc_boot(), if needed.
68
69::
53 70
54 struct rproc *rproc_get_by_phandle(phandle phandle) 71 struct rproc *rproc_get_by_phandle(phandle phandle)
55 - Find an rproc handle using a device tree phandle. Returns the rproc
56 handle on success, and NULL on failure. This function increments
57 the remote processor's refcount, so always use rproc_put() to
58 decrement it back once rproc isn't needed anymore.
59 72
603. Typical usage 73Find an rproc handle using a device tree phandle. Returns the rproc
74handle on success, and NULL on failure. This function increments
75the remote processor's refcount, so always use rproc_put() to
76decrement it back once rproc isn't needed anymore.
77
78Typical usage
79=============
61 80
62#include <linux/remoteproc.h> 81::
63 82
64/* in case we were given a valid 'rproc' handle */ 83 #include <linux/remoteproc.h>
65int dummy_rproc_example(struct rproc *my_rproc) 84
66{ 85 /* in case we were given a valid 'rproc' handle */
86 int dummy_rproc_example(struct rproc *my_rproc)
87 {
67 int ret; 88 int ret;
68 89
69 /* let's power on and boot our remote processor */ 90 /* let's power on and boot our remote processor */
@@ -80,84 +101,111 @@ int dummy_rproc_example(struct rproc *my_rproc)
80 101
81 /* let's shut it down now */ 102 /* let's shut it down now */
82 rproc_shutdown(my_rproc); 103 rproc_shutdown(my_rproc);
83} 104 }
105
106API for implementors
107====================
84 108
854. API for implementors 109::
86 110
87 struct rproc *rproc_alloc(struct device *dev, const char *name, 111 struct rproc *rproc_alloc(struct device *dev, const char *name,
88 const struct rproc_ops *ops, 112 const struct rproc_ops *ops,
89 const char *firmware, int len) 113 const char *firmware, int len)
90 - Allocate a new remote processor handle, but don't register 114
91 it yet. Required parameters are the underlying device, the 115Allocate a new remote processor handle, but don't register
92 name of this remote processor, platform-specific ops handlers, 116it yet. Required parameters are the underlying device, the
93 the name of the firmware to boot this rproc with, and the 117name of this remote processor, platform-specific ops handlers,
94 length of private data needed by the allocating rproc driver (in bytes). 118the name of the firmware to boot this rproc with, and the
95 119length of private data needed by the allocating rproc driver (in bytes).
96 This function should be used by rproc implementations during 120
97 initialization of the remote processor. 121This function should be used by rproc implementations during
98 After creating an rproc handle using this function, and when ready, 122initialization of the remote processor.
99 implementations should then call rproc_add() to complete 123
100 the registration of the remote processor. 124After creating an rproc handle using this function, and when ready,
101 On success, the new rproc is returned, and on failure, NULL. 125implementations should then call rproc_add() to complete
102 126the registration of the remote processor.
103 Note: _never_ directly deallocate @rproc, even if it was not registered 127
104 yet. Instead, when you need to unroll rproc_alloc(), use rproc_free(). 128On success, the new rproc is returned, and on failure, NULL.
129
130.. note::
131
132 **never** directly deallocate @rproc, even if it was not registered
133 yet. Instead, when you need to unroll rproc_alloc(), use rproc_free().
134
135::
105 136
106 void rproc_free(struct rproc *rproc) 137 void rproc_free(struct rproc *rproc)
107 - Free an rproc handle that was allocated by rproc_alloc. 138
108 This function essentially unrolls rproc_alloc(), by decrementing the 139Free an rproc handle that was allocated by rproc_alloc.
109 rproc's refcount. It doesn't directly free rproc; that would happen 140
110 only if there are no other references to rproc and its refcount now 141This function essentially unrolls rproc_alloc(), by decrementing the
111 dropped to zero. 142rproc's refcount. It doesn't directly free rproc; that would happen
143only if there are no other references to rproc and its refcount now
144dropped to zero.
145
146::
112 147
113 int rproc_add(struct rproc *rproc) 148 int rproc_add(struct rproc *rproc)
114 - Register @rproc with the remoteproc framework, after it has been 149
115 allocated with rproc_alloc(). 150Register @rproc with the remoteproc framework, after it has been
116 This is called by the platform-specific rproc implementation, whenever 151allocated with rproc_alloc().
117 a new remote processor device is probed. 152
118 Returns 0 on success and an appropriate error code otherwise. 153This is called by the platform-specific rproc implementation, whenever
119 Note: this function initiates an asynchronous firmware loading 154a new remote processor device is probed.
120 context, which will look for virtio devices supported by the rproc's 155
121 firmware. 156Returns 0 on success and an appropriate error code otherwise.
122 If found, those virtio devices will be created and added, so as a result 157Note: this function initiates an asynchronous firmware loading
123 of registering this remote processor, additional virtio drivers might get 158context, which will look for virtio devices supported by the rproc's
124 probed. 159firmware.
160
161If found, those virtio devices will be created and added, so as a result
162of registering this remote processor, additional virtio drivers might get
163probed.
164
165::
125 166
126 int rproc_del(struct rproc *rproc) 167 int rproc_del(struct rproc *rproc)
127 - Unroll rproc_add().
128 This function should be called when the platform specific rproc
129 implementation decides to remove the rproc device. it should
130 _only_ be called if a previous invocation of rproc_add()
131 has completed successfully.
132 168
133 After rproc_del() returns, @rproc is still valid, and its 169Unroll rproc_add().
134 last refcount should be decremented by calling rproc_free(). 170
171This function should be called when the platform specific rproc
172implementation decides to remove the rproc device. it should
173_only_ be called if a previous invocation of rproc_add()
174has completed successfully.
135 175
136 Returns 0 on success and -EINVAL if @rproc isn't valid. 176After rproc_del() returns, @rproc is still valid, and its
177last refcount should be decremented by calling rproc_free().
178
179Returns 0 on success and -EINVAL if @rproc isn't valid.
180
181::
137 182
138 void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) 183 void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
139 - Report a crash in a remoteproc
140 This function must be called every time a crash is detected by the
141 platform specific rproc implementation. This should not be called from a
142 non-remoteproc driver. This function can be called from atomic/interrupt
143 context.
144 184
1455. Implementation callbacks 185Report a crash in a remoteproc
186
187This function must be called every time a crash is detected by the
188platform specific rproc implementation. This should not be called from a
189non-remoteproc driver. This function can be called from atomic/interrupt
190context.
191
192Implementation callbacks
193========================
146 194
147These callbacks should be provided by platform-specific remoteproc 195These callbacks should be provided by platform-specific remoteproc
148drivers: 196drivers::
149 197
150/** 198 /**
151 * struct rproc_ops - platform-specific device handlers 199 * struct rproc_ops - platform-specific device handlers
152 * @start: power on the device and boot it 200 * @start: power on the device and boot it
153 * @stop: power off the device 201 * @stop: power off the device
154 * @kick: kick a virtqueue (virtqueue id given as a parameter) 202 * @kick: kick a virtqueue (virtqueue id given as a parameter)
155 */ 203 */
156struct rproc_ops { 204 struct rproc_ops {
157 int (*start)(struct rproc *rproc); 205 int (*start)(struct rproc *rproc);
158 int (*stop)(struct rproc *rproc); 206 int (*stop)(struct rproc *rproc);
159 void (*kick)(struct rproc *rproc, int vqid); 207 void (*kick)(struct rproc *rproc, int vqid);
160}; 208 };
161 209
162Every remoteproc implementation should at least provide the ->start and ->stop 210Every remoteproc implementation should at least provide the ->start and ->stop
163handlers. If rpmsg/virtio functionality is also desired, then the ->kick handler 211handlers. If rpmsg/virtio functionality is also desired, then the ->kick handler
@@ -179,7 +227,8 @@ the exact virtqueue index to look in is optional: it is easy (and not
179too expensive) to go through the existing virtqueues and look for new buffers 227too expensive) to go through the existing virtqueues and look for new buffers
180in the used rings. 228in the used rings.
181 229
1826. Binary Firmware Structure 230Binary Firmware Structure
231=========================
183 232
184At this point remoteproc only supports ELF32 firmware binaries. However, 233At this point remoteproc only supports ELF32 firmware binaries. However,
185it is quite expected that other platforms/devices which we'd want to 234it is quite expected that other platforms/devices which we'd want to
@@ -207,43 +256,43 @@ resource entries that publish the existence of supported features
207or configurations by the remote processor, such as trace buffers and 256or configurations by the remote processor, such as trace buffers and
208supported virtio devices (and their configurations). 257supported virtio devices (and their configurations).
209 258
210The resource table begins with this header: 259The resource table begins with this header::
211 260
212/** 261 /**
213 * struct resource_table - firmware resource table header 262 * struct resource_table - firmware resource table header
214 * @ver: version number 263 * @ver: version number
215 * @num: number of resource entries 264 * @num: number of resource entries
216 * @reserved: reserved (must be zero) 265 * @reserved: reserved (must be zero)
217 * @offset: array of offsets pointing at the various resource entries 266 * @offset: array of offsets pointing at the various resource entries
218 * 267 *
219 * The header of the resource table, as expressed by this structure, 268 * The header of the resource table, as expressed by this structure,
220 * contains a version number (should we need to change this format in the 269 * contains a version number (should we need to change this format in the
221 * future), the number of available resource entries, and their offsets 270 * future), the number of available resource entries, and their offsets
222 * in the table. 271 * in the table.
223 */ 272 */
224struct resource_table { 273 struct resource_table {
225 u32 ver; 274 u32 ver;
226 u32 num; 275 u32 num;
227 u32 reserved[2]; 276 u32 reserved[2];
228 u32 offset[0]; 277 u32 offset[0];
229} __packed; 278 } __packed;
230 279
231Immediately following this header are the resource entries themselves, 280Immediately following this header are the resource entries themselves,
232each of which begins with the following resource entry header: 281each of which begins with the following resource entry header::
233 282
234/** 283 /**
235 * struct fw_rsc_hdr - firmware resource entry header 284 * struct fw_rsc_hdr - firmware resource entry header
236 * @type: resource type 285 * @type: resource type
237 * @data: resource data 286 * @data: resource data
238 * 287 *
239 * Every resource entry begins with a 'struct fw_rsc_hdr' header providing 288 * Every resource entry begins with a 'struct fw_rsc_hdr' header providing
240 * its @type. The content of the entry itself will immediately follow 289 * its @type. The content of the entry itself will immediately follow
241 * this header, and it should be parsed according to the resource type. 290 * this header, and it should be parsed according to the resource type.
242 */ 291 */
243struct fw_rsc_hdr { 292 struct fw_rsc_hdr {
244 u32 type; 293 u32 type;
245 u8 data[0]; 294 u8 data[0];
246} __packed; 295 } __packed;
247 296
248Some resources entries are mere announcements, where the host is informed 297Some resources entries are mere announcements, where the host is informed
249of specific remoteproc configuration. Other entries require the host to 298of specific remoteproc configuration. Other entries require the host to
@@ -252,32 +301,32 @@ is expected, where the firmware requests a resource, and once allocated,
252the host should provide back its details (e.g. address of an allocated 301the host should provide back its details (e.g. address of an allocated
253memory region). 302memory region).
254 303
255Here are the various resource types that are currently supported: 304Here are the various resource types that are currently supported::
256 305
257/** 306 /**
258 * enum fw_resource_type - types of resource entries 307 * enum fw_resource_type - types of resource entries
259 * 308 *
260 * @RSC_CARVEOUT: request for allocation of a physically contiguous 309 * @RSC_CARVEOUT: request for allocation of a physically contiguous
261 * memory region. 310 * memory region.
262 * @RSC_DEVMEM: request to iommu_map a memory-based peripheral. 311 * @RSC_DEVMEM: request to iommu_map a memory-based peripheral.
263 * @RSC_TRACE: announces the availability of a trace buffer into which 312 * @RSC_TRACE: announces the availability of a trace buffer into which
264 * the remote processor will be writing logs. 313 * the remote processor will be writing logs.
265 * @RSC_VDEV: declare support for a virtio device, and serve as its 314 * @RSC_VDEV: declare support for a virtio device, and serve as its
266 * virtio header. 315 * virtio header.
267 * @RSC_LAST: just keep this one at the end 316 * @RSC_LAST: just keep this one at the end
268 * 317 *
269 * Please note that these values are used as indices to the rproc_handle_rsc 318 * Please note that these values are used as indices to the rproc_handle_rsc
270 * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to 319 * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to
271 * check the validity of an index before the lookup table is accessed, so 320 * check the validity of an index before the lookup table is accessed, so
272 * please update it as needed. 321 * please update it as needed.
273 */ 322 */
274enum fw_resource_type { 323 enum fw_resource_type {
275 RSC_CARVEOUT = 0, 324 RSC_CARVEOUT = 0,
276 RSC_DEVMEM = 1, 325 RSC_DEVMEM = 1,
277 RSC_TRACE = 2, 326 RSC_TRACE = 2,
278 RSC_VDEV = 3, 327 RSC_VDEV = 3,
279 RSC_LAST = 4, 328 RSC_LAST = 4,
280}; 329 };
281 330
282For more details regarding a specific resource type, please see its 331For more details regarding a specific resource type, please see its
283dedicated structure in include/linux/remoteproc.h. 332dedicated structure in include/linux/remoteproc.h.
@@ -286,7 +335,8 @@ We also expect that platform-specific resource entries will show up
286at some point. When that happens, we could easily add a new RSC_PLATFORM 335at some point. When that happens, we could easily add a new RSC_PLATFORM
287type, and hand those resources to the platform-specific rproc driver to handle. 336type, and hand those resources to the platform-specific rproc driver to handle.
288 337
2897. Virtio and remoteproc 338Virtio and remoteproc
339=====================
290 340
291The firmware should provide remoteproc information about virtio devices 341The firmware should provide remoteproc information about virtio devices
292that it supports, and their configurations: a RSC_VDEV resource entry 342that it supports, and their configurations: a RSC_VDEV resource entry
diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt
index 8c174063b3f0..a289285d2412 100644
--- a/Documentation/rfkill.txt
+++ b/Documentation/rfkill.txt
@@ -1,13 +1,13 @@
1===============================
1rfkill - RF kill switch support 2rfkill - RF kill switch support
2=============================== 3===============================
3 4
41. Introduction
52. Implementation details
63. Kernel API
74. Userspace support
8 5
6.. contents::
7 :depth: 2
9 8
101. Introduction 9Introduction
10============
11 11
12The rfkill subsystem provides a generic interface to disabling any radio 12The rfkill subsystem provides a generic interface to disabling any radio
13transmitter in the system. When a transmitter is blocked, it shall not 13transmitter in the system. When a transmitter is blocked, it shall not
@@ -21,17 +21,24 @@ aircraft.
21The rfkill subsystem has a concept of "hard" and "soft" block, which 21The rfkill subsystem has a concept of "hard" and "soft" block, which
22differ little in their meaning (block == transmitters off) but rather in 22differ little in their meaning (block == transmitters off) but rather in
23whether they can be changed or not: 23whether they can be changed or not:
24 - hard block: read-only radio block that cannot be overridden by software 24
25 - soft block: writable radio block (need not be readable) that is set by 25 - hard block
26 the system software. 26 read-only radio block that cannot be overridden by software
27
28 - soft block
29 writable radio block (need not be readable) that is set by
30 the system software.
27 31
28The rfkill subsystem has two parameters, rfkill.default_state and 32The rfkill subsystem has two parameters, rfkill.default_state and
29rfkill.master_switch_mode, which are documented in admin-guide/kernel-parameters.rst. 33rfkill.master_switch_mode, which are documented in
34admin-guide/kernel-parameters.rst.
30 35
31 36
322. Implementation details 37Implementation details
38======================
33 39
34The rfkill subsystem is composed of three main components: 40The rfkill subsystem is composed of three main components:
41
35 * the rfkill core, 42 * the rfkill core,
36 * the deprecated rfkill-input module (an input layer handler, being 43 * the deprecated rfkill-input module (an input layer handler, being
37 replaced by userspace policy code) and 44 replaced by userspace policy code) and
@@ -55,7 +62,8 @@ use the return value of rfkill_set_hw_state() unless the hardware actually
55keeps track of soft and hard block separately. 62keeps track of soft and hard block separately.
56 63
57 64
583. Kernel API 65Kernel API
66==========
59 67
60 68
61Drivers for radio transmitters normally implement an rfkill driver. 69Drivers for radio transmitters normally implement an rfkill driver.
@@ -69,7 +77,7 @@ For some platforms, it is possible that the hardware state changes during
69suspend/hibernation, in which case it will be necessary to update the rfkill 77suspend/hibernation, in which case it will be necessary to update the rfkill
70core with the current state is at resume time. 78core with the current state is at resume time.
71 79
72To create an rfkill driver, driver's Kconfig needs to have 80To create an rfkill driver, driver's Kconfig needs to have::
73 81
74 depends on RFKILL || !RFKILL 82 depends on RFKILL || !RFKILL
75 83
@@ -87,7 +95,8 @@ RFKill provides per-switch LED triggers, which can be used to drive LEDs
87according to the switch state (LED_FULL when blocked, LED_OFF otherwise). 95according to the switch state (LED_FULL when blocked, LED_OFF otherwise).
88 96
89 97
905. Userspace support 98Userspace support
99=================
91 100
92The recommended userspace interface to use is /dev/rfkill, which is a misc 101The recommended userspace interface to use is /dev/rfkill, which is a misc
93character device that allows userspace to obtain and set the state of rfkill 102character device that allows userspace to obtain and set the state of rfkill
@@ -112,11 +121,11 @@ rfkill core framework.
112Additionally, each rfkill device is registered in sysfs and emits uevents. 121Additionally, each rfkill device is registered in sysfs and emits uevents.
113 122
114rfkill devices issue uevents (with an action of "change"), with the following 123rfkill devices issue uevents (with an action of "change"), with the following
115environment variables set: 124environment variables set::
116 125
117RFKILL_NAME 126 RFKILL_NAME
118RFKILL_STATE 127 RFKILL_STATE
119RFKILL_TYPE 128 RFKILL_TYPE
120 129
121The contents of these variables corresponds to the "name", "state" and 130The contents of these variables corresponds to the "name", "state" and
122"type" sysfs files explained above. 131"type" sysfs files explained above.
diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/robust-futex-ABI.txt
index 16eb314f56cc..8a5d34abf726 100644
--- a/Documentation/robust-futex-ABI.txt
+++ b/Documentation/robust-futex-ABI.txt
@@ -1,7 +1,9 @@
1Started by Paul Jackson <pj@sgi.com> 1====================
2
3The robust futex ABI 2The robust futex ABI
4-------------------- 3====================
4
5:Author: Started by Paul Jackson <pj@sgi.com>
6
5 7
6Robust_futexes provide a mechanism that is used in addition to normal 8Robust_futexes provide a mechanism that is used in addition to normal
7futexes, for kernel assist of cleanup of held locks on task exit. 9futexes, for kernel assist of cleanup of held locks on task exit.
@@ -32,7 +34,7 @@ probably causing deadlock or other such failure of the other threads
32waiting on the same locks. 34waiting on the same locks.
33 35
34A thread that anticipates possibly using robust_futexes should first 36A thread that anticipates possibly using robust_futexes should first
35issue the system call: 37issue the system call::
36 38
37 asmlinkage long 39 asmlinkage long
38 sys_set_robust_list(struct robust_list_head __user *head, size_t len); 40 sys_set_robust_list(struct robust_list_head __user *head, size_t len);
@@ -91,7 +93,7 @@ that lock using the futex mechanism.
91When a thread has invoked the above system call to indicate it 93When a thread has invoked the above system call to indicate it
92anticipates using robust_futexes, the kernel stores the passed in 'head' 94anticipates using robust_futexes, the kernel stores the passed in 'head'
93pointer for that task. The task may retrieve that value later on by 95pointer for that task. The task may retrieve that value later on by
94using the system call: 96using the system call::
95 97
96 asmlinkage long 98 asmlinkage long
97 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, 99 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
@@ -135,6 +137,7 @@ manipulating this list), the user code must observe the following
135protocol on 'lock entry' insertion and removal: 137protocol on 'lock entry' insertion and removal:
136 138
137On insertion: 139On insertion:
140
138 1) set the 'list_op_pending' word to the address of the 'lock entry' 141 1) set the 'list_op_pending' word to the address of the 'lock entry'
139 to be inserted, 142 to be inserted,
140 2) acquire the futex lock, 143 2) acquire the futex lock,
@@ -143,6 +146,7 @@ On insertion:
143 4) clear the 'list_op_pending' word. 146 4) clear the 'list_op_pending' word.
144 147
145On removal: 148On removal:
149
146 1) set the 'list_op_pending' word to the address of the 'lock entry' 150 1) set the 'list_op_pending' word to the address of the 'lock entry'
147 to be removed, 151 to be removed,
148 2) remove the lock entry for this lock from the 'head' list, 152 2) remove the lock entry for this lock from the 'head' list,
diff --git a/Documentation/robust-futexes.txt b/Documentation/robust-futexes.txt
index 61c22d608759..6c42c75103eb 100644
--- a/Documentation/robust-futexes.txt
+++ b/Documentation/robust-futexes.txt
@@ -1,4 +1,8 @@
1Started by: Ingo Molnar <mingo@redhat.com> 1========================================
2A description of what robust futexes are
3========================================
4
5:Started by: Ingo Molnar <mingo@redhat.com>
2 6
3Background 7Background
4---------- 8----------
@@ -163,7 +167,7 @@ Implementation details
163---------------------- 167----------------------
164 168
165The patch adds two new syscalls: one to register the userspace list, and 169The patch adds two new syscalls: one to register the userspace list, and
166one to query the registered list pointer: 170one to query the registered list pointer::
167 171
168 asmlinkage long 172 asmlinkage long
169 sys_set_robust_list(struct robust_list_head __user *head, 173 sys_set_robust_list(struct robust_list_head __user *head,
@@ -185,7 +189,7 @@ straightforward. The kernel doesn't have any internal distinction between
185robust and normal futexes. 189robust and normal futexes.
186 190
187If a futex is found to be held at exit time, the kernel sets the 191If a futex is found to be held at exit time, the kernel sets the
188following bit of the futex word: 192following bit of the futex word::
189 193
190 #define FUTEX_OWNER_DIED 0x40000000 194 #define FUTEX_OWNER_DIED 0x40000000
191 195
@@ -193,7 +197,7 @@ and wakes up the next futex waiter (if any). User-space does the rest of
193the cleanup. 197the cleanup.
194 198
195Otherwise, robust futexes are acquired by glibc by putting the TID into 199Otherwise, robust futexes are acquired by glibc by putting the TID into
196the futex field atomically. Waiters set the FUTEX_WAITERS bit: 200the futex field atomically. Waiters set the FUTEX_WAITERS bit::
197 201
198 #define FUTEX_WAITERS 0x80000000 202 #define FUTEX_WAITERS 0x80000000
199 203
diff --git a/Documentation/rpmsg.txt b/Documentation/rpmsg.txt
index a95e36a43288..24b7a9e1a5f9 100644
--- a/Documentation/rpmsg.txt
+++ b/Documentation/rpmsg.txt
@@ -1,10 +1,15 @@
1============================================
1Remote Processor Messaging (rpmsg) Framework 2Remote Processor Messaging (rpmsg) Framework
3============================================
2 4
3Note: this document describes the rpmsg bus and how to write rpmsg drivers. 5.. note::
4To learn how to add rpmsg support for new platforms, check out remoteproc.txt
5(also a resident of Documentation/).
6 6
71. Introduction 7 This document describes the rpmsg bus and how to write rpmsg drivers.
8 To learn how to add rpmsg support for new platforms, check out remoteproc.txt
9 (also a resident of Documentation/).
10
11Introduction
12============
8 13
9Modern SoCs typically employ heterogeneous remote processor devices in 14Modern SoCs typically employ heterogeneous remote processor devices in
10asymmetric multiprocessing (AMP) configurations, which may be running 15asymmetric multiprocessing (AMP) configurations, which may be running
@@ -58,170 +63,222 @@ to their destination address (this is done by invoking the driver's rx handler
58with the payload of the inbound message). 63with the payload of the inbound message).
59 64
60 65
612. User API 66User API
67========
68
69::
62 70
63 int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len); 71 int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len);
64 - sends a message across to the remote processor on a given channel. 72
65 The caller should specify the channel, the data it wants to send, 73sends a message across to the remote processor on a given channel.
66 and its length (in bytes). The message will be sent on the specified 74The caller should specify the channel, the data it wants to send,
67 channel, i.e. its source and destination address fields will be 75and its length (in bytes). The message will be sent on the specified
68 set to the channel's src and dst addresses. 76channel, i.e. its source and destination address fields will be
69 77set to the channel's src and dst addresses.
70 In case there are no TX buffers available, the function will block until 78
71 one becomes available (i.e. until the remote processor consumes 79In case there are no TX buffers available, the function will block until
72 a tx buffer and puts it back on virtio's used descriptor ring), 80one becomes available (i.e. until the remote processor consumes
73 or a timeout of 15 seconds elapses. When the latter happens, 81a tx buffer and puts it back on virtio's used descriptor ring),
74 -ERESTARTSYS is returned. 82or a timeout of 15 seconds elapses. When the latter happens,
75 The function can only be called from a process context (for now). 83-ERESTARTSYS is returned.
76 Returns 0 on success and an appropriate error value on failure. 84
85The function can only be called from a process context (for now).
86Returns 0 on success and an appropriate error value on failure.
87
88::
77 89
78 int rpmsg_sendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst); 90 int rpmsg_sendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst);
79 - sends a message across to the remote processor on a given channel, 91
80 to a destination address provided by the caller. 92sends a message across to the remote processor on a given channel,
81 The caller should specify the channel, the data it wants to send, 93to a destination address provided by the caller.
82 its length (in bytes), and an explicit destination address. 94
83 The message will then be sent to the remote processor to which the 95The caller should specify the channel, the data it wants to send,
84 channel belongs, using the channel's src address, and the user-provided 96its length (in bytes), and an explicit destination address.
85 dst address (thus the channel's dst address will be ignored). 97
86 98The message will then be sent to the remote processor to which the
87 In case there are no TX buffers available, the function will block until 99channel belongs, using the channel's src address, and the user-provided
88 one becomes available (i.e. until the remote processor consumes 100dst address (thus the channel's dst address will be ignored).
89 a tx buffer and puts it back on virtio's used descriptor ring), 101
90 or a timeout of 15 seconds elapses. When the latter happens, 102In case there are no TX buffers available, the function will block until
91 -ERESTARTSYS is returned. 103one becomes available (i.e. until the remote processor consumes
92 The function can only be called from a process context (for now). 104a tx buffer and puts it back on virtio's used descriptor ring),
93 Returns 0 on success and an appropriate error value on failure. 105or a timeout of 15 seconds elapses. When the latter happens,
106-ERESTARTSYS is returned.
107
108The function can only be called from a process context (for now).
109Returns 0 on success and an appropriate error value on failure.
110
111::
94 112
95 int rpmsg_send_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, 113 int rpmsg_send_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst,
96 void *data, int len); 114 void *data, int len);
97 - sends a message across to the remote processor, using the src and dst 115
98 addresses provided by the user. 116
99 The caller should specify the channel, the data it wants to send, 117sends a message across to the remote processor, using the src and dst
100 its length (in bytes), and explicit source and destination addresses. 118addresses provided by the user.
101 The message will then be sent to the remote processor to which the 119
102 channel belongs, but the channel's src and dst addresses will be 120The caller should specify the channel, the data it wants to send,
103 ignored (and the user-provided addresses will be used instead). 121its length (in bytes), and explicit source and destination addresses.
104 122The message will then be sent to the remote processor to which the
105 In case there are no TX buffers available, the function will block until 123channel belongs, but the channel's src and dst addresses will be
106 one becomes available (i.e. until the remote processor consumes 124ignored (and the user-provided addresses will be used instead).
107 a tx buffer and puts it back on virtio's used descriptor ring), 125
108 or a timeout of 15 seconds elapses. When the latter happens, 126In case there are no TX buffers available, the function will block until
109 -ERESTARTSYS is returned. 127one becomes available (i.e. until the remote processor consumes
110 The function can only be called from a process context (for now). 128a tx buffer and puts it back on virtio's used descriptor ring),
111 Returns 0 on success and an appropriate error value on failure. 129or a timeout of 15 seconds elapses. When the latter happens,
130-ERESTARTSYS is returned.
131
132The function can only be called from a process context (for now).
133Returns 0 on success and an appropriate error value on failure.
134
135::
112 136
113 int rpmsg_trysend(struct rpmsg_channel *rpdev, void *data, int len); 137 int rpmsg_trysend(struct rpmsg_channel *rpdev, void *data, int len);
114 - sends a message across to the remote processor on a given channel.
115 The caller should specify the channel, the data it wants to send,
116 and its length (in bytes). The message will be sent on the specified
117 channel, i.e. its source and destination address fields will be
118 set to the channel's src and dst addresses.
119 138
120 In case there are no TX buffers available, the function will immediately 139sends a message across to the remote processor on a given channel.
121 return -ENOMEM without waiting until one becomes available. 140The caller should specify the channel, the data it wants to send,
122 The function can only be called from a process context (for now). 141and its length (in bytes). The message will be sent on the specified
123 Returns 0 on success and an appropriate error value on failure. 142channel, i.e. its source and destination address fields will be
143set to the channel's src and dst addresses.
144
145In case there are no TX buffers available, the function will immediately
146return -ENOMEM without waiting until one becomes available.
147
148The function can only be called from a process context (for now).
149Returns 0 on success and an appropriate error value on failure.
150
151::
124 152
125 int rpmsg_trysendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst) 153 int rpmsg_trysendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst)
126 - sends a message across to the remote processor on a given channel, 154
127 to a destination address provided by the user. 155
128 The user should specify the channel, the data it wants to send, 156sends a message across to the remote processor on a given channel,
129 its length (in bytes), and an explicit destination address. 157to a destination address provided by the user.
130 The message will then be sent to the remote processor to which the 158
131 channel belongs, using the channel's src address, and the user-provided 159The user should specify the channel, the data it wants to send,
132 dst address (thus the channel's dst address will be ignored). 160its length (in bytes), and an explicit destination address.
133 161
134 In case there are no TX buffers available, the function will immediately 162The message will then be sent to the remote processor to which the
135 return -ENOMEM without waiting until one becomes available. 163channel belongs, using the channel's src address, and the user-provided
136 The function can only be called from a process context (for now). 164dst address (thus the channel's dst address will be ignored).
137 Returns 0 on success and an appropriate error value on failure. 165
166In case there are no TX buffers available, the function will immediately
167return -ENOMEM without waiting until one becomes available.
168
169The function can only be called from a process context (for now).
170Returns 0 on success and an appropriate error value on failure.
171
172::
138 173
139 int rpmsg_trysend_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, 174 int rpmsg_trysend_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst,
140 void *data, int len); 175 void *data, int len);
141 - sends a message across to the remote processor, using source and 176
142 destination addresses provided by the user. 177
143 The user should specify the channel, the data it wants to send, 178sends a message across to the remote processor, using source and
144 its length (in bytes), and explicit source and destination addresses. 179destination addresses provided by the user.
145 The message will then be sent to the remote processor to which the 180
146 channel belongs, but the channel's src and dst addresses will be 181The user should specify the channel, the data it wants to send,
147 ignored (and the user-provided addresses will be used instead). 182its length (in bytes), and explicit source and destination addresses.
148 183The message will then be sent to the remote processor to which the
149 In case there are no TX buffers available, the function will immediately 184channel belongs, but the channel's src and dst addresses will be
150 return -ENOMEM without waiting until one becomes available. 185ignored (and the user-provided addresses will be used instead).
151 The function can only be called from a process context (for now). 186
152 Returns 0 on success and an appropriate error value on failure. 187In case there are no TX buffers available, the function will immediately
188return -ENOMEM without waiting until one becomes available.
189
190The function can only be called from a process context (for now).
191Returns 0 on success and an appropriate error value on failure.
192
193::
153 194
154 struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_channel *rpdev, 195 struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_channel *rpdev,
155 void (*cb)(struct rpmsg_channel *, void *, int, void *, u32), 196 void (*cb)(struct rpmsg_channel *, void *, int, void *, u32),
156 void *priv, u32 addr); 197 void *priv, u32 addr);
157 - every rpmsg address in the system is bound to an rx callback (so when 198
158 inbound messages arrive, they are dispatched by the rpmsg bus using the 199every rpmsg address in the system is bound to an rx callback (so when
159 appropriate callback handler) by means of an rpmsg_endpoint struct. 200inbound messages arrive, they are dispatched by the rpmsg bus using the
160 201appropriate callback handler) by means of an rpmsg_endpoint struct.
161 This function allows drivers to create such an endpoint, and by that, 202
162 bind a callback, and possibly some private data too, to an rpmsg address 203This function allows drivers to create such an endpoint, and by that,
163 (either one that is known in advance, or one that will be dynamically 204bind a callback, and possibly some private data too, to an rpmsg address
164 assigned for them). 205(either one that is known in advance, or one that will be dynamically
165 206assigned for them).
166 Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint 207
167 is already created for them when they are probed by the rpmsg bus 208Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint
168 (using the rx callback they provide when they registered to the rpmsg bus). 209is already created for them when they are probed by the rpmsg bus
169 210(using the rx callback they provide when they registered to the rpmsg bus).
170 So things should just work for simple drivers: they already have an 211
171 endpoint, their rx callback is bound to their rpmsg address, and when 212So things should just work for simple drivers: they already have an
172 relevant inbound messages arrive (i.e. messages which their dst address 213endpoint, their rx callback is bound to their rpmsg address, and when
173 equals to the src address of their rpmsg channel), the driver's handler 214relevant inbound messages arrive (i.e. messages which their dst address
174 is invoked to process it. 215equals to the src address of their rpmsg channel), the driver's handler
175 216is invoked to process it.
176 That said, more complicated drivers might do need to allocate 217
177 additional rpmsg addresses, and bind them to different rx callbacks. 218That said, more complicated drivers might do need to allocate
178 To accomplish that, those drivers need to call this function. 219additional rpmsg addresses, and bind them to different rx callbacks.
179 Drivers should provide their channel (so the new endpoint would bind 220To accomplish that, those drivers need to call this function.
180 to the same remote processor their channel belongs to), an rx callback 221Drivers should provide their channel (so the new endpoint would bind
181 function, an optional private data (which is provided back when the 222to the same remote processor their channel belongs to), an rx callback
182 rx callback is invoked), and an address they want to bind with the 223function, an optional private data (which is provided back when the
183 callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will 224rx callback is invoked), and an address they want to bind with the
184 dynamically assign them an available rpmsg address (drivers should have 225callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will
185 a very good reason why not to always use RPMSG_ADDR_ANY here). 226dynamically assign them an available rpmsg address (drivers should have
186 227a very good reason why not to always use RPMSG_ADDR_ANY here).
187 Returns a pointer to the endpoint on success, or NULL on error. 228
229Returns a pointer to the endpoint on success, or NULL on error.
230
231::
188 232
189 void rpmsg_destroy_ept(struct rpmsg_endpoint *ept); 233 void rpmsg_destroy_ept(struct rpmsg_endpoint *ept);
190 - destroys an existing rpmsg endpoint. user should provide a pointer 234
191 to an rpmsg endpoint that was previously created with rpmsg_create_ept(). 235
236destroys an existing rpmsg endpoint. user should provide a pointer
237to an rpmsg endpoint that was previously created with rpmsg_create_ept().
238
239::
192 240
193 int register_rpmsg_driver(struct rpmsg_driver *rpdrv); 241 int register_rpmsg_driver(struct rpmsg_driver *rpdrv);
194 - registers an rpmsg driver with the rpmsg bus. user should provide 242
195 a pointer to an rpmsg_driver struct, which contains the driver's 243
196 ->probe() and ->remove() functions, an rx callback, and an id_table 244registers an rpmsg driver with the rpmsg bus. user should provide
197 specifying the names of the channels this driver is interested to 245a pointer to an rpmsg_driver struct, which contains the driver's
198 be probed with. 246->probe() and ->remove() functions, an rx callback, and an id_table
247specifying the names of the channels this driver is interested to
248be probed with.
249
250::
199 251
200 void unregister_rpmsg_driver(struct rpmsg_driver *rpdrv); 252 void unregister_rpmsg_driver(struct rpmsg_driver *rpdrv);
201 - unregisters an rpmsg driver from the rpmsg bus. user should provide
202 a pointer to a previously-registered rpmsg_driver struct.
203 Returns 0 on success, and an appropriate error value on failure.
204 253
205 254
2063. Typical usage 255unregisters an rpmsg driver from the rpmsg bus. user should provide
256a pointer to a previously-registered rpmsg_driver struct.
257Returns 0 on success, and an appropriate error value on failure.
258
259
260Typical usage
261=============
207 262
208The following is a simple rpmsg driver, that sends an "hello!" message 263The following is a simple rpmsg driver, that sends an "hello!" message
209on probe(), and whenever it receives an incoming message, it dumps its 264on probe(), and whenever it receives an incoming message, it dumps its
210content to the console. 265content to the console.
211 266
212#include <linux/kernel.h> 267::
213#include <linux/module.h> 268
214#include <linux/rpmsg.h> 269 #include <linux/kernel.h>
270 #include <linux/module.h>
271 #include <linux/rpmsg.h>
215 272
216static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len, 273 static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len,
217 void *priv, u32 src) 274 void *priv, u32 src)
218{ 275 {
219 print_hex_dump(KERN_INFO, "incoming message:", DUMP_PREFIX_NONE, 276 print_hex_dump(KERN_INFO, "incoming message:", DUMP_PREFIX_NONE,
220 16, 1, data, len, true); 277 16, 1, data, len, true);
221} 278 }
222 279
223static int rpmsg_sample_probe(struct rpmsg_channel *rpdev) 280 static int rpmsg_sample_probe(struct rpmsg_channel *rpdev)
224{ 281 {
225 int err; 282 int err;
226 283
227 dev_info(&rpdev->dev, "chnl: 0x%x -> 0x%x\n", rpdev->src, rpdev->dst); 284 dev_info(&rpdev->dev, "chnl: 0x%x -> 0x%x\n", rpdev->src, rpdev->dst);
@@ -234,32 +291,35 @@ static int rpmsg_sample_probe(struct rpmsg_channel *rpdev)
234 } 291 }
235 292
236 return 0; 293 return 0;
237} 294 }
238 295
239static void rpmsg_sample_remove(struct rpmsg_channel *rpdev) 296 static void rpmsg_sample_remove(struct rpmsg_channel *rpdev)
240{ 297 {
241 dev_info(&rpdev->dev, "rpmsg sample client driver is removed\n"); 298 dev_info(&rpdev->dev, "rpmsg sample client driver is removed\n");
242} 299 }
243 300
244static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = { 301 static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = {
245 { .name = "rpmsg-client-sample" }, 302 { .name = "rpmsg-client-sample" },
246 { }, 303 { },
247}; 304 };
248MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table); 305 MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table);
249 306
250static struct rpmsg_driver rpmsg_sample_client = { 307 static struct rpmsg_driver rpmsg_sample_client = {
251 .drv.name = KBUILD_MODNAME, 308 .drv.name = KBUILD_MODNAME,
252 .id_table = rpmsg_driver_sample_id_table, 309 .id_table = rpmsg_driver_sample_id_table,
253 .probe = rpmsg_sample_probe, 310 .probe = rpmsg_sample_probe,
254 .callback = rpmsg_sample_cb, 311 .callback = rpmsg_sample_cb,
255 .remove = rpmsg_sample_remove, 312 .remove = rpmsg_sample_remove,
256}; 313 };
257module_rpmsg_driver(rpmsg_sample_client); 314 module_rpmsg_driver(rpmsg_sample_client);
315
316.. note::
258 317
259Note: a similar sample which can be built and loaded can be found 318 a similar sample which can be built and loaded can be found
260in samples/rpmsg/. 319 in samples/rpmsg/.
261 320
2624. Allocations of rpmsg channels: 321Allocations of rpmsg channels
322=============================
263 323
264At this point we only support dynamic allocations of rpmsg channels. 324At this point we only support dynamic allocations of rpmsg channels.
265 325
diff --git a/Documentation/sgi-ioc4.txt b/Documentation/sgi-ioc4.txt
index 876c96ae38db..72709222d3c0 100644
--- a/Documentation/sgi-ioc4.txt
+++ b/Documentation/sgi-ioc4.txt
@@ -1,3 +1,7 @@
1====================================
2SGI IOC4 PCI (multi function) device
3====================================
4
1The SGI IOC4 PCI device is a bit of a strange beast, so some notes on 5The SGI IOC4 PCI device is a bit of a strange beast, so some notes on
2it are in order. 6it are in order.
3 7
diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt
index 908d348ff777..9965821ab333 100644
--- a/Documentation/siphash.txt
+++ b/Documentation/siphash.txt
@@ -1,6 +1,8 @@
1 SipHash - a short input PRF 1===========================
2----------------------------------------------- 2SipHash - a short input PRF
3Written by Jason A. Donenfeld <jason@zx2c4.com> 3===========================
4
5:Author: Written by Jason A. Donenfeld <jason@zx2c4.com>
4 6
5SipHash is a cryptographically secure PRF -- a keyed hash function -- that 7SipHash is a cryptographically secure PRF -- a keyed hash function -- that
6performs very well for short inputs, hence the name. It was designed by 8performs very well for short inputs, hence the name. It was designed by
@@ -13,58 +15,61 @@ an input buffer or several input integers. It spits out an integer that is
13indistinguishable from random. You may then use that integer as part of secure 15indistinguishable from random. You may then use that integer as part of secure
14sequence numbers, secure cookies, or mask it off for use in a hash table. 16sequence numbers, secure cookies, or mask it off for use in a hash table.
15 17
161. Generating a key 18Generating a key
19================
17 20
18Keys should always be generated from a cryptographically secure source of 21Keys should always be generated from a cryptographically secure source of
19random numbers, either using get_random_bytes or get_random_once: 22random numbers, either using get_random_bytes or get_random_once::
20 23
21siphash_key_t key; 24 siphash_key_t key;
22get_random_bytes(&key, sizeof(key)); 25 get_random_bytes(&key, sizeof(key));
23 26
24If you're not deriving your key from here, you're doing it wrong. 27If you're not deriving your key from here, you're doing it wrong.
25 28
262. Using the functions 29Using the functions
30===================
27 31
28There are two variants of the function, one that takes a list of integers, and 32There are two variants of the function, one that takes a list of integers, and
29one that takes a buffer: 33one that takes a buffer::
30 34
31u64 siphash(const void *data, size_t len, const siphash_key_t *key); 35 u64 siphash(const void *data, size_t len, const siphash_key_t *key);
32 36
33And: 37And::
34 38
35u64 siphash_1u64(u64, const siphash_key_t *key); 39 u64 siphash_1u64(u64, const siphash_key_t *key);
36u64 siphash_2u64(u64, u64, const siphash_key_t *key); 40 u64 siphash_2u64(u64, u64, const siphash_key_t *key);
37u64 siphash_3u64(u64, u64, u64, const siphash_key_t *key); 41 u64 siphash_3u64(u64, u64, u64, const siphash_key_t *key);
38u64 siphash_4u64(u64, u64, u64, u64, const siphash_key_t *key); 42 u64 siphash_4u64(u64, u64, u64, u64, const siphash_key_t *key);
39u64 siphash_1u32(u32, const siphash_key_t *key); 43 u64 siphash_1u32(u32, const siphash_key_t *key);
40u64 siphash_2u32(u32, u32, const siphash_key_t *key); 44 u64 siphash_2u32(u32, u32, const siphash_key_t *key);
41u64 siphash_3u32(u32, u32, u32, const siphash_key_t *key); 45 u64 siphash_3u32(u32, u32, u32, const siphash_key_t *key);
42u64 siphash_4u32(u32, u32, u32, u32, const siphash_key_t *key); 46 u64 siphash_4u32(u32, u32, u32, u32, const siphash_key_t *key);
43 47
44If you pass the generic siphash function something of a constant length, it 48If you pass the generic siphash function something of a constant length, it
45will constant fold at compile-time and automatically choose one of the 49will constant fold at compile-time and automatically choose one of the
46optimized functions. 50optimized functions.
47 51
483. Hashtable key function usage: 52Hashtable key function usage::
49 53
50struct some_hashtable { 54 struct some_hashtable {
51 DECLARE_HASHTABLE(hashtable, 8); 55 DECLARE_HASHTABLE(hashtable, 8);
52 siphash_key_t key; 56 siphash_key_t key;
53}; 57 };
54 58
55void init_hashtable(struct some_hashtable *table) 59 void init_hashtable(struct some_hashtable *table)
56{ 60 {
57 get_random_bytes(&table->key, sizeof(table->key)); 61 get_random_bytes(&table->key, sizeof(table->key));
58} 62 }
59 63
60static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) 64 static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input)
61{ 65 {
62 return &table->hashtable[siphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; 66 return &table->hashtable[siphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)];
63} 67 }
64 68
65You may then iterate like usual over the returned hash bucket. 69You may then iterate like usual over the returned hash bucket.
66 70
674. Security 71Security
72========
68 73
69SipHash has a very high security margin, with its 128-bit key. So long as the 74SipHash has a very high security margin, with its 128-bit key. So long as the
70key is kept secret, it is impossible for an attacker to guess the outputs of 75key is kept secret, it is impossible for an attacker to guess the outputs of
@@ -73,7 +78,8 @@ is significant.
73 78
74Linux implements the "2-4" variant of SipHash. 79Linux implements the "2-4" variant of SipHash.
75 80
765. Struct-passing Pitfalls 81Struct-passing Pitfalls
82=======================
77 83
78Often times the XuY functions will not be large enough, and instead you'll 84Often times the XuY functions will not be large enough, and instead you'll
79want to pass a pre-filled struct to siphash. When doing this, it's important 85want to pass a pre-filled struct to siphash. When doing this, it's important
@@ -81,30 +87,32 @@ to always ensure the struct has no padding holes. The easiest way to do this
81is to simply arrange the members of the struct in descending order of size, 87is to simply arrange the members of the struct in descending order of size,
82and to use offsetendof() instead of sizeof() for getting the size. For 88and to use offsetendof() instead of sizeof() for getting the size. For
83performance reasons, if possible, it's probably a good thing to align the 89performance reasons, if possible, it's probably a good thing to align the
84struct to the right boundary. Here's an example: 90struct to the right boundary. Here's an example::
85 91
86const struct { 92 const struct {
87 struct in6_addr saddr; 93 struct in6_addr saddr;
88 u32 counter; 94 u32 counter;
89 u16 dport; 95 u16 dport;
90} __aligned(SIPHASH_ALIGNMENT) combined = { 96 } __aligned(SIPHASH_ALIGNMENT) combined = {
91 .saddr = *(struct in6_addr *)saddr, 97 .saddr = *(struct in6_addr *)saddr,
92 .counter = counter, 98 .counter = counter,
93 .dport = dport 99 .dport = dport
94}; 100 };
95u64 h = siphash(&combined, offsetofend(typeof(combined), dport), &secret); 101 u64 h = siphash(&combined, offsetofend(typeof(combined), dport), &secret);
96 102
976. Resources 103Resources
104=========
98 105
99Read the SipHash paper if you're interested in learning more: 106Read the SipHash paper if you're interested in learning more:
100https://131002.net/siphash/siphash.pdf 107https://131002.net/siphash/siphash.pdf
101 108
109-------------------------------------------------------------------------------
102 110
103~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~ 111===============================================
104
105HalfSipHash - SipHash's insecure younger cousin 112HalfSipHash - SipHash's insecure younger cousin
106----------------------------------------------- 113===============================================
107Written by Jason A. Donenfeld <jason@zx2c4.com> 114
115:Author: Written by Jason A. Donenfeld <jason@zx2c4.com>
108 116
109On the off-chance that SipHash is not fast enough for your needs, you might be 117On the off-chance that SipHash is not fast enough for your needs, you might be
110able to justify using HalfSipHash, a terrifying but potentially useful 118able to justify using HalfSipHash, a terrifying but potentially useful
@@ -120,7 +128,8 @@ then when you can be absolutely certain that the outputs will never be
120transmitted out of the kernel. This is only remotely useful over `jhash` as a 128transmitted out of the kernel. This is only remotely useful over `jhash` as a
121means of mitigating hashtable flooding denial of service attacks. 129means of mitigating hashtable flooding denial of service attacks.
122 130
1231. Generating a key 131Generating a key
132================
124 133
125Keys should always be generated from a cryptographically secure source of 134Keys should always be generated from a cryptographically secure source of
126random numbers, either using get_random_bytes or get_random_once: 135random numbers, either using get_random_bytes or get_random_once:
@@ -130,44 +139,49 @@ get_random_bytes(&key, sizeof(key));
130 139
131If you're not deriving your key from here, you're doing it wrong. 140If you're not deriving your key from here, you're doing it wrong.
132 141
1332. Using the functions 142Using the functions
143===================
134 144
135There are two variants of the function, one that takes a list of integers, and 145There are two variants of the function, one that takes a list of integers, and
136one that takes a buffer: 146one that takes a buffer::
137 147
138u32 hsiphash(const void *data, size_t len, const hsiphash_key_t *key); 148 u32 hsiphash(const void *data, size_t len, const hsiphash_key_t *key);
139 149
140And: 150And::
141 151
142u32 hsiphash_1u32(u32, const hsiphash_key_t *key); 152 u32 hsiphash_1u32(u32, const hsiphash_key_t *key);
143u32 hsiphash_2u32(u32, u32, const hsiphash_key_t *key); 153 u32 hsiphash_2u32(u32, u32, const hsiphash_key_t *key);
144u32 hsiphash_3u32(u32, u32, u32, const hsiphash_key_t *key); 154 u32 hsiphash_3u32(u32, u32, u32, const hsiphash_key_t *key);
145u32 hsiphash_4u32(u32, u32, u32, u32, const hsiphash_key_t *key); 155 u32 hsiphash_4u32(u32, u32, u32, u32, const hsiphash_key_t *key);
146 156
147If you pass the generic hsiphash function something of a constant length, it 157If you pass the generic hsiphash function something of a constant length, it
148will constant fold at compile-time and automatically choose one of the 158will constant fold at compile-time and automatically choose one of the
149optimized functions. 159optimized functions.
150 160
1513. Hashtable key function usage: 161Hashtable key function usage
162============================
163
164::
152 165
153struct some_hashtable { 166 struct some_hashtable {
154 DECLARE_HASHTABLE(hashtable, 8); 167 DECLARE_HASHTABLE(hashtable, 8);
155 hsiphash_key_t key; 168 hsiphash_key_t key;
156}; 169 };
157 170
158void init_hashtable(struct some_hashtable *table) 171 void init_hashtable(struct some_hashtable *table)
159{ 172 {
160 get_random_bytes(&table->key, sizeof(table->key)); 173 get_random_bytes(&table->key, sizeof(table->key));
161} 174 }
162 175
163static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) 176 static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input)
164{ 177 {
165 return &table->hashtable[hsiphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; 178 return &table->hashtable[hsiphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)];
166} 179 }
167 180
168You may then iterate like usual over the returned hash bucket. 181You may then iterate like usual over the returned hash bucket.
169 182
1704. Performance 183Performance
184===========
171 185
172HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements, 186HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements,
173this will not be a problem, as the hashtable lookup isn't the bottleneck. And 187this will not be a problem, as the hashtable lookup isn't the bottleneck. And
diff --git a/Documentation/smsc_ece1099.txt b/Documentation/smsc_ece1099.txt
index 6b492e82b43d..079277421eaf 100644
--- a/Documentation/smsc_ece1099.txt
+++ b/Documentation/smsc_ece1099.txt
@@ -1,3 +1,7 @@
1=================================================
2Msc Keyboard Scan Expansion/GPIO Expansion device
3=================================================
4
1What is smsc-ece1099? 5What is smsc-ece1099?
2---------------------- 6----------------------
3 7
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt
index ef419fd0897f..b83dfa1c0602 100644
--- a/Documentation/static-keys.txt
+++ b/Documentation/static-keys.txt
@@ -1,30 +1,34 @@
1 Static Keys 1===========
2 ----------- 2Static Keys
3===========
3 4
4DEPRECATED API: 5.. warning::
5 6
6The use of 'struct static_key' directly, is now DEPRECATED. In addition 7 DEPRECATED API:
7static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:
8 8
9struct static_key false = STATIC_KEY_INIT_FALSE; 9 The use of 'struct static_key' directly, is now DEPRECATED. In addition
10struct static_key true = STATIC_KEY_INIT_TRUE; 10 static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following::
11static_key_true()
12static_key_false()
13 11
14The updated API replacements are: 12 struct static_key false = STATIC_KEY_INIT_FALSE;
13 struct static_key true = STATIC_KEY_INIT_TRUE;
14 static_key_true()
15 static_key_false()
15 16
16DEFINE_STATIC_KEY_TRUE(key); 17 The updated API replacements are::
17DEFINE_STATIC_KEY_FALSE(key);
18DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
19DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
20static_branch_likely()
21static_branch_unlikely()
22 18
230) Abstract 19 DEFINE_STATIC_KEY_TRUE(key);
20 DEFINE_STATIC_KEY_FALSE(key);
21 DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
22 DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
23 static_branch_likely()
24 static_branch_unlikely()
25
26Abstract
27========
24 28
25Static keys allows the inclusion of seldom used features in 29Static keys allows the inclusion of seldom used features in
26performance-sensitive fast-path kernel code, via a GCC feature and a code 30performance-sensitive fast-path kernel code, via a GCC feature and a code
27patching technique. A quick example: 31patching technique. A quick example::
28 32
29 DEFINE_STATIC_KEY_FALSE(key); 33 DEFINE_STATIC_KEY_FALSE(key);
30 34
@@ -45,7 +49,8 @@ The static_branch_unlikely() branch will be generated into the code with as litt
45impact to the likely code path as possible. 49impact to the likely code path as possible.
46 50
47 51
481) Motivation 52Motivation
53==========
49 54
50 55
51Currently, tracepoints are implemented using a conditional branch. The 56Currently, tracepoints are implemented using a conditional branch. The
@@ -60,7 +65,8 @@ possible. Although tracepoints are the original motivation for this work, other
60kernel code paths should be able to make use of the static keys facility. 65kernel code paths should be able to make use of the static keys facility.
61 66
62 67
632) Solution 68Solution
69========
64 70
65 71
66gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: 72gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
@@ -71,7 +77,7 @@ Using the 'asm goto', we can create branches that are either taken or not taken
71by default, without the need to check memory. Then, at run-time, we can patch 77by default, without the need to check memory. Then, at run-time, we can patch
72the branch site to change the branch direction. 78the branch site to change the branch direction.
73 79
74For example, if we have a simple branch that is disabled by default: 80For example, if we have a simple branch that is disabled by default::
75 81
76 if (static_branch_unlikely(&key)) 82 if (static_branch_unlikely(&key))
77 printk("I am the true branch\n"); 83 printk("I am the true branch\n");
@@ -87,14 +93,15 @@ optimization.
87This lowlevel patching mechanism is called 'jump label patching', and it gives 93This lowlevel patching mechanism is called 'jump label patching', and it gives
88the basis for the static keys facility. 94the basis for the static keys facility.
89 95
903) Static key label API, usage and examples: 96Static key label API, usage and examples
97========================================
91 98
92 99
93In order to make use of this optimization you must first define a key: 100In order to make use of this optimization you must first define a key::
94 101
95 DEFINE_STATIC_KEY_TRUE(key); 102 DEFINE_STATIC_KEY_TRUE(key);
96 103
97or: 104or::
98 105
99 DEFINE_STATIC_KEY_FALSE(key); 106 DEFINE_STATIC_KEY_FALSE(key);
100 107
@@ -102,14 +109,14 @@ or:
102The key must be global, that is, it can't be allocated on the stack or dynamically 109The key must be global, that is, it can't be allocated on the stack or dynamically
103allocated at run-time. 110allocated at run-time.
104 111
105The key is then used in code as: 112The key is then used in code as::
106 113
107 if (static_branch_unlikely(&key)) 114 if (static_branch_unlikely(&key))
108 do unlikely code 115 do unlikely code
109 else 116 else
110 do likely code 117 do likely code
111 118
112Or: 119Or::
113 120
114 if (static_branch_likely(&key)) 121 if (static_branch_likely(&key))
115 do likely code 122 do likely code
@@ -120,15 +127,15 @@ Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
120be used in either static_branch_likely() or static_branch_unlikely() 127be used in either static_branch_likely() or static_branch_unlikely()
121statements. 128statements.
122 129
123Branch(es) can be set true via: 130Branch(es) can be set true via::
124 131
125static_branch_enable(&key); 132 static_branch_enable(&key);
126 133
127or false via: 134or false via::
128 135
129static_branch_disable(&key); 136 static_branch_disable(&key);
130 137
131The branch(es) can then be switched via reference counts: 138The branch(es) can then be switched via reference counts::
132 139
133 static_branch_inc(&key); 140 static_branch_inc(&key);
134 ... 141 ...
@@ -142,11 +149,11 @@ static_branch_inc(), will change the branch back to true. Likewise, if the
142key is initialized false, a 'static_branch_inc()', will change the branch to 149key is initialized false, a 'static_branch_inc()', will change the branch to
143true. And then a 'static_branch_dec()', will again make the branch false. 150true. And then a 'static_branch_dec()', will again make the branch false.
144 151
145Where an array of keys is required, it can be defined as: 152Where an array of keys is required, it can be defined as::
146 153
147 DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); 154 DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
148 155
149or: 156or::
150 157
151 DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); 158 DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
152 159
@@ -159,96 +166,98 @@ simply fall back to a traditional, load, test, and jump sequence. Also, the
159struct jump_entry table must be at least 4-byte aligned because the 166struct jump_entry table must be at least 4-byte aligned because the
160static_key->entry field makes use of the two least significant bits. 167static_key->entry field makes use of the two least significant bits.
161 168
162* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig 169* ``select HAVE_ARCH_JUMP_LABEL``,
163 170 see: arch/x86/Kconfig
164* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h
165 171
166* __always_inline bool arch_static_branch(struct static_key *key, bool branch), see: 172* ``#define JUMP_LABEL_NOP_SIZE``,
167 arch/x86/include/asm/jump_label.h 173 see: arch/x86/include/asm/jump_label.h
168 174
169* __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch), 175* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``,
170 see: arch/x86/include/asm/jump_label.h 176 see: arch/x86/include/asm/jump_label.h
171 177
172* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), 178* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``,
173 see: arch/x86/kernel/jump_label.c 179 see: arch/x86/include/asm/jump_label.h
174 180
175* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), 181* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``,
176 see: arch/x86/kernel/jump_label.c 182 see: arch/x86/kernel/jump_label.c
177 183
184* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``,
185 see: arch/x86/kernel/jump_label.c
178 186
179* struct jump_entry, see: arch/x86/include/asm/jump_label.h 187* ``struct jump_entry``,
188 see: arch/x86/include/asm/jump_label.h
180 189
181 190
1825) Static keys / jump label analysis, results (x86_64): 1915) Static keys / jump label analysis, results (x86_64):
183 192
184 193
185As an example, let's add the following branch to 'getppid()', such that the 194As an example, let's add the following branch to 'getppid()', such that the
186system call now looks like: 195system call now looks like::
187 196
188SYSCALL_DEFINE0(getppid) 197 SYSCALL_DEFINE0(getppid)
189{ 198 {
190 int pid; 199 int pid;
191 200
192+ if (static_branch_unlikely(&key)) 201 + if (static_branch_unlikely(&key))
193+ printk("I am the true branch\n"); 202 + printk("I am the true branch\n");
194 203
195 rcu_read_lock(); 204 rcu_read_lock();
196 pid = task_tgid_vnr(rcu_dereference(current->real_parent)); 205 pid = task_tgid_vnr(rcu_dereference(current->real_parent));
197 rcu_read_unlock(); 206 rcu_read_unlock();
198 207
199 return pid; 208 return pid;
200} 209 }
201 210
202The resulting instructions with jump labels generated by GCC is: 211The resulting instructions with jump labels generated by GCC is::
203 212
204ffffffff81044290 <sys_getppid>: 213 ffffffff81044290 <sys_getppid>:
205ffffffff81044290: 55 push %rbp 214 ffffffff81044290: 55 push %rbp
206ffffffff81044291: 48 89 e5 mov %rsp,%rbp 215 ffffffff81044291: 48 89 e5 mov %rsp,%rbp
207ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> 216 ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9>
208ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax 217 ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
209ffffffff810442a0: 00 00 218 ffffffff810442a0: 00 00
210ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax 219 ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
211ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax 220 ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
212ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi 221 ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
213ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> 222 ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr>
214ffffffff810442bc: 5d pop %rbp 223 ffffffff810442bc: 5d pop %rbp
215ffffffff810442bd: 48 98 cltq 224 ffffffff810442bd: 48 98 cltq
216ffffffff810442bf: c3 retq 225 ffffffff810442bf: c3 retq
217ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi 226 ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi
218ffffffff810442c7: 31 c0 xor %eax,%eax 227 ffffffff810442c7: 31 c0 xor %eax,%eax
219ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> 228 ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk>
220ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> 229 ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9>
221 230
222Without the jump label optimization it looks like: 231Without the jump label optimization it looks like::
223 232
224ffffffff810441f0 <sys_getppid>: 233 ffffffff810441f0 <sys_getppid>:
225ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> 234 ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key>
226ffffffff810441f6: 55 push %rbp 235 ffffffff810441f6: 55 push %rbp
227ffffffff810441f7: 48 89 e5 mov %rsp,%rbp 236 ffffffff810441f7: 48 89 e5 mov %rsp,%rbp
228ffffffff810441fa: 85 c0 test %eax,%eax 237 ffffffff810441fa: 85 c0 test %eax,%eax
229ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> 238 ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35>
230ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax 239 ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
231ffffffff81044205: 00 00 240 ffffffff81044205: 00 00
232ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax 241 ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
233ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax 242 ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
234ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi 243 ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
235ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> 244 ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr>
236ffffffff81044221: 5d pop %rbp 245 ffffffff81044221: 5d pop %rbp
237ffffffff81044222: 48 98 cltq 246 ffffffff81044222: 48 98 cltq
238ffffffff81044224: c3 retq 247 ffffffff81044224: c3 retq
239ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi 248 ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi
240ffffffff8104422c: 31 c0 xor %eax,%eax 249 ffffffff8104422c: 31 c0 xor %eax,%eax
241ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> 250 ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk>
242ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> 251 ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe>
243ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) 252 ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
244ffffffff8104423c: 00 00 00 00 253 ffffffff8104423c: 00 00 00 00
245 254
246Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction 255Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
247vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched 256vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
248to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump 257to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
249label case adds: 258label case adds::
250 259
2516 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. 260 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
252 261
253If we then include the padding bytes, the jump label code saves, 16 total bytes 262If we then include the padding bytes, the jump label code saves, 16 total bytes
254of instruction memory for this small function. In this case the non-jump label 263of instruction memory for this small function. In this case the non-jump label
@@ -262,7 +271,7 @@ Since there are a number of static key API uses in the scheduler paths,
262'pipe-test' (also known as 'perf bench sched pipe') can be used to show the 271'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
263performance improvement. Testing done on 3.3.0-rc2: 272performance improvement. Testing done on 3.3.0-rc2:
264 273
265jump label disabled: 274jump label disabled::
266 275
267 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): 276 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
268 277
@@ -279,7 +288,7 @@ jump label disabled:
279 288
280 1.601607384 seconds time elapsed ( +- 0.07% ) 289 1.601607384 seconds time elapsed ( +- 0.07% )
281 290
282jump label enabled: 291jump label enabled::
283 292
284 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): 293 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
285 294
diff --git a/Documentation/svga.txt b/Documentation/svga.txt
index cd66ec836e4f..119f1515b1ac 100644
--- a/Documentation/svga.txt
+++ b/Documentation/svga.txt
@@ -1,24 +1,31 @@
1 Video Mode Selection Support 2.13 1.. include:: <isonum.txt>
2 (c) 1995--1999 Martin Mares, <mj@ucw.cz>
3--------------------------------------------------------------------------------
4 2
51. Intro 3=================================
6~~~~~~~~ 4Video Mode Selection Support 2.13
7 This small document describes the "Video Mode Selection" feature which 5=================================
6
7:Copyright: |copy| 1995--1999 Martin Mares, <mj@ucw.cz>
8
9Intro
10~~~~~
11
12This small document describes the "Video Mode Selection" feature which
8allows the use of various special video modes supported by the video BIOS. Due 13allows the use of various special video modes supported by the video BIOS. Due
9to usage of the BIOS, the selection is limited to boot time (before the 14to usage of the BIOS, the selection is limited to boot time (before the
10kernel decompression starts) and works only on 80X86 machines. 15kernel decompression starts) and works only on 80X86 machines.
11 16
12 ** Short intro for the impatient: Just use vga=ask for the first time, 17.. note::
13 ** enter `scan' on the video mode prompt, pick the mode you want to use,
14 ** remember its mode ID (the four-digit hexadecimal number) and then
15 ** set the vga parameter to this number (converted to decimal first).
16 18
17 The video mode to be used is selected by a kernel parameter which can be 19 Short intro for the impatient: Just use vga=ask for the first time,
20 enter ``scan`` on the video mode prompt, pick the mode you want to use,
21 remember its mode ID (the four-digit hexadecimal number) and then
22 set the vga parameter to this number (converted to decimal first).
23
24The video mode to be used is selected by a kernel parameter which can be
18specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..." 25specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..."
19option of LILO (or some other boot loader you use) or by the "vidmode" utility 26option of LILO (or some other boot loader you use) or by the "vidmode" utility
20(present in standard Linux utility packages). You can use the following values 27(present in standard Linux utility packages). You can use the following values
21of this parameter: 28of this parameter::
22 29
23 NORMAL_VGA - Standard 80x25 mode available on all display adapters. 30 NORMAL_VGA - Standard 80x25 mode available on all display adapters.
24 31
@@ -37,77 +44,79 @@ of this parameter:
37 for exact meaning of the ID). Warning: rdev and LILO don't support 44 for exact meaning of the ID). Warning: rdev and LILO don't support
38 hexadecimal numbers -- you have to convert it to decimal manually. 45 hexadecimal numbers -- you have to convert it to decimal manually.
39 46
402. Menu 47Menu
41~~~~~~~ 48~~~~
42 The ASK_VGA mode causes the kernel to offer a video mode menu upon 49
50The ASK_VGA mode causes the kernel to offer a video mode menu upon
43bootup. It displays a "Press <RETURN> to see video modes available, <SPACE> 51bootup. It displays a "Press <RETURN> to see video modes available, <SPACE>
44to continue or wait 30 secs" message. If you press <RETURN>, you enter the 52to continue or wait 30 secs" message. If you press <RETURN>, you enter the
45menu, if you press <SPACE> or wait 30 seconds, the kernel will boot up in 53menu, if you press <SPACE> or wait 30 seconds, the kernel will boot up in
46the standard 80x25 mode. 54the standard 80x25 mode.
47 55
48 The menu looks like: 56The menu looks like::
49 57
50Video adapter: <name-of-detected-video-adapter> 58 Video adapter: <name-of-detected-video-adapter>
51Mode: COLSxROWS: 59 Mode: COLSxROWS:
520 0F00 80x25 60 0 0F00 80x25
531 0F01 80x50 61 1 0F01 80x50
542 0F02 80x43 62 2 0F02 80x43
553 0F03 80x26 63 3 0F03 80x26
56.... 64 ....
57Enter mode number or `scan': <flashing-cursor-here> 65 Enter mode number or ``scan``: <flashing-cursor-here>
58 66
59 <name-of-detected-video-adapter> tells what video adapter did Linux detect 67<name-of-detected-video-adapter> tells what video adapter did Linux detect
60-- it's either a generic adapter name (MDA, CGA, HGC, EGA, VGA, VESA VGA [a VGA 68-- it's either a generic adapter name (MDA, CGA, HGC, EGA, VGA, VESA VGA [a VGA
61with VESA-compliant BIOS]) or a chipset name (e.g., Trident). Direct detection 69with VESA-compliant BIOS]) or a chipset name (e.g., Trident). Direct detection
62of chipsets is turned off by default (see CONFIG_VIDEO_SVGA in chapter 4 to see 70of chipsets is turned off by default (see CONFIG_VIDEO_SVGA in chapter 4 to see
63how to enable it if you really want) as it's inherently unreliable due to 71how to enable it if you really want) as it's inherently unreliable due to
64absolutely insane PC design. 72absolutely insane PC design.
65 73
66 "0 0F00 80x25" means that the first menu item (the menu items are numbered 74"0 0F00 80x25" means that the first menu item (the menu items are numbered
67from "0" to "9" and from "a" to "z") is a 80x25 mode with ID=0x0f00 (see the 75from "0" to "9" and from "a" to "z") is a 80x25 mode with ID=0x0f00 (see the
68next section for a description of mode IDs). 76next section for a description of mode IDs).
69 77
70 <flashing-cursor-here> encourages you to enter the item number or mode ID 78<flashing-cursor-here> encourages you to enter the item number or mode ID
71you wish to set and press <RETURN>. If the computer complains something about 79you wish to set and press <RETURN>. If the computer complains something about
72"Unknown mode ID", it is trying to tell you that it isn't possible to set such 80"Unknown mode ID", it is trying to tell you that it isn't possible to set such
73a mode. It's also possible to press only <RETURN> which leaves the current mode. 81a mode. It's also possible to press only <RETURN> which leaves the current mode.
74 82
75 The mode list usually contains a few basic modes and some VESA modes. In 83The mode list usually contains a few basic modes and some VESA modes. In
76case your chipset has been detected, some chipset-specific modes are shown as 84case your chipset has been detected, some chipset-specific modes are shown as
77well (some of these might be missing or unusable on your machine as different 85well (some of these might be missing or unusable on your machine as different
78BIOSes are often shipped with the same card and the mode numbers depend purely 86BIOSes are often shipped with the same card and the mode numbers depend purely
79on the VGA BIOS). 87on the VGA BIOS).
80 88
81 The modes displayed on the menu are partially sorted: The list starts with 89The modes displayed on the menu are partially sorted: The list starts with
82the standard modes (80x25 and 80x50) followed by "special" modes (80x28 and 90the standard modes (80x25 and 80x50) followed by "special" modes (80x28 and
8380x43), local modes (if the local modes feature is enabled), VESA modes and 9180x43), local modes (if the local modes feature is enabled), VESA modes and
84finally SVGA modes for the auto-detected adapter. 92finally SVGA modes for the auto-detected adapter.
85 93
86 If you are not happy with the mode list offered (e.g., if you think your card 94If you are not happy with the mode list offered (e.g., if you think your card
87is able to do more), you can enter "scan" instead of item number / mode ID. The 95is able to do more), you can enter "scan" instead of item number / mode ID. The
88program will try to ask the BIOS for all possible video mode numbers and test 96program will try to ask the BIOS for all possible video mode numbers and test
89what happens then. The screen will be probably flashing wildly for some time and 97what happens then. The screen will be probably flashing wildly for some time and
90strange noises will be heard from inside the monitor and so on and then, really 98strange noises will be heard from inside the monitor and so on and then, really
91all consistent video modes supported by your BIOS will appear (plus maybe some 99all consistent video modes supported by your BIOS will appear (plus maybe some
92`ghost modes'). If you are afraid this could damage your monitor, don't use this 100``ghost modes``). If you are afraid this could damage your monitor, don't use
93function. 101this function.
94 102
95 After scanning, the mode ordering is a bit different: the auto-detected SVGA 103After scanning, the mode ordering is a bit different: the auto-detected SVGA
96modes are not listed at all and the modes revealed by `scan' are shown before 104modes are not listed at all and the modes revealed by ``scan`` are shown before
97all VESA modes. 105all VESA modes.
98 106
993. Mode IDs 107Mode IDs
100~~~~~~~~~~~ 108~~~~~~~~
101 Because of the complexity of all the video stuff, the video mode IDs 109
110Because of the complexity of all the video stuff, the video mode IDs
102used here are also a bit complex. A video mode ID is a 16-bit number usually 111used here are also a bit complex. A video mode ID is a 16-bit number usually
103expressed in a hexadecimal notation (starting with "0x"). You can set a mode 112expressed in a hexadecimal notation (starting with "0x"). You can set a mode
104by entering its mode directly if you know it even if it isn't shown on the menu. 113by entering its mode directly if you know it even if it isn't shown on the menu.
105 114
106The ID numbers can be divided to three regions: 115The ID numbers can be divided to those regions::
107 116
108 0x0000 to 0x00ff - menu item references. 0x0000 is the first item. Don't use 117 0x0000 to 0x00ff - menu item references. 0x0000 is the first item. Don't use
109 outside the menu as this can change from boot to boot (especially if you 118 outside the menu as this can change from boot to boot (especially if you
110 have used the `scan' feature). 119 have used the ``scan`` feature).
111 120
112 0x0100 to 0x017f - standard BIOS modes. The ID is a BIOS video mode number 121 0x0100 to 0x017f - standard BIOS modes. The ID is a BIOS video mode number
113 (as presented to INT 10, function 00) increased by 0x0100. 122 (as presented to INT 10, function 00) increased by 0x0100.
@@ -142,53 +151,54 @@ The ID numbers can be divided to three regions:
142 0xffff equivalent to 0x0f00 (standard 80x25) 151 0xffff equivalent to 0x0f00 (standard 80x25)
143 0xfffe equivalent to 0x0f01 (EGA 80x43 or VGA 80x50) 152 0xfffe equivalent to 0x0f01 (EGA 80x43 or VGA 80x50)
144 153
145 If you add 0x8000 to the mode ID, the program will try to recalculate 154If you add 0x8000 to the mode ID, the program will try to recalculate
146vertical display timing according to mode parameters, which can be used to 155vertical display timing according to mode parameters, which can be used to
147eliminate some annoying bugs of certain VGA BIOSes (usually those used for 156eliminate some annoying bugs of certain VGA BIOSes (usually those used for
148cards with S3 chipsets and old Cirrus Logic BIOSes) -- mainly extra lines at the 157cards with S3 chipsets and old Cirrus Logic BIOSes) -- mainly extra lines at the
149end of the display. 158end of the display.
150 159
1514. Options 160Options
152~~~~~~~~~~ 161~~~~~~~
153 Some options can be set in the source text (in arch/i386/boot/video.S). 162
163Some options can be set in the source text (in arch/i386/boot/video.S).
154All of them are simple #define's -- change them to #undef's when you want to 164All of them are simple #define's -- change them to #undef's when you want to
155switch them off. Currently supported: 165switch them off. Currently supported:
156 166
157 CONFIG_VIDEO_SVGA - enables autodetection of SVGA cards. This is switched 167CONFIG_VIDEO_SVGA - enables autodetection of SVGA cards. This is switched
158off by default as it's a bit unreliable due to terribly bad PC design. If you 168off by default as it's a bit unreliable due to terribly bad PC design. If you
159really want to have the adapter autodetected (maybe in case the `scan' feature 169really want to have the adapter autodetected (maybe in case the ``scan`` feature
160doesn't work on your machine), switch this on and don't cry if the results 170doesn't work on your machine), switch this on and don't cry if the results
161are not completely sane. In case you really need this feature, please drop me 171are not completely sane. In case you really need this feature, please drop me
162a mail as I think of removing it some day. 172a mail as I think of removing it some day.
163 173
164 CONFIG_VIDEO_VESA - enables autodetection of VESA modes. If it doesn't work 174CONFIG_VIDEO_VESA - enables autodetection of VESA modes. If it doesn't work
165on your machine (or displays a "Error: Scanning of VESA modes failed" message), 175on your machine (or displays a "Error: Scanning of VESA modes failed" message),
166you can switch it off and report as a bug. 176you can switch it off and report as a bug.
167 177
168 CONFIG_VIDEO_COMPACT - enables compacting of the video mode list. If there 178CONFIG_VIDEO_COMPACT - enables compacting of the video mode list. If there
169are more modes with the same screen size, only the first one is kept (see above 179are more modes with the same screen size, only the first one is kept (see above
170for more info on mode ordering). However, in very strange cases it's possible 180for more info on mode ordering). However, in very strange cases it's possible
171that the first "version" of the mode doesn't work although some of the others 181that the first "version" of the mode doesn't work although some of the others
172do -- in this case turn this switch off to see the rest. 182do -- in this case turn this switch off to see the rest.
173 183
174 CONFIG_VIDEO_RETAIN - enables retaining of screen contents when switching 184CONFIG_VIDEO_RETAIN - enables retaining of screen contents when switching
175video modes. Works only with some boot loaders which leave enough room for the 185video modes. Works only with some boot loaders which leave enough room for the
176buffer. (If you have old LILO, you can adjust heap_end_ptr and loadflags 186buffer. (If you have old LILO, you can adjust heap_end_ptr and loadflags
177in setup.S, but it's better to upgrade the boot loader...) 187in setup.S, but it's better to upgrade the boot loader...)
178 188
179 CONFIG_VIDEO_LOCAL - enables inclusion of "local modes" in the list. The 189CONFIG_VIDEO_LOCAL - enables inclusion of "local modes" in the list. The
180local modes are added automatically to the beginning of the list not depending 190local modes are added automatically to the beginning of the list not depending
181on hardware configuration. The local modes are listed in the source text after 191on hardware configuration. The local modes are listed in the source text after
182the "local_mode_table:" line. The comment before this line describes the format 192the "local_mode_table:" line. The comment before this line describes the format
183of the table (which also includes a video card name to be displayed on the 193of the table (which also includes a video card name to be displayed on the
184top of the menu). 194top of the menu).
185 195
186 CONFIG_VIDEO_400_HACK - force setting of 400 scan lines for standard VGA 196CONFIG_VIDEO_400_HACK - force setting of 400 scan lines for standard VGA
187modes. This option is intended to be used on certain buggy BIOSes which draw 197modes. This option is intended to be used on certain buggy BIOSes which draw
188some useless logo using font download and then fail to reset the correct mode. 198some useless logo using font download and then fail to reset the correct mode.
189Don't use unless needed as it forces resetting the video card. 199Don't use unless needed as it forces resetting the video card.
190 200
191 CONFIG_VIDEO_GFX_HACK - includes special hack for setting of graphics modes 201CONFIG_VIDEO_GFX_HACK - includes special hack for setting of graphics modes
192to be used later by special drivers (e.g., 800x600 on IBM ThinkPad -- see 202to be used later by special drivers (e.g., 800x600 on IBM ThinkPad -- see
193ftp://ftp.phys.keio.ac.jp/pub/XFree86/800x600/XF86Configs/XF86Config.IBM_TP560). 203ftp://ftp.phys.keio.ac.jp/pub/XFree86/800x600/XF86Configs/XF86Config.IBM_TP560).
194Allows to set _any_ BIOS mode including graphic ones and forcing specific 204Allows to set _any_ BIOS mode including graphic ones and forcing specific
@@ -196,33 +206,36 @@ text screen resolution instead of peeking it from BIOS variables. Don't use
196unless you think you know what you're doing. To activate this setup, use 206unless you think you know what you're doing. To activate this setup, use
197mode number 0x0f08 (see section 3). 207mode number 0x0f08 (see section 3).
198 208
1995. Still doesn't work? 209Still doesn't work?
200~~~~~~~~~~~~~~~~~~~~~~ 210~~~~~~~~~~~~~~~~~~~
201 When the mode detection doesn't work (e.g., the mode list is incorrect or 211
212When the mode detection doesn't work (e.g., the mode list is incorrect or
202the machine hangs instead of displaying the menu), try to switch off some of 213the machine hangs instead of displaying the menu), try to switch off some of
203the configuration options listed in section 4. If it fails, you can still use 214the configuration options listed in section 4. If it fails, you can still use
204your kernel with the video mode set directly via the kernel parameter. 215your kernel with the video mode set directly via the kernel parameter.
205 216
206 In either case, please send me a bug report containing what _exactly_ 217In either case, please send me a bug report containing what _exactly_
207happens and how do the configuration switches affect the behaviour of the bug. 218happens and how do the configuration switches affect the behaviour of the bug.
208 219
209 If you start Linux from M$-DOS, you might also use some DOS tools for 220If you start Linux from M$-DOS, you might also use some DOS tools for
210video mode setting. In this case, you must specify the 0x0f04 mode ("leave 221video mode setting. In this case, you must specify the 0x0f04 mode ("leave
211current settings") to Linux, because if you don't and you use any non-standard 222current settings") to Linux, because if you don't and you use any non-standard
212mode, Linux will switch to 80x25 automatically. 223mode, Linux will switch to 80x25 automatically.
213 224
214 If you set some extended mode and there's one or more extra lines on the 225If you set some extended mode and there's one or more extra lines on the
215bottom of the display containing already scrolled-out text, your VGA BIOS 226bottom of the display containing already scrolled-out text, your VGA BIOS
216contains the most common video BIOS bug called "incorrect vertical display 227contains the most common video BIOS bug called "incorrect vertical display
217end setting". Adding 0x8000 to the mode ID might fix the problem. Unfortunately, 228end setting". Adding 0x8000 to the mode ID might fix the problem. Unfortunately,
218this must be done manually -- no autodetection mechanisms are available. 229this must be done manually -- no autodetection mechanisms are available.
219 230
220 If you have a VGA card and your display still looks as on EGA, your BIOS 231If you have a VGA card and your display still looks as on EGA, your BIOS
221is probably broken and you need to set the CONFIG_VIDEO_400_HACK switch to 232is probably broken and you need to set the CONFIG_VIDEO_400_HACK switch to
222force setting of the correct mode. 233force setting of the correct mode.
223 234
2246. History 235History
225~~~~~~~~~~ 236~~~~~~~
237
238=============== ================================================================
2261.0 (??-Nov-95) First version supporting all adapters supported by the old 2391.0 (??-Nov-95) First version supporting all adapters supported by the old
227 setup.S + Cirrus Logic 54XX. Present in some 1.3.4? kernels 240 setup.S + Cirrus Logic 54XX. Present in some 1.3.4? kernels
228 and then removed due to instability on some machines. 241 and then removed due to instability on some machines.
@@ -260,17 +273,18 @@ force setting of the correct mode.
260 original version written by hhanemaa@cs.ruu.nl, patched by 273 original version written by hhanemaa@cs.ruu.nl, patched by
261 Jeff Chua, rewritten by me). 274 Jeff Chua, rewritten by me).
262 - Screen store/restore fixed. 275 - Screen store/restore fixed.
2632.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA. 2762.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA.
264 - Better recognition of text modes during mode scan. 277 - Better recognition of text modes during mode scan.
2652.9 (12-May-96) - Ignored VESA modes 0x80 - 0xff (more VESA BIOS bugs!) 2782.9 (12-May-96) - Ignored VESA modes 0x80 - 0xff (more VESA BIOS bugs!)
2662.10 (11-Nov-96)- The whole thing made optional. 2792.10(11-Nov-96) - The whole thing made optional.
267 - Added the CONFIG_VIDEO_400_HACK switch. 280 - Added the CONFIG_VIDEO_400_HACK switch.
268 - Added the CONFIG_VIDEO_GFX_HACK switch. 281 - Added the CONFIG_VIDEO_GFX_HACK switch.
269 - Code cleanup. 282 - Code cleanup.
2702.11 (03-May-97)- Yet another cleanup, now including also the documentation. 2832.11(03-May-97) - Yet another cleanup, now including also the documentation.
271 - Direct testing of SVGA adapters turned off by default, `scan' 284 - Direct testing of SVGA adapters turned off by default, ``scan``
272 offered explicitly on the prompt line. 285 offered explicitly on the prompt line.
273 - Removed the doc section describing adding of new probing 286 - Removed the doc section describing adding of new probing
274 functions as I try to get rid of _all_ hardware probing here. 287 functions as I try to get rid of _all_ hardware probing here.
2752.12 (25-May-98)- Added support for VESA frame buffer graphics. 2882.12(25-May-98) Added support for VESA frame buffer graphics.
2762.13 (14-May-99)- Minor documentation fixes. 2892.13(14-May-99) Minor documentation fixes.
290=============== ================================================================
diff --git a/Documentation/tee.txt b/Documentation/tee.txt
index 718599357596..56ea85ffebf2 100644
--- a/Documentation/tee.txt
+++ b/Documentation/tee.txt
@@ -1,4 +1,7 @@
1=============
1TEE subsystem 2TEE subsystem
3=============
4
2This document describes the TEE subsystem in Linux. 5This document describes the TEE subsystem in Linux.
3 6
4A TEE (Trusted Execution Environment) is a trusted OS running in some 7A TEE (Trusted Execution Environment) is a trusted OS running in some
@@ -80,27 +83,27 @@ The GlobalPlatform TEE Client API [5] is implemented on top of the generic
80TEE API. 83TEE API.
81 84
82Picture of the relationship between the different components in the 85Picture of the relationship between the different components in the
83OP-TEE architecture. 86OP-TEE architecture::
84 87
85 User space Kernel Secure world 88 User space Kernel Secure world
86 ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~ 89 ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~
87 +--------+ +-------------+ 90 +--------+ +-------------+
88 | Client | | Trusted | 91 | Client | | Trusted |
89 +--------+ | Application | 92 +--------+ | Application |
90 /\ +-------------+ 93 /\ +-------------+
91 || +----------+ /\ 94 || +----------+ /\
92 || |tee- | || 95 || |tee- | ||
93 || |supplicant| \/ 96 || |supplicant| \/
94 || +----------+ +-------------+ 97 || +----------+ +-------------+
95 \/ /\ | TEE Internal| 98 \/ /\ | TEE Internal|
96 +-------+ || | API | 99 +-------+ || | API |
97 + TEE | || +--------+--------+ +-------------+ 100 + TEE | || +--------+--------+ +-------------+
98 | Client| || | TEE | OP-TEE | | OP-TEE | 101 | Client| || | TEE | OP-TEE | | OP-TEE |
99 | API | \/ | subsys | driver | | Trusted OS | 102 | API | \/ | subsys | driver | | Trusted OS |
100 +-------+----------------+----+-------+----+-----------+-------------+ 103 +-------+----------------+----+-------+----+-----------+-------------+
101 | Generic TEE API | | OP-TEE MSG | 104 | Generic TEE API | | OP-TEE MSG |
102 | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) | 105 | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) |
103 +-----------------------------+ +------------------------------+ 106 +-----------------------------+ +------------------------------+
104 107
105RPC (Remote Procedure Call) are requests from secure world to kernel driver 108RPC (Remote Procedure Call) are requests from secure world to kernel driver
106or tee-supplicant. An RPC is identified by a special range of SMCCC return 109or tee-supplicant. An RPC is identified by a special range of SMCCC return
@@ -109,10 +112,16 @@ kernel are handled by the kernel driver. Other RPC messages will be forwarded to
109tee-supplicant without further involvement of the driver, except switching 112tee-supplicant without further involvement of the driver, except switching
110shared memory buffer representation. 113shared memory buffer representation.
111 114
112References: 115References
116==========
117
113[1] https://github.com/OP-TEE/optee_os 118[1] https://github.com/OP-TEE/optee_os
119
114[2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html 120[2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html
121
115[3] drivers/tee/optee/optee_smc.h 122[3] drivers/tee/optee/optee_smc.h
123
116[4] drivers/tee/optee/optee_msg.h 124[4] drivers/tee/optee/optee_msg.h
125
117[5] http://www.globalplatform.org/specificationsdevice.asp look for 126[5] http://www.globalplatform.org/specificationsdevice.asp look for
118 "TEE Client API Specification v1.0" and click download. 127 "TEE Client API Specification v1.0" and click download.
diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt
index 2cbf71975381..5cb8b883ae83 100644
--- a/Documentation/this_cpu_ops.txt
+++ b/Documentation/this_cpu_ops.txt
@@ -1,5 +1,9 @@
1===================
1this_cpu operations 2this_cpu operations
2------------------- 3===================
4
5:Author: Christoph Lameter, August 4th, 2014
6:Author: Pranith Kumar, Aug 2nd, 2014
3 7
4this_cpu operations are a way of optimizing access to per cpu 8this_cpu operations are a way of optimizing access to per cpu
5variables associated with the *currently* executing processor. This is 9variables associated with the *currently* executing processor. This is
@@ -39,7 +43,7 @@ operations.
39 43
40The following this_cpu() operations with implied preemption protection 44The following this_cpu() operations with implied preemption protection
41are defined. These operations can be used without worrying about 45are defined. These operations can be used without worrying about
42preemption and interrupts. 46preemption and interrupts::
43 47
44 this_cpu_read(pcp) 48 this_cpu_read(pcp)
45 this_cpu_write(pcp, val) 49 this_cpu_write(pcp, val)
@@ -67,14 +71,14 @@ to relocate a per cpu relative address to the proper per cpu area for
67the processor. So the relocation to the per cpu base is encoded in the 71the processor. So the relocation to the per cpu base is encoded in the
68instruction via a segment register prefix. 72instruction via a segment register prefix.
69 73
70For example: 74For example::
71 75
72 DEFINE_PER_CPU(int, x); 76 DEFINE_PER_CPU(int, x);
73 int z; 77 int z;
74 78
75 z = this_cpu_read(x); 79 z = this_cpu_read(x);
76 80
77results in a single instruction 81results in a single instruction::
78 82
79 mov ax, gs:[x] 83 mov ax, gs:[x]
80 84
@@ -84,16 +88,16 @@ this_cpu_ops such sequence also required preempt disable/enable to
84prevent the kernel from moving the thread to a different processor 88prevent the kernel from moving the thread to a different processor
85while the calculation is performed. 89while the calculation is performed.
86 90
87Consider the following this_cpu operation: 91Consider the following this_cpu operation::
88 92
89 this_cpu_inc(x) 93 this_cpu_inc(x)
90 94
91The above results in the following single instruction (no lock prefix!) 95The above results in the following single instruction (no lock prefix!)::
92 96
93 inc gs:[x] 97 inc gs:[x]
94 98
95instead of the following operations required if there is no segment 99instead of the following operations required if there is no segment
96register: 100register::
97 101
98 int *y; 102 int *y;
99 int cpu; 103 int cpu;
@@ -121,8 +125,10 @@ has to be paid for this optimization is the need to add up the per cpu
121counters when the value of a counter is needed. 125counters when the value of a counter is needed.
122 126
123 127
124Special operations: 128Special operations
125------------------- 129------------------
130
131::
126 132
127 y = this_cpu_ptr(&x) 133 y = this_cpu_ptr(&x)
128 134
@@ -153,11 +159,15 @@ Therefore the use of x or &x outside of the context of per cpu
153operations is invalid and will generally be treated like a NULL 159operations is invalid and will generally be treated like a NULL
154pointer dereference. 160pointer dereference.
155 161
162::
163
156 DEFINE_PER_CPU(int, x); 164 DEFINE_PER_CPU(int, x);
157 165
158In the context of per cpu operations the above implies that x is a per 166In the context of per cpu operations the above implies that x is a per
159cpu variable. Most this_cpu operations take a cpu variable. 167cpu variable. Most this_cpu operations take a cpu variable.
160 168
169::
170
161 int __percpu *p = &x; 171 int __percpu *p = &x;
162 172
163&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr() 173&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr()
@@ -168,7 +178,7 @@ strange.
168Operations on a field of a per cpu structure 178Operations on a field of a per cpu structure
169-------------------------------------------- 179--------------------------------------------
170 180
171Let's say we have a percpu structure 181Let's say we have a percpu structure::
172 182
173 struct s { 183 struct s {
174 int n,m; 184 int n,m;
@@ -177,14 +187,14 @@ Let's say we have a percpu structure
177 DEFINE_PER_CPU(struct s, p); 187 DEFINE_PER_CPU(struct s, p);
178 188
179 189
180Operations on these fields are straightforward 190Operations on these fields are straightforward::
181 191
182 this_cpu_inc(p.m) 192 this_cpu_inc(p.m)
183 193
184 z = this_cpu_cmpxchg(p.m, 0, 1); 194 z = this_cpu_cmpxchg(p.m, 0, 1);
185 195
186 196
187If we have an offset to struct s: 197If we have an offset to struct s::
188 198
189 struct s __percpu *ps = &p; 199 struct s __percpu *ps = &p;
190 200
@@ -194,7 +204,7 @@ If we have an offset to struct s:
194 204
195 205
196The calculation of the pointer may require the use of this_cpu_ptr() 206The calculation of the pointer may require the use of this_cpu_ptr()
197if we do not make use of this_cpu ops later to manipulate fields: 207if we do not make use of this_cpu ops later to manipulate fields::
198 208
199 struct s *pp; 209 struct s *pp;
200 210
@@ -206,7 +216,7 @@ if we do not make use of this_cpu ops later to manipulate fields:
206 216
207 217
208Variants of this_cpu ops 218Variants of this_cpu ops
209------------------------- 219------------------------
210 220
211this_cpu ops are interrupt safe. Some architectures do not support 221this_cpu ops are interrupt safe. Some architectures do not support
212these per cpu local operations. In that case the operation must be 222these per cpu local operations. In that case the operation must be
@@ -222,7 +232,7 @@ preemption. If a per cpu variable is not used in an interrupt context
222and the scheduler cannot preempt, then they are safe. If any interrupts 232and the scheduler cannot preempt, then they are safe. If any interrupts
223still occur while an operation is in progress and if the interrupt too 233still occur while an operation is in progress and if the interrupt too
224modifies the variable, then RMW actions can not be guaranteed to be 234modifies the variable, then RMW actions can not be guaranteed to be
225safe. 235safe::
226 236
227 __this_cpu_read(pcp) 237 __this_cpu_read(pcp)
228 __this_cpu_write(pcp, val) 238 __this_cpu_write(pcp, val)
@@ -279,7 +289,7 @@ unless absolutely necessary. Please consider using an IPI to wake up
279the remote CPU and perform the update to its per cpu area. 289the remote CPU and perform the update to its per cpu area.
280 290
281To access per-cpu data structure remotely, typically the per_cpu_ptr() 291To access per-cpu data structure remotely, typically the per_cpu_ptr()
282function is used: 292function is used::
283 293
284 294
285 DEFINE_PER_CPU(struct data, datap); 295 DEFINE_PER_CPU(struct data, datap);
@@ -289,7 +299,7 @@ function is used:
289This makes it explicit that we are getting ready to access a percpu 299This makes it explicit that we are getting ready to access a percpu
290area remotely. 300area remotely.
291 301
292You can also do the following to convert the datap offset to an address 302You can also do the following to convert the datap offset to an address::
293 303
294 struct data *p = this_cpu_ptr(&datap); 304 struct data *p = this_cpu_ptr(&datap);
295 305
@@ -305,7 +315,7 @@ the following scenario that occurs because two per cpu variables
305share a cache-line but the relaxed synchronization is applied to 315share a cache-line but the relaxed synchronization is applied to
306only one process updating the cache-line. 316only one process updating the cache-line.
307 317
308Consider the following example 318Consider the following example::
309 319
310 320
311 struct test { 321 struct test {
@@ -327,6 +337,3 @@ mind that a remote write will evict the cache line from the processor
327that most likely will access it. If the processor wakes up and finds a 337that most likely will access it. If the processor wakes up and finds a
328missing local cache line of a per cpu area, its performance and hence 338missing local cache line of a per cpu area, its performance and hence
329the wake up times will be affected. 339the wake up times will be affected.
330
331Christoph Lameter, August 4th, 2014
332Pranith Kumar, Aug 2nd, 2014
diff --git a/Documentation/unaligned-memory-access.txt b/Documentation/unaligned-memory-access.txt
index 3f76c0c37920..51b4ff031586 100644
--- a/Documentation/unaligned-memory-access.txt
+++ b/Documentation/unaligned-memory-access.txt
@@ -1,6 +1,15 @@
1=========================
1UNALIGNED MEMORY ACCESSES 2UNALIGNED MEMORY ACCESSES
2========================= 3=========================
3 4
5:Author: Daniel Drake <dsd@gentoo.org>,
6:Author: Johannes Berg <johannes@sipsolutions.net>
7
8:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
9 Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz,
10 Vadim Lobanov
11
12
4Linux runs on a wide variety of architectures which have varying behaviour 13Linux runs on a wide variety of architectures which have varying behaviour
5when it comes to memory access. This document presents some details about 14when it comes to memory access. This document presents some details about
6unaligned accesses, why you need to write code that doesn't cause them, 15unaligned accesses, why you need to write code that doesn't cause them,
@@ -73,7 +82,7 @@ memory addresses of certain variables, etc.
73 82
74Fortunately things are not too complex, as in most cases, the compiler 83Fortunately things are not too complex, as in most cases, the compiler
75ensures that things will work for you. For example, take the following 84ensures that things will work for you. For example, take the following
76structure: 85structure::
77 86
78 struct foo { 87 struct foo {
79 u16 field1; 88 u16 field1;
@@ -106,7 +115,7 @@ On a related topic, with the above considerations in mind you may observe
106that you could reorder the fields in the structure in order to place fields 115that you could reorder the fields in the structure in order to place fields
107where padding would otherwise be inserted, and hence reduce the overall 116where padding would otherwise be inserted, and hence reduce the overall
108resident memory size of structure instances. The optimal layout of the 117resident memory size of structure instances. The optimal layout of the
109above example is: 118above example is::
110 119
111 struct foo { 120 struct foo {
112 u32 field2; 121 u32 field2;
@@ -139,21 +148,21 @@ Code that causes unaligned access
139With the above in mind, let's move onto a real life example of a function 148With the above in mind, let's move onto a real life example of a function
140that can cause an unaligned memory access. The following function taken 149that can cause an unaligned memory access. The following function taken
141from include/linux/etherdevice.h is an optimized routine to compare two 150from include/linux/etherdevice.h is an optimized routine to compare two
142ethernet MAC addresses for equality. 151ethernet MAC addresses for equality::
143 152
144bool ether_addr_equal(const u8 *addr1, const u8 *addr2) 153 bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
145{ 154 {
146#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS 155 #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
147 u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) | 156 u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
148 ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))); 157 ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));
149 158
150 return fold == 0; 159 return fold == 0;
151#else 160 #else
152 const u16 *a = (const u16 *)addr1; 161 const u16 *a = (const u16 *)addr1;
153 const u16 *b = (const u16 *)addr2; 162 const u16 *b = (const u16 *)addr2;
154 return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0; 163 return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
155#endif 164 #endif
156} 165 }
157 166
158In the above function, when the hardware has efficient unaligned access 167In the above function, when the hardware has efficient unaligned access
159capability, there is no issue with this code. But when the hardware isn't 168capability, there is no issue with this code. But when the hardware isn't
@@ -171,7 +180,8 @@ as it is a decent optimization for the cases when you can ensure alignment,
171which is true almost all of the time in ethernet networking context. 180which is true almost all of the time in ethernet networking context.
172 181
173 182
174Here is another example of some code that could cause unaligned accesses: 183Here is another example of some code that could cause unaligned accesses::
184
175 void myfunc(u8 *data, u32 value) 185 void myfunc(u8 *data, u32 value)
176 { 186 {
177 [...] 187 [...]
@@ -184,6 +194,7 @@ to an address that is not evenly divisible by 4.
184 194
185In summary, the 2 main scenarios where you may run into unaligned access 195In summary, the 2 main scenarios where you may run into unaligned access
186problems involve: 196problems involve:
197
187 1. Casting variables to types of different lengths 198 1. Casting variables to types of different lengths
188 2. Pointer arithmetic followed by access to at least 2 bytes of data 199 2. Pointer arithmetic followed by access to at least 2 bytes of data
189 200
@@ -195,7 +206,7 @@ The easiest way to avoid unaligned access is to use the get_unaligned() and
195put_unaligned() macros provided by the <asm/unaligned.h> header file. 206put_unaligned() macros provided by the <asm/unaligned.h> header file.
196 207
197Going back to an earlier example of code that potentially causes unaligned 208Going back to an earlier example of code that potentially causes unaligned
198access: 209access::
199 210
200 void myfunc(u8 *data, u32 value) 211 void myfunc(u8 *data, u32 value)
201 { 212 {
@@ -204,7 +215,7 @@ access:
204 [...] 215 [...]
205 } 216 }
206 217
207To avoid the unaligned memory access, you would rewrite it as follows: 218To avoid the unaligned memory access, you would rewrite it as follows::
208 219
209 void myfunc(u8 *data, u32 value) 220 void myfunc(u8 *data, u32 value)
210 { 221 {
@@ -215,7 +226,7 @@ To avoid the unaligned memory access, you would rewrite it as follows:
215 } 226 }
216 227
217The get_unaligned() macro works similarly. Assuming 'data' is a pointer to 228The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
218memory and you wish to avoid unaligned access, its usage is as follows: 229memory and you wish to avoid unaligned access, its usage is as follows::
219 230
220 u32 value = get_unaligned((u32 *) data); 231 u32 value = get_unaligned((u32 *) data);
221 232
@@ -245,18 +256,10 @@ For some ethernet hardware that cannot DMA to unaligned addresses like
2454*n+2 or non-ethernet hardware, this can be a problem, and it is then 2564*n+2 or non-ethernet hardware, this can be a problem, and it is then
246required to copy the incoming frame into an aligned buffer. Because this is 257required to copy the incoming frame into an aligned buffer. Because this is
247unnecessary on architectures that can do unaligned accesses, the code can be 258unnecessary on architectures that can do unaligned accesses, the code can be
248made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so: 259made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so::
249
250#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
251 skb = original skb
252#else
253 skb = copy skb
254#endif
255
256--
257Authors: Daniel Drake <dsd@gentoo.org>,
258 Johannes Berg <johannes@sipsolutions.net>
259With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
260Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz,
261Vadim Lobanov
262 260
261 #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
262 skb = original skb
263 #else
264 skb = copy skb
265 #endif
diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
index e5e57b40f8af..1b3950346532 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -1,14 +1,17 @@
1/* 1.. include:: <isonum.txt>
2 * VFIO Mediated devices 2
3 * 3=====================
4 * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. 4VFIO Mediated devices
5 * Author: Neo Jia <cjia@nvidia.com> 5=====================
6 * Kirti Wankhede <kwankhede@nvidia.com> 6
7 * 7:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved.
8 * This program is free software; you can redistribute it and/or modify 8:Author: Neo Jia <cjia@nvidia.com>
9 * it under the terms of the GNU General Public License version 2 as 9:Author: Kirti Wankhede <kwankhede@nvidia.com>
10 * published by the Free Software Foundation. 10
11 */ 11This program is free software; you can redistribute it and/or modify
12it under the terms of the GNU General Public License version 2 as
13published by the Free Software Foundation.
14
12 15
13Virtual Function I/O (VFIO) Mediated devices[1] 16Virtual Function I/O (VFIO) Mediated devices[1]
14=============================================== 17===============================================
@@ -42,7 +45,7 @@ removes it from a VFIO group.
42 45
43The following high-level block diagram shows the main components and interfaces 46The following high-level block diagram shows the main components and interfaces
44in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM 47in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM
45devices as examples, as these devices are the first devices to use this module. 48devices as examples, as these devices are the first devices to use this module::
46 49
47 +---------------+ 50 +---------------+
48 | | 51 | |
@@ -91,7 +94,7 @@ Registration Interface for a Mediated Bus Driver
91------------------------------------------------ 94------------------------------------------------
92 95
93The registration interface for a mediated bus driver provides the following 96The registration interface for a mediated bus driver provides the following
94structure to represent a mediated device's driver: 97structure to represent a mediated device's driver::
95 98
96 /* 99 /*
97 * struct mdev_driver [2] - Mediated device's driver 100 * struct mdev_driver [2] - Mediated device's driver
@@ -110,14 +113,14 @@ structure to represent a mediated device's driver:
110A mediated bus driver for mdev should use this structure in the function calls 113A mediated bus driver for mdev should use this structure in the function calls
111to register and unregister itself with the core driver: 114to register and unregister itself with the core driver:
112 115
113* Register: 116* Register::
114 117
115 extern int mdev_register_driver(struct mdev_driver *drv, 118 extern int mdev_register_driver(struct mdev_driver *drv,
116 struct module *owner); 119 struct module *owner);
117 120
118* Unregister: 121* Unregister::
119 122
120 extern void mdev_unregister_driver(struct mdev_driver *drv); 123 extern void mdev_unregister_driver(struct mdev_driver *drv);
121 124
122The mediated bus driver is responsible for adding mediated devices to the VFIO 125The mediated bus driver is responsible for adding mediated devices to the VFIO
123group when devices are bound to the driver and removing mediated devices from 126group when devices are bound to the driver and removing mediated devices from
@@ -152,15 +155,15 @@ The callbacks in the mdev_parent_ops structure are as follows:
152* mmap: mmap emulation callback 155* mmap: mmap emulation callback
153 156
154A driver should use the mdev_parent_ops structure in the function call to 157A driver should use the mdev_parent_ops structure in the function call to
155register itself with the mdev core driver: 158register itself with the mdev core driver::
156 159
157extern int mdev_register_device(struct device *dev, 160 extern int mdev_register_device(struct device *dev,
158 const struct mdev_parent_ops *ops); 161 const struct mdev_parent_ops *ops);
159 162
160However, the mdev_parent_ops structure is not required in the function call 163However, the mdev_parent_ops structure is not required in the function call
161that a driver should use to unregister itself with the mdev core driver: 164that a driver should use to unregister itself with the mdev core driver::
162 165
163extern void mdev_unregister_device(struct device *dev); 166 extern void mdev_unregister_device(struct device *dev);
164 167
165 168
166Mediated Device Management Interface Through sysfs 169Mediated Device Management Interface Through sysfs
@@ -183,30 +186,32 @@ with the mdev core driver.
183Directories and files under the sysfs for Each Physical Device 186Directories and files under the sysfs for Each Physical Device
184-------------------------------------------------------------- 187--------------------------------------------------------------
185 188
186|- [parent physical device] 189::
187|--- Vendor-specific-attributes [optional] 190
188|--- [mdev_supported_types] 191 |- [parent physical device]
189| |--- [<type-id>] 192 |--- Vendor-specific-attributes [optional]
190| | |--- create 193 |--- [mdev_supported_types]
191| | |--- name 194 | |--- [<type-id>]
192| | |--- available_instances 195 | | |--- create
193| | |--- device_api 196 | | |--- name
194| | |--- description 197 | | |--- available_instances
195| | |--- [devices] 198 | | |--- device_api
196| |--- [<type-id>] 199 | | |--- description
197| | |--- create 200 | | |--- [devices]
198| | |--- name 201 | |--- [<type-id>]
199| | |--- available_instances 202 | | |--- create
200| | |--- device_api 203 | | |--- name
201| | |--- description 204 | | |--- available_instances
202| | |--- [devices] 205 | | |--- device_api
203| |--- [<type-id>] 206 | | |--- description
204| |--- create 207 | | |--- [devices]
205| |--- name 208 | |--- [<type-id>]
206| |--- available_instances 209 | |--- create
207| |--- device_api 210 | |--- name
208| |--- description 211 | |--- available_instances
209| |--- [devices] 212 | |--- device_api
213 | |--- description
214 | |--- [devices]
210 215
211* [mdev_supported_types] 216* [mdev_supported_types]
212 217
@@ -219,12 +224,12 @@ Directories and files under the sysfs for Each Physical Device
219 224
220 The [<type-id>] name is created by adding the device driver string as a prefix 225 The [<type-id>] name is created by adding the device driver string as a prefix
221 to the string provided by the vendor driver. This format of this name is as 226 to the string provided by the vendor driver. This format of this name is as
222 follows: 227 follows::
223 228
224 sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name); 229 sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name);
225 230
226 (or using mdev_parent_dev(mdev) to arrive at the parent device outside 231 (or using mdev_parent_dev(mdev) to arrive at the parent device outside
227 of the core mdev code) 232 of the core mdev code)
228 233
229* device_api 234* device_api
230 235
@@ -239,7 +244,7 @@ Directories and files under the sysfs for Each Physical Device
239* [device] 244* [device]
240 245
241 This directory contains links to the devices of type <type-id> that have been 246 This directory contains links to the devices of type <type-id> that have been
242created. 247 created.
243 248
244* name 249* name
245 250
@@ -253,21 +258,25 @@ created.
253Directories and Files Under the sysfs for Each mdev Device 258Directories and Files Under the sysfs for Each mdev Device
254---------------------------------------------------------- 259----------------------------------------------------------
255 260
256|- [parent phy device] 261::
257|--- [$MDEV_UUID] 262
263 |- [parent phy device]
264 |--- [$MDEV_UUID]
258 |--- remove 265 |--- remove
259 |--- mdev_type {link to its type} 266 |--- mdev_type {link to its type}
260 |--- vendor-specific-attributes [optional] 267 |--- vendor-specific-attributes [optional]
261 268
262* remove (write only) 269* remove (write only)
270
263Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can 271Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can
264fail the remove() callback if that device is active and the vendor driver 272fail the remove() callback if that device is active and the vendor driver
265doesn't support hot unplug. 273doesn't support hot unplug.
266 274
267Example: 275Example::
276
268 # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove 277 # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
269 278
270Mediated device Hot plug: 279Mediated device Hot plug
271------------------------ 280------------------------
272 281
273Mediated devices can be created and assigned at runtime. The procedure to hot 282Mediated devices can be created and assigned at runtime. The procedure to hot
@@ -277,13 +286,13 @@ Translation APIs for Mediated Devices
277===================================== 286=====================================
278 287
279The following APIs are provided for translating user pfn to host pfn in a VFIO 288The following APIs are provided for translating user pfn to host pfn in a VFIO
280driver: 289driver::
281 290
282extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, 291 extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
283 int npage, int prot, unsigned long *phys_pfn); 292 int npage, int prot, unsigned long *phys_pfn);
284 293
285extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, 294 extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn,
286 int npage); 295 int npage);
287 296
288These functions call back into the back-end IOMMU module by using the pin_pages 297These functions call back into the back-end IOMMU module by using the pin_pages
289and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently 298and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently
@@ -304,81 +313,80 @@ card.
304 313
305 This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ 314 This step creates a dummy device, /sys/devices/virtual/mtty/mtty/
306 315
307 Files in this device directory in sysfs are similar to the following: 316 Files in this device directory in sysfs are similar to the following::
308 317
309 # tree /sys/devices/virtual/mtty/mtty/ 318 # tree /sys/devices/virtual/mtty/mtty/
310 /sys/devices/virtual/mtty/mtty/ 319 /sys/devices/virtual/mtty/mtty/
311 |-- mdev_supported_types 320 |-- mdev_supported_types
312 | |-- mtty-1 321 | |-- mtty-1
313 | | |-- available_instances 322 | | |-- available_instances
314 | | |-- create 323 | | |-- create
315 | | |-- device_api 324 | | |-- device_api
316 | | |-- devices 325 | | |-- devices
317 | | `-- name 326 | | `-- name
318 | `-- mtty-2 327 | `-- mtty-2
319 | |-- available_instances 328 | |-- available_instances
320 | |-- create 329 | |-- create
321 | |-- device_api 330 | |-- device_api
322 | |-- devices 331 | |-- devices
323 | `-- name 332 | `-- name
324 |-- mtty_dev 333 |-- mtty_dev
325 | `-- sample_mtty_dev 334 | `-- sample_mtty_dev
326 |-- power 335 |-- power
327 | |-- autosuspend_delay_ms 336 | |-- autosuspend_delay_ms
328 | |-- control 337 | |-- control
329 | |-- runtime_active_time 338 | |-- runtime_active_time
330 | |-- runtime_status 339 | |-- runtime_status
331 | `-- runtime_suspended_time 340 | `-- runtime_suspended_time
332 |-- subsystem -> ../../../../class/mtty 341 |-- subsystem -> ../../../../class/mtty
333 `-- uevent 342 `-- uevent
334 343
3352. Create a mediated device by using the dummy device that you created in the 3442. Create a mediated device by using the dummy device that you created in the
336 previous step. 345 previous step::
337 346
338 # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ 347 # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \
339 /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create 348 /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create
340 349
3413. Add parameters to qemu-kvm. 3503. Add parameters to qemu-kvm::
342 351
343 -device vfio-pci,\ 352 -device vfio-pci,\
344 sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 353 sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
345 354
3464. Boot the VM. 3554. Boot the VM.
347 356
348 In the Linux guest VM, with no hardware on the host, the device appears 357 In the Linux guest VM, with no hardware on the host, the device appears
349 as follows: 358 as follows::
350 359
351 # lspci -s 00:05.0 -xxvv 360 # lspci -s 00:05.0 -xxvv
352 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) 361 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
353 Subsystem: Device 4348:3253 362 Subsystem: Device 4348:3253
354 Physical Slot: 5 363 Physical Slot: 5
355 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 364 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
356 Stepping- SERR- FastB2B- DisINTx- 365 Stepping- SERR- FastB2B- DisINTx-
357 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 366 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
358 <TAbort- <MAbort- >SERR- <PERR- INTx- 367 <TAbort- <MAbort- >SERR- <PERR- INTx-
359 Interrupt: pin A routed to IRQ 10 368 Interrupt: pin A routed to IRQ 10
360 Region 0: I/O ports at c150 [size=8] 369 Region 0: I/O ports at c150 [size=8]
361 Region 1: I/O ports at c158 [size=8] 370 Region 1: I/O ports at c158 [size=8]
362 Kernel driver in use: serial 371 Kernel driver in use: serial
363 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 372 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
364 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 373 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
365 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 374 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
366 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 375 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
367 376
368 In the Linux guest VM, dmesg output for the device is as follows: 377 In the Linux guest VM, dmesg output for the device is as follows:
369 378
370 serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 379 serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
37110 380 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
372 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 381 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
373 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A 382
374 383
375 3845. In the Linux guest VM, check the serial ports::
3765. In the Linux guest VM, check the serial ports. 385
377 386 # setserial -g /dev/ttyS*
378 # setserial -g /dev/ttyS* 387 /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
379 /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 388 /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
380 /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 389 /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
381 /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
382 390
3836. Using minicom or any terminal emulation program, open port /dev/ttyS1 or 3916. Using minicom or any terminal emulation program, open port /dev/ttyS1 or
384 /dev/ttyS2 with hardware flow control disabled. 392 /dev/ttyS2 with hardware flow control disabled.
@@ -388,14 +396,14 @@ card.
388 396
389 Data is loop backed from hosts mtty driver. 397 Data is loop backed from hosts mtty driver.
390 398
3918. Destroy the mediated device that you created. 3998. Destroy the mediated device that you created::
392 400
393 # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove 401 # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
394 402
395References 403References
396========== 404==========
397 405
398[1] See Documentation/vfio.txt for more information on VFIO. 4061. See Documentation/vfio.txt for more information on VFIO.
399[2] struct mdev_driver in include/linux/mdev.h 4072. struct mdev_driver in include/linux/mdev.h
400[3] struct mdev_parent_ops in include/linux/mdev.h 4083. struct mdev_parent_ops in include/linux/mdev.h
401[4] struct vfio_iommu_driver_ops in include/linux/vfio.h 4094. struct vfio_iommu_driver_ops in include/linux/vfio.h
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 1dd3fddfd3a1..ef6a5111eaa1 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -1,5 +1,7 @@
1VFIO - "Virtual Function I/O"[1] 1==================================
2------------------------------------------------------------------------------- 2VFIO - "Virtual Function I/O" [1]_
3==================================
4
3Many modern system now provide DMA and interrupt remapping facilities 5Many modern system now provide DMA and interrupt remapping facilities
4to help ensure I/O devices behave within the boundaries they've been 6to help ensure I/O devices behave within the boundaries they've been
5allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, 7allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
@@ -7,14 +9,14 @@ POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
7systems such as Freescale PAMU. The VFIO driver is an IOMMU/device 9systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
8agnostic framework for exposing direct device access to userspace, in 10agnostic framework for exposing direct device access to userspace, in
9a secure, IOMMU protected environment. In other words, this allows 11a secure, IOMMU protected environment. In other words, this allows
10safe[2], non-privileged, userspace drivers. 12safe [2]_, non-privileged, userspace drivers.
11 13
12Why do we want that? Virtual machines often make use of direct device 14Why do we want that? Virtual machines often make use of direct device
13access ("device assignment") when configured for the highest possible 15access ("device assignment") when configured for the highest possible
14I/O performance. From a device and host perspective, this simply 16I/O performance. From a device and host perspective, this simply
15turns the VM into a userspace driver, with the benefits of 17turns the VM into a userspace driver, with the benefits of
16significantly reduced latency, higher bandwidth, and direct use of 18significantly reduced latency, higher bandwidth, and direct use of
17bare-metal device drivers[3]. 19bare-metal device drivers [3]_.
18 20
19Some applications, particularly in the high performance computing 21Some applications, particularly in the high performance computing
20field, also benefit from low-overhead, direct device access from 22field, also benefit from low-overhead, direct device access from
@@ -31,7 +33,7 @@ KVM PCI specific device assignment code as well as provide a more
31secure, more featureful userspace driver environment than UIO. 33secure, more featureful userspace driver environment than UIO.
32 34
33Groups, Devices, and IOMMUs 35Groups, Devices, and IOMMUs
34------------------------------------------------------------------------------- 36---------------------------
35 37
36Devices are the main target of any I/O driver. Devices typically 38Devices are the main target of any I/O driver. Devices typically
37create a programming interface made up of I/O access, interrupts, 39create a programming interface made up of I/O access, interrupts,
@@ -114,40 +116,40 @@ well as mechanisms for describing and registering interrupt
114notifications. 116notifications.
115 117
116VFIO Usage Example 118VFIO Usage Example
117------------------------------------------------------------------------------- 119------------------
118 120
119Assume user wants to access PCI device 0000:06:0d.0 121Assume user wants to access PCI device 0000:06:0d.0::
120 122
121$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 123 $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
122../../../../kernel/iommu_groups/26 124 ../../../../kernel/iommu_groups/26
123 125
124This device is therefore in IOMMU group 26. This device is on the 126This device is therefore in IOMMU group 26. This device is on the
125pci bus, therefore the user will make use of vfio-pci to manage the 127pci bus, therefore the user will make use of vfio-pci to manage the
126group: 128group::
127 129
128# modprobe vfio-pci 130 # modprobe vfio-pci
129 131
130Binding this device to the vfio-pci driver creates the VFIO group 132Binding this device to the vfio-pci driver creates the VFIO group
131character devices for this group: 133character devices for this group::
132 134
133$ lspci -n -s 0000:06:0d.0 135 $ lspci -n -s 0000:06:0d.0
13406:0d.0 0401: 1102:0002 (rev 08) 136 06:0d.0 0401: 1102:0002 (rev 08)
135# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 137 # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
136# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 138 # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
137 139
138Now we need to look at what other devices are in the group to free 140Now we need to look at what other devices are in the group to free
139it for use by VFIO: 141it for use by VFIO::
140 142
141$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 143 $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
142total 0 144 total 0
143lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 145 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
144 ../../../../devices/pci0000:00/0000:00:1e.0 146 ../../../../devices/pci0000:00/0000:00:1e.0
145lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 147 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
146 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 148 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
147lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 149 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
148 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 150 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
149 151
150This device is behind a PCIe-to-PCI bridge[4], therefore we also 152This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
151need to add device 0000:06:0d.1 to the group following the same 153need to add device 0000:06:0d.1 to the group following the same
152procedure as above. Device 0000:00:1e.0 is a bridge that does 154procedure as above. Device 0000:00:1e.0 is a bridge that does
153not currently have a host driver, therefore it's not required to 155not currently have a host driver, therefore it's not required to
@@ -157,12 +159,12 @@ support PCI bridges).
157The final step is to provide the user with access to the group if 159The final step is to provide the user with access to the group if
158unprivileged operation is desired (note that /dev/vfio/vfio provides 160unprivileged operation is desired (note that /dev/vfio/vfio provides
159no capabilities on its own and is therefore expected to be set to 161no capabilities on its own and is therefore expected to be set to
160mode 0666 by the system). 162mode 0666 by the system)::
161 163
162# chown user:user /dev/vfio/26 164 # chown user:user /dev/vfio/26
163 165
164The user now has full access to all the devices and the iommu for this 166The user now has full access to all the devices and the iommu for this
165group and can access them as follows: 167group and can access them as follows::
166 168
167 int container, group, device, i; 169 int container, group, device, i;
168 struct vfio_group_status group_status = 170 struct vfio_group_status group_status =
@@ -248,31 +250,31 @@ VFIO bus driver API
248VFIO bus drivers, such as vfio-pci make use of only a few interfaces 250VFIO bus drivers, such as vfio-pci make use of only a few interfaces
249into VFIO core. When devices are bound and unbound to the driver, 251into VFIO core. When devices are bound and unbound to the driver,
250the driver should call vfio_add_group_dev() and vfio_del_group_dev() 252the driver should call vfio_add_group_dev() and vfio_del_group_dev()
251respectively: 253respectively::
252 254
253extern int vfio_add_group_dev(struct iommu_group *iommu_group, 255 extern int vfio_add_group_dev(struct iommu_group *iommu_group,
254 struct device *dev, 256 struct device *dev,
255 const struct vfio_device_ops *ops, 257 const struct vfio_device_ops *ops,
256 void *device_data); 258 void *device_data);
257 259
258extern void *vfio_del_group_dev(struct device *dev); 260 extern void *vfio_del_group_dev(struct device *dev);
259 261
260vfio_add_group_dev() indicates to the core to begin tracking the 262vfio_add_group_dev() indicates to the core to begin tracking the
261specified iommu_group and register the specified dev as owned by 263specified iommu_group and register the specified dev as owned by
262a VFIO bus driver. The driver provides an ops structure for callbacks 264a VFIO bus driver. The driver provides an ops structure for callbacks
263similar to a file operations structure: 265similar to a file operations structure::
264 266
265struct vfio_device_ops { 267 struct vfio_device_ops {
266 int (*open)(void *device_data); 268 int (*open)(void *device_data);
267 void (*release)(void *device_data); 269 void (*release)(void *device_data);
268 ssize_t (*read)(void *device_data, char __user *buf, 270 ssize_t (*read)(void *device_data, char __user *buf,
269 size_t count, loff_t *ppos); 271 size_t count, loff_t *ppos);
270 ssize_t (*write)(void *device_data, const char __user *buf, 272 ssize_t (*write)(void *device_data, const char __user *buf,
271 size_t size, loff_t *ppos); 273 size_t size, loff_t *ppos);
272 long (*ioctl)(void *device_data, unsigned int cmd, 274 long (*ioctl)(void *device_data, unsigned int cmd,
273 unsigned long arg); 275 unsigned long arg);
274 int (*mmap)(void *device_data, struct vm_area_struct *vma); 276 int (*mmap)(void *device_data, struct vm_area_struct *vma);
275}; 277 };
276 278
277Each function is passed the device_data that was originally registered 279Each function is passed the device_data that was originally registered
278in the vfio_add_group_dev() call above. This allows the bus driver 280in the vfio_add_group_dev() call above. This allows the bus driver
@@ -285,50 +287,55 @@ own VFIO_DEVICE_GET_REGION_INFO ioctl.
285 287
286 288
287PPC64 sPAPR implementation note 289PPC64 sPAPR implementation note
288------------------------------------------------------------------------------- 290-------------------------------
289 291
290This implementation has some specifics: 292This implementation has some specifics:
291 293
2921) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 2941) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
293container is supported as an IOMMU table is allocated at the boot time, 295 container is supported as an IOMMU table is allocated at the boot time,
294one table per a IOMMU group which is a Partitionable Endpoint (PE) 296 one table per a IOMMU group which is a Partitionable Endpoint (PE)
295(PE is often a PCI domain but not always). 297 (PE is often a PCI domain but not always).
296Newer systems (POWER8 with IODA2) have improved hardware design which allows 298
297to remove this limitation and have multiple IOMMU groups per a VFIO container. 299 Newer systems (POWER8 with IODA2) have improved hardware design which allows
300 to remove this limitation and have multiple IOMMU groups per a VFIO
301 container.
298 302
2992) The hardware supports so called DMA windows - the PCI address range 3032) The hardware supports so called DMA windows - the PCI address range
300within which DMA transfer is allowed, any attempt to access address space 304 within which DMA transfer is allowed, any attempt to access address space
301out of the window leads to the whole PE isolation. 305 out of the window leads to the whole PE isolation.
302 306
3033) PPC64 guests are paravirtualized but not fully emulated. There is an API 3073) PPC64 guests are paravirtualized but not fully emulated. There is an API
304to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 308 to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
305currently there is no way to reduce the number of calls. In order to make things 309 currently there is no way to reduce the number of calls. In order to make
306faster, the map/unmap handling has been implemented in real mode which provides 310 things faster, the map/unmap handling has been implemented in real mode
307an excellent performance which has limitations such as inability to do 311 which provides an excellent performance which has limitations such as
308locked pages accounting in real time. 312 inability to do locked pages accounting in real time.
309 313
3104) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 3144) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
311subtree that can be treated as a unit for the purposes of partitioning and 315 subtree that can be treated as a unit for the purposes of partitioning and
312error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 316 error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
313function of a multi-function IOA, or multiple IOAs (possibly including switch 317 function of a multi-function IOA, or multiple IOAs (possibly including
314and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors 318 switch and bridge structures above the multiple IOAs). PPC64 guests detect
315and recover from them via EEH RTAS services, which works on the basis of 319 PCI errors and recover from them via EEH RTAS services, which works on the
316additional ioctl commands. 320 basis of additional ioctl commands.
317 321
318So 4 additional ioctls have been added: 322 So 4 additional ioctls have been added:
319 323
320 VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start 324 VFIO_IOMMU_SPAPR_TCE_GET_INFO
321 of the DMA window on the PCI bus. 325 returns the size and the start of the DMA window on the PCI bus.
322 326
323 VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting 327 VFIO_IOMMU_ENABLE
328 enables the container. The locked pages accounting
324 is done at this point. This lets user first to know what 329 is done at this point. This lets user first to know what
325 the DMA window is and adjust rlimit before doing any real job. 330 the DMA window is and adjust rlimit before doing any real job.
326 331
327 VFIO_IOMMU_DISABLE - disables the container. 332 VFIO_IOMMU_DISABLE
333 disables the container.
328 334
329 VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. 335 VFIO_EEH_PE_OP
336 provides an API for EEH setup, error detection and recovery.
330 337
331The code flow from the example above should be slightly changed: 338 The code flow from the example above should be slightly changed::
332 339
333 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 340 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
334 341
@@ -442,73 +449,73 @@ The code flow from the example above should be slightly changed:
442 .... 449 ....
443 450
4445) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 4515) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
445VFIO_IOMMU_DISABLE and implements 2 new ioctls: 452 VFIO_IOMMU_DISABLE and implements 2 new ioctls:
446VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 453 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
447(which are unsupported in v1 IOMMU). 454 (which are unsupported in v1 IOMMU).
448 455
449PPC64 paravirtualized guests generate a lot of map/unmap requests, 456 PPC64 paravirtualized guests generate a lot of map/unmap requests,
450and the handling of those includes pinning/unpinning pages and updating 457 and the handling of those includes pinning/unpinning pages and updating
451mm::locked_vm counter to make sure we do not exceed the rlimit. 458 mm::locked_vm counter to make sure we do not exceed the rlimit.
452The v2 IOMMU splits accounting and pinning into separate operations: 459 The v2 IOMMU splits accounting and pinning into separate operations:
453 460
454- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 461 - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
455receive a user space address and size of the block to be pinned. 462 receive a user space address and size of the block to be pinned.
456Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 463 Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
457be called with the exact address and size used for registering 464 be called with the exact address and size used for registering
458the memory block. The userspace is not expected to call these often. 465 the memory block. The userspace is not expected to call these often.
459The ranges are stored in a linked list in a VFIO container. 466 The ranges are stored in a linked list in a VFIO container.
460 467
461- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 468 - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
462IOMMU table and do not do pinning; instead these check that the userspace 469 IOMMU table and do not do pinning; instead these check that the userspace
463address is from pre-registered range. 470 address is from pre-registered range.
464 471
465This separation helps in optimizing DMA for guests. 472 This separation helps in optimizing DMA for guests.
466 473
4676) sPAPR specification allows guests to have an additional DMA window(s) on 4746) sPAPR specification allows guests to have an additional DMA window(s) on
468a PCI bus with a variable page size. Two ioctls have been added to support 475 a PCI bus with a variable page size. Two ioctls have been added to support
469this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 476 this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
470The platform has to support the functionality or error will be returned to 477 The platform has to support the functionality or error will be returned to
471the userspace. The existing hardware supports up to 2 DMA windows, one is 478 the userspace. The existing hardware supports up to 2 DMA windows, one is
4722GB long, uses 4K pages and called "default 32bit window"; the other can 479 2GB long, uses 4K pages and called "default 32bit window"; the other can
473be as big as entire RAM, use different page size, it is optional - guests 480 be as big as entire RAM, use different page size, it is optional - guests
474create those in run-time if the guest driver supports 64bit DMA. 481 create those in run-time if the guest driver supports 64bit DMA.
475 482
476VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 483 VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
477a number of TCE table levels (if a TCE table is going to be big enough and 484 a number of TCE table levels (if a TCE table is going to be big enough and
478the kernel may not be able to allocate enough of physically contiguous memory). 485 the kernel may not be able to allocate enough of physically contiguous
479It creates a new window in the available slot and returns the bus address where 486 memory). It creates a new window in the available slot and returns the bus
480the new window starts. Due to hardware limitation, the user space cannot choose 487 address where the new window starts. Due to hardware limitation, the user
481the location of DMA windows. 488 space cannot choose the location of DMA windows.
482 489
483VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 490 VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
484and removes it. 491 and removes it.
485 492
486------------------------------------------------------------------------------- 493-------------------------------------------------------------------------------
487 494
488[1] VFIO was originally an acronym for "Virtual Function I/O" in its 495.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
489initial implementation by Tom Lyon while as Cisco. We've since 496 initial implementation by Tom Lyon while as Cisco. We've since
490outgrown the acronym, but it's catchy. 497 outgrown the acronym, but it's catchy.
491 498
492[2] "safe" also depends upon a device being "well behaved". It's 499.. [2] "safe" also depends upon a device being "well behaved". It's
493possible for multi-function devices to have backdoors between 500 possible for multi-function devices to have backdoors between
494functions and even for single function devices to have alternative 501 functions and even for single function devices to have alternative
495access to things like PCI config space through MMIO registers. To 502 access to things like PCI config space through MMIO registers. To
496guard against the former we can include additional precautions in the 503 guard against the former we can include additional precautions in the
497IOMMU driver to group multi-function PCI devices together 504 IOMMU driver to group multi-function PCI devices together
498(iommu=group_mf). The latter we can't prevent, but the IOMMU should 505 (iommu=group_mf). The latter we can't prevent, but the IOMMU should
499still provide isolation. For PCI, SR-IOV Virtual Functions are the 506 still provide isolation. For PCI, SR-IOV Virtual Functions are the
500best indicator of "well behaved", as these are designed for 507 best indicator of "well behaved", as these are designed for
501virtualization usage models. 508 virtualization usage models.
502 509
503[3] As always there are trade-offs to virtual machine device 510.. [3] As always there are trade-offs to virtual machine device
504assignment that are beyond the scope of VFIO. It's expected that 511 assignment that are beyond the scope of VFIO. It's expected that
505future IOMMU technologies will reduce some, but maybe not all, of 512 future IOMMU technologies will reduce some, but maybe not all, of
506these trade-offs. 513 these trade-offs.
507 514
508[4] In this case the device is below a PCI bridge, so transactions 515.. [4] In this case the device is below a PCI bridge, so transactions
509from either function of the device are indistinguishable to the iommu: 516 from either function of the device are indistinguishable to the iommu::
510 517
511-[0000:00]-+-1e.0-[06]--+-0d.0 518 -[0000:00]-+-1e.0-[06]--+-0d.0
512 \-0d.1 519 \-0d.1
513 520
51400:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 521 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
diff --git a/Documentation/xillybus.txt b/Documentation/xillybus.txt
index 1660145b9969..2446ee303c09 100644
--- a/Documentation/xillybus.txt
+++ b/Documentation/xillybus.txt
@@ -1,12 +1,11 @@
1==========================================
2Xillybus driver for generic FPGA interface
3==========================================
1 4
2 ========================================== 5:Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com)
3 Xillybus driver for generic FPGA interface 6:Email: eli.billauer@gmail.com or as advertised on Xillybus' site.
4 ==========================================
5 7
6Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) 8.. Contents:
7Email: eli.billauer@gmail.com or as advertised on Xillybus' site.
8
9Contents:
10 9
11 - Introduction 10 - Introduction
12 -- Background 11 -- Background
@@ -17,7 +16,7 @@ Contents:
17 -- Synchronization 16 -- Synchronization
18 -- Seekable pipes 17 -- Seekable pipes
19 18
20- Internals 19 - Internals
21 -- Source code organization 20 -- Source code organization
22 -- Pipe attributes 21 -- Pipe attributes
23 -- Host never reads from the FPGA 22 -- Host never reads from the FPGA
@@ -29,7 +28,7 @@ Contents:
29 -- The "nonempty" message (supporting poll) 28 -- The "nonempty" message (supporting poll)
30 29
31 30
32INTRODUCTION 31Introduction
33============ 32============
34 33
35Background 34Background
@@ -105,7 +104,7 @@ driver is used to work out of the box with any Xillybus IP core.
105The data structure just mentioned should not be confused with PCI's 104The data structure just mentioned should not be confused with PCI's
106configuration space or the Flattened Device Tree. 105configuration space or the Flattened Device Tree.
107 106
108USAGE 107Usage
109===== 108=====
110 109
111User interface 110User interface
@@ -117,11 +116,11 @@ names of these files depend on the IP core that is loaded in the FPGA (see
117Probing below). To communicate with the FPGA, open the device file that 116Probing below). To communicate with the FPGA, open the device file that
118corresponds to the hardware FIFO you want to send data or receive data from, 117corresponds to the hardware FIFO you want to send data or receive data from,
119and use plain write() or read() calls, just like with a regular pipe. In 118and use plain write() or read() calls, just like with a regular pipe. In
120particular, it makes perfect sense to go: 119particular, it makes perfect sense to go::
121 120
122$ cat mydata > /dev/xillybus_thisfifo 121 $ cat mydata > /dev/xillybus_thisfifo
123 122
124$ cat /dev/xillybus_thatfifo > hisdata 123 $ cat /dev/xillybus_thatfifo > hisdata
125 124
126possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have 125possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have
127the capability to send an EOF (but may not use it). 126the capability to send an EOF (but may not use it).
@@ -178,7 +177,7 @@ the attached memory is done by seeking to the desired address, and calling
178read() or write() as required. 177read() or write() as required.
179 178
180 179
181INTERNALS 180Internals
182========= 181=========
183 182
184Source code organization 183Source code organization
@@ -365,7 +364,7 @@ into that page. It can be shown that all pages requested from the kernel
365(except possibly for the last) are 100% utilized this way. 364(except possibly for the last) are 100% utilized this way.
366 365
367The "nonempty" message (supporting poll) 366The "nonempty" message (supporting poll)
368--------------------------------------- 367----------------------------------------
369 368
370In order to support the "poll" method (and hence select() ), there is a small 369In order to support the "poll" method (and hence select() ), there is a small
371catch regarding the FPGA to host direction: The FPGA may have filled a DMA 370catch regarding the FPGA to host direction: The FPGA may have filled a DMA
diff --git a/Documentation/xz.txt b/Documentation/xz.txt
index 2cf3e2608de3..b2220d03aa50 100644
--- a/Documentation/xz.txt
+++ b/Documentation/xz.txt
@@ -1,121 +1,127 @@
1 1============================
2XZ data compression in Linux 2XZ data compression in Linux
3============================ 3============================
4 4
5Introduction 5Introduction
6============
6 7
7 XZ is a general purpose data compression format with high compression 8XZ is a general purpose data compression format with high compression
8 ratio and relatively fast decompression. The primary compression 9ratio and relatively fast decompression. The primary compression
9 algorithm (filter) is LZMA2. Additional filters can be used to improve 10algorithm (filter) is LZMA2. Additional filters can be used to improve
10 compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters 11compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters
11 improve compression ratio of executable data. 12improve compression ratio of executable data.
12 13
13 The XZ decompressor in Linux is called XZ Embedded. It supports 14The XZ decompressor in Linux is called XZ Embedded. It supports
14 the LZMA2 filter and optionally also BCJ filters. CRC32 is supported 15the LZMA2 filter and optionally also BCJ filters. CRC32 is supported
15 for integrity checking. The home page of XZ Embedded is at 16for integrity checking. The home page of XZ Embedded is at
16 <http://tukaani.org/xz/embedded.html>, where you can find the 17<http://tukaani.org/xz/embedded.html>, where you can find the
17 latest version and also information about using the code outside 18latest version and also information about using the code outside
18 the Linux kernel. 19the Linux kernel.
19 20
20 For userspace, XZ Utils provide a zlib-like compression library 21For userspace, XZ Utils provide a zlib-like compression library
21 and a gzip-like command line tool. XZ Utils can be downloaded from 22and a gzip-like command line tool. XZ Utils can be downloaded from
22 <http://tukaani.org/xz/>. 23<http://tukaani.org/xz/>.
23 24
24XZ related components in the kernel 25XZ related components in the kernel
25 26===================================
26 The xz_dec module provides XZ decompressor with single-call (buffer 27
27 to buffer) and multi-call (stateful) APIs. The usage of the xz_dec 28The xz_dec module provides XZ decompressor with single-call (buffer
28 module is documented in include/linux/xz.h. 29to buffer) and multi-call (stateful) APIs. The usage of the xz_dec
29 30module is documented in include/linux/xz.h.
30 The xz_dec_test module is for testing xz_dec. xz_dec_test is not 31
31 useful unless you are hacking the XZ decompressor. xz_dec_test 32The xz_dec_test module is for testing xz_dec. xz_dec_test is not
32 allocates a char device major dynamically to which one can write 33useful unless you are hacking the XZ decompressor. xz_dec_test
33 .xz files from userspace. The decompressed output is thrown away. 34allocates a char device major dynamically to which one can write
34 Keep an eye on dmesg to see diagnostics printed by xz_dec_test. 35.xz files from userspace. The decompressed output is thrown away.
35 See the xz_dec_test source code for the details. 36Keep an eye on dmesg to see diagnostics printed by xz_dec_test.
36 37See the xz_dec_test source code for the details.
37 For decompressing the kernel image, initramfs, and initrd, there 38
38 is a wrapper function in lib/decompress_unxz.c. Its API is the 39For decompressing the kernel image, initramfs, and initrd, there
39 same as in other decompress_*.c files, which is defined in 40is a wrapper function in lib/decompress_unxz.c. Its API is the
40 include/linux/decompress/generic.h. 41same as in other decompress_*.c files, which is defined in
41 42include/linux/decompress/generic.h.
42 scripts/xz_wrap.sh is a wrapper for the xz command line tool found 43
43 from XZ Utils. The wrapper sets compression options to values suitable 44scripts/xz_wrap.sh is a wrapper for the xz command line tool found
44 for compressing the kernel image. 45from XZ Utils. The wrapper sets compression options to values suitable
45 46for compressing the kernel image.
46 For kernel makefiles, two commands are provided for use with 47
47 $(call if_needed). The kernel image should be compressed with 48For kernel makefiles, two commands are provided for use with
48 $(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2 49$(call if_needed). The kernel image should be compressed with
49 dictionary. It will also append a four-byte trailer containing the 50$(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2
50 uncompressed size of the file, which is needed by the boot code. 51dictionary. It will also append a four-byte trailer containing the
51 Other things should be compressed with $(call if_needed,xzmisc) 52uncompressed size of the file, which is needed by the boot code.
52 which will use no BCJ filter and 1 MiB LZMA2 dictionary. 53Other things should be compressed with $(call if_needed,xzmisc)
54which will use no BCJ filter and 1 MiB LZMA2 dictionary.
53 55
54Notes on compression options 56Notes on compression options
57============================
55 58
56 Since the XZ Embedded supports only streams with no integrity check or 59Since the XZ Embedded supports only streams with no integrity check or
57 CRC32, make sure that you don't use some other integrity check type 60CRC32, make sure that you don't use some other integrity check type
58 when encoding files that are supposed to be decoded by the kernel. With 61when encoding files that are supposed to be decoded by the kernel. With
59 liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32 62liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32
60 when encoding. With the xz command line tool, use --check=none or 63when encoding. With the xz command line tool, use --check=none or
61 --check=crc32. 64--check=crc32.
62 65
63 Using CRC32 is strongly recommended unless there is some other layer 66Using CRC32 is strongly recommended unless there is some other layer
64 which will verify the integrity of the uncompressed data anyway. 67which will verify the integrity of the uncompressed data anyway.
65 Double checking the integrity would probably be waste of CPU cycles. 68Double checking the integrity would probably be waste of CPU cycles.
66 Note that the headers will always have a CRC32 which will be validated 69Note that the headers will always have a CRC32 which will be validated
67 by the decoder; you can only change the integrity check type (or 70by the decoder; you can only change the integrity check type (or
68 disable it) for the actual uncompressed data. 71disable it) for the actual uncompressed data.
69 72
70 In userspace, LZMA2 is typically used with dictionary sizes of several 73In userspace, LZMA2 is typically used with dictionary sizes of several
71 megabytes. The decoder needs to have the dictionary in RAM, thus big 74megabytes. The decoder needs to have the dictionary in RAM, thus big
72 dictionaries cannot be used for files that are intended to be decoded 75dictionaries cannot be used for files that are intended to be decoded
73 by the kernel. 1 MiB is probably the maximum reasonable dictionary 76by the kernel. 1 MiB is probably the maximum reasonable dictionary
74 size for in-kernel use (maybe more is OK for initramfs). The presets 77size for in-kernel use (maybe more is OK for initramfs). The presets
75 in XZ Utils may not be optimal when creating files for the kernel, 78in XZ Utils may not be optimal when creating files for the kernel,
76 so don't hesitate to use custom settings. Example: 79so don't hesitate to use custom settings. Example::
77 80
78 xz --check=crc32 --lzma2=dict=512KiB inputfile 81 xz --check=crc32 --lzma2=dict=512KiB inputfile
79 82
80 An exception to above dictionary size limitation is when the decoder 83An exception to above dictionary size limitation is when the decoder
81 is used in single-call mode. Decompressing the kernel itself is an 84is used in single-call mode. Decompressing the kernel itself is an
82 example of this situation. In single-call mode, the memory usage 85example of this situation. In single-call mode, the memory usage
83 doesn't depend on the dictionary size, and it is perfectly fine to 86doesn't depend on the dictionary size, and it is perfectly fine to
84 use a big dictionary: for maximum compression, the dictionary should 87use a big dictionary: for maximum compression, the dictionary should
85 be at least as big as the uncompressed data itself. 88be at least as big as the uncompressed data itself.
86 89
87Future plans 90Future plans
91============
88 92
89 Creating a limited XZ encoder may be considered if people think it is 93Creating a limited XZ encoder may be considered if people think it is
90 useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at 94useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at
91 the fastest settings, so it isn't clear if LZMA2 encoder is wanted 95the fastest settings, so it isn't clear if LZMA2 encoder is wanted
92 into the kernel. 96into the kernel.
93 97
94 Support for limited random-access reading is planned for the 98Support for limited random-access reading is planned for the
95 decompression code. I don't know if it could have any use in the 99decompression code. I don't know if it could have any use in the
96 kernel, but I know that it would be useful in some embedded projects 100kernel, but I know that it would be useful in some embedded projects
97 outside the Linux kernel. 101outside the Linux kernel.
98 102
99Conformance to the .xz file format specification 103Conformance to the .xz file format specification
104================================================
100 105
101 There are a couple of corner cases where things have been simplified 106There are a couple of corner cases where things have been simplified
102 at expense of detecting errors as early as possible. These should not 107at expense of detecting errors as early as possible. These should not
103 matter in practice all, since they don't cause security issues. But 108matter in practice all, since they don't cause security issues. But
104 it is good to know this if testing the code e.g. with the test files 109it is good to know this if testing the code e.g. with the test files
105 from XZ Utils. 110from XZ Utils.
106 111
107Reporting bugs 112Reporting bugs
113==============
108 114
109 Before reporting a bug, please check that it's not fixed already 115Before reporting a bug, please check that it's not fixed already
110 at upstream. See <http://tukaani.org/xz/embedded.html> to get the 116at upstream. See <http://tukaani.org/xz/embedded.html> to get the
111 latest code. 117latest code.
112 118
113 Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on 119Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on
114 Freenode and talk to Larhzu. I don't actively read LKML or other 120Freenode and talk to Larhzu. I don't actively read LKML or other
115 kernel-related mailing lists, so if there's something I should know, 121kernel-related mailing lists, so if there's something I should know,
116 you should email to me personally or use IRC. 122you should email to me personally or use IRC.
117 123
118 Don't bother Igor Pavlov with questions about the XZ implementation 124Don't bother Igor Pavlov with questions about the XZ implementation
119 in the kernel or about XZ Utils. While these two implementations 125in the kernel or about XZ Utils. While these two implementations
120 include essential code that is directly based on Igor Pavlov's code, 126include essential code that is directly based on Igor Pavlov's code,
121 these implementations aren't maintained nor supported by him. 127these implementations aren't maintained nor supported by him.
diff --git a/Documentation/zorro.txt b/Documentation/zorro.txt
index d530971beb00..664072b017e3 100644
--- a/Documentation/zorro.txt
+++ b/Documentation/zorro.txt
@@ -1,12 +1,13 @@
1 Writing Device Drivers for Zorro Devices 1========================================
2 ---------------------------------------- 2Writing Device Drivers for Zorro Devices
3========================================
3 4
4Written by Geert Uytterhoeven <geert@linux-m68k.org> 5:Author: Written by Geert Uytterhoeven <geert@linux-m68k.org>
5Last revised: September 5, 2003 6:Last revised: September 5, 2003
6 7
7 8
81. Introduction 9Introduction
9--------------- 10------------
10 11
11The Zorro bus is the bus used in the Amiga family of computers. Thanks to 12The Zorro bus is the bus used in the Amiga family of computers. Thanks to
12AutoConfig(tm), it's 100% Plug-and-Play. 13AutoConfig(tm), it's 100% Plug-and-Play.
@@ -20,12 +21,12 @@ There are two types of Zorro buses, Zorro II and Zorro III:
20 with Zorro II. The Zorro III address space lies outside the first 16 MB. 21 with Zorro II. The Zorro III address space lies outside the first 16 MB.
21 22
22 23
232. Probing for Zorro Devices 24Probing for Zorro Devices
24---------------------------- 25-------------------------
25 26
26Zorro devices are found by calling `zorro_find_device()', which returns a 27Zorro devices are found by calling ``zorro_find_device()``, which returns a
27pointer to the `next' Zorro device with the specified Zorro ID. A probe loop 28pointer to the ``next`` Zorro device with the specified Zorro ID. A probe loop
28for the board with Zorro ID `ZORRO_PROD_xxx' looks like: 29for the board with Zorro ID ``ZORRO_PROD_xxx`` looks like::
29 30
30 struct zorro_dev *z = NULL; 31 struct zorro_dev *z = NULL;
31 32
@@ -35,8 +36,8 @@ for the board with Zorro ID `ZORRO_PROD_xxx' looks like:
35 ... 36 ...
36 } 37 }
37 38
38`ZORRO_WILDCARD' acts as a wildcard and finds any Zorro device. If your driver 39``ZORRO_WILDCARD`` acts as a wildcard and finds any Zorro device. If your driver
39supports different types of boards, you can use a construct like: 40supports different types of boards, you can use a construct like::
40 41
41 struct zorro_dev *z = NULL; 42 struct zorro_dev *z = NULL;
42 43
@@ -49,24 +50,24 @@ supports different types of boards, you can use a construct like:
49 } 50 }
50 51
51 52
523. Zorro Resources 53Zorro Resources
53------------------ 54---------------
54 55
55Before you can access a Zorro device's registers, you have to make sure it's 56Before you can access a Zorro device's registers, you have to make sure it's
56not yet in use. This is done using the I/O memory space resource management 57not yet in use. This is done using the I/O memory space resource management
57functions: 58functions::
58 59
59 request_mem_region() 60 request_mem_region()
60 release_mem_region() 61 release_mem_region()
61 62
62Shortcuts to claim the whole device's address space are provided as well: 63Shortcuts to claim the whole device's address space are provided as well::
63 64
64 zorro_request_device 65 zorro_request_device
65 zorro_release_device 66 zorro_release_device
66 67
67 68
684. Accessing the Zorro Address Space 69Accessing the Zorro Address Space
69------------------------------------ 70---------------------------------
70 71
71The address regions in the Zorro device resources are Zorro bus address 72The address regions in the Zorro device resources are Zorro bus address
72regions. Due to the identity bus-physical address mapping on the Zorro bus, 73regions. Due to the identity bus-physical address mapping on the Zorro bus,
@@ -78,26 +79,26 @@ The treatment of these regions depends on the type of Zorro space:
78 explicitly using z_ioremap(). 79 explicitly using z_ioremap().
79 80
80 Conversion from bus/physical Zorro II addresses to kernel virtual addresses 81 Conversion from bus/physical Zorro II addresses to kernel virtual addresses
81 and vice versa is done using: 82 and vice versa is done using::
82 83
83 virt_addr = ZTWO_VADDR(bus_addr); 84 virt_addr = ZTWO_VADDR(bus_addr);
84 bus_addr = ZTWO_PADDR(virt_addr); 85 bus_addr = ZTWO_PADDR(virt_addr);
85 86
86 - Zorro III address space must be mapped explicitly using z_ioremap() first 87 - Zorro III address space must be mapped explicitly using z_ioremap() first
87 before it can be accessed: 88 before it can be accessed::
88 89
89 virt_addr = z_ioremap(bus_addr, size); 90 virt_addr = z_ioremap(bus_addr, size);
90 ... 91 ...
91 z_iounmap(virt_addr); 92 z_iounmap(virt_addr);
92 93
93 94
945. References 95References
95------------- 96----------
96 97
97linux/include/linux/zorro.h 98#. linux/include/linux/zorro.h
98linux/include/uapi/linux/zorro.h 99#. linux/include/uapi/linux/zorro.h
99linux/include/uapi/linux/zorro_ids.h 100#. linux/include/uapi/linux/zorro_ids.h
100linux/arch/m68k/include/asm/zorro.h 101#. linux/arch/m68k/include/asm/zorro.h
101linux/drivers/zorro 102#. linux/drivers/zorro
102/proc/bus/zorro 103#. /proc/bus/zorro
103 104