From 7ad920b504a980adcab4d3f6b85695526e6fd7bb Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Tue, 6 Oct 2009 11:31:05 -0700 Subject: ceph: documentation Mount options, syntax. Signed-off-by: Sage Weil --- Documentation/filesystems/ceph.txt | 139 +++++++++++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 Documentation/filesystems/ceph.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 000000000000..6e03917316bd --- /dev/null +++ b/Documentation/filesystems/ceph.txt @@ -0,0 +1,139 @@ +Ceph Distributed File System +============================ + +Ceph is a distributed network file system designed to provide good +performance, reliability, and scalability. + +Basic features include: + + * POSIX semantics + * Seamless scaling from 1 to many thousands of nodes + * High availability and reliability. No single points of failure. + * N-way replication of data across storage nodes + * Fast recovery from node failures + * Automatic rebalancing of data on node addition/removal + * Easy deployment: most FS components are userspace daemons + +Also, + * Flexible snapshots (on any directory) + * Recursive accounting (nested files, directories, bytes) + +In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely +on symmetric access by all clients to shared block devices, Ceph +separates data and metadata management into independent server +clusters, similar to Lustre. Unlike Lustre, however, metadata and +storage nodes run entirely as user space daemons. Storage nodes +utilize btrfs to store data objects, leveraging its advanced features +(checksumming, metadata replication, etc.). File data is striped +across storage nodes in large chunks to distribute workload and +facilitate high throughputs. When storage nodes fail, data is +re-replicated in a distributed fashion by the storage nodes themselves +(with some minimal coordination from a cluster monitor), making the +system extremely efficient and scalable. + +Metadata servers effectively form a large, consistent, distributed +in-memory cache above the file namespace that is extremely scalable, +dynamically redistributes metadata in response to workload changes, +and can tolerate arbitrary (well, non-Byzantine) node failures. The +metadata server takes a somewhat unconventional approach to metadata +storage to significantly improve performance for common workloads. In +particular, inodes with only a single link are embedded in +directories, allowing entire directories of dentries and inodes to be +loaded into its cache with a single I/O operation. The contents of +extremely large directories can be fragmented and managed by +independent metadata servers, allowing scalable concurrent access. + +The system offers automatic data rebalancing/migration when scaling +from a small cluster of just a few nodes to many hundreds, without +requiring an administrator carve the data set into static volumes or +go through the tedious process of migrating data between servers. +When the file system approaches full, new nodes can be easily added +and things will "just work." + +Ceph includes flexible snapshot mechanism that allows a user to create +a snapshot on any subdirectory (and its nested contents) in the +system. Snapshot creation and deletion are as simple as 'mkdir +.snap/foo' and 'rmdir .snap/foo'. + +Ceph also provides some recursive accounting on directories for nested +files and bytes. That is, a 'getfattr -d foo' on any directory in the +system will reveal the total number of nested regular files and +subdirectories, and a summation of all nested file sizes. This makes +the identification of large disk space consumers relatively quick, as +no 'du' or similar recursive scan of the file system is required. + + +Mount Syntax +============ + +The basic mount syntax is: + + # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt + +You only need to specify a single monitor, as the client will get the +full list when it connects. (However, if the monitor you specify +happens to be down, the mount won't succeed.) The port can be left +off if the monitor is using the default. So if the monitor is at +1.2.3.4, + + # mount -t ceph 1.2.3.4:/ /mnt/ceph + +is sufficient. If /sbin/mount.ceph is installed, a hostname can be +used instead of an IP address. + + + +Mount Options +============= + + ip=A.B.C.D[:N] + Specify the IP and/or port the client should bind to locally. + There is normally not much reason to do this. If the IP is not + specified, the client's IP address is determined by looking at the + address it's connection to the monitor originates from. + + wsize=X + Specify the maximum write size in bytes. By default there is no + maximu. Ceph will normally size writes based on the file stripe + size. + + rsize=X + Specify the maximum readahead. + + mount_timeout=X + Specify the timeout value for mount (in seconds), in the case + of a non-responsive Ceph file system. The default is 30 + seconds. + + rbytes + When stat() is called on a directory, set st_size to 'rbytes', + the summation of file sizes over all files nested beneath that + directory. This is the default. + + norbytes + When stat() is called on a directory, set st_size to the + number of entries in that directory. + + nocrc + Disable CRC32C calculation for data writes. If set, the OSD + must rely on TCP's error correction to detect data corruption + in the data payload. + + noasyncreaddir + Disable client's use its local cache to satisfy readdir + requests. (This does not change correctness; the client uses + cached metadata only when a lease or capability ensures it is + valid.) + + +More Information +================ + +For more information on Ceph, see the home page at + http://ceph.newdream.net/ + +The Linux kernel client source tree is available at + git://ceph.newdream.net/linux-ceph-client.git + +and the source for the full system is at + git://ceph.newdream.net/ceph.git -- cgit v1.2.2 From 8f4e91dee2a245e4be6942f4a8d83a769e13a47d Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Tue, 6 Oct 2009 11:31:14 -0700 Subject: ceph: ioctls A few Ceph ioctls for getting and setting file layout (striping) parameters, and learning the identity and network address of the OSD a given region of a file is stored on. Signed-off-by: Sage Weil --- Documentation/ioctl/ioctl-number.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 947374977ca5..91cfdd76131e 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -182,6 +182,7 @@ Code Seq# Include File Comments 0x90 00 drivers/cdrom/sbpcd.h 0x93 60-7F linux/auto_fs.h +0x97 00-7F fs/ceph/ioctl.h Ceph file system 0x99 00-0F 537-Addinboard driver 0xA0 all linux/sdp/sdp.h Industrial Device Project -- cgit v1.2.2 From c2282adbdea3548ae0271c1b1b2deec5f56ad224 Mon Sep 17 00:00:00 2001 From: FUJITA Tomonori Date: Mon, 8 Mar 2010 09:11:07 +0100 Subject: Documentation: fix block/biodoc.txt dma mapping description - It looks incorrect to use blk_rq_map_sg with pci_map_page here about DMA mappings. dma_map_sg? - better to use dma_map_page instead of pci_map_page. http://marc.info/?l=linux-kernel&m=126596737604808&w=2 Signed-off-by: FUJITA Tomonori Signed-off-by: Jens Axboe --- Documentation/block/biodoc.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 6fab97ea7e6b..508b5b2b0289 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -1162,8 +1162,8 @@ where a driver received a request ala this before: As mentioned, there is no virtual mapping of a bio. For DMA, this is not a problem as the driver probably never will need a virtual mapping. -Instead it needs a bus mapping (pci_map_page for a single segment or -use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For +Instead it needs a bus mapping (dma_map_page for a single segment or +use dma_map_sg for scatter gather) to be able to ship it to the driver. For PIO drivers (or drivers that need to revert to PIO transfer once in a while (IDE for example)), where the CPU is doing the actual data transfer a virtual mapping is needed. If the driver supports highmem I/O, -- cgit v1.2.2 From 881245dcff29df992d8431392a41fb81549129f9 Mon Sep 17 00:00:00 2001 From: William Cohen Date: Tue, 9 Mar 2010 09:26:04 +0100 Subject: Add DocBook documentation for the block tracepoints. This patch adds a simple description of the various block tracepoints available in the kernel. Signed-off-by: William Cohen Acked-by: Randy Dunlap Signed-off-by: Jens Axboe --- Documentation/DocBook/tracepoint.tmpl | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/DocBook/tracepoint.tmpl b/Documentation/DocBook/tracepoint.tmpl index 8bca1d5cec09..e8473eae2a20 100644 --- a/Documentation/DocBook/tracepoint.tmpl +++ b/Documentation/DocBook/tracepoint.tmpl @@ -16,6 +16,15 @@ + + William + Cohen + +
+ wcohen@redhat.com +
+
+
@@ -91,4 +100,8 @@ !Iinclude/trace/events/signal.h + + Block IO +!Iinclude/trace/events/block.h + -- cgit v1.2.2 From 4c81ba4900ab4eb24c7d2ba1aca594c644b6ce4c Mon Sep 17 00:00:00 2001 From: Len Brown Date: Sun, 14 Mar 2010 16:28:46 -0400 Subject: ACPI: plan to delete "acpi=ht" boot option Signed-off-by: Len Brown --- Documentation/feature-removal-schedule.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index a5cc0db63d7a..ed511af0f79a 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -582,3 +582,10 @@ Why: The paravirt mmu host support is slower than non-paravirt mmu, both Who: Avi Kivity ---------------------------- + +What: "acpi=ht" boot option +When: 2.6.35 +Why: Useful in 2003, implementation is a hack. + Generally invoked by accident today. + Seen as doing more harm than good. +Who: Len Brown -- cgit v1.2.2 From 3b1da4c5d1032ebc29fec8bd8f592ba6589be8ed Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 22 Feb 2010 12:11:34 -0700 Subject: ACPI: processor: remove early _PDC optin quirks Now that we check for physically present processors before blindly evaluating _PDC, we no longer need to maintain a DMI opt-in table nor a kernel param. Acked-by: Venkatesh Pallipadi Signed-off-by: Alex Chiang Signed-off-by: Len Brown --- Documentation/kernel-parameters.txt | 4 ---- 1 file changed, 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 3bc48b0bd3a9..e4cbca58536c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -200,10 +200,6 @@ and is between 256 and 4096 characters. It is defined in the file acpi_display_output=video See above. - acpi_early_pdc_eval [HW,ACPI] Evaluate processor _PDC methods - early. Needed on some platforms to properly - initialize the EC. - acpi_irq_balance [HW,ACPI] ACPI will balance active IRQs default in APIC mode -- cgit v1.2.2 From f28f9e43b3d81d1e69da0ebf77c8a6780cb5e0c8 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 12 Mar 2010 19:23:27 +0000 Subject: timestamping: fix example build Fix Makefiles so that Documentation/networking/timestamping/timestamping.c will build when using the CONFIG_BUILD_DOCSRC kconfig option. (timestamping.c does not build currently with its simple Makefile.) Also fix printf format warnings. Signed-off-by: Randy Dunlap Cc: Patrick Ohly Signed-off-by: David S. Miller --- Documentation/networking/Makefile | 2 ++ Documentation/networking/timestamping/Makefile | 11 +++++++++-- Documentation/networking/timestamping/timestamping.c | 10 +++++----- 3 files changed, 16 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile index 6d8af1ac56c4..5aba7a33aeeb 100644 --- a/Documentation/networking/Makefile +++ b/Documentation/networking/Makefile @@ -6,3 +6,5 @@ hostprogs-y := ifenslave # Tell kbuild to always build the programs always := $(hostprogs-y) + +obj-m := timestamping/ diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile index 2a1489fdc036..e79973443e9f 100644 --- a/Documentation/networking/timestamping/Makefile +++ b/Documentation/networking/timestamping/Makefile @@ -1,6 +1,13 @@ -CPPFLAGS = -I../../../include +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o -timestamping: timestamping.c +# List of programs to build +hostprogs-y := timestamping + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_timestamping.o += -I$(objtree)/usr/include clean: rm -f timestamping diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c index a7936fe8444a..7b81d169cb0f 100644 --- a/Documentation/networking/timestamping/timestamping.c +++ b/Documentation/networking/timestamping/timestamping.c @@ -41,9 +41,9 @@ #include #include -#include "asm/types.h" -#include "linux/net_tstamp.h" -#include "linux/errqueue.h" +#include +#include +#include #ifndef SO_TIMESTAMPING # define SO_TIMESTAMPING 37 @@ -164,7 +164,7 @@ static void printpacket(struct msghdr *msg, int res, gettimeofday(&now, 0); - printf("%ld.%06ld: received %s data, %d bytes from %s, %d bytes control messages\n", + printf("%ld.%06ld: received %s data, %d bytes from %s, %zu bytes control messages\n", (long)now.tv_sec, (long)now.tv_usec, (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular", res, @@ -173,7 +173,7 @@ static void printpacket(struct msghdr *msg, int res, for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) { - printf(" cmsg len %d: ", cmsg->cmsg_len); + printf(" cmsg len %zu: ", cmsg->cmsg_len); switch (cmsg->cmsg_level) { case SOL_SOCKET: printf("SOL_SOCKET "); -- cgit v1.2.2 From 462bd295a3d74c7d1641501bda549bccf404be5b Mon Sep 17 00:00:00 2001 From: "Robert P. J. Day" Date: Thu, 11 Mar 2010 07:59:09 -0500 Subject: kobject: documentation: Fix erroneous example in kobject doc. Replace uio_mem example for kobjects with uio_map, since the uio_mem struct no longer contains a kobject. Signed-off-by: Robert P. J. Day Signed-off-by: Greg Kroah-Hartman --- Documentation/kobject.txt | 57 +++++++++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index bdb13817e1e9..668cb83d9561 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -59,37 +59,56 @@ nice to have in other objects. The C language does not allow for the direct expression of inheritance, so other techniques - such as structure embedding - must be used. -So, for example, the UIO code has a structure that defines the memory -region associated with a uio device: +(As an aside, for those familiar with the kernel linked list implementation, +this is analogous as to how "list_head" structs are rarely useful on +their own, but are invariably found embedded in the larger objects of +interest.) -struct uio_mem { +So, for example, the UIO code in drivers/uio/uio.c has a structure that +defines the memory region associated with a uio device: + + struct uio_map { struct kobject kobj; - unsigned long addr; - unsigned long size; - int memtype; - void __iomem *internal_addr; -}; + struct uio_mem *mem; + }; -If you have a struct uio_mem structure, finding its embedded kobject is +If you have a struct uio_map structure, finding its embedded kobject is just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) and, instead, use the container_of() macro, found in : - container_of(pointer, type, member) + container_of(pointer, type, member) + +where: + + * "pointer" is the pointer to the embedded kobject, + * "type" is the type of the containing structure, and + * "member" is the name of the structure field to which "pointer" points. + +The return value from container_of() is a pointer to the corresponding +container type. So, for example, a pointer "kp" to a struct kobject +embedded *within* a struct uio_map could be converted to a pointer to the +*containing* uio_map structure with: + + struct uio_map *u_map = container_of(kp, struct uio_map, kobj); + +For convenience, programmers often define a simple macro for "back-casting" +kobject pointers to the containing type. Exactly this happens in the +earlier drivers/uio/uio.c, as you can see here: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; -where pointer is the pointer to the embedded kobject, type is the type of -the containing structure, and member is the name of the structure field to -which pointer points. The return value from container_of() is a pointer to -the given type. So, for example, a pointer "kp" to a struct kobject -embedded within a struct uio_mem could be converted to a pointer to the -containing uio_mem structure with: + #define to_map(map) container_of(map, struct uio_map, kobj) - struct uio_mem *u_mem = container_of(kp, struct uio_mem, kobj); +where the macro argument "map" is a pointer to the struct kobject in +question. That macro is subsequently invoked with: -Programmers often define a simple macro for "back-casting" kobject pointers -to the containing type. + struct uio_map *map = to_map(kobj); Initialization of kobjects -- cgit v1.2.2 From 178a5b35b2777346206d4b577b36d10061732f8c Mon Sep 17 00:00:00 2001 From: "Robert P. J. Day" Date: Fri, 12 Mar 2010 07:30:35 -0500 Subject: kobject: documentation: Update to refer to kset-example.c. Signed-off-by: Robert P. J. Day Signed-off-by: Greg Kroah-Hartman --- Documentation/kobject.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index 668cb83d9561..3ab2472509cb 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -406,4 +406,5 @@ called, and the objects in the former circle release each other. Example code to copy from For a more complete example of using ksets and kobjects properly, see the -sample/kobject/kset-example.c code. +example programs samples/kobject/{kobject-example.c,kset-example.c}, +which will be built as loadable modules if you select CONFIG_SAMPLE_KOBJECT. -- cgit v1.2.2 From 1e63ef0e0c2cfb5deb9331420c9857fbe04bea73 Mon Sep 17 00:00:00 2001 From: Oliver Neukum Date: Fri, 12 Mar 2010 11:27:21 +0100 Subject: USB: Fix documentation for avoid_reset_quirk The name used in the documentation doesn't match reality. Signed-off-by: Oliver Neukum Signed-off-by: Greg Kroah-Hartman --- Documentation/ABI/testing/sysfs-bus-usb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index a986e9bbba3d..bcebb9eaedce 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb @@ -160,7 +160,7 @@ Description: match the driver to the device. For example: # echo "046d c315" > /sys/bus/usb/drivers/foo/remove_id -What: /sys/bus/usb/device/.../avoid_reset +What: /sys/bus/usb/device/.../avoid_reset_quirk Date: December 2009 Contact: Oliver Neukum Description: -- cgit v1.2.2 From 97065e033ec5fd824d0133ecd0c614fe1e5c20bd Mon Sep 17 00:00:00 2001 From: Henrik Rydberg Date: Sun, 21 Mar 2010 22:31:26 -0700 Subject: Input: clarify the no-finger event in multitouch protocol The current documentation does not explicitly specify how to report zero fingers, leaving a potential problem in the driver implementations and giving no parsing directive to userland. This patch defines two equally valid ways. Signed-off-by: Henrik Rydberg Signed-off-by: Dmitry Torokhov --- Documentation/input/multi-touch-protocol.txt | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'Documentation') diff --git a/Documentation/input/multi-touch-protocol.txt b/Documentation/input/multi-touch-protocol.txt index 8490480ce432..eacb22606664 100644 --- a/Documentation/input/multi-touch-protocol.txt +++ b/Documentation/input/multi-touch-protocol.txt @@ -68,6 +68,22 @@ like: SYN_MT_REPORT SYN_REPORT +Here is the sequence after lifting one of the fingers: + + ABS_MT_POSITION_X + ABS_MT_POSITION_Y + SYN_MT_REPORT + SYN_REPORT + +And here is the sequence after lifting the remaining finger: + + SYN_MT_REPORT + SYN_REPORT + +If the driver reports one of BTN_TOUCH or ABS_PRESSURE in addition to the +ABS_MT events, the last SYN_MT_REPORT event may be omitted. Otherwise, the +last SYN_REPORT will be dropped by the input core, resulting in no +zero-finger event reaching userland. Event Semantics --------------- -- cgit v1.2.2 From 13bad37b04c779d98983307a27f97e9caa36f9b1 Mon Sep 17 00:00:00 2001 From: Henrik Rydberg Date: Sun, 21 Mar 2010 22:31:26 -0700 Subject: Input: update the status of the Multitouch X driver project The Multitouch X driver project has moved to alpha status. This patch updates the documentation accordingly. Signed-off-by: Henrik Rydberg Signed-off-by: Dmitry Torokhov --- Documentation/input/multi-touch-protocol.txt | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/input/multi-touch-protocol.txt b/Documentation/input/multi-touch-protocol.txt index eacb22606664..c0fc1c75fd88 100644 --- a/Documentation/input/multi-touch-protocol.txt +++ b/Documentation/input/multi-touch-protocol.txt @@ -233,11 +233,6 @@ where examples can be found. difference between the contact position and the approaching tool position could be used to derive tilt. [2] The list can of course be extended. -[3] The multi-touch X driver is currently in the prototyping stage. At the -time of writing (April 2009), the MT protocol is not yet merged, and the -prototype implements finger matching, basic mouse support and two-finger -scrolling. The project aims at improving the quality of current multi-touch -functionality available in the Synaptics X driver, and in addition -implement more advanced gestures. +[3] Multitouch X driver project: http://bitmath.org/code/multitouch/. [4] See the section on event computation. [5] See the section on finger tracking. -- cgit v1.2.2 From 23ab15ad7a9d042afa7303b735b6e24faa607241 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Mon, 22 Mar 2010 09:37:14 -0700 Subject: ceph: avoid loaded term 'OSD' in documention 'OSD' means different things to different people; avoid it here to avoid confusion. Signed-off-by: Sage Weil --- Documentation/filesystems/ceph.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt index 6e03917316bd..523fdf0828db 100644 --- a/Documentation/filesystems/ceph.txt +++ b/Documentation/filesystems/ceph.txt @@ -115,7 +115,7 @@ Mount Options number of entries in that directory. nocrc - Disable CRC32C calculation for data writes. If set, the OSD + Disable CRC32C calculation for data writes. If set, the storage node must rely on TCP's error correction to detect data corruption in the data payload. -- cgit v1.2.2 From 091e635e6735fa4496c4a18e7e967b58e961303c Mon Sep 17 00:00:00 2001 From: Russell King Date: Tue, 23 Mar 2010 13:35:16 -0700 Subject: Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation cpu_relax() is documented in volatile-considered-harmful.txt to be a memory barrier. However, everyone with the exception of Blackfin and possibly ia64 defines cpu_relax() to be a compiler barrier. Make the documentation reflect the general concensus. Linus sayeth: : I don't think it was ever the intention that it would be seen as anything : but a compiler barrier, although it is obviously implied that it might : well perform some per-architecture actions that have "memory barrier-like" : semantics. : : After all, the whole and only point of the "cpu_relax()" thing is to tell : the CPU that we're busy-looping on some event. : : And that "event" might be (and often is) about reading the same memory : location over and over until it changes to what we want it to be. So it's : quite possible that on various architectures the "cpu_relax()" could be : about making sure that such a tight loop on loads doesn't starve cache : transactions, for example - and as such look a bit like a memory barrier : from a CPU standpoint. : : But it's not meant to have any kind of architectural memory ordering : semantics as far as the kernel is concerned - those must come from other : sources. Signed-off-by: Russell King Cc: Acked-by: Linus Torvalds Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/volatile-considered-harmful.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/volatile-considered-harmful.txt b/Documentation/volatile-considered-harmful.txt index 991c26a6ef64..db0cb228d64a 100644 --- a/Documentation/volatile-considered-harmful.txt +++ b/Documentation/volatile-considered-harmful.txt @@ -63,9 +63,9 @@ way to perform a busy wait is: cpu_relax(); The cpu_relax() call can lower CPU power consumption or yield to a -hyperthreaded twin processor; it also happens to serve as a memory barrier, -so, once again, volatile is unnecessary. Of course, busy-waiting is -generally an anti-social act to begin with. +hyperthreaded twin processor; it also happens to serve as a compiler +barrier, so, once again, volatile is unnecessary. Of course, busy- +waiting is generally an anti-social act to begin with. There are still a few rare situations where volatile makes sense in the kernel: -- cgit v1.2.2 From 5ca9ea9a17a14c68611d3774d1e8a7ab6c7f4763 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Tue, 23 Mar 2010 13:35:19 -0700 Subject: memcg: fix typo in memcg documentation Update memory.txt to be more consistent: s/swapiness/swappiness/ Signed-off-by: Greg Thelen Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index f8bc802d70b9..3a6aecd078ba 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -340,7 +340,7 @@ Note: 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. - Following cgroups' swapiness can't be changed. + Following cgroups' swappiness can't be changed. - root cgroup (uses /proc/sys/vm/swappiness). - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. -- cgit v1.2.2 From 5e07c2c7301bd2c82e55cf5cbb36f7b5bddeb8e9 Mon Sep 17 00:00:00 2001 From: FUJITA Tomonori Date: Tue, 23 Mar 2010 13:35:23 -0700 Subject: Documentation: rename PCI/PCI-DMA-mapping.txt to DMA-API-HOWTO.txt This patch renames PCI/PCI-DMA-mapping.txt to DMA-API-HOWTO.txt. The commit 51e7364ef281e540371f084008732b13292622f0 "Documentation: rename PCI-DMA-mapping.txt to DMA-API-HOWTO.txt" was supposed to do this but it didn't. Signed-off-by: FUJITA Tomonori Acked-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DMA-API-HOWTO.txt | 758 ++++++++++++++++++++++++++++++++++ Documentation/PCI/PCI-DMA-mapping.txt | 758 ---------------------------------- 2 files changed, 758 insertions(+), 758 deletions(-) create mode 100644 Documentation/DMA-API-HOWTO.txt delete mode 100644 Documentation/PCI/PCI-DMA-mapping.txt (limited to 'Documentation') diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt new file mode 100644 index 000000000000..52618ab069ad --- /dev/null +++ b/Documentation/DMA-API-HOWTO.txt @@ -0,0 +1,758 @@ + Dynamic DMA mapping Guide + ========================= + + David S. Miller + Richard Henderson + Jakub Jelinek + +This is a guide to device driver writers on how to use the DMA API +with example pseudo-code. For a concise description of the API, see +DMA-API.txt. + +Most of the 64bit platforms have special hardware that translates bus +addresses (DMA addresses) into physical addresses. This is similar to +how page tables and/or a TLB translates virtual addresses to physical +addresses on a CPU. This is needed so that e.g. PCI devices can +access with a Single Address Cycle (32bit DMA address) any page in the +64bit physical address space. Previously in Linux those 64bit +platforms had to set artificial limits on the maximum RAM size in the +system, so that the virt_to_bus() static scheme works (the DMA address +translation tables were simply filled on bootup to map each bus +address to the physical page __pa(bus_to_virt())). + +So that Linux can use the dynamic DMA mapping, it needs some help from the +drivers, namely it has to take into account that DMA addresses should be +mapped only for the time they are actually used and unmapped after the DMA +transfer. + +The following API will work of course even on platforms where no such +hardware exists. + +Note that the DMA API works with any bus independent of the underlying +microprocessor architecture. You should use the DMA API rather than +the bus specific DMA API (e.g. pci_dma_*). + +First of all, you should make sure + +#include + +is in your driver. This file will obtain for you the definition of the +dma_addr_t (which can hold any valid DMA address for the platform) +type which should be used everywhere you hold a DMA (bus) address +returned from the DMA mapping functions. + + What memory is DMA'able? + +The first piece of information you must know is what kernel memory can +be used with the DMA mapping facilities. There has been an unwritten +set of rules regarding this, and this text is an attempt to finally +write them down. + +If you acquired your memory via the page allocator +(i.e. __get_free_page*()) or the generic memory allocators +(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from +that memory using the addresses returned from those routines. + +This means specifically that you may _not_ use the memory/addresses +returned from vmalloc() for DMA. It is possible to DMA to the +_underlying_ memory mapped into a vmalloc() area, but this requires +walking page tables to get the physical addresses, and then +translating each of those pages back to a kernel address using +something like __va(). [ EDIT: Update this when we integrate +Gerd Knorr's generic code which does this. ] + +This rule also means that you may use neither kernel image addresses +(items in data/text/bss segments), nor module image addresses, nor +stack addresses for DMA. These could all be mapped somewhere entirely +different than the rest of physical memory. Even if those classes of +memory could physically work with DMA, you'd need to ensure the I/O +buffers were cacheline-aligned. Without that, you'd see cacheline +sharing problems (data corruption) on CPUs with DMA-incoherent caches. +(The CPU could write to one word, DMA would write to a different one +in the same cache line, and one of them could be overwritten.) + +Also, this means that you cannot take the return of a kmap() +call and DMA to/from that. This is similar to vmalloc(). + +What about block I/O and networking buffers? The block I/O and +networking subsystems make sure that the buffers they use are valid +for you to DMA from/to. + + DMA addressing limitations + +Does your device have any DMA addressing limitations? For example, is +your device only capable of driving the low order 24-bits of address? +If so, you need to inform the kernel of this fact. + +By default, the kernel assumes that your device can address the full +32-bits. For a 64-bit capable device, this needs to be increased. +And for a device with limitations, as discussed in the previous +paragraph, it needs to be decreased. + +Special note about PCI: PCI-X specification requires PCI-X devices to +support 64-bit addressing (DAC) for all transactions. And at least +one platform (SGI SN2) requires 64-bit consistent allocations to +operate correctly when the IO bus is in PCI-X mode. + +For correct operation, you must interrogate the kernel in your device +probe routine to see if the DMA controller on the machine can properly +support the DMA addressing limitation your device has. It is good +style to do this even if your device holds the default setting, +because this shows that you did think about these issues wrt. your +device. + +The query is performed via a call to dma_set_mask(): + + int dma_set_mask(struct device *dev, u64 mask); + +The query for consistent allocations is performed via a call to +dma_set_coherent_mask(): + + int dma_set_coherent_mask(struct device *dev, u64 mask); + +Here, dev is a pointer to the device struct of your device, and mask +is a bit mask describing which bits of an address your device +supports. It returns zero if your card can perform DMA properly on +the machine given the address mask you provided. In general, the +device struct of your device is embedded in the bus specific device +struct of your device. For example, a pointer to the device struct of +your PCI device is pdev->dev (pdev is a pointer to the PCI device +struct of your device). + +If it returns non-zero, your device cannot perform DMA properly on +this platform, and attempting to do so will result in undefined +behavior. You must either use a different mask, or not use DMA. + +This means that in the failure case, you have three options: + +1) Use another DMA mask, if possible (see below). +2) Use some non-DMA mode for data transfer, if possible. +3) Ignore this device and do not initialize it. + +It is recommended that your driver print a kernel KERN_WARNING message +when you end up performing either #2 or #3. In this manner, if a user +of your driver reports that performance is bad or that the device is not +even detected, you can ask them for the kernel messages to find out +exactly why. + +The standard 32-bit addressing device would do something like this: + + if (dma_set_mask(dev, DMA_BIT_MASK(32))) { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +Another common scenario is a 64-bit capable device. The approach here +is to try for 64-bit addressing, but back down to a 32-bit mask that +should not fail. The kernel may fail the 64-bit mask not because the +platform is not capable of 64-bit addressing. Rather, it may fail in +this case simply because 32-bit addressing is done more efficiently +than 64-bit addressing. For example, Sparc64 PCI SAC addressing is +more efficient than DAC addressing. + +Here is how you would handle a 64-bit capable device which can drive +all 64-bits when accessing streaming DMA: + + int using_dac; + + if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { + using_dac = 1; + } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { + using_dac = 0; + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +If a card is capable of using 64-bit consistent allocations as well, +the case would look like this: + + int using_dac, consistent_using_dac; + + if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { + using_dac = 1; + consistent_using_dac = 1; + dma_set_coherent_mask(dev, DMA_BIT_MASK(64)); + } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { + using_dac = 0; + consistent_using_dac = 0; + dma_set_coherent_mask(dev, DMA_BIT_MASK(32)); + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +dma_set_coherent_mask() will always be able to set the same or a +smaller mask as dma_set_mask(). However for the rare case that a +device driver only uses consistent allocations, one would have to +check the return value from dma_set_coherent_mask(). + +Finally, if your device can only drive the low 24-bits of +address you might do something like: + + if (dma_set_mask(dev, DMA_BIT_MASK(24))) { + printk(KERN_WARNING + "mydev: 24-bit DMA addressing not available.\n"); + goto ignore_this_device; + } + +When dma_set_mask() is successful, and returns zero, the kernel saves +away this mask you have provided. The kernel will use this +information later when you make DMA mappings. + +There is a case which we are aware of at this time, which is worth +mentioning in this documentation. If your device supports multiple +functions (for example a sound card provides playback and record +functions) and the various different functions have _different_ +DMA addressing limitations, you may wish to probe each mask and +only provide the functionality which the machine can handle. It +is important that the last call to dma_set_mask() be for the +most specific mask. + +Here is pseudo-code showing how this might be done: + + #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) + #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) + + struct my_sound_card *card; + struct device *dev; + + ... + if (!dma_set_mask(dev, PLAYBACK_ADDRESS_BITS)) { + card->playback_enabled = 1; + } else { + card->playback_enabled = 0; + printk(KERN_WARNING "%s: Playback disabled due to DMA limitations.\n", + card->name); + } + if (!dma_set_mask(dev, RECORD_ADDRESS_BITS)) { + card->record_enabled = 1; + } else { + card->record_enabled = 0; + printk(KERN_WARNING "%s: Record disabled due to DMA limitations.\n", + card->name); + } + +A sound card was used as an example here because this genre of PCI +devices seems to be littered with ISA chips given a PCI front end, +and thus retaining the 16MB DMA addressing limitations of ISA. + + Types of DMA mappings + +There are two types of DMA mappings: + +- Consistent DMA mappings which are usually mapped at driver + initialization, unmapped at the end and for which the hardware should + guarantee that the device and the CPU can access the data + in parallel and will see updates made by each other without any + explicit software flushing. + + Think of "consistent" as "synchronous" or "coherent". + + The current default is to return consistent memory in the low 32 + bits of the bus space. However, for future compatibility you should + set the consistent mask even if this default is fine for your + driver. + + Good examples of what to use consistent mappings for are: + + - Network card DMA ring descriptors. + - SCSI adapter mailbox command data structures. + - Device firmware microcode executed out of + main memory. + + The invariant these examples all require is that any CPU store + to memory is immediately visible to the device, and vice + versa. Consistent mappings guarantee this. + + IMPORTANT: Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to + consistent memory just as it may normal memory. Example: + if it is important for the device to see the first word + of a descriptor updated before the second, you must do + something like: + + desc->word0 = address; + wmb(); + desc->word1 = DESC_VALID; + + in order to get correct behavior on all platforms. + + Also, on some platforms your driver may need to flush CPU write + buffers in much the same way as it needs to flush write buffers + found in PCI bridges (such as by reading a register's value + after writing it). + +- Streaming DMA mappings which are usually mapped for one DMA + transfer, unmapped right after it (unless you use dma_sync_* below) + and for which hardware can optimize for sequential accesses. + + This of "streaming" as "asynchronous" or "outside the coherency + domain". + + Good examples of what to use streaming mappings for are: + + - Networking buffers transmitted/received by a device. + - Filesystem buffers written/read by a SCSI device. + + The interfaces for using this type of mapping were designed in + such a way that an implementation can make whatever performance + optimizations the hardware allows. To this end, when using + such mappings you must be explicit about what you want to happen. + +Neither type of DMA mapping has alignment restrictions that come from +the underlying bus, although some devices may have such restrictions. +Also, systems with caches that aren't DMA-coherent will work better +when the underlying buffers don't share cache lines with other data. + + + Using Consistent DMA mappings. + +To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +you should do: + + dma_addr_t dma_handle; + + cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); + +where device is a struct device *. This may be called in interrupt +context with the GFP_ATOMIC flag. + +Size is the length of the region you want to allocate, in bytes. + +This routine will allocate RAM for that region, so it acts similarly to +__get_free_pages (but takes size instead of a page order). If your +driver needs regions sized smaller than a page, you may prefer using +the dma_pool interface, described below. + +The consistent DMA mapping interfaces, for non-NULL dev, will by +default return a DMA address which is 32-bit addressable. Even if the +device indicates (via DMA mask) that it may address the upper 32-bits, +consistent allocation will only return > 32-bit addresses for DMA if +the consistent DMA mask has been explicitly changed via +dma_set_coherent_mask(). This is true of the dma_pool interface as +well. + +dma_alloc_coherent returns two values: the virtual address which you +can use to access it from the CPU and dma_handle which you pass to the +card. + +The cpu return address and the DMA bus master address are both +guaranteed to be aligned to the smallest PAGE_SIZE order which +is greater than or equal to the requested size. This invariant +exists (for example) to guarantee that if you allocate a chunk +which is smaller than or equal to 64 kilobytes, the extent of the +buffer you receive will not cross a 64K boundary. + +To unmap and free such a DMA region, you call: + + dma_free_coherent(dev, size, cpu_addr, dma_handle); + +where dev, size are the same as in the above call and cpu_addr and +dma_handle are the values dma_alloc_coherent returned to you. +This function may not be called in interrupt context. + +If your driver needs lots of smaller memory regions, you can write +custom code to subdivide pages returned by dma_alloc_coherent, +or you can use the dma_pool API to do that. A dma_pool is like +a kmem_cache, but it uses dma_alloc_coherent not __get_free_pages. +Also, it understands common hardware constraints for alignment, +like queue heads needing to be aligned on N byte boundaries. + +Create a dma_pool like this: + + struct dma_pool *pool; + + pool = dma_pool_create(name, dev, size, align, alloc); + +The "name" is for diagnostics (like a kmem_cache name); dev and size +are as above. The device's hardware alignment requirement for this +type of data is "align" (which is expressed in bytes, and must be a +power of two). If your device has no boundary crossing restrictions, +pass 0 for alloc; passing 4096 says memory allocated from this pool +must not cross 4KByte boundaries (but at that time it may be better to +go for dma_alloc_coherent directly instead). + +Allocate memory from a dma pool like this: + + cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); + +flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor +holding SMP locks), SLAB_ATOMIC otherwise. Like dma_alloc_coherent, +this returns two values, cpu_addr and dma_handle. + +Free memory that was allocated from a dma_pool like this: + + dma_pool_free(pool, cpu_addr, dma_handle); + +where pool is what you passed to dma_pool_alloc, and cpu_addr and +dma_handle are the values dma_pool_alloc returned. This function +may be called in interrupt context. + +Destroy a dma_pool by calling: + + dma_pool_destroy(pool); + +Make sure you've called dma_pool_free for all memory allocated +from a pool before you destroy the pool. This function may not +be called in interrupt context. + + DMA Direction + +The interfaces described in subsequent portions of this document +take a DMA direction argument, which is an integer and takes on +one of the following values: + + DMA_BIDIRECTIONAL + DMA_TO_DEVICE + DMA_FROM_DEVICE + DMA_NONE + +One should provide the exact DMA direction if you know it. + +DMA_TO_DEVICE means "from main memory to the device" +DMA_FROM_DEVICE means "from the device to main memory" +It is the direction in which the data moves during the DMA +transfer. + +You are _strongly_ encouraged to specify this as precisely +as you possibly can. + +If you absolutely cannot know the direction of the DMA transfer, +specify DMA_BIDIRECTIONAL. It means that the DMA can go in +either direction. The platform guarantees that you may legally +specify this, and that it will work, but this may be at the +cost of performance for example. + +The value DMA_NONE is to be used for debugging. One can +hold this in a data structure before you come to know the +precise direction, and this will help catch cases where your +direction tracking logic has failed to set things up properly. + +Another advantage of specifying this value precisely (outside of +potential platform-specific optimizations of such) is for debugging. +Some platforms actually have a write permission boolean which DMA +mappings can be marked with, much like page protections in the user +program address space. Such platforms can and do report errors in the +kernel logs when the DMA controller hardware detects violation of the +permission setting. + +Only streaming mappings specify a direction, consistent mappings +implicitly have a direction attribute setting of +DMA_BIDIRECTIONAL. + +The SCSI subsystem tells you the direction to use in the +'sc_data_direction' member of the SCSI command your driver is +working on. + +For Networking drivers, it's a rather simple affair. For transmit +packets, map/unmap them with the DMA_TO_DEVICE direction +specifier. For receive packets, just the opposite, map/unmap them +with the DMA_FROM_DEVICE direction specifier. + + Using Streaming DMA mappings + +The streaming DMA mapping routines can be called from interrupt +context. There are two versions of each map/unmap, one which will +map/unmap a single memory region, and one which will map/unmap a +scatterlist. + +To map a single region, you do: + + struct device *dev = &my_dev->dev; + dma_addr_t dma_handle; + void *addr = buffer->ptr; + size_t size = buffer->len; + + dma_handle = dma_map_single(dev, addr, size, direction); + +and to unmap it: + + dma_unmap_single(dev, dma_handle, size, direction); + +You should call dma_unmap_single when the DMA activity is finished, e.g. +from the interrupt which told you that the DMA transfer is done. + +Using cpu pointers like this for single mappings has a disadvantage, +you cannot reference HIGHMEM memory in this way. Thus, there is a +map/unmap interface pair akin to dma_{map,unmap}_single. These +interfaces deal with page/offset pairs instead of cpu pointers. +Specifically: + + struct device *dev = &my_dev->dev; + dma_addr_t dma_handle; + struct page *page = buffer->page; + unsigned long offset = buffer->offset; + size_t size = buffer->len; + + dma_handle = dma_map_page(dev, page, offset, size, direction); + + ... + + dma_unmap_page(dev, dma_handle, size, direction); + +Here, "offset" means byte offset within the given page. + +With scatterlists, you map a region gathered from several regions by: + + int i, count = dma_map_sg(dev, sglist, nents, direction); + struct scatterlist *sg; + + for_each_sg(sglist, sg, count, i) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any +consecutive sglist entries can be merged into one provided the first one +ends and the second one starts on a page boundary - in fact this is a huge +advantage for cards which either cannot do scatter-gather or have very +limited number of scatter-gather entries) and returns the actual number +of sg entries it mapped them to. On failure 0 is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +To unmap a scatterlist, just call: + + dma_unmap_sg(dev, sglist, nents, direction); + +Again, make sure DMA activity has already finished. + +PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be + the _same_ one you passed into the dma_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + dma_map_sg call. + +Every dma_map_{single,sg} call should have its dma_unmap_{single,sg} +counterpart, because the bus address space is a shared resource (although +in some ports the mapping is per each BUS so less devices contend for the +same bus address space) and you could render the machine unusable by eating +all bus addresses. + +If you need to use the same streaming DMA region multiple times and touch +the data in between the DMA transfers, the buffer needs to be synced +properly in order for the cpu and device to see the most uptodate and +correct copy of the DMA buffer. + +So, firstly, just map it with dma_map_{single,sg}, and after each DMA +transfer call either: + + dma_sync_single_for_cpu(dev, dma_handle, size, direction); + +or: + + dma_sync_sg_for_cpu(dev, sglist, nents, direction); + +as appropriate. + +Then, if you wish to let the device get at the DMA area again, +finish accessing the data with the cpu, and then before actually +giving the buffer to the hardware call either: + + dma_sync_single_for_device(dev, dma_handle, size, direction); + +or: + + dma_sync_sg_for_device(dev, sglist, nents, direction); + +as appropriate. + +After the last DMA transfer call one of the DMA unmap routines +dma_unmap_{single,sg}. If you don't touch the data from the first dma_map_* +call till dma_unmap_*, then you don't have to call the dma_sync_* +routines at all. + +Here is pseudo code which shows a situation in which you would need +to use the dma_sync_*() interfaces. + + my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) + { + dma_addr_t mapping; + + mapping = dma_map_single(cp->dev, buffer, len, DMA_FROM_DEVICE); + + cp->rx_buf = buffer; + cp->rx_len = len; + cp->rx_dma = mapping; + + give_rx_buf_to_card(cp); + } + + ... + + my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) + { + struct my_card *cp = devid; + + ... + if (read_card_status(cp) == RX_BUF_TRANSFERRED) { + struct my_card_header *hp; + + /* Examine the header to see if we wish + * to accept the data. But synchronize + * the DMA transfer with the CPU first + * so that we see updated contents. + */ + dma_sync_single_for_cpu(&cp->dev, cp->rx_dma, + cp->rx_len, + DMA_FROM_DEVICE); + + /* Now it is safe to examine the buffer. */ + hp = (struct my_card_header *) cp->rx_buf; + if (header_is_ok(hp)) { + dma_unmap_single(&cp->dev, cp->rx_dma, cp->rx_len, + DMA_FROM_DEVICE); + pass_to_upper_layers(cp->rx_buf); + make_and_setup_new_rx_buf(cp); + } else { + /* Just sync the buffer and give it back + * to the card. + */ + dma_sync_single_for_device(&cp->dev, + cp->rx_dma, + cp->rx_len, + DMA_FROM_DEVICE); + give_rx_buf_to_card(cp); + } + } + } + +Drivers converted fully to this interface should not use virt_to_bus any +longer, nor should they use bus_to_virt. Some drivers have to be changed a +little bit, because there is no longer an equivalent to bus_to_virt in the +dynamic DMA mapping scheme - you have to always store the DMA addresses +returned by the dma_alloc_coherent, dma_pool_alloc, and dma_map_single +calls (dma_map_sg stores them in the scatterlist itself if the platform +supports dynamic DMA mapping in hardware) in your driver structures and/or +in the card registers. + +All drivers should be using these interfaces with no exceptions. It +is planned to completely remove virt_to_bus() and bus_to_virt() as +they are entirely deprecated. Some ports already do not provide these +as it is impossible to correctly support them. + + Optimizing Unmap State Space Consumption + +On many platforms, dma_unmap_{single,page}() is simply a nop. +Therefore, keeping track of the mapping address and length is a waste +of space. Instead of filling your drivers up with ifdefs and the like +to "work around" this (which would defeat the whole purpose of a +portable API) the following facilities are provided. + +Actually, instead of describing the macros one by one, we'll +transform some example code. + +1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. + Example, before: + + struct ring_state { + struct sk_buff *skb; + dma_addr_t mapping; + __u32 len; + }; + + after: + + struct ring_state { + struct sk_buff *skb; + DEFINE_DMA_UNMAP_ADDR(mapping); + DEFINE_DMA_UNMAP_LEN(len); + }; + +2) Use dma_unmap_{addr,len}_set to set these values. + Example, before: + + ringp->mapping = FOO; + ringp->len = BAR; + + after: + + dma_unmap_addr_set(ringp, mapping, FOO); + dma_unmap_len_set(ringp, len, BAR); + +3) Use dma_unmap_{addr,len} to access these values. + Example, before: + + dma_unmap_single(dev, ringp->mapping, ringp->len, + DMA_FROM_DEVICE); + + after: + + dma_unmap_single(dev, + dma_unmap_addr(ringp, mapping), + dma_unmap_len(ringp, len), + DMA_FROM_DEVICE); + +It really should be self-explanatory. We treat the ADDR and LEN +separately, because it is possible for an implementation to only +need the address in order to perform the unmap operation. + + Platform Issues + +If you are just writing drivers for Linux and do not maintain +an architecture port for the kernel, you can safely skip down +to "Closing". + +1) Struct scatterlist requirements. + + Struct scatterlist must contain, at a minimum, the following + members: + + struct page *page; + unsigned int offset; + unsigned int length; + + The base address is specified by a "page+offset" pair. + + Previous versions of struct scatterlist contained a "void *address" + field that was sometimes used instead of page+offset. As of Linux + 2.5., page+offset is always used, and the "address" field has been + deleted. + +2) More to come... + + Handling Errors + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if dma_alloc_coherent returns NULL or dma_map_sg returns 0 + +- checking the returned dma_addr_t of dma_map_single and dma_map_page + by using dma_mapping_error(): + + dma_addr_t dma_handle; + + dma_handle = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + } + + Closing + +This document, and the API itself, would not be in it's current +form without the feedback and suggestions from numerous individuals. +We would like to specifically mention, in no particular order, the +following people: + + Russell King + Leo Dagum + Ralf Baechle + Grant Grundler + Jay Estabrook + Thomas Sailer + Andrea Arcangeli + Jens Axboe + David Mosberger-Tang diff --git a/Documentation/PCI/PCI-DMA-mapping.txt b/Documentation/PCI/PCI-DMA-mapping.txt deleted file mode 100644 index 52618ab069ad..000000000000 --- a/Documentation/PCI/PCI-DMA-mapping.txt +++ /dev/null @@ -1,758 +0,0 @@ - Dynamic DMA mapping Guide - ========================= - - David S. Miller - Richard Henderson - Jakub Jelinek - -This is a guide to device driver writers on how to use the DMA API -with example pseudo-code. For a concise description of the API, see -DMA-API.txt. - -Most of the 64bit platforms have special hardware that translates bus -addresses (DMA addresses) into physical addresses. This is similar to -how page tables and/or a TLB translates virtual addresses to physical -addresses on a CPU. This is needed so that e.g. PCI devices can -access with a Single Address Cycle (32bit DMA address) any page in the -64bit physical address space. Previously in Linux those 64bit -platforms had to set artificial limits on the maximum RAM size in the -system, so that the virt_to_bus() static scheme works (the DMA address -translation tables were simply filled on bootup to map each bus -address to the physical page __pa(bus_to_virt())). - -So that Linux can use the dynamic DMA mapping, it needs some help from the -drivers, namely it has to take into account that DMA addresses should be -mapped only for the time they are actually used and unmapped after the DMA -transfer. - -The following API will work of course even on platforms where no such -hardware exists. - -Note that the DMA API works with any bus independent of the underlying -microprocessor architecture. You should use the DMA API rather than -the bus specific DMA API (e.g. pci_dma_*). - -First of all, you should make sure - -#include - -is in your driver. This file will obtain for you the definition of the -dma_addr_t (which can hold any valid DMA address for the platform) -type which should be used everywhere you hold a DMA (bus) address -returned from the DMA mapping functions. - - What memory is DMA'able? - -The first piece of information you must know is what kernel memory can -be used with the DMA mapping facilities. There has been an unwritten -set of rules regarding this, and this text is an attempt to finally -write them down. - -If you acquired your memory via the page allocator -(i.e. __get_free_page*()) or the generic memory allocators -(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from -that memory using the addresses returned from those routines. - -This means specifically that you may _not_ use the memory/addresses -returned from vmalloc() for DMA. It is possible to DMA to the -_underlying_ memory mapped into a vmalloc() area, but this requires -walking page tables to get the physical addresses, and then -translating each of those pages back to a kernel address using -something like __va(). [ EDIT: Update this when we integrate -Gerd Knorr's generic code which does this. ] - -This rule also means that you may use neither kernel image addresses -(items in data/text/bss segments), nor module image addresses, nor -stack addresses for DMA. These could all be mapped somewhere entirely -different than the rest of physical memory. Even if those classes of -memory could physically work with DMA, you'd need to ensure the I/O -buffers were cacheline-aligned. Without that, you'd see cacheline -sharing problems (data corruption) on CPUs with DMA-incoherent caches. -(The CPU could write to one word, DMA would write to a different one -in the same cache line, and one of them could be overwritten.) - -Also, this means that you cannot take the return of a kmap() -call and DMA to/from that. This is similar to vmalloc(). - -What about block I/O and networking buffers? The block I/O and -networking subsystems make sure that the buffers they use are valid -for you to DMA from/to. - - DMA addressing limitations - -Does your device have any DMA addressing limitations? For example, is -your device only capable of driving the low order 24-bits of address? -If so, you need to inform the kernel of this fact. - -By default, the kernel assumes that your device can address the full -32-bits. For a 64-bit capable device, this needs to be increased. -And for a device with limitations, as discussed in the previous -paragraph, it needs to be decreased. - -Special note about PCI: PCI-X specification requires PCI-X devices to -support 64-bit addressing (DAC) for all transactions. And at least -one platform (SGI SN2) requires 64-bit consistent allocations to -operate correctly when the IO bus is in PCI-X mode. - -For correct operation, you must interrogate the kernel in your device -probe routine to see if the DMA controller on the machine can properly -support the DMA addressing limitation your device has. It is good -style to do this even if your device holds the default setting, -because this shows that you did think about these issues wrt. your -device. - -The query is performed via a call to dma_set_mask(): - - int dma_set_mask(struct device *dev, u64 mask); - -The query for consistent allocations is performed via a call to -dma_set_coherent_mask(): - - int dma_set_coherent_mask(struct device *dev, u64 mask); - -Here, dev is a pointer to the device struct of your device, and mask -is a bit mask describing which bits of an address your device -supports. It returns zero if your card can perform DMA properly on -the machine given the address mask you provided. In general, the -device struct of your device is embedded in the bus specific device -struct of your device. For example, a pointer to the device struct of -your PCI device is pdev->dev (pdev is a pointer to the PCI device -struct of your device). - -If it returns non-zero, your device cannot perform DMA properly on -this platform, and attempting to do so will result in undefined -behavior. You must either use a different mask, or not use DMA. - -This means that in the failure case, you have three options: - -1) Use another DMA mask, if possible (see below). -2) Use some non-DMA mode for data transfer, if possible. -3) Ignore this device and do not initialize it. - -It is recommended that your driver print a kernel KERN_WARNING message -when you end up performing either #2 or #3. In this manner, if a user -of your driver reports that performance is bad or that the device is not -even detected, you can ask them for the kernel messages to find out -exactly why. - -The standard 32-bit addressing device would do something like this: - - if (dma_set_mask(dev, DMA_BIT_MASK(32))) { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -Another common scenario is a 64-bit capable device. The approach here -is to try for 64-bit addressing, but back down to a 32-bit mask that -should not fail. The kernel may fail the 64-bit mask not because the -platform is not capable of 64-bit addressing. Rather, it may fail in -this case simply because 32-bit addressing is done more efficiently -than 64-bit addressing. For example, Sparc64 PCI SAC addressing is -more efficient than DAC addressing. - -Here is how you would handle a 64-bit capable device which can drive -all 64-bits when accessing streaming DMA: - - int using_dac; - - if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { - using_dac = 1; - } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { - using_dac = 0; - } else { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -If a card is capable of using 64-bit consistent allocations as well, -the case would look like this: - - int using_dac, consistent_using_dac; - - if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { - using_dac = 1; - consistent_using_dac = 1; - dma_set_coherent_mask(dev, DMA_BIT_MASK(64)); - } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { - using_dac = 0; - consistent_using_dac = 0; - dma_set_coherent_mask(dev, DMA_BIT_MASK(32)); - } else { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -dma_set_coherent_mask() will always be able to set the same or a -smaller mask as dma_set_mask(). However for the rare case that a -device driver only uses consistent allocations, one would have to -check the return value from dma_set_coherent_mask(). - -Finally, if your device can only drive the low 24-bits of -address you might do something like: - - if (dma_set_mask(dev, DMA_BIT_MASK(24))) { - printk(KERN_WARNING - "mydev: 24-bit DMA addressing not available.\n"); - goto ignore_this_device; - } - -When dma_set_mask() is successful, and returns zero, the kernel saves -away this mask you have provided. The kernel will use this -information later when you make DMA mappings. - -There is a case which we are aware of at this time, which is worth -mentioning in this documentation. If your device supports multiple -functions (for example a sound card provides playback and record -functions) and the various different functions have _different_ -DMA addressing limitations, you may wish to probe each mask and -only provide the functionality which the machine can handle. It -is important that the last call to dma_set_mask() be for the -most specific mask. - -Here is pseudo-code showing how this might be done: - - #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) - #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) - - struct my_sound_card *card; - struct device *dev; - - ... - if (!dma_set_mask(dev, PLAYBACK_ADDRESS_BITS)) { - card->playback_enabled = 1; - } else { - card->playback_enabled = 0; - printk(KERN_WARNING "%s: Playback disabled due to DMA limitations.\n", - card->name); - } - if (!dma_set_mask(dev, RECORD_ADDRESS_BITS)) { - card->record_enabled = 1; - } else { - card->record_enabled = 0; - printk(KERN_WARNING "%s: Record disabled due to DMA limitations.\n", - card->name); - } - -A sound card was used as an example here because this genre of PCI -devices seems to be littered with ISA chips given a PCI front end, -and thus retaining the 16MB DMA addressing limitations of ISA. - - Types of DMA mappings - -There are two types of DMA mappings: - -- Consistent DMA mappings which are usually mapped at driver - initialization, unmapped at the end and for which the hardware should - guarantee that the device and the CPU can access the data - in parallel and will see updates made by each other without any - explicit software flushing. - - Think of "consistent" as "synchronous" or "coherent". - - The current default is to return consistent memory in the low 32 - bits of the bus space. However, for future compatibility you should - set the consistent mask even if this default is fine for your - driver. - - Good examples of what to use consistent mappings for are: - - - Network card DMA ring descriptors. - - SCSI adapter mailbox command data structures. - - Device firmware microcode executed out of - main memory. - - The invariant these examples all require is that any CPU store - to memory is immediately visible to the device, and vice - versa. Consistent mappings guarantee this. - - IMPORTANT: Consistent DMA memory does not preclude the usage of - proper memory barriers. The CPU may reorder stores to - consistent memory just as it may normal memory. Example: - if it is important for the device to see the first word - of a descriptor updated before the second, you must do - something like: - - desc->word0 = address; - wmb(); - desc->word1 = DESC_VALID; - - in order to get correct behavior on all platforms. - - Also, on some platforms your driver may need to flush CPU write - buffers in much the same way as it needs to flush write buffers - found in PCI bridges (such as by reading a register's value - after writing it). - -- Streaming DMA mappings which are usually mapped for one DMA - transfer, unmapped right after it (unless you use dma_sync_* below) - and for which hardware can optimize for sequential accesses. - - This of "streaming" as "asynchronous" or "outside the coherency - domain". - - Good examples of what to use streaming mappings for are: - - - Networking buffers transmitted/received by a device. - - Filesystem buffers written/read by a SCSI device. - - The interfaces for using this type of mapping were designed in - such a way that an implementation can make whatever performance - optimizations the hardware allows. To this end, when using - such mappings you must be explicit about what you want to happen. - -Neither type of DMA mapping has alignment restrictions that come from -the underlying bus, although some devices may have such restrictions. -Also, systems with caches that aren't DMA-coherent will work better -when the underlying buffers don't share cache lines with other data. - - - Using Consistent DMA mappings. - -To allocate and map large (PAGE_SIZE or so) consistent DMA regions, -you should do: - - dma_addr_t dma_handle; - - cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); - -where device is a struct device *. This may be called in interrupt -context with the GFP_ATOMIC flag. - -Size is the length of the region you want to allocate, in bytes. - -This routine will allocate RAM for that region, so it acts similarly to -__get_free_pages (but takes size instead of a page order). If your -driver needs regions sized smaller than a page, you may prefer using -the dma_pool interface, described below. - -The consistent DMA mapping interfaces, for non-NULL dev, will by -default return a DMA address which is 32-bit addressable. Even if the -device indicates (via DMA mask) that it may address the upper 32-bits, -consistent allocation will only return > 32-bit addresses for DMA if -the consistent DMA mask has been explicitly changed via -dma_set_coherent_mask(). This is true of the dma_pool interface as -well. - -dma_alloc_coherent returns two values: the virtual address which you -can use to access it from the CPU and dma_handle which you pass to the -card. - -The cpu return address and the DMA bus master address are both -guaranteed to be aligned to the smallest PAGE_SIZE order which -is greater than or equal to the requested size. This invariant -exists (for example) to guarantee that if you allocate a chunk -which is smaller than or equal to 64 kilobytes, the extent of the -buffer you receive will not cross a 64K boundary. - -To unmap and free such a DMA region, you call: - - dma_free_coherent(dev, size, cpu_addr, dma_handle); - -where dev, size are the same as in the above call and cpu_addr and -dma_handle are the values dma_alloc_coherent returned to you. -This function may not be called in interrupt context. - -If your driver needs lots of smaller memory regions, you can write -custom code to subdivide pages returned by dma_alloc_coherent, -or you can use the dma_pool API to do that. A dma_pool is like -a kmem_cache, but it uses dma_alloc_coherent not __get_free_pages. -Also, it understands common hardware constraints for alignment, -like queue heads needing to be aligned on N byte boundaries. - -Create a dma_pool like this: - - struct dma_pool *pool; - - pool = dma_pool_create(name, dev, size, align, alloc); - -The "name" is for diagnostics (like a kmem_cache name); dev and size -are as above. The device's hardware alignment requirement for this -type of data is "align" (which is expressed in bytes, and must be a -power of two). If your device has no boundary crossing restrictions, -pass 0 for alloc; passing 4096 says memory allocated from this pool -must not cross 4KByte boundaries (but at that time it may be better to -go for dma_alloc_coherent directly instead). - -Allocate memory from a dma pool like this: - - cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); - -flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor -holding SMP locks), SLAB_ATOMIC otherwise. Like dma_alloc_coherent, -this returns two values, cpu_addr and dma_handle. - -Free memory that was allocated from a dma_pool like this: - - dma_pool_free(pool, cpu_addr, dma_handle); - -where pool is what you passed to dma_pool_alloc, and cpu_addr and -dma_handle are the values dma_pool_alloc returned. This function -may be called in interrupt context. - -Destroy a dma_pool by calling: - - dma_pool_destroy(pool); - -Make sure you've called dma_pool_free for all memory allocated -from a pool before you destroy the pool. This function may not -be called in interrupt context. - - DMA Direction - -The interfaces described in subsequent portions of this document -take a DMA direction argument, which is an integer and takes on -one of the following values: - - DMA_BIDIRECTIONAL - DMA_TO_DEVICE - DMA_FROM_DEVICE - DMA_NONE - -One should provide the exact DMA direction if you know it. - -DMA_TO_DEVICE means "from main memory to the device" -DMA_FROM_DEVICE means "from the device to main memory" -It is the direction in which the data moves during the DMA -transfer. - -You are _strongly_ encouraged to specify this as precisely -as you possibly can. - -If you absolutely cannot know the direction of the DMA transfer, -specify DMA_BIDIRECTIONAL. It means that the DMA can go in -either direction. The platform guarantees that you may legally -specify this, and that it will work, but this may be at the -cost of performance for example. - -The value DMA_NONE is to be used for debugging. One can -hold this in a data structure before you come to know the -precise direction, and this will help catch cases where your -direction tracking logic has failed to set things up properly. - -Another advantage of specifying this value precisely (outside of -potential platform-specific optimizations of such) is for debugging. -Some platforms actually have a write permission boolean which DMA -mappings can be marked with, much like page protections in the user -program address space. Such platforms can and do report errors in the -kernel logs when the DMA controller hardware detects violation of the -permission setting. - -Only streaming mappings specify a direction, consistent mappings -implicitly have a direction attribute setting of -DMA_BIDIRECTIONAL. - -The SCSI subsystem tells you the direction to use in the -'sc_data_direction' member of the SCSI command your driver is -working on. - -For Networking drivers, it's a rather simple affair. For transmit -packets, map/unmap them with the DMA_TO_DEVICE direction -specifier. For receive packets, just the opposite, map/unmap them -with the DMA_FROM_DEVICE direction specifier. - - Using Streaming DMA mappings - -The streaming DMA mapping routines can be called from interrupt -context. There are two versions of each map/unmap, one which will -map/unmap a single memory region, and one which will map/unmap a -scatterlist. - -To map a single region, you do: - - struct device *dev = &my_dev->dev; - dma_addr_t dma_handle; - void *addr = buffer->ptr; - size_t size = buffer->len; - - dma_handle = dma_map_single(dev, addr, size, direction); - -and to unmap it: - - dma_unmap_single(dev, dma_handle, size, direction); - -You should call dma_unmap_single when the DMA activity is finished, e.g. -from the interrupt which told you that the DMA transfer is done. - -Using cpu pointers like this for single mappings has a disadvantage, -you cannot reference HIGHMEM memory in this way. Thus, there is a -map/unmap interface pair akin to dma_{map,unmap}_single. These -interfaces deal with page/offset pairs instead of cpu pointers. -Specifically: - - struct device *dev = &my_dev->dev; - dma_addr_t dma_handle; - struct page *page = buffer->page; - unsigned long offset = buffer->offset; - size_t size = buffer->len; - - dma_handle = dma_map_page(dev, page, offset, size, direction); - - ... - - dma_unmap_page(dev, dma_handle, size, direction); - -Here, "offset" means byte offset within the given page. - -With scatterlists, you map a region gathered from several regions by: - - int i, count = dma_map_sg(dev, sglist, nents, direction); - struct scatterlist *sg; - - for_each_sg(sglist, sg, count, i) { - hw_address[i] = sg_dma_address(sg); - hw_len[i] = sg_dma_len(sg); - } - -where nents is the number of entries in the sglist. - -The implementation is free to merge several consecutive sglist entries -into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any -consecutive sglist entries can be merged into one provided the first one -ends and the second one starts on a page boundary - in fact this is a huge -advantage for cards which either cannot do scatter-gather or have very -limited number of scatter-gather entries) and returns the actual number -of sg entries it mapped them to. On failure 0 is returned. - -Then you should loop count times (note: this can be less than nents times) -and use sg_dma_address() and sg_dma_len() macros where you previously -accessed sg->address and sg->length as shown above. - -To unmap a scatterlist, just call: - - dma_unmap_sg(dev, sglist, nents, direction); - -Again, make sure DMA activity has already finished. - -PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be - the _same_ one you passed into the dma_map_sg call, - it should _NOT_ be the 'count' value _returned_ from the - dma_map_sg call. - -Every dma_map_{single,sg} call should have its dma_unmap_{single,sg} -counterpart, because the bus address space is a shared resource (although -in some ports the mapping is per each BUS so less devices contend for the -same bus address space) and you could render the machine unusable by eating -all bus addresses. - -If you need to use the same streaming DMA region multiple times and touch -the data in between the DMA transfers, the buffer needs to be synced -properly in order for the cpu and device to see the most uptodate and -correct copy of the DMA buffer. - -So, firstly, just map it with dma_map_{single,sg}, and after each DMA -transfer call either: - - dma_sync_single_for_cpu(dev, dma_handle, size, direction); - -or: - - dma_sync_sg_for_cpu(dev, sglist, nents, direction); - -as appropriate. - -Then, if you wish to let the device get at the DMA area again, -finish accessing the data with the cpu, and then before actually -giving the buffer to the hardware call either: - - dma_sync_single_for_device(dev, dma_handle, size, direction); - -or: - - dma_sync_sg_for_device(dev, sglist, nents, direction); - -as appropriate. - -After the last DMA transfer call one of the DMA unmap routines -dma_unmap_{single,sg}. If you don't touch the data from the first dma_map_* -call till dma_unmap_*, then you don't have to call the dma_sync_* -routines at all. - -Here is pseudo code which shows a situation in which you would need -to use the dma_sync_*() interfaces. - - my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) - { - dma_addr_t mapping; - - mapping = dma_map_single(cp->dev, buffer, len, DMA_FROM_DEVICE); - - cp->rx_buf = buffer; - cp->rx_len = len; - cp->rx_dma = mapping; - - give_rx_buf_to_card(cp); - } - - ... - - my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) - { - struct my_card *cp = devid; - - ... - if (read_card_status(cp) == RX_BUF_TRANSFERRED) { - struct my_card_header *hp; - - /* Examine the header to see if we wish - * to accept the data. But synchronize - * the DMA transfer with the CPU first - * so that we see updated contents. - */ - dma_sync_single_for_cpu(&cp->dev, cp->rx_dma, - cp->rx_len, - DMA_FROM_DEVICE); - - /* Now it is safe to examine the buffer. */ - hp = (struct my_card_header *) cp->rx_buf; - if (header_is_ok(hp)) { - dma_unmap_single(&cp->dev, cp->rx_dma, cp->rx_len, - DMA_FROM_DEVICE); - pass_to_upper_layers(cp->rx_buf); - make_and_setup_new_rx_buf(cp); - } else { - /* Just sync the buffer and give it back - * to the card. - */ - dma_sync_single_for_device(&cp->dev, - cp->rx_dma, - cp->rx_len, - DMA_FROM_DEVICE); - give_rx_buf_to_card(cp); - } - } - } - -Drivers converted fully to this interface should not use virt_to_bus any -longer, nor should they use bus_to_virt. Some drivers have to be changed a -little bit, because there is no longer an equivalent to bus_to_virt in the -dynamic DMA mapping scheme - you have to always store the DMA addresses -returned by the dma_alloc_coherent, dma_pool_alloc, and dma_map_single -calls (dma_map_sg stores them in the scatterlist itself if the platform -supports dynamic DMA mapping in hardware) in your driver structures and/or -in the card registers. - -All drivers should be using these interfaces with no exceptions. It -is planned to completely remove virt_to_bus() and bus_to_virt() as -they are entirely deprecated. Some ports already do not provide these -as it is impossible to correctly support them. - - Optimizing Unmap State Space Consumption - -On many platforms, dma_unmap_{single,page}() is simply a nop. -Therefore, keeping track of the mapping address and length is a waste -of space. Instead of filling your drivers up with ifdefs and the like -to "work around" this (which would defeat the whole purpose of a -portable API) the following facilities are provided. - -Actually, instead of describing the macros one by one, we'll -transform some example code. - -1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. - Example, before: - - struct ring_state { - struct sk_buff *skb; - dma_addr_t mapping; - __u32 len; - }; - - after: - - struct ring_state { - struct sk_buff *skb; - DEFINE_DMA_UNMAP_ADDR(mapping); - DEFINE_DMA_UNMAP_LEN(len); - }; - -2) Use dma_unmap_{addr,len}_set to set these values. - Example, before: - - ringp->mapping = FOO; - ringp->len = BAR; - - after: - - dma_unmap_addr_set(ringp, mapping, FOO); - dma_unmap_len_set(ringp, len, BAR); - -3) Use dma_unmap_{addr,len} to access these values. - Example, before: - - dma_unmap_single(dev, ringp->mapping, ringp->len, - DMA_FROM_DEVICE); - - after: - - dma_unmap_single(dev, - dma_unmap_addr(ringp, mapping), - dma_unmap_len(ringp, len), - DMA_FROM_DEVICE); - -It really should be self-explanatory. We treat the ADDR and LEN -separately, because it is possible for an implementation to only -need the address in order to perform the unmap operation. - - Platform Issues - -If you are just writing drivers for Linux and do not maintain -an architecture port for the kernel, you can safely skip down -to "Closing". - -1) Struct scatterlist requirements. - - Struct scatterlist must contain, at a minimum, the following - members: - - struct page *page; - unsigned int offset; - unsigned int length; - - The base address is specified by a "page+offset" pair. - - Previous versions of struct scatterlist contained a "void *address" - field that was sometimes used instead of page+offset. As of Linux - 2.5., page+offset is always used, and the "address" field has been - deleted. - -2) More to come... - - Handling Errors - -DMA address space is limited on some architectures and an allocation -failure can be determined by: - -- checking if dma_alloc_coherent returns NULL or dma_map_sg returns 0 - -- checking the returned dma_addr_t of dma_map_single and dma_map_page - by using dma_mapping_error(): - - dma_addr_t dma_handle; - - dma_handle = dma_map_single(dev, addr, size, direction); - if (dma_mapping_error(dev, dma_handle)) { - /* - * reduce current DMA mapping usage, - * delay and try again later or - * reset driver. - */ - } - - Closing - -This document, and the API itself, would not be in it's current -form without the feedback and suggestions from numerous individuals. -We would like to specifically mention, in no particular order, the -following people: - - Russell King - Leo Dagum - Ralf Baechle - Grant Grundler - Jay Estabrook - Thomas Sailer - Andrea Arcangeli - Jens Axboe - David Mosberger-Tang -- cgit v1.2.2 From 5574169613b40b85d6f4c67208fa4846b897a0a1 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 23 Mar 2010 13:35:33 -0700 Subject: doc: add the documentation for mpol=local commit 3f226aa1c (mempolicy: support mpol=local tmpfs mount option) added new mpol=local mount option. but it didn't add a documentation. This patch does it. Signed-off-by: KOSAKI Motohiro Cc: Ravikiran Thirumalai Cc: Christoph Lameter Cc: Mel Gorman Acked-by: Lee Schermerhorn Cc: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/tmpfs.txt | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 3015da0c6b2a..fe09a2cb1858 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be adjusted on the fly via 'mount -o remount ...' -mpol=default prefers to allocate memory from the local node +mpol=default use the process allocation policy + (see set_mempolicy(2)) mpol=prefer:Node prefers to allocate memory from the given Node mpol=bind:NodeList allocates memory only from nodes in NodeList mpol=interleave prefers to allocate from each node in turn mpol=interleave:NodeList allocates from each node of NodeList in turn +mpol=local prefers to allocate memory from the local node NodeList format is a comma-separated list of decimal numbers and ranges, a range being two hyphen-separated decimal numbers, the smallest and @@ -134,3 +136,5 @@ Author: Christoph Rohland , 1.12.01 Updated: Hugh Dickins, 4 June 2007 +Updated: + KOSAKI Motohiro, 16 Mar 2010 -- cgit v1.2.2 From 90fddabf5818367c6bd1fe1b256a10e01827862f Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 24 Mar 2010 09:43:00 +0000 Subject: Document Linux's circular buffering capabilities Document the circular buffering capabilities available in Linux. Signed-off-by: David Howells Signed-off-by: Paul E. McKenney Reviewed-by: Randy Dunlap Reviewed-by: Stefan Richter Signed-off-by: Linus Torvalds --- Documentation/circular-buffers.txt | 234 +++++++++++++++++++++++++++++++++++++ Documentation/memory-barriers.txt | 20 ++++ 2 files changed, 254 insertions(+) create mode 100644 Documentation/circular-buffers.txt (limited to 'Documentation') diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt new file mode 100644 index 000000000000..8117e5bf6065 --- /dev/null +++ b/Documentation/circular-buffers.txt @@ -0,0 +1,234 @@ + ================ + CIRCULAR BUFFERS + ================ + +By: David Howells + Paul E. McKenney + + +Linux provides a number of features that can be used to implement circular +buffering. There are two sets of such features: + + (1) Convenience functions for determining information about power-of-2 sized + buffers. + + (2) Memory barriers for when the producer and the consumer of objects in the + buffer don't want to share a lock. + +To use these facilities, as discussed below, there needs to be just one +producer and just one consumer. It is possible to handle multiple producers by +serialising them, and to handle multiple consumers by serialising them. + + +Contents: + + (*) What is a circular buffer? + + (*) Measuring power-of-2 buffers. + + (*) Using memory barriers with circular buffers. + - The producer. + - The consumer. + + +========================== +WHAT IS A CIRCULAR BUFFER? +========================== + +First of all, what is a circular buffer? A circular buffer is a buffer of +fixed, finite size into which there are two indices: + + (1) A 'head' index - the point at which the producer inserts items into the + buffer. + + (2) A 'tail' index - the point at which the consumer finds the next item in + the buffer. + +Typically when the tail pointer is equal to the head pointer, the buffer is +empty; and the buffer is full when the head pointer is one less than the tail +pointer. + +The head index is incremented when items are added, and the tail index when +items are removed. The tail index should never jump the head index, and both +indices should be wrapped to 0 when they reach the end of the buffer, thus +allowing an infinite amount of data to flow through the buffer. + +Typically, items will all be of the same unit size, but this isn't strictly +required to use the techniques below. The indices can be increased by more +than 1 if multiple items or variable-sized items are to be included in the +buffer, provided that neither index overtakes the other. The implementer must +be careful, however, as a region more than one unit in size may wrap the end of +the buffer and be broken into two segments. + + +============================ +MEASURING POWER-OF-2 BUFFERS +============================ + +Calculation of the occupancy or the remaining capacity of an arbitrarily sized +circular buffer would normally be a slow operation, requiring the use of a +modulus (divide) instruction. However, if the buffer is of a power-of-2 size, +then a much quicker bitwise-AND instruction can be used instead. + +Linux provides a set of macros for handling power-of-2 circular buffers. These +can be made use of by: + + #include + +The macros are: + + (*) Measure the remaining capacity of a buffer: + + CIRC_SPACE(head_index, tail_index, buffer_size); + + This returns the amount of space left in the buffer[1] into which items + can be inserted. + + + (*) Measure the maximum consecutive immediate space in a buffer: + + CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); + + This returns the amount of consecutive space left in the buffer[1] into + which items can be immediately inserted without having to wrap back to the + beginning of the buffer. + + + (*) Measure the occupancy of a buffer: + + CIRC_CNT(head_index, tail_index, buffer_size); + + This returns the number of items currently occupying a buffer[2]. + + + (*) Measure the non-wrapping occupancy of a buffer: + + CIRC_CNT_TO_END(head_index, tail_index, buffer_size); + + This returns the number of consecutive items[2] that can be extracted from + the buffer without having to wrap back to the beginning of the buffer. + + +Each of these macros will nominally return a value between 0 and buffer_size-1, +however: + + [1] CIRC_SPACE*() are intended to be used in the producer. To the producer + they will return a lower bound as the producer controls the head index, + but the consumer may still be depleting the buffer on another CPU and + moving the tail index. + + To the consumer it will show an upper bound as the producer may be busy + depleting the space. + + [2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they + will return a lower bound as the consumer controls the tail index, but the + producer may still be filling the buffer on another CPU and moving the + head index. + + To the producer it will show an upper bound as the consumer may be busy + emptying the buffer. + + [3] To a third party, the order in which the writes to the indices by the + producer and consumer become visible cannot be guaranteed as they are + independent and may be made on different CPUs - so the result in such a + situation will merely be a guess, and may even be negative. + + +=========================================== +USING MEMORY BARRIERS WITH CIRCULAR BUFFERS +=========================================== + +By using memory barriers in conjunction with circular buffers, you can avoid +the need to: + + (1) use a single lock to govern access to both ends of the buffer, thus + allowing the buffer to be filled and emptied at the same time; and + + (2) use atomic counter operations. + +There are two sides to this: the producer that fills the buffer, and the +consumer that empties it. Only one thing should be filling a buffer at any one +time, and only one thing should be emptying a buffer at any one time, but the +two sides can operate simultaneously. + + +THE PRODUCER +------------ + +The producer will look something like this: + + spin_lock(&producer_lock); + + unsigned long head = buffer->head; + unsigned long tail = ACCESS_ONCE(buffer->tail); + + if (CIRC_SPACE(head, tail, buffer->size) >= 1) { + /* insert one item into the buffer */ + struct item *item = buffer[head]; + + produce_item(item); + + smp_wmb(); /* commit the item before incrementing the head */ + + buffer->head = (head + 1) & (buffer->size - 1); + + /* wake_up() will make sure that the head is committed before + * waking anyone up */ + wake_up(consumer); + } + + spin_unlock(&producer_lock); + +This will instruct the CPU that the contents of the new item must be written +before the head index makes it available to the consumer and then instructs the +CPU that the revised head index must be written before the consumer is woken. + +Note that wake_up() doesn't have to be the exact mechanism used, but whatever +is used must guarantee a (write) memory barrier between the update of the head +index and the change of state of the consumer, if a change of state occurs. + + +THE CONSUMER +------------ + +The consumer will look something like this: + + spin_lock(&consumer_lock); + + unsigned long head = ACCESS_ONCE(buffer->head); + unsigned long tail = buffer->tail; + + if (CIRC_CNT(head, tail, buffer->size) >= 1) { + /* read index before reading contents at that index */ + smp_read_barrier_depends(); + + /* extract one item from the buffer */ + struct item *item = buffer[tail]; + + consume_item(item); + + smp_mb(); /* finish reading descriptor before incrementing tail */ + + buffer->tail = (tail + 1) & (buffer->size - 1); + } + + spin_unlock(&consumer_lock); + +This will instruct the CPU to make sure the index is up to date before reading +the new item, and then it shall make sure the CPU has finished reading the item +before it writes the new tail pointer, which will erase the item. + + +Note the use of ACCESS_ONCE() in both algorithms to read the opposition index. +This prevents the compiler from discarding and reloading its cached value - +which some compilers will do across smp_read_barrier_depends(). This isn't +strictly needed if you can be sure that the opposition index will _only_ be +used the once. + + +=============== +FURTHER READING +=============== + +See also Documentation/memory-barriers.txt for a description of Linux's memory +barrier facilities. diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 7f5809eddee6..631ad2f1b229 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -3,6 +3,7 @@ ============================ By: David Howells + Paul E. McKenney Contents: @@ -60,6 +61,10 @@ Contents: - And then there's the Alpha. + (*) Example uses. + + - Circular buffers. + (*) References. @@ -2226,6 +2231,21 @@ The Alpha defines the Linux kernel's memory barrier model. See the subsection on "Cache Coherency" above. +============ +EXAMPLE USES +============ + +CIRCULAR BUFFERS +---------------- + +Memory barriers can be used to implement circular buffering without the need +of a lock to serialise the producer with the consumer. See: + + Documentation/circular-buffers.txt + +for details. + + ========== REFERENCES ========== -- cgit v1.2.2 From 95d2c8ef08a902953d1ea2cad14928909e91e5d1 Mon Sep 17 00:00:00 2001 From: Timur Tabi Date: Fri, 26 Mar 2010 22:09:57 -0600 Subject: powerpc/fsl: add device tree binding for QE firmware Define a binding for embedding a QE firmware blob into the device tree. Also define a new property for the QE node that links to a firmware node. Signed-off-by: Timur Tabi Signed-off-by: Grant Likely --- .../powerpc/dts-bindings/fsl/cpm_qe/qe.txt | 54 ++++++++++++++++++++++ 1 file changed, 54 insertions(+) (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt b/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt index 6e37be1eeb2d..4f8930263dd9 100644 --- a/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt +++ b/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt @@ -21,6 +21,15 @@ Required properties: - fsl,qe-num-snums: define how many serial number(SNUM) the QE can use for the threads. +Optional properties: +- fsl,firmware-phandle: + Usage: required only if there is no fsl,qe-firmware child node + Value type: + Definition: Points to a firmware node (see "QE Firmware Node" below) + that contains the firmware that should be uploaded for this QE. + The compatible property for the firmware node should say, + "fsl,qe-firmware". + Recommended properties - brg-frequency : the internal clock source frequency for baud-rate generators in Hz. @@ -59,3 +68,48 @@ Example: reg = <0 c000>; }; }; + +* QE Firmware Node + +This node defines a firmware binary that is embedded in the device tree, for +the purpose of passing the firmware from bootloader to the kernel, or from +the hypervisor to the guest. + +The firmware node itself contains the firmware binary contents, a compatible +property, and any firmware-specific properties. The node should be placed +inside a QE node that needs it. Doing so eliminates the need for a +fsl,firmware-phandle property. Other QE nodes that need the same firmware +should define an fsl,firmware-phandle property that points to the firmware node +in the first QE node. + +The fsl,firmware property can be specified in the DTS (possibly using incbin) +or can be inserted by the boot loader at boot time. + +Required properties: + - compatible + Usage: required + Value type: + Definition: A standard property. Specify a string that indicates what + kind of firmware it is. For QE, this should be "fsl,qe-firmware". + + - fsl,firmware + Usage: required + Value type: , encoded as an array of bytes + Definition: A standard property. This property contains the firmware + binary "blob". + +Example: + qe1@e0080000 { + compatible = "fsl,qe"; + qe_firmware:qe-firmware { + compatible = "fsl,qe-firmware"; + fsl,firmware = [0x70 0xcd 0x00 0x00 0x01 0x46 0x45 ...]; + }; + ... + }; + + qe2@e0090000 { + compatible = "fsl,qe"; + fsl,firmware-phandle = <&qe_firmware>; + ... + }; -- cgit v1.2.2 From 8136b58dd0fce0b4cb649ac690e0493fb6fdacdb Mon Sep 17 00:00:00 2001 From: Cheng Renquan Date: Mon, 29 Mar 2010 19:05:57 +0800 Subject: ceph: some documentations fixes New documentation should have an entry in the 00-INDEX. Correct git urls. Signed-off-by: Cheng Renquan Signed-off-by: Sage Weil --- Documentation/filesystems/00-INDEX | 2 ++ Documentation/filesystems/ceph.txt | 9 +++++---- 2 files changed, 7 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 3bae418c6ad3..4303614b5add 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -16,6 +16,8 @@ befs.txt - information about the BeOS filesystem for Linux. bfs.txt - info for the SCO UnixWare Boot Filesystem (BFS). +ceph.txt + - info for the Ceph Distributed File System cifs.txt - description of the CIFS filesystem. coda.txt diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt index 523fdf0828db..0660c9f5deef 100644 --- a/Documentation/filesystems/ceph.txt +++ b/Documentation/filesystems/ceph.txt @@ -8,7 +8,7 @@ Basic features include: * POSIX semantics * Seamless scaling from 1 to many thousands of nodes - * High availability and reliability. No single points of failure. + * High availability and reliability. No single point of failure. * N-way replication of data across storage nodes * Fast recovery from node failures * Automatic rebalancing of data on node addition/removal @@ -94,7 +94,7 @@ Mount Options wsize=X Specify the maximum write size in bytes. By default there is no - maximu. Ceph will normally size writes based on the file stripe + maximum. Ceph will normally size writes based on the file stripe size. rsize=X @@ -133,7 +133,8 @@ For more information on Ceph, see the home page at http://ceph.newdream.net/ The Linux kernel client source tree is available at - git://ceph.newdream.net/linux-ceph-client.git + git://ceph.newdream.net/git/ceph-client.git + git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git and the source for the full system is at - git://ceph.newdream.net/ceph.git + git://ceph.newdream.net/git/ceph.git -- cgit v1.2.2 From 5a0e3ad6af8660be21ca98a971cd00f331318c05 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 24 Mar 2010 17:04:11 +0900 Subject: include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo Guess-its-ok-by: Christoph Lameter Cc: Ingo Molnar Cc: Lee Schermerhorn --- Documentation/connector/cn_test.c | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c index b07add3467f1..7764594778d4 100644 --- a/Documentation/connector/cn_test.c +++ b/Documentation/connector/cn_test.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include -- cgit v1.2.2 From 91cb17314e74d0e5ab572b4b84b9398c61b71abb Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Thu, 1 Apr 2010 18:08:29 +0200 Subject: ALSA: hda - Update document about MSI and interrupts Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/HD-Audio.txt | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/HD-Audio.txt b/Documentation/sound/alsa/HD-Audio.txt index f4dd3bf99d12..98d14cb8a85d 100644 --- a/Documentation/sound/alsa/HD-Audio.txt +++ b/Documentation/sound/alsa/HD-Audio.txt @@ -119,10 +119,18 @@ the codec slots 0 and 1 no matter what the hardware reports. Interrupt Handling ~~~~~~~~~~~~~~~~~~ -In rare but some cases, the interrupt isn't properly handled as -default. You would notice this by the DMA transfer error reported by -ALSA PCM core, for example. Using MSI might help in such a case. -Pass `enable_msi=1` option for enabling MSI. +HD-audio driver uses MSI as default (if available) since 2.6.33 +kernel as MSI works better on some machines, and in general, it's +better for performance. However, Nvidia controllers showed bad +regressions with MSI (especially in a combination with AMD chipset), +thus we disabled MSI for them. + +There seem also still other devices that don't work with MSI. If you +see a regression wrt the sound quality (stuttering, etc) or a lock-up +in the recent kernel, try to pass `enable_msi=0` option to disable +MSI. If it works, you can add the known bad device to the blacklist +defined in hda_intel.c. In such a case, please report and give the +patch back to the upstream developer. HD-AUDIO CODEC -- cgit v1.2.2 From a1d6f3f65512cc90a636e6ec653b7bc9e2238753 Mon Sep 17 00:00:00 2001 From: Giuseppe CAVALLARO Date: Wed, 31 Mar 2010 21:44:04 +0000 Subject: stmmac: add documentation for the driver. Add Documentation/networking/stmmac.txt for the stmmac network driver. Signed-off-by: Giuseppe Cavallaro Signed-off-by: David S. Miller --- Documentation/networking/stmmac.txt | 143 ++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 Documentation/networking/stmmac.txt (limited to 'Documentation') diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt new file mode 100644 index 000000000000..7ee770b5ef5f --- /dev/null +++ b/Documentation/networking/stmmac.txt @@ -0,0 +1,143 @@ + STMicroelectronics 10/100/1000 Synopsys Ethernet driver + +Copyright (C) 2007-2010 STMicroelectronics Ltd +Author: Giuseppe Cavallaro + +This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers +(Synopsys IP blocks); it has been fully tested on STLinux platforms. + +Currently this network device driver is for all STM embedded MAC/GMAC +(7xxx SoCs). + +DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100 +Universal version 4.0 have been used for developing the first code +implementation. + +Please, for more information also visit: www.stlinux.com + +1) Kernel Configuration +The kernel configuration option is STMMAC_ETH: + Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) ---> + STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH) + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + phyaddr: to manually provide the physical address to the PHY device; + dma_rxsize: DMA rx ring size; + dma_txsize: DMA tx ring size; + buf_sz: DMA buffer size; + tc: control the HW FIFO threshold; + tx_coe: Enable/Disable Tx Checksum Offload engine; + watchdog: transmit timeout (in milliseconds); + flow_ctrl: Flow control ability [on/off]; + pause: Flow Control Pause Time; + tmrate: timer period (only if timer optimisation is configured). + +3) Command line options +Driver parameters can be also passed in command line by using: + stmmaceth=dma_rxsize:128,dma_txsize:512 + +4) Driver information and notes + +4.1) Transmit process +The xmit method is invoked when the kernel needs to transmit a packet; it sets +the descriptors in the ring and informs the DMA engine that there is a packet +ready to be transmitted. +Once the controller has finished transmitting the packet, an interrupt is +triggered; So the driver will be able to release the socket buffers. +By default, the driver sets the NETIF_F_SG bit in the features field of the +net_device structure enabling the scatter/gather feature. + +4.2) Receive process +When one or more packets are received, an interrupt happens. The interrupts +are not queued so the driver has to scan all the descriptors in the ring during +the receive process. +This is based on NAPI so the interrupt handler signals only if there is work to be +done, and it exits. +Then the poll method will be scheduled at some future point. +The incoming packets are stored, by the DMA, in a list of pre-allocated socket +buffers in order to avoid the memcpy (Zero-copy). + +4.3) Timer-Driver Interrupt +Instead of having the device that asynchronously notifies the frame receptions, the +driver configures a timer to generate an interrupt at regular intervals. +Based on the granularity of the timer, the frames that are received by the device +will experience different levels of latency. Some NICs have dedicated timer +device to perform this task. STMMAC can use either the RTC device or the TMU +channel 2 on STLinux platforms. +The timers frequency can be passed to the driver as parameter; when change it, +take care of both hardware capability and network stability/performance impact. +Several performance tests on STM platforms showed this optimisation allows to spare +the CPU while having the maximum throughput. + +4.4) WOL +Wake up on Lan feature through Magic Frame is only supported for the GMAC +core. + +4.5) DMA descriptors +Driver handles both normal and enhanced descriptors. The latter has been only +tested on DWC Ether MAC 10/100/1000 Universal version 3.41a. + +4.6) Ethtool support +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.7) Jumbo and Segmentation Offloading +Jumbo frames are supported and tested for the GMAC. +The GSO has been also added but it's performed in software. +LRO is not supported. + +4.8) Physical +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.9) Platform information +Several information came from the platform; please refer to the +driver's Header file in include/linux directory. + +struct plat_stmmacenet_data { + int bus_id; + int pbl; + int has_gmac; + void (*fix_mac_speed)(void *priv, unsigned int speed); + void (*bus_setup)(unsigned long ioaddr); +#ifdef CONFIG_STM_DRIVERS + struct stm_pad_config *pad_config; +#endif + void *bsp_priv; +}; + +Where: +- pbl (Programmable Burst Length) is maximum number of + beats to be transferred in one DMA transaction. + GMAC also enables the 4xPBL by default. +- fix_mac_speed and bus_setup are used to configure internal target + registers (on STM platforms); +- has_gmac: GMAC core is on board (get it at run-time in the next step); +- bus_id: bus identifier. + +struct plat_stmmacphy_data { + int bus_id; + int phy_addr; + unsigned int phy_mask; + int interface; + int (*phy_reset)(void *priv); + void *priv; +}; + +Where: +- bus_id: bus identifier; +- phy_addr: physical address used for the attached phy device; + set it to -1 to get it at run-time; +- interface: physical MII interface mode; +- phy_reset: hook to reset HW function. + +TODO: +- Continue to make the driver more generic and suitable for other Synopsys + Ethernet controllers used on other architectures (i.e. ARM). +- 10G controllers are not supported. +- MAC uses Normal descriptors and GMAC uses enhanced ones. + This is a limit that should be reviewed. MAC could want to + use the enhanced structure. +- Checksumming: Rx/Tx csum is done in HW in case of GMAC only. +- Review the timer optimisation code to use an embedded device that seems to be + available in new chip generations. -- cgit v1.2.2 From cf9cf9aed19f529ff313c3e0901ae3b2972eaf4e Mon Sep 17 00:00:00 2001 From: Giel van Schijndel Date: Mon, 29 Mar 2010 21:12:09 +0200 Subject: [WATCHDOG] doc: watchdog simple example: don't fail on fsync() Don't terminate the watchdog daemon when fsync() fails because no watchdog driver actually implements the fsync() syscall. Signed-off-by: Giel van Schijndel Signed-off-by: Wim Van Sebroeck --- Documentation/watchdog/src/watchdog-simple.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/watchdog/src/watchdog-simple.c b/Documentation/watchdog/src/watchdog-simple.c index 4cf72f3fa8e9..ba45803a2216 100644 --- a/Documentation/watchdog/src/watchdog-simple.c +++ b/Documentation/watchdog/src/watchdog-simple.c @@ -17,9 +17,6 @@ int main(void) ret = -1; break; } - ret = fsync(fd); - if (ret) - break; sleep(10); } close(fd); -- cgit v1.2.2 From 9208d24253e5e644f8cb1b87b69de44897668303 Mon Sep 17 00:00:00 2001 From: Sripathi Kodi Date: Thu, 18 Mar 2010 08:01:33 +0000 Subject: 9p: documentation update This patch adds documentation for new 9P options introduced in 2.6.34. Signed-off-by: Sripathi Kodi Signed-off-by: Eric Van Hensbergen --- Documentation/filesystems/9p.txt | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 57e0b80a5274..c0236e753bc8 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9) mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER +For server running on QEMU host with virtio transport: + + mount -t 9p -o trans=virtio /mnt/9 + +where mount_tag is the tag associated by the server to each of the exported +mount points. Each 9P export is seen by the client as a virtio device with an +associated "mount_tag" property. Available mount tags can be +seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. + OPTIONS ======= @@ -47,7 +56,7 @@ OPTIONS fd - used passed file descriptors for connection (see rfdno and wfdno) virtio - connect to the next virtio channel available - (from lguest or KVM with trans_virtio module) + (from QEMU with trans_virtio module) rdma - connect to a specified RDMA channel uname=name user name to attempt mount as on the remote server. The @@ -85,7 +94,12 @@ OPTIONS port=n port to connect to on the remote server - noextend force legacy mode (no 9p2000.u semantics) + noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) + + version=name Select 9P protocol version. Valid options are: + 9p2000 - Legacy mode (same as noextend) + 9p2000.u - Use 9P2000.u protocol + 9p2000.L - Use 9P2000.L protocol dfltuid attempt to mount as a particular uid -- cgit v1.2.2 From dfc333834cb86805485920dc77ee0f2fbb484462 Mon Sep 17 00:00:00 2001 From: James Hogan Date: Mon, 5 Apr 2010 11:31:29 +0100 Subject: [WATCHDOG] doc: Fix use of WDIOC_SETOPTIONS ioctl. In the watchdog-test program and watchdog-api.txt, pass the values to the WDIOC_SETOPTIONS ioctl as a pointer to an integer containing the values intead of directly in the third ioctl argument. The actual watchdog drivers in drivers/watchdog don't read the options directly from the argument but use get_user and copy_from_user. Signed-off-by: James Hogan Signed-off-by: Wim Van Sebroeck Signed-off-by: Andrew Morton --- Documentation/watchdog/src/watchdog-test.c | 8 ++++++-- Documentation/watchdog/watchdog-api.txt | 5 ++--- 2 files changed, 8 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/watchdog/src/watchdog-test.c b/Documentation/watchdog/src/watchdog-test.c index a750532ffcf8..63fdc34ceb98 100644 --- a/Documentation/watchdog/src/watchdog-test.c +++ b/Documentation/watchdog/src/watchdog-test.c @@ -31,6 +31,8 @@ static void keep_alive(void) */ int main(int argc, char *argv[]) { + int flags; + fd = open("/dev/watchdog", O_WRONLY); if (fd == -1) { @@ -41,12 +43,14 @@ int main(int argc, char *argv[]) if (argc > 1) { if (!strncasecmp(argv[1], "-d", 2)) { - ioctl(fd, WDIOC_SETOPTIONS, WDIOS_DISABLECARD); + flags = WDIOS_DISABLECARD; + ioctl(fd, WDIOC_SETOPTIONS, &flags); fprintf(stderr, "Watchdog card disabled.\n"); fflush(stderr); exit(0); } else if (!strncasecmp(argv[1], "-e", 2)) { - ioctl(fd, WDIOC_SETOPTIONS, WDIOS_ENABLECARD); + flags = WDIOS_ENABLECARD; + ioctl(fd, WDIOC_SETOPTIONS, &flags); fprintf(stderr, "Watchdog card enabled.\n"); fflush(stderr); exit(0); diff --git a/Documentation/watchdog/watchdog-api.txt b/Documentation/watchdog/watchdog-api.txt index 4cc4ba9d7150..eb7132ed8bbc 100644 --- a/Documentation/watchdog/watchdog-api.txt +++ b/Documentation/watchdog/watchdog-api.txt @@ -222,11 +222,10 @@ returned value is the temperature in degrees fahrenheit. ioctl(fd, WDIOC_GETTEMP, &temperature); Finally the SETOPTIONS ioctl can be used to control some aspects of -the cards operation; right now the pcwd driver is the only one -supporting this ioctl. +the cards operation. int options = 0; - ioctl(fd, WDIOC_SETOPTIONS, options); + ioctl(fd, WDIOC_SETOPTIONS, &options); The following options are available: -- cgit v1.2.2 From 20a1cfba340f23a7ca62391e199c40c86b762ea3 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Wed, 7 Apr 2010 14:28:26 +0200 Subject: x86/amd-iommu: Remove obsolete parameter documentation Support for the share and fullflush parameters was removed. Remove the documentation about them too. Signed-off-by: Joerg Roedel --- Documentation/kernel-parameters.txt | 5 ----- 1 file changed, 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e7848a0d99eb..ccea846164f0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -323,11 +323,6 @@ and is between 256 and 4096 characters. It is defined in the file amd_iommu= [HW,X86-84] Pass parameters to the AMD IOMMU driver in the system. Possible values are: - isolate - enable device isolation (each device, as far - as possible, will get its own protection - domain) [default] - share - put every device behind one IOMMU into the - same protection domain fullflush - enable flushing of IO/TLB entries when they are unmapped. Otherwise they are flushed before they will be reused, which -- cgit v1.2.2 From a8557dc71949e80c298ec298b902ac6ebbc5d9dd Mon Sep 17 00:00:00 2001 From: "Justin P. Mattock" Date: Tue, 6 Apr 2010 14:34:45 -0700 Subject: fbdev: rename imacfb.txt to efifb.txt and change imacfb to efifb. Rename imacfb.txt to efifb.txt since imacfb was moved to efifb,and change imacfb to efifb. Signed-off-by: Justin P. Mattock Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/fb/efifb.txt | 31 +++++++++++++++++++++++++++++++ Documentation/fb/imacfb.txt | 31 ------------------------------- 2 files changed, 31 insertions(+), 31 deletions(-) create mode 100644 Documentation/fb/efifb.txt delete mode 100644 Documentation/fb/imacfb.txt (limited to 'Documentation') diff --git a/Documentation/fb/efifb.txt b/Documentation/fb/efifb.txt new file mode 100644 index 000000000000..a59916c29b33 --- /dev/null +++ b/Documentation/fb/efifb.txt @@ -0,0 +1,31 @@ + +What is efifb? +=============== + +This is a generic EFI platform driver for Intel based Apple computers. +efifb is only for EFI booted Intel Macs. + +Supported Hardware +================== + +iMac 17"/20" +Macbook +Macbook Pro 15"/17" +MacMini + +How to use it? +============== + +efifb does not have any kind of autodetection of your machine. +You have to add the following kernel parameters in your elilo.conf: + Macbook : + video=efifb:macbook + MacMini : + video=efifb:mini + Macbook Pro 15", iMac 17" : + video=efifb:i17 + Macbook Pro 17", iMac 20" : + video=efifb:i20 + +-- +Edgar Hucek diff --git a/Documentation/fb/imacfb.txt b/Documentation/fb/imacfb.txt deleted file mode 100644 index 316ec9bb7deb..000000000000 --- a/Documentation/fb/imacfb.txt +++ /dev/null @@ -1,31 +0,0 @@ - -What is imacfb? -=============== - -This is a generic EFI platform driver for Intel based Apple computers. -Imacfb is only for EFI booted Intel Macs. - -Supported Hardware -================== - -iMac 17"/20" -Macbook -Macbook Pro 15"/17" -MacMini - -How to use it? -============== - -Imacfb does not have any kind of autodetection of your machine. -You have to add the following kernel parameters in your elilo.conf: - Macbook : - video=imacfb:macbook - MacMini : - video=imacfb:mini - Macbook Pro 15", iMac 17" : - video=imacfb:i17 - Macbook Pro 17", iMac 20" : - video=imacfb:i20 - --- -Edgar Hucek -- cgit v1.2.2 From 69298698c2453c2f8cd1d7d2a4cae39eeec3b66e Mon Sep 17 00:00:00 2001 From: Patrick Loschmidt Date: Wed, 7 Apr 2010 21:52:07 -0700 Subject: net: corrected documentation for hardware time stamping The current documentation for hardware time stamping does not correctly specify the available kernel functions since the implementation was changed later on. Signed-off-by: Patrick Loschmidt Signed-off-by: David S. Miller --- Documentation/networking/timestamping.txt | 76 +++++++++++++++++++------------ 1 file changed, 46 insertions(+), 30 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 0e58b4539176..e8c8f4f06c67 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -41,11 +41,12 @@ SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in SOF_TIMESTAMPING_TX/RX determine how time stamps are generated. SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the following control message: - struct scm_timestamping { - struct timespec systime; - struct timespec hwtimetrans; - struct timespec hwtimeraw; - }; + +struct scm_timestamping { + struct timespec systime; + struct timespec hwtimetrans; + struct timespec hwtimeraw; +}; recvmsg() can be used to get this control message for regular incoming packets. For send time stamps the outgoing packet is looped back to @@ -87,12 +88,13 @@ by the network device and will be empty without that support. SIOCSHWTSTAMP: Hardware time stamping must also be initialized for each device driver -that is expected to do hardware time stamping. The parameter is: +that is expected to do hardware time stamping. The parameter is defined in +/include/linux/net_tstamp.h as: struct hwtstamp_config { - int flags; /* no flags defined right now, must be zero */ - int tx_type; /* HWTSTAMP_TX_* */ - int rx_filter; /* HWTSTAMP_FILTER_* */ + int flags; /* no flags defined right now, must be zero */ + int tx_type; /* HWTSTAMP_TX_* */ + int rx_filter; /* HWTSTAMP_FILTER_* */ }; Desired behavior is passed into the kernel and to a specific device by @@ -139,42 +141,56 @@ enum { /* time stamp any incoming packet */ HWTSTAMP_FILTER_ALL, - /* return value: time stamp all packets requested plus some others */ - HWTSTAMP_FILTER_SOME, + /* return value: time stamp all packets requested plus some others */ + HWTSTAMP_FILTER_SOME, /* PTP v1, UDP, any kind of event packet */ HWTSTAMP_FILTER_PTP_V1_L4_EVENT, - ... + /* for the complete list of values, please check + * the include file /include/linux/net_tstamp.h + */ }; DEVICE IMPLEMENTATION A driver which supports hardware time stamping must support the -SIOCSHWTSTAMP ioctl. Time stamps for received packets must be stored -in the skb with skb_hwtstamp_set(). +SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with +the actual values as described in the section on SIOCSHWTSTAMP. + +Time stamps for received packets must be stored in the skb. To get a pointer +to the shared time stamp structure of the skb call skb_hwtstamps(). Then +set the time stamps in the structure: + +struct skb_shared_hwtstamps { + /* hardware time stamp transformed into duration + * since arbitrary point in time + */ + ktime_t hwtstamp; + ktime_t syststamp; /* hwtstamp transformed to system time base */ +}; Time stamps for outgoing packets are to be generated as follows: -- In hard_start_xmit(), check if skb_hwtstamp_check_tx_hardware() - returns non-zero. If yes, then the driver is expected - to do hardware time stamping. +- In hard_start_xmit(), check if skb_tx(skb)->hardware is set no-zero. + If yes, then the driver is expected to do hardware time stamping. - If this is possible for the skb and requested, then declare - that the driver is doing the time stamping by calling - skb_hwtstamp_tx_in_progress(). A driver not supporting - hardware time stamping doesn't do that. A driver must never - touch sk_buff::tstamp! It is used to store how time stamping - for an outgoing packets is to be done. + that the driver is doing the time stamping by setting the field + skb_tx(skb)->in_progress non-zero. You might want to keep a pointer + to the associated skb for the next step and not free the skb. A driver + not supporting hardware time stamping doesn't do that. A driver must + never touch sk_buff::tstamp! It is used to store software generated + time stamps by the network subsystem. - As soon as the driver has sent the packet and/or obtained a hardware time stamp for it, it passes the time stamp back by calling skb_hwtstamp_tx() with the original skb, the raw - hardware time stamp and a handle to the device (necessary - to convert the hardware time stamp to system time). If obtaining - the hardware time stamp somehow fails, then the driver should - not fall back to software time stamping. The rationale is that - this would occur at a later time in the processing pipeline - than other software time stamping and therefore could lead - to unexpected deltas between time stamps. -- If the driver did not call skb_hwtstamp_tx_in_progress(), then + hardware time stamp. skb_hwtstamp_tx() clones the original skb and + adds the timestamps, therefore the original skb has to be freed now. + If obtaining the hardware time stamp somehow fails, then the driver + should not fall back to software time stamping. The rationale is that + this would occur at a later time in the processing pipeline than other + software time stamping and therefore could lead to unexpected deltas + between time stamps. +- If the driver did not call set skb_tx(skb)->in_progress, then dev_hard_start_xmit() checks whether software time stamping is wanted as fallback and potentially generates the time stamp. -- cgit v1.2.2 From 50aec0024eccb1d5f540ab64a1958eebcdb9340c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 9 Apr 2010 15:39:12 -0700 Subject: rcu: Update docs for rcu_access_pointer and rcu_dereference_protected Update examples and lists of APIs to include these new primitives. Signed-off-by: Paul E. McKenney Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <1270852752-25278-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar --- Documentation/RCU/NMI-RCU.txt | 39 ++++++++++++++++++++++----------------- Documentation/RCU/checklist.txt | 7 ++++--- Documentation/RCU/lockdep.txt | 28 ++++++++++++++++++++++++++-- Documentation/RCU/whatisRCU.txt | 6 ++++++ 4 files changed, 58 insertions(+), 22 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt index a6d32e65d222..a8536cb88091 100644 --- a/Documentation/RCU/NMI-RCU.txt +++ b/Documentation/RCU/NMI-RCU.txt @@ -34,7 +34,7 @@ NMI handler. cpu = smp_processor_id(); ++nmi_count(cpu); - if (!rcu_dereference(nmi_callback)(regs, cpu)) + if (!rcu_dereference_sched(nmi_callback)(regs, cpu)) default_do_nmi(regs); nmi_exit(); @@ -47,12 +47,13 @@ function pointer. If this handler returns zero, do_nmi() invokes the default_do_nmi() function to handle a machine-specific NMI. Finally, preemption is restored. -Strictly speaking, rcu_dereference() is not needed, since this code runs -only on i386, which does not need rcu_dereference() anyway. However, -it is a good documentation aid, particularly for anyone attempting to -do something similar on Alpha. +In theory, rcu_dereference_sched() is not needed, since this code runs +only on i386, which in theory does not need rcu_dereference_sched() +anyway. However, in practice it is a good documentation aid, particularly +for anyone attempting to do something similar on Alpha or on systems +with aggressive optimizing compilers. -Quick Quiz: Why might the rcu_dereference() be necessary on Alpha, +Quick Quiz: Why might the rcu_dereference_sched() be necessary on Alpha, given that the code referenced by the pointer is read-only? @@ -99,17 +100,21 @@ invoke irq_enter() and irq_exit() on NMI entry and exit, respectively. Answer to Quick Quiz - Why might the rcu_dereference() be necessary on Alpha, given + Why might the rcu_dereference_sched() be necessary on Alpha, given that the code referenced by the pointer is read-only? Answer: The caller to set_nmi_callback() might well have - initialized some data that is to be used by the - new NMI handler. In this case, the rcu_dereference() - would be needed, because otherwise a CPU that received - an NMI just after the new handler was set might see - the pointer to the new NMI handler, but the old - pre-initialized version of the handler's data. - - More important, the rcu_dereference() makes it clear - to someone reading the code that the pointer is being - protected by RCU. + initialized some data that is to be used by the new NMI + handler. In this case, the rcu_dereference_sched() would + be needed, because otherwise a CPU that received an NMI + just after the new handler was set might see the pointer + to the new NMI handler, but the old pre-initialized + version of the handler's data. + + This same sad story can happen on other CPUs when using + a compiler with aggressive pointer-value speculation + optimizations. + + More important, the rcu_dereference_sched() makes it + clear to someone reading the code that the pointer is + being protected by RCU-sched. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index cbc180f90194..790d1a812376 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -260,7 +260,8 @@ over a rather long period of time, but improvements are always welcome! The reason that it is permissible to use RCU list-traversal primitives when the update-side lock is held is that doing so can be quite helpful in reducing code bloat when common code is - shared between readers and updaters. + shared between readers and updaters. Additional primitives + are provided for this case, as discussed in lockdep.txt. 10. Conversely, if you are in an RCU read-side critical section, and you don't hold the appropriate update-side lock, you -must- @@ -344,8 +345,8 @@ over a rather long period of time, but improvements are always welcome! requiring SRCU's read-side deadlock immunity or low read-side realtime latency. - Note that, rcu_assign_pointer() and rcu_dereference() relate to - SRCU just as they do to other forms of RCU. + Note that, rcu_assign_pointer() relates to SRCU just as they do + to other forms of RCU. 15. The whole point of call_rcu(), synchronize_rcu(), and friends is to wait until all pre-existing readers have finished before diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt index fe24b58627bd..d7a49b2f6994 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.txt @@ -32,9 +32,20 @@ checking of rcu_dereference() primitives: srcu_dereference(p, sp): Check for SRCU read-side critical section. rcu_dereference_check(p, c): - Use explicit check expression "c". + Use explicit check expression "c". This is useful in + code that is invoked by both readers and updaters. rcu_dereference_raw(p) Don't check. (Use sparingly, if at all.) + rcu_dereference_protected(p, c): + Use explicit check expression "c", and omit all barriers + and compiler constraints. This is useful when the data + structure cannot change, for example, in code that is + invoked only by updaters. + rcu_access_pointer(p): + Return the value of the pointer and omit all barriers, + but retain the compiler constraints that prevent duplicating + or coalescsing. This is useful when when testing the + value of the pointer itself, for example, against NULL. The rcu_dereference_check() check expression can be any boolean expression, but would normally include one of the rcu_read_lock_held() @@ -59,7 +70,20 @@ In case (1), the pointer is picked up in an RCU-safe manner for vanilla RCU read-side critical sections, in case (2) the ->file_lock prevents any change from taking place, and finally, in case (3) the current task is the only task accessing the file_struct, again preventing any change -from taking place. +from taking place. If the above statement was invoked only from updater +code, it could instead be written as follows: + + file = rcu_dereference_protected(fdt->fd[fd], + lockdep_is_held(&files->file_lock) || + atomic_read(&files->count) == 1); + +This would verify cases #2 and #3 above, and furthermore lockdep would +complain if this was used in an RCU read-side critical section unless one +of these two cases held. Because rcu_dereference_protected() omits all +barriers and compiler constraints, it generates better code than do the +other flavors of rcu_dereference(). On the other hand, it is illegal +to use rcu_dereference_protected() if either the RCU-protected pointer +or the RCU-protected data that it points to can change concurrently. There are currently only "universal" versions of the rcu_assign_pointer() and RCU list-/tree-traversal primitives, which do not (yet) check for diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 1dc00ee97163..cfaac34c4557 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -840,6 +840,12 @@ SRCU: Initialization/cleanup init_srcu_struct cleanup_srcu_struct +All: lockdep-checked RCU-protected pointer access + + rcu_dereference_check + rcu_dereference_protected + rcu_access_pointer + See the comment headers in the source code (or the docbook generated from them) for more information. -- cgit v1.2.2