From ded4926aa28992efcb67dd27a642ddf139ac572b Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Fri, 28 Mar 2008 11:19:56 -0600 Subject: Add the seq_file documentation This is an updated version of the document describing the seq_file interface. Signed-off-by: Jonathan Corbet --- Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/seq_file.txt | 283 +++++++++++++++++++++++++++++++++ 2 files changed, 285 insertions(+) create mode 100644 Documentation/filesystems/seq_file.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index e68021c08fbd..e731196410b3 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -82,6 +82,8 @@ relay.txt - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. +seq_file.txt + - how to use the seq_file API sharedsubtree.txt - a description of shared subtrees for namespaces. smbfs.txt diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt new file mode 100644 index 000000000000..92975ee7942c --- /dev/null +++ b/Documentation/filesystems/seq_file.txt @@ -0,0 +1,283 @@ +The seq_file interface + + Copyright 2003 Jonathan Corbet + This file is originally from the LWN.net Driver Porting series at + http://lwn.net/Articles/driver-porting/ + + +There are numerous ways for a device driver (or other kernel component) to +provide information to the user or system administrator. One useful +technique is the creation of virtual files, in debugfs, /proc or elsewhere. +Virtual files can provide human-readable output that is easy to get at +without any special utility programs; they can also make life easier for +script writers. It is not surprising that the use of virtual files has +grown over the years. + +Creating those files correctly has always been a bit of a challenge, +however. It is not that hard to make a virtual file which returns a +string. But life gets trickier if the output is long - anything greater +than an application is likely to read in a single operation. Handling +multiple reads (and seeks) requires careful attention to the reader's +position within the virtual file - that position is, likely as not, in the +middle of a line of output. The kernel has traditionally had a number of +implementations that got this wrong. + +The 2.6 kernel contains a set of functions (implemented by Alexander Viro) +which are designed to make it easy for virtual file creators to get it +right. + +The seq_file interface is available via . There are +three aspects to seq_file: + + * An iterator interface which lets a virtual file implementation + step through the objects it is presenting. + + * Some utility functions for formatting objects for output without + needing to worry about things like output buffers. + + * A set of canned file_operations which implement most operations on + the virtual file. + +We'll look at the seq_file interface via an extremely simple example: a +loadable module which creates a file called /proc/sequence. The file, when +read, simply produces a set of increasing integer values, one per line. The +sequence will continue until the user loses patience and finds something +better to do. The file is seekable, in that one can do something like the +following: + + dd if=/proc/sequence of=out1 count=1 + dd if=/proc/sequence skip=1 out=out2 count=1 + +Then concatenate the output files out1 and out2 and get the right +result. Yes, it is a thoroughly useless module, but the point is to show +how the mechanism works without getting lost in other details. (Those +wanting to see the full source for this module can find it at +http://lwn.net/Articles/22359/). + + +The iterator interface + +Modules implementing a virtual file with seq_file must implement a simple +iterator object that allows stepping through the data of interest. +Iterators must be able to move to a specific position - like the file they +implement - but the interpretation of that position is up to the iterator +itself. A seq_file implementation that is formatting firewall rules, for +example, could interpret position N as the Nth rule in the chain. +Positioning can thus be done in whatever way makes the most sense for the +generator of the data, which need not be aware of how a position translates +to an offset in the virtual file. The one obvious exception is that a +position of zero should indicate the beginning of the file. + +The /proc/sequence iterator just uses the count of the next number it +will output as its position. + +Four functions must be implemented to make the iterator work. The first, +called start() takes a position as an argument and returns an iterator +which will start reading at that position. For our simple sequence example, +the start() function looks like: + + static void *ct_seq_start(struct seq_file *s, loff_t *pos) + { + loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); + if (! spos) + return NULL; + *spos = *pos; + return spos; + } + +The entire data structure for this iterator is a single loff_t value +holding the current position. There is no upper bound for the sequence +iterator, but that will not be the case for most other seq_file +implementations; in most cases the start() function should check for a +"past end of file" condition and return NULL if need be. + +For more complicated applications, the private field of the seq_file +structure can be used. There is also a special value whch can be returned +by the start() function called SEQ_START_TOKEN; it can be used if you wish +to instruct your show() function (described below) to print a header at the +top of the output. SEQ_START_TOKEN should only be used if the offset is +zero, however. + +The next function to implement is called, amazingly, next(); its job is to +move the iterator forward to the next position in the sequence. The +example module can simply increment the position by one; more useful +modules will do what is needed to step through some data structure. The +next() function returns a new iterator, or NULL if the sequence is +complete. Here's the example version: + + static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + loff_t *spos = (loff_t *) v; + *pos = ++(*spos); + return spos; + } + +The stop() function is called when iteration is complete; its job, of +course, is to clean up. If dynamic memory is allocated for the iterator, +stop() is the place to free it. + + static void ct_seq_stop(struct seq_file *s, void *v) + { + kfree(v); + } + +Finally, the show() function should format the object currently pointed to +by the iterator for output. It should return zero, or an error code if +something goes wrong. The example module's show() function is: + + static int ct_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = (loff_t *) v; + seq_printf(s, "%Ld\n", *spos); + return 0; + } + +We will look at seq_printf() in a moment. But first, the definition of the +seq_file iterator is finished by creating a seq_operations structure with +the four functions we have just defined: + + static struct seq_operations ct_seq_ops = { + .start = ct_seq_start, + .next = ct_seq_next, + .stop = ct_seq_stop, + .show = ct_seq_show + }; + +This structure will be needed to tie our iterator to the /proc file in +a little bit. + +It's worth noting that the interator value returned by start() and +manipulated by the other functions is considered to be completely opaque by +the seq_file code. It can thus be anything that is useful in stepping +through the data to be output. Counters can be useful, but it could also be +a direct pointer into an array or linked list. Anything goes, as long as +the programmer is aware that things can happen between calls to the +iterator function. However, the seq_file code (by design) will not sleep +between the calls to start() and stop(), so holding a lock during that time +is a reasonable thing to do. The seq_file code will also avoid taking any +other locks while the iterator is active. + + +Formatted output + +The seq_file code manages positioning within the output created by the +iterator and getting it into the user's buffer. But, for that to work, that +output must be passed to the seq_file code. Some utility functions have +been defined which make this task easy. + +Most code will simply use seq_printf(), which works pretty much like +printk(), but which requires the seq_file pointer as an argument. It is +common to ignore the return value from seq_printf(), but a function +producing complicated output may want to check that value and quit if +something non-zero is returned; an error return means that the seq_file +buffer has been filled and further output will be discarded. + +For straight character output, the following functions may be used: + + int seq_putc(struct seq_file *m, char c); + int seq_puts(struct seq_file *m, const char *s); + int seq_escape(struct seq_file *m, const char *s, const char *esc); + +The first two output a single character and a string, just like one would +expect. seq_escape() is like seq_puts(), except that any character in s +which is in the string esc will be represented in octal form in the output. + +There is also a function for printing filenames: + + int seq_path(struct seq_file *m, struct path *path, char *esc); + +Here, path indicates the file of interest, and esc is a set of characters +which should be escaped in the output. + + +Making it all work + +So far, we have a nice set of functions which can produce output within the +seq_file system, but we have not yet turned them into a file that a user +can see. Creating a file within the kernel requires, of course, the +creation of a set of file_operations which implement the operations on that +file. The seq_file interface provides a set of canned operations which do +most of the work. The virtual file author still must implement the open() +method, however, to hook everything up. The open function is often a single +line, as in the example module: + + static int ct_open(struct inode *inode, struct file *file) + { + return seq_open(file, &ct_seq_ops); + }; + +Here, the call to seq_open() takes the seq_operations structure we created +before, and gets set up to iterate through the virtual file. + +On a successful open, seq_open() stores the struct seq_file pointer in +file->private_data. If you have an application where the same iterator can +be used for more than one file, you can store an arbitrary pointer in the +private field of the seq_file structure; that value can then be retrieved +by the iterator functions. + +The other operations of interest - read(), llseek(), and release() - are +all implemented by the seq_file code itself. So a virtual file's +file_operations structure will look like: + + static struct file_operations ct_file_ops = { + .owner = THIS_MODULE, + .open = ct_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + +There is also a seq_release_private() which passes the contents of the +seq_file private field to kfree() before releasing the structure. + +The final step is the creation of the /proc file itself. In the example +code, that is done in the initialization code in the usual way: + + static int ct_init(void) + { + struct proc_dir_entry *entry; + + entry = create_proc_entry("sequence", 0, NULL); + if (entry) + entry->proc_fops = &ct_file_ops; + return 0; + } + + module_init(ct_init); + +And that is pretty much it. + + +seq_list + +If your file will be iterating through a linked list, you may find these +routines useful: + + struct list_head *seq_list_start(struct list_head *head, + loff_t pos); + struct list_head *seq_list_start_head(struct list_head *head, + loff_t pos); + struct list_head *seq_list_next(void *v, struct list_head *head, + loff_t *ppos); + +These helpers will interpret pos as a position within the list and iterate +accordingly. Your start() and next() functions need only invoke the +seq_list_* helpers with a pointer to the appropriate list_head structure. + + +The extra-simple version + +For extremely simple virtual files, there is an even easier interface. A +module can define only the show() function, which should create all the +output that the virtual file will contain. The file's open() method then +calls: + + int single_open(struct file *file, + int (*show)(struct seq_file *m, void *p), + void *data); + +When output time comes, the show() function will be called once. The data +value given to single_open() can be found in the private field of the +seq_file structure. When using single_open(), the programmer should use +single_release() instead of seq_release() in the file_operations structure +to avoid a memory leak. -- cgit v1.2.2 From ef40203a09823bc2c69168ffa626c46365e3ca2c Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Fri, 28 Mar 2008 11:22:38 -0600 Subject: Fill out information on patch tags in SubmittingPatches Add more information about the various patch tags in use, and try to establish a meaning for Reviewed-by: Signed-off-by: Jonathan Corbet --- Documentation/SubmittingPatches | 54 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 51 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 08a1ed1cb5d8..cc00c8e8f040 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -328,7 +328,7 @@ now, but you can do this to mark internal company procedures or just point out some special detail about the sign-off. -13) When to use Acked-by: +13) When to use Acked-by: and Cc: The Signed-off-by: tag indicates that the signer was involved in the development of the patch, or that he/she was in the patch's delivery path. @@ -349,11 +349,59 @@ Acked-by: does not necessarily indicate acknowledgement of the entire patch. For example, if a patch affects multiple subsystems and has an Acked-by: from one subsystem maintainer then this usually indicates acknowledgement of just the part which affects that maintainer's code. Judgement should be used here. - When in doubt people should refer to the original discussion in the mailing +When in doubt people should refer to the original discussion in the mailing list archives. +If a person has had the opportunity to comment on a patch, but has not +provided such comments, you may optionally add a "Cc:" tag to the patch. +This is the only tag which might be added without an explicit action by the +person it names. This tag documents that potentially interested parties +have been included in the discussion -14) The canonical patch format + +14) Using Test-by: and Reviewed-by: + +A Tested-by: tag indicates that the patch has been successfully tested (in +some environment) by the person named. This tag informs maintainers that +some testing has been performed, provides a means to locate testers for +future patches, and ensures credit for the testers. + +Reviewed-by:, instead, indicates that the patch has been reviewed and found +acceptable according to the Reviewer's Statement: + + Reviewer's statement of oversight + + By offering my Reviewed-by: tag, I state that: + + (a) I have carried out a technical review of this patch to + evaluate its appropriateness and readiness for inclusion into + the mainline kernel. + + (b) Any problems, concerns, or questions relating to the patch + have been communicated back to the submitter. I am satisfied + with the submitter's response to my comments. + + (c) While there may be things that could be improved with this + submission, I believe that it is, at this time, (1) a + worthwhile modification to the kernel, and (2) free of known + issues which would argue against its inclusion. + + (d) While I have reviewed the patch and believe it to be sound, I + do not (unless explicitly stated elsewhere) make any + warranties or guarantees that it will achieve its stated + purpose or function properly in any given situation. + +A Reviewed-by tag is a statement of opinion that the patch is an +appropriate modification of the kernel without any remaining serious +technical issues. Any interested reviewer (who has done the work) can +offer a Reviewed-by tag for a patch. This tag serves to give credit to +reviewers and to inform maintainers of the degree of review which has been +done on the patch. Reviewed-by: tags, when supplied by reviewers known to +understand the subject area and to perform thorough reviews, will normally +increase the liklihood of your patch getting into the kernel. + + +15) The canonical patch format The canonical patch subject line is: -- cgit v1.2.2 From f3271f656458063e9bb0da9ba920771ecc6f024c Mon Sep 17 00:00:00 2001 From: Jan Engelhardt Date: Fri, 28 Mar 2008 20:09:39 +0100 Subject: Fixes to the seq_file document On Friday 2008-03-28 19:20, Jonathan Corbet wrote: >commit 9756ccfda31b4c4544aa010aacf71b6672d668e8 >Date: Fri Mar 28 11:19:56 2008 -0600 > > Add the seq_file documentation patch on top: - add const qualifiers - remove void* casts - use proper specifier (%Ld is not valid) Signed-off-by: Jonathan Corbet Signed-off-by: Jan Engelhardt --- Documentation/filesystems/seq_file.txt | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt index 92975ee7942c..cc6cdb95b73a 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.txt @@ -107,8 +107,8 @@ complete. Here's the example version: static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) { - loff_t *spos = (loff_t *) v; - *pos = ++(*spos); + loff_t *spos = v; + *pos = ++*spos; return spos; } @@ -127,8 +127,8 @@ something goes wrong. The example module's show() function is: static int ct_seq_show(struct seq_file *s, void *v) { - loff_t *spos = (loff_t *) v; - seq_printf(s, "%Ld\n", *spos); + loff_t *spos = v; + seq_printf(s, "%lld\n", (long long)*spos); return 0; } @@ -136,7 +136,7 @@ We will look at seq_printf() in a moment. But first, the definition of the seq_file iterator is finished by creating a seq_operations structure with the four functions we have just defined: - static struct seq_operations ct_seq_ops = { + static const struct seq_operations ct_seq_ops = { .start = ct_seq_start, .next = ct_seq_next, .stop = ct_seq_stop, @@ -204,7 +204,7 @@ line, as in the example module: static int ct_open(struct inode *inode, struct file *file) { return seq_open(file, &ct_seq_ops); - }; + } Here, the call to seq_open() takes the seq_operations structure we created before, and gets set up to iterate through the virtual file. @@ -219,7 +219,7 @@ The other operations of interest - read(), llseek(), and release() - are all implemented by the seq_file code itself. So a virtual file's file_operations structure will look like: - static struct file_operations ct_file_ops = { + static const struct file_operations ct_file_ops = { .owner = THIS_MODULE, .open = ct_open, .read = seq_read, -- cgit v1.2.2 From b40b5162ac4e5b94d16cd9fb0a87168b1633c7dd Mon Sep 17 00:00:00 2001 From: Will Newton Date: Fri, 28 Mar 2008 19:39:17 +0000 Subject: Fix a typo in highres.txt A small patch for the highres timers documentation: Signed-off-by: Will newton Signed-off-by: Jonathan Corbet --- Documentation/hrtimers/highres.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/hrtimers/highres.txt b/Documentation/hrtimers/highres.txt index ce0e9a91e157..a73ecf5b4bdb 100644 --- a/Documentation/hrtimers/highres.txt +++ b/Documentation/hrtimers/highres.txt @@ -98,7 +98,7 @@ System-level global event devices are used for the Linux periodic tick. Per-CPU event devices are used to provide local CPU functionality such as process accounting, profiling, and high resolution timers. -The management layer assignes one or more of the folliwing functions to a clock +The management layer assigns one or more of the following functions to a clock event device: - system global periodic tick (jiffies update) - cpu local update_process_times -- cgit v1.2.2 From 8bab8dded67d026c39367bbd5e27d2f6c556c38e Mon Sep 17 00:00:00 2001 From: Paul Menage Date: Fri, 4 Apr 2008 14:29:57 -0700 Subject: cgroups: add cgroup support for enabling controllers at boot time The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in a single hierarchy - foo isn't visible as an individually mountable subsystem As a result there will only ever be one call to foo->create(), at init time; all processes will stay in this group, and the group will never be mounted on a visible hierarchy. Any additional effects (e.g. not allocating metadata) are up to the foo subsystem. This doesn't handle early_init subsystems (their "disabled" bit isn't set be, but it could easily be extended to do so if any of the early_init systems wanted it - I think it would just involve some nastier parameter processing since it would occur before the command-line argument parser had been run. Hugh said: Ballpark figures, I'm trying to get this question out rather than processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead to the affected paths, booting with cgroup_disable=memory cuts that back to 1% overhead (due to slightly bigger struct page). I'm no expert on distros, they may have no interest whatever in CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or without it, or apply the cgroup_disable=memory patches. Unix bench's execl test result on x86_64 was == just after boot without mounting any cgroup fs.== mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6 mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0 == [lizf@cn.fujitsu.com: fix boot option parsing] Signed-off-by: Balbir Singh Cc: Paul Menage Cc: Balbir Singh Cc: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Cc: Sudhir Kumar Cc: YAMAMOTO Takashi Cc: David Rientjes Signed-off-by: Li Zefan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4cd1a5da80a4..32e9297ef747 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -375,6 +375,10 @@ and is between 256 and 4096 characters. It is defined in the file ccw_timeout_log [S390] See Documentation/s390/CommonIO for details. + cgroup_disable= [KNL] Disable a particular controller + Format: {name of the controller(s) to disable} + {Currently supported controllers - "memory"} + checkreqprot [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. -- cgit v1.2.2 From 6395bee7e92bf34e95dc67c1da5acc30e8b98244 Mon Sep 17 00:00:00 2001 From: David Brownell Date: Tue, 8 Apr 2008 17:41:58 -0700 Subject: spi: documentation tweaks Update SPI documentation to clarify some areas of recent confusion: clock polarity takes effect when chipselect goes active; and zero length buffers are OK in certain cases. Signed-off-by: David Brownell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/spi/spi-summary | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/spi/spi-summary b/Documentation/spi/spi-summary index 8861e47e5a2d..6d5f18143c50 100644 --- a/Documentation/spi/spi-summary +++ b/Documentation/spi/spi-summary @@ -116,6 +116,13 @@ low order bit. So when a chip's timing diagram shows the clock starting low (CPOL=0) and data stabilized for sampling during the trailing clock edge (CPHA=1), that's SPI mode 1. +Note that the clock mode is relevant as soon as the chipselect goes +active. So the master must set the clock to inactive before selecting +a slave, and the slave can tell the chosen polarity by sampling the +clock level when its select line goes active. That's why many devices +support for example both modes 0 and 3: they don't care about polarity, +and alway clock data in/out on rising clock edges. + How do these driver programming interfaces work? ------------------------------------------------ @@ -379,8 +386,14 @@ any more such messages. + when bidirectional reads and writes start ... by how its sequence of spi_transfer requests is arranged; + + which I/O buffers are used ... each spi_transfer wraps a + buffer for each transfer direction, supporting full duplex + (two pointers, maybe the same one in both cases) and half + duplex (one pointer is NULL) transfers; + + optionally defining short delays after transfers ... using - the spi_transfer.delay_usecs setting; + the spi_transfer.delay_usecs setting (this delay can be the + only protocol effect, if the buffer length is zero); + whether the chipselect becomes inactive after a transfer and any delay ... by using the spi_transfer.cs_change flag; -- cgit v1.2.2 From 6ded55da6be9f186ae1022724a5881b43846c164 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Mon, 7 Apr 2008 15:59:03 -0400 Subject: Documentation: move nfsroot.txt to filesystems/ Documentation/ is a little large, and filesystems/ seems an obvious place for this file. Signed-off-by: J. Bruce Fields Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 - Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/nfsroot.txt | 270 ++++++++++++++++++++++++++++++++++ Documentation/kernel-parameters.txt | 6 +- Documentation/nfsroot.txt | 270 ---------------------------------- 5 files changed, 275 insertions(+), 275 deletions(-) create mode 100644 Documentation/filesystems/nfsroot.txt delete mode 100644 Documentation/nfsroot.txt (limited to 'Documentation') diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index fc8e7c7d182f..08a39cdb27f2 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -271,8 +271,6 @@ netlabel/ - directory with information on the NetLabel subsystem. networking/ - directory with info on various aspects of networking with Linux. -nfsroot.txt - - short guide on setting up a diskless box with NFS root filesystem. nmi_watchdog.txt - info on NMI watchdog for SMP systems. nommu-mmap.txt diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index e731196410b3..2ec174c992f1 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -66,6 +66,8 @@ mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. ntfs.txt - info and mount options for the NTFS filesystem (Windows NT). ocfs2.txt diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfsroot.txt new file mode 100644 index 000000000000..31b329172343 --- /dev/null +++ b/Documentation/filesystems/nfsroot.txt @@ -0,0 +1,270 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann +Updated 1997 by Martin Mares +Updated 2006 by Nico Schottelius +Updated 2006 by Horms + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities + ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line + ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[:][,] + + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. + + Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + Standard NFS options. All options are separated by commas. + The following defaults are used: + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=:::::: + + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + IP address of the client. + + Default: Determined using autoconfiguration. + + IP address of the NFS server. If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + Netmask for local network interface. If unspecified + the netmask is derived from the client IP address assuming + classful addressing. + + Default: Determined using autoconfiguration. + + Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. + + Default: Client IP address is used in ASCII notation. + + Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + Default: any + + + + +3.) Boot Loader + ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1) Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch//boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel + +3.4) Using loadlin + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel ". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits + ------- + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann . + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares . + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager for his help. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 508e2a2c9864..57709e472b9b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -845,7 +845,7 @@ and is between 256 and 4096 characters. It is defined in the file arch/alpha/kernel/core_marvel.c. ip= [IP_PNP] - See Documentation/nfsroot.txt. + See Documentation/filesystems/nfsroot.txt. ip2= [HW] Set IO/IRQ pairs for up to 4 IntelliPort boards See comment before ip2_setup() in @@ -1199,10 +1199,10 @@ and is between 256 and 4096 characters. It is defined in the file file if at all. nfsaddrs= [NFS] - See Documentation/nfsroot.txt. + See Documentation/filesystems/nfsroot.txt. nfsroot= [NFS] nfs root filesystem for disk-less boxes. - See Documentation/nfsroot.txt. + See Documentation/filesystems/nfsroot.txt. nfs.callback_tcpport= [NFS] set the TCP port on which the NFSv4 callback diff --git a/Documentation/nfsroot.txt b/Documentation/nfsroot.txt deleted file mode 100644 index 31b329172343..000000000000 --- a/Documentation/nfsroot.txt +++ /dev/null @@ -1,270 +0,0 @@ -Mounting the root filesystem via NFS (nfsroot) -=============================================== - -Written 1996 by Gero Kuhlmann -Updated 1997 by Martin Mares -Updated 2006 by Nico Schottelius -Updated 2006 by Horms - - - -In order to use a diskless system, such as an X-terminal or printer server -for example, it is necessary for the root filesystem to be present on a -non-disk device. This may be an initramfs (see Documentation/filesystems/ -ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a -filesystem mounted via NFS. The following text describes on how to use NFS -for the root filesystem. For the rest of this text 'client' means the -diskless system, and 'server' means the NFS server. - - - - -1.) Enabling nfsroot capabilities - ----------------------------- - -In order to use nfsroot, NFS client support needs to be selected as -built-in during configuration. Once this has been selected, the nfsroot -option will become available, which should also be selected. - -In the networking options, kernel level autoconfiguration can be selected, -along with the types of autoconfiguration to support. Selecting all of -DHCP, BOOTP and RARP is safe. - - - - -2.) Kernel command line - ------------------- - -When the kernel has been loaded by a boot loader (see below) it needs to be -told what root fs device to use. And in the case of nfsroot, where to find -both the server and the name of the directory on the server to mount as root. -This can be established using the following kernel command line parameters: - - -root=/dev/nfs - - This is necessary to enable the pseudo-NFS-device. Note that it's not a - real device but just a synonym to tell the kernel to use NFS instead of - a real device. - - -nfsroot=[:][,] - - If the `nfsroot' parameter is NOT given on the command line, - the default "/tftpboot/%s" will be used. - - Specifies the IP address of the NFS server. - The default address is determined by the `ip' parameter - (see below). This parameter allows the use of different - servers for IP autoconfiguration and NFS. - - Name of the directory on the server to mount as root. - If there is a "%s" token in the string, it will be - replaced by the ASCII-representation of the client's - IP address. - - Standard NFS options. All options are separated by commas. - The following defaults are used: - port = as given by server portmap daemon - rsize = 4096 - wsize = 4096 - timeo = 7 - retrans = 3 - acregmin = 3 - acregmax = 60 - acdirmin = 30 - acdirmax = 60 - flags = hard, nointr, noposix, cto, ac - - -ip=:::::: - - This parameter tells the kernel how to configure IP addresses of devices - and also how to set up the IP routing table. It was originally called - `nfsaddrs', but now the boot-time IP configuration works independently of - NFS, so it was renamed to `ip' and the old name remained as an alias for - compatibility reasons. - - If this parameter is missing from the kernel command line, all fields are - assumed to be empty, and the defaults mentioned below apply. In general - this means that the kernel tries to configure everything using - autoconfiguration. - - The parameter can appear alone as the value to the `ip' - parameter (without all the ':' characters before). If the value is - "ip=off" or "ip=none", no autoconfiguration will take place, otherwise - autoconfiguration will take place. The most common way to use this - is "ip=dhcp". - - IP address of the client. - - Default: Determined using autoconfiguration. - - IP address of the NFS server. If RARP is used to determine - the client address and this parameter is NOT empty only - replies from the specified server are accepted. - - Only required for for NFS root. That is autoconfiguration - will not be triggered if it is missing and NFS root is not - in operation. - - Default: Determined using autoconfiguration. - The address of the autoconfiguration server is used. - - IP address of a gateway if the server is on a different subnet. - - Default: Determined using autoconfiguration. - - Netmask for local network interface. If unspecified - the netmask is derived from the client IP address assuming - classful addressing. - - Default: Determined using autoconfiguration. - - Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - - Default: Client IP address is used in ASCII notation. - - Name of network device to use. - - Default: If the host only has one device, it is used. - Otherwise the device is determined using - autoconfiguration. This is done by sending - autoconfiguration requests out of all devices, - and using the device that received the first reply. - - Method to use for autoconfiguration. In the case of options - which specify multiple autoconfiguration protocols, - requests are sent using all protocols, and the first one - to reply is used. - - Only autoconfiguration protocols that have been compiled - into the kernel will be used, regardless of the value of - this option. - - off or none: don't use autoconfiguration - (do static IP assignment instead) - on or any: use any protocol available in the kernel - (default) - dhcp: use DHCP - bootp: use BOOTP - rarp: use RARP - both: use both BOOTP and RARP but not DHCP - (old option kept for backwards compatibility) - - Default: any - - - - -3.) Boot Loader - ---------- - -To get the kernel into memory different approaches can be used. -They depend on various facilities being available: - - -3.1) Booting from a floppy using syslinux - - When building kernels, an easy way to create a boot floppy that uses - syslinux is to use the zdisk or bzdisk make targets which use - and bzimage images respectively. Both targets accept the - FDARGS parameter which can be used to set the kernel command line. - - e.g. - make bzdisk FDARGS="root=/dev/nfs" - - Note that the user running this command will need to have - access to the floppy drive device, /dev/fd0 - - For more information on syslinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - N.B: Previously it was possible to write a kernel directly to - a floppy using dd, configure the boot device using rdev, and - boot using the resulting floppy. Linux no longer supports this - method of booting. - -3.2) Booting from a cdrom using isolinux - - When building kernels, an easy way to create a bootable cdrom that - uses isolinux is to use the isoimage target which uses a bzimage - image. Like zdisk and bzdisk, this target accepts the FDARGS - parameter which can be used to set the kernel command line. - - e.g. - make isoimage FDARGS="root=/dev/nfs" - - The resulting iso image will be arch//boot/image.iso - This can be written to a cdrom using a variety of tools including - cdrecord. - - e.g. - cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - -3.2) Using LILO - When using LILO all the necessary command line parameters may be - specified using the 'append=' directive in the LILO configuration - file. - - However, to use the 'root=' directive you also need to create - a dummy root device, which may be removed after LILO is run. - - mknod /dev/boot255 c 0 255 - - For information on configuring LILO, please refer to its documentation. - -3.3) Using GRUB - When using GRUB, kernel parameter are simply appended after the kernel - specification: kernel - -3.4) Using loadlin - loadlin may be used to boot Linux from a DOS command prompt without - requiring a local hard disk to mount as root. This has not been - thoroughly tested by the authors of this document, but in general - it should be possible configure the kernel command line similarly - to the configuration of LILO. - - Please refer to the loadlin documentation for further information. - -3.5) Using a boot ROM - This is probably the most elegant way of booting a diskless client. - With a boot ROM the kernel is loaded using the TFTP protocol. The - authors of this document are not aware of any no commercial boot - ROMs that support booting Linux over the network. However, there - are two free implementations of a boot ROM, netboot-nfs and - etherboot, both of which are available on sunsite.unc.edu, and both - of which contain everything you need to boot a diskless Linux client. - -3.6) Using pxelinux - Pxelinux may be used to boot linux using the PXE boot loader - which is present on many modern network cards. - - When using pxelinux, the kernel image is specified using - "kernel ". The nfsroot parameters - are passed to the kernel by adding them to the "append" line. - It is common to use serial console in conjunction with pxeliunx, - see Documentation/serial-console.txt for more information. - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - - - -4.) Credits - ------- - - The nfsroot code in the kernel and the RARP support have been written - by Gero Kuhlmann . - - The rest of the IP layer autoconfiguration code has been written - by Martin Mares . - - In order to write the initial version of nfsroot I would like to thank - Jens-Uwe Mager for his help. -- cgit v1.2.2 From 8bcd1cc293f4e76edbfd8f422770c80a018b82d9 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Mon, 7 Apr 2008 15:59:04 -0400 Subject: Documentation: move rpc-cache.txt to filesystems/ This file is nfs-related. (Maybe Documentation/filesystems/ would benefit from a separate nfs/ directory at some point.) Signed-off-by: J. Bruce Fields Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 - Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/rpc-cache.txt | 202 ++++++++++++++++++++++++++++++++ Documentation/rpc-cache.txt | 202 -------------------------------- 4 files changed, 204 insertions(+), 204 deletions(-) create mode 100644 Documentation/filesystems/rpc-cache.txt delete mode 100644 Documentation/rpc-cache.txt (limited to 'Documentation') diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 08a39cdb27f2..e8fb24671967 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -319,8 +319,6 @@ robust-futexes.txt - a description of what robust futexes are. rocket.txt - info on the Comtrol RocketPort multiport serial driver. -rpc-cache.txt - - introduction to the caching mechanisms in the sunrpc layer. rt-mutex-design.txt - description of the RealTime mutex implementation design. rt-mutex.txt diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 2ec174c992f1..52cd611277a3 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -84,6 +84,8 @@ relay.txt - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. seq_file.txt - how to use the seq_file API sharedsubtree.txt diff --git a/Documentation/filesystems/rpc-cache.txt b/Documentation/filesystems/rpc-cache.txt new file mode 100644 index 000000000000..8a382bea6808 --- /dev/null +++ b/Documentation/filesystems/rpc-cache.txt @@ -0,0 +1,202 @@ + This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store. This is in the form of a + structure definition that must contain a + struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + The operations requires are: + struct cache_head *alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + void cache_put(struct kref *) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + int match(struct cache_head *orig, struct cache_head *new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + void init(struct cache_head *orig, struct cache_head *new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + void update(struct cache_head *orig, struct cache_head *new) + Set the 'content' fileds in 'new' from 'orig'. + int cache_show(struct seq_file *m, struct cache_detail *cd, + struct cache_head *h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + int cache_request(struct cache_detail *cd, struct cache_head *h, + char **bpp, int *blen) + Format a request to be send to user-space for an item + to be instantiated. *bpp is a buffer of size *blen. + bpp should be moved forward over the encoded message, + and *blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + int cache_parse(struct cache_detail *cd, char *buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup, and update the item + with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + - a key + - an expiry time + - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail. This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use it's own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +2/ otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/rpc-cache.txt b/Documentation/rpc-cache.txt deleted file mode 100644 index 8a382bea6808..000000000000 --- a/Documentation/rpc-cache.txt +++ /dev/null @@ -1,202 +0,0 @@ - This document gives a brief introduction to the caching -mechanisms in the sunrpc layer that is used, in particular, -for NFS authentication. - -CACHES -====== -The caching replaces the old exports table and allows for -a wide variety of values to be caches. - -There are a number of caches that are similar in structure though -quite possibly very different in content and use. There is a corpus -of common code for managing these caches. - -Examples of caches that are likely to be needed are: - - mapping from IP address to client name - - mapping from client name and filesystem to export options - - mapping from UID to list of GIDs, to work around NFS's limitation - of 16 gids. - - mappings between local UID/GID and remote UID/GID for sites that - do not have uniform uid assignment - - mapping from network identify to public key for crypto authentication. - -The common code handles such things as: - - general cache lookup with correct locking - - supporting 'NEGATIVE' as well as positive entries - - allowing an EXPIRED time on cache items, and removing - items after they expire, and are no longer in-use. - - making requests to user-space to fill in cache entries - - allowing user-space to directly set entries in the cache - - delaying RPC requests that depend on as-yet incomplete - cache entries, and replaying those requests when the cache entry - is complete. - - clean out old entries as they expire. - -Creating a Cache ----------------- - -1/ A cache needs a datum to store. This is in the form of a - structure definition that must contain a - struct cache_head - as an element, usually the first. - It will also contain a key and some content. - Each cache element is reference counted and contains - expiry and update times for use in cache management. -2/ A cache needs a "cache_detail" structure that - describes the cache. This stores the hash table, some - parameters for cache management, and some operations detailing how - to work with particular cache items. - The operations requires are: - struct cache_head *alloc(void) - This simply allocates appropriate memory and returns - a pointer to the cache_detail embedded within the - structure - void cache_put(struct kref *) - This is called when the last reference to an item is - dropped. The pointer passed is to the 'ref' field - in the cache_head. cache_put should release any - references create by 'cache_init' and, if CACHE_VALID - is set, any references created by cache_update. - It should then release the memory allocated by - 'alloc'. - int match(struct cache_head *orig, struct cache_head *new) - test if the keys in the two structures match. Return - 1 if they do, 0 if they don't. - void init(struct cache_head *orig, struct cache_head *new) - Set the 'key' fields in 'new' from 'orig'. This may - include taking references to shared objects. - void update(struct cache_head *orig, struct cache_head *new) - Set the 'content' fileds in 'new' from 'orig'. - int cache_show(struct seq_file *m, struct cache_detail *cd, - struct cache_head *h) - Optional. Used to provide a /proc file that lists the - contents of a cache. This should show one item, - usually on just one line. - int cache_request(struct cache_detail *cd, struct cache_head *h, - char **bpp, int *blen) - Format a request to be send to user-space for an item - to be instantiated. *bpp is a buffer of size *blen. - bpp should be moved forward over the encoded message, - and *blen should be reduced to show how much free - space remains. Return 0 on success or <0 if not - enough room or other problem. - int cache_parse(struct cache_detail *cd, char *buf, int len) - A message from user space has arrived to fill out a - cache entry. It is in 'buf' of length 'len'. - cache_parse should parse this, find the item in the - cache with sunrpc_cache_lookup, and update the item - with sunrpc_cache_update. - - -3/ A cache needs to be registered using cache_register(). This - includes it on a list of caches that will be regularly - cleaned to discard old data. - -Using a cache -------------- - -To find a value in a cache, call sunrpc_cache_lookup passing a pointer -to the cache_head in a sample item with the 'key' fields filled in. -This will be passed to ->match to identify the target entry. If no -entry is found, a new entry will be create, added to the cache, and -marked as not containing valid data. - -The item returned is typically passed to cache_check which will check -if the data is valid, and may initiate an up-call to get fresh data. -cache_check will return -ENOENT in the entry is negative or if an up -call is needed but not possible, -EAGAIN if an upcall is pending, -or 0 if the data is valid; - -cache_check can be passed a "struct cache_req *". This structure is -typically embedded in the actual request and can be used to create a -deferred copy of the request (struct cache_deferred_req). This is -done when the found cache item is not uptodate, but the is reason to -believe that userspace might provide information soon. When the cache -item does become valid, the deferred copy of the request will be -revisited (->revisit). It is expected that this method will -reschedule the request for processing. - -The value returned by sunrpc_cache_lookup can also be passed to -sunrpc_cache_update to set the content for the item. A second item is -passed which should hold the content. If the item found by _lookup -has valid data, then it is discarded and a new item is created. This -saves any user of an item from worrying about content changing while -it is being inspected. If the item found by _lookup does not contain -valid data, then the content is copied across and CACHE_VALID is set. - -Populating a cache ------------------- - -Each cache has a name, and when the cache is registered, a directory -with that name is created in /proc/net/rpc - -This directory contains a file called 'channel' which is a channel -for communicating between kernel and user for populating the cache. -This directory may later contain other files of interacting -with the cache. - -The 'channel' works a bit like a datagram socket. Each 'write' is -passed as a whole to the cache for parsing and interpretation. -Each cache can treat the write requests differently, but it is -expected that a message written will contain: - - a key - - an expiry time - - a content. -with the intention that an item in the cache with the give key -should be create or updated to have the given content, and the -expiry time should be set on that item. - -Reading from a channel is a bit more interesting. When a cache -lookup fails, or when it succeeds but finds an entry that may soon -expire, a request is lodged for that cache item to be updated by -user-space. These requests appear in the channel file. - -Successive reads will return successive requests. -If there are no more requests to return, read will return EOF, but a -select or poll for read will block waiting for another request to be -added. - -Thus a user-space helper is likely to: - open the channel. - select for readable - read a request - write a response - loop. - -If it dies and needs to be restarted, any requests that have not been -answered will still appear in the file and will be read by the new -instance of the helper. - -Each cache should define a "cache_parse" method which takes a message -written from user-space and processes it. It should return an error -(which propagates back to the write syscall) or 0. - -Each cache should also define a "cache_request" method which -takes a cache item and encodes a request into the buffer -provided. - -Note: If a cache has no active readers on the channel, and has had not -active readers for more than 60 seconds, further requests will not be -added to the channel but instead all lookups that do not find a valid -entry will fail. This is partly for backward compatibility: The -previous nfs exports table was deemed to be authoritative and a -failed lookup meant a definite 'no'. - -request/response format ------------------------ - -While each cache is free to use it's own format for requests -and responses over channel, the following is recommended as -appropriate and support routines are available to help: -Each request or response record should be printable ASCII -with precisely one newline character which should be at the end. -Fields within the record should be separated by spaces, normally one. -If spaces, newlines, or nul characters are needed in a field they -much be quoted. two mechanisms are available: -1/ If a field begins '\x' then it must contain an even number of - hex digits, and pairs of these digits provide the bytes in the - field. -2/ otherwise a \ in the field must be followed by 3 octal digits - which give the code for a byte. Other characters are treated - as them selves. At the very least, space, newline, nul, and - '\' must be quoted in this way. -- cgit v1.2.2 From d396c5f158547e50c2b78bc984cb4a72d76e969b Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Mon, 7 Apr 2008 16:39:38 -0400 Subject: Move sched-rt-group.txt to scheduler/ Signed-off-by: J. Bruce Fields Signed-off-by: Jonathan Corbet --- Documentation/sched-rt-group.txt | 59 ------------------------------ Documentation/scheduler/00-INDEX | 2 + Documentation/scheduler/sched-rt-group.txt | 59 ++++++++++++++++++++++++++++++ 3 files changed, 61 insertions(+), 59 deletions(-) delete mode 100644 Documentation/sched-rt-group.txt create mode 100644 Documentation/scheduler/sched-rt-group.txt (limited to 'Documentation') diff --git a/Documentation/sched-rt-group.txt b/Documentation/sched-rt-group.txt deleted file mode 100644 index 1c6332f4543c..000000000000 --- a/Documentation/sched-rt-group.txt +++ /dev/null @@ -1,59 +0,0 @@ - - -Real-Time group scheduling. - -The problem space: - -In order to schedule multiple groups of realtime tasks each group must -be assigned a fixed portion of the CPU time available. Without a minimum -guarantee a realtime group can obviously fall short. A fuzzy upper limit -is of no use since it cannot be relied upon. Which leaves us with just -the single fixed portion. - -CPU time is divided by means of specifying how much time can be spent -running in a given period. Say a frame fixed realtime renderer must -deliver 25 frames a second, which yields a period of 0.04s. Now say -it will also have to play some music and respond to input, leaving it -with around 80% for the graphics. We can then give this group a runtime -of 0.8 * 0.04s = 0.032s. - -This way the graphics group will have a 0.04s period with a 0.032s runtime -limit. - -Now if the audio thread needs to refill the DMA buffer every 0.005s, but -needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s -= 0.00015s. - - -The Interface: - -system wide: - -/proc/sys/kernel/sched_rt_period_ms -/proc/sys/kernel/sched_rt_runtime_us - -CONFIG_FAIR_USER_SCHED - -/sys/kernel/uids//cpu_rt_runtime_us - -or - -CONFIG_FAIR_CGROUP_SCHED - -/cgroup//cpu.rt_runtime_us - -[ time is specified in us because the interface is s32; this gives an - operating range of ~35m to 1us ] - -The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. - -A runtime of -1 specifies runtime == period, ie. no limit. - -New groups get the period from /proc/sys/kernel/sched_rt_period_us and -a runtime of 0. - -Settings are constrained to: - - \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period - -in order to keep the configuration schedulable. diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX index b5f5ca069b2d..fc234d093fbf 100644 --- a/Documentation/scheduler/00-INDEX +++ b/Documentation/scheduler/00-INDEX @@ -12,5 +12,7 @@ sched-domains.txt - information on scheduling domains. sched-nice-design.txt - How and why the scheduler's nice levels are implemented. +sched-rt-group.txt + - real-time group scheduling. sched-stats.txt - information on schedstats (Linux Scheduler Statistics). diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt new file mode 100644 index 000000000000..1c6332f4543c --- /dev/null +++ b/Documentation/scheduler/sched-rt-group.txt @@ -0,0 +1,59 @@ + + +Real-Time group scheduling. + +The problem space: + +In order to schedule multiple groups of realtime tasks each group must +be assigned a fixed portion of the CPU time available. Without a minimum +guarantee a realtime group can obviously fall short. A fuzzy upper limit +is of no use since it cannot be relied upon. Which leaves us with just +the single fixed portion. + +CPU time is divided by means of specifying how much time can be spent +running in a given period. Say a frame fixed realtime renderer must +deliver 25 frames a second, which yields a period of 0.04s. Now say +it will also have to play some music and respond to input, leaving it +with around 80% for the graphics. We can then give this group a runtime +of 0.8 * 0.04s = 0.032s. + +This way the graphics group will have a 0.04s period with a 0.032s runtime +limit. + +Now if the audio thread needs to refill the DMA buffer every 0.005s, but +needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s += 0.00015s. + + +The Interface: + +system wide: + +/proc/sys/kernel/sched_rt_period_ms +/proc/sys/kernel/sched_rt_runtime_us + +CONFIG_FAIR_USER_SCHED + +/sys/kernel/uids//cpu_rt_runtime_us + +or + +CONFIG_FAIR_CGROUP_SCHED + +/cgroup//cpu.rt_runtime_us + +[ time is specified in us because the interface is s32; this gives an + operating range of ~35m to 1us ] + +The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. + +A runtime of -1 specifies runtime == period, ie. no limit. + +New groups get the period from /proc/sys/kernel/sched_rt_period_us and +a runtime of 0. + +Settings are constrained to: + + \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period + +in order to keep the configuration schedulable. -- cgit v1.2.2 From 14dadf1d5eb5bea2dd115852cfee880505c1c169 Mon Sep 17 00:00:00 2001 From: Mark Fasheh Date: Thu, 10 Apr 2008 13:55:21 -0700 Subject: Add additional examples in Documentation/spinlocks.txt Checkpatch will throw an error if code doesn't use the correct initializers for static spinlocks: ERROR: Use of SPIN_LOCK_UNLOCKED is deprecated: see Documentation/spinlocks.txt This is fine, but Documentation/spinlocks.txt isn't very clear on how to _use_ the new initializers for static variables. To save people time in the future, I added two small examples of how to fix old-style static initializers to be more lockdep friendly. Signed-off-by: Mark Fasheh Signed-off-by: Jonathan Corbet --- Documentation/spinlocks.txt | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) (limited to 'Documentation') diff --git a/Documentation/spinlocks.txt b/Documentation/spinlocks.txt index 471e75389778..619699dde593 100644 --- a/Documentation/spinlocks.txt +++ b/Documentation/spinlocks.txt @@ -5,6 +5,28 @@ Please use DEFINE_SPINLOCK()/DEFINE_RWLOCK() or __SPIN_LOCK_UNLOCKED()/__RW_LOCK_UNLOCKED() as appropriate for static initialization. +Most of the time, you can simply turn: + + static spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED; + +into: + + static DEFINE_SPINLOCK(xxx_lock); + +Static structure member variables go from: + + struct foo bar { + .lock = SPIN_LOCK_UNLOCKED; + }; + +to: + + struct foo bar { + .lock = __SPIN_LOCK_UNLOCKED(bar.lock); + }; + +Declaration of static rw_locks undergo a similar transformation. + Dynamic initialization, when necessary, may be performed as demonstrated below. -- cgit v1.2.2 From 56690c2151d33534f0537fd03c533eda81d96f0f Mon Sep 17 00:00:00 2001 From: Oliver Hartkopp Date: Tue, 15 Apr 2008 00:46:38 -0700 Subject: [CAN]: Update documentation of struct sockaddr_can The struct sockaddr_can has been simplified in the code review process. This patch updates this simplification also in the associated documentation in can.txt . Signed-off-by: Oliver Hartkopp Signed-off-by: David S. Miller --- Documentation/networking/can.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt index f1b2de170929..641d2afacffa 100644 --- a/Documentation/networking/can.txt +++ b/Documentation/networking/can.txt @@ -281,10 +281,10 @@ solution for a couple of reasons: sa_family_t can_family; int can_ifindex; union { - struct { canid_t rx_id, tx_id; } tp16; - struct { canid_t rx_id, tx_id; } tp20; - struct { canid_t rx_id, tx_id; } mcnet; - struct { canid_t rx_id, tx_id; } isotp; + /* transport protocol class address info (e.g. ISOTP) */ + struct { canid_t rx_id, tx_id; } tp; + + /* reserved for future CAN protocols address information */ } can_addr; }; -- cgit v1.2.2 From b82d4043b3550df00a036f6aa2c8ab9578a283ef Mon Sep 17 00:00:00 2001 From: Dmitri Vorobiev Date: Tue, 15 Apr 2008 14:34:40 -0700 Subject: Fix typos in Documentation/filesystems/seq_file.txt A couple of typos crept into the newly added document about the seq_file interface. This patch corrects those typos and simultaneously deletes unnecessary trailing spaces. Signed-off-by: Dmitri Vorobiev Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/seq_file.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt index cc6cdb95b73a..7fb8e6dc62bf 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.txt @@ -92,7 +92,7 @@ implementations; in most cases the start() function should check for a "past end of file" condition and return NULL if need be. For more complicated applications, the private field of the seq_file -structure can be used. There is also a special value whch can be returned +structure can be used. There is also a special value which can be returned by the start() function called SEQ_START_TOKEN; it can be used if you wish to instruct your show() function (described below) to print a header at the top of the output. SEQ_START_TOKEN should only be used if the offset is @@ -146,7 +146,7 @@ the four functions we have just defined: This structure will be needed to tie our iterator to the /proc file in a little bit. -It's worth noting that the interator value returned by start() and +It's worth noting that the iterator value returned by start() and manipulated by the other functions is considered to be completely opaque by the seq_file code. It can thus be anything that is useful in stepping through the data to be output. Counters can be useful, but it could also be @@ -262,7 +262,7 @@ routines useful: These helpers will interpret pos as a position within the list and iterate accordingly. Your start() and next() functions need only invoke the -seq_list_* helpers with a pointer to the appropriate list_head structure. +seq_list_* helpers with a pointer to the appropriate list_head structure. The extra-simple version -- cgit v1.2.2 From 423bec43079a2942a3004034df7aad76469758d8 Mon Sep 17 00:00:00 2001 From: Nishanth Aravamudan Date: Tue, 15 Apr 2008 14:34:43 -0700 Subject: Documentation: correct overcommit caveat in hugetlbpage.txt As shown by Gurudas Pai recently, we can put hugepages into the surplus state (by echo 0 > /proc/sys/vm/nr_hugepages), even when /proc/sys/vm/nr_overcommit_hugepages is 0. This is actually correct, to allow the original goal (shrink the static pool to 0) to succeed (we are converting hugepages to surplus because they are in use). However, the documentation does not accurately reflect this case. Update it. Signed-off-by: Nishanth Aravamudan Acked-by: Andy Whitcroft Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index f962d01bea2a..3102b81bef88 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -88,10 +88,9 @@ hugepages from the buddy allocator, if the normal pool is exhausted. As these surplus hugepages go out of use, they are freed back to the buddy allocator. -Caveat: Shrinking the pool via nr_hugepages while a surplus is in effect -will allow the number of surplus huge pages to exceed the overcommit -value, as the pool hugepages (which must have been in use for a surplus -hugepages to be allocated) will become surplus hugepages. As long as +Caveat: Shrinking the pool via nr_hugepages such that it becomes less +than the number of hugepages in use will convert the balance to surplus +huge pages even if it would exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. -- cgit v1.2.2