aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/nfs
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/nfs')
-rw-r--r--Documentation/filesystems/nfs/00-INDEX16
-rw-r--r--Documentation/filesystems/nfs/Exporting147
-rw-r--r--Documentation/filesystems/nfs/knfsd-stats.txt159
-rw-r--r--Documentation/filesystems/nfs/nfs-rdma.txt271
-rw-r--r--Documentation/filesystems/nfs/nfs.txt98
-rw-r--r--Documentation/filesystems/nfs/nfs41-server.txt222
-rw-r--r--Documentation/filesystems/nfs/nfsroot.txt270
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.txt202
8 files changed, 1385 insertions, 0 deletions
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
new file mode 100644
index 000000000000..2f68cd688769
--- /dev/null
+++ b/Documentation/filesystems/nfs/00-INDEX
@@ -0,0 +1,16 @@
100-INDEX
2 - this file (nfs-related documentation).
3Exporting
4 - explanation of how to make filesystems exportable.
5knfsd-stats.txt
6 - statistics which the NFS server makes available to user space.
7nfs.txt
8 - nfs client, and DNS resolution for fs_locations.
9nfs41-server.txt
10 - info on the Linux server implementation of NFSv4 minor version 1.
11nfs-rdma.txt
12 - how to install and setup the Linux NFS/RDMA client and server software
13nfsroot.txt
14 - short guide on setting up a diskless box with NFS root filesystem.
15rpc-cache.txt
16 - introduction to the caching mechanisms in the sunrpc layer.
diff --git a/Documentation/filesystems/nfs/Exporting b/Documentation/filesystems/nfs/Exporting
new file mode 100644
index 000000000000..87019d2b5981
--- /dev/null
+++ b/Documentation/filesystems/nfs/Exporting
@@ -0,0 +1,147 @@
1
2Making Filesystems Exportable
3=============================
4
5Overview
6--------
7
8All filesystem operations require a dentry (or two) as a starting
9point. Local applications have a reference-counted hold on suitable
10dentries via open file descriptors or cwd/root. However remote
11applications that access a filesystem via a remote filesystem protocol
12such as NFS may not be able to hold such a reference, and so need a
13different way to refer to a particular dentry. As the alternative
14form of reference needs to be stable across renames, truncates, and
15server-reboot (among other things, though these tend to be the most
16problematic), there is no simple answer like 'filename'.
17
18The mechanism discussed here allows each filesystem implementation to
19specify how to generate an opaque (outside of the filesystem) byte
20string for any dentry, and how to find an appropriate dentry for any
21given opaque byte string.
22This byte string will be called a "filehandle fragment" as it
23corresponds to part of an NFS filehandle.
24
25A filesystem which supports the mapping between filehandle fragments
26and dentries will be termed "exportable".
27
28
29
30Dcache Issues
31-------------
32
33The dcache normally contains a proper prefix of any given filesystem
34tree. This means that if any filesystem object is in the dcache, then
35all of the ancestors of that filesystem object are also in the dcache.
36As normal access is by filename this prefix is created naturally and
37maintained easily (by each object maintaining a reference count on
38its parent).
39
40However when objects are included into the dcache by interpreting a
41filehandle fragment, there is no automatic creation of a path prefix
42for the object. This leads to two related but distinct features of
43the dcache that are not needed for normal filesystem access.
44
451/ The dcache must sometimes contain objects that are not part of the
46 proper prefix. i.e that are not connected to the root.
472/ The dcache must be prepared for a newly found (via ->lookup) directory
48 to already have a (non-connected) dentry, and must be able to move
49 that dentry into place (based on the parent and name in the
50 ->lookup). This is particularly needed for directories as
51 it is a dcache invariant that directories only have one dentry.
52
53To implement these features, the dcache has:
54
55a/ A dentry flag DCACHE_DISCONNECTED which is set on
56 any dentry that might not be part of the proper prefix.
57 This is set when anonymous dentries are created, and cleared when a
58 dentry is noticed to be a child of a dentry which is in the proper
59 prefix.
60
61b/ A per-superblock list "s_anon" of dentries which are the roots of
62 subtrees that are not in the proper prefix. These dentries, as
63 well as the proper prefix, need to be released at unmount time. As
64 these dentries will not be hashed, they are linked together on the
65 d_hash list_head.
66
67c/ Helper routines to allocate anonymous dentries, and to help attach
68 loose directory dentries at lookup time. They are:
69 d_alloc_anon(inode) will return a dentry for the given inode.
70 If the inode already has a dentry, one of those is returned.
71 If it doesn't, a new anonymous (IS_ROOT and
72 DCACHE_DISCONNECTED) dentry is allocated and attached.
73 In the case of a directory, care is taken that only one dentry
74 can ever be attached.
75 d_splice_alias(inode, dentry) will make sure that there is a
76 dentry with the same name and parent as the given dentry, and
77 which refers to the given inode.
78 If the inode is a directory and already has a dentry, then that
79 dentry is d_moved over the given dentry.
80 If the passed dentry gets attached, care is taken that this is
81 mutually exclusive to a d_alloc_anon operation.
82 If the passed dentry is used, NULL is returned, else the used
83 dentry is returned. This corresponds to the calling pattern of
84 ->lookup.
85
86
87Filesystem Issues
88-----------------
89
90For a filesystem to be exportable it must:
91
92 1/ provide the filehandle fragment routines described below.
93 2/ make sure that d_splice_alias is used rather than d_add
94 when ->lookup finds an inode for a given parent and name.
95 Typically the ->lookup routine will end with a:
96
97 return d_splice_alias(inode, dentry);
98 }
99
100
101
102 A file system implementation declares that instances of the filesystem
103are exportable by setting the s_export_op field in the struct
104super_block. This field must point to a "struct export_operations"
105struct which has the following members:
106
107 encode_fh (optional)
108 Takes a dentry and creates a filehandle fragment which can later be used
109 to find or create a dentry for the same object. The default
110 implementation creates a filehandle fragment that encodes a 32bit inode
111 and generation number for the inode encoded, and if necessary the
112 same information for the parent.
113
114 fh_to_dentry (mandatory)
115 Given a filehandle fragment, this should find the implied object and
116 create a dentry for it (possibly with d_alloc_anon).
117
118 fh_to_parent (optional but strongly recommended)
119 Given a filehandle fragment, this should find the parent of the
120 implied object and create a dentry for it (possibly with d_alloc_anon).
121 May fail if the filehandle fragment is too small.
122
123 get_parent (optional but strongly recommended)
124 When given a dentry for a directory, this should return a dentry for
125 the parent. Quite possibly the parent dentry will have been allocated
126 by d_alloc_anon. The default get_parent function just returns an error
127 so any filehandle lookup that requires finding a parent will fail.
128 ->lookup("..") is *not* used as a default as it can leave ".." entries
129 in the dcache which are too messy to work with.
130
131 get_name (optional)
132 When given a parent dentry and a child dentry, this should find a name
133 in the directory identified by the parent dentry, which leads to the
134 object identified by the child dentry. If no get_name function is
135 supplied, a default implementation is provided which uses vfs_readdir
136 to find potential names, and matches inode numbers to find the correct
137 match.
138
139
140A filehandle fragment consists of an array of 1 or more 4byte words,
141together with a one byte "type".
142The decode_fh routine should not depend on the stated size that is
143passed to it. This size may be larger than the original filehandle
144generated by encode_fh, in which case it will have been padded with
145nuls. Rather, the encode_fh routine should choose a "type" which
146indicates the decode_fh how much of the filehandle is valid, and how
147it should be interpreted.
diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt
new file mode 100644
index 000000000000..64ced5149d37
--- /dev/null
+++ b/Documentation/filesystems/nfs/knfsd-stats.txt
@@ -0,0 +1,159 @@
1
2Kernel NFS Server Statistics
3============================
4
5This document describes the format and semantics of the statistics
6which the kernel NFS server makes available to userspace. These
7statistics are available in several text form pseudo files, each of
8which is described separately below.
9
10In most cases you don't need to know these formats, as the nfsstat(8)
11program from the nfs-utils distribution provides a helpful command-line
12interface for extracting and printing them.
13
14All the files described here are formatted as a sequence of text lines,
15separated by newline '\n' characters. Lines beginning with a hash
16'#' character are comments intended for humans and should be ignored
17by parsing routines. All other lines contain a sequence of fields
18separated by whitespace.
19
20/proc/fs/nfsd/pool_stats
21------------------------
22
23This file is available in kernels from 2.6.30 onwards, if the
24/proc/fs/nfsd filesystem is mounted (it almost always should be).
25
26The first line is a comment which describes the fields present in
27all the other lines. The other lines present the following data as
28a sequence of unsigned decimal numeric fields. One line is shown
29for each NFS thread pool.
30
31All counters are 64 bits wide and wrap naturally. There is no way
32to zero these counters, instead applications should do their own
33rate conversion.
34
35pool
36 The id number of the NFS thread pool to which this line applies.
37 This number does not change.
38
39 Thread pool ids are a contiguous set of small integers starting
40 at zero. The maximum value depends on the thread pool mode, but
41 currently cannot be larger than the number of CPUs in the system.
42 Note that in the default case there will be a single thread pool
43 which contains all the nfsd threads and all the CPUs in the system,
44 and thus this file will have a single line with a pool id of "0".
45
46packets-arrived
47 Counts how many NFS packets have arrived. More precisely, this
48 is the number of times that the network stack has notified the
49 sunrpc server layer that new data may be available on a transport
50 (e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
51
52 Depending on the NFS workload patterns and various network stack
53 effects (such as Large Receive Offload) which can combine packets
54 on the wire, this may be either more or less than the number
55 of NFS calls received (which statistic is available elsewhere).
56 However this is a more accurate and less workload-dependent measure
57 of how much CPU load is being placed on the sunrpc server layer
58 due to NFS network traffic.
59
60sockets-enqueued
61 Counts how many times an NFS transport is enqueued to wait for
62 an nfsd thread to service it, i.e. no nfsd thread was considered
63 available.
64
65 The circumstance this statistic tracks indicates that there was NFS
66 network-facing work to be done but it couldn't be done immediately,
67 thus introducing a small delay in servicing NFS calls. The ideal
68 rate of change for this counter is zero; significantly non-zero
69 values may indicate a performance limitation.
70
71 This can happen either because there are too few nfsd threads in the
72 thread pool for the NFS workload (the workload is thread-limited),
73 or because the NFS workload needs more CPU time than is available in
74 the thread pool (the workload is CPU-limited). In the former case,
75 configuring more nfsd threads will probably improve the performance
76 of the NFS workload. In the latter case, the sunrpc server layer is
77 already choosing not to wake idle nfsd threads because there are too
78 many nfsd threads which want to run but cannot, so configuring more
79 nfsd threads will make no difference whatsoever. The overloads-avoided
80 statistic (see below) can be used to distinguish these cases.
81
82threads-woken
83 Counts how many times an idle nfsd thread is woken to try to
84 receive some data from an NFS transport.
85
86 This statistic tracks the circumstance where incoming
87 network-facing NFS work is being handled quickly, which is a good
88 thing. The ideal rate of change for this counter will be close
89 to but less than the rate of change of the packets-arrived counter.
90
91overloads-avoided
92 Counts how many times the sunrpc server layer chose not to wake an
93 nfsd thread, despite the presence of idle nfsd threads, because
94 too many nfsd threads had been recently woken but could not get
95 enough CPU time to actually run.
96
97 This statistic counts a circumstance where the sunrpc layer
98 heuristically avoids overloading the CPU scheduler with too many
99 runnable nfsd threads. The ideal rate of change for this counter
100 is zero. Significant non-zero values indicate that the workload
101 is CPU limited. Usually this is associated with heavy CPU usage
102 on all the CPUs in the nfsd thread pool.
103
104 If a sustained large overloads-avoided rate is detected on a pool,
105 the top(1) utility should be used to check for the following
106 pattern of CPU usage on all the CPUs associated with the given
107 nfsd thread pool.
108
109 - %us ~= 0 (as you're *NOT* running applications on your NFS server)
110
111 - %wa ~= 0
112
113 - %id ~= 0
114
115 - %sy + %hi + %si ~= 100
116
117 If this pattern is seen, configuring more nfsd threads will *not*
118 improve the performance of the workload. If this patten is not
119 seen, then something more subtle is wrong.
120
121threads-timedout
122 Counts how many times an nfsd thread triggered an idle timeout,
123 i.e. was not woken to handle any incoming network packets for
124 some time.
125
126 This statistic counts a circumstance where there are more nfsd
127 threads configured than can be used by the NFS workload. This is
128 a clue that the number of nfsd threads can be reduced without
129 affecting performance. Unfortunately, it's only a clue and not
130 a strong indication, for a couple of reasons:
131
132 - Currently the rate at which the counter is incremented is quite
133 slow; the idle timeout is 60 minutes. Unless the NFS workload
134 remains constant for hours at a time, this counter is unlikely
135 to be providing information that is still useful.
136
137 - It is usually a wise policy to provide some slack,
138 i.e. configure a few more nfsds than are currently needed,
139 to allow for future spikes in load.
140
141
142Note that incoming packets on NFS transports will be dealt with in
143one of three ways. An nfsd thread can be woken (threads-woken counts
144this case), or the transport can be enqueued for later attention
145(sockets-enqueued counts this case), or the packet can be temporarily
146deferred because the transport is currently being used by an nfsd
147thread. This last case is not very interesting and is not explicitly
148counted, but can be inferred from the other counters thus:
149
150packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
151
152
153More
154----
155Descriptions of the other statistics file should go here.
156
157
158Greg Banks <gnb@sgi.com>
15926 Mar 2009
diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt
new file mode 100644
index 000000000000..e386f7e4bcee
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs-rdma.txt
@@ -0,0 +1,271 @@
1################################################################################
2# #
3# NFS/RDMA README #
4# #
5################################################################################
6
7 Author: NetApp and Open Grid Computing
8 Date: May 29, 2008
9
10Table of Contents
11~~~~~~~~~~~~~~~~~
12 - Overview
13 - Getting Help
14 - Installation
15 - Check RDMA and NFS Setup
16 - NFS/RDMA Setup
17
18Overview
19~~~~~~~~
20
21 This document describes how to install and setup the Linux NFS/RDMA client
22 and server software.
23
24 The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
25 was first included in the following release, Linux 2.6.25.
26
27 In our testing, we have obtained excellent performance results (full 10Gbit
28 wire bandwidth at minimal client CPU) under many workloads. The code passes
29 the full Connectathon test suite and operates over both Infiniband and iWARP
30 RDMA adapters.
31
32Getting Help
33~~~~~~~~~~~~
34
35 If you get stuck, you can ask questions on the
36
37 nfs-rdma-devel@lists.sourceforge.net
38
39 mailing list.
40
41Installation
42~~~~~~~~~~~~
43
44 These instructions are a step by step guide to building a machine for
45 use with NFS/RDMA.
46
47 - Install an RDMA device
48
49 Any device supported by the drivers in drivers/infiniband/hw is acceptable.
50
51 Testing has been performed using several Mellanox-based IB cards, the
52 Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
53
54 - Install a Linux distribution and tools
55
56 The first kernel release to contain both the NFS/RDMA client and server was
57 Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
58 Linux kernel release should be installed.
59
60 The procedures described in this document have been tested with
61 distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
62
63 - Install nfs-utils-1.1.2 or greater on the client
64
65 An NFS/RDMA mount point can be obtained by using the mount.nfs command in
66 nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
67 version with support for NFS/RDMA mounts, but for various reasons we
68 recommend using nfs-utils-1.1.2 or greater). To see which version of
69 mount.nfs you are using, type:
70
71 $ /sbin/mount.nfs -V
72
73 If the version is less than 1.1.2 or the command does not exist,
74 you should install the latest version of nfs-utils.
75
76 Download the latest package from:
77
78 http://www.kernel.org/pub/linux/utils/nfs
79
80 Uncompress the package and follow the installation instructions.
81
82 If you will not need the idmapper and gssd executables (you do not need
83 these to create an NFS/RDMA enabled mount command), the installation
84 process can be simplified by disabling these features when running
85 configure:
86
87 $ ./configure --disable-gss --disable-nfsv4
88
89 To build nfs-utils you will need the tcp_wrappers package installed. For
90 more information on this see the package's README and INSTALL files.
91
92 After building the nfs-utils package, there will be a mount.nfs binary in
93 the utils/mount directory. This binary can be used to initiate NFS v2, v3,
94 or v4 mounts. To initiate a v4 mount, the binary must be called
95 mount.nfs4. The standard technique is to create a symlink called
96 mount.nfs4 to mount.nfs.
97
98 This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
99
100 $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
101
102 In this location, mount.nfs will be invoked automatically for NFS mounts
103 by the system mount command.
104
105 NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
106 on the NFS client machine. You do not need this specific version of
107 nfs-utils on the server. Furthermore, only the mount.nfs command from
108 nfs-utils-1.1.2 is needed on the client.
109
110 - Install a Linux kernel with NFS/RDMA
111
112 The NFS/RDMA client and server are both included in the mainline Linux
113 kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
114 kernel can be found at:
115
116 ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
117
118 Download the sources and place them in an appropriate location.
119
120 - Configure the RDMA stack
121
122 Make sure your kernel configuration has RDMA support enabled. Under
123 Device Drivers -> InfiniBand support, update the kernel configuration
124 to enable InfiniBand support [NOTE: the option name is misleading. Enabling
125 InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
126
127 Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
128 iWARP adapter support (amso, cxgb3, etc.).
129
130 If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
131
132 - Configure the NFS client and server
133
134 Your kernel configuration must also have NFS file system support and/or
135 NFS server support enabled. These and other NFS related configuration
136 options can be found under File Systems -> Network File Systems.
137
138 - Build, install, reboot
139
140 The NFS/RDMA code will be enabled automatically if NFS and RDMA
141 are turned on. The NFS/RDMA client and server are configured via the hidden
142 SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
143 value of SUNRPC_XPRT_RDMA will be:
144
145 - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
146 and server will not be built
147 - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
148 in this case the NFS/RDMA client and server will be built as modules
149 - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
150 and server will be built into the kernel
151
152 Therefore, if you have followed the steps above and turned no NFS and RDMA,
153 the NFS/RDMA client and server will be built.
154
155 Build a new kernel, install it, boot it.
156
157Check RDMA and NFS Setup
158~~~~~~~~~~~~~~~~~~~~~~~~
159
160 Before configuring the NFS/RDMA software, it is a good idea to test
161 your new kernel to ensure that the kernel is working correctly.
162 In particular, it is a good idea to verify that the RDMA stack
163 is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
164 is working properly.
165
166 - Check RDMA Setup
167
168 If you built the RDMA components as modules, load them at
169 this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
170 card:
171
172 $ modprobe ib_mthca
173 $ modprobe ib_ipoib
174
175 If you are using InfiniBand, make sure there is a Subnet Manager (SM)
176 running on the network. If your IB switch has an embedded SM, you can
177 use it. Otherwise, you will need to run an SM, such as OpenSM, on one
178 of your end nodes.
179
180 If an SM is running on your network, you should see the following:
181
182 $ cat /sys/class/infiniband/driverX/ports/1/state
183 4: ACTIVE
184
185 where driverX is mthca0, ipath5, ehca3, etc.
186
187 To further test the InfiniBand software stack, use IPoIB (this
188 assumes you have two IB hosts named host1 and host2):
189
190 host1$ ifconfig ib0 a.b.c.x
191 host2$ ifconfig ib0 a.b.c.y
192 host1$ ping a.b.c.y
193 host2$ ping a.b.c.x
194
195 For other device types, follow the appropriate procedures.
196
197 - Check NFS Setup
198
199 For the NFS components enabled above (client and/or server),
200 test their functionality over standard Ethernet using TCP/IP or UDP/IP.
201
202NFS/RDMA Setup
203~~~~~~~~~~~~~~
204
205 We recommend that you use two machines, one to act as the client and
206 one to act as the server.
207
208 One time configuration:
209
210 - On the server system, configure the /etc/exports file and
211 start the NFS/RDMA server.
212
213 Exports entries with the following formats have been tested:
214
215 /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
216 /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
217
218 The IP address(es) is(are) the client's IPoIB address for an InfiniBand
219 HCA or the cleint's iWARP address(es) for an RNIC.
220
221 NOTE: The "insecure" option must be used because the NFS/RDMA client does
222 not use a reserved port.
223
224 Each time a machine boots:
225
226 - Load and configure the RDMA drivers
227
228 For InfiniBand using a Mellanox adapter:
229
230 $ modprobe ib_mthca
231 $ modprobe ib_ipoib
232 $ ifconfig ib0 a.b.c.d
233
234 NOTE: use unique addresses for the client and server
235
236 - Start the NFS server
237
238 If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
239 kernel config), load the RDMA transport module:
240
241 $ modprobe svcrdma
242
243 Regardless of how the server was built (module or built-in), start the
244 server:
245
246 $ /etc/init.d/nfs start
247
248 or
249
250 $ service nfs start
251
252 Instruct the server to listen on the RDMA transport:
253
254 $ echo rdma 20049 > /proc/fs/nfsd/portlist
255
256 - On the client system
257
258 If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
259 kernel config), load the RDMA client module:
260
261 $ modprobe xprtrdma.ko
262
263 Regardless of how the client was built (module or built-in), use this
264 command to mount the NFS/RDMA server:
265
266 $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
267
268 To verify that the mount is using RDMA, run "cat /proc/mounts" and check
269 the "proto" field for the given mount.
270
271 Congratulations! You're using NFS/RDMA!
diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt
new file mode 100644
index 000000000000..f50f26ce6cd0
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs.txt
@@ -0,0 +1,98 @@
1
2The NFS client
3==============
4
5The NFS version 2 protocol was first documented in RFC1094 (March 1989).
6Since then two more major releases of NFS have been published, with NFSv3
7being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
82003).
9
10The Linux NFS client currently supports all the above published versions,
11and work is in progress on adding support for minor version 1 of the NFSv4
12protocol.
13
14The purpose of this document is to provide information on some of the
15upcall interfaces that are used in order to provide the NFS client with
16some of the information that it requires in order to fully comply with
17the NFS spec.
18
19The DNS resolver
20================
21
22NFSv4 allows for one server to refer the NFS client to data that has been
23migrated onto another server by means of the special "fs_locations"
24attribute. See
25 http://tools.ietf.org/html/rfc3530#section-6
26and
27 http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
28
29The fs_locations information can take the form of either an ip address and
30a path, or a DNS hostname and a path. The latter requires the NFS client to
31do a DNS lookup in order to mount the new volume, and hence the need for an
32upcall to allow userland to provide this service.
33
34Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
35/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
36
37 (1) The process checks the dns_resolve cache to see if it contains a
38 valid entry. If so, it returns that entry and exits.
39
40 (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
41 (may be changed using the 'nfs.cache_getent' kernel boot parameter)
42 is run, with two arguments:
43 - the cache name, "dns_resolve"
44 - the hostname to resolve
45
46 (3) After looking up the corresponding ip address, the helper script
47 writes the result into the rpc_pipefs pseudo-file
48 '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
49 in the following (text) format:
50
51 "<ip address> <hostname> <ttl>\n"
52
53 Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
54 (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
55 <hostname> is identical to the second argument of the helper
56 script, and <ttl> is the 'time to live' of this cache entry (in
57 units of seconds).
58
59 Note: If <ip address> is invalid, say the string "0", then a negative
60 entry is created, which will cause the kernel to treat the hostname
61 as having no valid DNS translation.
62
63
64
65
66A basic sample /sbin/nfs_cache_getent
67=====================================
68
69#!/bin/bash
70#
71ttl=600
72#
73cut=/usr/bin/cut
74getent=/usr/bin/getent
75rpc_pipefs=/var/lib/nfs/rpc_pipefs
76#
77die()
78{
79 echo "Usage: $0 cache_name entry_name"
80 exit 1
81}
82
83[ $# -lt 2 ] && die
84cachename="$1"
85cache_path=${rpc_pipefs}/cache/${cachename}/channel
86
87case "${cachename}" in
88 dns_resolve)
89 name="$2"
90 result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
91 [ -z "${result}" ] && result="0"
92 ;;
93 *)
94 die
95 ;;
96esac
97echo "${result} ${name} ${ttl}" >${cache_path}
98
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
new file mode 100644
index 000000000000..1bd0d0c05171
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs41-server.txt
@@ -0,0 +1,222 @@
1NFSv4.1 Server Implementation
2
3Server support for minorversion 1 can be controlled using the
4/proc/fs/nfsd/versions control file. The string output returned
5by reading this file will contain either "+4.1" or "-4.1"
6correspondingly.
7
8Currently, server support for minorversion 1 is disabled by default.
9It can be enabled at run time by writing the string "+4.1" to
10the /proc/fs/nfsd/versions control file. Note that to write this
11control file, the nfsd service must be taken down. Use your user-mode
12nfs-utils to set this up; see rpc.nfsd(8)
13
14(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
15"-4", respectively. Therefore, code meant to work on both new and old
16kernels must turn 4.1 on or off *before* turning support for version 4
17on or off; rpc.nfsd does this correctly.)
18
19The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
20on the latest NFSv4.1 Internet Draft:
21http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29
22
23From the many new features in NFSv4.1 the current implementation
24focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
25"exactly once" semantics and better control and throttling of the
26resources allocated for each client.
27
28Other NFSv4.1 features, Parallel NFS operations in particular,
29are still under development out of tree.
30See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
31for more information.
32
33The current implementation is intended for developers only: while it
34does support ordinary file operations on clients we have tested against
35(including the linux client), it is incomplete in ways which may limit
36features unexpectedly, cause known bugs in rare cases, or cause
37interoperability problems with future clients. Known issues:
38
39 - gss support is questionable: currently mounts with kerberos
40 from a linux client are possible, but we aren't really
41 conformant with the spec (for example, we don't use kerberos
42 on the backchannel correctly).
43 - no trunking support: no clients currently take advantage of
44 trunking, but this is a mandatory feature, and its use is
45 recommended to clients in a number of places. (E.g. to ensure
46 timely renewal in case an existing connection's retry timeouts
47 have gotten too long; see section 8.3 of the draft.)
48 Therefore, lack of this feature may cause future clients to
49 fail.
50 - Incomplete backchannel support: incomplete backchannel gss
51 support and no support for BACKCHANNEL_CTL mean that
52 callbacks (hence delegations and layouts) may not be
53 available and clients confused by the incomplete
54 implementation may fail.
55 - Server reboot recovery is unsupported; if the server reboots,
56 clients may fail.
57 - We do not support SSV, which provides security for shared
58 client-server state (thus preventing unauthorized tampering
59 with locks and opens, for example). It is mandatory for
60 servers to support this, though no clients use it yet.
61 - Mandatory operations which we do not support, such as
62 DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and
63 TEST_STATEID, are not currently used by clients, but will be
64 (and the spec recommends their uses in common cases), and
65 clients should not be expected to know how to recover from the
66 case where they are not supported. This will eventually cause
67 interoperability failures.
68
69In addition, some limitations are inherited from the current NFSv4
70implementation:
71
72 - Incomplete delegation enforcement: if a file is renamed or
73 unlinked, a client holding a delegation may continue to
74 indefinitely allow opens of the file under the old name.
75
76The table below, taken from the NFSv4.1 document, lists
77the operations that are mandatory to implement (REQ), optional
78(OPT), and NFSv4.0 operations that are required not to implement (MNI)
79in minor version 1. The first column indicates the operations that
80are not supported yet by the linux server implementation.
81
82The OPTIONAL features identified and their abbreviations are as follows:
83 pNFS Parallel NFS
84 FDELG File Delegations
85 DDELG Directory Delegations
86
87The following abbreviations indicate the linux server implementation status.
88 I Implemented NFSv4.1 operations.
89 NS Not Supported.
90 NS* unimplemented optional feature.
91 P pNFS features implemented out of tree.
92 PNS pNFS features that are not supported yet (out of tree).
93
94Operations
95
96 +----------------------+------------+--------------+----------------+
97 | Operation | REQ, REC, | Feature | Definition |
98 | | OPT, or | (REQ, REC, | |
99 | | MNI | or OPT) | |
100 +----------------------+------------+--------------+----------------+
101 | ACCESS | REQ | | Section 18.1 |
102NS | BACKCHANNEL_CTL | REQ | | Section 18.33 |
103NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
104 | CLOSE | REQ | | Section 18.2 |
105 | COMMIT | REQ | | Section 18.3 |
106 | CREATE | REQ | | Section 18.4 |
107I | CREATE_SESSION | REQ | | Section 18.36 |
108NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
109 | DELEGRETURN | OPT | FDELG, | Section 18.6 |
110 | | | DDELG, pNFS | |
111 | | | (REQ) | |
112NS | DESTROY_CLIENTID | REQ | | Section 18.50 |
113I | DESTROY_SESSION | REQ | | Section 18.37 |
114I | EXCHANGE_ID | REQ | | Section 18.35 |
115NS | FREE_STATEID | REQ | | Section 18.38 |
116 | GETATTR | REQ | | Section 18.7 |
117P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
118P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
119 | GETFH | REQ | | Section 18.8 |
120NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
121P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
122P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
123P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
124 | LINK | OPT | | Section 18.9 |
125 | LOCK | REQ | | Section 18.10 |
126 | LOCKT | REQ | | Section 18.11 |
127 | LOCKU | REQ | | Section 18.12 |
128 | LOOKUP | REQ | | Section 18.13 |
129 | LOOKUPP | REQ | | Section 18.14 |
130 | NVERIFY | REQ | | Section 18.15 |
131 | OPEN | REQ | | Section 18.16 |
132NS*| OPENATTR | OPT | | Section 18.17 |
133 | OPEN_CONFIRM | MNI | | N/A |
134 | OPEN_DOWNGRADE | REQ | | Section 18.18 |
135 | PUTFH | REQ | | Section 18.19 |
136 | PUTPUBFH | REQ | | Section 18.20 |
137 | PUTROOTFH | REQ | | Section 18.21 |
138 | READ | REQ | | Section 18.22 |
139 | READDIR | REQ | | Section 18.23 |
140 | READLINK | OPT | | Section 18.24 |
141NS | RECLAIM_COMPLETE | REQ | | Section 18.51 |
142 | RELEASE_LOCKOWNER | MNI | | N/A |
143 | REMOVE | REQ | | Section 18.25 |
144 | RENAME | REQ | | Section 18.26 |
145 | RENEW | MNI | | N/A |
146 | RESTOREFH | REQ | | Section 18.27 |
147 | SAVEFH | REQ | | Section 18.28 |
148 | SECINFO | REQ | | Section 18.29 |
149NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
150 | | | layout (REQ) | Section 13.12 |
151I | SEQUENCE | REQ | | Section 18.46 |
152 | SETATTR | REQ | | Section 18.30 |
153 | SETCLIENTID | MNI | | N/A |
154 | SETCLIENTID_CONFIRM | MNI | | N/A |
155NS | SET_SSV | REQ | | Section 18.47 |
156NS | TEST_STATEID | REQ | | Section 18.48 |
157 | VERIFY | REQ | | Section 18.31 |
158NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
159 | WRITE | REQ | | Section 18.32 |
160
161Callback Operations
162
163 +-------------------------+-----------+-------------+---------------+
164 | Operation | REQ, REC, | Feature | Definition |
165 | | OPT, or | (REQ, REC, | |
166 | | MNI | or OPT) | |
167 +-------------------------+-----------+-------------+---------------+
168 | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
169P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
170NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
171P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
172NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
173NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
174 | CB_RECALL | OPT | FDELG, | Section 20.2 |
175 | | | DDELG, pNFS | |
176 | | | (REQ) | |
177NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
178 | | | DDELG, pNFS | |
179 | | | (REQ) | |
180NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
181NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
182 | | | (REQ) | |
183I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
184 | | | DDELG, pNFS | |
185 | | | (REQ) | |
186NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
187 | | | DDELG, pNFS | |
188 | | | (REQ) | |
189 +-------------------------+-----------+-------------+---------------+
190
191Implementation notes:
192
193DELEGPURGE:
194* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
195 CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
196 persist across client reboots). Thus we need not implement this for
197 now.
198
199EXCHANGE_ID:
200* only SP4_NONE state protection supported
201* implementation ids are ignored
202
203CREATE_SESSION:
204* backchannel attributes are ignored
205* backchannel security parameters are ignored
206
207SEQUENCE:
208* no support for dynamic slot table renegotiation (optional)
209
210nfsv4.1 COMPOUND rules:
211The following cases aren't supported yet:
212* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION,
213 DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID.
214* DESTROY_SESSION MUST be the final operation in the COMPOUND request.
215
216Nonstandard compound limitations:
217* No support for a sessions fore channel RPC compound that requires both a
218 ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
219 fail to live up to the promise we made in CREATE_SESSION fore channel
220 negotiation.
221* No more than one IO operation (read, write, readdir) allowed per
222 compound.
diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt
new file mode 100644
index 000000000000..3ba0b945aaf8
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -0,0 +1,270 @@
1Mounting the root filesystem via NFS (nfsroot)
2===============================================
3
4Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
5Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
6Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
7Updated 2006 by Horms <horms@verge.net.au>
8
9
10
11In order to use a diskless system, such as an X-terminal or printer server
12for example, it is necessary for the root filesystem to be present on a
13non-disk device. This may be an initramfs (see Documentation/filesystems/
14ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
15filesystem mounted via NFS. The following text describes on how to use NFS
16for the root filesystem. For the rest of this text 'client' means the
17diskless system, and 'server' means the NFS server.
18
19
20
21
221.) Enabling nfsroot capabilities
23 -----------------------------
24
25In order to use nfsroot, NFS client support needs to be selected as
26built-in during configuration. Once this has been selected, the nfsroot
27option will become available, which should also be selected.
28
29In the networking options, kernel level autoconfiguration can be selected,
30along with the types of autoconfiguration to support. Selecting all of
31DHCP, BOOTP and RARP is safe.
32
33
34
35
362.) Kernel command line
37 -------------------
38
39When the kernel has been loaded by a boot loader (see below) it needs to be
40told what root fs device to use. And in the case of nfsroot, where to find
41both the server and the name of the directory on the server to mount as root.
42This can be established using the following kernel command line parameters:
43
44
45root=/dev/nfs
46
47 This is necessary to enable the pseudo-NFS-device. Note that it's not a
48 real device but just a synonym to tell the kernel to use NFS instead of
49 a real device.
50
51
52nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
53
54 If the `nfsroot' parameter is NOT given on the command line,
55 the default "/tftpboot/%s" will be used.
56
57 <server-ip> Specifies the IP address of the NFS server.
58 The default address is determined by the `ip' parameter
59 (see below). This parameter allows the use of different
60 servers for IP autoconfiguration and NFS.
61
62 <root-dir> Name of the directory on the server to mount as root.
63 If there is a "%s" token in the string, it will be
64 replaced by the ASCII-representation of the client's
65 IP address.
66
67 <nfs-options> Standard NFS options. All options are separated by commas.
68 The following defaults are used:
69 port = as given by server portmap daemon
70 rsize = 4096
71 wsize = 4096
72 timeo = 7
73 retrans = 3
74 acregmin = 3
75 acregmax = 60
76 acdirmin = 30
77 acdirmax = 60
78 flags = hard, nointr, noposix, cto, ac
79
80
81ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
82
83 This parameter tells the kernel how to configure IP addresses of devices
84 and also how to set up the IP routing table. It was originally called
85 `nfsaddrs', but now the boot-time IP configuration works independently of
86 NFS, so it was renamed to `ip' and the old name remained as an alias for
87 compatibility reasons.
88
89 If this parameter is missing from the kernel command line, all fields are
90 assumed to be empty, and the defaults mentioned below apply. In general
91 this means that the kernel tries to configure everything using
92 autoconfiguration.
93
94 The <autoconf> parameter can appear alone as the value to the `ip'
95 parameter (without all the ':' characters before). If the value is
96 "ip=off" or "ip=none", no autoconfiguration will take place, otherwise
97 autoconfiguration will take place. The most common way to use this
98 is "ip=dhcp".
99
100 <client-ip> IP address of the client.
101
102 Default: Determined using autoconfiguration.
103
104 <server-ip> IP address of the NFS server. If RARP is used to determine
105 the client address and this parameter is NOT empty only
106 replies from the specified server are accepted.
107
108 Only required for NFS root. That is autoconfiguration
109 will not be triggered if it is missing and NFS root is not
110 in operation.
111
112 Default: Determined using autoconfiguration.
113 The address of the autoconfiguration server is used.
114
115 <gw-ip> IP address of a gateway if the server is on a different subnet.
116
117 Default: Determined using autoconfiguration.
118
119 <netmask> Netmask for local network interface. If unspecified
120 the netmask is derived from the client IP address assuming
121 classful addressing.
122
123 Default: Determined using autoconfiguration.
124
125 <hostname> Name of the client. May be supplied by autoconfiguration,
126 but its absence will not trigger autoconfiguration.
127
128 Default: Client IP address is used in ASCII notation.
129
130 <device> Name of network device to use.
131
132 Default: If the host only has one device, it is used.
133 Otherwise the device is determined using
134 autoconfiguration. This is done by sending
135 autoconfiguration requests out of all devices,
136 and using the device that received the first reply.
137
138 <autoconf> Method to use for autoconfiguration. In the case of options
139 which specify multiple autoconfiguration protocols,
140 requests are sent using all protocols, and the first one
141 to reply is used.
142
143 Only autoconfiguration protocols that have been compiled
144 into the kernel will be used, regardless of the value of
145 this option.
146
147 off or none: don't use autoconfiguration
148 (do static IP assignment instead)
149 on or any: use any protocol available in the kernel
150 (default)
151 dhcp: use DHCP
152 bootp: use BOOTP
153 rarp: use RARP
154 both: use both BOOTP and RARP but not DHCP
155 (old option kept for backwards compatibility)
156
157 Default: any
158
159
160
161
1623.) Boot Loader
163 ----------
164
165To get the kernel into memory different approaches can be used.
166They depend on various facilities being available:
167
168
1693.1) Booting from a floppy using syslinux
170
171 When building kernels, an easy way to create a boot floppy that uses
172 syslinux is to use the zdisk or bzdisk make targets which use zimage
173 and bzimage images respectively. Both targets accept the
174 FDARGS parameter which can be used to set the kernel command line.
175
176 e.g.
177 make bzdisk FDARGS="root=/dev/nfs"
178
179 Note that the user running this command will need to have
180 access to the floppy drive device, /dev/fd0
181
182 For more information on syslinux, including how to create bootdisks
183 for prebuilt kernels, see http://syslinux.zytor.com/
184
185 N.B: Previously it was possible to write a kernel directly to
186 a floppy using dd, configure the boot device using rdev, and
187 boot using the resulting floppy. Linux no longer supports this
188 method of booting.
189
1903.2) Booting from a cdrom using isolinux
191
192 When building kernels, an easy way to create a bootable cdrom that
193 uses isolinux is to use the isoimage target which uses a bzimage
194 image. Like zdisk and bzdisk, this target accepts the FDARGS
195 parameter which can be used to set the kernel command line.
196
197 e.g.
198 make isoimage FDARGS="root=/dev/nfs"
199
200 The resulting iso image will be arch/<ARCH>/boot/image.iso
201 This can be written to a cdrom using a variety of tools including
202 cdrecord.
203
204 e.g.
205 cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso
206
207 For more information on isolinux, including how to create bootdisks
208 for prebuilt kernels, see http://syslinux.zytor.com/
209
2103.2) Using LILO
211 When using LILO all the necessary command line parameters may be
212 specified using the 'append=' directive in the LILO configuration
213 file.
214
215 However, to use the 'root=' directive you also need to create
216 a dummy root device, which may be removed after LILO is run.
217
218 mknod /dev/boot255 c 0 255
219
220 For information on configuring LILO, please refer to its documentation.
221
2223.3) Using GRUB
223 When using GRUB, kernel parameter are simply appended after the kernel
224 specification: kernel <kernel> <parameters>
225
2263.4) Using loadlin
227 loadlin may be used to boot Linux from a DOS command prompt without
228 requiring a local hard disk to mount as root. This has not been
229 thoroughly tested by the authors of this document, but in general
230 it should be possible configure the kernel command line similarly
231 to the configuration of LILO.
232
233 Please refer to the loadlin documentation for further information.
234
2353.5) Using a boot ROM
236 This is probably the most elegant way of booting a diskless client.
237 With a boot ROM the kernel is loaded using the TFTP protocol. The
238 authors of this document are not aware of any no commercial boot
239 ROMs that support booting Linux over the network. However, there
240 are two free implementations of a boot ROM, netboot-nfs and
241 etherboot, both of which are available on sunsite.unc.edu, and both
242 of which contain everything you need to boot a diskless Linux client.
243
2443.6) Using pxelinux
245 Pxelinux may be used to boot linux using the PXE boot loader
246 which is present on many modern network cards.
247
248 When using pxelinux, the kernel image is specified using
249 "kernel <relative-path-below /tftpboot>". The nfsroot parameters
250 are passed to the kernel by adding them to the "append" line.
251 It is common to use serial console in conjunction with pxeliunx,
252 see Documentation/serial-console.txt for more information.
253
254 For more information on isolinux, including how to create bootdisks
255 for prebuilt kernels, see http://syslinux.zytor.com/
256
257
258
259
2604.) Credits
261 -------
262
263 The nfsroot code in the kernel and the RARP support have been written
264 by Gero Kuhlmann <gero@gkminix.han.de>.
265
266 The rest of the IP layer autoconfiguration code has been written
267 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
268
269 In order to write the initial version of nfsroot I would like to thank
270 Jens-Uwe Mager <jum@anubis.han.de> for his help.
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt
new file mode 100644
index 000000000000..8a382bea6808
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-cache.txt
@@ -0,0 +1,202 @@
1 This document gives a brief introduction to the caching
2mechanisms in the sunrpc layer that is used, in particular,
3for NFS authentication.
4
5CACHES
6======
7The caching replaces the old exports table and allows for
8a wide variety of values to be caches.
9
10There are a number of caches that are similar in structure though
11quite possibly very different in content and use. There is a corpus
12of common code for managing these caches.
13
14Examples of caches that are likely to be needed are:
15 - mapping from IP address to client name
16 - mapping from client name and filesystem to export options
17 - mapping from UID to list of GIDs, to work around NFS's limitation
18 of 16 gids.
19 - mappings between local UID/GID and remote UID/GID for sites that
20 do not have uniform uid assignment
21 - mapping from network identify to public key for crypto authentication.
22
23The common code handles such things as:
24 - general cache lookup with correct locking
25 - supporting 'NEGATIVE' as well as positive entries
26 - allowing an EXPIRED time on cache items, and removing
27 items after they expire, and are no longer in-use.
28 - making requests to user-space to fill in cache entries
29 - allowing user-space to directly set entries in the cache
30 - delaying RPC requests that depend on as-yet incomplete
31 cache entries, and replaying those requests when the cache entry
32 is complete.
33 - clean out old entries as they expire.
34
35Creating a Cache
36----------------
37
381/ A cache needs a datum to store. This is in the form of a
39 structure definition that must contain a
40 struct cache_head
41 as an element, usually the first.
42 It will also contain a key and some content.
43 Each cache element is reference counted and contains
44 expiry and update times for use in cache management.
452/ A cache needs a "cache_detail" structure that
46 describes the cache. This stores the hash table, some
47 parameters for cache management, and some operations detailing how
48 to work with particular cache items.
49 The operations requires are:
50 struct cache_head *alloc(void)
51 This simply allocates appropriate memory and returns
52 a pointer to the cache_detail embedded within the
53 structure
54 void cache_put(struct kref *)
55 This is called when the last reference to an item is
56 dropped. The pointer passed is to the 'ref' field
57 in the cache_head. cache_put should release any
58 references create by 'cache_init' and, if CACHE_VALID
59 is set, any references created by cache_update.
60 It should then release the memory allocated by
61 'alloc'.
62 int match(struct cache_head *orig, struct cache_head *new)
63 test if the keys in the two structures match. Return
64 1 if they do, 0 if they don't.
65 void init(struct cache_head *orig, struct cache_head *new)
66 Set the 'key' fields in 'new' from 'orig'. This may
67 include taking references to shared objects.
68 void update(struct cache_head *orig, struct cache_head *new)
69 Set the 'content' fileds in 'new' from 'orig'.
70 int cache_show(struct seq_file *m, struct cache_detail *cd,
71 struct cache_head *h)
72 Optional. Used to provide a /proc file that lists the
73 contents of a cache. This should show one item,
74 usually on just one line.
75 int cache_request(struct cache_detail *cd, struct cache_head *h,
76 char **bpp, int *blen)
77 Format a request to be send to user-space for an item
78 to be instantiated. *bpp is a buffer of size *blen.
79 bpp should be moved forward over the encoded message,
80 and *blen should be reduced to show how much free
81 space remains. Return 0 on success or <0 if not
82 enough room or other problem.
83 int cache_parse(struct cache_detail *cd, char *buf, int len)
84 A message from user space has arrived to fill out a
85 cache entry. It is in 'buf' of length 'len'.
86 cache_parse should parse this, find the item in the
87 cache with sunrpc_cache_lookup, and update the item
88 with sunrpc_cache_update.
89
90
913/ A cache needs to be registered using cache_register(). This
92 includes it on a list of caches that will be regularly
93 cleaned to discard old data.
94
95Using a cache
96-------------
97
98To find a value in a cache, call sunrpc_cache_lookup passing a pointer
99to the cache_head in a sample item with the 'key' fields filled in.
100This will be passed to ->match to identify the target entry. If no
101entry is found, a new entry will be create, added to the cache, and
102marked as not containing valid data.
103
104The item returned is typically passed to cache_check which will check
105if the data is valid, and may initiate an up-call to get fresh data.
106cache_check will return -ENOENT in the entry is negative or if an up
107call is needed but not possible, -EAGAIN if an upcall is pending,
108or 0 if the data is valid;
109
110cache_check can be passed a "struct cache_req *". This structure is
111typically embedded in the actual request and can be used to create a
112deferred copy of the request (struct cache_deferred_req). This is
113done when the found cache item is not uptodate, but the is reason to
114believe that userspace might provide information soon. When the cache
115item does become valid, the deferred copy of the request will be
116revisited (->revisit). It is expected that this method will
117reschedule the request for processing.
118
119The value returned by sunrpc_cache_lookup can also be passed to
120sunrpc_cache_update to set the content for the item. A second item is
121passed which should hold the content. If the item found by _lookup
122has valid data, then it is discarded and a new item is created. This
123saves any user of an item from worrying about content changing while
124it is being inspected. If the item found by _lookup does not contain
125valid data, then the content is copied across and CACHE_VALID is set.
126
127Populating a cache
128------------------
129
130Each cache has a name, and when the cache is registered, a directory
131with that name is created in /proc/net/rpc
132
133This directory contains a file called 'channel' which is a channel
134for communicating between kernel and user for populating the cache.
135This directory may later contain other files of interacting
136with the cache.
137
138The 'channel' works a bit like a datagram socket. Each 'write' is
139passed as a whole to the cache for parsing and interpretation.
140Each cache can treat the write requests differently, but it is
141expected that a message written will contain:
142 - a key
143 - an expiry time
144 - a content.
145with the intention that an item in the cache with the give key
146should be create or updated to have the given content, and the
147expiry time should be set on that item.
148
149Reading from a channel is a bit more interesting. When a cache
150lookup fails, or when it succeeds but finds an entry that may soon
151expire, a request is lodged for that cache item to be updated by
152user-space. These requests appear in the channel file.
153
154Successive reads will return successive requests.
155If there are no more requests to return, read will return EOF, but a
156select or poll for read will block waiting for another request to be
157added.
158
159Thus a user-space helper is likely to:
160 open the channel.
161 select for readable
162 read a request
163 write a response
164 loop.
165
166If it dies and needs to be restarted, any requests that have not been
167answered will still appear in the file and will be read by the new
168instance of the helper.
169
170Each cache should define a "cache_parse" method which takes a message
171written from user-space and processes it. It should return an error
172(which propagates back to the write syscall) or 0.
173
174Each cache should also define a "cache_request" method which
175takes a cache item and encodes a request into the buffer
176provided.
177
178Note: If a cache has no active readers on the channel, and has had not
179active readers for more than 60 seconds, further requests will not be
180added to the channel but instead all lookups that do not find a valid
181entry will fail. This is partly for backward compatibility: The
182previous nfs exports table was deemed to be authoritative and a
183failed lookup meant a definite 'no'.
184
185request/response format
186-----------------------
187
188While each cache is free to use it's own format for requests
189and responses over channel, the following is recommended as
190appropriate and support routines are available to help:
191Each request or response record should be printable ASCII
192with precisely one newline character which should be at the end.
193Fields within the record should be separated by spaces, normally one.
194If spaces, newlines, or nul characters are needed in a field they
195much be quoted. two mechanisms are available:
1961/ If a field begins '\x' then it must contain an even number of
197 hex digits, and pairs of these digits provide the bytes in the
198 field.
1992/ otherwise a \ in the field must be followed by 3 octal digits
200 which give the code for a byte. Other characters are treated
201 as them selves. At the very least, space, newline, nul, and
202 '\' must be quoted in this way.