aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/nfs
diff options
context:
space:
mode:
authorJ. Bruce Fields <bfields@citi.umich.edu>2009-11-06 13:59:43 -0500
committerJ. Bruce Fields <bfields@citi.umich.edu>2009-11-06 14:01:02 -0500
commitea4878a24d7e6a467d369b962bab95bd6a12cbe0 (patch)
tree4f937b8dfa658b16779ae2267d450b53fb035fe7 /Documentation/filesystems/nfs
parent8c10cbdb4af642d9a2efb45ea89251aaab905360 (diff)
nfs: move more to Documentation/filesystems/nfs
Oops: I missed two files in the first commit that created this directory. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Diffstat (limited to 'Documentation/filesystems/nfs')
-rw-r--r--Documentation/filesystems/nfs/00-INDEX4
-rw-r--r--Documentation/filesystems/nfs/knfsd-stats.txt159
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.txt202
3 files changed, 365 insertions, 0 deletions
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
index 6ff3d212027..2f68cd68876 100644
--- a/Documentation/filesystems/nfs/00-INDEX
+++ b/Documentation/filesystems/nfs/00-INDEX
@@ -2,6 +2,8 @@
2 - this file (nfs-related documentation). 2 - this file (nfs-related documentation).
3Exporting 3Exporting
4 - explanation of how to make filesystems exportable. 4 - explanation of how to make filesystems exportable.
5knfsd-stats.txt
6 - statistics which the NFS server makes available to user space.
5nfs.txt 7nfs.txt
6 - nfs client, and DNS resolution for fs_locations. 8 - nfs client, and DNS resolution for fs_locations.
7nfs41-server.txt 9nfs41-server.txt
@@ -10,3 +12,5 @@ nfs-rdma.txt
10 - how to install and setup the Linux NFS/RDMA client and server software 12 - how to install and setup the Linux NFS/RDMA client and server software
11nfsroot.txt 13nfsroot.txt
12 - short guide on setting up a diskless box with NFS root filesystem. 14 - short guide on setting up a diskless box with NFS root filesystem.
15rpc-cache.txt
16 - introduction to the caching mechanisms in the sunrpc layer.
diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt
new file mode 100644
index 00000000000..64ced5149d3
--- /dev/null
+++ b/Documentation/filesystems/nfs/knfsd-stats.txt
@@ -0,0 +1,159 @@
1
2Kernel NFS Server Statistics
3============================
4
5This document describes the format and semantics of the statistics
6which the kernel NFS server makes available to userspace. These
7statistics are available in several text form pseudo files, each of
8which is described separately below.
9
10In most cases you don't need to know these formats, as the nfsstat(8)
11program from the nfs-utils distribution provides a helpful command-line
12interface for extracting and printing them.
13
14All the files described here are formatted as a sequence of text lines,
15separated by newline '\n' characters. Lines beginning with a hash
16'#' character are comments intended for humans and should be ignored
17by parsing routines. All other lines contain a sequence of fields
18separated by whitespace.
19
20/proc/fs/nfsd/pool_stats
21------------------------
22
23This file is available in kernels from 2.6.30 onwards, if the
24/proc/fs/nfsd filesystem is mounted (it almost always should be).
25
26The first line is a comment which describes the fields present in
27all the other lines. The other lines present the following data as
28a sequence of unsigned decimal numeric fields. One line is shown
29for each NFS thread pool.
30
31All counters are 64 bits wide and wrap naturally. There is no way
32to zero these counters, instead applications should do their own
33rate conversion.
34
35pool
36 The id number of the NFS thread pool to which this line applies.
37 This number does not change.
38
39 Thread pool ids are a contiguous set of small integers starting
40 at zero. The maximum value depends on the thread pool mode, but
41 currently cannot be larger than the number of CPUs in the system.
42 Note that in the default case there will be a single thread pool
43 which contains all the nfsd threads and all the CPUs in the system,
44 and thus this file will have a single line with a pool id of "0".
45
46packets-arrived
47 Counts how many NFS packets have arrived. More precisely, this
48 is the number of times that the network stack has notified the
49 sunrpc server layer that new data may be available on a transport
50 (e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
51
52 Depending on the NFS workload patterns and various network stack
53 effects (such as Large Receive Offload) which can combine packets
54 on the wire, this may be either more or less than the number
55 of NFS calls received (which statistic is available elsewhere).
56 However this is a more accurate and less workload-dependent measure
57 of how much CPU load is being placed on the sunrpc server layer
58 due to NFS network traffic.
59
60sockets-enqueued
61 Counts how many times an NFS transport is enqueued to wait for
62 an nfsd thread to service it, i.e. no nfsd thread was considered
63 available.
64
65 The circumstance this statistic tracks indicates that there was NFS
66 network-facing work to be done but it couldn't be done immediately,
67 thus introducing a small delay in servicing NFS calls. The ideal
68 rate of change for this counter is zero; significantly non-zero
69 values may indicate a performance limitation.
70
71 This can happen either because there are too few nfsd threads in the
72 thread pool for the NFS workload (the workload is thread-limited),
73 or because the NFS workload needs more CPU time than is available in
74 the thread pool (the workload is CPU-limited). In the former case,
75 configuring more nfsd threads will probably improve the performance
76 of the NFS workload. In the latter case, the sunrpc server layer is
77 already choosing not to wake idle nfsd threads because there are too
78 many nfsd threads which want to run but cannot, so configuring more
79 nfsd threads will make no difference whatsoever. The overloads-avoided
80 statistic (see below) can be used to distinguish these cases.
81
82threads-woken
83 Counts how many times an idle nfsd thread is woken to try to
84 receive some data from an NFS transport.
85
86 This statistic tracks the circumstance where incoming
87 network-facing NFS work is being handled quickly, which is a good
88 thing. The ideal rate of change for this counter will be close
89 to but less than the rate of change of the packets-arrived counter.
90
91overloads-avoided
92 Counts how many times the sunrpc server layer chose not to wake an
93 nfsd thread, despite the presence of idle nfsd threads, because
94 too many nfsd threads had been recently woken but could not get
95 enough CPU time to actually run.
96
97 This statistic counts a circumstance where the sunrpc layer
98 heuristically avoids overloading the CPU scheduler with too many
99 runnable nfsd threads. The ideal rate of change for this counter
100 is zero. Significant non-zero values indicate that the workload
101 is CPU limited. Usually this is associated with heavy CPU usage
102 on all the CPUs in the nfsd thread pool.
103
104 If a sustained large overloads-avoided rate is detected on a pool,
105 the top(1) utility should be used to check for the following
106 pattern of CPU usage on all the CPUs associated with the given
107 nfsd thread pool.
108
109 - %us ~= 0 (as you're *NOT* running applications on your NFS server)
110
111 - %wa ~= 0
112
113 - %id ~= 0
114
115 - %sy + %hi + %si ~= 100
116
117 If this pattern is seen, configuring more nfsd threads will *not*
118 improve the performance of the workload. If this patten is not
119 seen, then something more subtle is wrong.
120
121threads-timedout
122 Counts how many times an nfsd thread triggered an idle timeout,
123 i.e. was not woken to handle any incoming network packets for
124 some time.
125
126 This statistic counts a circumstance where there are more nfsd
127 threads configured than can be used by the NFS workload. This is
128 a clue that the number of nfsd threads can be reduced without
129 affecting performance. Unfortunately, it's only a clue and not
130 a strong indication, for a couple of reasons:
131
132 - Currently the rate at which the counter is incremented is quite
133 slow; the idle timeout is 60 minutes. Unless the NFS workload
134 remains constant for hours at a time, this counter is unlikely
135 to be providing information that is still useful.
136
137 - It is usually a wise policy to provide some slack,
138 i.e. configure a few more nfsds than are currently needed,
139 to allow for future spikes in load.
140
141
142Note that incoming packets on NFS transports will be dealt with in
143one of three ways. An nfsd thread can be woken (threads-woken counts
144this case), or the transport can be enqueued for later attention
145(sockets-enqueued counts this case), or the packet can be temporarily
146deferred because the transport is currently being used by an nfsd
147thread. This last case is not very interesting and is not explicitly
148counted, but can be inferred from the other counters thus:
149
150packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
151
152
153More
154----
155Descriptions of the other statistics file should go here.
156
157
158Greg Banks <gnb@sgi.com>
15926 Mar 2009
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt
new file mode 100644
index 00000000000..8a382bea680
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-cache.txt
@@ -0,0 +1,202 @@
1 This document gives a brief introduction to the caching
2mechanisms in the sunrpc layer that is used, in particular,
3for NFS authentication.
4
5CACHES
6======
7The caching replaces the old exports table and allows for
8a wide variety of values to be caches.
9
10There are a number of caches that are similar in structure though
11quite possibly very different in content and use. There is a corpus
12of common code for managing these caches.
13
14Examples of caches that are likely to be needed are:
15 - mapping from IP address to client name
16 - mapping from client name and filesystem to export options
17 - mapping from UID to list of GIDs, to work around NFS's limitation
18 of 16 gids.
19 - mappings between local UID/GID and remote UID/GID for sites that
20 do not have uniform uid assignment
21 - mapping from network identify to public key for crypto authentication.
22
23The common code handles such things as:
24 - general cache lookup with correct locking
25 - supporting 'NEGATIVE' as well as positive entries
26 - allowing an EXPIRED time on cache items, and removing
27 items after they expire, and are no longer in-use.
28 - making requests to user-space to fill in cache entries
29 - allowing user-space to directly set entries in the cache
30 - delaying RPC requests that depend on as-yet incomplete
31 cache entries, and replaying those requests when the cache entry
32 is complete.
33 - clean out old entries as they expire.
34
35Creating a Cache
36----------------
37
381/ A cache needs a datum to store. This is in the form of a
39 structure definition that must contain a
40 struct cache_head
41 as an element, usually the first.
42 It will also contain a key and some content.
43 Each cache element is reference counted and contains
44 expiry and update times for use in cache management.
452/ A cache needs a "cache_detail" structure that
46 describes the cache. This stores the hash table, some
47 parameters for cache management, and some operations detailing how
48 to work with particular cache items.
49 The operations requires are:
50 struct cache_head *alloc(void)
51 This simply allocates appropriate memory and returns
52 a pointer to the cache_detail embedded within the
53 structure
54 void cache_put(struct kref *)
55 This is called when the last reference to an item is
56 dropped. The pointer passed is to the 'ref' field
57 in the cache_head. cache_put should release any
58 references create by 'cache_init' and, if CACHE_VALID
59 is set, any references created by cache_update.
60 It should then release the memory allocated by
61 'alloc'.
62 int match(struct cache_head *orig, struct cache_head *new)
63 test if the keys in the two structures match. Return
64 1 if they do, 0 if they don't.
65 void init(struct cache_head *orig, struct cache_head *new)
66 Set the 'key' fields in 'new' from 'orig'. This may
67 include taking references to shared objects.
68 void update(struct cache_head *orig, struct cache_head *new)
69 Set the 'content' fileds in 'new' from 'orig'.
70 int cache_show(struct seq_file *m, struct cache_detail *cd,
71 struct cache_head *h)
72 Optional. Used to provide a /proc file that lists the
73 contents of a cache. This should show one item,
74 usually on just one line.
75 int cache_request(struct cache_detail *cd, struct cache_head *h,
76 char **bpp, int *blen)
77 Format a request to be send to user-space for an item
78 to be instantiated. *bpp is a buffer of size *blen.
79 bpp should be moved forward over the encoded message,
80 and *blen should be reduced to show how much free
81 space remains. Return 0 on success or <0 if not
82 enough room or other problem.
83 int cache_parse(struct cache_detail *cd, char *buf, int len)
84 A message from user space has arrived to fill out a
85 cache entry. It is in 'buf' of length 'len'.
86 cache_parse should parse this, find the item in the
87 cache with sunrpc_cache_lookup, and update the item
88 with sunrpc_cache_update.
89
90
913/ A cache needs to be registered using cache_register(). This
92 includes it on a list of caches that will be regularly
93 cleaned to discard old data.
94
95Using a cache
96-------------
97
98To find a value in a cache, call sunrpc_cache_lookup passing a pointer
99to the cache_head in a sample item with the 'key' fields filled in.
100This will be passed to ->match to identify the target entry. If no
101entry is found, a new entry will be create, added to the cache, and
102marked as not containing valid data.
103
104The item returned is typically passed to cache_check which will check
105if the data is valid, and may initiate an up-call to get fresh data.
106cache_check will return -ENOENT in the entry is negative or if an up
107call is needed but not possible, -EAGAIN if an upcall is pending,
108or 0 if the data is valid;
109
110cache_check can be passed a "struct cache_req *". This structure is
111typically embedded in the actual request and can be used to create a
112deferred copy of the request (struct cache_deferred_req). This is
113done when the found cache item is not uptodate, but the is reason to
114believe that userspace might provide information soon. When the cache
115item does become valid, the deferred copy of the request will be
116revisited (->revisit). It is expected that this method will
117reschedule the request for processing.
118
119The value returned by sunrpc_cache_lookup can also be passed to
120sunrpc_cache_update to set the content for the item. A second item is
121passed which should hold the content. If the item found by _lookup
122has valid data, then it is discarded and a new item is created. This
123saves any user of an item from worrying about content changing while
124it is being inspected. If the item found by _lookup does not contain
125valid data, then the content is copied across and CACHE_VALID is set.
126
127Populating a cache
128------------------
129
130Each cache has a name, and when the cache is registered, a directory
131with that name is created in /proc/net/rpc
132
133This directory contains a file called 'channel' which is a channel
134for communicating between kernel and user for populating the cache.
135This directory may later contain other files of interacting
136with the cache.
137
138The 'channel' works a bit like a datagram socket. Each 'write' is
139passed as a whole to the cache for parsing and interpretation.
140Each cache can treat the write requests differently, but it is
141expected that a message written will contain:
142 - a key
143 - an expiry time
144 - a content.
145with the intention that an item in the cache with the give key
146should be create or updated to have the given content, and the
147expiry time should be set on that item.
148
149Reading from a channel is a bit more interesting. When a cache
150lookup fails, or when it succeeds but finds an entry that may soon
151expire, a request is lodged for that cache item to be updated by
152user-space. These requests appear in the channel file.
153
154Successive reads will return successive requests.
155If there are no more requests to return, read will return EOF, but a
156select or poll for read will block waiting for another request to be
157added.
158
159Thus a user-space helper is likely to:
160 open the channel.
161 select for readable
162 read a request
163 write a response
164 loop.
165
166If it dies and needs to be restarted, any requests that have not been
167answered will still appear in the file and will be read by the new
168instance of the helper.
169
170Each cache should define a "cache_parse" method which takes a message
171written from user-space and processes it. It should return an error
172(which propagates back to the write syscall) or 0.
173
174Each cache should also define a "cache_request" method which
175takes a cache item and encodes a request into the buffer
176provided.
177
178Note: If a cache has no active readers on the channel, and has had not
179active readers for more than 60 seconds, further requests will not be
180added to the channel but instead all lookups that do not find a valid
181entry will fail. This is partly for backward compatibility: The
182previous nfs exports table was deemed to be authoritative and a
183failed lookup meant a definite 'no'.
184
185request/response format
186-----------------------
187
188While each cache is free to use it's own format for requests
189and responses over channel, the following is recommended as
190appropriate and support routines are available to help:
191Each request or response record should be printable ASCII
192with precisely one newline character which should be at the end.
193Fields within the record should be separated by spaces, normally one.
194If spaces, newlines, or nul characters are needed in a field they
195much be quoted. two mechanisms are available:
1961/ If a field begins '\x' then it must contain an even number of
197 hex digits, and pairs of these digits provide the bytes in the
198 field.
1992/ otherwise a \ in the field must be followed by 3 octal digits
200 which give the code for a byte. Other characters are treated
201 as them selves. At the very least, space, newline, nul, and
202 '\' must be quoted in this way.