diff options
author | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
---|---|---|
committer | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
commit | ada47b5fe13d89735805b566185f4885f5a3f750 (patch) | |
tree | 644b88f8a71896307d71438e9b3af49126ffb22b /Documentation/filesystems | |
parent | 43e98717ad40a4ae64545b5ba047c7b86aa44f4f (diff) | |
parent | 3280f21d43ee541f97f8cda5792150d2dbec20d5 (diff) |
Merge branch 'wip-2.6.34' into old-private-masterarchived-private-master
Diffstat (limited to 'Documentation/filesystems')
28 files changed, 610 insertions, 110 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index f15621ee5599..4303614b5add 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX | |||
@@ -1,7 +1,5 @@ | |||
1 | 00-INDEX | 1 | 00-INDEX |
2 | - this file (info on some of the filesystems supported by linux). | 2 | - this file (info on some of the filesystems supported by linux). |
3 | Exporting | ||
4 | - explanation of how to make filesystems exportable. | ||
5 | Locking | 3 | Locking |
6 | - info on locking rules as they pertain to Linux VFS. | 4 | - info on locking rules as they pertain to Linux VFS. |
7 | 9p.txt | 5 | 9p.txt |
@@ -18,6 +16,8 @@ befs.txt | |||
18 | - information about the BeOS filesystem for Linux. | 16 | - information about the BeOS filesystem for Linux. |
19 | bfs.txt | 17 | bfs.txt |
20 | - info for the SCO UnixWare Boot Filesystem (BFS). | 18 | - info for the SCO UnixWare Boot Filesystem (BFS). |
19 | ceph.txt | ||
20 | - info for the Ceph Distributed File System | ||
21 | cifs.txt | 21 | cifs.txt |
22 | - description of the CIFS filesystem. | 22 | - description of the CIFS filesystem. |
23 | coda.txt | 23 | coda.txt |
@@ -34,8 +34,12 @@ dlmfs.txt | |||
34 | - info on the userspace interface to the OCFS2 DLM. | 34 | - info on the userspace interface to the OCFS2 DLM. |
35 | dnotify.txt | 35 | dnotify.txt |
36 | - info about directory notification in Linux. | 36 | - info about directory notification in Linux. |
37 | dnotify_test.c | ||
38 | - example program for dnotify | ||
37 | ecryptfs.txt | 39 | ecryptfs.txt |
38 | - docs on eCryptfs: stacked cryptographic filesystem for Linux. | 40 | - docs on eCryptfs: stacked cryptographic filesystem for Linux. |
41 | exofs.txt | ||
42 | - info, usage, mount options, design about EXOFS. | ||
39 | ext2.txt | 43 | ext2.txt |
40 | - info, mount options and specifications for the Ext2 filesystem. | 44 | - info, mount options and specifications for the Ext2 filesystem. |
41 | ext3.txt | 45 | ext3.txt |
@@ -62,16 +66,14 @@ jfs.txt | |||
62 | - info and mount options for the JFS filesystem. | 66 | - info and mount options for the JFS filesystem. |
63 | locks.txt | 67 | locks.txt |
64 | - info on file locking implementations, flock() vs. fcntl(), etc. | 68 | - info on file locking implementations, flock() vs. fcntl(), etc. |
69 | logfs.txt | ||
70 | - info on the LogFS flash filesystem. | ||
65 | mandatory-locking.txt | 71 | mandatory-locking.txt |
66 | - info on the Linux implementation of Sys V mandatory file locking. | 72 | - info on the Linux implementation of Sys V mandatory file locking. |
67 | ncpfs.txt | 73 | ncpfs.txt |
68 | - info on Novell Netware(tm) filesystem using NCP protocol. | 74 | - info on Novell Netware(tm) filesystem using NCP protocol. |
69 | nfs41-server.txt | 75 | nfs/ |
70 | - info on the Linux server implementation of NFSv4 minor version 1. | 76 | - nfs-related documentation. |
71 | nfs-rdma.txt | ||
72 | - how to install and setup the Linux NFS/RDMA client and server software. | ||
73 | nfsroot.txt | ||
74 | - short guide on setting up a diskless box with NFS root filesystem. | ||
75 | nilfs2.txt | 77 | nilfs2.txt |
76 | - info and mount options for the NILFS2 filesystem. | 78 | - info and mount options for the NILFS2 filesystem. |
77 | ntfs.txt | 79 | ntfs.txt |
@@ -90,8 +92,6 @@ relay.txt | |||
90 | - info on relay, for efficient streaming from kernel to user space. | 92 | - info on relay, for efficient streaming from kernel to user space. |
91 | romfs.txt | 93 | romfs.txt |
92 | - description of the ROMFS filesystem. | 94 | - description of the ROMFS filesystem. |
93 | rpc-cache.txt | ||
94 | - introduction to the caching mechanisms in the sunrpc layer. | ||
95 | seq_file.txt | 95 | seq_file.txt |
96 | - how to use the seq_file API | 96 | - how to use the seq_file API |
97 | sharedsubtree.txt | 97 | sharedsubtree.txt |
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 57e0b80a5274..c0236e753bc8 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt | |||
@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9) | |||
37 | 37 | ||
38 | mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER | 38 | mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER |
39 | 39 | ||
40 | For server running on QEMU host with virtio transport: | ||
41 | |||
42 | mount -t 9p -o trans=virtio <mount_tag> /mnt/9 | ||
43 | |||
44 | where mount_tag is the tag associated by the server to each of the exported | ||
45 | mount points. Each 9P export is seen by the client as a virtio device with an | ||
46 | associated "mount_tag" property. Available mount tags can be | ||
47 | seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. | ||
48 | |||
40 | OPTIONS | 49 | OPTIONS |
41 | ======= | 50 | ======= |
42 | 51 | ||
@@ -47,7 +56,7 @@ OPTIONS | |||
47 | fd - used passed file descriptors for connection | 56 | fd - used passed file descriptors for connection |
48 | (see rfdno and wfdno) | 57 | (see rfdno and wfdno) |
49 | virtio - connect to the next virtio channel available | 58 | virtio - connect to the next virtio channel available |
50 | (from lguest or KVM with trans_virtio module) | 59 | (from QEMU with trans_virtio module) |
51 | rdma - connect to a specified RDMA channel | 60 | rdma - connect to a specified RDMA channel |
52 | 61 | ||
53 | uname=name user name to attempt mount as on the remote server. The | 62 | uname=name user name to attempt mount as on the remote server. The |
@@ -85,7 +94,12 @@ OPTIONS | |||
85 | 94 | ||
86 | port=n port to connect to on the remote server | 95 | port=n port to connect to on the remote server |
87 | 96 | ||
88 | noextend force legacy mode (no 9p2000.u semantics) | 97 | noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) |
98 | |||
99 | version=name Select 9P protocol version. Valid options are: | ||
100 | 9p2000 - Legacy mode (same as noextend) | ||
101 | 9p2000.u - Use 9P2000.u protocol | ||
102 | 9p2000.L - Use 9P2000.L protocol | ||
89 | 103 | ||
90 | dfltuid attempt to mount as a particular uid | 104 | dfltuid attempt to mount as a particular uid |
91 | 105 | ||
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 18b9d0ca0630..06bbbed71206 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking | |||
@@ -460,13 +460,6 @@ in sys_read() and friends. | |||
460 | 460 | ||
461 | --------------------------- dquot_operations ------------------------------- | 461 | --------------------------- dquot_operations ------------------------------- |
462 | prototypes: | 462 | prototypes: |
463 | int (*initialize) (struct inode *, int); | ||
464 | int (*drop) (struct inode *); | ||
465 | int (*alloc_space) (struct inode *, qsize_t, int); | ||
466 | int (*alloc_inode) (const struct inode *, unsigned long); | ||
467 | int (*free_space) (struct inode *, qsize_t); | ||
468 | int (*free_inode) (const struct inode *, unsigned long); | ||
469 | int (*transfer) (struct inode *, struct iattr *); | ||
470 | int (*write_dquot) (struct dquot *); | 463 | int (*write_dquot) (struct dquot *); |
471 | int (*acquire_dquot) (struct dquot *); | 464 | int (*acquire_dquot) (struct dquot *); |
472 | int (*release_dquot) (struct dquot *); | 465 | int (*release_dquot) (struct dquot *); |
@@ -479,13 +472,6 @@ a proper locking wrt the filesystem and call the generic quota operations. | |||
479 | What filesystem should expect from the generic quota functions: | 472 | What filesystem should expect from the generic quota functions: |
480 | 473 | ||
481 | FS recursion Held locks when called | 474 | FS recursion Held locks when called |
482 | initialize: yes maybe dqonoff_sem | ||
483 | drop: yes - | ||
484 | alloc_space: ->mark_dirty() - | ||
485 | alloc_inode: ->mark_dirty() - | ||
486 | free_space: ->mark_dirty() - | ||
487 | free_inode: ->mark_dirty() - | ||
488 | transfer: yes - | ||
489 | write_dquot: yes dqonoff_sem or dqptr_sem | 475 | write_dquot: yes dqonoff_sem or dqptr_sem |
490 | acquire_dquot: yes dqonoff_sem or dqptr_sem | 476 | acquire_dquot: yes dqonoff_sem or dqptr_sem |
491 | release_dquot: yes dqonoff_sem or dqptr_sem | 477 | release_dquot: yes dqonoff_sem or dqptr_sem |
@@ -495,10 +481,6 @@ write_info: yes dqonoff_sem | |||
495 | FS recursion means calling ->quota_read() and ->quota_write() from superblock | 481 | FS recursion means calling ->quota_read() and ->quota_write() from superblock |
496 | operations. | 482 | operations. |
497 | 483 | ||
498 | ->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called | ||
499 | only directly by the filesystem and do not call any fs functions only | ||
500 | the ->mark_dirty() operation. | ||
501 | |||
502 | More details about quota locking can be found in fs/dquot.c. | 484 | More details about quota locking can be found in fs/dquot.c. |
503 | 485 | ||
504 | --------------------------- vm_operations_struct ----------------------------- | 486 | --------------------------- vm_operations_struct ----------------------------- |
diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile new file mode 100644 index 000000000000..a5dd114da14f --- /dev/null +++ b/Documentation/filesystems/Makefile | |||
@@ -0,0 +1,8 @@ | |||
1 | # kbuild trick to avoid linker error. Can be omitted if a module is built. | ||
2 | obj- := dummy.o | ||
3 | |||
4 | # List of programs to build | ||
5 | hostprogs-y := dnotify_test | ||
6 | |||
7 | # Tell kbuild to always build the programs | ||
8 | always := $(hostprogs-y) | ||
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 000000000000..0660c9f5deef --- /dev/null +++ b/Documentation/filesystems/ceph.txt | |||
@@ -0,0 +1,140 @@ | |||
1 | Ceph Distributed File System | ||
2 | ============================ | ||
3 | |||
4 | Ceph is a distributed network file system designed to provide good | ||
5 | performance, reliability, and scalability. | ||
6 | |||
7 | Basic features include: | ||
8 | |||
9 | * POSIX semantics | ||
10 | * Seamless scaling from 1 to many thousands of nodes | ||
11 | * High availability and reliability. No single point of failure. | ||
12 | * N-way replication of data across storage nodes | ||
13 | * Fast recovery from node failures | ||
14 | * Automatic rebalancing of data on node addition/removal | ||
15 | * Easy deployment: most FS components are userspace daemons | ||
16 | |||
17 | Also, | ||
18 | * Flexible snapshots (on any directory) | ||
19 | * Recursive accounting (nested files, directories, bytes) | ||
20 | |||
21 | In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely | ||
22 | on symmetric access by all clients to shared block devices, Ceph | ||
23 | separates data and metadata management into independent server | ||
24 | clusters, similar to Lustre. Unlike Lustre, however, metadata and | ||
25 | storage nodes run entirely as user space daemons. Storage nodes | ||
26 | utilize btrfs to store data objects, leveraging its advanced features | ||
27 | (checksumming, metadata replication, etc.). File data is striped | ||
28 | across storage nodes in large chunks to distribute workload and | ||
29 | facilitate high throughputs. When storage nodes fail, data is | ||
30 | re-replicated in a distributed fashion by the storage nodes themselves | ||
31 | (with some minimal coordination from a cluster monitor), making the | ||
32 | system extremely efficient and scalable. | ||
33 | |||
34 | Metadata servers effectively form a large, consistent, distributed | ||
35 | in-memory cache above the file namespace that is extremely scalable, | ||
36 | dynamically redistributes metadata in response to workload changes, | ||
37 | and can tolerate arbitrary (well, non-Byzantine) node failures. The | ||
38 | metadata server takes a somewhat unconventional approach to metadata | ||
39 | storage to significantly improve performance for common workloads. In | ||
40 | particular, inodes with only a single link are embedded in | ||
41 | directories, allowing entire directories of dentries and inodes to be | ||
42 | loaded into its cache with a single I/O operation. The contents of | ||
43 | extremely large directories can be fragmented and managed by | ||
44 | independent metadata servers, allowing scalable concurrent access. | ||
45 | |||
46 | The system offers automatic data rebalancing/migration when scaling | ||
47 | from a small cluster of just a few nodes to many hundreds, without | ||
48 | requiring an administrator carve the data set into static volumes or | ||
49 | go through the tedious process of migrating data between servers. | ||
50 | When the file system approaches full, new nodes can be easily added | ||
51 | and things will "just work." | ||
52 | |||
53 | Ceph includes flexible snapshot mechanism that allows a user to create | ||
54 | a snapshot on any subdirectory (and its nested contents) in the | ||
55 | system. Snapshot creation and deletion are as simple as 'mkdir | ||
56 | .snap/foo' and 'rmdir .snap/foo'. | ||
57 | |||
58 | Ceph also provides some recursive accounting on directories for nested | ||
59 | files and bytes. That is, a 'getfattr -d foo' on any directory in the | ||
60 | system will reveal the total number of nested regular files and | ||
61 | subdirectories, and a summation of all nested file sizes. This makes | ||
62 | the identification of large disk space consumers relatively quick, as | ||
63 | no 'du' or similar recursive scan of the file system is required. | ||
64 | |||
65 | |||
66 | Mount Syntax | ||
67 | ============ | ||
68 | |||
69 | The basic mount syntax is: | ||
70 | |||
71 | # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt | ||
72 | |||
73 | You only need to specify a single monitor, as the client will get the | ||
74 | full list when it connects. (However, if the monitor you specify | ||
75 | happens to be down, the mount won't succeed.) The port can be left | ||
76 | off if the monitor is using the default. So if the monitor is at | ||
77 | 1.2.3.4, | ||
78 | |||
79 | # mount -t ceph 1.2.3.4:/ /mnt/ceph | ||
80 | |||
81 | is sufficient. If /sbin/mount.ceph is installed, a hostname can be | ||
82 | used instead of an IP address. | ||
83 | |||
84 | |||
85 | |||
86 | Mount Options | ||
87 | ============= | ||
88 | |||
89 | ip=A.B.C.D[:N] | ||
90 | Specify the IP and/or port the client should bind to locally. | ||
91 | There is normally not much reason to do this. If the IP is not | ||
92 | specified, the client's IP address is determined by looking at the | ||
93 | address it's connection to the monitor originates from. | ||
94 | |||
95 | wsize=X | ||
96 | Specify the maximum write size in bytes. By default there is no | ||
97 | maximum. Ceph will normally size writes based on the file stripe | ||
98 | size. | ||
99 | |||
100 | rsize=X | ||
101 | Specify the maximum readahead. | ||
102 | |||
103 | mount_timeout=X | ||
104 | Specify the timeout value for mount (in seconds), in the case | ||
105 | of a non-responsive Ceph file system. The default is 30 | ||
106 | seconds. | ||
107 | |||
108 | rbytes | ||
109 | When stat() is called on a directory, set st_size to 'rbytes', | ||
110 | the summation of file sizes over all files nested beneath that | ||
111 | directory. This is the default. | ||
112 | |||
113 | norbytes | ||
114 | When stat() is called on a directory, set st_size to the | ||
115 | number of entries in that directory. | ||
116 | |||
117 | nocrc | ||
118 | Disable CRC32C calculation for data writes. If set, the storage node | ||
119 | must rely on TCP's error correction to detect data corruption | ||
120 | in the data payload. | ||
121 | |||
122 | noasyncreaddir | ||
123 | Disable client's use its local cache to satisfy readdir | ||
124 | requests. (This does not change correctness; the client uses | ||
125 | cached metadata only when a lease or capability ensures it is | ||
126 | valid.) | ||
127 | |||
128 | |||
129 | More Information | ||
130 | ================ | ||
131 | |||
132 | For more information on Ceph, see the home page at | ||
133 | http://ceph.newdream.net/ | ||
134 | |||
135 | The Linux kernel client source tree is available at | ||
136 | git://ceph.newdream.net/git/ceph-client.git | ||
137 | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git | ||
138 | |||
139 | and the source for the full system is at | ||
140 | git://ceph.newdream.net/git/ceph.git | ||
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt index 4c0c575a4012..79334ed5daa7 100644 --- a/Documentation/filesystems/dentry-locking.txt +++ b/Documentation/filesystems/dentry-locking.txt | |||
@@ -62,7 +62,8 @@ changes are : | |||
62 | 2. Insertion of a dentry into the hash table is done using | 62 | 2. Insertion of a dentry into the hash table is done using |
63 | hlist_add_head_rcu() which take care of ordering the writes - the | 63 | hlist_add_head_rcu() which take care of ordering the writes - the |
64 | writes to the dentry must be visible before the dentry is | 64 | writes to the dentry must be visible before the dentry is |
65 | inserted. This works in conjunction with hlist_for_each_rcu() while | 65 | inserted. This works in conjunction with hlist_for_each_rcu(), |
66 | which has since been replaced by hlist_for_each_entry_rcu(), while | ||
66 | walking the hash chain. The only requirement is that all | 67 | walking the hash chain. The only requirement is that all |
67 | initialization to the dentry must be done before | 68 | initialization to the dentry must be done before |
68 | hlist_add_head_rcu() since we don't have dcache_lock protection | 69 | hlist_add_head_rcu() since we don't have dcache_lock protection |
diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt index 9f5d338ddbb8..6baf88f46859 100644 --- a/Documentation/filesystems/dnotify.txt +++ b/Documentation/filesystems/dnotify.txt | |||
@@ -62,38 +62,9 @@ disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. | |||
62 | 62 | ||
63 | Example | 63 | Example |
64 | ------- | 64 | ------- |
65 | See Documentation/filesystems/dnotify_test.c for an example. | ||
65 | 66 | ||
66 | #define _GNU_SOURCE /* needed to get the defines */ | 67 | NOTE |
67 | #include <fcntl.h> /* in glibc 2.2 this has the needed | 68 | ---- |
68 | values defined */ | 69 | Beginning with Linux 2.6.13, dnotify has been replaced by inotify. |
69 | #include <signal.h> | 70 | See Documentation/filesystems/inotify.txt for more information on it. |
70 | #include <stdio.h> | ||
71 | #include <unistd.h> | ||
72 | |||
73 | static volatile int event_fd; | ||
74 | |||
75 | static void handler(int sig, siginfo_t *si, void *data) | ||
76 | { | ||
77 | event_fd = si->si_fd; | ||
78 | } | ||
79 | |||
80 | int main(void) | ||
81 | { | ||
82 | struct sigaction act; | ||
83 | int fd; | ||
84 | |||
85 | act.sa_sigaction = handler; | ||
86 | sigemptyset(&act.sa_mask); | ||
87 | act.sa_flags = SA_SIGINFO; | ||
88 | sigaction(SIGRTMIN + 1, &act, NULL); | ||
89 | |||
90 | fd = open(".", O_RDONLY); | ||
91 | fcntl(fd, F_SETSIG, SIGRTMIN + 1); | ||
92 | fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); | ||
93 | /* we will now be notified if any of the files | ||
94 | in "." is modified or new files are created */ | ||
95 | while (1) { | ||
96 | pause(); | ||
97 | printf("Got event on fd=%d\n", event_fd); | ||
98 | } | ||
99 | } | ||
diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c new file mode 100644 index 000000000000..8b37b4a1e18d --- /dev/null +++ b/Documentation/filesystems/dnotify_test.c | |||
@@ -0,0 +1,34 @@ | |||
1 | #define _GNU_SOURCE /* needed to get the defines */ | ||
2 | #include <fcntl.h> /* in glibc 2.2 this has the needed | ||
3 | values defined */ | ||
4 | #include <signal.h> | ||
5 | #include <stdio.h> | ||
6 | #include <unistd.h> | ||
7 | |||
8 | static volatile int event_fd; | ||
9 | |||
10 | static void handler(int sig, siginfo_t *si, void *data) | ||
11 | { | ||
12 | event_fd = si->si_fd; | ||
13 | } | ||
14 | |||
15 | int main(void) | ||
16 | { | ||
17 | struct sigaction act; | ||
18 | int fd; | ||
19 | |||
20 | act.sa_sigaction = handler; | ||
21 | sigemptyset(&act.sa_mask); | ||
22 | act.sa_flags = SA_SIGINFO; | ||
23 | sigaction(SIGRTMIN + 1, &act, NULL); | ||
24 | |||
25 | fd = open(".", O_RDONLY); | ||
26 | fcntl(fd, F_SETSIG, SIGRTMIN + 1); | ||
27 | fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); | ||
28 | /* we will now be notified if any of the files | ||
29 | in "." is modified or new files are created */ | ||
30 | while (1) { | ||
31 | pause(); | ||
32 | printf("Got event on fd=%d\n", event_fd); | ||
33 | } | ||
34 | } | ||
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt index 0ced74c2f73c..abd2a9b5b787 100644 --- a/Documentation/filesystems/exofs.txt +++ b/Documentation/filesystems/exofs.txt | |||
@@ -60,13 +60,13 @@ USAGE | |||
60 | 60 | ||
61 | mkfs.exofs --pid=65536 --format /dev/osd0 | 61 | mkfs.exofs --pid=65536 --format /dev/osd0 |
62 | 62 | ||
63 | The --format is optional if not specified no OSD_FORMAT will be | 63 | The --format is optional. If not specified, no OSD_FORMAT will be |
64 | preformed and a clean file system will be created in the specified pid, | 64 | performed and a clean file system will be created in the specified pid, |
65 | in the available space of the target. (Use --format=size_in_meg to limit | 65 | in the available space of the target. (Use --format=size_in_meg to limit |
66 | the total LUN space available) | 66 | the total LUN space available) |
67 | 67 | ||
68 | If pid already exist it will be deleted and a new one will be created in it's | 68 | If pid already exists, it will be deleted and a new one will be created in |
69 | place. Be careful. | 69 | its place. Be careful. |
70 | 70 | ||
71 | An exofs lives inside a single OSD partition. You can create multiple exofs | 71 | An exofs lives inside a single OSD partition. You can create multiple exofs |
72 | filesystems on the same device using multiple pids. | 72 | filesystems on the same device using multiple pids. |
@@ -81,7 +81,7 @@ USAGE | |||
81 | 81 | ||
82 | 7. For reference (See do-exofs example script): | 82 | 7. For reference (See do-exofs example script): |
83 | do-exofs start - an example of how to perform the above steps. | 83 | do-exofs start - an example of how to perform the above steps. |
84 | do-exofs stop - an example of how to unmount the file system. | 84 | do-exofs stop - an example of how to unmount the file system. |
85 | do-exofs format - an example of how to format and mkfs a new exofs. | 85 | do-exofs format - an example of how to format and mkfs a new exofs. |
86 | 86 | ||
87 | 8. Extra compilation flags (uncomment in fs/exofs/Kbuild): | 87 | 8. Extra compilation flags (uncomment in fs/exofs/Kbuild): |
@@ -104,8 +104,8 @@ Where: | |||
104 | exofs specific options: Options are separated by commas (,) | 104 | exofs specific options: Options are separated by commas (,) |
105 | pid=<integer> - The partition number to mount/create as | 105 | pid=<integer> - The partition number to mount/create as |
106 | container of the filesystem. | 106 | container of the filesystem. |
107 | This option is mandatory | 107 | This option is mandatory. |
108 | to=<integer> - Timeout in ticks for a single command | 108 | to=<integer> - Timeout in ticks for a single command. |
109 | default is (60 * HZ) [for debugging only] | 109 | default is (60 * HZ) [for debugging only] |
110 | 110 | ||
111 | =============================================================================== | 111 | =============================================================================== |
@@ -116,7 +116,7 @@ DESIGN | |||
116 | with a special ID (defined in common.h). | 116 | with a special ID (defined in common.h). |
117 | Information included in the file system control block is used to fill the | 117 | Information included in the file system control block is used to fill the |
118 | in-memory superblock structure at mount time. This object is created before | 118 | in-memory superblock structure at mount time. This object is created before |
119 | the file system is used by mkexofs.c It contains information such as: | 119 | the file system is used by mkexofs.c. It contains information such as: |
120 | - The file system's magic number | 120 | - The file system's magic number |
121 | - The next inode number to be allocated | 121 | - The next inode number to be allocated |
122 | 122 | ||
@@ -134,8 +134,8 @@ DESIGN | |||
134 | attributes. This applies to both regular files and other types (directories, | 134 | attributes. This applies to both regular files and other types (directories, |
135 | device files, symlinks, etc.). | 135 | device files, symlinks, etc.). |
136 | 136 | ||
137 | * Credentials are generated per object (inode and superblock) when they is | 137 | * Credentials are generated per object (inode and superblock) when they are |
138 | created in memory (read off disk or created). The credential works for all | 138 | created in memory (read from disk or created). The credential works for all |
139 | operations and is used as long as the object remains in memory. | 139 | operations and is used as long as the object remains in memory. |
140 | 140 | ||
141 | * Async OSD operations are used whenever possible, but the target may execute | 141 | * Async OSD operations are used whenever possible, but the target may execute |
@@ -145,7 +145,8 @@ DESIGN | |||
145 | from executing in reverse order: | 145 | from executing in reverse order: |
146 | - The following are handled with the OBJ_CREATED and OBJ_2BCREATED | 146 | - The following are handled with the OBJ_CREATED and OBJ_2BCREATED |
147 | flags. OBJ_CREATED is set when we know the object exists on the OSD - | 147 | flags. OBJ_CREATED is set when we know the object exists on the OSD - |
148 | in create's callback function, and when we successfully do a read_inode. | 148 | in create's callback function, and when we successfully do a |
149 | read_inode. | ||
149 | OBJ_2BCREATED is set in the beginning of the create function, so we | 150 | OBJ_2BCREATED is set in the beginning of the create function, so we |
150 | know that we should wait. | 151 | know that we should wait. |
151 | - create/delete: delete should wait until the object is created | 152 | - create/delete: delete should wait until the object is created |
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 05d5cf1d743f..867c5b50cb42 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -32,8 +32,8 @@ journal_dev=devnum When the external journal device's major/minor numbers | |||
32 | identified through its new major/minor numbers encoded | 32 | identified through its new major/minor numbers encoded |
33 | in devnum. | 33 | in devnum. |
34 | 34 | ||
35 | noload Don't load the journal on mounting. Note that this forces | 35 | norecovery Don't load the journal on mounting. Note that this forces |
36 | mount of inconsistent filesystem, which can lead to | 36 | noload mount of inconsistent filesystem, which can lead to |
37 | various problems. | 37 | various problems. |
38 | 38 | ||
39 | data=journal All data are committed into the journal prior to being | 39 | data=journal All data are committed into the journal prior to being |
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 6d94e0696f8c..e1def1786e50 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
@@ -153,8 +153,8 @@ journal_dev=devnum When the external journal device's major/minor numbers | |||
153 | identified through its new major/minor numbers encoded | 153 | identified through its new major/minor numbers encoded |
154 | in devnum. | 154 | in devnum. |
155 | 155 | ||
156 | noload Don't load the journal on mounting. Note that | 156 | norecovery Don't load the journal on mounting. Note that |
157 | if the filesystem was not unmounted cleanly, | 157 | noload if the filesystem was not unmounted cleanly, |
158 | skipping the journal replay will lead to the | 158 | skipping the journal replay will lead to the |
159 | filesystem containing inconsistencies that can | 159 | filesystem containing inconsistencies that can |
160 | lead to any number of problems. | 160 | lead to any number of problems. |
@@ -196,7 +196,7 @@ nobarrier This also requires an IO stack which can support | |||
196 | also be used to enable or disable barriers, for | 196 | also be used to enable or disable barriers, for |
197 | consistency with other ext4 mount options. | 197 | consistency with other ext4 mount options. |
198 | 198 | ||
199 | inode_readahead=n This tuning parameter controls the maximum | 199 | inode_readahead_blks=n This tuning parameter controls the maximum |
200 | number of inode table blocks that ext4's inode | 200 | number of inode table blocks that ext4's inode |
201 | table readahead algorithm will pre-read into | 201 | table readahead algorithm will pre-read into |
202 | the buffer cache. The default value is 32 blocks. | 202 | the buffer cache. The default value is 32 blocks. |
@@ -353,6 +353,12 @@ noauto_da_alloc replacing existing files via patterns such as | |||
353 | system crashes before the delayed allocation | 353 | system crashes before the delayed allocation |
354 | blocks are forced to disk. | 354 | blocks are forced to disk. |
355 | 355 | ||
356 | discard Controls whether ext4 should issue discard/TRIM | ||
357 | nodiscard(*) commands to the underlying block device when | ||
358 | blocks are freed. This is useful for SSD devices | ||
359 | and sparse/thinly-provisioned LUNs, but it is off | ||
360 | by default until sufficient testing has been done. | ||
361 | |||
356 | Data Mode | 362 | Data Mode |
357 | ========= | 363 | ========= |
358 | There are 3 different data modes: | 364 | There are 3 different data modes: |
diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt new file mode 100644 index 000000000000..e64c94ba401a --- /dev/null +++ b/Documentation/filesystems/logfs.txt | |||
@@ -0,0 +1,241 @@ | |||
1 | |||
2 | The LogFS Flash Filesystem | ||
3 | ========================== | ||
4 | |||
5 | Specification | ||
6 | ============= | ||
7 | |||
8 | Superblocks | ||
9 | ----------- | ||
10 | |||
11 | Two superblocks exist at the beginning and end of the filesystem. | ||
12 | Each superblock is 256 Bytes large, with another 3840 Bytes reserved | ||
13 | for future purposes, making a total of 4096 Bytes. | ||
14 | |||
15 | Superblock locations may differ for MTD and block devices. On MTD the | ||
16 | first non-bad block contains a superblock in the first 4096 Bytes and | ||
17 | the last non-bad block contains a superblock in the last 4096 Bytes. | ||
18 | On block devices, the first 4096 Bytes of the device contain the first | ||
19 | superblock and the last aligned 4096 Byte-block contains the second | ||
20 | superblock. | ||
21 | |||
22 | For the most part, the superblocks can be considered read-only. They | ||
23 | are written only to correct errors detected within the superblocks, | ||
24 | move the journal and change the filesystem parameters through tunefs. | ||
25 | As a result, the superblock does not contain any fields that require | ||
26 | constant updates, like the amount of free space, etc. | ||
27 | |||
28 | Segments | ||
29 | -------- | ||
30 | |||
31 | The space in the device is split up into equal-sized segments. | ||
32 | Segments are the primary write unit of LogFS. Within each segments, | ||
33 | writes happen from front (low addresses) to back (high addresses. If | ||
34 | only a partial segment has been written, the segment number, the | ||
35 | current position within and optionally a write buffer are stored in | ||
36 | the journal. | ||
37 | |||
38 | Segments are erased as a whole. Therefore Garbage Collection may be | ||
39 | required to completely free a segment before doing so. | ||
40 | |||
41 | Journal | ||
42 | -------- | ||
43 | |||
44 | The journal contains all global information about the filesystem that | ||
45 | is subject to frequent change. At mount time, it has to be scanned | ||
46 | for the most recent commit entry, which contains a list of pointers to | ||
47 | all currently valid entries. | ||
48 | |||
49 | Object Store | ||
50 | ------------ | ||
51 | |||
52 | All space except for the superblocks and journal is part of the object | ||
53 | store. Each segment contains a segment header and a number of | ||
54 | objects, each consisting of the object header and the payload. | ||
55 | Objects are either inodes, directory entries (dentries), file data | ||
56 | blocks or indirect blocks. | ||
57 | |||
58 | Levels | ||
59 | ------ | ||
60 | |||
61 | Garbage collection (GC) may fail if all data is written | ||
62 | indiscriminately. One requirement of GC is that data is seperated | ||
63 | roughly according to the distance between the tree root and the data. | ||
64 | Effectively that means all file data is on level 0, indirect blocks | ||
65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, | ||
66 | respectively. Inode file data is on level 6 for the inodes and 7-11 | ||
67 | for indirect blocks. | ||
68 | |||
69 | Each segment contains objects of a single level only. As a result, | ||
70 | each level requires its own seperate segment to be open for writing. | ||
71 | |||
72 | Inode File | ||
73 | ---------- | ||
74 | |||
75 | All inodes are stored in a special file, the inode file. Single | ||
76 | exception is the inode file's inode (master inode) which for obvious | ||
77 | reasons is stored in the journal instead. Instead of data blocks, the | ||
78 | leaf nodes of the inode files are inodes. | ||
79 | |||
80 | Aliases | ||
81 | ------- | ||
82 | |||
83 | Writes in LogFS are done by means of a wandering tree. A naïve | ||
84 | implementation would require that for each write or a block, all | ||
85 | parent blocks are written as well, since the block pointers have | ||
86 | changed. Such an implementation would not be very efficient. | ||
87 | |||
88 | In LogFS, the block pointer changes are cached in the journal by means | ||
89 | of alias entries. Each alias consists of its logical address - inode | ||
90 | number, block index, level and child number (index into block) - and | ||
91 | the changed data. Any 8-byte word can be changes in this manner. | ||
92 | |||
93 | Currently aliases are used for block pointers, file size, file used | ||
94 | bytes and the height of an inodes indirect tree. | ||
95 | |||
96 | Segment Aliases | ||
97 | --------------- | ||
98 | |||
99 | Related to regular aliases, these are used to handle bad blocks. | ||
100 | Initially, bad blocks are handled by moving the affected segment | ||
101 | content to a spare segment and noting this move in the journal with a | ||
102 | segment alias, a simple (to, from) tupel. GC will later empty this | ||
103 | segment and the alias can be removed again. This is used on MTD only. | ||
104 | |||
105 | Vim | ||
106 | --- | ||
107 | |||
108 | By cleverly predicting the life time of data, it is possible to | ||
109 | seperate long-living data from short-living data and thereby reduce | ||
110 | the GC overhead later. Each type of distinc life expectency (vim) can | ||
111 | have a seperate segment open for writing. Each (level, vim) tupel can | ||
112 | be open just once. If an open segment with unknown vim is encountered | ||
113 | at mount time, it is closed and ignored henceforth. | ||
114 | |||
115 | Indirect Tree | ||
116 | ------------- | ||
117 | |||
118 | Inodes in LogFS are similar to FFS-style filesystems with direct and | ||
119 | indirect block pointers. One difference is that LogFS uses a single | ||
120 | indirect pointer that can be either a 1x, 2x, etc. indirect pointer. | ||
121 | A height field in the inode defines the height of the indirect tree | ||
122 | and thereby the indirection of the pointer. | ||
123 | |||
124 | Another difference is the addressing of indirect blocks. In LogFS, | ||
125 | the first 16 pointers in the first indirect block are left empty, | ||
126 | corresponding to the 16 direct pointers in the inode. In ext2 (maybe | ||
127 | others as well) the first pointer in the first indirect block | ||
128 | corresponds to logical block 12, skipping the 12 direct pointers. | ||
129 | So where ext2 is using arithmetic to better utilize space, LogFS keeps | ||
130 | arithmetic simple and uses compression to save space. | ||
131 | |||
132 | Compression | ||
133 | ----------- | ||
134 | |||
135 | Both file data and metadata can be compressed. Compression for file | ||
136 | data can be enabled with chattr +c and disabled with chattr -c. Doing | ||
137 | so has no effect on existing data, but new data will be stored | ||
138 | accordingly. New inodes will inherit the compression flag of the | ||
139 | parent directory. | ||
140 | |||
141 | Metadata is always compressed. However, the space accounting ignores | ||
142 | this and charges for the uncompressed size. Failing to do so could | ||
143 | result in GC failures when, after moving some data, indirect blocks | ||
144 | compress worse than previously. Even on a 100% full medium, GC may | ||
145 | not consume any extra space, so the compression gains are lost space | ||
146 | to the user. | ||
147 | |||
148 | However, they are not lost space to the filesystem internals. By | ||
149 | cheating the user for those bytes, the filesystem gained some slack | ||
150 | space and GC will run less often and faster. | ||
151 | |||
152 | Garbage Collection and Wear Leveling | ||
153 | ------------------------------------ | ||
154 | |||
155 | Garbage collection is invoked whenever the number of free segments | ||
156 | falls below a threshold. The best (known) candidate is picked based | ||
157 | on the least amount of valid data contained in the segment. All | ||
158 | remaining valid data is copied elsewhere, thereby invalidating it. | ||
159 | |||
160 | The GC code also checks for aliases and writes then back if their | ||
161 | number gets too large. | ||
162 | |||
163 | Wear leveling is done by occasionally picking a suboptimal segment for | ||
164 | garbage collection. If a stale segments erase count is significantly | ||
165 | lower than the active segments' erase counts, it will be picked. Wear | ||
166 | leveling is rate limited, so it will never monopolize the device for | ||
167 | more than one segment worth at a time. | ||
168 | |||
169 | Values for "occasionally", "significantly lower" are compile time | ||
170 | constants. | ||
171 | |||
172 | Hashed directories | ||
173 | ------------------ | ||
174 | |||
175 | To satisfy efficient lookup(), directory entries are hashed and | ||
176 | located based on the hash. In order to both support large directories | ||
177 | and not be overly inefficient for small directories, several hash | ||
178 | tables of increasing size are used. For each table, the hash value | ||
179 | modulo the table size gives the table index. | ||
180 | |||
181 | Tables sizes are chosen to limit the number of indirect blocks with a | ||
182 | fully populated table to 0, 1, 2 or 3 respectively. So the first | ||
183 | table contains 16 entries, the second 512-16, etc. | ||
184 | |||
185 | The last table is special in several ways. First its size depends on | ||
186 | the effective 32bit limit on telldir/seekdir cookies. Since logfs | ||
187 | uses the upper half of the address space for indirect blocks, the size | ||
188 | is limited to 2^31. Secondly the table contains hash buckets with 16 | ||
189 | entries each. | ||
190 | |||
191 | Using single-entry buckets would result in birthday "attacks". At | ||
192 | just 2^16 used entries, hash collisions would be likely (P >= 0.5). | ||
193 | My math skills are insufficient to do the combinatorics for the 17x | ||
194 | collisions necessary to overflow a bucket, but testing showed that in | ||
195 | 10,000 runs the lowest directory fill before a bucket overflow was | ||
196 | 188,057,130 entries with an average of 315,149,915 entries. So for | ||
197 | directory sizes of up to a million, bucket overflows should be | ||
198 | virtually impossible under normal circumstances. | ||
199 | |||
200 | With carefully chosen filenames, it is obviously possible to cause an | ||
201 | overflow with just 21 entries (4 higher tables + 16 entries + 1). So | ||
202 | there may be a security concern if a malicious user has write access | ||
203 | to a directory. | ||
204 | |||
205 | Open For Discussion | ||
206 | =================== | ||
207 | |||
208 | Device Address Space | ||
209 | -------------------- | ||
210 | |||
211 | A device address space is used for caching. Both block devices and | ||
212 | MTD provide functions to either read a single page or write a segment. | ||
213 | Partial segments may be written for data integrity, but where possible | ||
214 | complete segments are written for performance on simple block device | ||
215 | flash media. | ||
216 | |||
217 | Meta Inodes | ||
218 | ----------- | ||
219 | |||
220 | Inodes are stored in the inode file, which is just a regular file for | ||
221 | most purposes. At umount time, however, the inode file needs to | ||
222 | remain open until all dirty inodes are written. So | ||
223 | generic_shutdown_super() may not close this inode, but shouldn't | ||
224 | complain about remaining inodes due to the inode file either. Same | ||
225 | goes for mapping inode of the device address space. | ||
226 | |||
227 | Currently logfs uses a hack that essentially copies part of fs/inode.c | ||
228 | code over. A general solution would be preferred. | ||
229 | |||
230 | Indirect block mapping | ||
231 | ---------------------- | ||
232 | |||
233 | With compression, the block device (or mapping inode) cannot be used | ||
234 | to cache indirect blocks. Some other place is required. Currently | ||
235 | logfs uses the top half of each inode's address space. The low 8TB | ||
236 | (on 32bit) are filled with file data, the high 8TB are used for | ||
237 | indirect blocks. | ||
238 | |||
239 | One problem is that 16TB files created on 64bit systems actually have | ||
240 | data in the top 8TB. But files >16TB would cause problems anyway, so | ||
241 | only the limit has changed. | ||
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX new file mode 100644 index 000000000000..2f68cd688769 --- /dev/null +++ b/Documentation/filesystems/nfs/00-INDEX | |||
@@ -0,0 +1,16 @@ | |||
1 | 00-INDEX | ||
2 | - this file (nfs-related documentation). | ||
3 | Exporting | ||
4 | - explanation of how to make filesystems exportable. | ||
5 | knfsd-stats.txt | ||
6 | - statistics which the NFS server makes available to user space. | ||
7 | nfs.txt | ||
8 | - nfs client, and DNS resolution for fs_locations. | ||
9 | nfs41-server.txt | ||
10 | - info on the Linux server implementation of NFSv4 minor version 1. | ||
11 | nfs-rdma.txt | ||
12 | - how to install and setup the Linux NFS/RDMA client and server software | ||
13 | nfsroot.txt | ||
14 | - short guide on setting up a diskless box with NFS root filesystem. | ||
15 | rpc-cache.txt | ||
16 | - introduction to the caching mechanisms in the sunrpc layer. | ||
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/nfs/Exporting index 87019d2b5981..87019d2b5981 100644 --- a/Documentation/filesystems/Exporting +++ b/Documentation/filesystems/nfs/Exporting | |||
diff --git a/Documentation/filesystems/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt index 64ced5149d37..64ced5149d37 100644 --- a/Documentation/filesystems/knfsd-stats.txt +++ b/Documentation/filesystems/nfs/knfsd-stats.txt | |||
diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt index e386f7e4bcee..e386f7e4bcee 100644 --- a/Documentation/filesystems/nfs-rdma.txt +++ b/Documentation/filesystems/nfs/nfs-rdma.txt | |||
diff --git a/Documentation/filesystems/nfs.txt b/Documentation/filesystems/nfs/nfs.txt index f50f26ce6cd0..f50f26ce6cd0 100644 --- a/Documentation/filesystems/nfs.txt +++ b/Documentation/filesystems/nfs/nfs.txt | |||
diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt index 5920fe26e6ff..6a53a84afc72 100644 --- a/Documentation/filesystems/nfs41-server.txt +++ b/Documentation/filesystems/nfs/nfs41-server.txt | |||
@@ -17,8 +17,7 @@ kernels must turn 4.1 on or off *before* turning support for version 4 | |||
17 | on or off; rpc.nfsd does this correctly.) | 17 | on or off; rpc.nfsd does this correctly.) |
18 | 18 | ||
19 | The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based | 19 | The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based |
20 | on the latest NFSv4.1 Internet Draft: | 20 | on RFC 5661. |
21 | http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 | ||
22 | 21 | ||
23 | From the many new features in NFSv4.1 the current implementation | 22 | From the many new features in NFSv4.1 the current implementation |
24 | focuses on the mandatory-to-implement NFSv4.1 Sessions, providing | 23 | focuses on the mandatory-to-implement NFSv4.1 Sessions, providing |
@@ -41,10 +40,10 @@ interoperability problems with future clients. Known issues: | |||
41 | conformant with the spec (for example, we don't use kerberos | 40 | conformant with the spec (for example, we don't use kerberos |
42 | on the backchannel correctly). | 41 | on the backchannel correctly). |
43 | - no trunking support: no clients currently take advantage of | 42 | - no trunking support: no clients currently take advantage of |
44 | trunking, but this is a mandatory failure, and its use is | 43 | trunking, but this is a mandatory feature, and its use is |
45 | recommended to clients in a number of places. (E.g. to ensure | 44 | recommended to clients in a number of places. (E.g. to ensure |
46 | timely renewal in case an existing connection's retry timeouts | 45 | timely renewal in case an existing connection's retry timeouts |
47 | have gotten too long; see section 8.3 of the draft.) | 46 | have gotten too long; see section 8.3 of the RFC.) |
48 | Therefore, lack of this feature may cause future clients to | 47 | Therefore, lack of this feature may cause future clients to |
49 | fail. | 48 | fail. |
50 | - Incomplete backchannel support: incomplete backchannel gss | 49 | - Incomplete backchannel support: incomplete backchannel gss |
@@ -213,3 +212,10 @@ The following cases aren't supported yet: | |||
213 | DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. | 212 | DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. |
214 | * DESTROY_SESSION MUST be the final operation in the COMPOUND request. | 213 | * DESTROY_SESSION MUST be the final operation in the COMPOUND request. |
215 | 214 | ||
215 | Nonstandard compound limitations: | ||
216 | * No support for a sessions fore channel RPC compound that requires both a | ||
217 | ca_maxrequestsize request and a ca_maxresponsesize reply, so we may | ||
218 | fail to live up to the promise we made in CREATE_SESSION fore channel | ||
219 | negotiation. | ||
220 | * No more than one IO operation (read, write, readdir) allowed per | ||
221 | compound. | ||
diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 3ba0b945aaf8..3ba0b945aaf8 100644 --- a/Documentation/filesystems/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt | |||
diff --git a/Documentation/filesystems/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt index 8a382bea6808..8a382bea6808 100644 --- a/Documentation/filesystems/rpc-cache.txt +++ b/Documentation/filesystems/nfs/rpc-cache.txt | |||
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt index 01539f410676..cf6d0d85ca82 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.txt | |||
@@ -28,7 +28,7 @@ described in the man pages included in the package. | |||
28 | Project web page: http://www.nilfs.org/en/ | 28 | Project web page: http://www.nilfs.org/en/ |
29 | Download page: http://www.nilfs.org/en/download.html | 29 | Download page: http://www.nilfs.org/en/download.html |
30 | Git tree web page: http://www.nilfs.org/git/ | 30 | Git tree web page: http://www.nilfs.org/git/ |
31 | NILFS mailing lists: http://www.nilfs.org/mailman/listinfo/users | 31 | List info: http://vger.kernel.org/vger-lists.html#linux-nilfs |
32 | 32 | ||
33 | Caveats | 33 | Caveats |
34 | ======= | 34 | ======= |
@@ -49,8 +49,7 @@ Mount options | |||
49 | NILFS2 supports the following mount options: | 49 | NILFS2 supports the following mount options: |
50 | (*) == default | 50 | (*) == default |
51 | 51 | ||
52 | barrier=on(*) This enables/disables barriers. barrier=off disables | 52 | nobarrier Disables barriers. |
53 | it, barrier=on enables it. | ||
54 | errors=continue(*) Keep going on a filesystem error. | 53 | errors=continue(*) Keep going on a filesystem error. |
55 | errors=remount-ro Remount the filesystem read-only on an error. | 54 | errors=remount-ro Remount the filesystem read-only on an error. |
56 | errors=panic Panic and halt the machine if an error occurs. | 55 | errors=panic Panic and halt the machine if an error occurs. |
@@ -71,6 +70,13 @@ order=strict Apply strict in-order semantics that preserves sequence | |||
71 | blocks. That means, it is guaranteed that no | 70 | blocks. That means, it is guaranteed that no |
72 | overtaking of events occurs in the recovered file | 71 | overtaking of events occurs in the recovered file |
73 | system after a crash. | 72 | system after a crash. |
73 | norecovery Disable recovery of the filesystem on mount. | ||
74 | This disables every write access on the device for | ||
75 | read-only mounts or snapshots. This option will fail | ||
76 | for r/w mounts on an unclean volume. | ||
77 | discard Issue discard/TRIM commands to the underlying block | ||
78 | device when blocks are freed. This is useful for SSD | ||
79 | devices and sparse/thinly-provisioned LUNs. | ||
74 | 80 | ||
75 | NILFS2 usage | 81 | NILFS2 usage |
76 | ============ | 82 | ============ |
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 92b888d540a6..a7e9746ee7ea 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting | |||
@@ -140,7 +140,7 @@ Callers of notify_change() need ->i_mutex now. | |||
140 | New super_block field "struct export_operations *s_export_op" for | 140 | New super_block field "struct export_operations *s_export_op" for |
141 | explicit support for exporting, e.g. via NFS. The structure is fully | 141 | explicit support for exporting, e.g. via NFS. The structure is fully |
142 | documented at its declaration in include/linux/fs.h, and in | 142 | documented at its declaration in include/linux/fs.h, and in |
143 | Documentation/filesystems/Exporting. | 143 | Documentation/filesystems/nfs/Exporting. |
144 | 144 | ||
145 | Briefly it allows for the definition of decode_fh and encode_fh operations | 145 | Briefly it allows for the definition of decode_fh and encode_fh operations |
146 | to encode and decode filehandles, and allows the filesystem to use | 146 | to encode and decode filehandles, and allows the filesystem to use |
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2c48f945546b..1e359b62c40a 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -38,6 +38,7 @@ Table of Contents | |||
38 | 3.3 /proc/<pid>/io - Display the IO accounting fields | 38 | 3.3 /proc/<pid>/io - Display the IO accounting fields |
39 | 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings | 39 | 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings |
40 | 3.5 /proc/<pid>/mountinfo - Information about mounts | 40 | 3.5 /proc/<pid>/mountinfo - Information about mounts |
41 | 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm | ||
41 | 42 | ||
42 | 43 | ||
43 | ------------------------------------------------------------------------------ | 44 | ------------------------------------------------------------------------------ |
@@ -163,6 +164,7 @@ read the file /proc/PID/status: | |||
163 | VmExe: 68 kB | 164 | VmExe: 68 kB |
164 | VmLib: 1412 kB | 165 | VmLib: 1412 kB |
165 | VmPTE: 20 kb | 166 | VmPTE: 20 kb |
167 | VmSwap: 0 kB | ||
166 | Threads: 1 | 168 | Threads: 1 |
167 | SigQ: 0/28578 | 169 | SigQ: 0/28578 |
168 | SigPnd: 0000000000000000 | 170 | SigPnd: 0000000000000000 |
@@ -176,7 +178,6 @@ read the file /proc/PID/status: | |||
176 | CapBnd: ffffffffffffffff | 178 | CapBnd: ffffffffffffffff |
177 | voluntary_ctxt_switches: 0 | 179 | voluntary_ctxt_switches: 0 |
178 | nonvoluntary_ctxt_switches: 1 | 180 | nonvoluntary_ctxt_switches: 1 |
179 | Stack usage: 12 kB | ||
180 | 181 | ||
181 | This shows you nearly the same information you would get if you viewed it with | 182 | This shows you nearly the same information you would get if you viewed it with |
182 | the ps command. In fact, ps uses the proc file system to obtain its | 183 | the ps command. In fact, ps uses the proc file system to obtain its |
@@ -188,7 +189,13 @@ memory usage. Its seven fields are explained in Table 1-3. The stat file | |||
188 | contains details information about the process itself. Its fields are | 189 | contains details information about the process itself. Its fields are |
189 | explained in Table 1-4. | 190 | explained in Table 1-4. |
190 | 191 | ||
191 | Table 1-2: Contents of the statm files (as of 2.6.30-rc7) | 192 | (for SMP CONFIG users) |
193 | For making accounting scalable, RSS related information are handled in | ||
194 | asynchronous manner and the vaule may not be very precise. To see a precise | ||
195 | snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. | ||
196 | It's slow but very precise. | ||
197 | |||
198 | Table 1-2: Contents of the status files (as of 2.6.30-rc7) | ||
192 | .............................................................................. | 199 | .............................................................................. |
193 | Field Content | 200 | Field Content |
194 | Name filename of the executable | 201 | Name filename of the executable |
@@ -213,6 +220,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) | |||
213 | VmExe size of text segment | 220 | VmExe size of text segment |
214 | VmLib size of shared library code | 221 | VmLib size of shared library code |
215 | VmPTE size of page table entries | 222 | VmPTE size of page table entries |
223 | VmSwap size of swap usage (the number of referred swapents) | ||
216 | Threads number of threads | 224 | Threads number of threads |
217 | SigQ number of signals queued/max. number for queue | 225 | SigQ number of signals queued/max. number for queue |
218 | SigPnd bitmap of pending signals for the thread | 226 | SigPnd bitmap of pending signals for the thread |
@@ -230,7 +238,6 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) | |||
230 | Mems_allowed_list Same as previous, but in "list format" | 238 | Mems_allowed_list Same as previous, but in "list format" |
231 | voluntary_ctxt_switches number of voluntary context switches | 239 | voluntary_ctxt_switches number of voluntary context switches |
232 | nonvoluntary_ctxt_switches number of non voluntary context switches | 240 | nonvoluntary_ctxt_switches number of non voluntary context switches |
233 | Stack usage: stack usage high water mark (round up to page size) | ||
234 | .............................................................................. | 241 | .............................................................................. |
235 | 242 | ||
236 | Table 1-3: Contents of the statm files (as of 2.6.8-rc3) | 243 | Table 1-3: Contents of the statm files (as of 2.6.8-rc3) |
@@ -309,7 +316,7 @@ address perms offset dev inode pathname | |||
309 | 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test | 316 | 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test |
310 | 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] | 317 | 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] |
311 | a7cb1000-a7cb2000 ---p 00000000 00:00 0 | 318 | a7cb1000-a7cb2000 ---p 00000000 00:00 0 |
312 | a7cb2000-a7eb2000 rw-p 00000000 00:00 0 [threadstack:001ff4b4] | 319 | a7cb2000-a7eb2000 rw-p 00000000 00:00 0 |
313 | a7eb2000-a7eb3000 ---p 00000000 00:00 0 | 320 | a7eb2000-a7eb3000 ---p 00000000 00:00 0 |
314 | a7eb3000-a7ed5000 rw-p 00000000 00:00 0 | 321 | a7eb3000-a7ed5000 rw-p 00000000 00:00 0 |
315 | a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 | 322 | a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 |
@@ -345,7 +352,6 @@ is not associated with a file: | |||
345 | [stack] = the stack of the main process | 352 | [stack] = the stack of the main process |
346 | [vdso] = the "virtual dynamic shared object", | 353 | [vdso] = the "virtual dynamic shared object", |
347 | the kernel system call handler | 354 | the kernel system call handler |
348 | [threadstack:xxxxxxxx] = the stack of the thread, xxxxxxxx is the stack size | ||
349 | 355 | ||
350 | or if empty, the mapping is anonymous. | 356 | or if empty, the mapping is anonymous. |
351 | 357 | ||
@@ -431,6 +437,7 @@ Table 1-5: Kernel info in /proc | |||
431 | modules List of loaded modules | 437 | modules List of loaded modules |
432 | mounts Mounted filesystems | 438 | mounts Mounted filesystems |
433 | net Networking info (see text) | 439 | net Networking info (see text) |
440 | pagetypeinfo Additional page allocator information (see text) (2.5) | ||
434 | partitions Table of partitions known to the system | 441 | partitions Table of partitions known to the system |
435 | pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, | 442 | pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, |
436 | decoupled by lspci (2.4) | 443 | decoupled by lspci (2.4) |
@@ -585,7 +592,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ... | |||
585 | Node 0, zone Normal 1 0 0 1 101 8 ... | 592 | Node 0, zone Normal 1 0 0 1 101 8 ... |
586 | Node 0, zone HighMem 2 0 0 1 1 0 ... | 593 | Node 0, zone HighMem 2 0 0 1 1 0 ... |
587 | 594 | ||
588 | Memory fragmentation is a problem under some workloads, and buddyinfo is a | 595 | External fragmentation is a problem under some workloads, and buddyinfo is a |
589 | useful tool for helping diagnose these problems. Buddyinfo will give you a | 596 | useful tool for helping diagnose these problems. Buddyinfo will give you a |
590 | clue as to how big an area you can safely allocate, or why a previous | 597 | clue as to how big an area you can safely allocate, or why a previous |
591 | allocation failed. | 598 | allocation failed. |
@@ -595,6 +602,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in | |||
595 | ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE | 602 | ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE |
596 | available in ZONE_NORMAL, etc... | 603 | available in ZONE_NORMAL, etc... |
597 | 604 | ||
605 | More information relevant to external fragmentation can be found in | ||
606 | pagetypeinfo. | ||
607 | |||
608 | > cat /proc/pagetypeinfo | ||
609 | Page block order: 9 | ||
610 | Pages per block: 512 | ||
611 | |||
612 | Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 | ||
613 | Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 | ||
614 | Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 | ||
615 | Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 | ||
616 | Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 | ||
617 | Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 | ||
618 | Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 | ||
619 | Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 | ||
620 | Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 | ||
621 | Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 | ||
622 | Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 | ||
623 | |||
624 | Number of blocks type Unmovable Reclaimable Movable Reserve Isolate | ||
625 | Node 0, zone DMA 2 0 5 1 0 | ||
626 | Node 0, zone DMA32 41 6 967 2 0 | ||
627 | |||
628 | Fragmentation avoidance in the kernel works by grouping pages of different | ||
629 | migrate types into the same contiguous regions of memory called page blocks. | ||
630 | A page block is typically the size of the default hugepage size e.g. 2MB on | ||
631 | X86-64. By keeping pages grouped based on their ability to move, the kernel | ||
632 | can reclaim pages within a page block to satisfy a high-order allocation. | ||
633 | |||
634 | The pagetypinfo begins with information on the size of a page block. It | ||
635 | then gives the same type of information as buddyinfo except broken down | ||
636 | by migrate-type and finishes with details on how many page blocks of each | ||
637 | type exist. | ||
638 | |||
639 | If min_free_kbytes has been tuned correctly (recommendations made by hugeadm | ||
640 | from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can | ||
641 | make an estimate of the likely number of huge pages that can be allocated | ||
642 | at a given point in time. All the "Movable" blocks should be allocatable | ||
643 | unless memory has been mlock()'d. Some of the Reclaimable blocks should | ||
644 | also be allocatable although a lot of filesystem metadata may have to be | ||
645 | reclaimed to achieve this. | ||
646 | |||
598 | .............................................................................. | 647 | .............................................................................. |
599 | 648 | ||
600 | meminfo: | 649 | meminfo: |
@@ -1072,7 +1121,8 @@ second). The meanings of the columns are as follows, from left to right: | |||
1072 | - irq: servicing interrupts | 1121 | - irq: servicing interrupts |
1073 | - softirq: servicing softirqs | 1122 | - softirq: servicing softirqs |
1074 | - steal: involuntary wait | 1123 | - steal: involuntary wait |
1075 | - guest: running a guest | 1124 | - guest: running a normal guest |
1125 | - guest_nice: running a niced guest | ||
1076 | 1126 | ||
1077 | The "intr" line gives counts of interrupts serviced since boot time, for each | 1127 | The "intr" line gives counts of interrupts serviced since boot time, for each |
1078 | of the possible system interrupts. The first column is the total of all | 1128 | of the possible system interrupts. The first column is the total of all |
@@ -1088,8 +1138,8 @@ The "processes" line gives the number of processes and threads created, which | |||
1088 | includes (but is not limited to) those created by calls to the fork() and | 1138 | includes (but is not limited to) those created by calls to the fork() and |
1089 | clone() system calls. | 1139 | clone() system calls. |
1090 | 1140 | ||
1091 | The "procs_running" line gives the number of processes currently running on | 1141 | The "procs_running" line gives the total number of threads that are |
1092 | CPUs. | 1142 | running or ready to run (i.e., the total number of runnable threads). |
1093 | 1143 | ||
1094 | The "procs_blocked" line gives the number of processes currently blocked, | 1144 | The "procs_blocked" line gives the number of processes currently blocked, |
1095 | waiting for I/O to complete. | 1145 | waiting for I/O to complete. |
@@ -1408,3 +1458,11 @@ For more information on mount propagation see: | |||
1408 | 1458 | ||
1409 | Documentation/filesystems/sharedsubtree.txt | 1459 | Documentation/filesystems/sharedsubtree.txt |
1410 | 1460 | ||
1461 | |||
1462 | 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm | ||
1463 | -------------------------------------------------------- | ||
1464 | These files provide a method to access a tasks comm value. It also allows for | ||
1465 | a task to set its own or one of its thread siblings comm value. The comm value | ||
1466 | is limited in size compared to the cmdline value, so writing anything longer | ||
1467 | then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated | ||
1468 | comm value. | ||
diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt index 0d15ebccf5b0..a1e2e0dda907 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.txt | |||
@@ -248,9 +248,7 @@ code, that is done in the initialization code in the usual way: | |||
248 | { | 248 | { |
249 | struct proc_dir_entry *entry; | 249 | struct proc_dir_entry *entry; |
250 | 250 | ||
251 | entry = create_proc_entry("sequence", 0, NULL); | 251 | proc_create("sequence", 0, NULL, &ct_file_ops); |
252 | if (entry) | ||
253 | entry->proc_fops = &ct_file_ops; | ||
254 | return 0; | 252 | return 0; |
255 | } | 253 | } |
256 | 254 | ||
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 23a181074f94..fc0e39af43c3 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt | |||
@@ -837,6 +837,9 @@ replicas continue to be exactly same. | |||
837 | individual lists does not affect propagation or the way propagation | 837 | individual lists does not affect propagation or the way propagation |
838 | tree is modified by operations. | 838 | tree is modified by operations. |
839 | 839 | ||
840 | All vfsmounts in a peer group have the same ->mnt_master. If it is | ||
841 | non-NULL, they form a contiguous (ordered) segment of slave list. | ||
842 | |||
840 | A example propagation tree looks as shown in the figure below. | 843 | A example propagation tree looks as shown in the figure below. |
841 | [ NOTE: Though it looks like a forest, if we consider all the shared | 844 | [ NOTE: Though it looks like a forest, if we consider all the shared |
842 | mounts as a conceptual entity called 'pnode', it becomes a tree] | 845 | mounts as a conceptual entity called 'pnode', it becomes a tree] |
@@ -874,8 +877,19 @@ replicas continue to be exactly same. | |||
874 | 877 | ||
875 | NOTE: The propagation tree is orthogonal to the mount tree. | 878 | NOTE: The propagation tree is orthogonal to the mount tree. |
876 | 879 | ||
880 | 8B Locking: | ||
881 | |||
882 | ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected | ||
883 | by namespace_sem (exclusive for modifications, shared for reading). | ||
884 | |||
885 | Normally we have ->mnt_flags modifications serialized by vfsmount_lock. | ||
886 | There are two exceptions: do_add_mount() and clone_mnt(). | ||
887 | The former modifies a vfsmount that has not been visible in any shared | ||
888 | data structures yet. | ||
889 | The latter holds namespace_sem and the only references to vfsmount | ||
890 | are in lists that can't be traversed without namespace_sem. | ||
877 | 891 | ||
878 | 8B Algorithm: | 892 | 8C Algorithm: |
879 | 893 | ||
880 | The crux of the implementation resides in rbind/move operation. | 894 | The crux of the implementation resides in rbind/move operation. |
881 | 895 | ||
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index b245d524d568..931c806642c5 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt | |||
@@ -91,8 +91,8 @@ struct device_attribute { | |||
91 | const char *buf, size_t count); | 91 | const char *buf, size_t count); |
92 | }; | 92 | }; |
93 | 93 | ||
94 | int device_create_file(struct device *, struct device_attribute *); | 94 | int device_create_file(struct device *, const struct device_attribute *); |
95 | void device_remove_file(struct device *, struct device_attribute *); | 95 | void device_remove_file(struct device *, const struct device_attribute *); |
96 | 96 | ||
97 | It also defines this helper for defining device attributes: | 97 | It also defines this helper for defining device attributes: |
98 | 98 | ||
@@ -316,8 +316,8 @@ DEVICE_ATTR(_name, _mode, _show, _store); | |||
316 | 316 | ||
317 | Creation/Removal: | 317 | Creation/Removal: |
318 | 318 | ||
319 | int device_create_file(struct device *device, struct device_attribute * attr); | 319 | int device_create_file(struct device *dev, const struct device_attribute * attr); |
320 | void device_remove_file(struct device * dev, struct device_attribute * attr); | 320 | void device_remove_file(struct device *dev, const struct device_attribute * attr); |
321 | 321 | ||
322 | 322 | ||
323 | - bus drivers (include/linux/device.h) | 323 | - bus drivers (include/linux/device.h) |
@@ -358,7 +358,7 @@ DRIVER_ATTR(_name, _mode, _show, _store) | |||
358 | 358 | ||
359 | Creation/Removal: | 359 | Creation/Removal: |
360 | 360 | ||
361 | int driver_create_file(struct device_driver *, struct driver_attribute *); | 361 | int driver_create_file(struct device_driver *, const struct driver_attribute *); |
362 | void driver_remove_file(struct device_driver *, struct driver_attribute *); | 362 | void driver_remove_file(struct device_driver *, const struct driver_attribute *); |
363 | 363 | ||
364 | 364 | ||
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 3015da0c6b2a..fe09a2cb1858 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt | |||
@@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for | |||
82 | all files in that instance (if CONFIG_NUMA is enabled) - which can be | 82 | all files in that instance (if CONFIG_NUMA is enabled) - which can be |
83 | adjusted on the fly via 'mount -o remount ...' | 83 | adjusted on the fly via 'mount -o remount ...' |
84 | 84 | ||
85 | mpol=default prefers to allocate memory from the local node | 85 | mpol=default use the process allocation policy |
86 | (see set_mempolicy(2)) | ||
86 | mpol=prefer:Node prefers to allocate memory from the given Node | 87 | mpol=prefer:Node prefers to allocate memory from the given Node |
87 | mpol=bind:NodeList allocates memory only from nodes in NodeList | 88 | mpol=bind:NodeList allocates memory only from nodes in NodeList |
88 | mpol=interleave prefers to allocate from each node in turn | 89 | mpol=interleave prefers to allocate from each node in turn |
89 | mpol=interleave:NodeList allocates from each node of NodeList in turn | 90 | mpol=interleave:NodeList allocates from each node of NodeList in turn |
91 | mpol=local prefers to allocate memory from the local node | ||
90 | 92 | ||
91 | NodeList format is a comma-separated list of decimal numbers and ranges, | 93 | NodeList format is a comma-separated list of decimal numbers and ranges, |
92 | a range being two hyphen-separated decimal numbers, the smallest and | 94 | a range being two hyphen-separated decimal numbers, the smallest and |
@@ -134,3 +136,5 @@ Author: | |||
134 | Christoph Rohland <cr@sap.com>, 1.12.01 | 136 | Christoph Rohland <cr@sap.com>, 1.12.01 |
135 | Updated: | 137 | Updated: |
136 | Hugh Dickins, 4 June 2007 | 138 | Hugh Dickins, 4 June 2007 |
139 | Updated: | ||
140 | KOSAKI Motohiro, 16 Mar 2010 | ||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 623f094c9d8d..3de2f32edd90 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -472,7 +472,7 @@ __sync_single_inode) to check if ->writepages has been successful in | |||
472 | writing out the whole address_space. | 472 | writing out the whole address_space. |
473 | 473 | ||
474 | The Writeback tag is used by filemap*wait* and sync_page* functions, | 474 | The Writeback tag is used by filemap*wait* and sync_page* functions, |
475 | via wait_on_page_writeback_range, to wait for all writeback to | 475 | via filemap_fdatawait_range, to wait for all writeback to |
476 | complete. While waiting ->sync_page (if defined) will be called on | 476 | complete. While waiting ->sync_page (if defined) will be called on |
477 | each page that is found to require writeback. | 477 | each page that is found to require writeback. |
478 | 478 | ||