diff options
author | Michal Marek <mmarek@suse.cz> | 2010-08-04 08:05:07 -0400 |
---|---|---|
committer | Michal Marek <mmarek@suse.cz> | 2010-08-04 08:05:07 -0400 |
commit | 7a996d3ab150bb0e1b71fa182f70199a703efdd1 (patch) | |
tree | 96a36947d90c9b96580899abd38cb3b70cd9d40b /Documentation/filesystems | |
parent | 7cf3d73b4360e91b14326632ab1aeda4cb26308d (diff) | |
parent | 9fe6206f400646a2322096b56c59891d530e8d51 (diff) |
Merge commit 'v2.6.35' into kbuild/kconfig
Conflicts:
scripts/kconfig/Makefile
Diffstat (limited to 'Documentation/filesystems')
28 files changed, 1482 insertions, 104 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 875d49696b6e..4303614b5add 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX | |||
@@ -16,6 +16,8 @@ befs.txt | |||
16 | - information about the BeOS filesystem for Linux. | 16 | - information about the BeOS filesystem for Linux. |
17 | bfs.txt | 17 | bfs.txt |
18 | - info for the SCO UnixWare Boot Filesystem (BFS). | 18 | - info for the SCO UnixWare Boot Filesystem (BFS). |
19 | ceph.txt | ||
20 | - info for the Ceph Distributed File System | ||
19 | cifs.txt | 21 | cifs.txt |
20 | - description of the CIFS filesystem. | 22 | - description of the CIFS filesystem. |
21 | coda.txt | 23 | coda.txt |
@@ -32,6 +34,8 @@ dlmfs.txt | |||
32 | - info on the userspace interface to the OCFS2 DLM. | 34 | - info on the userspace interface to the OCFS2 DLM. |
33 | dnotify.txt | 35 | dnotify.txt |
34 | - info about directory notification in Linux. | 36 | - info about directory notification in Linux. |
37 | dnotify_test.c | ||
38 | - example program for dnotify | ||
35 | ecryptfs.txt | 39 | ecryptfs.txt |
36 | - docs on eCryptfs: stacked cryptographic filesystem for Linux. | 40 | - docs on eCryptfs: stacked cryptographic filesystem for Linux. |
37 | exofs.txt | 41 | exofs.txt |
@@ -62,6 +66,8 @@ jfs.txt | |||
62 | - info and mount options for the JFS filesystem. | 66 | - info and mount options for the JFS filesystem. |
63 | locks.txt | 67 | locks.txt |
64 | - info on file locking implementations, flock() vs. fcntl(), etc. | 68 | - info on file locking implementations, flock() vs. fcntl(), etc. |
69 | logfs.txt | ||
70 | - info on the LogFS flash filesystem. | ||
65 | mandatory-locking.txt | 71 | mandatory-locking.txt |
66 | - info on the Linux implementation of Sys V mandatory file locking. | 72 | - info on the Linux implementation of Sys V mandatory file locking. |
67 | ncpfs.txt | 73 | ncpfs.txt |
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 57e0b80a5274..c0236e753bc8 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt | |||
@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9) | |||
37 | 37 | ||
38 | mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER | 38 | mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER |
39 | 39 | ||
40 | For server running on QEMU host with virtio transport: | ||
41 | |||
42 | mount -t 9p -o trans=virtio <mount_tag> /mnt/9 | ||
43 | |||
44 | where mount_tag is the tag associated by the server to each of the exported | ||
45 | mount points. Each 9P export is seen by the client as a virtio device with an | ||
46 | associated "mount_tag" property. Available mount tags can be | ||
47 | seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. | ||
48 | |||
40 | OPTIONS | 49 | OPTIONS |
41 | ======= | 50 | ======= |
42 | 51 | ||
@@ -47,7 +56,7 @@ OPTIONS | |||
47 | fd - used passed file descriptors for connection | 56 | fd - used passed file descriptors for connection |
48 | (see rfdno and wfdno) | 57 | (see rfdno and wfdno) |
49 | virtio - connect to the next virtio channel available | 58 | virtio - connect to the next virtio channel available |
50 | (from lguest or KVM with trans_virtio module) | 59 | (from QEMU with trans_virtio module) |
51 | rdma - connect to a specified RDMA channel | 60 | rdma - connect to a specified RDMA channel |
52 | 61 | ||
53 | uname=name user name to attempt mount as on the remote server. The | 62 | uname=name user name to attempt mount as on the remote server. The |
@@ -85,7 +94,12 @@ OPTIONS | |||
85 | 94 | ||
86 | port=n port to connect to on the remote server | 95 | port=n port to connect to on the remote server |
87 | 96 | ||
88 | noextend force legacy mode (no 9p2000.u semantics) | 97 | noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) |
98 | |||
99 | version=name Select 9P protocol version. Valid options are: | ||
100 | 9p2000 - Legacy mode (same as noextend) | ||
101 | 9p2000.u - Use 9P2000.u protocol | ||
102 | 9p2000.L - Use 9P2000.L protocol | ||
89 | 103 | ||
90 | dfltuid attempt to mount as a particular uid | 104 | dfltuid attempt to mount as a particular uid |
91 | 105 | ||
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 18b9d0ca0630..96d4293607ec 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking | |||
@@ -178,7 +178,7 @@ prototypes: | |||
178 | locking rules: | 178 | locking rules: |
179 | All except set_page_dirty may block | 179 | All except set_page_dirty may block |
180 | 180 | ||
181 | BKL PageLocked(page) i_sem | 181 | BKL PageLocked(page) i_mutex |
182 | writepage: no yes, unlocks (see below) | 182 | writepage: no yes, unlocks (see below) |
183 | readpage: no yes, unlocks | 183 | readpage: no yes, unlocks |
184 | sync_page: no maybe | 184 | sync_page: no maybe |
@@ -380,7 +380,7 @@ prototypes: | |||
380 | int (*open) (struct inode *, struct file *); | 380 | int (*open) (struct inode *, struct file *); |
381 | int (*flush) (struct file *); | 381 | int (*flush) (struct file *); |
382 | int (*release) (struct inode *, struct file *); | 382 | int (*release) (struct inode *, struct file *); |
383 | int (*fsync) (struct file *, struct dentry *, int datasync); | 383 | int (*fsync) (struct file *, int datasync); |
384 | int (*aio_fsync) (struct kiocb *, int datasync); | 384 | int (*aio_fsync) (struct kiocb *, int datasync); |
385 | int (*fasync) (int, struct file *, int); | 385 | int (*fasync) (int, struct file *, int); |
386 | int (*lock) (struct file *, int, struct file_lock *); | 386 | int (*lock) (struct file *, int, struct file_lock *); |
@@ -429,8 +429,9 @@ check_flags: no | |||
429 | implementations. If your fs is not using generic_file_llseek, you | 429 | implementations. If your fs is not using generic_file_llseek, you |
430 | need to acquire and release the appropriate locks in your ->llseek(). | 430 | need to acquire and release the appropriate locks in your ->llseek(). |
431 | For many filesystems, it is probably safe to acquire the inode | 431 | For many filesystems, it is probably safe to acquire the inode |
432 | semaphore. Note some filesystems (i.e. remote ones) provide no | 432 | mutex or just to use i_size_read() instead. |
433 | protection for i_size so you will need to use the BKL. | 433 | Note: this does not protect the file->f_pos against concurrent modifications |
434 | since this is something the userspace has to take care about. | ||
434 | 435 | ||
435 | Note: ext2_release() was *the* source of contention on fs-intensive | 436 | Note: ext2_release() was *the* source of contention on fs-intensive |
436 | loads and dropping BKL on ->release() helps to get rid of that (we still | 437 | loads and dropping BKL on ->release() helps to get rid of that (we still |
@@ -460,13 +461,6 @@ in sys_read() and friends. | |||
460 | 461 | ||
461 | --------------------------- dquot_operations ------------------------------- | 462 | --------------------------- dquot_operations ------------------------------- |
462 | prototypes: | 463 | prototypes: |
463 | int (*initialize) (struct inode *, int); | ||
464 | int (*drop) (struct inode *); | ||
465 | int (*alloc_space) (struct inode *, qsize_t, int); | ||
466 | int (*alloc_inode) (const struct inode *, unsigned long); | ||
467 | int (*free_space) (struct inode *, qsize_t); | ||
468 | int (*free_inode) (const struct inode *, unsigned long); | ||
469 | int (*transfer) (struct inode *, struct iattr *); | ||
470 | int (*write_dquot) (struct dquot *); | 464 | int (*write_dquot) (struct dquot *); |
471 | int (*acquire_dquot) (struct dquot *); | 465 | int (*acquire_dquot) (struct dquot *); |
472 | int (*release_dquot) (struct dquot *); | 466 | int (*release_dquot) (struct dquot *); |
@@ -479,13 +473,6 @@ a proper locking wrt the filesystem and call the generic quota operations. | |||
479 | What filesystem should expect from the generic quota functions: | 473 | What filesystem should expect from the generic quota functions: |
480 | 474 | ||
481 | FS recursion Held locks when called | 475 | FS recursion Held locks when called |
482 | initialize: yes maybe dqonoff_sem | ||
483 | drop: yes - | ||
484 | alloc_space: ->mark_dirty() - | ||
485 | alloc_inode: ->mark_dirty() - | ||
486 | free_space: ->mark_dirty() - | ||
487 | free_inode: ->mark_dirty() - | ||
488 | transfer: yes - | ||
489 | write_dquot: yes dqonoff_sem or dqptr_sem | 476 | write_dquot: yes dqonoff_sem or dqptr_sem |
490 | acquire_dquot: yes dqonoff_sem or dqptr_sem | 477 | acquire_dquot: yes dqonoff_sem or dqptr_sem |
491 | release_dquot: yes dqonoff_sem or dqptr_sem | 478 | release_dquot: yes dqonoff_sem or dqptr_sem |
@@ -495,10 +482,6 @@ write_info: yes dqonoff_sem | |||
495 | FS recursion means calling ->quota_read() and ->quota_write() from superblock | 482 | FS recursion means calling ->quota_read() and ->quota_write() from superblock |
496 | operations. | 483 | operations. |
497 | 484 | ||
498 | ->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called | ||
499 | only directly by the filesystem and do not call any fs functions only | ||
500 | the ->mark_dirty() operation. | ||
501 | |||
502 | More details about quota locking can be found in fs/dquot.c. | 485 | More details about quota locking can be found in fs/dquot.c. |
503 | 486 | ||
504 | --------------------------- vm_operations_struct ----------------------------- | 487 | --------------------------- vm_operations_struct ----------------------------- |
diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile new file mode 100644 index 000000000000..a5dd114da14f --- /dev/null +++ b/Documentation/filesystems/Makefile | |||
@@ -0,0 +1,8 @@ | |||
1 | # kbuild trick to avoid linker error. Can be omitted if a module is built. | ||
2 | obj- := dummy.o | ||
3 | |||
4 | # List of programs to build | ||
5 | hostprogs-y := dnotify_test | ||
6 | |||
7 | # Tell kbuild to always build the programs | ||
8 | always := $(hostprogs-y) | ||
diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt index 8f78ded4b648..51986bf08a4d 100644 --- a/Documentation/filesystems/autofs4-mount-control.txt +++ b/Documentation/filesystems/autofs4-mount-control.txt | |||
@@ -146,7 +146,7 @@ found to be inadequate, in this case. The Generic Netlink system was | |||
146 | used for this as raw Netlink would lead to a significant increase in | 146 | used for this as raw Netlink would lead to a significant increase in |
147 | complexity. There's no question that the Generic Netlink system is an | 147 | complexity. There's no question that the Generic Netlink system is an |
148 | elegant solution for common case ioctl functions but it's not a complete | 148 | elegant solution for common case ioctl functions but it's not a complete |
149 | replacement probably because it's primary purpose in life is to be a | 149 | replacement probably because its primary purpose in life is to be a |
150 | message bus implementation rather than specifically an ioctl replacement. | 150 | message bus implementation rather than specifically an ioctl replacement. |
151 | While it would be possible to work around this there is one concern | 151 | While it would be possible to work around this there is one concern |
152 | that lead to the decision to not use it. This is that the autofs | 152 | that lead to the decision to not use it. This is that the autofs |
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 000000000000..763d8ebbbebd --- /dev/null +++ b/Documentation/filesystems/ceph.txt | |||
@@ -0,0 +1,140 @@ | |||
1 | Ceph Distributed File System | ||
2 | ============================ | ||
3 | |||
4 | Ceph is a distributed network file system designed to provide good | ||
5 | performance, reliability, and scalability. | ||
6 | |||
7 | Basic features include: | ||
8 | |||
9 | * POSIX semantics | ||
10 | * Seamless scaling from 1 to many thousands of nodes | ||
11 | * High availability and reliability. No single point of failure. | ||
12 | * N-way replication of data across storage nodes | ||
13 | * Fast recovery from node failures | ||
14 | * Automatic rebalancing of data on node addition/removal | ||
15 | * Easy deployment: most FS components are userspace daemons | ||
16 | |||
17 | Also, | ||
18 | * Flexible snapshots (on any directory) | ||
19 | * Recursive accounting (nested files, directories, bytes) | ||
20 | |||
21 | In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely | ||
22 | on symmetric access by all clients to shared block devices, Ceph | ||
23 | separates data and metadata management into independent server | ||
24 | clusters, similar to Lustre. Unlike Lustre, however, metadata and | ||
25 | storage nodes run entirely as user space daemons. Storage nodes | ||
26 | utilize btrfs to store data objects, leveraging its advanced features | ||
27 | (checksumming, metadata replication, etc.). File data is striped | ||
28 | across storage nodes in large chunks to distribute workload and | ||
29 | facilitate high throughputs. When storage nodes fail, data is | ||
30 | re-replicated in a distributed fashion by the storage nodes themselves | ||
31 | (with some minimal coordination from a cluster monitor), making the | ||
32 | system extremely efficient and scalable. | ||
33 | |||
34 | Metadata servers effectively form a large, consistent, distributed | ||
35 | in-memory cache above the file namespace that is extremely scalable, | ||
36 | dynamically redistributes metadata in response to workload changes, | ||
37 | and can tolerate arbitrary (well, non-Byzantine) node failures. The | ||
38 | metadata server takes a somewhat unconventional approach to metadata | ||
39 | storage to significantly improve performance for common workloads. In | ||
40 | particular, inodes with only a single link are embedded in | ||
41 | directories, allowing entire directories of dentries and inodes to be | ||
42 | loaded into its cache with a single I/O operation. The contents of | ||
43 | extremely large directories can be fragmented and managed by | ||
44 | independent metadata servers, allowing scalable concurrent access. | ||
45 | |||
46 | The system offers automatic data rebalancing/migration when scaling | ||
47 | from a small cluster of just a few nodes to many hundreds, without | ||
48 | requiring an administrator carve the data set into static volumes or | ||
49 | go through the tedious process of migrating data between servers. | ||
50 | When the file system approaches full, new nodes can be easily added | ||
51 | and things will "just work." | ||
52 | |||
53 | Ceph includes flexible snapshot mechanism that allows a user to create | ||
54 | a snapshot on any subdirectory (and its nested contents) in the | ||
55 | system. Snapshot creation and deletion are as simple as 'mkdir | ||
56 | .snap/foo' and 'rmdir .snap/foo'. | ||
57 | |||
58 | Ceph also provides some recursive accounting on directories for nested | ||
59 | files and bytes. That is, a 'getfattr -d foo' on any directory in the | ||
60 | system will reveal the total number of nested regular files and | ||
61 | subdirectories, and a summation of all nested file sizes. This makes | ||
62 | the identification of large disk space consumers relatively quick, as | ||
63 | no 'du' or similar recursive scan of the file system is required. | ||
64 | |||
65 | |||
66 | Mount Syntax | ||
67 | ============ | ||
68 | |||
69 | The basic mount syntax is: | ||
70 | |||
71 | # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt | ||
72 | |||
73 | You only need to specify a single monitor, as the client will get the | ||
74 | full list when it connects. (However, if the monitor you specify | ||
75 | happens to be down, the mount won't succeed.) The port can be left | ||
76 | off if the monitor is using the default. So if the monitor is at | ||
77 | 1.2.3.4, | ||
78 | |||
79 | # mount -t ceph 1.2.3.4:/ /mnt/ceph | ||
80 | |||
81 | is sufficient. If /sbin/mount.ceph is installed, a hostname can be | ||
82 | used instead of an IP address. | ||
83 | |||
84 | |||
85 | |||
86 | Mount Options | ||
87 | ============= | ||
88 | |||
89 | ip=A.B.C.D[:N] | ||
90 | Specify the IP and/or port the client should bind to locally. | ||
91 | There is normally not much reason to do this. If the IP is not | ||
92 | specified, the client's IP address is determined by looking at the | ||
93 | address its connection to the monitor originates from. | ||
94 | |||
95 | wsize=X | ||
96 | Specify the maximum write size in bytes. By default there is no | ||
97 | maximum. Ceph will normally size writes based on the file stripe | ||
98 | size. | ||
99 | |||
100 | rsize=X | ||
101 | Specify the maximum readahead. | ||
102 | |||
103 | mount_timeout=X | ||
104 | Specify the timeout value for mount (in seconds), in the case | ||
105 | of a non-responsive Ceph file system. The default is 30 | ||
106 | seconds. | ||
107 | |||
108 | rbytes | ||
109 | When stat() is called on a directory, set st_size to 'rbytes', | ||
110 | the summation of file sizes over all files nested beneath that | ||
111 | directory. This is the default. | ||
112 | |||
113 | norbytes | ||
114 | When stat() is called on a directory, set st_size to the | ||
115 | number of entries in that directory. | ||
116 | |||
117 | nocrc | ||
118 | Disable CRC32C calculation for data writes. If set, the storage node | ||
119 | must rely on TCP's error correction to detect data corruption | ||
120 | in the data payload. | ||
121 | |||
122 | noasyncreaddir | ||
123 | Disable client's use its local cache to satisfy readdir | ||
124 | requests. (This does not change correctness; the client uses | ||
125 | cached metadata only when a lease or capability ensures it is | ||
126 | valid.) | ||
127 | |||
128 | |||
129 | More Information | ||
130 | ================ | ||
131 | |||
132 | For more information on Ceph, see the home page at | ||
133 | http://ceph.newdream.net/ | ||
134 | |||
135 | The Linux kernel client source tree is available at | ||
136 | git://ceph.newdream.net/git/ceph-client.git | ||
137 | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git | ||
138 | |||
139 | and the source for the full system is at | ||
140 | git://ceph.newdream.net/git/ceph.git | ||
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt index 4c0c575a4012..79334ed5daa7 100644 --- a/Documentation/filesystems/dentry-locking.txt +++ b/Documentation/filesystems/dentry-locking.txt | |||
@@ -62,7 +62,8 @@ changes are : | |||
62 | 2. Insertion of a dentry into the hash table is done using | 62 | 2. Insertion of a dentry into the hash table is done using |
63 | hlist_add_head_rcu() which take care of ordering the writes - the | 63 | hlist_add_head_rcu() which take care of ordering the writes - the |
64 | writes to the dentry must be visible before the dentry is | 64 | writes to the dentry must be visible before the dentry is |
65 | inserted. This works in conjunction with hlist_for_each_rcu() while | 65 | inserted. This works in conjunction with hlist_for_each_rcu(), |
66 | which has since been replaced by hlist_for_each_entry_rcu(), while | ||
66 | walking the hash chain. The only requirement is that all | 67 | walking the hash chain. The only requirement is that all |
67 | initialization to the dentry must be done before | 68 | initialization to the dentry must be done before |
68 | hlist_add_head_rcu() since we don't have dcache_lock protection | 69 | hlist_add_head_rcu() since we don't have dcache_lock protection |
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt index c50bbb2d52b4..1b528b2ad809 100644 --- a/Documentation/filesystems/dlmfs.txt +++ b/Documentation/filesystems/dlmfs.txt | |||
@@ -47,7 +47,7 @@ You'll want to start heartbeating on a volume which all the nodes in | |||
47 | your lockspace can access. The easiest way to do this is via | 47 | your lockspace can access. The easiest way to do this is via |
48 | ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires | 48 | ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires |
49 | that an OCFS2 file system be in place so that it can automatically | 49 | that an OCFS2 file system be in place so that it can automatically |
50 | find it's heartbeat area, though it will eventually support heartbeat | 50 | find its heartbeat area, though it will eventually support heartbeat |
51 | against raw disks. | 51 | against raw disks. |
52 | 52 | ||
53 | Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed | 53 | Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed |
diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt index 9f5d338ddbb8..6baf88f46859 100644 --- a/Documentation/filesystems/dnotify.txt +++ b/Documentation/filesystems/dnotify.txt | |||
@@ -62,38 +62,9 @@ disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. | |||
62 | 62 | ||
63 | Example | 63 | Example |
64 | ------- | 64 | ------- |
65 | See Documentation/filesystems/dnotify_test.c for an example. | ||
65 | 66 | ||
66 | #define _GNU_SOURCE /* needed to get the defines */ | 67 | NOTE |
67 | #include <fcntl.h> /* in glibc 2.2 this has the needed | 68 | ---- |
68 | values defined */ | 69 | Beginning with Linux 2.6.13, dnotify has been replaced by inotify. |
69 | #include <signal.h> | 70 | See Documentation/filesystems/inotify.txt for more information on it. |
70 | #include <stdio.h> | ||
71 | #include <unistd.h> | ||
72 | |||
73 | static volatile int event_fd; | ||
74 | |||
75 | static void handler(int sig, siginfo_t *si, void *data) | ||
76 | { | ||
77 | event_fd = si->si_fd; | ||
78 | } | ||
79 | |||
80 | int main(void) | ||
81 | { | ||
82 | struct sigaction act; | ||
83 | int fd; | ||
84 | |||
85 | act.sa_sigaction = handler; | ||
86 | sigemptyset(&act.sa_mask); | ||
87 | act.sa_flags = SA_SIGINFO; | ||
88 | sigaction(SIGRTMIN + 1, &act, NULL); | ||
89 | |||
90 | fd = open(".", O_RDONLY); | ||
91 | fcntl(fd, F_SETSIG, SIGRTMIN + 1); | ||
92 | fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); | ||
93 | /* we will now be notified if any of the files | ||
94 | in "." is modified or new files are created */ | ||
95 | while (1) { | ||
96 | pause(); | ||
97 | printf("Got event on fd=%d\n", event_fd); | ||
98 | } | ||
99 | } | ||
diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c new file mode 100644 index 000000000000..8b37b4a1e18d --- /dev/null +++ b/Documentation/filesystems/dnotify_test.c | |||
@@ -0,0 +1,34 @@ | |||
1 | #define _GNU_SOURCE /* needed to get the defines */ | ||
2 | #include <fcntl.h> /* in glibc 2.2 this has the needed | ||
3 | values defined */ | ||
4 | #include <signal.h> | ||
5 | #include <stdio.h> | ||
6 | #include <unistd.h> | ||
7 | |||
8 | static volatile int event_fd; | ||
9 | |||
10 | static void handler(int sig, siginfo_t *si, void *data) | ||
11 | { | ||
12 | event_fd = si->si_fd; | ||
13 | } | ||
14 | |||
15 | int main(void) | ||
16 | { | ||
17 | struct sigaction act; | ||
18 | int fd; | ||
19 | |||
20 | act.sa_sigaction = handler; | ||
21 | sigemptyset(&act.sa_mask); | ||
22 | act.sa_flags = SA_SIGINFO; | ||
23 | sigaction(SIGRTMIN + 1, &act, NULL); | ||
24 | |||
25 | fd = open(".", O_RDONLY); | ||
26 | fcntl(fd, F_SETSIG, SIGRTMIN + 1); | ||
27 | fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); | ||
28 | /* we will now be notified if any of the files | ||
29 | in "." is modified or new files are created */ | ||
30 | while (1) { | ||
31 | pause(); | ||
32 | printf("Got event on fd=%d\n", event_fd); | ||
33 | } | ||
34 | } | ||
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 867c5b50cb42..272f80d5f966 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -59,8 +59,19 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata | |||
59 | Setting it to very large values will improve | 59 | Setting it to very large values will improve |
60 | performance. | 60 | performance. |
61 | 61 | ||
62 | barrier=1 This enables/disables barriers. barrier=0 disables | 62 | barrier=<0(*)|1> This enables/disables the use of write barriers in |
63 | it, barrier=1 enables it. | 63 | barrier the jbd code. barrier=0 disables, barrier=1 enables. |
64 | nobarrier (*) This also requires an IO stack which can support | ||
65 | barriers, and if jbd gets an error on a barrier | ||
66 | write, it will disable again with a warning. | ||
67 | Write barriers enforce proper on-disk ordering | ||
68 | of journal commits, making volatile disk write caches | ||
69 | safe to use, at some performance penalty. If | ||
70 | your disks are battery-backed in one way or another, | ||
71 | disabling barriers may safely improve performance. | ||
72 | The mount options "barrier" and "nobarrier" can | ||
73 | also be used to enable or disable barriers, for | ||
74 | consistency with other ext3 mount options. | ||
64 | 75 | ||
65 | orlov (*) This enables the new Orlov block allocator. It is | 76 | orlov (*) This enables the new Orlov block allocator. It is |
66 | enabled by default. | 77 | enabled by default. |
diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt index 606233cd4618..1b805a0efbb0 100644 --- a/Documentation/filesystems/fiemap.txt +++ b/Documentation/filesystems/fiemap.txt | |||
@@ -38,7 +38,7 @@ flags, it will return EBADR and the contents of fm_flags will contain | |||
38 | the set of flags which caused the error. If the kernel is compatible | 38 | the set of flags which caused the error. If the kernel is compatible |
39 | with all flags passed, the contents of fm_flags will be unmodified. | 39 | with all flags passed, the contents of fm_flags will be unmodified. |
40 | It is up to userspace to determine whether rejection of a particular | 40 | It is up to userspace to determine whether rejection of a particular |
41 | flag is fatal to it's operation. This scheme is intended to allow the | 41 | flag is fatal to its operation. This scheme is intended to allow the |
42 | fiemap interface to grow in the future but without losing | 42 | fiemap interface to grow in the future but without losing |
43 | compatibility with old software. | 43 | compatibility with old software. |
44 | 44 | ||
@@ -56,7 +56,7 @@ If this flag is set, the kernel will sync the file before mapping extents. | |||
56 | 56 | ||
57 | * FIEMAP_FLAG_XATTR | 57 | * FIEMAP_FLAG_XATTR |
58 | If this flag is set, the extents returned will describe the inodes | 58 | If this flag is set, the extents returned will describe the inodes |
59 | extended attribute lookup tree, instead of it's data tree. | 59 | extended attribute lookup tree, instead of its data tree. |
60 | 60 | ||
61 | 61 | ||
62 | Extent Mapping | 62 | Extent Mapping |
@@ -89,7 +89,7 @@ struct fiemap_extent { | |||
89 | }; | 89 | }; |
90 | 90 | ||
91 | All offsets and lengths are in bytes and mirror those on disk. It is valid | 91 | All offsets and lengths are in bytes and mirror those on disk. It is valid |
92 | for an extents logical offset to start before the request or it's logical | 92 | for an extents logical offset to start before the request or its logical |
93 | length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is | 93 | length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is |
94 | returned, fe_logical, fe_physical, and fe_length will be aligned to the | 94 | returned, fe_logical, fe_physical, and fe_length will be aligned to the |
95 | block size of the file system. With the exception of extents flagged as | 95 | block size of the file system. With the exception of extents flagged as |
@@ -125,7 +125,7 @@ been allocated for the file yet. | |||
125 | 125 | ||
126 | * FIEMAP_EXTENT_DELALLOC | 126 | * FIEMAP_EXTENT_DELALLOC |
127 | - This will also set FIEMAP_EXTENT_UNKNOWN. | 127 | - This will also set FIEMAP_EXTENT_UNKNOWN. |
128 | Delayed allocation - while there is data for this extent, it's | 128 | Delayed allocation - while there is data for this extent, its |
129 | physical location has not been allocated yet. | 129 | physical location has not been allocated yet. |
130 | 130 | ||
131 | * FIEMAP_EXTENT_ENCODED | 131 | * FIEMAP_EXTENT_ENCODED |
@@ -159,7 +159,7 @@ Data is located within a meta data block. | |||
159 | Data is packed into a block with data from other files. | 159 | Data is packed into a block with data from other files. |
160 | 160 | ||
161 | * FIEMAP_EXTENT_UNWRITTEN | 161 | * FIEMAP_EXTENT_UNWRITTEN |
162 | Unwritten extent - the extent is allocated but it's data has not been | 162 | Unwritten extent - the extent is allocated but its data has not been |
163 | initialized. This indicates the extent's data will be all zero if read | 163 | initialized. This indicates the extent's data will be all zero if read |
164 | through the filesystem but the contents are undefined if read directly from | 164 | through the filesystem but the contents are undefined if read directly from |
165 | the device. | 165 | the device. |
@@ -176,7 +176,7 @@ VFS -> File System Implementation | |||
176 | 176 | ||
177 | File systems wishing to support fiemap must implement a ->fiemap callback on | 177 | File systems wishing to support fiemap must implement a ->fiemap callback on |
178 | their inode_operations structure. The fs ->fiemap call is responsible for | 178 | their inode_operations structure. The fs ->fiemap call is responsible for |
179 | defining it's set of supported fiemap flags, and calling a helper function on | 179 | defining its set of supported fiemap flags, and calling a helper function on |
180 | each discovered extent: | 180 | each discovered extent: |
181 | 181 | ||
182 | struct inode_operations { | 182 | struct inode_operations { |
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 397a41adb4c3..13af4a49e7db 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt | |||
@@ -91,7 +91,7 @@ Mount options | |||
91 | 'default_permissions' | 91 | 'default_permissions' |
92 | 92 | ||
93 | By default FUSE doesn't check file access permissions, the | 93 | By default FUSE doesn't check file access permissions, the |
94 | filesystem is free to implement it's access policy or leave it to | 94 | filesystem is free to implement its access policy or leave it to |
95 | the underlying file access mechanism (e.g. in case of network | 95 | the underlying file access mechanism (e.g. in case of network |
96 | filesystems). This option enables permission checking, restricting | 96 | filesystems). This option enables permission checking, restricting |
97 | access based on file mode. It is usually useful together with the | 97 | access based on file mode. It is usually useful together with the |
@@ -171,7 +171,7 @@ or may honor them by sending a reply to the _original_ request, with | |||
171 | the error set to EINTR. | 171 | the error set to EINTR. |
172 | 172 | ||
173 | It is also possible that there's a race between processing the | 173 | It is also possible that there's a race between processing the |
174 | original request and it's INTERRUPT request. There are two possibilities: | 174 | original request and its INTERRUPT request. There are two possibilities: |
175 | 175 | ||
176 | 1) The INTERRUPT request is processed before the original request is | 176 | 1) The INTERRUPT request is processed before the original request is |
177 | processed | 177 | processed |
diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 5e3ab8f3beff..0b59c0200912 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt | |||
@@ -1,7 +1,7 @@ | |||
1 | Global File System | 1 | Global File System |
2 | ------------------ | 2 | ------------------ |
3 | 3 | ||
4 | http://sources.redhat.com/cluster/ | 4 | http://sources.redhat.com/cluster/wiki/ |
5 | 5 | ||
6 | GFS is a cluster file system. It allows a cluster of computers to | 6 | GFS is a cluster file system. It allows a cluster of computers to |
7 | simultaneously use a block device that is shared between them (with FC, | 7 | simultaneously use a block device that is shared between them (with FC, |
@@ -36,11 +36,11 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it | |||
36 | is pretty close. | 36 | is pretty close. |
37 | 37 | ||
38 | The following man pages can be found at the URL above: | 38 | The following man pages can be found at the URL above: |
39 | fsck.gfs2 to repair a filesystem | 39 | fsck.gfs2 to repair a filesystem |
40 | gfs2_grow to expand a filesystem online | 40 | gfs2_grow to expand a filesystem online |
41 | gfs2_jadd to add journals to a filesystem online | 41 | gfs2_jadd to add journals to a filesystem online |
42 | gfs2_tool to manipulate, examine and tune a filesystem | 42 | gfs2_tool to manipulate, examine and tune a filesystem |
43 | gfs2_quota to examine and change quota values in a filesystem | 43 | gfs2_quota to examine and change quota values in a filesystem |
44 | gfs2_convert to convert a gfs filesystem to gfs2 in-place | 44 | gfs2_convert to convert a gfs filesystem to gfs2 in-place |
45 | mount.gfs2 to help mount(8) mount a filesystem | 45 | mount.gfs2 to help mount(8) mount a filesystem |
46 | mkfs.gfs2 to make a filesystem | 46 | mkfs.gfs2 to make a filesystem |
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt index fa45c3baed98..74630bd504fb 100644 --- a/Documentation/filesystems/hpfs.txt +++ b/Documentation/filesystems/hpfs.txt | |||
@@ -103,7 +103,7 @@ to analyze or change OS2SYS.INI. | |||
103 | Codepages | 103 | Codepages |
104 | 104 | ||
105 | HPFS can contain several uppercasing tables for several codepages and each | 105 | HPFS can contain several uppercasing tables for several codepages and each |
106 | file has a pointer to codepage it's name is in. However OS/2 was created in | 106 | file has a pointer to codepage its name is in. However OS/2 was created in |
107 | America where people don't care much about codepages and so multiple codepages | 107 | America where people don't care much about codepages and so multiple codepages |
108 | support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. | 108 | support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. |
109 | Once I booted English OS/2 working in cp 850 and I created a file on my 852 | 109 | Once I booted English OS/2 working in cp 850 and I created a file on my 852 |
diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt new file mode 100644 index 000000000000..bca42c22a143 --- /dev/null +++ b/Documentation/filesystems/logfs.txt | |||
@@ -0,0 +1,241 @@ | |||
1 | |||
2 | The LogFS Flash Filesystem | ||
3 | ========================== | ||
4 | |||
5 | Specification | ||
6 | ============= | ||
7 | |||
8 | Superblocks | ||
9 | ----------- | ||
10 | |||
11 | Two superblocks exist at the beginning and end of the filesystem. | ||
12 | Each superblock is 256 Bytes large, with another 3840 Bytes reserved | ||
13 | for future purposes, making a total of 4096 Bytes. | ||
14 | |||
15 | Superblock locations may differ for MTD and block devices. On MTD the | ||
16 | first non-bad block contains a superblock in the first 4096 Bytes and | ||
17 | the last non-bad block contains a superblock in the last 4096 Bytes. | ||
18 | On block devices, the first 4096 Bytes of the device contain the first | ||
19 | superblock and the last aligned 4096 Byte-block contains the second | ||
20 | superblock. | ||
21 | |||
22 | For the most part, the superblocks can be considered read-only. They | ||
23 | are written only to correct errors detected within the superblocks, | ||
24 | move the journal and change the filesystem parameters through tunefs. | ||
25 | As a result, the superblock does not contain any fields that require | ||
26 | constant updates, like the amount of free space, etc. | ||
27 | |||
28 | Segments | ||
29 | -------- | ||
30 | |||
31 | The space in the device is split up into equal-sized segments. | ||
32 | Segments are the primary write unit of LogFS. Within each segments, | ||
33 | writes happen from front (low addresses) to back (high addresses. If | ||
34 | only a partial segment has been written, the segment number, the | ||
35 | current position within and optionally a write buffer are stored in | ||
36 | the journal. | ||
37 | |||
38 | Segments are erased as a whole. Therefore Garbage Collection may be | ||
39 | required to completely free a segment before doing so. | ||
40 | |||
41 | Journal | ||
42 | -------- | ||
43 | |||
44 | The journal contains all global information about the filesystem that | ||
45 | is subject to frequent change. At mount time, it has to be scanned | ||
46 | for the most recent commit entry, which contains a list of pointers to | ||
47 | all currently valid entries. | ||
48 | |||
49 | Object Store | ||
50 | ------------ | ||
51 | |||
52 | All space except for the superblocks and journal is part of the object | ||
53 | store. Each segment contains a segment header and a number of | ||
54 | objects, each consisting of the object header and the payload. | ||
55 | Objects are either inodes, directory entries (dentries), file data | ||
56 | blocks or indirect blocks. | ||
57 | |||
58 | Levels | ||
59 | ------ | ||
60 | |||
61 | Garbage collection (GC) may fail if all data is written | ||
62 | indiscriminately. One requirement of GC is that data is separated | ||
63 | roughly according to the distance between the tree root and the data. | ||
64 | Effectively that means all file data is on level 0, indirect blocks | ||
65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, | ||
66 | respectively. Inode file data is on level 6 for the inodes and 7-11 | ||
67 | for indirect blocks. | ||
68 | |||
69 | Each segment contains objects of a single level only. As a result, | ||
70 | each level requires its own separate segment to be open for writing. | ||
71 | |||
72 | Inode File | ||
73 | ---------- | ||
74 | |||
75 | All inodes are stored in a special file, the inode file. Single | ||
76 | exception is the inode file's inode (master inode) which for obvious | ||
77 | reasons is stored in the journal instead. Instead of data blocks, the | ||
78 | leaf nodes of the inode files are inodes. | ||
79 | |||
80 | Aliases | ||
81 | ------- | ||
82 | |||
83 | Writes in LogFS are done by means of a wandering tree. A naïve | ||
84 | implementation would require that for each write or a block, all | ||
85 | parent blocks are written as well, since the block pointers have | ||
86 | changed. Such an implementation would not be very efficient. | ||
87 | |||
88 | In LogFS, the block pointer changes are cached in the journal by means | ||
89 | of alias entries. Each alias consists of its logical address - inode | ||
90 | number, block index, level and child number (index into block) - and | ||
91 | the changed data. Any 8-byte word can be changes in this manner. | ||
92 | |||
93 | Currently aliases are used for block pointers, file size, file used | ||
94 | bytes and the height of an inodes indirect tree. | ||
95 | |||
96 | Segment Aliases | ||
97 | --------------- | ||
98 | |||
99 | Related to regular aliases, these are used to handle bad blocks. | ||
100 | Initially, bad blocks are handled by moving the affected segment | ||
101 | content to a spare segment and noting this move in the journal with a | ||
102 | segment alias, a simple (to, from) tupel. GC will later empty this | ||
103 | segment and the alias can be removed again. This is used on MTD only. | ||
104 | |||
105 | Vim | ||
106 | --- | ||
107 | |||
108 | By cleverly predicting the life time of data, it is possible to | ||
109 | separate long-living data from short-living data and thereby reduce | ||
110 | the GC overhead later. Each type of distinc life expectency (vim) can | ||
111 | have a separate segment open for writing. Each (level, vim) tupel can | ||
112 | be open just once. If an open segment with unknown vim is encountered | ||
113 | at mount time, it is closed and ignored henceforth. | ||
114 | |||
115 | Indirect Tree | ||
116 | ------------- | ||
117 | |||
118 | Inodes in LogFS are similar to FFS-style filesystems with direct and | ||
119 | indirect block pointers. One difference is that LogFS uses a single | ||
120 | indirect pointer that can be either a 1x, 2x, etc. indirect pointer. | ||
121 | A height field in the inode defines the height of the indirect tree | ||
122 | and thereby the indirection of the pointer. | ||
123 | |||
124 | Another difference is the addressing of indirect blocks. In LogFS, | ||
125 | the first 16 pointers in the first indirect block are left empty, | ||
126 | corresponding to the 16 direct pointers in the inode. In ext2 (maybe | ||
127 | others as well) the first pointer in the first indirect block | ||
128 | corresponds to logical block 12, skipping the 12 direct pointers. | ||
129 | So where ext2 is using arithmetic to better utilize space, LogFS keeps | ||
130 | arithmetic simple and uses compression to save space. | ||
131 | |||
132 | Compression | ||
133 | ----------- | ||
134 | |||
135 | Both file data and metadata can be compressed. Compression for file | ||
136 | data can be enabled with chattr +c and disabled with chattr -c. Doing | ||
137 | so has no effect on existing data, but new data will be stored | ||
138 | accordingly. New inodes will inherit the compression flag of the | ||
139 | parent directory. | ||
140 | |||
141 | Metadata is always compressed. However, the space accounting ignores | ||
142 | this and charges for the uncompressed size. Failing to do so could | ||
143 | result in GC failures when, after moving some data, indirect blocks | ||
144 | compress worse than previously. Even on a 100% full medium, GC may | ||
145 | not consume any extra space, so the compression gains are lost space | ||
146 | to the user. | ||
147 | |||
148 | However, they are not lost space to the filesystem internals. By | ||
149 | cheating the user for those bytes, the filesystem gained some slack | ||
150 | space and GC will run less often and faster. | ||
151 | |||
152 | Garbage Collection and Wear Leveling | ||
153 | ------------------------------------ | ||
154 | |||
155 | Garbage collection is invoked whenever the number of free segments | ||
156 | falls below a threshold. The best (known) candidate is picked based | ||
157 | on the least amount of valid data contained in the segment. All | ||
158 | remaining valid data is copied elsewhere, thereby invalidating it. | ||
159 | |||
160 | The GC code also checks for aliases and writes then back if their | ||
161 | number gets too large. | ||
162 | |||
163 | Wear leveling is done by occasionally picking a suboptimal segment for | ||
164 | garbage collection. If a stale segments erase count is significantly | ||
165 | lower than the active segments' erase counts, it will be picked. Wear | ||
166 | leveling is rate limited, so it will never monopolize the device for | ||
167 | more than one segment worth at a time. | ||
168 | |||
169 | Values for "occasionally", "significantly lower" are compile time | ||
170 | constants. | ||
171 | |||
172 | Hashed directories | ||
173 | ------------------ | ||
174 | |||
175 | To satisfy efficient lookup(), directory entries are hashed and | ||
176 | located based on the hash. In order to both support large directories | ||
177 | and not be overly inefficient for small directories, several hash | ||
178 | tables of increasing size are used. For each table, the hash value | ||
179 | modulo the table size gives the table index. | ||
180 | |||
181 | Tables sizes are chosen to limit the number of indirect blocks with a | ||
182 | fully populated table to 0, 1, 2 or 3 respectively. So the first | ||
183 | table contains 16 entries, the second 512-16, etc. | ||
184 | |||
185 | The last table is special in several ways. First its size depends on | ||
186 | the effective 32bit limit on telldir/seekdir cookies. Since logfs | ||
187 | uses the upper half of the address space for indirect blocks, the size | ||
188 | is limited to 2^31. Secondly the table contains hash buckets with 16 | ||
189 | entries each. | ||
190 | |||
191 | Using single-entry buckets would result in birthday "attacks". At | ||
192 | just 2^16 used entries, hash collisions would be likely (P >= 0.5). | ||
193 | My math skills are insufficient to do the combinatorics for the 17x | ||
194 | collisions necessary to overflow a bucket, but testing showed that in | ||
195 | 10,000 runs the lowest directory fill before a bucket overflow was | ||
196 | 188,057,130 entries with an average of 315,149,915 entries. So for | ||
197 | directory sizes of up to a million, bucket overflows should be | ||
198 | virtually impossible under normal circumstances. | ||
199 | |||
200 | With carefully chosen filenames, it is obviously possible to cause an | ||
201 | overflow with just 21 entries (4 higher tables + 16 entries + 1). So | ||
202 | there may be a security concern if a malicious user has write access | ||
203 | to a directory. | ||
204 | |||
205 | Open For Discussion | ||
206 | =================== | ||
207 | |||
208 | Device Address Space | ||
209 | -------------------- | ||
210 | |||
211 | A device address space is used for caching. Both block devices and | ||
212 | MTD provide functions to either read a single page or write a segment. | ||
213 | Partial segments may be written for data integrity, but where possible | ||
214 | complete segments are written for performance on simple block device | ||
215 | flash media. | ||
216 | |||
217 | Meta Inodes | ||
218 | ----------- | ||
219 | |||
220 | Inodes are stored in the inode file, which is just a regular file for | ||
221 | most purposes. At umount time, however, the inode file needs to | ||
222 | remain open until all dirty inodes are written. So | ||
223 | generic_shutdown_super() may not close this inode, but shouldn't | ||
224 | complain about remaining inodes due to the inode file either. Same | ||
225 | goes for mapping inode of the device address space. | ||
226 | |||
227 | Currently logfs uses a hack that essentially copies part of fs/inode.c | ||
228 | code over. A general solution would be preferred. | ||
229 | |||
230 | Indirect block mapping | ||
231 | ---------------------- | ||
232 | |||
233 | With compression, the block device (or mapping inode) cannot be used | ||
234 | to cache indirect blocks. Some other place is required. Currently | ||
235 | logfs uses the top half of each inode's address space. The low 8TB | ||
236 | (on 32bit) are filled with file data, the high 8TB are used for | ||
237 | indirect blocks. | ||
238 | |||
239 | One problem is that 16TB files created on 64bit systems actually have | ||
240 | data in the top 8TB. But files >16TB would cause problems anyway, so | ||
241 | only the limit has changed. | ||
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt index 1bd0d0c05171..04884914a1c8 100644 --- a/Documentation/filesystems/nfs/nfs41-server.txt +++ b/Documentation/filesystems/nfs/nfs41-server.txt | |||
@@ -17,8 +17,7 @@ kernels must turn 4.1 on or off *before* turning support for version 4 | |||
17 | on or off; rpc.nfsd does this correctly.) | 17 | on or off; rpc.nfsd does this correctly.) |
18 | 18 | ||
19 | The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based | 19 | The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based |
20 | on the latest NFSv4.1 Internet Draft: | 20 | on RFC 5661. |
21 | http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 | ||
22 | 21 | ||
23 | From the many new features in NFSv4.1 the current implementation | 22 | From the many new features in NFSv4.1 the current implementation |
24 | focuses on the mandatory-to-implement NFSv4.1 Sessions, providing | 23 | focuses on the mandatory-to-implement NFSv4.1 Sessions, providing |
@@ -44,7 +43,7 @@ interoperability problems with future clients. Known issues: | |||
44 | trunking, but this is a mandatory feature, and its use is | 43 | trunking, but this is a mandatory feature, and its use is |
45 | recommended to clients in a number of places. (E.g. to ensure | 44 | recommended to clients in a number of places. (E.g. to ensure |
46 | timely renewal in case an existing connection's retry timeouts | 45 | timely renewal in case an existing connection's retry timeouts |
47 | have gotten too long; see section 8.3 of the draft.) | 46 | have gotten too long; see section 8.3 of the RFC.) |
48 | Therefore, lack of this feature may cause future clients to | 47 | Therefore, lack of this feature may cause future clients to |
49 | fail. | 48 | fail. |
50 | - Incomplete backchannel support: incomplete backchannel gss | 49 | - Incomplete backchannel support: incomplete backchannel gss |
@@ -138,7 +137,7 @@ NS*| OPENATTR | OPT | | Section 18.17 | | |||
138 | | READ | REQ | | Section 18.22 | | 137 | | READ | REQ | | Section 18.22 | |
139 | | READDIR | REQ | | Section 18.23 | | 138 | | READDIR | REQ | | Section 18.23 | |
140 | | READLINK | OPT | | Section 18.24 | | 139 | | READLINK | OPT | | Section 18.24 | |
141 | NS | RECLAIM_COMPLETE | REQ | | Section 18.51 | | 140 | | RECLAIM_COMPLETE | REQ | | Section 18.51 | |
142 | | RELEASE_LOCKOWNER | MNI | | N/A | | 141 | | RELEASE_LOCKOWNER | MNI | | N/A | |
143 | | REMOVE | REQ | | Section 18.25 | | 142 | | REMOVE | REQ | | Section 18.25 | |
144 | | RENAME | REQ | | Section 18.26 | | 143 | | RENAME | REQ | | Section 18.26 | |
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt index 8a382bea6808..ebcaaee21616 100644 --- a/Documentation/filesystems/nfs/rpc-cache.txt +++ b/Documentation/filesystems/nfs/rpc-cache.txt | |||
@@ -185,7 +185,7 @@ failed lookup meant a definite 'no'. | |||
185 | request/response format | 185 | request/response format |
186 | ----------------------- | 186 | ----------------------- |
187 | 187 | ||
188 | While each cache is free to use it's own format for requests | 188 | While each cache is free to use its own format for requests |
189 | and responses over channel, the following is recommended as | 189 | and responses over channel, the following is recommended as |
190 | appropriate and support routines are available to help: | 190 | appropriate and support routines are available to help: |
191 | Each request or response record should be printable ASCII | 191 | Each request or response record should be printable ASCII |
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt index 839efd8a8a8c..d3e7673995eb 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.txt | |||
@@ -50,8 +50,8 @@ NILFS2 supports the following mount options: | |||
50 | (*) == default | 50 | (*) == default |
51 | 51 | ||
52 | nobarrier Disables barriers. | 52 | nobarrier Disables barriers. |
53 | errors=continue(*) Keep going on a filesystem error. | 53 | errors=continue Keep going on a filesystem error. |
54 | errors=remount-ro Remount the filesystem read-only on an error. | 54 | errors=remount-ro(*) Remount the filesystem read-only on an error. |
55 | errors=panic Panic and halt the machine if an error occurs. | 55 | errors=panic Panic and halt the machine if an error occurs. |
56 | cp=n Specify the checkpoint-number of the snapshot to be | 56 | cp=n Specify the checkpoint-number of the snapshot to be |
57 | mounted. Checkpoints and snapshots are listed by lscp | 57 | mounted. Checkpoints and snapshots are listed by lscp |
@@ -74,6 +74,9 @@ norecovery Disable recovery of the filesystem on mount. | |||
74 | This disables every write access on the device for | 74 | This disables every write access on the device for |
75 | read-only mounts or snapshots. This option will fail | 75 | read-only mounts or snapshots. This option will fail |
76 | for r/w mounts on an unclean volume. | 76 | for r/w mounts on an unclean volume. |
77 | discard Issue discard/TRIM commands to the underlying block | ||
78 | device when blocks are freed. This is useful for SSD | ||
79 | devices and sparse/thinly-provisioned LUNs. | ||
77 | 80 | ||
78 | NILFS2 usage | 81 | NILFS2 usage |
79 | ============ | 82 | ============ |
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index c58b9f5ba002..1f7ae144f6d8 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt | |||
@@ -80,3 +80,10 @@ user_xattr (*) Enables Extended User Attributes. | |||
80 | nouser_xattr Disables Extended User Attributes. | 80 | nouser_xattr Disables Extended User Attributes. |
81 | acl Enables POSIX Access Control Lists support. | 81 | acl Enables POSIX Access Control Lists support. |
82 | noacl (*) Disables POSIX Access Control Lists support. | 82 | noacl (*) Disables POSIX Access Control Lists support. |
83 | resv_level=2 (*) Set how agressive allocation reservations will be. | ||
84 | Valid values are between 0 (reservations off) to 8 | ||
85 | (maximum space for reservations). | ||
86 | dir_resv_level= (*) By default, directory reservations will scale with file | ||
87 | reservations - users should rarely need to change this | ||
88 | value. If allocation reservations are turned off, this | ||
89 | option will have no effect. | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 0d07513a67a6..9fb6cbe70bde 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -164,6 +164,7 @@ read the file /proc/PID/status: | |||
164 | VmExe: 68 kB | 164 | VmExe: 68 kB |
165 | VmLib: 1412 kB | 165 | VmLib: 1412 kB |
166 | VmPTE: 20 kb | 166 | VmPTE: 20 kb |
167 | VmSwap: 0 kB | ||
167 | Threads: 1 | 168 | Threads: 1 |
168 | SigQ: 0/28578 | 169 | SigQ: 0/28578 |
169 | SigPnd: 0000000000000000 | 170 | SigPnd: 0000000000000000 |
@@ -188,7 +189,13 @@ memory usage. Its seven fields are explained in Table 1-3. The stat file | |||
188 | contains details information about the process itself. Its fields are | 189 | contains details information about the process itself. Its fields are |
189 | explained in Table 1-4. | 190 | explained in Table 1-4. |
190 | 191 | ||
191 | Table 1-2: Contents of the statm files (as of 2.6.30-rc7) | 192 | (for SMP CONFIG users) |
193 | For making accounting scalable, RSS related information are handled in | ||
194 | asynchronous manner and the vaule may not be very precise. To see a precise | ||
195 | snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. | ||
196 | It's slow but very precise. | ||
197 | |||
198 | Table 1-2: Contents of the status files (as of 2.6.30-rc7) | ||
192 | .............................................................................. | 199 | .............................................................................. |
193 | Field Content | 200 | Field Content |
194 | Name filename of the executable | 201 | Name filename of the executable |
@@ -213,6 +220,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) | |||
213 | VmExe size of text segment | 220 | VmExe size of text segment |
214 | VmLib size of shared library code | 221 | VmLib size of shared library code |
215 | VmPTE size of page table entries | 222 | VmPTE size of page table entries |
223 | VmSwap size of swap usage (the number of referred swapents) | ||
216 | Threads number of threads | 224 | Threads number of threads |
217 | SigQ number of signals queued/max. number for queue | 225 | SigQ number of signals queued/max. number for queue |
218 | SigPnd bitmap of pending signals for the thread | 226 | SigPnd bitmap of pending signals for the thread |
@@ -297,7 +305,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) | |||
297 | cgtime guest time of the task children in jiffies | 305 | cgtime guest time of the task children in jiffies |
298 | .............................................................................. | 306 | .............................................................................. |
299 | 307 | ||
300 | The /proc/PID/map file containing the currently mapped memory regions and | 308 | The /proc/PID/maps file containing the currently mapped memory regions and |
301 | their access permissions. | 309 | their access permissions. |
302 | 310 | ||
303 | The format is: | 311 | The format is: |
@@ -308,7 +316,7 @@ address perms offset dev inode pathname | |||
308 | 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test | 316 | 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test |
309 | 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] | 317 | 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] |
310 | a7cb1000-a7cb2000 ---p 00000000 00:00 0 | 318 | a7cb1000-a7cb2000 ---p 00000000 00:00 0 |
311 | a7cb2000-a7eb2000 rw-p 00000000 00:00 0 [threadstack:001ff4b4] | 319 | a7cb2000-a7eb2000 rw-p 00000000 00:00 0 |
312 | a7eb2000-a7eb3000 ---p 00000000 00:00 0 | 320 | a7eb2000-a7eb3000 ---p 00000000 00:00 0 |
313 | a7eb3000-a7ed5000 rw-p 00000000 00:00 0 | 321 | a7eb3000-a7ed5000 rw-p 00000000 00:00 0 |
314 | a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 | 322 | a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 |
@@ -344,7 +352,6 @@ is not associated with a file: | |||
344 | [stack] = the stack of the main process | 352 | [stack] = the stack of the main process |
345 | [vdso] = the "virtual dynamic shared object", | 353 | [vdso] = the "virtual dynamic shared object", |
346 | the kernel system call handler | 354 | the kernel system call handler |
347 | [threadstack:xxxxxxxx] = the stack of the thread, xxxxxxxx is the stack size | ||
348 | 355 | ||
349 | or if empty, the mapping is anonymous. | 356 | or if empty, the mapping is anonymous. |
350 | 357 | ||
@@ -430,6 +437,7 @@ Table 1-5: Kernel info in /proc | |||
430 | modules List of loaded modules | 437 | modules List of loaded modules |
431 | mounts Mounted filesystems | 438 | mounts Mounted filesystems |
432 | net Networking info (see text) | 439 | net Networking info (see text) |
440 | pagetypeinfo Additional page allocator information (see text) (2.5) | ||
433 | partitions Table of partitions known to the system | 441 | partitions Table of partitions known to the system |
434 | pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, | 442 | pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, |
435 | decoupled by lspci (2.4) | 443 | decoupled by lspci (2.4) |
@@ -557,6 +565,10 @@ The default_smp_affinity mask applies to all non-active IRQs, which are the | |||
557 | IRQs which have not yet been allocated/activated, and hence which lack a | 565 | IRQs which have not yet been allocated/activated, and hence which lack a |
558 | /proc/irq/[0-9]* directory. | 566 | /proc/irq/[0-9]* directory. |
559 | 567 | ||
568 | The node file on an SMP system shows the node to which the device using the IRQ | ||
569 | reports itself as being attached. This hardware locality information does not | ||
570 | include information about any possible driver locality preference. | ||
571 | |||
560 | prof_cpu_mask specifies which CPUs are to be profiled by the system wide | 572 | prof_cpu_mask specifies which CPUs are to be profiled by the system wide |
561 | profiler. Default value is ffffffff (all cpus). | 573 | profiler. Default value is ffffffff (all cpus). |
562 | 574 | ||
@@ -584,7 +596,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ... | |||
584 | Node 0, zone Normal 1 0 0 1 101 8 ... | 596 | Node 0, zone Normal 1 0 0 1 101 8 ... |
585 | Node 0, zone HighMem 2 0 0 1 1 0 ... | 597 | Node 0, zone HighMem 2 0 0 1 1 0 ... |
586 | 598 | ||
587 | Memory fragmentation is a problem under some workloads, and buddyinfo is a | 599 | External fragmentation is a problem under some workloads, and buddyinfo is a |
588 | useful tool for helping diagnose these problems. Buddyinfo will give you a | 600 | useful tool for helping diagnose these problems. Buddyinfo will give you a |
589 | clue as to how big an area you can safely allocate, or why a previous | 601 | clue as to how big an area you can safely allocate, or why a previous |
590 | allocation failed. | 602 | allocation failed. |
@@ -594,6 +606,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in | |||
594 | ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE | 606 | ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE |
595 | available in ZONE_NORMAL, etc... | 607 | available in ZONE_NORMAL, etc... |
596 | 608 | ||
609 | More information relevant to external fragmentation can be found in | ||
610 | pagetypeinfo. | ||
611 | |||
612 | > cat /proc/pagetypeinfo | ||
613 | Page block order: 9 | ||
614 | Pages per block: 512 | ||
615 | |||
616 | Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 | ||
617 | Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 | ||
618 | Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 | ||
619 | Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 | ||
620 | Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 | ||
621 | Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 | ||
622 | Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 | ||
623 | Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 | ||
624 | Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 | ||
625 | Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 | ||
626 | Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 | ||
627 | |||
628 | Number of blocks type Unmovable Reclaimable Movable Reserve Isolate | ||
629 | Node 0, zone DMA 2 0 5 1 0 | ||
630 | Node 0, zone DMA32 41 6 967 2 0 | ||
631 | |||
632 | Fragmentation avoidance in the kernel works by grouping pages of different | ||
633 | migrate types into the same contiguous regions of memory called page blocks. | ||
634 | A page block is typically the size of the default hugepage size e.g. 2MB on | ||
635 | X86-64. By keeping pages grouped based on their ability to move, the kernel | ||
636 | can reclaim pages within a page block to satisfy a high-order allocation. | ||
637 | |||
638 | The pagetypinfo begins with information on the size of a page block. It | ||
639 | then gives the same type of information as buddyinfo except broken down | ||
640 | by migrate-type and finishes with details on how many page blocks of each | ||
641 | type exist. | ||
642 | |||
643 | If min_free_kbytes has been tuned correctly (recommendations made by hugeadm | ||
644 | from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can | ||
645 | make an estimate of the likely number of huge pages that can be allocated | ||
646 | at a given point in time. All the "Movable" blocks should be allocatable | ||
647 | unless memory has been mlock()'d. Some of the Reclaimable blocks should | ||
648 | also be allocatable although a lot of filesystem metadata may have to be | ||
649 | reclaimed to achieve this. | ||
650 | |||
597 | .............................................................................. | 651 | .............................................................................. |
598 | 652 | ||
599 | meminfo: | 653 | meminfo: |
@@ -914,7 +968,7 @@ your system and how much traffic was routed over those devices: | |||
914 | ...] 1375103 17405 0 0 0 0 0 0 | 968 | ...] 1375103 17405 0 0 0 0 0 0 |
915 | ...] 1703981 5535 0 0 0 3 0 0 | 969 | ...] 1703981 5535 0 0 0 3 0 0 |
916 | 970 | ||
917 | In addition, each Channel Bond interface has it's own directory. For | 971 | In addition, each Channel Bond interface has its own directory. For |
918 | example, the bond0 device will have a directory called /proc/net/bond0/. | 972 | example, the bond0 device will have a directory called /proc/net/bond0/. |
919 | It will contain information that is specific to that bond, such as the | 973 | It will contain information that is specific to that bond, such as the |
920 | current slaves of the bond, the link status of the slaves, and how | 974 | current slaves of the bond, the link status of the slaves, and how |
@@ -1311,7 +1365,7 @@ been accounted as having caused 1MB of write. | |||
1311 | In other words: The number of bytes which this process caused to not happen, | 1365 | In other words: The number of bytes which this process caused to not happen, |
1312 | by truncating pagecache. A task can cause "negative" IO too. If this task | 1366 | by truncating pagecache. A task can cause "negative" IO too. If this task |
1313 | truncates some dirty pagecache, some IO which another task has been accounted | 1367 | truncates some dirty pagecache, some IO which another task has been accounted |
1314 | for (in it's write_bytes) will not be happening. We _could_ just subtract that | 1368 | for (in its write_bytes) will not be happening. We _could_ just subtract that |
1315 | from the truncating task's write_bytes, but there is information loss in doing | 1369 | from the truncating task's write_bytes, but there is information loss in doing |
1316 | that. | 1370 | that. |
1317 | 1371 | ||
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 23a181074f94..fc0e39af43c3 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt | |||
@@ -837,6 +837,9 @@ replicas continue to be exactly same. | |||
837 | individual lists does not affect propagation or the way propagation | 837 | individual lists does not affect propagation or the way propagation |
838 | tree is modified by operations. | 838 | tree is modified by operations. |
839 | 839 | ||
840 | All vfsmounts in a peer group have the same ->mnt_master. If it is | ||
841 | non-NULL, they form a contiguous (ordered) segment of slave list. | ||
842 | |||
840 | A example propagation tree looks as shown in the figure below. | 843 | A example propagation tree looks as shown in the figure below. |
841 | [ NOTE: Though it looks like a forest, if we consider all the shared | 844 | [ NOTE: Though it looks like a forest, if we consider all the shared |
842 | mounts as a conceptual entity called 'pnode', it becomes a tree] | 845 | mounts as a conceptual entity called 'pnode', it becomes a tree] |
@@ -874,8 +877,19 @@ replicas continue to be exactly same. | |||
874 | 877 | ||
875 | NOTE: The propagation tree is orthogonal to the mount tree. | 878 | NOTE: The propagation tree is orthogonal to the mount tree. |
876 | 879 | ||
880 | 8B Locking: | ||
881 | |||
882 | ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected | ||
883 | by namespace_sem (exclusive for modifications, shared for reading). | ||
884 | |||
885 | Normally we have ->mnt_flags modifications serialized by vfsmount_lock. | ||
886 | There are two exceptions: do_add_mount() and clone_mnt(). | ||
887 | The former modifies a vfsmount that has not been visible in any shared | ||
888 | data structures yet. | ||
889 | The latter holds namespace_sem and the only references to vfsmount | ||
890 | are in lists that can't be traversed without namespace_sem. | ||
877 | 891 | ||
878 | 8B Algorithm: | 892 | 8C Algorithm: |
879 | 893 | ||
880 | The crux of the implementation resides in rbind/move operation. | 894 | The crux of the implementation resides in rbind/move operation. |
881 | 895 | ||
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt index f673ef0de0f7..194fb0decd2c 100644 --- a/Documentation/filesystems/smbfs.txt +++ b/Documentation/filesystems/smbfs.txt | |||
@@ -3,6 +3,6 @@ protocol used by Windows for Workgroups, Windows 95 and Windows NT. | |||
3 | Smbfs was inspired by Samba, the program written by Andrew Tridgell | 3 | Smbfs was inspired by Samba, the program written by Andrew Tridgell |
4 | that turns any Unix host into a file server for DOS or Windows clients. | 4 | that turns any Unix host into a file server for DOS or Windows clients. |
5 | 5 | ||
6 | Smbfs is a SMB client, but uses parts of samba for it's operation. For | 6 | Smbfs is a SMB client, but uses parts of samba for its operation. For |
7 | more info on samba, including documentation, please go to | 7 | more info on samba, including documentation, please go to |
8 | http://www.samba.org/ and then on to your nearest mirror. | 8 | http://www.samba.org/ and then on to your nearest mirror. |
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt index b324c033035a..203f7202cc9e 100644 --- a/Documentation/filesystems/squashfs.txt +++ b/Documentation/filesystems/squashfs.txt | |||
@@ -38,7 +38,8 @@ Hard link support: yes no | |||
38 | Real inode numbers: yes no | 38 | Real inode numbers: yes no |
39 | 32-bit uids/gids: yes no | 39 | 32-bit uids/gids: yes no |
40 | File creation time: yes no | 40 | File creation time: yes no |
41 | Xattr and ACL support: no no | 41 | Xattr support: yes no |
42 | ACL support: no no | ||
42 | 43 | ||
43 | Squashfs compresses data, inodes and directories. In addition, inode and | 44 | Squashfs compresses data, inodes and directories. In addition, inode and |
44 | directory data are highly compacted, and packed on byte boundaries. Each | 45 | directory data are highly compacted, and packed on byte boundaries. Each |
@@ -58,7 +59,7 @@ obtained from this site also. | |||
58 | 3. SQUASHFS FILESYSTEM DESIGN | 59 | 3. SQUASHFS FILESYSTEM DESIGN |
59 | ----------------------------- | 60 | ----------------------------- |
60 | 61 | ||
61 | A squashfs filesystem consists of seven parts, packed together on a byte | 62 | A squashfs filesystem consists of a maximum of eight parts, packed together on a byte |
62 | alignment: | 63 | alignment: |
63 | 64 | ||
64 | --------------- | 65 | --------------- |
@@ -80,6 +81,9 @@ alignment: | |||
80 | |---------------| | 81 | |---------------| |
81 | | uid/gid | | 82 | | uid/gid | |
82 | | lookup table | | 83 | | lookup table | |
84 | |---------------| | ||
85 | | xattr | | ||
86 | | table | | ||
83 | --------------- | 87 | --------------- |
84 | 88 | ||
85 | Compressed data blocks are written to the filesystem as files are read from | 89 | Compressed data blocks are written to the filesystem as files are read from |
@@ -192,6 +196,26 @@ This table is stored compressed into metadata blocks. A second index table is | |||
192 | used to locate these. This second index table for speed of access (and because | 196 | used to locate these. This second index table for speed of access (and because |
193 | it is small) is read at mount time and cached in memory. | 197 | it is small) is read at mount time and cached in memory. |
194 | 198 | ||
199 | 3.7 Xattr table | ||
200 | --------------- | ||
201 | |||
202 | The xattr table contains extended attributes for each inode. The xattrs | ||
203 | for each inode are stored in a list, each list entry containing a type, | ||
204 | name and value field. The type field encodes the xattr prefix | ||
205 | ("user.", "trusted." etc) and it also encodes how the name/value fields | ||
206 | should be interpreted. Currently the type indicates whether the value | ||
207 | is stored inline (in which case the value field contains the xattr value), | ||
208 | or if it is stored out of line (in which case the value field stores a | ||
209 | reference to where the actual value is stored). This allows large values | ||
210 | to be stored out of line improving scanning and lookup performance and it | ||
211 | also allows values to be de-duplicated, the value being stored once, and | ||
212 | all other occurences holding an out of line reference to that value. | ||
213 | |||
214 | The xattr lists are packed into compressed 8K metadata blocks. | ||
215 | To reduce overhead in inodes, rather than storing the on-disk | ||
216 | location of the xattr list inside each inode, a 32-bit xattr id | ||
217 | is stored. This xattr id is mapped into the location of the xattr | ||
218 | list using a second xattr id lookup table. | ||
195 | 219 | ||
196 | 4. TODOS AND OUTSTANDING ISSUES | 220 | 4. TODOS AND OUTSTANDING ISSUES |
197 | ------------------------------- | 221 | ------------------------------- |
@@ -199,9 +223,7 @@ it is small) is read at mount time and cached in memory. | |||
199 | 4.1 Todo list | 223 | 4.1 Todo list |
200 | ------------- | 224 | ------------- |
201 | 225 | ||
202 | Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks | 226 | Implement ACL support. |
203 | for these but the code has not been written. Once the code has been written | ||
204 | the existing layout should not require modification. | ||
205 | 227 | ||
206 | 4.2 Squashfs internal cache | 228 | 4.2 Squashfs internal cache |
207 | --------------------------- | 229 | --------------------------- |
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt new file mode 100644 index 000000000000..caaaf1266d8f --- /dev/null +++ b/Documentation/filesystems/sysfs-tagging.txt | |||
@@ -0,0 +1,42 @@ | |||
1 | Sysfs tagging | ||
2 | ------------- | ||
3 | |||
4 | (Taken almost verbatim from Eric Biederman's netns tagging patch | ||
5 | commit msg) | ||
6 | |||
7 | The problem. Network devices show up in sysfs and with the network | ||
8 | namespace active multiple devices with the same name can show up in | ||
9 | the same directory, ouch! | ||
10 | |||
11 | To avoid that problem and allow existing applications in network | ||
12 | namespaces to see the same interface that is currently presented in | ||
13 | sysfs, sysfs now has tagging directory support. | ||
14 | |||
15 | By using the network namespace pointers as tags to separate out the | ||
16 | the sysfs directory entries we ensure that we don't have conflicts | ||
17 | in the directories and applications only see a limited set of | ||
18 | the network devices. | ||
19 | |||
20 | Each sysfs directory entry may be tagged with zero or one | ||
21 | namespaces. A sysfs_dirent is augmented with a void *s_ns. If a | ||
22 | directory entry is tagged, then sysfs_dirent->s_flags will have a | ||
23 | flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will | ||
24 | point to the namespace to which it belongs. | ||
25 | |||
26 | Each sysfs superblock's sysfs_super_info contains an array void | ||
27 | *ns[KOBJ_NS_TYPES]. When a a task in a tagging namespace | ||
28 | kobj_nstype first mounts sysfs, a new superblock is created. It | ||
29 | will be differentiated from other sysfs mounts by having its | ||
30 | s_fs_info->ns[kobj_nstype] set to the new namespace. Note that | ||
31 | through bind mounting and mounts propagation, a task can easily view | ||
32 | the contents of other namespaces' sysfs mounts. Therefore, when a | ||
33 | namespace exits, it will call kobj_ns_exit() to invalidate any | ||
34 | sysfs_dirent->s_ns pointers pointing to it. | ||
35 | |||
36 | Users of this interface: | ||
37 | - define a type in the kobj_ns_type enumeration. | ||
38 | - call kobj_ns_type_register() with its kobj_ns_type_operations which has | ||
39 | - current_ns() which returns current's namespace | ||
40 | - netlink_ns() which returns a socket's namespace | ||
41 | - initial_ns() which returns the initial namesapce | ||
42 | - call kobj_ns_exit() when an individual tag is no longer valid | ||
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 3015da0c6b2a..98ef55124158 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt | |||
@@ -82,21 +82,31 @@ tmpfs has a mount option to set the NUMA memory allocation policy for | |||
82 | all files in that instance (if CONFIG_NUMA is enabled) - which can be | 82 | all files in that instance (if CONFIG_NUMA is enabled) - which can be |
83 | adjusted on the fly via 'mount -o remount ...' | 83 | adjusted on the fly via 'mount -o remount ...' |
84 | 84 | ||
85 | mpol=default prefers to allocate memory from the local node | 85 | mpol=default use the process allocation policy |
86 | (see set_mempolicy(2)) | ||
86 | mpol=prefer:Node prefers to allocate memory from the given Node | 87 | mpol=prefer:Node prefers to allocate memory from the given Node |
87 | mpol=bind:NodeList allocates memory only from nodes in NodeList | 88 | mpol=bind:NodeList allocates memory only from nodes in NodeList |
88 | mpol=interleave prefers to allocate from each node in turn | 89 | mpol=interleave prefers to allocate from each node in turn |
89 | mpol=interleave:NodeList allocates from each node of NodeList in turn | 90 | mpol=interleave:NodeList allocates from each node of NodeList in turn |
91 | mpol=local prefers to allocate memory from the local node | ||
90 | 92 | ||
91 | NodeList format is a comma-separated list of decimal numbers and ranges, | 93 | NodeList format is a comma-separated list of decimal numbers and ranges, |
92 | a range being two hyphen-separated decimal numbers, the smallest and | 94 | a range being two hyphen-separated decimal numbers, the smallest and |
93 | largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 | 95 | largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 |
94 | 96 | ||
97 | A memory policy with a valid NodeList will be saved, as specified, for | ||
98 | use at file creation time. When a task allocates a file in the file | ||
99 | system, the mount option memory policy will be applied with a NodeList, | ||
100 | if any, modified by the calling task's cpuset constraints | ||
101 | [See Documentation/cgroups/cpusets.txt] and any optional flags, listed | ||
102 | below. If the resulting NodeLists is the empty set, the effective memory | ||
103 | policy for the file will revert to "default" policy. | ||
104 | |||
95 | NUMA memory allocation policies have optional flags that can be used in | 105 | NUMA memory allocation policies have optional flags that can be used in |
96 | conjunction with their modes. These optional flags can be specified | 106 | conjunction with their modes. These optional flags can be specified |
97 | when tmpfs is mounted by appending them to the mode before the NodeList. | 107 | when tmpfs is mounted by appending them to the mode before the NodeList. |
98 | See Documentation/vm/numa_memory_policy.txt for a list of all available | 108 | See Documentation/vm/numa_memory_policy.txt for a list of all available |
99 | memory allocation policy mode flags. | 109 | memory allocation policy mode flags and their effect on memory policy. |
100 | 110 | ||
101 | =static is equivalent to MPOL_F_STATIC_NODES | 111 | =static is equivalent to MPOL_F_STATIC_NODES |
102 | =relative is equivalent to MPOL_F_RELATIVE_NODES | 112 | =relative is equivalent to MPOL_F_RELATIVE_NODES |
@@ -134,3 +144,5 @@ Author: | |||
134 | Christoph Rohland <cr@sap.com>, 1.12.01 | 144 | Christoph Rohland <cr@sap.com>, 1.12.01 |
135 | Updated: | 145 | Updated: |
136 | Hugh Dickins, 4 June 2007 | 146 | Hugh Dickins, 4 June 2007 |
147 | Updated: | ||
148 | KOSAKI Motohiro, 16 Mar 2010 | ||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 3de2f32edd90..94677e7dcb13 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -72,7 +72,7 @@ structure (this is the kernel-side implementation of file | |||
72 | descriptors). The freshly allocated file structure is initialized with | 72 | descriptors). The freshly allocated file structure is initialized with |
73 | a pointer to the dentry and a set of file operation member functions. | 73 | a pointer to the dentry and a set of file operation member functions. |
74 | These are taken from the inode data. The open() file method is then | 74 | These are taken from the inode data. The open() file method is then |
75 | called so the specific filesystem implementation can do it's work. You | 75 | called so the specific filesystem implementation can do its work. You |
76 | can see that this is another switch performed by the VFS. The file | 76 | can see that this is another switch performed by the VFS. The file |
77 | structure is placed into the file descriptor table for the process. | 77 | structure is placed into the file descriptor table for the process. |
78 | 78 | ||
@@ -401,11 +401,16 @@ otherwise noted. | |||
401 | started might not be in the page cache at the end of the | 401 | started might not be in the page cache at the end of the |
402 | walk). | 402 | walk). |
403 | 403 | ||
404 | truncate: called by the VFS to change the size of a file. The | 404 | truncate: Deprecated. This will not be called if ->setsize is defined. |
405 | Called by the VFS to change the size of a file. The | ||
405 | i_size field of the inode is set to the desired size by the | 406 | i_size field of the inode is set to the desired size by the |
406 | VFS before this method is called. This method is called by | 407 | VFS before this method is called. This method is called by |
407 | the truncate(2) system call and related functionality. | 408 | the truncate(2) system call and related functionality. |
408 | 409 | ||
410 | Note: ->truncate and vmtruncate are deprecated. Do not add new | ||
411 | instances/calls of these. Filesystems should be converted to do their | ||
412 | truncate sequence via ->setattr(). | ||
413 | |||
409 | permission: called by the VFS to check for access rights on a POSIX-like | 414 | permission: called by the VFS to check for access rights on a POSIX-like |
410 | filesystem. | 415 | filesystem. |
411 | 416 | ||
@@ -729,7 +734,7 @@ struct file_operations { | |||
729 | int (*open) (struct inode *, struct file *); | 734 | int (*open) (struct inode *, struct file *); |
730 | int (*flush) (struct file *); | 735 | int (*flush) (struct file *); |
731 | int (*release) (struct inode *, struct file *); | 736 | int (*release) (struct inode *, struct file *); |
732 | int (*fsync) (struct file *, struct dentry *, int datasync); | 737 | int (*fsync) (struct file *, int datasync); |
733 | int (*aio_fsync) (struct kiocb *, int datasync); | 738 | int (*aio_fsync) (struct kiocb *, int datasync); |
734 | int (*fasync) (int, struct file *, int); | 739 | int (*fasync) (int, struct file *, int); |
735 | int (*lock) (struct file *, int, struct file_lock *); | 740 | int (*lock) (struct file *, int, struct file_lock *); |
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 000000000000..96d0df28bed3 --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt | |||
@@ -0,0 +1,811 @@ | |||
1 | XFS Delayed Logging Design | ||
2 | -------------------------- | ||
3 | |||
4 | Introduction to Re-logging in XFS | ||
5 | --------------------------------- | ||
6 | |||
7 | XFS logging is a combination of logical and physical logging. Some objects, | ||
8 | such as inodes and dquots, are logged in logical format where the details | ||
9 | logged are made up of the changes to in-core structures rather than on-disk | ||
10 | structures. Other objects - typically buffers - have their physical changes | ||
11 | logged. The reason for these differences is to reduce the amount of log space | ||
12 | required for objects that are frequently logged. Some parts of inodes are more | ||
13 | frequently logged than others, and inodes are typically more frequently logged | ||
14 | than any other object (except maybe the superblock buffer) so keeping the | ||
15 | amount of metadata logged low is of prime importance. | ||
16 | |||
17 | The reason that this is such a concern is that XFS allows multiple separate | ||
18 | modifications to a single object to be carried in the log at any given time. | ||
19 | This allows the log to avoid needing to flush each change to disk before | ||
20 | recording a new change to the object. XFS does this via a method called | ||
21 | "re-logging". Conceptually, this is quite simple - all it requires is that any | ||
22 | new change to the object is recorded with a *new copy* of all the existing | ||
23 | changes in the new transaction that is written to the log. | ||
24 | |||
25 | That is, if we have a sequence of changes A through to F, and the object was | ||
26 | written to disk after change D, we would see in the log the following series | ||
27 | of transactions, their contents and the log sequence number (LSN) of the | ||
28 | transaction: | ||
29 | |||
30 | Transaction Contents LSN | ||
31 | A A X | ||
32 | B A+B X+n | ||
33 | C A+B+C X+n+m | ||
34 | D A+B+C+D X+n+m+o | ||
35 | <object written to disk> | ||
36 | E E Y (> X+n+m+o) | ||
37 | F E+F Yٍ+p | ||
38 | |||
39 | In other words, each time an object is relogged, the new transaction contains | ||
40 | the aggregation of all the previous changes currently held only in the log. | ||
41 | |||
42 | This relogging technique also allows objects to be moved forward in the log so | ||
43 | that an object being relogged does not prevent the tail of the log from ever | ||
44 | moving forward. This can be seen in the table above by the changing | ||
45 | (increasing) LSN of each subsquent transaction - the LSN is effectively a | ||
46 | direct encoding of the location in the log of the transaction. | ||
47 | |||
48 | This relogging is also used to implement long-running, multiple-commit | ||
49 | transactions. These transaction are known as rolling transactions, and require | ||
50 | a special log reservation known as a permanent transaction reservation. A | ||
51 | typical example of a rolling transaction is the removal of extents from an | ||
52 | inode which can only be done at a rate of two extents per transaction because | ||
53 | of reservation size limitations. Hence a rolling extent removal transaction | ||
54 | keeps relogging the inode and btree buffers as they get modified in each | ||
55 | removal operation. This keeps them moving forward in the log as the operation | ||
56 | progresses, ensuring that current operation never gets blocked by itself if the | ||
57 | log wraps around. | ||
58 | |||
59 | Hence it can be seen that the relogging operation is fundamental to the correct | ||
60 | working of the XFS journalling subsystem. From the above description, most | ||
61 | people should be able to see why the XFS metadata operations writes so much to | ||
62 | the log - repeated operations to the same objects write the same changes to | ||
63 | the log over and over again. Worse is the fact that objects tend to get | ||
64 | dirtier as they get relogged, so each subsequent transaction is writing more | ||
65 | metadata into the log. | ||
66 | |||
67 | Another feature of the XFS transaction subsystem is that most transactions are | ||
68 | asynchronous. That is, they don't commit to disk until either a log buffer is | ||
69 | filled (a log buffer can hold multiple transactions) or a synchronous operation | ||
70 | forces the log buffers holding the transactions to disk. This means that XFS is | ||
71 | doing aggregation of transactions in memory - batching them, if you like - to | ||
72 | minimise the impact of the log IO on transaction throughput. | ||
73 | |||
74 | The limitation on asynchronous transaction throughput is the number and size of | ||
75 | log buffers made available by the log manager. By default there are 8 log | ||
76 | buffers available and the size of each is 32kB - the size can be increased up | ||
77 | to 256kB by use of a mount option. | ||
78 | |||
79 | Effectively, this gives us the maximum bound of outstanding metadata changes | ||
80 | that can be made to the filesystem at any point in time - if all the log | ||
81 | buffers are full and under IO, then no more transactions can be committed until | ||
82 | the current batch completes. It is now common for a single current CPU core to | ||
83 | be to able to issue enough transactions to keep the log buffers full and under | ||
84 | IO permanently. Hence the XFS journalling subsystem can be considered to be IO | ||
85 | bound. | ||
86 | |||
87 | Delayed Logging: Concepts | ||
88 | ------------------------- | ||
89 | |||
90 | The key thing to note about the asynchronous logging combined with the | ||
91 | relogging technique XFS uses is that we can be relogging changed objects | ||
92 | multiple times before they are committed to disk in the log buffers. If we | ||
93 | return to the previous relogging example, it is entirely possible that | ||
94 | transactions A through D are committed to disk in the same log buffer. | ||
95 | |||
96 | That is, a single log buffer may contain multiple copies of the same object, | ||
97 | but only one of those copies needs to be there - the last one "D", as it | ||
98 | contains all the changes from the previous changes. In other words, we have one | ||
99 | necessary copy in the log buffer, and three stale copies that are simply | ||
100 | wasting space. When we are doing repeated operations on the same set of | ||
101 | objects, these "stale objects" can be over 90% of the space used in the log | ||
102 | buffers. It is clear that reducing the number of stale objects written to the | ||
103 | log would greatly reduce the amount of metadata we write to the log, and this | ||
104 | is the fundamental goal of delayed logging. | ||
105 | |||
106 | From a conceptual point of view, XFS is already doing relogging in memory (where | ||
107 | memory == log buffer), only it is doing it extremely inefficiently. It is using | ||
108 | logical to physical formatting to do the relogging because there is no | ||
109 | infrastructure to keep track of logical changes in memory prior to physically | ||
110 | formatting the changes in a transaction to the log buffer. Hence we cannot avoid | ||
111 | accumulating stale objects in the log buffers. | ||
112 | |||
113 | Delayed logging is the name we've given to keeping and tracking transactional | ||
114 | changes to objects in memory outside the log buffer infrastructure. Because of | ||
115 | the relogging concept fundamental to the XFS journalling subsystem, this is | ||
116 | actually relatively easy to do - all the changes to logged items are already | ||
117 | tracked in the current infrastructure. The big problem is how to accumulate | ||
118 | them and get them to the log in a consistent, recoverable manner. | ||
119 | Describing the problems and how they have been solved is the focus of this | ||
120 | document. | ||
121 | |||
122 | One of the key changes that delayed logging makes to the operation of the | ||
123 | journalling subsystem is that it disassociates the amount of outstanding | ||
124 | metadata changes from the size and number of log buffers available. In other | ||
125 | words, instead of there only being a maximum of 2MB of transaction changes not | ||
126 | written to the log at any point in time, there may be a much greater amount | ||
127 | being accumulated in memory. Hence the potential for loss of metadata on a | ||
128 | crash is much greater than for the existing logging mechanism. | ||
129 | |||
130 | It should be noted that this does not change the guarantee that log recovery | ||
131 | will result in a consistent filesystem. What it does mean is that as far as the | ||
132 | recovered filesystem is concerned, there may be many thousands of transactions | ||
133 | that simply did not occur as a result of the crash. This makes it even more | ||
134 | important that applications that care about their data use fsync() where they | ||
135 | need to ensure application level data integrity is maintained. | ||
136 | |||
137 | It should be noted that delayed logging is not an innovative new concept that | ||
138 | warrants rigorous proofs to determine whether it is correct or not. The method | ||
139 | of accumulating changes in memory for some period before writing them to the | ||
140 | log is used effectively in many filesystems including ext3 and ext4. Hence | ||
141 | no time is spent in this document trying to convince the reader that the | ||
142 | concept is sound. Instead it is simply considered a "solved problem" and as | ||
143 | such implementing it in XFS is purely an exercise in software engineering. | ||
144 | |||
145 | The fundamental requirements for delayed logging in XFS are simple: | ||
146 | |||
147 | 1. Reduce the amount of metadata written to the log by at least | ||
148 | an order of magnitude. | ||
149 | 2. Supply sufficient statistics to validate Requirement #1. | ||
150 | 3. Supply sufficient new tracing infrastructure to be able to debug | ||
151 | problems with the new code. | ||
152 | 4. No on-disk format change (metadata or log format). | ||
153 | 5. Enable and disable with a mount option. | ||
154 | 6. No performance regressions for synchronous transaction workloads. | ||
155 | |||
156 | Delayed Logging: Design | ||
157 | ----------------------- | ||
158 | |||
159 | Storing Changes | ||
160 | |||
161 | The problem with accumulating changes at a logical level (i.e. just using the | ||
162 | existing log item dirty region tracking) is that when it comes to writing the | ||
163 | changes to the log buffers, we need to ensure that the object we are formatting | ||
164 | is not changing while we do this. This requires locking the object to prevent | ||
165 | concurrent modification. Hence flushing the logical changes to the log would | ||
166 | require us to lock every object, format them, and then unlock them again. | ||
167 | |||
168 | This introduces lots of scope for deadlocks with transactions that are already | ||
169 | running. For example, a transaction has object A locked and modified, but needs | ||
170 | the delayed logging tracking lock to commit the transaction. However, the | ||
171 | flushing thread has the delayed logging tracking lock already held, and is | ||
172 | trying to get the lock on object A to flush it to the log buffer. This appears | ||
173 | to be an unsolvable deadlock condition, and it was solving this problem that | ||
174 | was the barrier to implementing delayed logging for so long. | ||
175 | |||
176 | The solution is relatively simple - it just took a long time to recognise it. | ||
177 | Put simply, the current logging code formats the changes to each item into an | ||
178 | vector array that points to the changed regions in the item. The log write code | ||
179 | simply copies the memory these vectors point to into the log buffer during | ||
180 | transaction commit while the item is locked in the transaction. Instead of | ||
181 | using the log buffer as the destination of the formatting code, we can use an | ||
182 | allocated memory buffer big enough to fit the formatted vector. | ||
183 | |||
184 | If we then copy the vector into the memory buffer and rewrite the vector to | ||
185 | point to the memory buffer rather than the object itself, we now have a copy of | ||
186 | the changes in a format that is compatible with the log buffer writing code. | ||
187 | that does not require us to lock the item to access. This formatting and | ||
188 | rewriting can all be done while the object is locked during transaction commit, | ||
189 | resulting in a vector that is transactionally consistent and can be accessed | ||
190 | without needing to lock the owning item. | ||
191 | |||
192 | Hence we avoid the need to lock items when we need to flush outstanding | ||
193 | asynchronous transactions to the log. The differences between the existing | ||
194 | formatting method and the delayed logging formatting can be seen in the | ||
195 | diagram below. | ||
196 | |||
197 | Current format log vector: | ||
198 | |||
199 | Object +---------------------------------------------+ | ||
200 | Vector 1 +----+ | ||
201 | Vector 2 +----+ | ||
202 | Vector 3 +----------+ | ||
203 | |||
204 | After formatting: | ||
205 | |||
206 | Log Buffer +-V1-+-V2-+----V3----+ | ||
207 | |||
208 | Delayed logging vector: | ||
209 | |||
210 | Object +---------------------------------------------+ | ||
211 | Vector 1 +----+ | ||
212 | Vector 2 +----+ | ||
213 | Vector 3 +----------+ | ||
214 | |||
215 | After formatting: | ||
216 | |||
217 | Memory Buffer +-V1-+-V2-+----V3----+ | ||
218 | Vector 1 +----+ | ||
219 | Vector 2 +----+ | ||
220 | Vector 3 +----------+ | ||
221 | |||
222 | The memory buffer and associated vector need to be passed as a single object, | ||
223 | but still need to be associated with the parent object so if the object is | ||
224 | relogged we can replace the current memory buffer with a new memory buffer that | ||
225 | contains the latest changes. | ||
226 | |||
227 | The reason for keeping the vector around after we've formatted the memory | ||
228 | buffer is to support splitting vectors across log buffer boundaries correctly. | ||
229 | If we don't keep the vector around, we do not know where the region boundaries | ||
230 | are in the item, so we'd need a new encapsulation method for regions in the log | ||
231 | buffer writing (i.e. double encapsulation). This would be an on-disk format | ||
232 | change and as such is not desirable. It also means we'd have to write the log | ||
233 | region headers in the formatting stage, which is problematic as there is per | ||
234 | region state that needs to be placed into the headers during the log write. | ||
235 | |||
236 | Hence we need to keep the vector, but by attaching the memory buffer to it and | ||
237 | rewriting the vector addresses to point at the memory buffer we end up with a | ||
238 | self-describing object that can be passed to the log buffer write code to be | ||
239 | handled in exactly the same manner as the existing log vectors are handled. | ||
240 | Hence we avoid needing a new on-disk format to handle items that have been | ||
241 | relogged in memory. | ||
242 | |||
243 | |||
244 | Tracking Changes | ||
245 | |||
246 | Now that we can record transactional changes in memory in a form that allows | ||
247 | them to be used without limitations, we need to be able to track and accumulate | ||
248 | them so that they can be written to the log at some later point in time. The | ||
249 | log item is the natural place to store this vector and buffer, and also makes sense | ||
250 | to be the object that is used to track committed objects as it will always | ||
251 | exist once the object has been included in a transaction. | ||
252 | |||
253 | The log item is already used to track the log items that have been written to | ||
254 | the log but not yet written to disk. Such log items are considered "active" | ||
255 | and as such are stored in the Active Item List (AIL) which is a LSN-ordered | ||
256 | double linked list. Items are inserted into this list during log buffer IO | ||
257 | completion, after which they are unpinned and can be written to disk. An object | ||
258 | that is in the AIL can be relogged, which causes the object to be pinned again | ||
259 | and then moved forward in the AIL when the log buffer IO completes for that | ||
260 | transaction. | ||
261 | |||
262 | Essentially, this shows that an item that is in the AIL can still be modified | ||
263 | and relogged, so any tracking must be separate to the AIL infrastructure. As | ||
264 | such, we cannot reuse the AIL list pointers for tracking committed items, nor | ||
265 | can we store state in any field that is protected by the AIL lock. Hence the | ||
266 | committed item tracking needs it's own locks, lists and state fields in the log | ||
267 | item. | ||
268 | |||
269 | Similar to the AIL, tracking of committed items is done through a new list | ||
270 | called the Committed Item List (CIL). The list tracks log items that have been | ||
271 | committed and have formatted memory buffers attached to them. It tracks objects | ||
272 | in transaction commit order, so when an object is relogged it is removed from | ||
273 | it's place in the list and re-inserted at the tail. This is entirely arbitrary | ||
274 | and done to make it easy for debugging - the last items in the list are the | ||
275 | ones that are most recently modified. Ordering of the CIL is not necessary for | ||
276 | transactional integrity (as discussed in the next section) so the ordering is | ||
277 | done for convenience/sanity of the developers. | ||
278 | |||
279 | |||
280 | Delayed Logging: Checkpoints | ||
281 | |||
282 | When we have a log synchronisation event, commonly known as a "log force", | ||
283 | all the items in the CIL must be written into the log via the log buffers. | ||
284 | We need to write these items in the order that they exist in the CIL, and they | ||
285 | need to be written as an atomic transaction. The need for all the objects to be | ||
286 | written as an atomic transaction comes from the requirements of relogging and | ||
287 | log replay - all the changes in all the objects in a given transaction must | ||
288 | either be completely replayed during log recovery, or not replayed at all. If | ||
289 | a transaction is not replayed because it is not complete in the log, then | ||
290 | no later transactions should be replayed, either. | ||
291 | |||
292 | To fulfill this requirement, we need to write the entire CIL in a single log | ||
293 | transaction. Fortunately, the XFS log code has no fixed limit on the size of a | ||
294 | transaction, nor does the log replay code. The only fundamental limit is that | ||
295 | the transaction cannot be larger than just under half the size of the log. The | ||
296 | reason for this limit is that to find the head and tail of the log, there must | ||
297 | be at least one complete transaction in the log at any given time. If a | ||
298 | transaction is larger than half the log, then there is the possibility that a | ||
299 | crash during the write of a such a transaction could partially overwrite the | ||
300 | only complete previous transaction in the log. This will result in a recovery | ||
301 | failure and an inconsistent filesystem and hence we must enforce the maximum | ||
302 | size of a checkpoint to be slightly less than a half the log. | ||
303 | |||
304 | Apart from this size requirement, a checkpoint transaction looks no different | ||
305 | to any other transaction - it contains a transaction header, a series of | ||
306 | formatted log items and a commit record at the tail. From a recovery | ||
307 | perspective, the checkpoint transaction is also no different - just a lot | ||
308 | bigger with a lot more items in it. The worst case effect of this is that we | ||
309 | might need to tune the recovery transaction object hash size. | ||
310 | |||
311 | Because the checkpoint is just another transaction and all the changes to log | ||
312 | items are stored as log vectors, we can use the existing log buffer writing | ||
313 | code to write the changes into the log. To do this efficiently, we need to | ||
314 | minimise the time we hold the CIL locked while writing the checkpoint | ||
315 | transaction. The current log write code enables us to do this easily with the | ||
316 | way it separates the writing of the transaction contents (the log vectors) from | ||
317 | the transaction commit record, but tracking this requires us to have a | ||
318 | per-checkpoint context that travels through the log write process through to | ||
319 | checkpoint completion. | ||
320 | |||
321 | Hence a checkpoint has a context that tracks the state of the current | ||
322 | checkpoint from initiation to checkpoint completion. A new context is initiated | ||
323 | at the same time a checkpoint transaction is started. That is, when we remove | ||
324 | all the current items from the CIL during a checkpoint operation, we move all | ||
325 | those changes into the current checkpoint context. We then initialise a new | ||
326 | context and attach that to the CIL for aggregation of new transactions. | ||
327 | |||
328 | This allows us to unlock the CIL immediately after transfer of all the | ||
329 | committed items and effectively allow new transactions to be issued while we | ||
330 | are formatting the checkpoint into the log. It also allows concurrent | ||
331 | checkpoints to be written into the log buffers in the case of log force heavy | ||
332 | workloads, just like the existing transaction commit code does. This, however, | ||
333 | requires that we strictly order the commit records in the log so that | ||
334 | checkpoint sequence order is maintained during log replay. | ||
335 | |||
336 | To ensure that we can be writing an item into a checkpoint transaction at | ||
337 | the same time another transaction modifies the item and inserts the log item | ||
338 | into the new CIL, then checkpoint transaction commit code cannot use log items | ||
339 | to store the list of log vectors that need to be written into the transaction. | ||
340 | Hence log vectors need to be able to be chained together to allow them to be | ||
341 | detatched from the log items. That is, when the CIL is flushed the memory | ||
342 | buffer and log vector attached to each log item needs to be attached to the | ||
343 | checkpoint context so that the log item can be released. In diagrammatic form, | ||
344 | the CIL would look like this before the flush: | ||
345 | |||
346 | CIL Head | ||
347 | | | ||
348 | V | ||
349 | Log Item <-> log vector 1 -> memory buffer | ||
350 | | -> vector array | ||
351 | V | ||
352 | Log Item <-> log vector 2 -> memory buffer | ||
353 | | -> vector array | ||
354 | V | ||
355 | ...... | ||
356 | | | ||
357 | V | ||
358 | Log Item <-> log vector N-1 -> memory buffer | ||
359 | | -> vector array | ||
360 | V | ||
361 | Log Item <-> log vector N -> memory buffer | ||
362 | -> vector array | ||
363 | |||
364 | And after the flush the CIL head is empty, and the checkpoint context log | ||
365 | vector list would look like: | ||
366 | |||
367 | Checkpoint Context | ||
368 | | | ||
369 | V | ||
370 | log vector 1 -> memory buffer | ||
371 | | -> vector array | ||
372 | | -> Log Item | ||
373 | V | ||
374 | log vector 2 -> memory buffer | ||
375 | | -> vector array | ||
376 | | -> Log Item | ||
377 | V | ||
378 | ...... | ||
379 | | | ||
380 | V | ||
381 | log vector N-1 -> memory buffer | ||
382 | | -> vector array | ||
383 | | -> Log Item | ||
384 | V | ||
385 | log vector N -> memory buffer | ||
386 | -> vector array | ||
387 | -> Log Item | ||
388 | |||
389 | Once this transfer is done, the CIL can be unlocked and new transactions can | ||
390 | start, while the checkpoint flush code works over the log vector chain to | ||
391 | commit the checkpoint. | ||
392 | |||
393 | Once the checkpoint is written into the log buffers, the checkpoint context is | ||
394 | attached to the log buffer that the commit record was written to along with a | ||
395 | completion callback. Log IO completion will call that callback, which can then | ||
396 | run transaction committed processing for the log items (i.e. insert into AIL | ||
397 | and unpin) in the log vector chain and then free the log vector chain and | ||
398 | checkpoint context. | ||
399 | |||
400 | Discussion Point: I am uncertain as to whether the log item is the most | ||
401 | efficient way to track vectors, even though it seems like the natural way to do | ||
402 | it. The fact that we walk the log items (in the CIL) just to chain the log | ||
403 | vectors and break the link between the log item and the log vector means that | ||
404 | we take a cache line hit for the log item list modification, then another for | ||
405 | the log vector chaining. If we track by the log vectors, then we only need to | ||
406 | break the link between the log item and the log vector, which means we should | ||
407 | dirty only the log item cachelines. Normally I wouldn't be concerned about one | ||
408 | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log | ||
409 | vectors in one checkpoint transaction. I'd guess this is a "measure and | ||
410 | compare" situation that can be done after a working and reviewed implementation | ||
411 | is in the dev tree.... | ||
412 | |||
413 | Delayed Logging: Checkpoint Sequencing | ||
414 | |||
415 | One of the key aspects of the XFS transaction subsystem is that it tags | ||
416 | committed transactions with the log sequence number of the transaction commit. | ||
417 | This allows transactions to be issued asynchronously even though there may be | ||
418 | future operations that cannot be completed until that transaction is fully | ||
419 | committed to the log. In the rare case that a dependent operation occurs (e.g. | ||
420 | re-using a freed metadata extent for a data extent), a special, optimised log | ||
421 | force can be issued to force the dependent transaction to disk immediately. | ||
422 | |||
423 | To do this, transactions need to record the LSN of the commit record of the | ||
424 | transaction. This LSN comes directly from the log buffer the transaction is | ||
425 | written into. While this works just fine for the existing transaction | ||
426 | mechanism, it does not work for delayed logging because transactions are not | ||
427 | written directly into the log buffers. Hence some other method of sequencing | ||
428 | transactions is required. | ||
429 | |||
430 | As discussed in the checkpoint section, delayed logging uses per-checkpoint | ||
431 | contexts, and as such it is simple to assign a sequence number to each | ||
432 | checkpoint. Because the switching of checkpoint contexts must be done | ||
433 | atomically, it is simple to ensure that each new context has a monotonically | ||
434 | increasing sequence number assigned to it without the need for an external | ||
435 | atomic counter - we can just take the current context sequence number and add | ||
436 | one to it for the new context. | ||
437 | |||
438 | Then, instead of assigning a log buffer LSN to the transaction commit LSN | ||
439 | during the commit, we can assign the current checkpoint sequence. This allows | ||
440 | operations that track transactions that have not yet completed know what | ||
441 | checkpoint sequence needs to be committed before they can continue. As a | ||
442 | result, the code that forces the log to a specific LSN now needs to ensure that | ||
443 | the log forces to a specific checkpoint. | ||
444 | |||
445 | To ensure that we can do this, we need to track all the checkpoint contexts | ||
446 | that are currently committing to the log. When we flush a checkpoint, the | ||
447 | context gets added to a "committing" list which can be searched. When a | ||
448 | checkpoint commit completes, it is removed from the committing list. Because | ||
449 | the checkpoint context records the LSN of the commit record for the checkpoint, | ||
450 | we can also wait on the log buffer that contains the commit record, thereby | ||
451 | using the existing log force mechanisms to execute synchronous forces. | ||
452 | |||
453 | It should be noted that the synchronous forces may need to be extended with | ||
454 | mitigation algorithms similar to the current log buffer code to allow | ||
455 | aggregation of multiple synchronous transactions if there are already | ||
456 | synchronous transactions being flushed. Investigation of the performance of the | ||
457 | current design is needed before making any decisions here. | ||
458 | |||
459 | The main concern with log forces is to ensure that all the previous checkpoints | ||
460 | are also committed to disk before the one we need to wait for. Therefore we | ||
461 | need to check that all the prior contexts in the committing list are also | ||
462 | complete before waiting on the one we need to complete. We do this | ||
463 | synchronisation in the log force code so that we don't need to wait anywhere | ||
464 | else for such serialisation - it only matters when we do a log force. | ||
465 | |||
466 | The only remaining complexity is that a log force now also has to handle the | ||
467 | case where the forcing sequence number is the same as the current context. That | ||
468 | is, we need to flush the CIL and potentially wait for it to complete. This is a | ||
469 | simple addition to the existing log forcing code to check the sequence numbers | ||
470 | and push if required. Indeed, placing the current sequence checkpoint flush in | ||
471 | the log force code enables the current mechanism for issuing synchronous | ||
472 | transactions to remain untouched (i.e. commit an asynchronous transaction, then | ||
473 | force the log at the LSN of that transaction) and so the higher level code | ||
474 | behaves the same regardless of whether delayed logging is being used or not. | ||
475 | |||
476 | Delayed Logging: Checkpoint Log Space Accounting | ||
477 | |||
478 | The big issue for a checkpoint transaction is the log space reservation for the | ||
479 | transaction. We don't know how big a checkpoint transaction is going to be | ||
480 | ahead of time, nor how many log buffers it will take to write out, nor the | ||
481 | number of split log vector regions are going to be used. We can track the | ||
482 | amount of log space required as we add items to the commit item list, but we | ||
483 | still need to reserve the space in the log for the checkpoint. | ||
484 | |||
485 | A typical transaction reserves enough space in the log for the worst case space | ||
486 | usage of the transaction. The reservation accounts for log record headers, | ||
487 | transaction and region headers, headers for split regions, buffer tail padding, | ||
488 | etc. as well as the actual space for all the changed metadata in the | ||
489 | transaction. While some of this is fixed overhead, much of it is dependent on | ||
490 | the size of the transaction and the number of regions being logged (the number | ||
491 | of log vectors in the transaction). | ||
492 | |||
493 | An example of the differences would be logging directory changes versus logging | ||
494 | inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then | ||
495 | there are lots of transactions that only contain an inode core and an inode log | ||
496 | format structure. That is, two vectors totaling roughly 150 bytes. If we modify | ||
497 | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each | ||
498 | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In | ||
499 | comparison, if we are logging full directory buffers, they are typically 4KB | ||
500 | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a | ||
501 | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total | ||
502 | space. From this, it should be obvious that a static log space reservation is | ||
503 | not particularly flexible and is difficult to select the "optimal value" for | ||
504 | all workloads. | ||
505 | |||
506 | Further, if we are going to use a static reservation, which bit of the entire | ||
507 | reservation does it cover? We account for space used by the transaction | ||
508 | reservation by tracking the space currently used by the object in the CIL and | ||
509 | then calculating the increase or decrease in space used as the object is | ||
510 | relogged. This allows for a checkpoint reservation to only have to account for | ||
511 | log buffer metadata used such as log header records. | ||
512 | |||
513 | However, even using a static reservation for just the log metadata is | ||
514 | problematic. Typically log record headers use at least 16KB of log space per | ||
515 | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be | ||
516 | large enough to handle arbitrary sized checkpoint transactions. This | ||
517 | reservation needs to be made before the checkpoint is started, and we need to | ||
518 | be able to reserve the space without sleeping. For a 8MB checkpoint, we need a | ||
519 | reservation of around 150KB, which is a non-trivial amount of space. | ||
520 | |||
521 | A static reservation needs to manipulate the log grant counters - we can take a | ||
522 | permanent reservation on the space, but we still need to make sure we refresh | ||
523 | the write reservation (the actual space available to the transaction) after | ||
524 | every checkpoint transaction completion. Unfortunately, if this space is not | ||
525 | available when required, then the regrant code will sleep waiting for it. | ||
526 | |||
527 | The problem with this is that it can lead to deadlocks as we may need to commit | ||
528 | checkpoints to be able to free up log space (refer back to the description of | ||
529 | rolling transactions for an example of this). Hence we *must* always have | ||
530 | space available in the log if we are to use static reservations, and that is | ||
531 | very difficult and complex to arrange. It is possible to do, but there is a | ||
532 | simpler way. | ||
533 | |||
534 | The simpler way of doing this is tracking the entire log space used by the | ||
535 | items in the CIL and using this to dynamically calculate the amount of log | ||
536 | space required by the log metadata. If this log metadata space changes as a | ||
537 | result of a transaction commit inserting a new memory buffer into the CIL, then | ||
538 | the difference in space required is removed from the transaction that causes | ||
539 | the change. Transactions at this level will *always* have enough space | ||
540 | available in their reservation for this as they have already reserved the | ||
541 | maximal amount of log metadata space they require, and such a delta reservation | ||
542 | will always be less than or equal to the maximal amount in the reservation. | ||
543 | |||
544 | Hence we can grow the checkpoint transaction reservation dynamically as items | ||
545 | are added to the CIL and avoid the need for reserving and regranting log space | ||
546 | up front. This avoids deadlocks and removes a blocking point from the | ||
547 | checkpoint flush code. | ||
548 | |||
549 | As mentioned early, transactions can't grow to more than half the size of the | ||
550 | log. Hence as part of the reservation growing, we need to also check the size | ||
551 | of the reservation against the maximum allowed transaction size. If we reach | ||
552 | the maximum threshold, we need to push the CIL to the log. This is effectively | ||
553 | a "background flush" and is done on demand. This is identical to | ||
554 | a CIL push triggered by a log force, only that there is no waiting for the | ||
555 | checkpoint commit to complete. This background push is checked and executed by | ||
556 | transaction commit code. | ||
557 | |||
558 | If the transaction subsystem goes idle while we still have items in the CIL, | ||
559 | they will be flushed by the periodic log force issued by the xfssyncd. This log | ||
560 | force will push the CIL to disk, and if the transaction subsystem stays idle, | ||
561 | allow the idle log to be covered (effectively marked clean) in exactly the same | ||
562 | manner that is done for the existing logging method. A discussion point is | ||
563 | whether this log force needs to be done more frequently than the current rate | ||
564 | which is once every 30s. | ||
565 | |||
566 | |||
567 | Delayed Logging: Log Item Pinning | ||
568 | |||
569 | Currently log items are pinned during transaction commit while the items are | ||
570 | still locked. This happens just after the items are formatted, though it could | ||
571 | be done any time before the items are unlocked. The result of this mechanism is | ||
572 | that items get pinned once for every transaction that is committed to the log | ||
573 | buffers. Hence items that are relogged in the log buffers will have a pin count | ||
574 | for every outstanding transaction they were dirtied in. When each of these | ||
575 | transactions is completed, they will unpin the item once. As a result, the item | ||
576 | only becomes unpinned when all the transactions complete and there are no | ||
577 | pending transactions. Thus the pinning and unpinning of a log item is symmetric | ||
578 | as there is a 1:1 relationship with transaction commit and log item completion. | ||
579 | |||
580 | For delayed logging, however, we have an assymetric transaction commit to | ||
581 | completion relationship. Every time an object is relogged in the CIL it goes | ||
582 | through the commit process without a corresponding completion being registered. | ||
583 | That is, we now have a many-to-one relationship between transaction commit and | ||
584 | log item completion. The result of this is that pinning and unpinning of the | ||
585 | log items becomes unbalanced if we retain the "pin on transaction commit, unpin | ||
586 | on transaction completion" model. | ||
587 | |||
588 | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on | ||
589 | insertion into the CIL, unpin on checkpoint completion". In other words, the | ||
590 | pinning and unpinning becomes symmetric around a checkpoint context. We have to | ||
591 | pin the object the first time it is inserted into the CIL - if it is already in | ||
592 | the CIL during a transaction commit, then we do not pin it again. Because there | ||
593 | can be multiple outstanding checkpoint contexts, we can still see elevated pin | ||
594 | counts, but as each checkpoint completes the pin count will retain the correct | ||
595 | value according to it's context. | ||
596 | |||
597 | Just to make matters more slightly more complex, this checkpoint level context | ||
598 | for the pin count means that the pinning of an item must take place under the | ||
599 | CIL commit/flush lock. If we pin the object outside this lock, we cannot | ||
600 | guarantee which context the pin count is associated with. This is because of | ||
601 | the fact pinning the item is dependent on whether the item is present in the | ||
602 | current CIL or not. If we don't pin the CIL first before we check and pin the | ||
603 | object, we have a race with CIL being flushed between the check and the pin | ||
604 | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit | ||
605 | lock to guarantee that we pin the items correctly. | ||
606 | |||
607 | Delayed Logging: Concurrent Scalability | ||
608 | |||
609 | A fundamental requirement for the CIL is that accesses through transaction | ||
610 | commits must scale to many concurrent commits. The current transaction commit | ||
611 | code does not break down even when there are transactions coming from 2048 | ||
612 | processors at once. The current transaction code does not go any faster than if | ||
613 | there was only one CPU using it, but it does not slow down either. | ||
614 | |||
615 | As a result, the delayed logging transaction commit code needs to be designed | ||
616 | for concurrency from the ground up. It is obvious that there are serialisation | ||
617 | points in the design - the three important ones are: | ||
618 | |||
619 | 1. Locking out new transaction commits while flushing the CIL | ||
620 | 2. Adding items to the CIL and updating item space accounting | ||
621 | 3. Checkpoint commit ordering | ||
622 | |||
623 | Looking at the transaction commit and CIL flushing interactions, it is clear | ||
624 | that we have a many-to-one interaction here. That is, the only restriction on | ||
625 | the number of concurrent transactions that can be trying to commit at once is | ||
626 | the amount of space available in the log for their reservations. The practical | ||
627 | limit here is in the order of several hundred concurrent transactions for a | ||
628 | 128MB log, which means that it is generally one per CPU in a machine. | ||
629 | |||
630 | The amount of time a transaction commit needs to hold out a flush is a | ||
631 | relatively long period of time - the pinning of log items needs to be done | ||
632 | while we are holding out a CIL flush, so at the moment that means it is held | ||
633 | across the formatting of the objects into memory buffers (i.e. while memcpy()s | ||
634 | are in progress). Ultimately a two pass algorithm where the formatting is done | ||
635 | separately to the pinning of objects could be used to reduce the hold time of | ||
636 | the transaction commit side. | ||
637 | |||
638 | Because of the number of potential transaction commit side holders, the lock | ||
639 | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not | ||
640 | want every other CPU in the machine spinning on the CIL lock. Given that | ||
641 | flushing the CIL could involve walking a list of tens of thousands of log | ||
642 | items, it will get held for a significant time and so spin contention is a | ||
643 | significant concern. Preventing lots of CPUs spinning doing nothing is the | ||
644 | main reason for choosing a sleeping lock even though nothing in either the | ||
645 | transaction commit or CIL flush side sleeps with the lock held. | ||
646 | |||
647 | It should also be noted that CIL flushing is also a relatively rare operation | ||
648 | compared to transaction commit for asynchronous transaction workloads - only | ||
649 | time will tell if using a read-write semaphore for exclusion will limit | ||
650 | transaction commit concurrency due to cache line bouncing of the lock on the | ||
651 | read side. | ||
652 | |||
653 | The second serialisation point is on the transaction commit side where items | ||
654 | are inserted into the CIL. Because transactions can enter this code | ||
655 | concurrently, the CIL needs to be protected separately from the above | ||
656 | commit/flush exclusion. It also needs to be an exclusive lock but it is only | ||
657 | held for a very short time and so a spin lock is appropriate here. It is | ||
658 | possible that this lock will become a contention point, but given the short | ||
659 | hold time once per transaction I think that contention is unlikely. | ||
660 | |||
661 | The final serialisation point is the checkpoint commit record ordering code | ||
662 | that is run as part of the checkpoint commit and log force sequencing. The code | ||
663 | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter | ||
664 | an ordering loop after writing all the log vectors into the log buffers but | ||
665 | before writing the commit record. This loop walks the list of committing | ||
666 | checkpoints and needs to block waiting for checkpoints to complete their commit | ||
667 | record write. As a result it needs a lock and a wait variable. Log force | ||
668 | sequencing also requires the same lock, list walk, and blocking mechanism to | ||
669 | ensure completion of checkpoints. | ||
670 | |||
671 | These two sequencing operations can use the mechanism even though the | ||
672 | events they are waiting for are different. The checkpoint commit record | ||
673 | sequencing needs to wait until checkpoint contexts contain a commit LSN | ||
674 | (obtained through completion of a commit record write) while log force | ||
675 | sequencing needs to wait until previous checkpoint contexts are removed from | ||
676 | the committing list (i.e. they've completed). A simple wait variable and | ||
677 | broadcast wakeups (thundering herds) has been used to implement these two | ||
678 | serialisation queues. They use the same lock as the CIL, too. If we see too | ||
679 | much contention on the CIL lock, or too many context switches as a result of | ||
680 | the broadcast wakeups these operations can be put under a new spinlock and | ||
681 | given separate wait lists to reduce lock contention and the number of processes | ||
682 | woken by the wrong event. | ||
683 | |||
684 | |||
685 | Lifecycle Changes | ||
686 | |||
687 | The existing log item life cycle is as follows: | ||
688 | |||
689 | 1. Transaction allocate | ||
690 | 2. Transaction reserve | ||
691 | 3. Lock item | ||
692 | 4. Join item to transaction | ||
693 | If not already attached, | ||
694 | Allocate log item | ||
695 | Attach log item to owner item | ||
696 | Attach log item to transaction | ||
697 | 5. Modify item | ||
698 | Record modifications in log item | ||
699 | 6. Transaction commit | ||
700 | Pin item in memory | ||
701 | Format item into log buffer | ||
702 | Write commit LSN into transaction | ||
703 | Unlock item | ||
704 | Attach transaction to log buffer | ||
705 | |||
706 | <log buffer IO dispatched> | ||
707 | <log buffer IO completes> | ||
708 | |||
709 | 7. Transaction completion | ||
710 | Mark log item committed | ||
711 | Insert log item into AIL | ||
712 | Write commit LSN into log item | ||
713 | Unpin log item | ||
714 | 8. AIL traversal | ||
715 | Lock item | ||
716 | Mark log item clean | ||
717 | Flush item to disk | ||
718 | |||
719 | <item IO completion> | ||
720 | |||
721 | 9. Log item removed from AIL | ||
722 | Moves log tail | ||
723 | Item unlocked | ||
724 | |||
725 | Essentially, steps 1-6 operate independently from step 7, which is also | ||
726 | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 | ||
727 | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur | ||
728 | at the same time. If the log item is in the AIL or between steps 6 and 7 | ||
729 | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 | ||
730 | are entered and completed is the object considered clean. | ||
731 | |||
732 | With delayed logging, there are new steps inserted into the life cycle: | ||
733 | |||
734 | 1. Transaction allocate | ||
735 | 2. Transaction reserve | ||
736 | 3. Lock item | ||
737 | 4. Join item to transaction | ||
738 | If not already attached, | ||
739 | Allocate log item | ||
740 | Attach log item to owner item | ||
741 | Attach log item to transaction | ||
742 | 5. Modify item | ||
743 | Record modifications in log item | ||
744 | 6. Transaction commit | ||
745 | Pin item in memory if not pinned in CIL | ||
746 | Format item into log vector + buffer | ||
747 | Attach log vector and buffer to log item | ||
748 | Insert log item into CIL | ||
749 | Write CIL context sequence into transaction | ||
750 | Unlock item | ||
751 | |||
752 | <next log force> | ||
753 | |||
754 | 7. CIL push | ||
755 | lock CIL flush | ||
756 | Chain log vectors and buffers together | ||
757 | Remove items from CIL | ||
758 | unlock CIL flush | ||
759 | write log vectors into log | ||
760 | sequence commit records | ||
761 | attach checkpoint context to log buffer | ||
762 | |||
763 | <log buffer IO dispatched> | ||
764 | <log buffer IO completes> | ||
765 | |||
766 | 8. Checkpoint completion | ||
767 | Mark log item committed | ||
768 | Insert item into AIL | ||
769 | Write commit LSN into log item | ||
770 | Unpin log item | ||
771 | 9. AIL traversal | ||
772 | Lock item | ||
773 | Mark log item clean | ||
774 | Flush item to disk | ||
775 | <item IO completion> | ||
776 | 10. Log item removed from AIL | ||
777 | Moves log tail | ||
778 | Item unlocked | ||
779 | |||
780 | From this, it can be seen that the only life cycle differences between the two | ||
781 | logging methods are in the middle of the life cycle - they still have the same | ||
782 | beginning and end and execution constraints. The only differences are in the | ||
783 | commiting of the log items to the log itself and the completion processing. | ||
784 | Hence delayed logging should not introduce any constraints on log item | ||
785 | behaviour, allocation or freeing that don't already exist. | ||
786 | |||
787 | As a result of this zero-impact "insertion" of delayed logging infrastructure | ||
788 | and the design of the internal structures to avoid on disk format changes, we | ||
789 | can basically switch between delayed logging and the existing mechanism with a | ||
790 | mount option. Fundamentally, there is no reason why the log manager would not | ||
791 | be able to swap methods automatically and transparently depending on load | ||
792 | characteristics, but this should not be necessary if delayed logging works as | ||
793 | designed. | ||
794 | |||
795 | Roadmap: | ||
796 | |||
797 | 2.6.37 Remove experimental tag from mount option | ||
798 | => should be roughly 6 months after initial merge | ||
799 | => enough time to: | ||
800 | => gain confidence and fix problems reported by early | ||
801 | adopters (a.k.a. guinea pigs) | ||
802 | => address worst performance regressions and undesired | ||
803 | behaviours | ||
804 | => start tuning/optimising code for parallelism | ||
805 | => start tuning/optimising algorithms consuming | ||
806 | excessive CPU time | ||
807 | |||
808 | 2.6.39 Switch default mount option to use delayed logging | ||
809 | => should be roughly 12 months after initial merge | ||
810 | => enough time to shake out remaining problems before next round of | ||
811 | enterprise distro kernel rebases | ||