diff options
Diffstat (limited to 'Documentation/filesystems')
21 files changed, 953 insertions, 42 deletions
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 06bbbed71206..61c98f03baa1 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking | |||
@@ -178,7 +178,7 @@ prototypes: | |||
178 | locking rules: | 178 | locking rules: |
179 | All except set_page_dirty may block | 179 | All except set_page_dirty may block |
180 | 180 | ||
181 | BKL PageLocked(page) i_sem | 181 | BKL PageLocked(page) i_mutex |
182 | writepage: no yes, unlocks (see below) | 182 | writepage: no yes, unlocks (see below) |
183 | readpage: no yes, unlocks | 183 | readpage: no yes, unlocks |
184 | sync_page: no maybe | 184 | sync_page: no maybe |
@@ -429,8 +429,9 @@ check_flags: no | |||
429 | implementations. If your fs is not using generic_file_llseek, you | 429 | implementations. If your fs is not using generic_file_llseek, you |
430 | need to acquire and release the appropriate locks in your ->llseek(). | 430 | need to acquire and release the appropriate locks in your ->llseek(). |
431 | For many filesystems, it is probably safe to acquire the inode | 431 | For many filesystems, it is probably safe to acquire the inode |
432 | semaphore. Note some filesystems (i.e. remote ones) provide no | 432 | mutex or just to use i_size_read() instead. |
433 | protection for i_size so you will need to use the BKL. | 433 | Note: this does not protect the file->f_pos against concurrent modifications |
434 | since this is something the userspace has to take care about. | ||
434 | 435 | ||
435 | Note: ext2_release() was *the* source of contention on fs-intensive | 436 | Note: ext2_release() was *the* source of contention on fs-intensive |
436 | loads and dropping BKL on ->release() helps to get rid of that (we still | 437 | loads and dropping BKL on ->release() helps to get rid of that (we still |
diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt index 8f78ded4b648..51986bf08a4d 100644 --- a/Documentation/filesystems/autofs4-mount-control.txt +++ b/Documentation/filesystems/autofs4-mount-control.txt | |||
@@ -146,7 +146,7 @@ found to be inadequate, in this case. The Generic Netlink system was | |||
146 | used for this as raw Netlink would lead to a significant increase in | 146 | used for this as raw Netlink would lead to a significant increase in |
147 | complexity. There's no question that the Generic Netlink system is an | 147 | complexity. There's no question that the Generic Netlink system is an |
148 | elegant solution for common case ioctl functions but it's not a complete | 148 | elegant solution for common case ioctl functions but it's not a complete |
149 | replacement probably because it's primary purpose in life is to be a | 149 | replacement probably because its primary purpose in life is to be a |
150 | message bus implementation rather than specifically an ioctl replacement. | 150 | message bus implementation rather than specifically an ioctl replacement. |
151 | While it would be possible to work around this there is one concern | 151 | While it would be possible to work around this there is one concern |
152 | that lead to the decision to not use it. This is that the autofs | 152 | that lead to the decision to not use it. This is that the autofs |
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt index 0660c9f5deef..763d8ebbbebd 100644 --- a/Documentation/filesystems/ceph.txt +++ b/Documentation/filesystems/ceph.txt | |||
@@ -90,7 +90,7 @@ Mount Options | |||
90 | Specify the IP and/or port the client should bind to locally. | 90 | Specify the IP and/or port the client should bind to locally. |
91 | There is normally not much reason to do this. If the IP is not | 91 | There is normally not much reason to do this. If the IP is not |
92 | specified, the client's IP address is determined by looking at the | 92 | specified, the client's IP address is determined by looking at the |
93 | address it's connection to the monitor originates from. | 93 | address its connection to the monitor originates from. |
94 | 94 | ||
95 | wsize=X | 95 | wsize=X |
96 | Specify the maximum write size in bytes. By default there is no | 96 | Specify the maximum write size in bytes. By default there is no |
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt index c50bbb2d52b4..1b528b2ad809 100644 --- a/Documentation/filesystems/dlmfs.txt +++ b/Documentation/filesystems/dlmfs.txt | |||
@@ -47,7 +47,7 @@ You'll want to start heartbeating on a volume which all the nodes in | |||
47 | your lockspace can access. The easiest way to do this is via | 47 | your lockspace can access. The easiest way to do this is via |
48 | ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires | 48 | ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires |
49 | that an OCFS2 file system be in place so that it can automatically | 49 | that an OCFS2 file system be in place so that it can automatically |
50 | find it's heartbeat area, though it will eventually support heartbeat | 50 | find its heartbeat area, though it will eventually support heartbeat |
51 | against raw disks. | 51 | against raw disks. |
52 | 52 | ||
53 | Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed | 53 | Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed |
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 867c5b50cb42..272f80d5f966 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -59,8 +59,19 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata | |||
59 | Setting it to very large values will improve | 59 | Setting it to very large values will improve |
60 | performance. | 60 | performance. |
61 | 61 | ||
62 | barrier=1 This enables/disables barriers. barrier=0 disables | 62 | barrier=<0(*)|1> This enables/disables the use of write barriers in |
63 | it, barrier=1 enables it. | 63 | barrier the jbd code. barrier=0 disables, barrier=1 enables. |
64 | nobarrier (*) This also requires an IO stack which can support | ||
65 | barriers, and if jbd gets an error on a barrier | ||
66 | write, it will disable again with a warning. | ||
67 | Write barriers enforce proper on-disk ordering | ||
68 | of journal commits, making volatile disk write caches | ||
69 | safe to use, at some performance penalty. If | ||
70 | your disks are battery-backed in one way or another, | ||
71 | disabling barriers may safely improve performance. | ||
72 | The mount options "barrier" and "nobarrier" can | ||
73 | also be used to enable or disable barriers, for | ||
74 | consistency with other ext3 mount options. | ||
64 | 75 | ||
65 | orlov (*) This enables the new Orlov block allocator. It is | 76 | orlov (*) This enables the new Orlov block allocator. It is |
66 | enabled by default. | 77 | enabled by default. |
diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt index 606233cd4618..1b805a0efbb0 100644 --- a/Documentation/filesystems/fiemap.txt +++ b/Documentation/filesystems/fiemap.txt | |||
@@ -38,7 +38,7 @@ flags, it will return EBADR and the contents of fm_flags will contain | |||
38 | the set of flags which caused the error. If the kernel is compatible | 38 | the set of flags which caused the error. If the kernel is compatible |
39 | with all flags passed, the contents of fm_flags will be unmodified. | 39 | with all flags passed, the contents of fm_flags will be unmodified. |
40 | It is up to userspace to determine whether rejection of a particular | 40 | It is up to userspace to determine whether rejection of a particular |
41 | flag is fatal to it's operation. This scheme is intended to allow the | 41 | flag is fatal to its operation. This scheme is intended to allow the |
42 | fiemap interface to grow in the future but without losing | 42 | fiemap interface to grow in the future but without losing |
43 | compatibility with old software. | 43 | compatibility with old software. |
44 | 44 | ||
@@ -56,7 +56,7 @@ If this flag is set, the kernel will sync the file before mapping extents. | |||
56 | 56 | ||
57 | * FIEMAP_FLAG_XATTR | 57 | * FIEMAP_FLAG_XATTR |
58 | If this flag is set, the extents returned will describe the inodes | 58 | If this flag is set, the extents returned will describe the inodes |
59 | extended attribute lookup tree, instead of it's data tree. | 59 | extended attribute lookup tree, instead of its data tree. |
60 | 60 | ||
61 | 61 | ||
62 | Extent Mapping | 62 | Extent Mapping |
@@ -89,7 +89,7 @@ struct fiemap_extent { | |||
89 | }; | 89 | }; |
90 | 90 | ||
91 | All offsets and lengths are in bytes and mirror those on disk. It is valid | 91 | All offsets and lengths are in bytes and mirror those on disk. It is valid |
92 | for an extents logical offset to start before the request or it's logical | 92 | for an extents logical offset to start before the request or its logical |
93 | length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is | 93 | length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is |
94 | returned, fe_logical, fe_physical, and fe_length will be aligned to the | 94 | returned, fe_logical, fe_physical, and fe_length will be aligned to the |
95 | block size of the file system. With the exception of extents flagged as | 95 | block size of the file system. With the exception of extents flagged as |
@@ -125,7 +125,7 @@ been allocated for the file yet. | |||
125 | 125 | ||
126 | * FIEMAP_EXTENT_DELALLOC | 126 | * FIEMAP_EXTENT_DELALLOC |
127 | - This will also set FIEMAP_EXTENT_UNKNOWN. | 127 | - This will also set FIEMAP_EXTENT_UNKNOWN. |
128 | Delayed allocation - while there is data for this extent, it's | 128 | Delayed allocation - while there is data for this extent, its |
129 | physical location has not been allocated yet. | 129 | physical location has not been allocated yet. |
130 | 130 | ||
131 | * FIEMAP_EXTENT_ENCODED | 131 | * FIEMAP_EXTENT_ENCODED |
@@ -159,7 +159,7 @@ Data is located within a meta data block. | |||
159 | Data is packed into a block with data from other files. | 159 | Data is packed into a block with data from other files. |
160 | 160 | ||
161 | * FIEMAP_EXTENT_UNWRITTEN | 161 | * FIEMAP_EXTENT_UNWRITTEN |
162 | Unwritten extent - the extent is allocated but it's data has not been | 162 | Unwritten extent - the extent is allocated but its data has not been |
163 | initialized. This indicates the extent's data will be all zero if read | 163 | initialized. This indicates the extent's data will be all zero if read |
164 | through the filesystem but the contents are undefined if read directly from | 164 | through the filesystem but the contents are undefined if read directly from |
165 | the device. | 165 | the device. |
@@ -176,7 +176,7 @@ VFS -> File System Implementation | |||
176 | 176 | ||
177 | File systems wishing to support fiemap must implement a ->fiemap callback on | 177 | File systems wishing to support fiemap must implement a ->fiemap callback on |
178 | their inode_operations structure. The fs ->fiemap call is responsible for | 178 | their inode_operations structure. The fs ->fiemap call is responsible for |
179 | defining it's set of supported fiemap flags, and calling a helper function on | 179 | defining its set of supported fiemap flags, and calling a helper function on |
180 | each discovered extent: | 180 | each discovered extent: |
181 | 181 | ||
182 | struct inode_operations { | 182 | struct inode_operations { |
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 397a41adb4c3..13af4a49e7db 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt | |||
@@ -91,7 +91,7 @@ Mount options | |||
91 | 'default_permissions' | 91 | 'default_permissions' |
92 | 92 | ||
93 | By default FUSE doesn't check file access permissions, the | 93 | By default FUSE doesn't check file access permissions, the |
94 | filesystem is free to implement it's access policy or leave it to | 94 | filesystem is free to implement its access policy or leave it to |
95 | the underlying file access mechanism (e.g. in case of network | 95 | the underlying file access mechanism (e.g. in case of network |
96 | filesystems). This option enables permission checking, restricting | 96 | filesystems). This option enables permission checking, restricting |
97 | access based on file mode. It is usually useful together with the | 97 | access based on file mode. It is usually useful together with the |
@@ -171,7 +171,7 @@ or may honor them by sending a reply to the _original_ request, with | |||
171 | the error set to EINTR. | 171 | the error set to EINTR. |
172 | 172 | ||
173 | It is also possible that there's a race between processing the | 173 | It is also possible that there's a race between processing the |
174 | original request and it's INTERRUPT request. There are two possibilities: | 174 | original request and its INTERRUPT request. There are two possibilities: |
175 | 175 | ||
176 | 1) The INTERRUPT request is processed before the original request is | 176 | 1) The INTERRUPT request is processed before the original request is |
177 | processed | 177 | processed |
diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 5e3ab8f3beff..0b59c0200912 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt | |||
@@ -1,7 +1,7 @@ | |||
1 | Global File System | 1 | Global File System |
2 | ------------------ | 2 | ------------------ |
3 | 3 | ||
4 | http://sources.redhat.com/cluster/ | 4 | http://sources.redhat.com/cluster/wiki/ |
5 | 5 | ||
6 | GFS is a cluster file system. It allows a cluster of computers to | 6 | GFS is a cluster file system. It allows a cluster of computers to |
7 | simultaneously use a block device that is shared between them (with FC, | 7 | simultaneously use a block device that is shared between them (with FC, |
@@ -36,11 +36,11 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it | |||
36 | is pretty close. | 36 | is pretty close. |
37 | 37 | ||
38 | The following man pages can be found at the URL above: | 38 | The following man pages can be found at the URL above: |
39 | fsck.gfs2 to repair a filesystem | 39 | fsck.gfs2 to repair a filesystem |
40 | gfs2_grow to expand a filesystem online | 40 | gfs2_grow to expand a filesystem online |
41 | gfs2_jadd to add journals to a filesystem online | 41 | gfs2_jadd to add journals to a filesystem online |
42 | gfs2_tool to manipulate, examine and tune a filesystem | 42 | gfs2_tool to manipulate, examine and tune a filesystem |
43 | gfs2_quota to examine and change quota values in a filesystem | 43 | gfs2_quota to examine and change quota values in a filesystem |
44 | gfs2_convert to convert a gfs filesystem to gfs2 in-place | 44 | gfs2_convert to convert a gfs filesystem to gfs2 in-place |
45 | mount.gfs2 to help mount(8) mount a filesystem | 45 | mount.gfs2 to help mount(8) mount a filesystem |
46 | mkfs.gfs2 to make a filesystem | 46 | mkfs.gfs2 to make a filesystem |
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt index fa45c3baed98..74630bd504fb 100644 --- a/Documentation/filesystems/hpfs.txt +++ b/Documentation/filesystems/hpfs.txt | |||
@@ -103,7 +103,7 @@ to analyze or change OS2SYS.INI. | |||
103 | Codepages | 103 | Codepages |
104 | 104 | ||
105 | HPFS can contain several uppercasing tables for several codepages and each | 105 | HPFS can contain several uppercasing tables for several codepages and each |
106 | file has a pointer to codepage it's name is in. However OS/2 was created in | 106 | file has a pointer to codepage its name is in. However OS/2 was created in |
107 | America where people don't care much about codepages and so multiple codepages | 107 | America where people don't care much about codepages and so multiple codepages |
108 | support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. | 108 | support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. |
109 | Once I booted English OS/2 working in cp 850 and I created a file on my 852 | 109 | Once I booted English OS/2 working in cp 850 and I created a file on my 852 |
diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt index e64c94ba401a..bca42c22a143 100644 --- a/Documentation/filesystems/logfs.txt +++ b/Documentation/filesystems/logfs.txt | |||
@@ -59,7 +59,7 @@ Levels | |||
59 | ------ | 59 | ------ |
60 | 60 | ||
61 | Garbage collection (GC) may fail if all data is written | 61 | Garbage collection (GC) may fail if all data is written |
62 | indiscriminately. One requirement of GC is that data is seperated | 62 | indiscriminately. One requirement of GC is that data is separated |
63 | roughly according to the distance between the tree root and the data. | 63 | roughly according to the distance between the tree root and the data. |
64 | Effectively that means all file data is on level 0, indirect blocks | 64 | Effectively that means all file data is on level 0, indirect blocks |
65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, | 65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, |
@@ -67,7 +67,7 @@ respectively. Inode file data is on level 6 for the inodes and 7-11 | |||
67 | for indirect blocks. | 67 | for indirect blocks. |
68 | 68 | ||
69 | Each segment contains objects of a single level only. As a result, | 69 | Each segment contains objects of a single level only. As a result, |
70 | each level requires its own seperate segment to be open for writing. | 70 | each level requires its own separate segment to be open for writing. |
71 | 71 | ||
72 | Inode File | 72 | Inode File |
73 | ---------- | 73 | ---------- |
@@ -106,9 +106,9 @@ Vim | |||
106 | --- | 106 | --- |
107 | 107 | ||
108 | By cleverly predicting the life time of data, it is possible to | 108 | By cleverly predicting the life time of data, it is possible to |
109 | seperate long-living data from short-living data and thereby reduce | 109 | separate long-living data from short-living data and thereby reduce |
110 | the GC overhead later. Each type of distinc life expectency (vim) can | 110 | the GC overhead later. Each type of distinc life expectency (vim) can |
111 | have a seperate segment open for writing. Each (level, vim) tupel can | 111 | have a separate segment open for writing. Each (level, vim) tupel can |
112 | be open just once. If an open segment with unknown vim is encountered | 112 | be open just once. If an open segment with unknown vim is encountered |
113 | at mount time, it is closed and ignored henceforth. | 113 | at mount time, it is closed and ignored henceforth. |
114 | 114 | ||
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt index 6a53a84afc72..04884914a1c8 100644 --- a/Documentation/filesystems/nfs/nfs41-server.txt +++ b/Documentation/filesystems/nfs/nfs41-server.txt | |||
@@ -137,7 +137,7 @@ NS*| OPENATTR | OPT | | Section 18.17 | | |||
137 | | READ | REQ | | Section 18.22 | | 137 | | READ | REQ | | Section 18.22 | |
138 | | READDIR | REQ | | Section 18.23 | | 138 | | READDIR | REQ | | Section 18.23 | |
139 | | READLINK | OPT | | Section 18.24 | | 139 | | READLINK | OPT | | Section 18.24 | |
140 | NS | RECLAIM_COMPLETE | REQ | | Section 18.51 | | 140 | | RECLAIM_COMPLETE | REQ | | Section 18.51 | |
141 | | RELEASE_LOCKOWNER | MNI | | N/A | | 141 | | RELEASE_LOCKOWNER | MNI | | N/A | |
142 | | REMOVE | REQ | | Section 18.25 | | 142 | | REMOVE | REQ | | Section 18.25 | |
143 | | RENAME | REQ | | Section 18.26 | | 143 | | RENAME | REQ | | Section 18.26 | |
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt index 8a382bea6808..ebcaaee21616 100644 --- a/Documentation/filesystems/nfs/rpc-cache.txt +++ b/Documentation/filesystems/nfs/rpc-cache.txt | |||
@@ -185,7 +185,7 @@ failed lookup meant a definite 'no'. | |||
185 | request/response format | 185 | request/response format |
186 | ----------------------- | 186 | ----------------------- |
187 | 187 | ||
188 | While each cache is free to use it's own format for requests | 188 | While each cache is free to use its own format for requests |
189 | and responses over channel, the following is recommended as | 189 | and responses over channel, the following is recommended as |
190 | appropriate and support routines are available to help: | 190 | appropriate and support routines are available to help: |
191 | Each request or response record should be printable ASCII | 191 | Each request or response record should be printable ASCII |
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt index cf6d0d85ca82..d3e7673995eb 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.txt | |||
@@ -50,8 +50,8 @@ NILFS2 supports the following mount options: | |||
50 | (*) == default | 50 | (*) == default |
51 | 51 | ||
52 | nobarrier Disables barriers. | 52 | nobarrier Disables barriers. |
53 | errors=continue(*) Keep going on a filesystem error. | 53 | errors=continue Keep going on a filesystem error. |
54 | errors=remount-ro Remount the filesystem read-only on an error. | 54 | errors=remount-ro(*) Remount the filesystem read-only on an error. |
55 | errors=panic Panic and halt the machine if an error occurs. | 55 | errors=panic Panic and halt the machine if an error occurs. |
56 | cp=n Specify the checkpoint-number of the snapshot to be | 56 | cp=n Specify the checkpoint-number of the snapshot to be |
57 | mounted. Checkpoints and snapshots are listed by lscp | 57 | mounted. Checkpoints and snapshots are listed by lscp |
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index c58b9f5ba002..1f7ae144f6d8 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt | |||
@@ -80,3 +80,10 @@ user_xattr (*) Enables Extended User Attributes. | |||
80 | nouser_xattr Disables Extended User Attributes. | 80 | nouser_xattr Disables Extended User Attributes. |
81 | acl Enables POSIX Access Control Lists support. | 81 | acl Enables POSIX Access Control Lists support. |
82 | noacl (*) Disables POSIX Access Control Lists support. | 82 | noacl (*) Disables POSIX Access Control Lists support. |
83 | resv_level=2 (*) Set how agressive allocation reservations will be. | ||
84 | Valid values are between 0 (reservations off) to 8 | ||
85 | (maximum space for reservations). | ||
86 | dir_resv_level= (*) By default, directory reservations will scale with file | ||
87 | reservations - users should rarely need to change this | ||
88 | value. If allocation reservations are turned off, this | ||
89 | option will have no effect. | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 1e359b62c40a..9fb6cbe70bde 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -305,7 +305,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) | |||
305 | cgtime guest time of the task children in jiffies | 305 | cgtime guest time of the task children in jiffies |
306 | .............................................................................. | 306 | .............................................................................. |
307 | 307 | ||
308 | The /proc/PID/map file containing the currently mapped memory regions and | 308 | The /proc/PID/maps file containing the currently mapped memory regions and |
309 | their access permissions. | 309 | their access permissions. |
310 | 310 | ||
311 | The format is: | 311 | The format is: |
@@ -565,6 +565,10 @@ The default_smp_affinity mask applies to all non-active IRQs, which are the | |||
565 | IRQs which have not yet been allocated/activated, and hence which lack a | 565 | IRQs which have not yet been allocated/activated, and hence which lack a |
566 | /proc/irq/[0-9]* directory. | 566 | /proc/irq/[0-9]* directory. |
567 | 567 | ||
568 | The node file on an SMP system shows the node to which the device using the IRQ | ||
569 | reports itself as being attached. This hardware locality information does not | ||
570 | include information about any possible driver locality preference. | ||
571 | |||
568 | prof_cpu_mask specifies which CPUs are to be profiled by the system wide | 572 | prof_cpu_mask specifies which CPUs are to be profiled by the system wide |
569 | profiler. Default value is ffffffff (all cpus). | 573 | profiler. Default value is ffffffff (all cpus). |
570 | 574 | ||
@@ -964,7 +968,7 @@ your system and how much traffic was routed over those devices: | |||
964 | ...] 1375103 17405 0 0 0 0 0 0 | 968 | ...] 1375103 17405 0 0 0 0 0 0 |
965 | ...] 1703981 5535 0 0 0 3 0 0 | 969 | ...] 1703981 5535 0 0 0 3 0 0 |
966 | 970 | ||
967 | In addition, each Channel Bond interface has it's own directory. For | 971 | In addition, each Channel Bond interface has its own directory. For |
968 | example, the bond0 device will have a directory called /proc/net/bond0/. | 972 | example, the bond0 device will have a directory called /proc/net/bond0/. |
969 | It will contain information that is specific to that bond, such as the | 973 | It will contain information that is specific to that bond, such as the |
970 | current slaves of the bond, the link status of the slaves, and how | 974 | current slaves of the bond, the link status of the slaves, and how |
@@ -1361,7 +1365,7 @@ been accounted as having caused 1MB of write. | |||
1361 | In other words: The number of bytes which this process caused to not happen, | 1365 | In other words: The number of bytes which this process caused to not happen, |
1362 | by truncating pagecache. A task can cause "negative" IO too. If this task | 1366 | by truncating pagecache. A task can cause "negative" IO too. If this task |
1363 | truncates some dirty pagecache, some IO which another task has been accounted | 1367 | truncates some dirty pagecache, some IO which another task has been accounted |
1364 | for (in it's write_bytes) will not be happening. We _could_ just subtract that | 1368 | for (in its write_bytes) will not be happening. We _could_ just subtract that |
1365 | from the truncating task's write_bytes, but there is information loss in doing | 1369 | from the truncating task's write_bytes, but there is information loss in doing |
1366 | that. | 1370 | that. |
1367 | 1371 | ||
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt index f673ef0de0f7..194fb0decd2c 100644 --- a/Documentation/filesystems/smbfs.txt +++ b/Documentation/filesystems/smbfs.txt | |||
@@ -3,6 +3,6 @@ protocol used by Windows for Workgroups, Windows 95 and Windows NT. | |||
3 | Smbfs was inspired by Samba, the program written by Andrew Tridgell | 3 | Smbfs was inspired by Samba, the program written by Andrew Tridgell |
4 | that turns any Unix host into a file server for DOS or Windows clients. | 4 | that turns any Unix host into a file server for DOS or Windows clients. |
5 | 5 | ||
6 | Smbfs is a SMB client, but uses parts of samba for it's operation. For | 6 | Smbfs is a SMB client, but uses parts of samba for its operation. For |
7 | more info on samba, including documentation, please go to | 7 | more info on samba, including documentation, please go to |
8 | http://www.samba.org/ and then on to your nearest mirror. | 8 | http://www.samba.org/ and then on to your nearest mirror. |
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt index b324c033035a..203f7202cc9e 100644 --- a/Documentation/filesystems/squashfs.txt +++ b/Documentation/filesystems/squashfs.txt | |||
@@ -38,7 +38,8 @@ Hard link support: yes no | |||
38 | Real inode numbers: yes no | 38 | Real inode numbers: yes no |
39 | 32-bit uids/gids: yes no | 39 | 32-bit uids/gids: yes no |
40 | File creation time: yes no | 40 | File creation time: yes no |
41 | Xattr and ACL support: no no | 41 | Xattr support: yes no |
42 | ACL support: no no | ||
42 | 43 | ||
43 | Squashfs compresses data, inodes and directories. In addition, inode and | 44 | Squashfs compresses data, inodes and directories. In addition, inode and |
44 | directory data are highly compacted, and packed on byte boundaries. Each | 45 | directory data are highly compacted, and packed on byte boundaries. Each |
@@ -58,7 +59,7 @@ obtained from this site also. | |||
58 | 3. SQUASHFS FILESYSTEM DESIGN | 59 | 3. SQUASHFS FILESYSTEM DESIGN |
59 | ----------------------------- | 60 | ----------------------------- |
60 | 61 | ||
61 | A squashfs filesystem consists of seven parts, packed together on a byte | 62 | A squashfs filesystem consists of a maximum of eight parts, packed together on a byte |
62 | alignment: | 63 | alignment: |
63 | 64 | ||
64 | --------------- | 65 | --------------- |
@@ -80,6 +81,9 @@ alignment: | |||
80 | |---------------| | 81 | |---------------| |
81 | | uid/gid | | 82 | | uid/gid | |
82 | | lookup table | | 83 | | lookup table | |
84 | |---------------| | ||
85 | | xattr | | ||
86 | | table | | ||
83 | --------------- | 87 | --------------- |
84 | 88 | ||
85 | Compressed data blocks are written to the filesystem as files are read from | 89 | Compressed data blocks are written to the filesystem as files are read from |
@@ -192,6 +196,26 @@ This table is stored compressed into metadata blocks. A second index table is | |||
192 | used to locate these. This second index table for speed of access (and because | 196 | used to locate these. This second index table for speed of access (and because |
193 | it is small) is read at mount time and cached in memory. | 197 | it is small) is read at mount time and cached in memory. |
194 | 198 | ||
199 | 3.7 Xattr table | ||
200 | --------------- | ||
201 | |||
202 | The xattr table contains extended attributes for each inode. The xattrs | ||
203 | for each inode are stored in a list, each list entry containing a type, | ||
204 | name and value field. The type field encodes the xattr prefix | ||
205 | ("user.", "trusted." etc) and it also encodes how the name/value fields | ||
206 | should be interpreted. Currently the type indicates whether the value | ||
207 | is stored inline (in which case the value field contains the xattr value), | ||
208 | or if it is stored out of line (in which case the value field stores a | ||
209 | reference to where the actual value is stored). This allows large values | ||
210 | to be stored out of line improving scanning and lookup performance and it | ||
211 | also allows values to be de-duplicated, the value being stored once, and | ||
212 | all other occurences holding an out of line reference to that value. | ||
213 | |||
214 | The xattr lists are packed into compressed 8K metadata blocks. | ||
215 | To reduce overhead in inodes, rather than storing the on-disk | ||
216 | location of the xattr list inside each inode, a 32-bit xattr id | ||
217 | is stored. This xattr id is mapped into the location of the xattr | ||
218 | list using a second xattr id lookup table. | ||
195 | 219 | ||
196 | 4. TODOS AND OUTSTANDING ISSUES | 220 | 4. TODOS AND OUTSTANDING ISSUES |
197 | ------------------------------- | 221 | ------------------------------- |
@@ -199,9 +223,7 @@ it is small) is read at mount time and cached in memory. | |||
199 | 4.1 Todo list | 223 | 4.1 Todo list |
200 | ------------- | 224 | ------------- |
201 | 225 | ||
202 | Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks | 226 | Implement ACL support. |
203 | for these but the code has not been written. Once the code has been written | ||
204 | the existing layout should not require modification. | ||
205 | 227 | ||
206 | 4.2 Squashfs internal cache | 228 | 4.2 Squashfs internal cache |
207 | --------------------------- | 229 | --------------------------- |
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt new file mode 100644 index 000000000000..caaaf1266d8f --- /dev/null +++ b/Documentation/filesystems/sysfs-tagging.txt | |||
@@ -0,0 +1,42 @@ | |||
1 | Sysfs tagging | ||
2 | ------------- | ||
3 | |||
4 | (Taken almost verbatim from Eric Biederman's netns tagging patch | ||
5 | commit msg) | ||
6 | |||
7 | The problem. Network devices show up in sysfs and with the network | ||
8 | namespace active multiple devices with the same name can show up in | ||
9 | the same directory, ouch! | ||
10 | |||
11 | To avoid that problem and allow existing applications in network | ||
12 | namespaces to see the same interface that is currently presented in | ||
13 | sysfs, sysfs now has tagging directory support. | ||
14 | |||
15 | By using the network namespace pointers as tags to separate out the | ||
16 | the sysfs directory entries we ensure that we don't have conflicts | ||
17 | in the directories and applications only see a limited set of | ||
18 | the network devices. | ||
19 | |||
20 | Each sysfs directory entry may be tagged with zero or one | ||
21 | namespaces. A sysfs_dirent is augmented with a void *s_ns. If a | ||
22 | directory entry is tagged, then sysfs_dirent->s_flags will have a | ||
23 | flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will | ||
24 | point to the namespace to which it belongs. | ||
25 | |||
26 | Each sysfs superblock's sysfs_super_info contains an array void | ||
27 | *ns[KOBJ_NS_TYPES]. When a a task in a tagging namespace | ||
28 | kobj_nstype first mounts sysfs, a new superblock is created. It | ||
29 | will be differentiated from other sysfs mounts by having its | ||
30 | s_fs_info->ns[kobj_nstype] set to the new namespace. Note that | ||
31 | through bind mounting and mounts propagation, a task can easily view | ||
32 | the contents of other namespaces' sysfs mounts. Therefore, when a | ||
33 | namespace exits, it will call kobj_ns_exit() to invalidate any | ||
34 | sysfs_dirent->s_ns pointers pointing to it. | ||
35 | |||
36 | Users of this interface: | ||
37 | - define a type in the kobj_ns_type enumeration. | ||
38 | - call kobj_ns_type_register() with its kobj_ns_type_operations which has | ||
39 | - current_ns() which returns current's namespace | ||
40 | - netlink_ns() which returns a socket's namespace | ||
41 | - initial_ns() which returns the initial namesapce | ||
42 | - call kobj_ns_exit() when an individual tag is no longer valid | ||
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index fe09a2cb1858..98ef55124158 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt | |||
@@ -94,11 +94,19 @@ NodeList format is a comma-separated list of decimal numbers and ranges, | |||
94 | a range being two hyphen-separated decimal numbers, the smallest and | 94 | a range being two hyphen-separated decimal numbers, the smallest and |
95 | largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 | 95 | largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 |
96 | 96 | ||
97 | A memory policy with a valid NodeList will be saved, as specified, for | ||
98 | use at file creation time. When a task allocates a file in the file | ||
99 | system, the mount option memory policy will be applied with a NodeList, | ||
100 | if any, modified by the calling task's cpuset constraints | ||
101 | [See Documentation/cgroups/cpusets.txt] and any optional flags, listed | ||
102 | below. If the resulting NodeLists is the empty set, the effective memory | ||
103 | policy for the file will revert to "default" policy. | ||
104 | |||
97 | NUMA memory allocation policies have optional flags that can be used in | 105 | NUMA memory allocation policies have optional flags that can be used in |
98 | conjunction with their modes. These optional flags can be specified | 106 | conjunction with their modes. These optional flags can be specified |
99 | when tmpfs is mounted by appending them to the mode before the NodeList. | 107 | when tmpfs is mounted by appending them to the mode before the NodeList. |
100 | See Documentation/vm/numa_memory_policy.txt for a list of all available | 108 | See Documentation/vm/numa_memory_policy.txt for a list of all available |
101 | memory allocation policy mode flags. | 109 | memory allocation policy mode flags and their effect on memory policy. |
102 | 110 | ||
103 | =static is equivalent to MPOL_F_STATIC_NODES | 111 | =static is equivalent to MPOL_F_STATIC_NODES |
104 | =relative is equivalent to MPOL_F_RELATIVE_NODES | 112 | =relative is equivalent to MPOL_F_RELATIVE_NODES |
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 3de2f32edd90..b66858538df5 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -72,7 +72,7 @@ structure (this is the kernel-side implementation of file | |||
72 | descriptors). The freshly allocated file structure is initialized with | 72 | descriptors). The freshly allocated file structure is initialized with |
73 | a pointer to the dentry and a set of file operation member functions. | 73 | a pointer to the dentry and a set of file operation member functions. |
74 | These are taken from the inode data. The open() file method is then | 74 | These are taken from the inode data. The open() file method is then |
75 | called so the specific filesystem implementation can do it's work. You | 75 | called so the specific filesystem implementation can do its work. You |
76 | can see that this is another switch performed by the VFS. The file | 76 | can see that this is another switch performed by the VFS. The file |
77 | structure is placed into the file descriptor table for the process. | 77 | structure is placed into the file descriptor table for the process. |
78 | 78 | ||
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 000000000000..d8119e9d2d60 --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt | |||
@@ -0,0 +1,816 @@ | |||
1 | XFS Delayed Logging Design | ||
2 | -------------------------- | ||
3 | |||
4 | Introduction to Re-logging in XFS | ||
5 | --------------------------------- | ||
6 | |||
7 | XFS logging is a combination of logical and physical logging. Some objects, | ||
8 | such as inodes and dquots, are logged in logical format where the details | ||
9 | logged are made up of the changes to in-core structures rather than on-disk | ||
10 | structures. Other objects - typically buffers - have their physical changes | ||
11 | logged. The reason for these differences is to reduce the amount of log space | ||
12 | required for objects that are frequently logged. Some parts of inodes are more | ||
13 | frequently logged than others, and inodes are typically more frequently logged | ||
14 | than any other object (except maybe the superblock buffer) so keeping the | ||
15 | amount of metadata logged low is of prime importance. | ||
16 | |||
17 | The reason that this is such a concern is that XFS allows multiple separate | ||
18 | modifications to a single object to be carried in the log at any given time. | ||
19 | This allows the log to avoid needing to flush each change to disk before | ||
20 | recording a new change to the object. XFS does this via a method called | ||
21 | "re-logging". Conceptually, this is quite simple - all it requires is that any | ||
22 | new change to the object is recorded with a *new copy* of all the existing | ||
23 | changes in the new transaction that is written to the log. | ||
24 | |||
25 | That is, if we have a sequence of changes A through to F, and the object was | ||
26 | written to disk after change D, we would see in the log the following series | ||
27 | of transactions, their contents and the log sequence number (LSN) of the | ||
28 | transaction: | ||
29 | |||
30 | Transaction Contents LSN | ||
31 | A A X | ||
32 | B A+B X+n | ||
33 | C A+B+C X+n+m | ||
34 | D A+B+C+D X+n+m+o | ||
35 | <object written to disk> | ||
36 | E E Y (> X+n+m+o) | ||
37 | F E+F Yٍ+p | ||
38 | |||
39 | In other words, each time an object is relogged, the new transaction contains | ||
40 | the aggregation of all the previous changes currently held only in the log. | ||
41 | |||
42 | This relogging technique also allows objects to be moved forward in the log so | ||
43 | that an object being relogged does not prevent the tail of the log from ever | ||
44 | moving forward. This can be seen in the table above by the changing | ||
45 | (increasing) LSN of each subsquent transaction - the LSN is effectively a | ||
46 | direct encoding of the location in the log of the transaction. | ||
47 | |||
48 | This relogging is also used to implement long-running, multiple-commit | ||
49 | transactions. These transaction are known as rolling transactions, and require | ||
50 | a special log reservation known as a permanent transaction reservation. A | ||
51 | typical example of a rolling transaction is the removal of extents from an | ||
52 | inode which can only be done at a rate of two extents per transaction because | ||
53 | of reservation size limitations. Hence a rolling extent removal transaction | ||
54 | keeps relogging the inode and btree buffers as they get modified in each | ||
55 | removal operation. This keeps them moving forward in the log as the operation | ||
56 | progresses, ensuring that current operation never gets blocked by itself if the | ||
57 | log wraps around. | ||
58 | |||
59 | Hence it can be seen that the relogging operation is fundamental to the correct | ||
60 | working of the XFS journalling subsystem. From the above description, most | ||
61 | people should be able to see why the XFS metadata operations writes so much to | ||
62 | the log - repeated operations to the same objects write the same changes to | ||
63 | the log over and over again. Worse is the fact that objects tend to get | ||
64 | dirtier as they get relogged, so each subsequent transaction is writing more | ||
65 | metadata into the log. | ||
66 | |||
67 | Another feature of the XFS transaction subsystem is that most transactions are | ||
68 | asynchronous. That is, they don't commit to disk until either a log buffer is | ||
69 | filled (a log buffer can hold multiple transactions) or a synchronous operation | ||
70 | forces the log buffers holding the transactions to disk. This means that XFS is | ||
71 | doing aggregation of transactions in memory - batching them, if you like - to | ||
72 | minimise the impact of the log IO on transaction throughput. | ||
73 | |||
74 | The limitation on asynchronous transaction throughput is the number and size of | ||
75 | log buffers made available by the log manager. By default there are 8 log | ||
76 | buffers available and the size of each is 32kB - the size can be increased up | ||
77 | to 256kB by use of a mount option. | ||
78 | |||
79 | Effectively, this gives us the maximum bound of outstanding metadata changes | ||
80 | that can be made to the filesystem at any point in time - if all the log | ||
81 | buffers are full and under IO, then no more transactions can be committed until | ||
82 | the current batch completes. It is now common for a single current CPU core to | ||
83 | be to able to issue enough transactions to keep the log buffers full and under | ||
84 | IO permanently. Hence the XFS journalling subsystem can be considered to be IO | ||
85 | bound. | ||
86 | |||
87 | Delayed Logging: Concepts | ||
88 | ------------------------- | ||
89 | |||
90 | The key thing to note about the asynchronous logging combined with the | ||
91 | relogging technique XFS uses is that we can be relogging changed objects | ||
92 | multiple times before they are committed to disk in the log buffers. If we | ||
93 | return to the previous relogging example, it is entirely possible that | ||
94 | transactions A through D are committed to disk in the same log buffer. | ||
95 | |||
96 | That is, a single log buffer may contain multiple copies of the same object, | ||
97 | but only one of those copies needs to be there - the last one "D", as it | ||
98 | contains all the changes from the previous changes. In other words, we have one | ||
99 | necessary copy in the log buffer, and three stale copies that are simply | ||
100 | wasting space. When we are doing repeated operations on the same set of | ||
101 | objects, these "stale objects" can be over 90% of the space used in the log | ||
102 | buffers. It is clear that reducing the number of stale objects written to the | ||
103 | log would greatly reduce the amount of metadata we write to the log, and this | ||
104 | is the fundamental goal of delayed logging. | ||
105 | |||
106 | From a conceptual point of view, XFS is already doing relogging in memory (where | ||
107 | memory == log buffer), only it is doing it extremely inefficiently. It is using | ||
108 | logical to physical formatting to do the relogging because there is no | ||
109 | infrastructure to keep track of logical changes in memory prior to physically | ||
110 | formatting the changes in a transaction to the log buffer. Hence we cannot avoid | ||
111 | accumulating stale objects in the log buffers. | ||
112 | |||
113 | Delayed logging is the name we've given to keeping and tracking transactional | ||
114 | changes to objects in memory outside the log buffer infrastructure. Because of | ||
115 | the relogging concept fundamental to the XFS journalling subsystem, this is | ||
116 | actually relatively easy to do - all the changes to logged items are already | ||
117 | tracked in the current infrastructure. The big problem is how to accumulate | ||
118 | them and get them to the log in a consistent, recoverable manner. | ||
119 | Describing the problems and how they have been solved is the focus of this | ||
120 | document. | ||
121 | |||
122 | One of the key changes that delayed logging makes to the operation of the | ||
123 | journalling subsystem is that it disassociates the amount of outstanding | ||
124 | metadata changes from the size and number of log buffers available. In other | ||
125 | words, instead of there only being a maximum of 2MB of transaction changes not | ||
126 | written to the log at any point in time, there may be a much greater amount | ||
127 | being accumulated in memory. Hence the potential for loss of metadata on a | ||
128 | crash is much greater than for the existing logging mechanism. | ||
129 | |||
130 | It should be noted that this does not change the guarantee that log recovery | ||
131 | will result in a consistent filesystem. What it does mean is that as far as the | ||
132 | recovered filesystem is concerned, there may be many thousands of transactions | ||
133 | that simply did not occur as a result of the crash. This makes it even more | ||
134 | important that applications that care about their data use fsync() where they | ||
135 | need to ensure application level data integrity is maintained. | ||
136 | |||
137 | It should be noted that delayed logging is not an innovative new concept that | ||
138 | warrants rigorous proofs to determine whether it is correct or not. The method | ||
139 | of accumulating changes in memory for some period before writing them to the | ||
140 | log is used effectively in many filesystems including ext3 and ext4. Hence | ||
141 | no time is spent in this document trying to convince the reader that the | ||
142 | concept is sound. Instead it is simply considered a "solved problem" and as | ||
143 | such implementing it in XFS is purely an exercise in software engineering. | ||
144 | |||
145 | The fundamental requirements for delayed logging in XFS are simple: | ||
146 | |||
147 | 1. Reduce the amount of metadata written to the log by at least | ||
148 | an order of magnitude. | ||
149 | 2. Supply sufficient statistics to validate Requirement #1. | ||
150 | 3. Supply sufficient new tracing infrastructure to be able to debug | ||
151 | problems with the new code. | ||
152 | 4. No on-disk format change (metadata or log format). | ||
153 | 5. Enable and disable with a mount option. | ||
154 | 6. No performance regressions for synchronous transaction workloads. | ||
155 | |||
156 | Delayed Logging: Design | ||
157 | ----------------------- | ||
158 | |||
159 | Storing Changes | ||
160 | |||
161 | The problem with accumulating changes at a logical level (i.e. just using the | ||
162 | existing log item dirty region tracking) is that when it comes to writing the | ||
163 | changes to the log buffers, we need to ensure that the object we are formatting | ||
164 | is not changing while we do this. This requires locking the object to prevent | ||
165 | concurrent modification. Hence flushing the logical changes to the log would | ||
166 | require us to lock every object, format them, and then unlock them again. | ||
167 | |||
168 | This introduces lots of scope for deadlocks with transactions that are already | ||
169 | running. For example, a transaction has object A locked and modified, but needs | ||
170 | the delayed logging tracking lock to commit the transaction. However, the | ||
171 | flushing thread has the delayed logging tracking lock already held, and is | ||
172 | trying to get the lock on object A to flush it to the log buffer. This appears | ||
173 | to be an unsolvable deadlock condition, and it was solving this problem that | ||
174 | was the barrier to implementing delayed logging for so long. | ||
175 | |||
176 | The solution is relatively simple - it just took a long time to recognise it. | ||
177 | Put simply, the current logging code formats the changes to each item into an | ||
178 | vector array that points to the changed regions in the item. The log write code | ||
179 | simply copies the memory these vectors point to into the log buffer during | ||
180 | transaction commit while the item is locked in the transaction. Instead of | ||
181 | using the log buffer as the destination of the formatting code, we can use an | ||
182 | allocated memory buffer big enough to fit the formatted vector. | ||
183 | |||
184 | If we then copy the vector into the memory buffer and rewrite the vector to | ||
185 | point to the memory buffer rather than the object itself, we now have a copy of | ||
186 | the changes in a format that is compatible with the log buffer writing code. | ||
187 | that does not require us to lock the item to access. This formatting and | ||
188 | rewriting can all be done while the object is locked during transaction commit, | ||
189 | resulting in a vector that is transactionally consistent and can be accessed | ||
190 | without needing to lock the owning item. | ||
191 | |||
192 | Hence we avoid the need to lock items when we need to flush outstanding | ||
193 | asynchronous transactions to the log. The differences between the existing | ||
194 | formatting method and the delayed logging formatting can be seen in the | ||
195 | diagram below. | ||
196 | |||
197 | Current format log vector: | ||
198 | |||
199 | Object +---------------------------------------------+ | ||
200 | Vector 1 +----+ | ||
201 | Vector 2 +----+ | ||
202 | Vector 3 +----------+ | ||
203 | |||
204 | After formatting: | ||
205 | |||
206 | Log Buffer +-V1-+-V2-+----V3----+ | ||
207 | |||
208 | Delayed logging vector: | ||
209 | |||
210 | Object +---------------------------------------------+ | ||
211 | Vector 1 +----+ | ||
212 | Vector 2 +----+ | ||
213 | Vector 3 +----------+ | ||
214 | |||
215 | After formatting: | ||
216 | |||
217 | Memory Buffer +-V1-+-V2-+----V3----+ | ||
218 | Vector 1 +----+ | ||
219 | Vector 2 +----+ | ||
220 | Vector 3 +----------+ | ||
221 | |||
222 | The memory buffer and associated vector need to be passed as a single object, | ||
223 | but still need to be associated with the parent object so if the object is | ||
224 | relogged we can replace the current memory buffer with a new memory buffer that | ||
225 | contains the latest changes. | ||
226 | |||
227 | The reason for keeping the vector around after we've formatted the memory | ||
228 | buffer is to support splitting vectors across log buffer boundaries correctly. | ||
229 | If we don't keep the vector around, we do not know where the region boundaries | ||
230 | are in the item, so we'd need a new encapsulation method for regions in the log | ||
231 | buffer writing (i.e. double encapsulation). This would be an on-disk format | ||
232 | change and as such is not desirable. It also means we'd have to write the log | ||
233 | region headers in the formatting stage, which is problematic as there is per | ||
234 | region state that needs to be placed into the headers during the log write. | ||
235 | |||
236 | Hence we need to keep the vector, but by attaching the memory buffer to it and | ||
237 | rewriting the vector addresses to point at the memory buffer we end up with a | ||
238 | self-describing object that can be passed to the log buffer write code to be | ||
239 | handled in exactly the same manner as the existing log vectors are handled. | ||
240 | Hence we avoid needing a new on-disk format to handle items that have been | ||
241 | relogged in memory. | ||
242 | |||
243 | |||
244 | Tracking Changes | ||
245 | |||
246 | Now that we can record transactional changes in memory in a form that allows | ||
247 | them to be used without limitations, we need to be able to track and accumulate | ||
248 | them so that they can be written to the log at some later point in time. The | ||
249 | log item is the natural place to store this vector and buffer, and also makes sense | ||
250 | to be the object that is used to track committed objects as it will always | ||
251 | exist once the object has been included in a transaction. | ||
252 | |||
253 | The log item is already used to track the log items that have been written to | ||
254 | the log but not yet written to disk. Such log items are considered "active" | ||
255 | and as such are stored in the Active Item List (AIL) which is a LSN-ordered | ||
256 | double linked list. Items are inserted into this list during log buffer IO | ||
257 | completion, after which they are unpinned and can be written to disk. An object | ||
258 | that is in the AIL can be relogged, which causes the object to be pinned again | ||
259 | and then moved forward in the AIL when the log buffer IO completes for that | ||
260 | transaction. | ||
261 | |||
262 | Essentially, this shows that an item that is in the AIL can still be modified | ||
263 | and relogged, so any tracking must be separate to the AIL infrastructure. As | ||
264 | such, we cannot reuse the AIL list pointers for tracking committed items, nor | ||
265 | can we store state in any field that is protected by the AIL lock. Hence the | ||
266 | committed item tracking needs it's own locks, lists and state fields in the log | ||
267 | item. | ||
268 | |||
269 | Similar to the AIL, tracking of committed items is done through a new list | ||
270 | called the Committed Item List (CIL). The list tracks log items that have been | ||
271 | committed and have formatted memory buffers attached to them. It tracks objects | ||
272 | in transaction commit order, so when an object is relogged it is removed from | ||
273 | it's place in the list and re-inserted at the tail. This is entirely arbitrary | ||
274 | and done to make it easy for debugging - the last items in the list are the | ||
275 | ones that are most recently modified. Ordering of the CIL is not necessary for | ||
276 | transactional integrity (as discussed in the next section) so the ordering is | ||
277 | done for convenience/sanity of the developers. | ||
278 | |||
279 | |||
280 | Delayed Logging: Checkpoints | ||
281 | |||
282 | When we have a log synchronisation event, commonly known as a "log force", | ||
283 | all the items in the CIL must be written into the log via the log buffers. | ||
284 | We need to write these items in the order that they exist in the CIL, and they | ||
285 | need to be written as an atomic transaction. The need for all the objects to be | ||
286 | written as an atomic transaction comes from the requirements of relogging and | ||
287 | log replay - all the changes in all the objects in a given transaction must | ||
288 | either be completely replayed during log recovery, or not replayed at all. If | ||
289 | a transaction is not replayed because it is not complete in the log, then | ||
290 | no later transactions should be replayed, either. | ||
291 | |||
292 | To fulfill this requirement, we need to write the entire CIL in a single log | ||
293 | transaction. Fortunately, the XFS log code has no fixed limit on the size of a | ||
294 | transaction, nor does the log replay code. The only fundamental limit is that | ||
295 | the transaction cannot be larger than just under half the size of the log. The | ||
296 | reason for this limit is that to find the head and tail of the log, there must | ||
297 | be at least one complete transaction in the log at any given time. If a | ||
298 | transaction is larger than half the log, then there is the possibility that a | ||
299 | crash during the write of a such a transaction could partially overwrite the | ||
300 | only complete previous transaction in the log. This will result in a recovery | ||
301 | failure and an inconsistent filesystem and hence we must enforce the maximum | ||
302 | size of a checkpoint to be slightly less than a half the log. | ||
303 | |||
304 | Apart from this size requirement, a checkpoint transaction looks no different | ||
305 | to any other transaction - it contains a transaction header, a series of | ||
306 | formatted log items and a commit record at the tail. From a recovery | ||
307 | perspective, the checkpoint transaction is also no different - just a lot | ||
308 | bigger with a lot more items in it. The worst case effect of this is that we | ||
309 | might need to tune the recovery transaction object hash size. | ||
310 | |||
311 | Because the checkpoint is just another transaction and all the changes to log | ||
312 | items are stored as log vectors, we can use the existing log buffer writing | ||
313 | code to write the changes into the log. To do this efficiently, we need to | ||
314 | minimise the time we hold the CIL locked while writing the checkpoint | ||
315 | transaction. The current log write code enables us to do this easily with the | ||
316 | way it separates the writing of the transaction contents (the log vectors) from | ||
317 | the transaction commit record, but tracking this requires us to have a | ||
318 | per-checkpoint context that travels through the log write process through to | ||
319 | checkpoint completion. | ||
320 | |||
321 | Hence a checkpoint has a context that tracks the state of the current | ||
322 | checkpoint from initiation to checkpoint completion. A new context is initiated | ||
323 | at the same time a checkpoint transaction is started. That is, when we remove | ||
324 | all the current items from the CIL during a checkpoint operation, we move all | ||
325 | those changes into the current checkpoint context. We then initialise a new | ||
326 | context and attach that to the CIL for aggregation of new transactions. | ||
327 | |||
328 | This allows us to unlock the CIL immediately after transfer of all the | ||
329 | committed items and effectively allow new transactions to be issued while we | ||
330 | are formatting the checkpoint into the log. It also allows concurrent | ||
331 | checkpoints to be written into the log buffers in the case of log force heavy | ||
332 | workloads, just like the existing transaction commit code does. This, however, | ||
333 | requires that we strictly order the commit records in the log so that | ||
334 | checkpoint sequence order is maintained during log replay. | ||
335 | |||
336 | To ensure that we can be writing an item into a checkpoint transaction at | ||
337 | the same time another transaction modifies the item and inserts the log item | ||
338 | into the new CIL, then checkpoint transaction commit code cannot use log items | ||
339 | to store the list of log vectors that need to be written into the transaction. | ||
340 | Hence log vectors need to be able to be chained together to allow them to be | ||
341 | detatched from the log items. That is, when the CIL is flushed the memory | ||
342 | buffer and log vector attached to each log item needs to be attached to the | ||
343 | checkpoint context so that the log item can be released. In diagrammatic form, | ||
344 | the CIL would look like this before the flush: | ||
345 | |||
346 | CIL Head | ||
347 | | | ||
348 | V | ||
349 | Log Item <-> log vector 1 -> memory buffer | ||
350 | | -> vector array | ||
351 | V | ||
352 | Log Item <-> log vector 2 -> memory buffer | ||
353 | | -> vector array | ||
354 | V | ||
355 | ...... | ||
356 | | | ||
357 | V | ||
358 | Log Item <-> log vector N-1 -> memory buffer | ||
359 | | -> vector array | ||
360 | V | ||
361 | Log Item <-> log vector N -> memory buffer | ||
362 | -> vector array | ||
363 | |||
364 | And after the flush the CIL head is empty, and the checkpoint context log | ||
365 | vector list would look like: | ||
366 | |||
367 | Checkpoint Context | ||
368 | | | ||
369 | V | ||
370 | log vector 1 -> memory buffer | ||
371 | | -> vector array | ||
372 | | -> Log Item | ||
373 | V | ||
374 | log vector 2 -> memory buffer | ||
375 | | -> vector array | ||
376 | | -> Log Item | ||
377 | V | ||
378 | ...... | ||
379 | | | ||
380 | V | ||
381 | log vector N-1 -> memory buffer | ||
382 | | -> vector array | ||
383 | | -> Log Item | ||
384 | V | ||
385 | log vector N -> memory buffer | ||
386 | -> vector array | ||
387 | -> Log Item | ||
388 | |||
389 | Once this transfer is done, the CIL can be unlocked and new transactions can | ||
390 | start, while the checkpoint flush code works over the log vector chain to | ||
391 | commit the checkpoint. | ||
392 | |||
393 | Once the checkpoint is written into the log buffers, the checkpoint context is | ||
394 | attached to the log buffer that the commit record was written to along with a | ||
395 | completion callback. Log IO completion will call that callback, which can then | ||
396 | run transaction committed processing for the log items (i.e. insert into AIL | ||
397 | and unpin) in the log vector chain and then free the log vector chain and | ||
398 | checkpoint context. | ||
399 | |||
400 | Discussion Point: I am uncertain as to whether the log item is the most | ||
401 | efficient way to track vectors, even though it seems like the natural way to do | ||
402 | it. The fact that we walk the log items (in the CIL) just to chain the log | ||
403 | vectors and break the link between the log item and the log vector means that | ||
404 | we take a cache line hit for the log item list modification, then another for | ||
405 | the log vector chaining. If we track by the log vectors, then we only need to | ||
406 | break the link between the log item and the log vector, which means we should | ||
407 | dirty only the log item cachelines. Normally I wouldn't be concerned about one | ||
408 | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log | ||
409 | vectors in one checkpoint transaction. I'd guess this is a "measure and | ||
410 | compare" situation that can be done after a working and reviewed implementation | ||
411 | is in the dev tree.... | ||
412 | |||
413 | Delayed Logging: Checkpoint Sequencing | ||
414 | |||
415 | One of the key aspects of the XFS transaction subsystem is that it tags | ||
416 | committed transactions with the log sequence number of the transaction commit. | ||
417 | This allows transactions to be issued asynchronously even though there may be | ||
418 | future operations that cannot be completed until that transaction is fully | ||
419 | committed to the log. In the rare case that a dependent operation occurs (e.g. | ||
420 | re-using a freed metadata extent for a data extent), a special, optimised log | ||
421 | force can be issued to force the dependent transaction to disk immediately. | ||
422 | |||
423 | To do this, transactions need to record the LSN of the commit record of the | ||
424 | transaction. This LSN comes directly from the log buffer the transaction is | ||
425 | written into. While this works just fine for the existing transaction | ||
426 | mechanism, it does not work for delayed logging because transactions are not | ||
427 | written directly into the log buffers. Hence some other method of sequencing | ||
428 | transactions is required. | ||
429 | |||
430 | As discussed in the checkpoint section, delayed logging uses per-checkpoint | ||
431 | contexts, and as such it is simple to assign a sequence number to each | ||
432 | checkpoint. Because the switching of checkpoint contexts must be done | ||
433 | atomically, it is simple to ensure that each new context has a monotonically | ||
434 | increasing sequence number assigned to it without the need for an external | ||
435 | atomic counter - we can just take the current context sequence number and add | ||
436 | one to it for the new context. | ||
437 | |||
438 | Then, instead of assigning a log buffer LSN to the transaction commit LSN | ||
439 | during the commit, we can assign the current checkpoint sequence. This allows | ||
440 | operations that track transactions that have not yet completed know what | ||
441 | checkpoint sequence needs to be committed before they can continue. As a | ||
442 | result, the code that forces the log to a specific LSN now needs to ensure that | ||
443 | the log forces to a specific checkpoint. | ||
444 | |||
445 | To ensure that we can do this, we need to track all the checkpoint contexts | ||
446 | that are currently committing to the log. When we flush a checkpoint, the | ||
447 | context gets added to a "committing" list which can be searched. When a | ||
448 | checkpoint commit completes, it is removed from the committing list. Because | ||
449 | the checkpoint context records the LSN of the commit record for the checkpoint, | ||
450 | we can also wait on the log buffer that contains the commit record, thereby | ||
451 | using the existing log force mechanisms to execute synchronous forces. | ||
452 | |||
453 | It should be noted that the synchronous forces may need to be extended with | ||
454 | mitigation algorithms similar to the current log buffer code to allow | ||
455 | aggregation of multiple synchronous transactions if there are already | ||
456 | synchronous transactions being flushed. Investigation of the performance of the | ||
457 | current design is needed before making any decisions here. | ||
458 | |||
459 | The main concern with log forces is to ensure that all the previous checkpoints | ||
460 | are also committed to disk before the one we need to wait for. Therefore we | ||
461 | need to check that all the prior contexts in the committing list are also | ||
462 | complete before waiting on the one we need to complete. We do this | ||
463 | synchronisation in the log force code so that we don't need to wait anywhere | ||
464 | else for such serialisation - it only matters when we do a log force. | ||
465 | |||
466 | The only remaining complexity is that a log force now also has to handle the | ||
467 | case where the forcing sequence number is the same as the current context. That | ||
468 | is, we need to flush the CIL and potentially wait for it to complete. This is a | ||
469 | simple addition to the existing log forcing code to check the sequence numbers | ||
470 | and push if required. Indeed, placing the current sequence checkpoint flush in | ||
471 | the log force code enables the current mechanism for issuing synchronous | ||
472 | transactions to remain untouched (i.e. commit an asynchronous transaction, then | ||
473 | force the log at the LSN of that transaction) and so the higher level code | ||
474 | behaves the same regardless of whether delayed logging is being used or not. | ||
475 | |||
476 | Delayed Logging: Checkpoint Log Space Accounting | ||
477 | |||
478 | The big issue for a checkpoint transaction is the log space reservation for the | ||
479 | transaction. We don't know how big a checkpoint transaction is going to be | ||
480 | ahead of time, nor how many log buffers it will take to write out, nor the | ||
481 | number of split log vector regions are going to be used. We can track the | ||
482 | amount of log space required as we add items to the commit item list, but we | ||
483 | still need to reserve the space in the log for the checkpoint. | ||
484 | |||
485 | A typical transaction reserves enough space in the log for the worst case space | ||
486 | usage of the transaction. The reservation accounts for log record headers, | ||
487 | transaction and region headers, headers for split regions, buffer tail padding, | ||
488 | etc. as well as the actual space for all the changed metadata in the | ||
489 | transaction. While some of this is fixed overhead, much of it is dependent on | ||
490 | the size of the transaction and the number of regions being logged (the number | ||
491 | of log vectors in the transaction). | ||
492 | |||
493 | An example of the differences would be logging directory changes versus logging | ||
494 | inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then | ||
495 | there are lots of transactions that only contain an inode core and an inode log | ||
496 | format structure. That is, two vectors totaling roughly 150 bytes. If we modify | ||
497 | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each | ||
498 | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In | ||
499 | comparison, if we are logging full directory buffers, they are typically 4KB | ||
500 | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a | ||
501 | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total | ||
502 | space. From this, it should be obvious that a static log space reservation is | ||
503 | not particularly flexible and is difficult to select the "optimal value" for | ||
504 | all workloads. | ||
505 | |||
506 | Further, if we are going to use a static reservation, which bit of the entire | ||
507 | reservation does it cover? We account for space used by the transaction | ||
508 | reservation by tracking the space currently used by the object in the CIL and | ||
509 | then calculating the increase or decrease in space used as the object is | ||
510 | relogged. This allows for a checkpoint reservation to only have to account for | ||
511 | log buffer metadata used such as log header records. | ||
512 | |||
513 | However, even using a static reservation for just the log metadata is | ||
514 | problematic. Typically log record headers use at least 16KB of log space per | ||
515 | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be | ||
516 | large enough to handle arbitrary sized checkpoint transactions. This | ||
517 | reservation needs to be made before the checkpoint is started, and we need to | ||
518 | be able to reserve the space without sleeping. For a 8MB checkpoint, we need a | ||
519 | reservation of around 150KB, which is a non-trivial amount of space. | ||
520 | |||
521 | A static reservation needs to manipulate the log grant counters - we can take a | ||
522 | permanent reservation on the space, but we still need to make sure we refresh | ||
523 | the write reservation (the actual space available to the transaction) after | ||
524 | every checkpoint transaction completion. Unfortunately, if this space is not | ||
525 | available when required, then the regrant code will sleep waiting for it. | ||
526 | |||
527 | The problem with this is that it can lead to deadlocks as we may need to commit | ||
528 | checkpoints to be able to free up log space (refer back to the description of | ||
529 | rolling transactions for an example of this). Hence we *must* always have | ||
530 | space available in the log if we are to use static reservations, and that is | ||
531 | very difficult and complex to arrange. It is possible to do, but there is a | ||
532 | simpler way. | ||
533 | |||
534 | The simpler way of doing this is tracking the entire log space used by the | ||
535 | items in the CIL and using this to dynamically calculate the amount of log | ||
536 | space required by the log metadata. If this log metadata space changes as a | ||
537 | result of a transaction commit inserting a new memory buffer into the CIL, then | ||
538 | the difference in space required is removed from the transaction that causes | ||
539 | the change. Transactions at this level will *always* have enough space | ||
540 | available in their reservation for this as they have already reserved the | ||
541 | maximal amount of log metadata space they require, and such a delta reservation | ||
542 | will always be less than or equal to the maximal amount in the reservation. | ||
543 | |||
544 | Hence we can grow the checkpoint transaction reservation dynamically as items | ||
545 | are added to the CIL and avoid the need for reserving and regranting log space | ||
546 | up front. This avoids deadlocks and removes a blocking point from the | ||
547 | checkpoint flush code. | ||
548 | |||
549 | As mentioned early, transactions can't grow to more than half the size of the | ||
550 | log. Hence as part of the reservation growing, we need to also check the size | ||
551 | of the reservation against the maximum allowed transaction size. If we reach | ||
552 | the maximum threshold, we need to push the CIL to the log. This is effectively | ||
553 | a "background flush" and is done on demand. This is identical to | ||
554 | a CIL push triggered by a log force, only that there is no waiting for the | ||
555 | checkpoint commit to complete. This background push is checked and executed by | ||
556 | transaction commit code. | ||
557 | |||
558 | If the transaction subsystem goes idle while we still have items in the CIL, | ||
559 | they will be flushed by the periodic log force issued by the xfssyncd. This log | ||
560 | force will push the CIL to disk, and if the transaction subsystem stays idle, | ||
561 | allow the idle log to be covered (effectively marked clean) in exactly the same | ||
562 | manner that is done for the existing logging method. A discussion point is | ||
563 | whether this log force needs to be done more frequently than the current rate | ||
564 | which is once every 30s. | ||
565 | |||
566 | |||
567 | Delayed Logging: Log Item Pinning | ||
568 | |||
569 | Currently log items are pinned during transaction commit while the items are | ||
570 | still locked. This happens just after the items are formatted, though it could | ||
571 | be done any time before the items are unlocked. The result of this mechanism is | ||
572 | that items get pinned once for every transaction that is committed to the log | ||
573 | buffers. Hence items that are relogged in the log buffers will have a pin count | ||
574 | for every outstanding transaction they were dirtied in. When each of these | ||
575 | transactions is completed, they will unpin the item once. As a result, the item | ||
576 | only becomes unpinned when all the transactions complete and there are no | ||
577 | pending transactions. Thus the pinning and unpinning of a log item is symmetric | ||
578 | as there is a 1:1 relationship with transaction commit and log item completion. | ||
579 | |||
580 | For delayed logging, however, we have an assymetric transaction commit to | ||
581 | completion relationship. Every time an object is relogged in the CIL it goes | ||
582 | through the commit process without a corresponding completion being registered. | ||
583 | That is, we now have a many-to-one relationship between transaction commit and | ||
584 | log item completion. The result of this is that pinning and unpinning of the | ||
585 | log items becomes unbalanced if we retain the "pin on transaction commit, unpin | ||
586 | on transaction completion" model. | ||
587 | |||
588 | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on | ||
589 | insertion into the CIL, unpin on checkpoint completion". In other words, the | ||
590 | pinning and unpinning becomes symmetric around a checkpoint context. We have to | ||
591 | pin the object the first time it is inserted into the CIL - if it is already in | ||
592 | the CIL during a transaction commit, then we do not pin it again. Because there | ||
593 | can be multiple outstanding checkpoint contexts, we can still see elevated pin | ||
594 | counts, but as each checkpoint completes the pin count will retain the correct | ||
595 | value according to it's context. | ||
596 | |||
597 | Just to make matters more slightly more complex, this checkpoint level context | ||
598 | for the pin count means that the pinning of an item must take place under the | ||
599 | CIL commit/flush lock. If we pin the object outside this lock, we cannot | ||
600 | guarantee which context the pin count is associated with. This is because of | ||
601 | the fact pinning the item is dependent on whether the item is present in the | ||
602 | current CIL or not. If we don't pin the CIL first before we check and pin the | ||
603 | object, we have a race with CIL being flushed between the check and the pin | ||
604 | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit | ||
605 | lock to guarantee that we pin the items correctly. | ||
606 | |||
607 | Delayed Logging: Concurrent Scalability | ||
608 | |||
609 | A fundamental requirement for the CIL is that accesses through transaction | ||
610 | commits must scale to many concurrent commits. The current transaction commit | ||
611 | code does not break down even when there are transactions coming from 2048 | ||
612 | processors at once. The current transaction code does not go any faster than if | ||
613 | there was only one CPU using it, but it does not slow down either. | ||
614 | |||
615 | As a result, the delayed logging transaction commit code needs to be designed | ||
616 | for concurrency from the ground up. It is obvious that there are serialisation | ||
617 | points in the design - the three important ones are: | ||
618 | |||
619 | 1. Locking out new transaction commits while flushing the CIL | ||
620 | 2. Adding items to the CIL and updating item space accounting | ||
621 | 3. Checkpoint commit ordering | ||
622 | |||
623 | Looking at the transaction commit and CIL flushing interactions, it is clear | ||
624 | that we have a many-to-one interaction here. That is, the only restriction on | ||
625 | the number of concurrent transactions that can be trying to commit at once is | ||
626 | the amount of space available in the log for their reservations. The practical | ||
627 | limit here is in the order of several hundred concurrent transactions for a | ||
628 | 128MB log, which means that it is generally one per CPU in a machine. | ||
629 | |||
630 | The amount of time a transaction commit needs to hold out a flush is a | ||
631 | relatively long period of time - the pinning of log items needs to be done | ||
632 | while we are holding out a CIL flush, so at the moment that means it is held | ||
633 | across the formatting of the objects into memory buffers (i.e. while memcpy()s | ||
634 | are in progress). Ultimately a two pass algorithm where the formatting is done | ||
635 | separately to the pinning of objects could be used to reduce the hold time of | ||
636 | the transaction commit side. | ||
637 | |||
638 | Because of the number of potential transaction commit side holders, the lock | ||
639 | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not | ||
640 | want every other CPU in the machine spinning on the CIL lock. Given that | ||
641 | flushing the CIL could involve walking a list of tens of thousands of log | ||
642 | items, it will get held for a significant time and so spin contention is a | ||
643 | significant concern. Preventing lots of CPUs spinning doing nothing is the | ||
644 | main reason for choosing a sleeping lock even though nothing in either the | ||
645 | transaction commit or CIL flush side sleeps with the lock held. | ||
646 | |||
647 | It should also be noted that CIL flushing is also a relatively rare operation | ||
648 | compared to transaction commit for asynchronous transaction workloads - only | ||
649 | time will tell if using a read-write semaphore for exclusion will limit | ||
650 | transaction commit concurrency due to cache line bouncing of the lock on the | ||
651 | read side. | ||
652 | |||
653 | The second serialisation point is on the transaction commit side where items | ||
654 | are inserted into the CIL. Because transactions can enter this code | ||
655 | concurrently, the CIL needs to be protected separately from the above | ||
656 | commit/flush exclusion. It also needs to be an exclusive lock but it is only | ||
657 | held for a very short time and so a spin lock is appropriate here. It is | ||
658 | possible that this lock will become a contention point, but given the short | ||
659 | hold time once per transaction I think that contention is unlikely. | ||
660 | |||
661 | The final serialisation point is the checkpoint commit record ordering code | ||
662 | that is run as part of the checkpoint commit and log force sequencing. The code | ||
663 | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter | ||
664 | an ordering loop after writing all the log vectors into the log buffers but | ||
665 | before writing the commit record. This loop walks the list of committing | ||
666 | checkpoints and needs to block waiting for checkpoints to complete their commit | ||
667 | record write. As a result it needs a lock and a wait variable. Log force | ||
668 | sequencing also requires the same lock, list walk, and blocking mechanism to | ||
669 | ensure completion of checkpoints. | ||
670 | |||
671 | These two sequencing operations can use the mechanism even though the | ||
672 | events they are waiting for are different. The checkpoint commit record | ||
673 | sequencing needs to wait until checkpoint contexts contain a commit LSN | ||
674 | (obtained through completion of a commit record write) while log force | ||
675 | sequencing needs to wait until previous checkpoint contexts are removed from | ||
676 | the committing list (i.e. they've completed). A simple wait variable and | ||
677 | broadcast wakeups (thundering herds) has been used to implement these two | ||
678 | serialisation queues. They use the same lock as the CIL, too. If we see too | ||
679 | much contention on the CIL lock, or too many context switches as a result of | ||
680 | the broadcast wakeups these operations can be put under a new spinlock and | ||
681 | given separate wait lists to reduce lock contention and the number of processes | ||
682 | woken by the wrong event. | ||
683 | |||
684 | |||
685 | Lifecycle Changes | ||
686 | |||
687 | The existing log item life cycle is as follows: | ||
688 | |||
689 | 1. Transaction allocate | ||
690 | 2. Transaction reserve | ||
691 | 3. Lock item | ||
692 | 4. Join item to transaction | ||
693 | If not already attached, | ||
694 | Allocate log item | ||
695 | Attach log item to owner item | ||
696 | Attach log item to transaction | ||
697 | 5. Modify item | ||
698 | Record modifications in log item | ||
699 | 6. Transaction commit | ||
700 | Pin item in memory | ||
701 | Format item into log buffer | ||
702 | Write commit LSN into transaction | ||
703 | Unlock item | ||
704 | Attach transaction to log buffer | ||
705 | |||
706 | <log buffer IO dispatched> | ||
707 | <log buffer IO completes> | ||
708 | |||
709 | 7. Transaction completion | ||
710 | Mark log item committed | ||
711 | Insert log item into AIL | ||
712 | Write commit LSN into log item | ||
713 | Unpin log item | ||
714 | 8. AIL traversal | ||
715 | Lock item | ||
716 | Mark log item clean | ||
717 | Flush item to disk | ||
718 | |||
719 | <item IO completion> | ||
720 | |||
721 | 9. Log item removed from AIL | ||
722 | Moves log tail | ||
723 | Item unlocked | ||
724 | |||
725 | Essentially, steps 1-6 operate independently from step 7, which is also | ||
726 | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 | ||
727 | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur | ||
728 | at the same time. If the log item is in the AIL or between steps 6 and 7 | ||
729 | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 | ||
730 | are entered and completed is the object considered clean. | ||
731 | |||
732 | With delayed logging, there are new steps inserted into the life cycle: | ||
733 | |||
734 | 1. Transaction allocate | ||
735 | 2. Transaction reserve | ||
736 | 3. Lock item | ||
737 | 4. Join item to transaction | ||
738 | If not already attached, | ||
739 | Allocate log item | ||
740 | Attach log item to owner item | ||
741 | Attach log item to transaction | ||
742 | 5. Modify item | ||
743 | Record modifications in log item | ||
744 | 6. Transaction commit | ||
745 | Pin item in memory if not pinned in CIL | ||
746 | Format item into log vector + buffer | ||
747 | Attach log vector and buffer to log item | ||
748 | Insert log item into CIL | ||
749 | Write CIL context sequence into transaction | ||
750 | Unlock item | ||
751 | |||
752 | <next log force> | ||
753 | |||
754 | 7. CIL push | ||
755 | lock CIL flush | ||
756 | Chain log vectors and buffers together | ||
757 | Remove items from CIL | ||
758 | unlock CIL flush | ||
759 | write log vectors into log | ||
760 | sequence commit records | ||
761 | attach checkpoint context to log buffer | ||
762 | |||
763 | <log buffer IO dispatched> | ||
764 | <log buffer IO completes> | ||
765 | |||
766 | 8. Checkpoint completion | ||
767 | Mark log item committed | ||
768 | Insert item into AIL | ||
769 | Write commit LSN into log item | ||
770 | Unpin log item | ||
771 | 9. AIL traversal | ||
772 | Lock item | ||
773 | Mark log item clean | ||
774 | Flush item to disk | ||
775 | <item IO completion> | ||
776 | 10. Log item removed from AIL | ||
777 | Moves log tail | ||
778 | Item unlocked | ||
779 | |||
780 | From this, it can be seen that the only life cycle differences between the two | ||
781 | logging methods are in the middle of the life cycle - they still have the same | ||
782 | beginning and end and execution constraints. The only differences are in the | ||
783 | commiting of the log items to the log itself and the completion processing. | ||
784 | Hence delayed logging should not introduce any constraints on log item | ||
785 | behaviour, allocation or freeing that don't already exist. | ||
786 | |||
787 | As a result of this zero-impact "insertion" of delayed logging infrastructure | ||
788 | and the design of the internal structures to avoid on disk format changes, we | ||
789 | can basically switch between delayed logging and the existing mechanism with a | ||
790 | mount option. Fundamentally, there is no reason why the log manager would not | ||
791 | be able to swap methods automatically and transparently depending on load | ||
792 | characteristics, but this should not be necessary if delayed logging works as | ||
793 | designed. | ||
794 | |||
795 | Roadmap: | ||
796 | |||
797 | 2.6.35 Inclusion in mainline as an experimental mount option | ||
798 | => approximately 2-3 months to merge window | ||
799 | => needs to be in xfs-dev tree in 4-6 weeks | ||
800 | => code is nearing readiness for review | ||
801 | |||
802 | 2.6.37 Remove experimental tag from mount option | ||
803 | => should be roughly 6 months after initial merge | ||
804 | => enough time to: | ||
805 | => gain confidence and fix problems reported by early | ||
806 | adopters (a.k.a. guinea pigs) | ||
807 | => address worst performance regressions and undesired | ||
808 | behaviours | ||
809 | => start tuning/optimising code for parallelism | ||
810 | => start tuning/optimising algorithms consuming | ||
811 | excessive CPU time | ||
812 | |||
813 | 2.6.39 Switch default mount option to use delayed logging | ||
814 | => should be roughly 12 months after initial merge | ||
815 | => enough time to shake out remaining problems before next round of | ||
816 | enterprise distro kernel rebases | ||