aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/00-INDEX50
-rw-r--r--Documentation/filesystems/Exporting176
-rw-r--r--Documentation/filesystems/Locking515
-rw-r--r--Documentation/filesystems/adfs.txt57
-rw-r--r--Documentation/filesystems/affs.txt219
-rw-r--r--Documentation/filesystems/afs.txt155
-rw-r--r--Documentation/filesystems/automount-support.txt118
-rw-r--r--Documentation/filesystems/befs.txt117
-rw-r--r--Documentation/filesystems/bfs.txt57
-rw-r--r--Documentation/filesystems/cifs.txt51
-rw-r--r--Documentation/filesystems/coda.txt1673
-rw-r--r--Documentation/filesystems/cramfs.txt76
-rw-r--r--Documentation/filesystems/devfs/ChangeLog1977
-rw-r--r--Documentation/filesystems/devfs/README1964
-rw-r--r--Documentation/filesystems/devfs/ToDo40
-rw-r--r--Documentation/filesystems/devfs/boot-options65
-rw-r--r--Documentation/filesystems/directory-locking113
-rw-r--r--Documentation/filesystems/ext2.txt383
-rw-r--r--Documentation/filesystems/ext3.txt183
-rw-r--r--Documentation/filesystems/hfs.txt83
-rw-r--r--Documentation/filesystems/hpfs.txt296
-rw-r--r--Documentation/filesystems/isofs.txt38
-rw-r--r--Documentation/filesystems/jfs.txt35
-rw-r--r--Documentation/filesystems/ncpfs.txt12
-rw-r--r--Documentation/filesystems/ntfs.txt630
-rw-r--r--Documentation/filesystems/porting266
-rw-r--r--Documentation/filesystems/proc.txt1940
-rw-r--r--Documentation/filesystems/romfs.txt187
-rw-r--r--Documentation/filesystems/smbfs.txt8
-rw-r--r--Documentation/filesystems/sysfs-pci.txt88
-rw-r--r--Documentation/filesystems/sysfs.txt341
-rw-r--r--Documentation/filesystems/sysv-fs.txt38
-rw-r--r--Documentation/filesystems/tmpfs.txt100
-rw-r--r--Documentation/filesystems/udf.txt57
-rw-r--r--Documentation/filesystems/ufs.txt61
-rw-r--r--Documentation/filesystems/vfat.txt231
-rw-r--r--Documentation/filesystems/vfs.txt671
-rw-r--r--Documentation/filesystems/xfs.txt188
38 files changed, 13259 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
new file mode 100644
index 000000000000..bcfbab899b37
--- /dev/null
+++ b/Documentation/filesystems/00-INDEX
@@ -0,0 +1,50 @@
100-INDEX
2 - this file (info on some of the filesystems supported by linux).
3Locking
4 - info on locking rules as they pertain to Linux VFS.
5adfs.txt
6 - info and mount options for the Acorn Advanced Disc Filing System.
7affs.txt
8 - info and mount options for the Amiga Fast File System.
9bfs.txt
10 - info for the SCO UnixWare Boot Filesystem (BFS).
11cifs.txt
12 - description of the CIFS filesystem
13coda.txt
14 - description of the CODA filesystem.
15cramfs.txt
16 - info on the cram filesystem for small storage (ROMs etc)
17devfs/
18 - directory containing devfs documentation.
19ext2.txt
20 - info, mount options and specifications for the Ext2 filesystem.
21fat_cvf.txt
22 - info on the Compressed Volume Files extension to the FAT filesystem
23hpfs.txt
24 - info and mount options for the OS/2 HPFS.
25isofs.txt
26 - info and mount options for the ISO 9660 (CDROM) filesystem.
27jfs.txt
28 - info and mount options for the JFS filesystem.
29ncpfs.txt
30 - info on Novell Netware(tm) filesystem using NCP protocol.
31ntfs.txt
32 - info and mount options for the NTFS filesystem (Windows NT).
33proc.txt
34 - info on Linux's /proc filesystem.
35romfs.txt
36 - Description of the ROMFS filesystem.
37smbfs.txt
38 - info on using filesystems with the SMB protocol (Windows 3.11 and NT)
39sysv-fs.txt
40 - info on the SystemV/V7/Xenix/Coherent filesystem.
41udf.txt
42 - info and mount options for the UDF filesystem.
43ufs.txt
44 - info on the ufs filesystem.
45vfat.txt
46 - info on using the VFAT filesystem used in Windows NT and Windows 95
47vfs.txt
48 - Overview of the Virtual File System
49xfs.txt
50 - info and mount options for the XFS filesystem.
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting
new file mode 100644
index 000000000000..31047e0fe14b
--- /dev/null
+++ b/Documentation/filesystems/Exporting
@@ -0,0 +1,176 @@
1
2Making Filesystems Exportable
3=============================
4
5Most filesystem operations require a dentry (or two) as a starting
6point. Local applications have a reference-counted hold on suitable
7dentrys via open file descriptors or cwd/root. However remote
8applications that access a filesystem via a remote filesystem protocol
9such as NFS may not be able to hold such a reference, and so need a
10different way to refer to a particular dentry. As the alternative
11form of reference needs to be stable across renames, truncates, and
12server-reboot (among other things, though these tend to be the most
13problematic), there is no simple answer like 'filename'.
14
15The mechanism discussed here allows each filesystem implementation to
16specify how to generate an opaque (out side of the filesystem) byte
17string for any dentry, and how to find an appropriate dentry for any
18given opaque byte string.
19This byte string will be called a "filehandle fragment" as it
20corresponds to part of an NFS filehandle.
21
22A filesystem which supports the mapping between filehandle fragments
23and dentrys will be termed "exportable".
24
25
26
27Dcache Issues
28-------------
29
30The dcache normally contains a proper prefix of any given filesystem
31tree. This means that if any filesystem object is in the dcache, then
32all of the ancestors of that filesystem object are also in the dcache.
33As normal access is by filename this prefix is created naturally and
34maintained easily (by each object maintaining a reference count on
35its parent).
36
37However when objects are included into the dcache by interpreting a
38filehandle fragment, there is no automatic creation of a path prefix
39for the object. This leads to two related but distinct features of
40the dcache that are not needed for normal filesystem access.
41
421/ The dcache must sometimes contain objects that are not part of the
43 proper prefix. i.e that are not connected to the root.
442/ The dcache must be prepared for a newly found (via ->lookup) directory
45 to already have a (non-connected) dentry, and must be able to move
46 that dentry into place (based on the parent and name in the
47 ->lookup). This is particularly needed for directories as
48 it is a dcache invariant that directories only have one dentry.
49
50To implement these features, the dcache has:
51
52a/ A dentry flag DCACHE_DISCONNECTED which is set on
53 any dentry that might not be part of the proper prefix.
54 This is set when anonymous dentries are created, and cleared when a
55 dentry is noticed to be a child of a dentry which is in the proper
56 prefix.
57
58b/ A per-superblock list "s_anon" of dentries which are the roots of
59 subtrees that are not in the proper prefix. These dentries, as
60 well as the proper prefix, need to be released at unmount time. As
61 these dentries will not be hashed, they are linked together on the
62 d_hash list_head.
63
64c/ Helper routines to allocate anonymous dentries, and to help attach
65 loose directory dentries at lookup time. They are:
66 d_alloc_anon(inode) will return a dentry for the given inode.
67 If the inode already has a dentry, one of those is returned.
68 If it doesn't, a new anonymous (IS_ROOT and
69 DCACHE_DISCONNECTED) dentry is allocated and attached.
70 In the case of a directory, care is taken that only one dentry
71 can ever be attached.
72 d_splice_alias(inode, dentry) will make sure that there is a
73 dentry with the same name and parent as the given dentry, and
74 which refers to the given inode.
75 If the inode is a directory and already has a dentry, then that
76 dentry is d_moved over the given dentry.
77 If the passed dentry gets attached, care is taken that this is
78 mutually exclusive to a d_alloc_anon operation.
79 If the passed dentry is used, NULL is returned, else the used
80 dentry is returned. This corresponds to the calling pattern of
81 ->lookup.
82
83
84Filesystem Issues
85-----------------
86
87For a filesystem to be exportable it must:
88
89 1/ provide the filehandle fragment routines described below.
90 2/ make sure that d_splice_alias is used rather than d_add
91 when ->lookup finds an inode for a given parent and name.
92 Typically the ->lookup routine will end:
93 if (inode)
94 return d_splice(inode, dentry);
95 d_add(dentry, inode);
96 return NULL;
97 }
98
99
100
101 A file system implementation declares that instances of the filesystem
102are exportable by setting the s_export_op field in the struct
103super_block. This field must point to a "struct export_operations"
104struct which could potentially be full of NULLs, though normally at
105least get_parent will be set.
106
107 The primary operations are decode_fh and encode_fh.
108decode_fh takes a filehandle fragment and tries to find or create a
109dentry for the object referred to by the filehandle.
110encode_fh takes a dentry and creates a filehandle fragment which can
111later be used to find/create a dentry for the same object.
112
113decode_fh will probably make use of "find_exported_dentry".
114This function lives in the "exportfs" module which a filesystem does
115not need unless it is being exported. So rather that calling
116find_exported_dentry directly, each filesystem should call it through
117the find_exported_dentry pointer in it's export_operations table.
118This field is set correctly by the exporting agent (e.g. nfsd) when a
119filesystem is exported, and before any export operations are called.
120
121find_exported_dentry needs three support functions from the
122filesystem:
123 get_name. When given a parent dentry and a child dentry, this
124 should find a name in the directory identified by the parent
125 dentry, which leads to the object identified by the child dentry.
126 If no get_name function is supplied, a default implementation is
127 provided which uses vfs_readdir to find potential names, and
128 matches inode numbers to find the correct match.
129
130 get_parent. When given a dentry for a directory, this should return
131 a dentry for the parent. Quite possibly the parent dentry will
132 have been allocated by d_alloc_anon.
133 The default get_parent function just returns an error so any
134 filehandle lookup that requires finding a parent will fail.
135 ->lookup("..") is *not* used as a default as it can leave ".."
136 entries in the dcache which are too messy to work with.
137
138 get_dentry. When given an opaque datum, this should find the
139 implied object and create a dentry for it (possibly with
140 d_alloc_anon).
141 The opaque datum is whatever is passed down by the decode_fh
142 function, and is often simply a fragment of the filehandle
143 fragment.
144 decode_fh passes two datums through find_exported_dentry. One that
145 should be used to identify the target object, and one that can be
146 used to identify the object's parent, should that be necessary.
147 The default get_dentry function assumes that the datum contains an
148 inode number and a generation number, and it attempts to get the
149 inode using "iget" and check it's validity by matching the
150 generation number. A filesystem should only depend on the default
151 if iget can safely be used this way.
152
153If decode_fh and/or encode_fh are left as NULL, then default
154implementations are used. These defaults are suitable for ext2 and
155extremely similar filesystems (like ext3).
156
157The default encode_fh creates a filehandle fragment from the inode
158number and generation number of the target together with the inode
159number and generation number of the parent (if the parent is
160required).
161
162The default decode_fh extract the target and parent datums from the
163filehandle assuming the format used by the default encode_fh and
164passed them to find_exported_dentry.
165
166
167A filehandle fragment consists of an array of 1 or more 4byte words,
168together with a one byte "type".
169The decode_fh routine should not depend on the stated size that is
170passed to it. This size may be larger than the original filehandle
171generated by encode_fh, in which case it will have been padded with
172nuls. Rather, the encode_fh routine should choose a "type" which
173indicates the decode_fh how much of the filehandle is valid, and how
174it should be interpreted.
175
176
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
new file mode 100644
index 000000000000..a934baeeb33a
--- /dev/null
+++ b/Documentation/filesystems/Locking
@@ -0,0 +1,515 @@
1 The text below describes the locking rules for VFS-related methods.
2It is (believed to be) up-to-date. *Please*, if you change anything in
3prototypes or locking protocols - update this file. And update the relevant
4instances in the tree, don't leave that to maintainers of filesystems/devices/
5etc. At the very least, put the list of dubious cases in the end of this file.
6Don't turn it into log - maintainers of out-of-the-tree code are supposed to
7be able to use diff(1).
8 Thing currently missing here: socket operations. Alexey?
9
10--------------------------- dentry_operations --------------------------
11prototypes:
12 int (*d_revalidate)(struct dentry *, int);
13 int (*d_hash) (struct dentry *, struct qstr *);
14 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
15 int (*d_delete)(struct dentry *);
16 void (*d_release)(struct dentry *);
17 void (*d_iput)(struct dentry *, struct inode *);
18
19locking rules:
20 none have BKL
21 dcache_lock rename_lock ->d_lock may block
22d_revalidate: no no no yes
23d_hash no no no yes
24d_compare: no yes no no
25d_delete: yes no yes no
26d_release: no no no yes
27d_iput: no no no yes
28
29--------------------------- inode_operations ---------------------------
30prototypes:
31 int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
32 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
33ata *);
34 int (*link) (struct dentry *,struct inode *,struct dentry *);
35 int (*unlink) (struct inode *,struct dentry *);
36 int (*symlink) (struct inode *,struct dentry *,const char *);
37 int (*mkdir) (struct inode *,struct dentry *,int);
38 int (*rmdir) (struct inode *,struct dentry *);
39 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
40 int (*rename) (struct inode *, struct dentry *,
41 struct inode *, struct dentry *);
42 int (*readlink) (struct dentry *, char __user *,int);
43 int (*follow_link) (struct dentry *, struct nameidata *);
44 void (*truncate) (struct inode *);
45 int (*permission) (struct inode *, int, struct nameidata *);
46 int (*setattr) (struct dentry *, struct iattr *);
47 int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
48 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
49 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
50 ssize_t (*listxattr) (struct dentry *, char *, size_t);
51 int (*removexattr) (struct dentry *, const char *);
52
53locking rules:
54 all may block, none have BKL
55 i_sem(inode)
56lookup: yes
57create: yes
58link: yes (both)
59mknod: yes
60symlink: yes
61mkdir: yes
62unlink: yes (both)
63rmdir: yes (both) (see below)
64rename: yes (all) (see below)
65readlink: no
66follow_link: no
67truncate: yes (see below)
68setattr: yes
69permission: no
70getattr: no
71setxattr: yes
72getxattr: no
73listxattr: no
74removexattr: yes
75 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
76victim.
77 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
78 ->truncate() is never called directly - it's a callback, not a
79method. It's called by vmtruncate() - library function normally used by
80->setattr(). Locking information above applies to that call (i.e. is
81inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
82passed).
83
84See Documentation/filesystems/directory-locking for more detailed discussion
85of the locking scheme for directory operations.
86
87--------------------------- super_operations ---------------------------
88prototypes:
89 struct inode *(*alloc_inode)(struct super_block *sb);
90 void (*destroy_inode)(struct inode *);
91 void (*read_inode) (struct inode *);
92 void (*dirty_inode) (struct inode *);
93 int (*write_inode) (struct inode *, int);
94 void (*put_inode) (struct inode *);
95 void (*drop_inode) (struct inode *);
96 void (*delete_inode) (struct inode *);
97 void (*put_super) (struct super_block *);
98 void (*write_super) (struct super_block *);
99 int (*sync_fs)(struct super_block *sb, int wait);
100 void (*write_super_lockfs) (struct super_block *);
101 void (*unlockfs) (struct super_block *);
102 int (*statfs) (struct super_block *, struct kstatfs *);
103 int (*remount_fs) (struct super_block *, int *, char *);
104 void (*clear_inode) (struct inode *);
105 void (*umount_begin) (struct super_block *);
106 int (*show_options)(struct seq_file *, struct vfsmount *);
107 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
108 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
109
110locking rules:
111 All may block.
112 BKL s_lock s_umount
113alloc_inode: no no no
114destroy_inode: no
115read_inode: no (see below)
116dirty_inode: no (must not sleep)
117write_inode: no
118put_inode: no
119drop_inode: no !!!inode_lock!!!
120delete_inode: no
121put_super: yes yes no
122write_super: no yes read
123sync_fs: no no read
124write_super_lockfs: ?
125unlockfs: ?
126statfs: no no no
127remount_fs: no yes maybe (see below)
128clear_inode: no
129umount_begin: yes no no
130show_options: no (vfsmount->sem)
131quota_read: no no no (see below)
132quota_write: no no no (see below)
133
134->read_inode() is not a method - it's a callback used in iget().
135->remount_fs() will have the s_umount lock if it's already mounted.
136When called from get_sb_single, it does NOT have the s_umount lock.
137->quota_read() and ->quota_write() functions are both guaranteed to
138be the only ones operating on the quota file by the quota code (via
139dqio_sem) (unless an admin really wants to screw up something and
140writes to quota files with quotas on). For other details about locking
141see also dquot_operations section.
142
143--------------------------- file_system_type ---------------------------
144prototypes:
145 struct super_block *(*get_sb) (struct file_system_type *, int,
146 const char *, void *);
147 void (*kill_sb) (struct super_block *);
148locking rules:
149 may block BKL
150get_sb yes yes
151kill_sb yes yes
152
153->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
154->kill_sb() takes a write-locked superblock, does all shutdown work on it,
155unlocks and drops the reference.
156
157--------------------------- address_space_operations --------------------------
158prototypes:
159 int (*writepage)(struct page *page, struct writeback_control *wbc);
160 int (*readpage)(struct file *, struct page *);
161 int (*sync_page)(struct page *);
162 int (*writepages)(struct address_space *, struct writeback_control *);
163 int (*set_page_dirty)(struct page *page);
164 int (*readpages)(struct file *filp, struct address_space *mapping,
165 struct list_head *pages, unsigned nr_pages);
166 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
167 int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
168 sector_t (*bmap)(struct address_space *, sector_t);
169 int (*invalidatepage) (struct page *, unsigned long);
170 int (*releasepage) (struct page *, int);
171 int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
172 loff_t offset, unsigned long nr_segs);
173
174locking rules:
175 All except set_page_dirty may block
176
177 BKL PageLocked(page)
178writepage: no yes, unlocks (see below)
179readpage: no yes, unlocks
180sync_page: no maybe
181writepages: no
182set_page_dirty no no
183readpages: no
184prepare_write: no yes
185commit_write: no yes
186bmap: yes
187invalidatepage: no yes
188releasepage: no yes
189direct_IO: no
190
191 ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
192may be called from the request handler (/dev/loop).
193
194 ->readpage() unlocks the page, either synchronously or via I/O
195completion.
196
197 ->readpages() populates the pagecache with the passed pages and starts
198I/O against them. They come unlocked upon I/O completion.
199
200 ->writepage() is used for two purposes: for "memory cleansing" and for
201"sync". These are quite different operations and the behaviour may differ
202depending upon the mode.
203
204If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
205it *must* start I/O against the page, even if that would involve
206blocking on in-progress I/O.
207
208If writepage is called for memory cleansing (sync_mode ==
209WBC_SYNC_NONE) then its role is to get as much writeout underway as
210possible. So writepage should try to avoid blocking against
211currently-in-progress I/O.
212
213If the filesystem is not called for "sync" and it determines that it
214would need to block against in-progress I/O to be able to start new I/O
215against the page the filesystem should redirty the page with
216redirty_page_for_writepage(), then unlock the page and return zero.
217This may also be done to avoid internal deadlocks, but rarely.
218
219If the filesytem is called for sync then it must wait on any
220in-progress I/O and then start new I/O.
221
222The filesystem should unlock the page synchronously, before returning
223to the caller.
224
225Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
226and return zero, writepage *must* run set_page_writeback() against the page,
227followed by unlocking it. Once set_page_writeback() has been run against the
228page, write I/O can be submitted and the write I/O completion handler must run
229end_page_writeback() once the I/O is complete. If no I/O is submitted, the
230filesystem must run end_page_writeback() against the page before returning from
231writepage.
232
233That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
234if the filesystem needs the page to be locked during writeout, that is ok, too,
235the page is allowed to be unlocked at any point in time between the calls to
236set_page_writeback() and end_page_writeback().
237
238Note, failure to run either redirty_page_for_writepage() or the combination of
239set_page_writeback()/end_page_writeback() on a page submitted to writepage
240will leave the page itself marked clean but it will be tagged as dirty in the
241radix tree. This incoherency can lead to all sorts of hard-to-debug problems
242in the filesystem like having dirty inodes at umount and losing written data.
243
244 ->sync_page() locking rules are not well-defined - usually it is called
245with lock on page, but that is not guaranteed. Considering the currently
246existing instances of this method ->sync_page() itself doesn't look
247well-defined...
248
249 ->writepages() is used for periodic writeback and for syscall-initiated
250sync operations. The address_space should start I/O against at least
251*nr_to_write pages. *nr_to_write must be decremented for each page which is
252written. The address_space implementation may write more (or less) pages
253than *nr_to_write asks for, but it should try to be reasonably close. If
254nr_to_write is NULL, all dirty pages must be written.
255
256writepages should _only_ write pages which are present on
257mapping->io_pages.
258
259 ->set_page_dirty() is called from various places in the kernel
260when the target page is marked as needing writeback. It may be called
261under spinlock (it cannot block) and is sometimes called with the page
262not locked.
263
264 ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
265filesystems and by the swapper. The latter will eventually go away. All
266instances do not actually need the BKL. Please, keep it that way and don't
267breed new callers.
268
269 ->invalidatepage() is called when the filesystem must attempt to drop
270some or all of the buffers from the page when it is being truncated. It
271returns zero on success. If ->invalidatepage is zero, the kernel uses
272block_invalidatepage() instead.
273
274 ->releasepage() is called when the kernel is about to try to drop the
275buffers from the page in preparation for freeing it. It returns zero to
276indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
277the kernel assumes that the fs has no private interest in the buffers.
278
279 Note: currently almost all instances of address_space methods are
280using BKL for internal serialization and that's one of the worst sources
281of contention. Normally they are calling library functions (in fs/buffer.c)
282and pass foo_get_block() as a callback (on local block-based filesystems,
283indeed). BKL is not needed for library stuff and is usually taken by
284foo_get_block(). It's an overkill, since block bitmaps can be protected by
285internal fs locking and real critical areas are much smaller than the areas
286filesystems protect now.
287
288----------------------- file_lock_operations ------------------------------
289prototypes:
290 void (*fl_insert)(struct file_lock *); /* lock insertion callback */
291 void (*fl_remove)(struct file_lock *); /* lock removal callback */
292 void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
293 void (*fl_release_private)(struct file_lock *);
294
295
296locking rules:
297 BKL may block
298fl_insert: yes no
299fl_remove: yes no
300fl_copy_lock: yes no
301fl_release_private: yes yes
302
303----------------------- lock_manager_operations ---------------------------
304prototypes:
305 int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
306 void (*fl_notify)(struct file_lock *); /* unblock callback */
307 void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
308 void (*fl_release_private)(struct file_lock *);
309 void (*fl_break)(struct file_lock *); /* break_lease callback */
310
311locking rules:
312 BKL may block
313fl_compare_owner: yes no
314fl_notify: yes no
315fl_copy_lock: yes no
316fl_release_private: yes yes
317fl_break: yes no
318
319 Currently only NFSD and NLM provide instances of this class. None of the
320them block. If you have out-of-tree instances - please, show up. Locking
321in that area will change.
322--------------------------- buffer_head -----------------------------------
323prototypes:
324 void (*b_end_io)(struct buffer_head *bh, int uptodate);
325
326locking rules:
327 called from interrupts. In other words, extreme care is needed here.
328bh is locked, but that's all warranties we have here. Currently only RAID1,
329highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
330call this method upon the IO completion.
331
332--------------------------- block_device_operations -----------------------
333prototypes:
334 int (*open) (struct inode *, struct file *);
335 int (*release) (struct inode *, struct file *);
336 int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
337 int (*media_changed) (struct gendisk *);
338 int (*revalidate_disk) (struct gendisk *);
339
340locking rules:
341 BKL bd_sem
342open: yes yes
343release: yes yes
344ioctl: yes no
345media_changed: no no
346revalidate_disk: no no
347
348The last two are called only from check_disk_change().
349
350--------------------------- file_operations -------------------------------
351prototypes:
352 loff_t (*llseek) (struct file *, loff_t, int);
353 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
354 ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
355 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
356 ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t,
357 loff_t);
358 int (*readdir) (struct file *, void *, filldir_t);
359 unsigned int (*poll) (struct file *, struct poll_table_struct *);
360 int (*ioctl) (struct inode *, struct file *, unsigned int,
361 unsigned long);
362 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
363 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
364 int (*mmap) (struct file *, struct vm_area_struct *);
365 int (*open) (struct inode *, struct file *);
366 int (*flush) (struct file *);
367 int (*release) (struct inode *, struct file *);
368 int (*fsync) (struct file *, struct dentry *, int datasync);
369 int (*aio_fsync) (struct kiocb *, int datasync);
370 int (*fasync) (int, struct file *, int);
371 int (*lock) (struct file *, int, struct file_lock *);
372 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
373 loff_t *);
374 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
375 loff_t *);
376 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
377 void __user *);
378 ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
379 loff_t *, int);
380 unsigned long (*get_unmapped_area)(struct file *, unsigned long,
381 unsigned long, unsigned long, unsigned long);
382 int (*check_flags)(int);
383 int (*dir_notify)(struct file *, unsigned long);
384};
385
386locking rules:
387 All except ->poll() may block.
388 BKL
389llseek: no (see below)
390read: no
391aio_read: no
392write: no
393aio_write: no
394readdir: no
395poll: no
396ioctl: yes (see below)
397unlocked_ioctl: no (see below)
398compat_ioctl: no
399mmap: no
400open: maybe (see below)
401flush: no
402release: no
403fsync: no (see below)
404aio_fsync: no
405fasync: yes (see below)
406lock: yes
407readv: no
408writev: no
409sendfile: no
410sendpage: no
411get_unmapped_area: no
412check_flags: no
413dir_notify: no
414
415->llseek() locking has moved from llseek to the individual llseek
416implementations. If your fs is not using generic_file_llseek, you
417need to acquire and release the appropriate locks in your ->llseek().
418For many filesystems, it is probably safe to acquire the inode
419semaphore. Note some filesystems (i.e. remote ones) provide no
420protection for i_size so you will need to use the BKL.
421
422->open() locking is in-transit: big lock partially moved into the methods.
423The only exception is ->open() in the instances of file_operations that never
424end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
425(chrdev_open() takes lock before replacing ->f_op and calling the secondary
426method. As soon as we fix the handling of module reference counters all
427instances of ->open() will be called without the BKL.
428
429Note: ext2_release() was *the* source of contention on fs-intensive
430loads and dropping BKL on ->release() helps to get rid of that (we still
431grab BKL for cases when we close a file that had been opened r/w, but that
432can and should be done using the internal locking with smaller critical areas).
433Current worst offender is ext2_get_block()...
434
435->fasync() is a mess. This area needs a big cleanup and that will probably
436affect locking.
437
438->readdir() and ->ioctl() on directories must be changed. Ideally we would
439move ->readdir() to inode_operations and use a separate method for directory
440->ioctl() or kill the latter completely. One of the problems is that for
441anything that resembles union-mount we won't have a struct file for all
442components. And there are other reasons why the current interface is a mess...
443
444->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
445doesn't take the BKL.
446
447->read on directories probably must go away - we should just enforce -EISDIR
448in sys_read() and friends.
449
450->fsync() has i_sem on inode.
451
452--------------------------- dquot_operations -------------------------------
453prototypes:
454 int (*initialize) (struct inode *, int);
455 int (*drop) (struct inode *);
456 int (*alloc_space) (struct inode *, qsize_t, int);
457 int (*alloc_inode) (const struct inode *, unsigned long);
458 int (*free_space) (struct inode *, qsize_t);
459 int (*free_inode) (const struct inode *, unsigned long);
460 int (*transfer) (struct inode *, struct iattr *);
461 int (*write_dquot) (struct dquot *);
462 int (*acquire_dquot) (struct dquot *);
463 int (*release_dquot) (struct dquot *);
464 int (*mark_dirty) (struct dquot *);
465 int (*write_info) (struct super_block *, int);
466
467These operations are intended to be more or less wrapping functions that ensure
468a proper locking wrt the filesystem and call the generic quota operations.
469
470What filesystem should expect from the generic quota functions:
471
472 FS recursion Held locks when called
473initialize: yes maybe dqonoff_sem
474drop: yes -
475alloc_space: ->mark_dirty() -
476alloc_inode: ->mark_dirty() -
477free_space: ->mark_dirty() -
478free_inode: ->mark_dirty() -
479transfer: yes -
480write_dquot: yes dqonoff_sem or dqptr_sem
481acquire_dquot: yes dqonoff_sem or dqptr_sem
482release_dquot: yes dqonoff_sem or dqptr_sem
483mark_dirty: no -
484write_info: yes dqonoff_sem
485
486FS recursion means calling ->quota_read() and ->quota_write() from superblock
487operations.
488
489->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
490only directly by the filesystem and do not call any fs functions only
491the ->mark_dirty() operation.
492
493More details about quota locking can be found in fs/dquot.c.
494
495--------------------------- vm_operations_struct -----------------------------
496prototypes:
497 void (*open)(struct vm_area_struct*);
498 void (*close)(struct vm_area_struct*);
499 struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
500
501locking rules:
502 BKL mmap_sem
503open: no yes
504close: no yes
505nopage: no yes
506
507================================================================================
508 Dubious stuff
509
510(if you break something or notice that it is broken and do not fix it yourself
511- at least put it here)
512
513ipc/shm.c::shm_delete() - may need BKL.
514->read() and ->write() in many drivers are (probably) missing BKL.
515drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt
new file mode 100644
index 000000000000..060abb0c7004
--- /dev/null
+++ b/Documentation/filesystems/adfs.txt
@@ -0,0 +1,57 @@
1Mount options for ADFS
2----------------------
3
4 uid=nnn All files in the partition will be owned by
5 user id nnn. Default 0 (root).
6 gid=nnn All files in the partition willbe in group
7 nnn. Default 0 (root).
8 ownmask=nnn The permission mask for ADFS 'owner' permissions
9 will be nnn. Default 0700.
10 othmask=nnn The permission mask for ADFS 'other' permissions
11 will be nnn. Default 0077.
12
13Mapping of ADFS permissions to Linux permissions
14------------------------------------------------
15
16 ADFS permissions consist of the following:
17
18 Owner read
19 Owner write
20 Other read
21 Other write
22
23 (In older versions, an 'execute' permission did exist, but this
24 does not hold the same meaning as the Linux 'execute' permission
25 and is now obsolete).
26
27 The mapping is performed as follows:
28
29 Owner read -> -r--r--r--
30 Owner write -> --w--w---w
31 Owner read and filetype UnixExec -> ---x--x--x
32 These are then masked by ownmask, eg 700 -> -rwx------
33 Possible owner mode permissions -> -rwx------
34
35 Other read -> -r--r--r--
36 Other write -> --w--w--w-
37 Other read and filetype UnixExec -> ---x--x--x
38 These are then masked by othmask, eg 077 -> ----rwxrwx
39 Possible other mode permissions -> ----rwxrwx
40
41 Hence, with the default masks, if a file is owner read/write, and
42 not a UnixExec filetype, then the permissions will be:
43
44 -rw-------
45
46 However, if the masks were ownmask=0770,othmask=0007, then this would
47 be modified to:
48 -rw-rw----
49
50 There is no restriction on what you can do with these masks. You may
51 wish that either read bits give read access to the file for all, but
52 keep the default write protection (ownmask=0755,othmask=0577):
53
54 -rw-r--r--
55
56 You can therefore tailor the permission translation to whatever you
57 desire the permissions should be under Linux.
diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt
new file mode 100644
index 000000000000..30c9738590f4
--- /dev/null
+++ b/Documentation/filesystems/affs.txt
@@ -0,0 +1,219 @@
1Overview of Amiga Filesystems
2=============================
3
4Not all varieties of the Amiga filesystems are supported for reading and
5writing. The Amiga currently knows six different filesystems:
6
7DOS\0 The old or original filesystem, not really suited for
8 hard disks and normally not used on them, either.
9 Supported read/write.
10
11DOS\1 The original Fast File System. Supported read/write.
12
13DOS\2 The old "international" filesystem. International means that
14 a bug has been fixed so that accented ("international") letters
15 in file names are case-insensitive, as they ought to be.
16 Supported read/write.
17
18DOS\3 The "international" Fast File System. Supported read/write.
19
20DOS\4 The original filesystem with directory cache. The directory
21 cache speeds up directory accesses on floppies considerably,
22 but slows down file creation/deletion. Doesn't make much
23 sense on hard disks. Supported read only.
24
25DOS\5 The Fast File System with directory cache. Supported read only.
26
27All of the above filesystems allow block sizes from 512 to 32K bytes.
28Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
29speed up almost everything at the expense of wasted disk space. The speed
30gain above 4K seems not really worth the price, so you don't lose too
31much here, either.
32
33The muFS (multi user File System) equivalents of the above file systems
34are supported, too.
35
36Mount options for the AFFS
37==========================
38
39protect If this option is set, the protection bits cannot be altered.
40
41setuid[=uid] This sets the owner of all files and directories in the file
42 system to uid or the uid of the current user, respectively.
43
44setgid[=gid] Same as above, but for gid.
45
46mode=mode Sets the mode flags to the given (octal) value, regardless
47 of the original permissions. Directories will get an x
48 permission if the corresponding r bit is set.
49 This is useful since most of the plain AmigaOS files
50 will map to 600.
51
52reserved=num Sets the number of reserved blocks at the start of the
53 partition to num. You should never need this option.
54 Default is 2.
55
56root=block Sets the block number of the root block. This should never
57 be necessary.
58
59bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
60 1024, 2048 and 4096. Like the root option, this should
61 never be necessary, as the affs can figure it out itself.
62
63quiet The file system will not return an error for disallowed
64 mode changes.
65
66verbose The volume name, file system type and block size will
67 be written to the syslog when the filesystem is mounted.
68
69mufs The filesystem is really a muFS, also it doesn't
70 identify itself as one. This option is necessary if
71 the filesystem wasn't formatted as muFS, but is used
72 as one.
73
74prefix=path Path will be prefixed to every absolute path name of
75 symbolic links on an AFFS partition. Default = "/".
76 (See below.)
77
78volume=name When symbolic links with an absolute path are created
79 on an AFFS partition, name will be prepended as the
80 volume name. Default = "" (empty string).
81 (See below.)
82
83Handling of the Users/Groups and protection flags
84=================================================
85
86Amiga -> Linux:
87
88The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
89
90 - R maps to r for user, group and others. On directories, R implies x.
91
92 - If both W and D are allowed, w will be set.
93
94 - E maps to x.
95
96 - H and P are always retained and ignored under Linux.
97
98 - A is always reset when a file is written to.
99
100User id and group id will be used unless set[gu]id are given as mount
101options. Since most of the Amiga file systems are single user systems
102they will be owned by root. The root directory (the mount point) of the
103Amiga filesystem will be owned by the user who actually mounts the
104filesystem (the root directory doesn't have uid/gid fields).
105
106Linux -> Amiga:
107
108The Linux rwxrwxrwx file mode is handled as follows:
109
110 - r permission will set R for user, group and others.
111
112 - w permission will set W and D for user, group and others.
113
114 - x permission of the user will set E for plain files.
115
116 - All other flags (suid, sgid, ...) are ignored and will
117 not be retained.
118
119Newly created files and directories will get the user and group ID
120of the current user and a mode according to the umask.
121
122Symbolic links
123==============
124
125Although the Amiga and Linux file systems resemble each other, there
126are some, not always subtle, differences. One of them becomes apparent
127with symbolic links. While Linux has a file system with exactly one
128root directory, the Amiga has a separate root directory for each
129file system (for example, partition, floppy disk, ...). With the Amiga,
130these entities are called "volumes". They have symbolic names which
131can be used to access them. Thus, symbolic links can point to a
132different volume. AFFS turns the volume name into a directory name
133and prepends the prefix path (see prefix option) to it.
134
135Example:
136You mount all your Amiga partitions under /amiga/<volume> (where
137<volume> is the name of the volume), and you give the option
138"prefix=/amiga/" when mounting all your AFFS partitions. (They
139might be "User", "WB" and "Graphics", the mount points /amiga/User,
140/amiga/WB and /amiga/Graphics). A symbolic link referring to
141"User:sc/include/dos/dos.h" will be followed to
142"/amiga/User/sc/include/dos/dos.h".
143
144Examples
145========
146
147Command line:
148 mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
149 mount /dev/sda3 /Amiga -t affs
150
151/etc/fstab entry:
152 /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
153
154IMPORTANT NOTE
155==============
156
157If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
158have an Amiga harddisk connected to your PC, it will overwrite
159the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
160the Rigid Disk Block. Sheer luck has it that this is an unused
161area of the RDB, so only the checksum doesn't match anymore.
162Linux will ignore this garbage and recognize the RDB anyway, but
163before you connect that drive to your Amiga again, you must
164restore or repair your RDB. So please do make a backup copy of it
165before booting Windows!
166
167If the damage is already done, the following should fix the RDB
168(where <disk> is the device name).
169DO AT YOUR OWN RISK:
170
171 dd if=/dev/<disk> of=rdb.tmp count=1
172 cp rdb.tmp rdb.fixed
173 dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
174 dd if=rdb.fixed of=/dev/<disk>
175
176Bugs, Restrictions, Caveats
177===========================
178
179Quite a few things may not work as advertised. Not everything is
180tested, though several hundred MB have been read and written using
181this fs. For a most up-to-date list of bugs please consult
182fs/affs/Changes.
183
184Filenames are truncated to 30 characters without warning (this
185can be changed by setting the compile-time option AFFS_NO_TRUNCATE
186in include/linux/amigaffs.h).
187
188Case is ignored by the affs in filename matching, but Linux shells
189do care about the case. Example (with /wb being an affs mounted fs):
190 rm /wb/WRONGCASE
191will remove /mnt/wrongcase, but
192 rm /wb/WR*
193will not since the names are matched by the shell.
194
195The block allocation is designed for hard disk partitions. If more
196than 1 process writes to a (small) diskette, the blocks are allocated
197in an ugly way (but the real AFFS doesn't do much better). This
198is also true when space gets tight.
199
200You cannot execute programs on an OFS (Old File System), since the
201program files cannot be memory mapped due to the 488 byte blocks.
202For the same reason you cannot mount an image on such a filesystem
203via the loopback device.
204
205The bitmap valid flag in the root block may not be accurate when the
206system crashes while an affs partition is mounted. There's currently
207no way to fix a garbled filesystem without an Amiga (disk validator)
208or manually (who would do this?). Maybe later.
209
210If you mount affs partitions on system startup, you may want to tell
211fsck that the fs should not be checked (place a '0' in the sixth field
212of /etc/fstab).
213
214It's not possible to read floppy disks with a normal PC or workstation
215due to an incompatibility with the Amiga floppy controller.
216
217If you are interested in an Amiga Emulator for Linux, look at
218
219http://www-users.informatik.rwth-aachen.de/~crux/uae.html
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt
new file mode 100644
index 000000000000..2f4237dfb8c7
--- /dev/null
+++ b/Documentation/filesystems/afs.txt
@@ -0,0 +1,155 @@
1 kAFS: AFS FILESYSTEM
2 ====================
3
4ABOUT
5=====
6
7This filesystem provides a fairly simple AFS filesystem driver. It is under
8development and only provides very basic facilities. It does not yet support
9the following AFS features:
10
11 (*) Write support.
12 (*) Communications security.
13 (*) Local caching.
14 (*) pioctl() system call.
15 (*) Automatic mounting of embedded mountpoints.
16
17
18USAGE
19=====
20
21When inserting the driver modules the root cell must be specified along with a
22list of volume location server IP addresses:
23
24 insmod rxrpc.o
25 insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
26
27The first module is a driver for the RxRPC remote operation protocol, and the
28second is the actual filesystem driver for the AFS filesystem.
29
30Once the module has been loaded, more modules can be added by the following
31procedure:
32
33 echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
34
35Where the parameters to the "add" command are the name of a cell and a list of
36volume location servers within that cell.
37
38Filesystems can be mounted anywhere by commands similar to the following:
39
40 mount -t afs "%cambridge.redhat.com:root.afs." /afs
41 mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
42 mount -t afs "#root.afs." /afs
43 mount -t afs "#root.cell." /afs/cambridge
44
45 NB: When using this on Linux 2.4, the mount command has to be different,
46 since the filesystem doesn't have access to the device name argument:
47
48 mount -t afs none /afs -ovol="#root.afs."
49
50Where the initial character is either a hash or a percent symbol depending on
51whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
52volume, but are willing to use a R/W volume instead (percent).
53
54The name of the volume can be suffixes with ".backup" or ".readonly" to
55specify connection to only volumes of those types.
56
57The name of the cell is optional, and if not given during a mount, then the
58named volume will be looked up in the cell specified during insmod.
59
60Additional cells can be added through /proc (see later section).
61
62
63MOUNTPOINTS
64===========
65
66AFS has a concept of mountpoints. These are specially formatted symbolic links
67(of the same form as the "device name" passed to mount). kAFS presents these
68to the user as directories that have special properties:
69
70 (*) They cannot be listed. Running a program like "ls" on them will incur an
71 EREMOTE error (Object is remote).
72
73 (*) Other objects can't be looked up inside of them. This also incurs an
74 EREMOTE error.
75
76 (*) They can be queried with the readlink() system call, which will return
77 the name of the mountpoint to which they point. The "readlink" program
78 will also work.
79
80 (*) They can be mounted on (which symbolic links can't).
81
82
83PROC FILESYSTEM
84===============
85
86The rxrpc module creates a number of files in various places in the /proc
87filesystem:
88
89 (*) Firstly, some information files are made available in a directory called
90 "/proc/net/rxrpc/". These list the extant transport endpoint, peer,
91 connection and call records.
92
93 (*) Secondly, some control files are made available in a directory called
94 "/proc/sys/rxrpc/". Currently, all these files can be used for is to
95 turn on various levels of tracing.
96
97The AFS modules creates a "/proc/fs/afs/" directory and populates it:
98
99 (*) A "cells" file that lists cells currently known to the afs module.
100
101 (*) A directory per cell that contains files that list volume location
102 servers, volumes, and active servers known within that cell.
103
104
105THE CELL DATABASE
106=================
107
108The filesystem maintains an internal database of all the cells it knows and
109the IP addresses of the volume location servers for those cells. The cell to
110which the computer belongs is added to the database when insmod is performed
111by the "rootcell=" argument.
112
113Further cells can be added by commands similar to the following:
114
115 echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
116 echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
117
118No other cell database operations are available at this time.
119
120
121EXAMPLES
122========
123
124Here's what I use to test this. Some of the names and IP addresses are local
125to my internal DNS. My "root.afs" partition has a mount point within it for
126some public volumes volumes.
127
128insmod -S /tmp/rxrpc.o
129insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
130
131mount -t afs \%root.afs. /afs
132mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
133
134echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
135mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
136mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
137mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
138mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
139mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
140mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
141mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
142mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
143
144umount /afs/grand.central.org/user
145umount /afs/grand.central.org/software
146umount /afs/grand.central.org/service
147umount /afs/grand.central.org/project
148umount /afs/grand.central.org/doc
149umount /afs/grand.central.org/contrib
150umount /afs/grand.central.org/archive
151umount /afs/grand.central.org
152umount /afs/cambridge.redhat.com
153umount /afs
154rmmod kafs
155rmmod rxrpc
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt
new file mode 100644
index 000000000000..58c65a1713e5
--- /dev/null
+++ b/Documentation/filesystems/automount-support.txt
@@ -0,0 +1,118 @@
1Support is available for filesystems that wish to do automounting support (such
2as kAFS which can be found in fs/afs/). This facility includes allowing
3in-kernel mounts to be performed and mountpoint degradation to be
4requested. The latter can also be requested by userspace.
5
6
7======================
8IN-KERNEL AUTOMOUNTING
9======================
10
11A filesystem can now mount another filesystem on one of its directories by the
12following procedure:
13
14 (1) Give the directory a follow_link() operation.
15
16 When the directory is accessed, the follow_link op will be called, and
17 it will be provided with the location of the mountpoint in the nameidata
18 structure (vfsmount and dentry).
19
20 (2) Have the follow_link() op do the following steps:
21
22 (a) Call do_kern_mount() to call the appropriate filesystem to set up a
23 superblock and gain a vfsmount structure representing it.
24
25 (b) Copy the nameidata provided as an argument and substitute the dentry
26 argument into it the copy.
27
28 (c) Call do_add_mount() to install the new vfsmount into the namespace's
29 mountpoint tree, thus making it accessible to userspace. Use the
30 nameidata set up in (b) as the destination.
31
32 If the mountpoint will be automatically expired, then do_add_mount()
33 should also be given the location of an expiration list (see further
34 down).
35
36 (d) Release the path in the nameidata argument and substitute in the new
37 vfsmount and its root dentry. The ref counts on these will need
38 incrementing.
39
40Then from userspace, you can just do something like:
41
42 [root@andromeda root]# mount -t afs \#root.afs. /afs
43 [root@andromeda root]# ls /afs
44 asd cambridge cambridge.redhat.com grand.central.org
45 [root@andromeda root]# ls /afs/cambridge
46 afsdoc
47 [root@andromeda root]# ls /afs/cambridge/afsdoc/
48 ChangeLog html LICENSE pdf RELNOTES-1.2.2
49
50And then if you look in the mountpoint catalogue, you'll see something like:
51
52 [root@andromeda root]# cat /proc/mounts
53 ...
54 #root.afs. /afs afs rw 0 0
55 #root.cell. /afs/cambridge.redhat.com afs rw 0 0
56 #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
57
58
59===========================
60AUTOMATIC MOUNTPOINT EXPIRY
61===========================
62
63Automatic expiration of mountpoints is easy, provided you've mounted the
64mountpoint to be expired in the automounting procedure outlined above.
65
66To do expiration, you need to follow these steps:
67
68 (3) Create at least one list off which the vfsmounts to be expired can be
69 hung. Access to this list will be governed by the vfsmount_lock.
70
71 (4) In step (2c) above, the call to do_add_mount() should be provided with a
72 pointer to this list. It will hang the vfsmount off of it if it succeeds.
73
74 (5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
75 with a pointer to this list. This will process the list, marking every
76 vfsmount thereon for potential expiry on the next call.
77
78 If a vfsmount was already flagged for expiry, and if its usage count is 1
79 (it's only referenced by its parent vfsmount), then it will be deleted
80 from the namespace and thrown away (effectively unmounted).
81
82 It may prove simplest to simply call this at regular intervals, using
83 some sort of timed event to drive it.
84
85The expiration flag is cleared by calls to mntput. This means that expiration
86will only happen on the second expiration request after the last time the
87mountpoint was accessed.
88
89If a mountpoint is moved, it gets removed from the expiration list. If a bind
90mount is made on an expirable mount, the new vfsmount will not be on the
91expiration list and will not expire.
92
93If a namespace is copied, all mountpoints contained therein will be copied,
94and the copies of those that are on an expiration list will be added to the
95same expiration list.
96
97
98=======================
99USERSPACE DRIVEN EXPIRY
100=======================
101
102As an alternative, it is possible for userspace to request expiry of any
103mountpoint (though some will be rejected - the current process's idea of the
104rootfs for example). It does this by passing the MNT_EXPIRE flag to
105umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
106
107If the mountpoint in question is in referenced by something other than
108umount() or its parent mountpoint, an EBUSY error will be returned and the
109mountpoint will not be marked for expiration or unmounted.
110
111If the mountpoint was not already marked for expiry at that time, an EAGAIN
112error will be given and it won't be unmounted.
113
114Otherwise if it was already marked and it wasn't referenced, unmounting will
115take place as usual.
116
117Again, the expiration flag is cleared every time anything other than umount()
118looks at a mountpoint.
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt
new file mode 100644
index 000000000000..877a7b1d46ec
--- /dev/null
+++ b/Documentation/filesystems/befs.txt
@@ -0,0 +1,117 @@
1BeOS filesystem for Linux
2
3Document last updated: Dec 6, 2001
4
5WARNING
6=======
7Make sure you understand that this is alpha software. This means that the
8implementation is neither complete nor well-tested.
9
10I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
11
12LICENSE
13=====
14This software is covered by the GNU General Public License.
15See the file COPYING for the complete text of the license.
16Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
17
18AUTHOR
19=====
20The largest part of the code written by Will Dyson <will_dyson@pobox.com>
21He has been working on the code since Aug 13, 2001. See the changelog for
22details.
23
24Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
25His orriginal code can still be found at:
26<http://hp.vector.co.jp/authors/VA008030/bfs/>
27Does anyone know of a more current email address for Makoto? He doesn't
28respond to the address given above...
29
30Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
31
32WHAT IS THIS DRIVER?
33==================
34This module implements the native filesystem of BeOS <http://www.be.com/>
35for the linux 2.4.1 and later kernels. Currently it is a read-only
36implementation.
37
38Which is it, BFS or BEFS?
39================
40Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
41But Unixware Boot Filesystem is called bfs, too. And they are already in
42the kernel. Because of this nameing conflict, on Linux the BeOS
43filesystem is called befs.
44
45HOW TO INSTALL
46==============
47step 1. Install the BeFS patch into the source code tree of linux.
48
49Apply the patchfile to your kernel source tree.
50Assuming that your kernel source is in /foo/bar/linux and the patchfile
51is called patch-befs-xxx, you would do the following:
52
53 cd /foo/bar/linux
54 patch -p1 < /path/to/patch-befs-xxx
55
56if the patching step fails (i.e. there are rejected hunks), you can try to
57figure it out yourself (it shouldn't be hard), or mail the maintainer
58(Will Dyson <will_dyson@pobox.com>) for help.
59
60step 2. Configuretion & make kernel
61
62The linux kernel has many compile-time options. Most of them are beyond the
63scope of this document. I suggest the Kernel-HOWTO document as a good general
64reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
65
66However, to use the BeFS module, you must enable it at configure time.
67
68 cd /foo/bar/linux
69 make menuconfig (or xconfig)
70
71The BeFS module is not a standard part of the linux kernel, so you must first
72enable support for experimental code under the "Code maturity level" menu.
73
74Then, under the "Filesystems" menu will be an option called "BeFS
75filesystem (experimental)", or something like that. Enable that option
76(it is fine to make it a module).
77
78Save your kernel configuration and then build your kernel.
79
80step 3. Install
81
82See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
83instructions on this critical step.
84
85USING BFS
86=========
87To use the BeOS filesystem, use filesystem type 'befs'.
88
89ex)
90 mount -t befs /dev/fd0 /beos
91
92MOUNT OPTIONS
93=============
94uid=nnn All files in the partition will be owned by user id nnn.
95gid=nnn All files in the partition will be in group nnn.
96iocharset=xxx Use xxx as the name of the NLS translation table.
97debug The driver will output debugging information to the syslog.
98
99HOW TO GET LASTEST VERSION
100==========================
101
102The latest version is currently available at:
103<http://befs-driver.sourceforge.net/>
104
105ANY KNOWN BUGS?
106===========
107As of Jan 20, 2002:
108
109 None
110
111SPECIAL THANKS
112==============
113Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
114Hiroyuki Yamada ... Testing LinuxPPC.
115
116
117
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt
new file mode 100644
index 000000000000..d2841e0bcf02
--- /dev/null
+++ b/Documentation/filesystems/bfs.txt
@@ -0,0 +1,57 @@
1BFS FILESYSTEM FOR LINUX
2========================
3
4The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
5usually contains the kernel image and a few other files required for the
6boot process.
7
8In order to access /stand partition under Linux you obviously need to
9know the partition number and the kernel must support UnixWare disk slices
10(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
11depend on having UnixWare disklabel support because one can also mount
12BFS filesystem via loopback:
13
14# losetup /dev/loop0 stand.img
15# mount -t bfs /dev/loop0 /mnt/stand
16
17where stand.img is a file containing the image of BFS filesystem.
18When you have finished using it and umounted you need to also deallocate
19/dev/loop0 device by:
20
21# losetup -d /dev/loop0
22
23You can simplify mounting by just typing:
24
25# mount -t bfs -o loop stand.img /mnt/stand
26
27this will allocate the first available loopback device (and load loop.o
28kernel module if necessary) automatically. If the loopback driver is not
29loaded automatically, make sure that your kernel is compiled with kmod
30support (CONFIG_KMOD) enabled. Beware that umount will not
31deallocate /dev/loopN device if /etc/mtab file on your system is a
32symbolic link to /proc/mounts. You will need to do it manually using
33"-d" switch of losetup(8). Read losetup(8) manpage for more info.
34
35To create the BFS image under UnixWare you need to find out first which
36slice contains it. The command prtvtoc(1M) is your friend:
37
38# prtvtoc /dev/rdsk/c0b0t0d0s0
39
40(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
41look for the slice with tag "STAND", which is usually slice 10. With this
42information you can use dd(1) to create the BFS image:
43
44# umount /stand
45# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
46
47Just in case, you can verify that you have done the right thing by checking
48the magic number:
49
50# od -Ad -tx4 stand.img | more
51
52The first 4 bytes should be 0x1badface.
53
54If you have any patches, questions or suggestions regarding this BFS
55implementation please contact the author:
56
57Tigran A. Aivazian <tigran@veritas.com>
diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt
new file mode 100644
index 000000000000..49cc923a93e3
--- /dev/null
+++ b/Documentation/filesystems/cifs.txt
@@ -0,0 +1,51 @@
1 This is the client VFS module for the Common Internet File System
2 (CIFS) protocol which is the successor to the Server Message Block
3 (SMB) protocol, the native file sharing mechanism for most early
4 PC operating systems. CIFS is fully supported by current network
5 file servers such as Windows 2000, Windows 2003 (including
6 Windows XP) as well by Samba (which provides excellent CIFS
7 server support for Linux and many other operating systems), so
8 this network filesystem client can mount to a wide variety of
9 servers. The smbfs module should be used instead of this cifs module
10 for mounting to older SMB servers such as OS/2. The smbfs and cifs
11 modules can coexist and do not conflict. The CIFS VFS filesystem
12 module is designed to work well with servers that implement the
13 newer versions (dialects) of the SMB/CIFS protocol such as Samba,
14 the program written by Andrew Tridgell that turns any Unix host
15 into a SMB/CIFS file server.
16
17 The intent of this module is to provide the most advanced network
18 file system function for CIFS compliant servers, including better
19 POSIX compliance, secure per-user session establishment, high
20 performance safe distributed caching (oplock), optional packet
21 signing, large files, Unicode support and other internationalization
22 improvements. Since both Samba server and this filesystem client support
23 the CIFS Unix extensions, the combination can provide a reasonable
24 alternative to NFSv4 for fileserving in some Linux to Linux environments,
25 not just in Linux to Windows environments.
26
27 This filesystem has an optional mount utility (mount.cifs) that can
28 be obtained from the project page and installed in the path in the same
29 directory with the other mount helpers (such as mount.smbfs).
30 Mounting using the cifs filesystem without installing the mount helper
31 requires specifying the server's ip address.
32
33 For Linux 2.4:
34 mount //anything/here /mnt_target -o
35 user=username,pass=password,unc=//ip_address_of_server/sharename
36
37 For Linux 2.5:
38 mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
39
40
41 For more information on the module see the project page at
42
43 http://us1.samba.org/samba/Linux_CIFS_client.html
44
45 For more information on CIFS see:
46
47 http://www.snia.org/tech_activities/CIFS
48
49 or the Samba site:
50
51 http://www.samba.org
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
new file mode 100644
index 000000000000..61311356025d
--- /dev/null
+++ b/Documentation/filesystems/coda.txt
@@ -0,0 +1,1673 @@
1NOTE:
2This is one of the technical documents describing a component of
3Coda -- this document describes the client kernel-Venus interface.
4
5For more information:
6 http://www.coda.cs.cmu.edu
7For user level software needed to run Coda:
8 ftp://ftp.coda.cs.cmu.edu
9
10To run Coda you need to get a user level cache manager for the client,
11named Venus, as well as tools to manipulate ACLs, to log in, etc. The
12client needs to have the Coda filesystem selected in the kernel
13configuration.
14
15The server needs a user level server and at present does not depend on
16kernel support.
17
18
19
20
21
22
23
24 The Venus kernel interface
25 Peter J. Braam
26 v1.0, Nov 9, 1997
27
28 This document describes the communication between Venus and kernel
29 level filesystem code needed for the operation of the Coda file sys-
30 tem. This document version is meant to describe the current interface
31 (version 1.0) as well as improvements we envisage.
32 ______________________________________________________________________
33
34 Table of Contents
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90 1. Introduction
91
92 2. Servicing Coda filesystem calls
93
94 3. The message layer
95
96 3.1 Implementation details
97
98 4. The interface at the call level
99
100 4.1 Data structures shared by the kernel and Venus
101 4.2 The pioctl interface
102 4.3 root
103 4.4 lookup
104 4.5 getattr
105 4.6 setattr
106 4.7 access
107 4.8 create
108 4.9 mkdir
109 4.10 link
110 4.11 symlink
111 4.12 remove
112 4.13 rmdir
113 4.14 readlink
114 4.15 open
115 4.16 close
116 4.17 ioctl
117 4.18 rename
118 4.19 readdir
119 4.20 vget
120 4.21 fsync
121 4.22 inactive
122 4.23 rdwr
123 4.24 odymount
124 4.25 ody_lookup
125 4.26 ody_expand
126 4.27 prefetch
127 4.28 signal
128
129 5. The minicache and downcalls
130
131 5.1 INVALIDATE
132 5.2 FLUSH
133 5.3 PURGEUSER
134 5.4 ZAPFILE
135 5.5 ZAPDIR
136 5.6 ZAPVNODE
137 5.7 PURGEFID
138 5.8 REPLACE
139
140 6. Initialization and cleanup
141
142 6.1 Requirements
143
144
145 ______________________________________________________________________
146 0wpage
147
148 11.. IInnttrroodduuccttiioonn
149
150
151
152 A key component in the Coda Distributed File System is the cache
153 manager, _V_e_n_u_s.
154
155
156 When processes on a Coda enabled system access files in the Coda
157 filesystem, requests are directed at the filesystem layer in the
158 operating system. The operating system will communicate with Venus to
159 service the request for the process. Venus manages a persistent
160 client cache and makes remote procedure calls to Coda file servers and
161 related servers (such as authentication servers) to service these
162 requests it receives from the operating system. When Venus has
163 serviced a request it replies to the operating system with appropriate
164 return codes, and other data related to the request. Optionally the
165 kernel support for Coda may maintain a minicache of recently processed
166 requests to limit the number of interactions with Venus. Venus
167 possesses the facility to inform the kernel when elements from its
168 minicache are no longer valid.
169
170 This document describes precisely this communication between the
171 kernel and Venus. The definitions of so called upcalls and downcalls
172 will be given with the format of the data they handle. We shall also
173 describe the semantic invariants resulting from the calls.
174
175 Historically Coda was implemented in a BSD file system in Mach 2.6.
176 The interface between the kernel and Venus is very similar to the BSD
177 VFS interface. Similar functionality is provided, and the format of
178 the parameters and returned data is very similar to the BSD VFS. This
179 leads to an almost natural environment for implementing a kernel-level
180 filesystem driver for Coda in a BSD system. However, other operating
181 systems such as Linux and Windows 95 and NT have virtual filesystem
182 with different interfaces.
183
184 To implement Coda on these systems some reverse engineering of the
185 Venus/Kernel protocol is necessary. Also it came to light that other
186 systems could profit significantly from certain small optimizations
187 and modifications to the protocol. To facilitate this work as well as
188 to make future ports easier, communication between Venus and the
189 kernel should be documented in great detail. This is the aim of this
190 document.
191
192 0wpage
193
194 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss
195
196 The service of a request for a Coda file system service originates in
197 a process PP which accessing a Coda file. It makes a system call which
198 traps to the OS kernel. Examples of such calls trapping to the kernel
199 are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix
200 context. Similar calls exist in the Win32 environment, and are named
201 _C_r_e_a_t_e_F_i_l_e_, .
202
203 Generally the operating system handles the request in a virtual
204 filesystem (VFS) layer, which is named I/O Manager in NT and IFS
205 manager in Windows 95. The VFS is responsible for partial processing
206 of the request and for locating the specific filesystem(s) which will
207 service parts of the request. Usually the information in the path
208 assists in locating the correct FS drivers. Sometimes after extensive
209 pre-processing, the VFS starts invoking exported routines in the FS
210 driver. This is the point where the FS specific processing of the
211 request starts, and here the Coda specific kernel code comes into
212 play.
213
214 The FS layer for Coda must expose and implement several interfaces.
215 First and foremost the VFS must be able to make all necessary calls to
216 the Coda FS layer, so the Coda FS driver must expose the VFS interface
217 as applicable in the operating system. These differ very significantly
218 among operating systems, but share features such as facilities to
219 read/write and create and remove objects. The Coda FS layer services
220 such VFS requests by invoking one or more well defined services
221 offered by the cache manager Venus. When the replies from Venus have
222 come back to the FS driver, servicing of the VFS call continues and
223 finishes with a reply to the kernel's VFS. Finally the VFS layer
224 returns to the process.
225
226 As a result of this design a basic interface exposed by the FS driver
227 must allow Venus to manage message traffic. In particular Venus must
228 be able to retrieve and place messages and to be notified of the
229 arrival of a new message. The notification must be through a mechanism
230 which does not block Venus since Venus must attend to other tasks even
231 when no messages are waiting or being processed.
232
233
234
235
236
237
238 Interfaces of the Coda FS Driver
239
240 Furthermore the FS layer provides for a special path of communication
241 between a user process and Venus, called the pioctl interface. The
242 pioctl interface is used for Coda specific services, such as
243 requesting detailed information about the persistent cache managed by
244 Venus. Here the involvement of the kernel is minimal. It identifies
245 the calling process and passes the information on to Venus. When
246 Venus replies the response is passed back to the caller in unmodified
247 form.
248
249 Finally Venus allows the kernel FS driver to cache the results from
250 certain services. This is done to avoid excessive context switches
251 and results in an efficient system. However, Venus may acquire
252 information, for example from the network which implies that cached
253 information must be flushed or replaced. Venus then makes a downcall
254 to the Coda FS layer to request flushes or updates in the cache. The
255 kernel FS driver handles such requests synchronously.
256
257 Among these interfaces the VFS interface and the facility to place,
258 receive and be notified of messages are platform specific. We will
259 not go into the calls exported to the VFS layer but we will state the
260 requirements of the message exchange mechanism.
261
262 0wpage
263
264 33.. TThhee mmeessssaaggee llaayyeerr
265
266
267
268 At the lowest level the communication between Venus and the FS driver
269 proceeds through messages. The synchronization between processes
270 requesting Coda file service and Venus relies on blocking and waking
271 up processes. The Coda FS driver processes VFS- and pioctl-requests
272 on behalf of a process P, creates messages for Venus, awaits replies
273 and finally returns to the caller. The implementation of the exchange
274 of messages is platform specific, but the semantics have (so far)
275 appeared to be generally applicable. Data buffers are created by the
276 FS Driver in kernel memory on behalf of P and copied to user memory in
277 Venus.
278
279 The FS Driver while servicing P makes upcalls to Venus. Such an
280 upcall is dispatched to Venus by creating a message structure. The
281 structure contains the identification of P, the message sequence
282 number, the size of the request and a pointer to the data in kernel
283 memory for the request. Since the data buffer is re-used to hold the
284 reply from Venus, there is a field for the size of the reply. A flags
285 field is used in the message to precisely record the status of the
286 message. Additional platform dependent structures involve pointers to
287 determine the position of the message on queues and pointers to
288 synchronization objects. In the upcall routine the message structure
289 is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g
290 queue. The routine calling upcall is responsible for allocating the
291 data buffer; its structure will be described in the next section.
292
293 A facility must exist to notify Venus that the message has been
294 created, and implemented using available synchronization objects in
295 the OS. This notification is done in the upcall context of the process
296 P. When the message is on the pending queue, process P cannot proceed
297 in upcall. The (kernel mode) processing of P in the filesystem
298 request routine must be suspended until Venus has replied. Therefore
299 the calling thread in P is blocked in upcall. A pointer in the
300 message structure will locate the synchronization object on which P is
301 sleeping.
302
303 Venus detects the notification that a message has arrived, and the FS
304 driver allow Venus to retrieve the message with a getmsg_from_kernel
305 call. This action finishes in the kernel by putting the message on the
306 queue of processing messages and setting flags to READ. Venus is
307 passed the contents of the data buffer. The getmsg_from_kernel call
308 now returns and Venus processes the request.
309
310 At some later point the FS driver receives a message from Venus,
311 namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
312 driver looks at the contents of the message and decides if:
313
314
315 +o the message is a reply for a suspended thread P. If so it removes
316 the message from the processing queue and marks the message as
317 WRITTEN. Finally, the FS driver unblocks P (still in the kernel
318 mode context of Venus) and the sendmsg_to_kernel call returns to
319 Venus. The process P will be scheduled at some point and continues
320 processing its upcall with the data buffer replaced with the reply
321 from Venus.
322
323 +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to
324 the FS Driver. The FS driver processes the request immediately
325 (usually a cache eviction or replacement) and when it finishes
326 sendmsg_to_kernel returns.
327
328 Now P awakes and continues processing upcall. There are some
329 subtleties to take account of. First P will determine if it was woken
330 up in upcall by a signal from some other source (for example an
331 attempt to terminate P) or as is normally the case by Venus in its
332 sendmsg_to_kernel call. In the normal case, the upcall routine will
333 deallocate the message structure and return. The FS routine can proceed
334 with its processing.
335
336
337
338
339
340
341
342 Sleeping and IPC arrangements
343
344 In case P is woken up by a signal and not by Venus, it will first look
345 at the flags field. If the message is not yet READ, the process P can
346 handle its signal without notifying Venus. If Venus has READ, and
347 the request should not be processed, P can send Venus a signal message
348 to indicate that it should disregard the previous message. Such
349 signals are put in the queue at the head, and read first by Venus. If
350 the message is already marked as WRITTEN it is too late to stop the
351 processing. The VFS routine will now continue. (-- If a VFS request
352 involves more than one upcall, this can lead to complicated state, an
353 extra field "handle_signals" could be added in the message structure
354 to indicate points of no return have been passed.--)
355
356
357
358 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss
359
360 The Unix implementation of this mechanism has been through the
361 implementation of a character device associated with Coda. Venus
362 retrieves messages by doing a read on the device, replies are sent
363 with a write and notification is through the select system call on the
364 file descriptor for the device. The process P is kept waiting on an
365 interruptible wait queue object.
366
367 In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
368 call is used. The DeviceIoControl call is designed to copy buffers
369 from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
370 is issued as a synchronous call, while the getmsg_from_kernel call is
371 asynchronous. Windows EventObjects are used for notification of
372 message arrival. The process P is kept waiting on a KernelEvent
373 object in NT and a semaphore in Windows 95.
374
375 0wpage
376
377 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell
378
379
380 This section describes the upcalls a Coda FS driver can make to Venus.
381 Each of these upcalls make use of two structures: inputArgs and
382 outputArgs. In pseudo BNF form the structures take the following
383 form:
384
385
386 struct inputArgs {
387 u_long opcode;
388 u_long unique; /* Keep multiple outstanding msgs distinct */
389 u_short pid; /* Common to all */
390 u_short pgid; /* Common to all */
391 struct CodaCred cred; /* Common to all */
392
393 <union "in" of call dependent parts of inputArgs>
394 };
395
396 struct outputArgs {
397 u_long opcode;
398 u_long unique; /* Keep multiple outstanding msgs distinct */
399 u_long result;
400
401 <union "out" of call dependent parts of inputArgs>
402 };
403
404
405
406 Before going on let us elucidate the role of the various fields. The
407 inputArgs start with the opcode which defines the type of service
408 requested from Venus. There are approximately 30 upcalls at present
409 which we will discuss. The unique field labels the inputArg with a
410 unique number which will identify the message uniquely. A process and
411 process group id are passed. Finally the credentials of the caller
412 are included.
413
414 Before delving into the specific calls we need to discuss a variety of
415 data structures shared by the kernel and Venus.
416
417
418
419
420 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss
421
422
423 The CodaCred structure defines a variety of user and group ids as
424 they are set for the calling process. The vuid_t and guid_t are 32 bit
425 unsigned integers. It also defines group membership in an array. On
426 Unix the CodaCred has proven sufficient to implement good security
427 semantics for Coda but the structure may have to undergo modification
428 for the Windows environment when these mature.
429
430 struct CodaCred {
431 vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/
432 vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
433 vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
434 };
435
436
437
438 NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus
439 doesn't know about groups, although it does create files with the
440 default uid/gid. Perhaps the list of group membership is superfluous.
441
442
443 The next item is the fundamental identifier used to identify Coda
444 files, the ViceFid. A fid of a file uniquely defines a file or
445 directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a
446 group of Coda servers acting under the aegis of a single system
447 control machine or SCM. See the Coda Administration manual for a
448 detailed description of the role of the SCM.--)
449
450
451 typedef struct ViceFid {
452 VolumeId Volume;
453 VnodeId Vnode;
454 Unique_t Unique;
455 } ViceFid;
456
457
458
459 Each of the constituent fields: VolumeId, VnodeId and Unique_t are
460 unsigned 32 bit integers. We envisage that a further field will need
461 to be prefixed to identify the Coda cell; this will probably take the
462 form of a Ipv6 size IP address naming the Coda cell through DNS.
463
464 The next important structure shared between Venus and the kernel is
465 the attributes of the file. The following structure is used to
466 exchange information. It has room for future extensions such as
467 support for device files (currently not present in Coda).
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486 struct coda_vattr {
487 enum coda_vtype va_type; /* vnode type (for create) */
488 u_short va_mode; /* files access mode and type */
489 short va_nlink; /* number of references to file */
490 vuid_t va_uid; /* owner user id */
491 vgid_t va_gid; /* owner group id */
492 long va_fsid; /* file system id (dev for now) */
493 long va_fileid; /* file id */
494 u_quad_t va_size; /* file size in bytes */
495 long va_blocksize; /* blocksize preferred for i/o */
496 struct timespec va_atime; /* time of last access */
497 struct timespec va_mtime; /* time of last modification */
498 struct timespec va_ctime; /* time file changed */
499 u_long va_gen; /* generation number of file */
500 u_long va_flags; /* flags defined for file */
501 dev_t va_rdev; /* device special file represents */
502 u_quad_t va_bytes; /* bytes of disk space held by file */
503 u_quad_t va_filerev; /* file modification number */
504 u_int va_vaflags; /* operations flags, see below */
505 long va_spare; /* remain quad aligned */
506 };
507
508
509
510
511 44..22.. TThhee ppiiooccttll iinntteerrffaaccee
512
513
514 Coda specific requests can be made by application through the pioctl
515 interface. The pioctl is implemented as an ordinary ioctl on a
516 fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
517 a file handle and makes the ioctl call. Finally it closes the file.
518
519 The kernel involvement in this is limited to providing the facility to
520 open and close and pass the ioctl message _a_n_d to verify that a path in
521 the pioctl data buffers is a file in a Coda filesystem.
522
523 The kernel is handed a data packet of the form:
524
525 struct {
526 const char *path;
527 struct ViceIoctl vidata;
528 int follow;
529 } data;
530
531
532
533 where
534
535
536 struct ViceIoctl {
537 caddr_t in, out; /* Data to be transferred in, or out */
538 short in_size; /* Size of input buffer <= 2K */
539 short out_size; /* Maximum size of output buffer, <= 2K */
540 };
541
542
543
544 The path must be a Coda file, otherwise the ioctl upcall will not be
545 made.
546
547 NNOOTTEE The data structures and code are a mess. We need to clean this
548 up.
549
550 We now proceed to document the individual calls:
551
552 0wpage
553
554 44..33.. rroooott
555
556
557 AArrgguummeennttss
558
559 iinn empty
560
561 oouutt
562
563 struct cfs_root_out {
564 ViceFid VFid;
565 } cfs_root;
566
567
568
569 DDeessccrriippttiioonn This call is made to Venus during the initialization of
570 the Coda filesystem. If the result is zero, the cfs_root structure
571 contains the ViceFid of the root of the Coda filesystem. If a non-zero
572 result is generated, its value is a platform dependent error code
573 indicating the difficulty Venus encountered in locating the root of
574 the Coda filesystem.
575
576 0wpage
577
578 44..44.. llooookkuupp
579
580
581 SSuummmmaarryy Find the ViceFid and type of an object in a directory if it
582 exists.
583
584 AArrgguummeennttss
585
586 iinn
587
588 struct cfs_lookup_in {
589 ViceFid VFid;
590 char *name; /* Place holder for data. */
591 } cfs_lookup;
592
593
594
595 oouutt
596
597 struct cfs_lookup_out {
598 ViceFid VFid;
599 int vtype;
600 } cfs_lookup;
601
602
603
604 DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of
605 a directory entry. The directory entry requested carries name name
606 and Venus will search the directory identified by cfs_lookup_in.VFid.
607 The result may indicate that the name does not exist, or that
608 difficulty was encountered in finding it (e.g. due to disconnection).
609 If the result is zero, the field cfs_lookup_out.VFid contains the
610 targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
611 type of object the name designates.
612
613 The name of the object is an 8 bit character string of maximum length
614 CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
615
616 It is extremely important to realize that Venus bitwise ors the field
617 cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
618 not be put in the kernel name cache.
619
620 NNOOTTEE The type of the vtype is currently wrong. It should be
621 coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
622
623 0wpage
624
625 44..55.. ggeettaattttrr
626
627
628 SSuummmmaarryy Get the attributes of a file.
629
630 AArrgguummeennttss
631
632 iinn
633
634 struct cfs_getattr_in {
635 ViceFid VFid;
636 struct coda_vattr attr; /* XXXXX */
637 } cfs_getattr;
638
639
640
641 oouutt
642
643 struct cfs_getattr_out {
644 struct coda_vattr attr;
645 } cfs_getattr;
646
647
648
649 DDeessccrriippttiioonn This call returns the attributes of the file identified by
650 fid.
651
652 EErrrroorrss Errors can occur if the object with fid does not exist, is
653 unaccessible or if the caller does not have permission to fetch
654 attributes.
655
656 NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
657 the attributes as well as the Fid for the instantiation of an internal
658 "inode" or "FileHandle". A significant improvement in performance on
659 such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls
660 both at the Venus/kernel interaction level and at the RPC level.
661
662 The vattr structure included in the input arguments is superfluous and
663 should be removed.
664
665 0wpage
666
667 44..66.. sseettaattttrr
668
669
670 SSuummmmaarryy Set the attributes of a file.
671
672 AArrgguummeennttss
673
674 iinn
675
676 struct cfs_setattr_in {
677 ViceFid VFid;
678 struct coda_vattr attr;
679 } cfs_setattr;
680
681
682
683
684 oouutt
685 empty
686
687 DDeessccrriippttiioonn The structure attr is filled with attributes to be changed
688 in BSD style. Attributes not to be changed are set to -1, apart from
689 vtype which is set to VNON. Other are set to the value to be assigned.
690 The only attributes which the FS driver may request to change are the
691 mode, owner, groupid, atime, mtime and ctime. The return value
692 indicates success or failure.
693
694 EErrrroorrss A variety of errors can occur. The object may not exist, may
695 be inaccessible, or permission may not be granted by Venus.
696
697 0wpage
698
699 44..77.. aacccceessss
700
701
702 SSuummmmaarryy
703
704 AArrgguummeennttss
705
706 iinn
707
708 struct cfs_access_in {
709 ViceFid VFid;
710 int flags;
711 } cfs_access;
712
713
714
715 oouutt
716 empty
717
718 DDeessccrriippttiioonn Verify if access to the object identified by VFid for
719 operations described by flags is permitted. The result indicates if
720 access will be granted. It is important to remember that Coda uses
721 ACLs to enforce protection and that ultimately the servers, not the
722 clients enforce the security of the system. The result of this call
723 will depend on whether a _t_o_k_e_n is held by the user.
724
725 EErrrroorrss The object may not exist, or the ACL describing the protection
726 may not be accessible.
727
728 0wpage
729
730 44..88.. ccrreeaattee
731
732
733 SSuummmmaarryy Invoked to create a file
734
735 AArrgguummeennttss
736
737 iinn
738
739 struct cfs_create_in {
740 ViceFid VFid;
741 struct coda_vattr attr;
742 int excl;
743 int mode;
744 char *name; /* Place holder for data. */
745 } cfs_create;
746
747
748
749
750 oouutt
751
752 struct cfs_create_out {
753 ViceFid VFid;
754 struct coda_vattr attr;
755 } cfs_create;
756
757
758
759 DDeessccrriippttiioonn This upcall is invoked to request creation of a file.
760 The file will be created in the directory identified by VFid, its name
761 will be name, and the mode will be mode. If excl is set an error will
762 be returned if the file already exists. If the size field in attr is
763 set to zero the file will be truncated. The uid and gid of the file
764 are set by converting the CodaCred to a uid using a macro CRTOUID
765 (this macro is platform dependent). Upon success the VFid and
766 attributes of the file are returned. The Coda FS Driver will normally
767 instantiate a vnode, inode or file handle at kernel level for the new
768 object.
769
770
771 EErrrroorrss A variety of errors can occur. Permissions may be insufficient.
772 If the object exists and is not a file the error EISDIR is returned
773 under Unix.
774
775 NNOOTTEE The packing of parameters is very inefficient and appears to
776 indicate confusion between the system call creat and the VFS operation
777 create. The VFS operation create is only called to create new objects.
778 This create call differs from the Unix one in that it is not invoked
779 to return a file descriptor. The truncate and exclusive options,
780 together with the mode, could simply be part of the mode as it is
781 under Unix. There should be no flags argument; this is used in open
782 (2) to return a file descriptor for READ or WRITE mode.
783
784 The attributes of the directory should be returned too, since the size
785 and mtime changed.
786
787 0wpage
788
789 44..99.. mmkkddiirr
790
791
792 SSuummmmaarryy Create a new directory.
793
794 AArrgguummeennttss
795
796 iinn
797
798 struct cfs_mkdir_in {
799 ViceFid VFid;
800 struct coda_vattr attr;
801 char *name; /* Place holder for data. */
802 } cfs_mkdir;
803
804
805
806 oouutt
807
808 struct cfs_mkdir_out {
809 ViceFid VFid;
810 struct coda_vattr attr;
811 } cfs_mkdir;
812
813
814
815
816 DDeessccrriippttiioonn This call is similar to create but creates a directory.
817 Only the mode field in the input parameters is used for creation.
818 Upon successful creation, the attr returned contains the attributes of
819 the new directory.
820
821 EErrrroorrss As for create.
822
823 NNOOTTEE The input parameter should be changed to mode instead of
824 attributes.
825
826 The attributes of the parent should be returned since the size and
827 mtime changes.
828
829 0wpage
830
831 44..1100.. lliinnkk
832
833
834 SSuummmmaarryy Create a link to an existing file.
835
836 AArrgguummeennttss
837
838 iinn
839
840 struct cfs_link_in {
841 ViceFid sourceFid; /* cnode to link *to* */
842 ViceFid destFid; /* Directory in which to place link */
843 char *tname; /* Place holder for data. */
844 } cfs_link;
845
846
847
848 oouutt
849 empty
850
851 DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory
852 identified by destFid with name tname. The source must reside in the
853 target's parent, i.e. the source must be have parent destFid, i.e. Coda
854 does not support cross directory hard links. Only the return value is
855 relevant. It indicates success or the type of failure.
856
857 EErrrroorrss The usual errors can occur.0wpage
858
859 44..1111.. ssyymmlliinnkk
860
861
862 SSuummmmaarryy create a symbolic link
863
864 AArrgguummeennttss
865
866 iinn
867
868 struct cfs_symlink_in {
869 ViceFid VFid; /* Directory to put symlink in */
870 char *srcname;
871 struct coda_vattr attr;
872 char *tname;
873 } cfs_symlink;
874
875
876
877 oouutt
878 none
879
880 DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the
881 directory identified by VFid and named tname. It should point to the
882 pathname srcname. The attributes of the newly created object are to
883 be set to attr.
884
885 EErrrroorrss
886
887 NNOOTTEE The attributes of the target directory should be returned since
888 its size changed.
889
890 0wpage
891
892 44..1122.. rreemmoovvee
893
894
895 SSuummmmaarryy Remove a file
896
897 AArrgguummeennttss
898
899 iinn
900
901 struct cfs_remove_in {
902 ViceFid VFid;
903 char *name; /* Place holder for data. */
904 } cfs_remove;
905
906
907
908 oouutt
909 none
910
911 DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory
912 identified by VFid.
913
914 EErrrroorrss
915
916 NNOOTTEE The attributes of the directory should be returned since its
917 mtime and size may change.
918
919 0wpage
920
921 44..1133.. rrmmddiirr
922
923
924 SSuummmmaarryy Remove a directory
925
926 AArrgguummeennttss
927
928 iinn
929
930 struct cfs_rmdir_in {
931 ViceFid VFid;
932 char *name; /* Place holder for data. */
933 } cfs_rmdir;
934
935
936
937 oouutt
938 none
939
940 DDeessccrriippttiioonn Remove the directory with name name from the directory
941 identified by VFid.
942
943 EErrrroorrss
944
945 NNOOTTEE The attributes of the parent directory should be returned since
946 its mtime and size may change.
947
948 0wpage
949
950 44..1144.. rreeaaddlliinnkk
951
952
953 SSuummmmaarryy Read the value of a symbolic link.
954
955 AArrgguummeennttss
956
957 iinn
958
959 struct cfs_readlink_in {
960 ViceFid VFid;
961 } cfs_readlink;
962
963
964
965 oouutt
966
967 struct cfs_readlink_out {
968 int count;
969 caddr_t data; /* Place holder for data. */
970 } cfs_readlink;
971
972
973
974 DDeessccrriippttiioonn This routine reads the contents of symbolic link
975 identified by VFid into the buffer data. The buffer data must be able
976 to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
977
978 EErrrroorrss No unusual errors.
979
980 0wpage
981
982 44..1155.. ooppeenn
983
984
985 SSuummmmaarryy Open a file.
986
987 AArrgguummeennttss
988
989 iinn
990
991 struct cfs_open_in {
992 ViceFid VFid;
993 int flags;
994 } cfs_open;
995
996
997
998 oouutt
999
1000 struct cfs_open_out {
1001 dev_t dev;
1002 ino_t inode;
1003 } cfs_open;
1004
1005
1006
1007 DDeessccrriippttiioonn This request asks Venus to place the file identified by
1008 VFid in its cache and to note that the calling process wishes to open
1009 it with flags as in open(2). The return value to the kernel differs
1010 for Unix and Windows systems. For Unix systems the Coda FS Driver is
1011 informed of the device and inode number of the container file in the
1012 fields dev and inode. For Windows the path of the container file is
1013 returned to the kernel.
1014 EErrrroorrss
1015
1016 NNOOTTEE Currently the cfs_open_out structure is not properly adapted to
1017 deal with the Windows case. It might be best to implement two
1018 upcalls, one to open aiming at a container file name, the other at a
1019 container file inode.
1020
1021 0wpage
1022
1023 44..1166.. cclloossee
1024
1025
1026 SSuummmmaarryy Close a file, update it on the servers.
1027
1028 AArrgguummeennttss
1029
1030 iinn
1031
1032 struct cfs_close_in {
1033 ViceFid VFid;
1034 int flags;
1035 } cfs_close;
1036
1037
1038
1039 oouutt
1040 none
1041
1042 DDeessccrriippttiioonn Close the file identified by VFid.
1043
1044 EErrrroorrss
1045
1046 NNOOTTEE The flags argument is bogus and not used. However, Venus' code
1047 has room to deal with an execp input field, probably this field should
1048 be used to inform Venus that the file was closed but is still memory
1049 mapped for execution. There are comments about fetching versus not
1050 fetching the data in Venus vproc_vfscalls. This seems silly. If a
1051 file is being closed, the data in the container file is to be the new
1052 data. Here again the execp flag might be in play to create confusion:
1053 currently Venus might think a file can be flushed from the cache when
1054 it is still memory mapped. This needs to be understood.
1055
1056 0wpage
1057
1058 44..1177.. iiooccttll
1059
1060
1061 SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface.
1062
1063 AArrgguummeennttss
1064
1065 iinn
1066
1067 struct cfs_ioctl_in {
1068 ViceFid VFid;
1069 int cmd;
1070 int len;
1071 int rwflag;
1072 char *data; /* Place holder for data. */
1073 } cfs_ioctl;
1074
1075
1076
1077 oouutt
1078
1079
1080 struct cfs_ioctl_out {
1081 int len;
1082 caddr_t data; /* Place holder for data. */
1083 } cfs_ioctl;
1084
1085
1086
1087 DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and
1088 data arguments are filled as usual. flags is not used by Venus.
1089
1090 EErrrroorrss
1091
1092 NNOOTTEE Another bogus parameter. flags is not used. What is the
1093 business about PREFETCHING in the Venus code?
1094
1095
1096 0wpage
1097
1098 44..1188.. rreennaammee
1099
1100
1101 SSuummmmaarryy Rename a fid.
1102
1103 AArrgguummeennttss
1104
1105 iinn
1106
1107 struct cfs_rename_in {
1108 ViceFid sourceFid;
1109 char *srcname;
1110 ViceFid destFid;
1111 char *destname;
1112 } cfs_rename;
1113
1114
1115
1116 oouutt
1117 none
1118
1119 DDeessccrriippttiioonn Rename the object with name srcname in directory
1120 sourceFid to destname in destFid. It is important that the names
1121 srcname and destname are 0 terminated strings. Strings in Unix
1122 kernels are not always null terminated.
1123
1124 EErrrroorrss
1125
1126 0wpage
1127
1128 44..1199.. rreeaaddddiirr
1129
1130
1131 SSuummmmaarryy Read directory entries.
1132
1133 AArrgguummeennttss
1134
1135 iinn
1136
1137 struct cfs_readdir_in {
1138 ViceFid VFid;
1139 int count;
1140 int offset;
1141 } cfs_readdir;
1142
1143
1144
1145
1146 oouutt
1147
1148 struct cfs_readdir_out {
1149 int size;
1150 caddr_t data; /* Place holder for data. */
1151 } cfs_readdir;
1152
1153
1154
1155 DDeessccrriippttiioonn Read directory entries from VFid starting at offset and
1156 read at most count bytes. Returns the data in data and returns
1157 the size in size.
1158
1159 EErrrroorrss
1160
1161 NNOOTTEE This call is not used. Readdir operations exploit container
1162 files. We will re-evaluate this during the directory revamp which is
1163 about to take place.
1164
1165 0wpage
1166
1167 44..2200.. vvggeett
1168
1169
1170 SSuummmmaarryy instructs Venus to do an FSDB->Get.
1171
1172 AArrgguummeennttss
1173
1174 iinn
1175
1176 struct cfs_vget_in {
1177 ViceFid VFid;
1178 } cfs_vget;
1179
1180
1181
1182 oouutt
1183
1184 struct cfs_vget_out {
1185 ViceFid VFid;
1186 int vtype;
1187 } cfs_vget;
1188
1189
1190
1191 DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj
1192 labelled by VFid.
1193
1194 EErrrroorrss
1195
1196 NNOOTTEE This operation is not used. However, it is extremely useful
1197 since it can be used to deal with read/write memory mapped files.
1198 These can be "pinned" in the Venus cache using vget and released with
1199 inactive.
1200
1201 0wpage
1202
1203 44..2211.. ffssyynncc
1204
1205
1206 SSuummmmaarryy Tell Venus to update the RVM attributes of a file.
1207
1208 AArrgguummeennttss
1209
1210 iinn
1211
1212 struct cfs_fsync_in {
1213 ViceFid VFid;
1214 } cfs_fsync;
1215
1216
1217
1218 oouutt
1219 none
1220
1221 DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This
1222 should be called as part of kernel level fsync type calls. The
1223 result indicates if the syncing was successful.
1224
1225 EErrrroorrss
1226
1227 NNOOTTEE Linux does not implement this call. It should.
1228
1229 0wpage
1230
1231 44..2222.. iinnaaccttiivvee
1232
1233
1234 SSuummmmaarryy Tell Venus a vnode is no longer in use.
1235
1236 AArrgguummeennttss
1237
1238 iinn
1239
1240 struct cfs_inactive_in {
1241 ViceFid VFid;
1242 } cfs_inactive;
1243
1244
1245
1246 oouutt
1247 none
1248
1249 DDeessccrriippttiioonn This operation returns EOPNOTSUPP.
1250
1251 EErrrroorrss
1252
1253 NNOOTTEE This should perhaps be removed.
1254
1255 0wpage
1256
1257 44..2233.. rrddwwrr
1258
1259
1260 SSuummmmaarryy Read or write from a file
1261
1262 AArrgguummeennttss
1263
1264 iinn
1265
1266 struct cfs_rdwr_in {
1267 ViceFid VFid;
1268 int rwflag;
1269 int count;
1270 int offset;
1271 int ioflag;
1272 caddr_t data; /* Place holder for data. */
1273 } cfs_rdwr;
1274
1275
1276
1277
1278 oouutt
1279
1280 struct cfs_rdwr_out {
1281 int rwflag;
1282 int count;
1283 caddr_t data; /* Place holder for data. */
1284 } cfs_rdwr;
1285
1286
1287
1288 DDeessccrriippttiioonn This upcall asks Venus to read or write from a file.
1289
1290 EErrrroorrss
1291
1292 NNOOTTEE It should be removed since it is against the Coda philosophy that
1293 read/write operations never reach Venus. I have been told the
1294 operation does not work. It is not currently used.
1295
1296
1297 0wpage
1298
1299 44..2244.. ooddyymmoouunntt
1300
1301
1302 SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount
1303 point.
1304
1305 AArrgguummeennttss
1306
1307 iinn
1308
1309 struct ody_mount_in {
1310 char *name; /* Place holder for data. */
1311 } ody_mount;
1312
1313
1314
1315 oouutt
1316
1317 struct ody_mount_out {
1318 ViceFid VFid;
1319 } ody_mount;
1320
1321
1322
1323 DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named
1324 name. The fid is returned in VFid.
1325
1326 EErrrroorrss
1327
1328 NNOOTTEE This call was used by David for dynamic sets. It should be
1329 removed since it causes a jungle of pointers in the VFS mounting area.
1330 It is not used by Coda proper. Call is not implemented by Venus.
1331
1332 0wpage
1333
1334 44..2255.. ooddyy__llooookkuupp
1335
1336
1337 SSuummmmaarryy Looks up something.
1338
1339 AArrgguummeennttss
1340
1341 iinn irrelevant
1342
1343
1344 oouutt
1345 irrelevant
1346
1347 DDeessccrriippttiioonn
1348
1349 EErrrroorrss
1350
1351 NNOOTTEE Gut it. Call is not implemented by Venus.
1352
1353 0wpage
1354
1355 44..2266.. ooddyy__eexxppaanndd
1356
1357
1358 SSuummmmaarryy expands something in a dynamic set.
1359
1360 AArrgguummeennttss
1361
1362 iinn irrelevant
1363
1364 oouutt
1365 irrelevant
1366
1367 DDeessccrriippttiioonn
1368
1369 EErrrroorrss
1370
1371 NNOOTTEE Gut it. Call is not implemented by Venus.
1372
1373 0wpage
1374
1375 44..2277.. pprreeffeettcchh
1376
1377
1378 SSuummmmaarryy Prefetch a dynamic set.
1379
1380 AArrgguummeennttss
1381
1382 iinn Not documented.
1383
1384 oouutt
1385 Not documented.
1386
1387 DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is
1388 noted that it doesn't work. Not surprising, since the kernel does not
1389 have support for it. (ODY_PREFETCH is not a defined operation).
1390
1391 EErrrroorrss
1392
1393 NNOOTTEE Gut it. It isn't working and isn't used by Coda.
1394
1395
1396 0wpage
1397
1398 44..2288.. ssiiggnnaall
1399
1400
1401 SSuummmmaarryy Send Venus a signal about an upcall.
1402
1403 AArrgguummeennttss
1404
1405 iinn none
1406
1407 oouutt
1408 not applicable.
1409
1410 DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus
1411 that the calling process received a signal after Venus read the
1412 message from the input queue. Venus is supposed to clean up the
1413 operation.
1414
1415 EErrrroorrss No reply is given.
1416
1417 NNOOTTEE We need to better understand what Venus needs to clean up and if
1418 it is doing this correctly. Also we need to handle multiple upcall
1419 per system call situations correctly. It would be important to know
1420 what state changes in Venus take place after an upcall for which the
1421 kernel is responsible for notifying Venus to clean up (e.g. open
1422 definitely is such a state change, but many others are maybe not).
1423
1424 0wpage
1425
1426 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss
1427
1428
1429 The Coda FS Driver can cache results of lookup and access upcalls, to
1430 limit the frequency of upcalls. Upcalls carry a price since a process
1431 context switch needs to take place. The counterpart of caching the
1432 information is that Venus will notify the FS Driver that cached
1433 entries must be flushed or renamed.
1434
1435 The kernel code generally has to maintain a structure which links the
1436 internal file handles (called vnodes in BSD, inodes in Linux and
1437 FileHandles in Windows) with the ViceFid's which Venus maintains. The
1438 reason is that frequent translations back and forth are needed in
1439 order to make upcalls and use the results of upcalls. Such linking
1440 objects are called ccnnooddeess.
1441
1442 The current minicache implementations have cache entries which record
1443 the following:
1444
1445 1. the name of the file
1446
1447 2. the cnode of the directory containing the object
1448
1449 3. a list of CodaCred's for which the lookup is permitted.
1450
1451 4. the cnode of the object
1452
1453 The lookup call in the Coda FS Driver may request the cnode of the
1454 desired object from the cache, by passing its name, directory and the
1455 CodaCred's of the caller. The cache will return the cnode or indicate
1456 that it cannot be found. The Coda FS Driver must be careful to
1457 invalidate cache entries when it modifies or removes objects.
1458
1459 When Venus obtains information that indicates that cache entries are
1460 no longer valid, it will make a downcall to the kernel. Downcalls are
1461 intercepted by the Coda FS Driver and lead to cache invalidations of
1462 the kind described below. The Coda FS Driver does not return an error
1463 unless the downcall data could not be read into kernel memory.
1464
1465
1466 55..11.. IINNVVAALLIIDDAATTEE
1467
1468
1469 No information is available on this call.
1470
1471
1472 55..22.. FFLLUUSSHH
1473
1474
1475
1476 AArrgguummeennttss None
1477
1478 SSuummmmaarryy Flush the name cache entirely.
1479
1480 DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This
1481 is to prevent stale cache information being held. Some operating
1482 systems allow the kernel name cache to be switched off dynamically.
1483 When this is done, this downcall is made.
1484
1485
1486 55..33.. PPUURRGGEEUUSSEERR
1487
1488
1489 AArrgguummeennttss
1490
1491 struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
1492 struct CodaCred cred;
1493 } cfs_purgeuser;
1494
1495
1496
1497 DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This
1498 call is issued when tokens for a user expire or are flushed.
1499
1500
1501 55..44.. ZZAAPPFFIILLEE
1502
1503
1504 AArrgguummeennttss
1505
1506 struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
1507 ViceFid CodaFid;
1508 } cfs_zapfile;
1509
1510
1511
1512 DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair.
1513 This is issued as a result of an invalidation of cached attributes of
1514 a vnode.
1515
1516 NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache
1517 zapfile routine takes different arguments. Linux does not implement
1518 the invalidation of attributes correctly.
1519
1520
1521
1522 55..55.. ZZAAPPDDIIRR
1523
1524
1525 AArrgguummeennttss
1526
1527 struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
1528 ViceFid CodaFid;
1529 } cfs_zapdir;
1530
1531
1532
1533 DDeessccrriippttiioonn Remove all entries in the cache lying in a directory
1534 CodaFid, and all children of this directory. This call is issued when
1535 Venus receives a callback on the directory.
1536
1537
1538 55..66.. ZZAAPPVVNNOODDEE
1539
1540
1541
1542 AArrgguummeennttss
1543
1544 struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
1545 struct CodaCred cred;
1546 ViceFid VFid;
1547 } cfs_zapvnode;
1548
1549
1550
1551 DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid
1552 as in the arguments. This downcall is probably never issued.
1553
1554
1555 55..77.. PPUURRGGEEFFIIDD
1556
1557
1558 SSuummmmaarryy
1559
1560 AArrgguummeennttss
1561
1562 struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
1563 ViceFid CodaFid;
1564 } cfs_purgefid;
1565
1566
1567
1568 DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd
1569 vnode), purge its children from the namecache and remove the file from the
1570 namecache.
1571
1572
1573
1574 55..88.. RREEPPLLAACCEE
1575
1576
1577 SSuummmmaarryy Replace the Fid's for a collection of names.
1578
1579 AArrgguummeennttss
1580
1581 struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
1582 ViceFid NewFid;
1583 ViceFid OldFid;
1584 } cfs_replace;
1585
1586
1587
1588 DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with
1589 another. It is added to allow Venus during reintegration to replace
1590 locally allocated temp fids while disconnected with global fids even
1591 when the reference counts on those fids are not zero.
1592
1593 0wpage
1594
1595 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp
1596
1597
1598 This section gives brief hints as to desirable features for the Coda
1599 FS Driver at startup and upon shutdown or Venus failures. Before
1600 entering the discussion it is useful to repeat that the Coda FS Driver
1601 maintains the following data:
1602
1603
1604 1. message queues
1605
1606 2. cnodes
1607
1608 3. name cache entries
1609
1610 The name cache entries are entirely private to the driver, so they
1611 can easily be manipulated. The message queues will generally have
1612 clear points of initialization and destruction. The cnodes are
1613 much more delicate. User processes hold reference counts in Coda
1614 filesystems and it can be difficult to clean up the cnodes.
1615
1616 It can expect requests through:
1617
1618 1. the message subsystem
1619
1620 2. the VFS layer
1621
1622 3. pioctl interface
1623
1624 Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can
1625 treat these similarly.
1626
1627
1628 66..11.. RReeqquuiirreemmeennttss
1629
1630
1631 The following requirements should be accommodated:
1632
1633 1. The message queues should have open and close routines. On Unix
1634 the opening of the character devices are such routines.
1635
1636 +o Before opening, no messages can be placed.
1637
1638 +o Opening will remove any old messages still pending.
1639
1640 +o Close will notify any sleeping processes that their upcall cannot
1641 be completed.
1642
1643 +o Close will free all memory allocated by the message queues.
1644
1645
1646 2. At open the namecache shall be initialized to empty state.
1647
1648 3. Before the message queues are open, all VFS operations will fail.
1649 Fortunately this can be achieved by making sure than mounting the
1650 Coda filesystem cannot succeed before opening.
1651
1652 4. After closing of the queues, no VFS operations can succeed. Here
1653 one needs to be careful, since a few operations (lookup,
1654 read/write, readdir) can proceed without upcalls. These must be
1655 explicitly blocked.
1656
1657 5. Upon closing the namecache shall be flushed and disabled.
1658
1659 6. All memory held by cnodes can be freed without relying on upcalls.
1660
1661 7. Unmounting the file system can be done without relying on upcalls.
1662
1663 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
1664 get the rootfid or the attributes of the rootfid. The latter is
1665 best implemented by Venus fetching these objects before attempting
1666 to mount.
1667
1668 NNOOTTEE NetBSD in particular but also Linux have not implemented the
1669 above requirements fully. For smooth operation this needs to be
1670 corrected.
1671
1672
1673
diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt
new file mode 100644
index 000000000000..31f53f0ab957
--- /dev/null
+++ b/Documentation/filesystems/cramfs.txt
@@ -0,0 +1,76 @@
1
2 Cramfs - cram a filesystem onto a small ROM
3
4cramfs is designed to be simple and small, and to compress things well.
5
6It uses the zlib routines to compress a file one page at a time, and
7allows random page access. The meta-data is not compressed, but is
8expressed in a very terse representation to make it use much less
9diskspace than traditional filesystems.
10
11You can't write to a cramfs filesystem (making it compressible and
12compact also makes it _very_ hard to update on-the-fly), so you have to
13create the disk image with the "mkcramfs" utility.
14
15
16Usage Notes
17-----------
18
19File sizes are limited to less than 16MB.
20
21Maximum filesystem size is a little over 256MB. (The last file on the
22filesystem is allowed to extend past 256MB.)
23
24Only the low 8 bits of gid are stored. The current version of
25mkcramfs simply truncates to 8 bits, which is a potential security
26issue.
27
28Hard links are supported, but hard linked files
29will still have a link count of 1 in the cramfs image.
30
31Cramfs directories have no `.' or `..' entries. Directories (like
32every other file on cramfs) always have a link count of 1. (There's
33no need to use -noleaf in `find', btw.)
34
35No timestamps are stored in a cramfs, so these default to the epoch
36(1970 GMT). Recently-accessed files may have updated timestamps, but
37the update lasts only as long as the inode is cached in memory, after
38which the timestamp reverts to 1970, i.e. moves backwards in time.
39
40Currently, cramfs must be written and read with architectures of the
41same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
42== 4096. At least the latter of these is a bug, but it hasn't been
43decided what the best fix is. For the moment if you have larger pages
44you can just change the #define in mkcramfs.c, so long as you don't
45mind the filesystem becoming unreadable to future kernels.
46
47
48For /usr/share/magic
49--------------------
50
510 ulelong 0x28cd3d45 Linux cramfs offset 0
52>4 ulelong x size %d
53>8 ulelong x flags 0x%x
54>12 ulelong x future 0x%x
55>16 string >\0 signature "%.16s"
56>32 ulelong x fsid.crc 0x%x
57>36 ulelong x fsid.edition %d
58>40 ulelong x fsid.blocks %d
59>44 ulelong x fsid.files %d
60>48 string >\0 name "%.16s"
61512 ulelong 0x28cd3d45 Linux cramfs offset 512
62>516 ulelong x size %d
63>520 ulelong x flags 0x%x
64>524 ulelong x future 0x%x
65>528 string >\0 signature "%.16s"
66>544 ulelong x fsid.crc 0x%x
67>548 ulelong x fsid.edition %d
68>552 ulelong x fsid.blocks %d
69>556 ulelong x fsid.files %d
70>560 string >\0 name "%.16s"
71
72
73Hacker Notes
74------------
75
76See fs/cramfs/README for filesystem layout and implementation notes.
diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog
new file mode 100644
index 000000000000..e5aba5246d7c
--- /dev/null
+++ b/Documentation/filesystems/devfs/ChangeLog
@@ -0,0 +1,1977 @@
1/* -*- auto-fill -*- */
2===============================================================================
3Changes for patch v1
4
5- creation of devfs
6
7- modified miscellaneous character devices to support devfs
8===============================================================================
9Changes for patch v2
10
11- bug fix with manual inode creation
12===============================================================================
13Changes for patch v3
14
15- bugfixes
16
17- documentation improvements
18
19- created a couple of scripts (one to save&restore a devfs and the
20 other to set up compatibility symlinks)
21
22- devfs support for SCSI discs. New name format is: sd_hHcCiIlL
23===============================================================================
24Changes for patch v4
25
26- bugfix for the directory reading code
27
28- bugfix for compilation with kerneld
29
30- devfs support for generic hard discs
31
32- rationalisation of the various watchdog drivers
33===============================================================================
34Changes for patch v5
35
36- support for mounting directly from entries in the devfs (it doesn't
37 need to be mounted to do this), including the root filesystem.
38 Mounting of swap partitions also works. Hence, now if you set
39 CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs
40 via ordinary device nodes. Naturally, the default is 'N' so that you
41 can still use your old device nodes. If you want to mount from devfs
42 entries, make sure you use: append = "root=/dev/sd_..." in your
43 lilo.conf. It seems LILO looks for the device number (major&minor)
44 and writes that into the kernel image :-(
45
46- support for character memory devices (/dev/null, /dev/zero, /dev/full
47 and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
48===============================================================================
49Changes for patch v6
50
51- support for subdirectories
52
53- support for symbolic links (created by devfs_mk_symlink(), no
54 support yet for creation via symlink(2))
55
56- SCSI disc naming now cast in stone, with the format:
57 /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc
58 /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition
59
60- loop devices now appear in devfs
61
62- tty devices, console, serial ports, etc. now appear in devfs
63 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
64
65- bugs with mounting devfs-only devices now fixed
66===============================================================================
67Changes for patch v7
68
69- SCSI CD-ROMS, tapes and generic devices now appear in devfs
70===============================================================================
71Changes for patch v8
72
73- bugfix with no-rewind SCSI tapes
74
75- RAMDISCs now appear in devfs
76
77- better cleaning up of devfs entries created by various modules
78
79- interface change to <devfs_register>
80===============================================================================
81Changes for patch v9
82
83- the v8 patch was corrupted somehow, which would affect the patch for
84 linux/fs/filesystems.c
85 I've also fixed the v8 patch file on the WWW
86
87- MetaDevices (/dev/md*) should now appear in devfs
88===============================================================================
89Changes for patch v10
90
91- bugfix in meta device support for devfs
92
93- created this ChangeLog file
94
95- added devfs support to the floppy driver
96
97- added support for creating sockets in a devfs
98===============================================================================
99Changes for patch v11
100
101- added DEVFS_FL_HIDE_UNREG flag
102
103- incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu.
104
105- interface change to <devfs_mk_symlink>
106
107- support for creating symlinks with symlink(2)
108
109- parallel port printer (/dev/lp*) now appears in devfs
110===============================================================================
111Changes for patch v12
112
113- added inode check to <devfs_fill_file> function
114
115- improved devfs support when mounting from devfs
116
117- added call to <<release>> operation when removing swap areas on
118 devfs devices
119
120- increased NR_SUPER to 128 to support large numbers of devfs mounts
121 (for chroot(2) gaols)
122
123- fixed bug in SCSI disc support: was generating incorrect minors if
124 SCSI ID's did not start at 0 and increase by 1
125
126- support symlink traversal when mounting root
127===============================================================================
128Changes for patch v13
129
130- added devfs support to soundcard driver
131 Thanks to Eric Dumas <dumas@linux.eu.org> and
132 C. Scott Ananian <cananian@alumni.princeton.edu>
133
134- added devfs support to the joystick driver
135
136- loop driver now has it's own subdirectory "/dev/loop/"
137
138- created <devfs_get_flags> and <devfs_set_flags> functions
139
140- fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f})
141 which assumes ID's start at 0 and increase by 1. Also only create
142 devfs entries for SCSI disc partitions which actually exist
143 Show new names in partition check
144 Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz>
145===============================================================================
146Changes for patch v14
147
148- bug fix in floppy driver: would not compile without
149 CONFIG_DEVFS_FS='Y'
150 Thanks to Jurgen Botz <jbotz@nova.botz.org>
151
152- bug fix in loop driver
153 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
154
155- do not create devfs entries for printers not configured
156 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
157
158- do not create devfs entries for serial ports not present
159 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
160
161- ensure <tty_register_devfs> is exported from tty_io.c
162 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
163
164- allow unregistering of devfs symlink entries
165
166- fixed bug in SCSI disc naming introduced in last patch version
167===============================================================================
168Changes for patch v15
169
170- ported to kernel 2.1.81
171===============================================================================
172Changes for patch v16
173
174- created <devfs_set_symlink_destination> function
175
176- moved DEVFS_SUPER_MAGIC into header file
177
178- added DEVFS_FL_HIDE flag
179
180- created <devfs_get_maj_min>
181
182- created <devfs_get_handle_from_inode>
183
184- fixed bugs in searching by major&minor
185
186- changed interface to <devfs_unregister>, <devfs_fill_file> and
187 <devfs_find_handle>
188
189- fixed inode times when symlink created with symlink(2)
190
191- change tty driver to do auto-creation of devfs entries
192 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
193
194- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to
195 devfs
196
197- updated libc 5.4.43 patch for ttyname()
198===============================================================================
199Changes for patch v17
200
201- added CONFIG_DEVFS_TTY_COMPAT
202 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
203
204- bugfix in devfs support for drivers/char/lp.c
205 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
206
207- clean up serial driver so that PCMCIA devices unregister correctly
208 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
209
210- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to
211 devfs [was missing in patch v16]
212
213- updated libc 5.4.43 patch for ttyname() [was missing in patch v16]
214
215- all SCSI devices now registered in /dev/sg
216
217- support removal of devfs entries via unlink(2)
218===============================================================================
219Changes for patch v18
220
221- added floppy/?u720 floppy entry
222
223- fixed kerneld support for entries in devfs subdirectories
224
225- incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu.
226===============================================================================
227Changes for patch v19
228
229- bug fix when looking up unregistered entries: kerneld was not called
230
231- fixes for kernel 2.1.86 (now requires 2.1.86)
232===============================================================================
233Changes for patch v20
234
235- only create available floppy entries
236 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
237
238- new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1
239 instead of /dev/hda1)
240 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
241
242- new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1
243 instead of /dev/xda1)
244 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
245
246- new non-standard CD-ROM names (i.e. /dev/sbp/c#t#)
247 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
248
249- allow symlink traversal when mounting the root filesystem
250
251- Create entries for MD devices at MD init
252 Thanks to Christophe Leroy <christophe.leroy5@capway.com>
253===============================================================================
254Changes for patch v21
255
256- ported to kernel 2.1.91
257===============================================================================
258Changes for patch v22
259
260- SCSI host number patch ("scsihosts=" kernel option)
261 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
262===============================================================================
263Changes for patch v23
264
265- Fixed persistence bug with device numbers for manually created
266 device files
267
268- Fixed problem with recreating symlinks with different content
269
270- Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time)
271===============================================================================
272Changes for patch v24
273
274- Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading
275 should now work again
276
277- Hide entries which are manually unlinked
278
279- Always invalidate devfs dentry cache when registering entries
280
281- Support removal of devfs directories via rmdir(2)
282
283- Ensure directories created by <devfs_mk_dir> are visible
284
285- Default no access for "other" for floppy device
286===============================================================================
287Changes for patch v25
288
289- Updates to CREDITS file and minor IDE numbering change
290 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
291
292- Invalidate devfs dentry cache when making directories
293
294- Invalidate devfs dentry cache when removing entries
295
296- More informative message if root FS mount fails when devfs
297 configured
298
299- Fixed persistence bug with fifos
300===============================================================================
301Changes for patch v26
302
303- ported to kernel 2.1.97
304
305- Changed serial directory from "/dev/serial" to "/dev/tts" and
306 "/dev/consoles" to "/dev/vc" to be more friendly to new procps
307===============================================================================
308Changes for patch v27
309
310- Added support for IDE4 and IDE5
311 Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
312
313- Documented "scsihosts=" boot parameter
314
315- Print process command when debugging kerneld/kmod
316
317- Added debugging for register/unregister/change operations
318
319- Added "devfs=" boot options
320
321- Hide unregistered entries by default
322===============================================================================
323Changes for patch v28
324
325- No longer lock/unlock superblock in <devfs_put_super> (cope with
326 recent VFS interface change)
327
328- Do not automatically change ownership/protection of /dev/tty
329
330- Drop negative dentries when they are released
331
332- Manage dcache more efficiently
333===============================================================================
334Changes for patch v29
335
336- Added DEVFS_FL_AUTO_DEVNUM flag
337===============================================================================
338Changes for patch v30
339
340- No longer set unnecessary methods
341
342- Ported to kernel 2.1.99-pre3
343===============================================================================
344Changes for patch v31
345
346- Added PID display to <call_kerneld> debugging message
347
348- Added "diread" and "diwrite" options
349
350- Ported to kernel 2.1.102
351
352- Fixed persistence problem with permissions
353===============================================================================
354Changes for patch v32
355
356- Fixed devfs support in drivers/block/md.c
357===============================================================================
358Changes for patch v33
359
360- Support legacy device nodes
361
362- Fixed bug where recreated inodes were hidden
363
364- New IDE naming scheme: everything is under /dev/ide
365===============================================================================
366Changes for patch v34
367
368- Improved debugging in <get_vfs_inode>
369
370- Prevent duplicate calls to <devfs_mk_dir> in SCSI layer
371
372- No longer free old dentries in <devfs_mk_dir>
373
374- Free all dentries for a given entry when deleting inodes
375===============================================================================
376Changes for patch v35
377
378- Ported to kernel 2.1.105 (sound driver changes)
379===============================================================================
380Changes for patch v36
381
382- Fixed sound driver port
383===============================================================================
384Changes for patch v37
385
386- Minor documentation tweaks
387===============================================================================
388Changes for patch v38
389
390- More documentation tweaks
391
392- Fix for sound driver port
393
394- Removed ttyname-patch (grab libc 5.4.44 instead)
395
396- Ported to kernel 2.1.107-pre2 (loop driver fix)
397===============================================================================
398Changes for patch v39
399
400- Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh
401
402- Removed many #ifdef's, replaced with trickery in include/devfs_fs.h
403===============================================================================
404Changes for patch v40
405
406- Fix for sound driver port
407
408- Limit auto-device numbering to majors 128 to 239
409===============================================================================
410Changes for patch v41
411
412- Fixed inode times persistence problem
413===============================================================================
414Changes for patch v42
415
416- Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke)
417===============================================================================
418Changes for patch v43
419
420- Fixed spelling in <devfs_readlink> debug
421
422- Fixed bug in <devfs_setup> parsing "dilookup"
423
424- More #ifdef's removed
425
426- Supported Sparc keyboard (/dev/kbd)
427
428- Supported DSP56001 digital signal processor (/dev/dsp56k)
429
430- Supported Apple Desktop Bus (/dev/adb)
431
432- Supported Coda network file system (/dev/cfs*)
433===============================================================================
434Changes for patch v44
435
436- Fixed devfs inode leak when manually recreating inodes
437
438- Fixed permission persistence problem when recreating inodes
439===============================================================================
440Changes for patch v45
441
442- Ported to kernel 2.1.110
443===============================================================================
444Changes for patch v46
445
446- Ported to kernel 2.1.112-pre1
447
448- Removed harmless "unused variable" compiler warning
449
450- Fixed modes for manually recreated device nodes
451===============================================================================
452Changes for patch v47
453
454- Added NULL devfs inode warning in <devfs_read_inode>
455
456- Force all inode nlink values to 1
457===============================================================================
458Changes for patch v48
459
460- Added "dimknod" option
461
462- Set inode nlink to 0 when freeing dentries
463
464- Added support for virtual console capture devices (/dev/vcs*)
465 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
466
467- Fixed modes for manually recreated symlinks
468===============================================================================
469Changes for patch v49
470
471- Ported to kernel 2.1.113
472===============================================================================
473Changes for patch v50
474
475- Fixed bugs in recreated directories and symlinks
476===============================================================================
477Changes for patch v51
478
479- Improved robustness of rc.devfs script
480 Thanks to Roderich Schupp <rsch@experteam.de>
481
482- Fixed bugs in recreated device nodes
483
484- Fixed bug in currently unused <devfs_get_handle_from_inode>
485
486- Defined new <devfs_handle_t> type
487
488- Improved debugging when getting entries
489
490- Fixed bug where directories could be emptied
491
492- Ported to kernel 2.1.115
493===============================================================================
494Changes for patch v52
495
496- Replaced dummy .epoch inode with .devfsd character device
497
498- Modified rc.devfs to take account of above change
499
500- Removed spurious driver warning messages when CONFIG_DEVFS_FS=n
501
502- Implemented devfsd protocol revision 0
503===============================================================================
504Changes for patch v53
505
506- Ported to kernel 2.1.116 (kmod change broke hunk)
507
508- Updated Documentation/Configure.help
509
510- Test and tty pattern patch for rc.devfs script
511 Thanks to Roderich Schupp <rsch@experteam.de>
512
513- Added soothing message to warning in <devfs_d_iput>
514===============================================================================
515Changes for patch v54
516
517- Ported to kernel 2.1.117
518
519- Fixed default permissions in sound driver
520
521- Added support for frame buffer devices (/dev/fb*)
522===============================================================================
523Changes for patch v55
524
525- Ported to kernel 2.1.119
526
527- Use GCC extensions for structure initialisations
528
529- Implemented async open notification
530
531- Incremented devfsd protocol revision to 1
532===============================================================================
533Changes for patch v56
534
535- Ported to kernel 2.1.120-pre3
536
537- Moved async open notification to end of <devfs_open>
538===============================================================================
539Changes for patch v57
540
541- Ported to kernel 2.1.121
542
543- Prepended "/dev/" to module load request
544
545- Renamed <call_kerneld> to <call_kmod>
546
547- Created sample modules.conf file
548===============================================================================
549Changes for patch v58
550
551- Fixed typo "AYSNC" -> "ASYNC"
552===============================================================================
553Changes for patch v59
554
555- Added open flag for files
556===============================================================================
557Changes for patch v60
558
559- Ported to kernel 2.1.123-pre2
560===============================================================================
561Changes for patch v61
562
563- Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode>
564===============================================================================
565Changes for patch v62
566
567- Ported to kernel 2.1.123
568===============================================================================
569Changes for patch v63
570
571- Ported to kernel 2.1.124-pre2
572===============================================================================
573Changes for patch v64
574
575- Fixed Unix98 pty support
576
577- Increased buffer size in <get_partition_list> to avoid crash and
578 burn
579===============================================================================
580Changes for patch v65
581
582- More Unix98 pty support fixes
583
584- Added test for empty <<name>> in <devfs_find_handle>
585
586- Renamed <generate_path> to <devfs_generate_path> and published
587
588- Created /dev/root symlink
589 Thanks to Roderich Schupp <rsch@ExperTeam.de>
590 with further modifications by me
591===============================================================================
592Changes for patch v66
593
594- Yet more Unix98 pty support fixes (now tested)
595
596- Created <devfs_get_fops>
597
598- Support media change checks when CONFIG_DEVFS_ONLY=y
599
600- Abolished Unix98-style PTY names for old PTY devices
601===============================================================================
602Changes for patch v67
603
604- Added inline declaration for dummy <devfs_generate_path>
605
606- Removed spurious "unable to register... in devfs" messages when
607 CONFIG_DEVFS_FS=n
608
609- Fixed misc. devices when CONFIG_DEVFS_FS=n
610
611- Limit auto-device numbering to majors 144 to 239
612===============================================================================
613Changes for patch v68
614
615- Hide unopened virtual consoles from directory listings
616
617- Added support for video capture devices
618
619- Ported to kernel 2.1.125
620===============================================================================
621Changes for patch v69
622
623- Fix for CONFIG_VT=n
624===============================================================================
625Changes for patch v70
626
627- Added support for non-OSS/Free sound cards
628===============================================================================
629Changes for patch v71
630
631- Ported to kernel 2.1.126-pre2
632===============================================================================
633Changes for patch v72
634
635- #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed
636===============================================================================
637Changes for patch v73
638
639- CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option
640
641- CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available
642===============================================================================
643Changes for patch v74
644
645- Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with
646 "nomount" boot option
647
648- Documentation updates
649
650- Updated sample modules.conf
651===============================================================================
652Changes for patch v75
653
654- Updated sample modules.conf
655
656- Remount devfs after initrd finishes
657
658- Ported to kernel 2.1.127
659
660- Added support for ISDN
661 Thanks to Christophe Leroy <christophe.leroy5@capway.com>
662===============================================================================
663Changes for patch v76
664
665- Updated an email address in ChangeLog
666
667- CONFIG_DEVFS_ONLY replaced with "only" boot option
668===============================================================================
669Changes for patch v77
670
671- Added DEVFS_FL_REMOVABLE flag
672
673- Check for disc change when listing directories with removable media
674 devices
675
676- Use DEVFS_FL_REMOVABLE in sd.c
677
678- Ported to kernel 2.1.128
679===============================================================================
680Changes for patch v78
681
682- Only call <scan_dir_for_removable> on first call to <devfs_readdir>
683
684- Ported to kernel 2.1.129-pre5
685
686- ISDN support improvements
687 Thanks to Christophe Leroy <christophe.leroy5@capway.com>
688===============================================================================
689Changes for patch v79
690
691- Ported to kernel 2.1.130
692
693- Renamed miscdevice "apm" to "apm_bios" to be consistent with
694 devices.txt
695===============================================================================
696Changes for patch v80
697
698- Ported to kernel 2.1.131
699
700- Updated <devfs_rmdir> for VFS change in 2.1.131
701===============================================================================
702Changes for patch v81
703
704- Fixed permissions on /dev/ptmx
705===============================================================================
706Changes for patch v82
707
708- Ported to kernel 2.1.132-pre4
709
710- Changed initial permissions on /dev/pts/*
711
712- Created <devfs_mk_compat>
713
714- Added "symlinks" boot option
715
716- Changed devfs_register_blkdev() back to register_blkdev() for IDE
717
718- Check for partitions on removable media in <devfs_lookup>
719===============================================================================
720Changes for patch v83
721
722- Fixed support for ramdisc when using string-based root FS name
723
724- Ported to kernel 2.2.0-pre1
725===============================================================================
726Changes for patch v84
727
728- Ported to kernel 2.2.0-pre7
729===============================================================================
730Changes for patch v85
731
732- Compile fixes for driver/sound/sound_common.c (non-module) and
733 drivers/isdn/isdn_common.c
734 Thanks to Christophe Leroy <christophe.leroy5@capway.com>
735
736- Added support for registering regular files
737
738- Created <devfs_set_file_size>
739
740- Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr
741
742- Update devfs inodes from entries if not changed through FS
743===============================================================================
744Changes for patch v86
745
746- Ported to kernel 2.2.0-pre9
747===============================================================================
748Changes for patch v87
749
750- Fixed bug when mounting non-devfs devices in a devfs
751===============================================================================
752Changes for patch v88
753
754- Fixed <devfs_fill_file> to only initialise temporary inodes
755
756- Trap for NULL fops in <devfs_register>
757
758- Return -ENODEV in <devfs_fill_file> for non-driver inodes
759
760- Fixed bug when unswapping non-devfs devices in a devfs
761===============================================================================
762Changes for patch v89
763
764- Switched to C data types in include/linux/devfs_fs.h
765
766- Switched from PATH_MAX to DEVFS_PATHLEN
767
768- Updated Documentation/filesystems/devfs/modules.conf to take account
769 of reverse scanning (!) by modprobe
770
771- Ported to kernel 2.2.0
772===============================================================================
773Changes for patch v90
774
775- CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot
776 option
777
778- CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now
779 controls this. This means you must have libc 5.4.44 or later, or a
780 recent version of libc 6 if you use the "symlinks" option
781===============================================================================
782Changes for patch v91
783
784- Switch from <devfs_mk_symlink> to <devfs_mk_compat> in
785 drivers/char/vc_screen.c to fix problems with Midnight Commander
786===============================================================================
787Changes for patch v92
788
789- Ported to kernel 2.2.2-pre5
790===============================================================================
791Changes for patch v93
792
793- Modified <sd_name> in drivers/scsi/sd.c to cope with devices that
794 don't exist (which happens with new RAID autostart code printk()s)
795===============================================================================
796Changes for patch v94
797
798- Fixed bug in joystick driver: only first joystick was registered
799===============================================================================
800Changes for patch v95
801
802- Fixed another bug in joystick driver
803
804- Fixed <devfsd_read> to not overrun event buffer
805===============================================================================
806Changes for patch v96
807
808- Ported to kernel 2.2.5-2
809
810- Created <devfs_auto_unregister>
811
812- Fixed bugs: compatibility entries were not unregistered for:
813 loop driver
814 floppy driver
815 RAMDISC driver
816 IDE tape driver
817 SCSI CD-ROM driver
818 SCSI HDD driver
819===============================================================================
820Changes for patch v97
821
822- Fixed bugs: compatibility entries were not unregistered for:
823 ALSA sound driver
824 partitions in generic disc driver
825
826- Don't return unregistred entries in <devfs_find_handle>
827
828- Panic in <devfs_unregister> if entry unregistered
829
830- Don't panic in <devfs_auto_unregister> for duplicates
831===============================================================================
832Changes for patch v98
833
834- Don't unregister already unregistered entries in <unregister>
835
836- Register entry in <sd_detect>
837
838- Unregister entry in <sd_detach>
839
840- Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c
841
842- Ported to kernel 2.2.7
843===============================================================================
844Changes for patch v99
845
846- Ported to kernel 2.2.8
847
848- Fixed bug in drivers/scsi/sd.c when >16 SCSI discs
849
850- Disable warning messages when unable to read partition table for
851 removable media
852===============================================================================
853Changes for patch v100
854
855- Ported to kernel 2.3.1-pre5
856
857- Added "oops-on-panic" boot option
858
859- Improved debugging in <devfs_register> and <devfs_unregister>
860
861- Register entry in <sr_detect>
862
863- Unregister entry in <sr_detach>
864
865- Register entry in <sg_detect>
866
867- Unregister entry in <sg_detach>
868
869- Added support for ALSA drivers
870===============================================================================
871Changes for patch v101
872
873- Ported to kernel 2.3.2
874===============================================================================
875Changes for patch v102
876
877- Update serial driver to register PCMCIA entries
878 Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr>
879
880- Updated an email address in ChangeLog
881
882- Hide virtual console capture entries from directory listings when
883 corresponding console device is not open
884===============================================================================
885Changes for patch v103
886
887- Ported to kernel 2.3.3
888===============================================================================
889Changes for patch v104
890
891- Added documentation for some functions
892
893- Added "doc" target to fs/devfs/Makefile
894
895- Added "v4l" directory for video4linux devices
896
897- Replaced call to <devfs_unregister> in <sd_detach> with call to
898 <devfs_register_partitions>
899
900- Moved registration for sr and sg drivers from detect() to attach()
901 methods
902
903- Register entries in <st_attach> and unregister in <st_detach>
904
905- Work around IDE driver treating CD-ROM as gendisk
906
907- Use <sed> instead of <tr> in rc.devfs
908
909- Updated ToDo list
910
911- Removed "oops-on-panic" boot option: now always Oops
912===============================================================================
913Changes for patch v105
914
915- Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister>
916 Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu>
917
918- Don't save /dev/log in rc.devfs
919
920- Ported to kernel 2.3.4-pre1
921===============================================================================
922Changes for patch v106
923
924- Fixed silly typo in drivers/scsi/st.c
925
926- Improved debugging in <devfs_register>
927===============================================================================
928Changes for patch v107
929
930- Added "diunlink" and "nokmod" boot options
931
932- Removed superfluous warning message in <devfs_d_iput>
933===============================================================================
934Changes for patch v108
935
936- Remove entries when unloading sound module
937===============================================================================
938Changes for patch v109
939
940- Ported to kernel 2.3.6-pre2
941===============================================================================
942Changes for patch v110
943
944- Took account of change to <d_alloc_root>
945===============================================================================
946Changes for patch v111
947
948- Created separate event queue for each mounted devfs
949
950- Removed <devfs_invalidate_dcache>
951
952- Created new ioctl()s for devfsd
953
954- Incremented devfsd protocol revision to 3
955
956- Fixed bug when re-creating directories: contents were lost
957
958- Block access to inodes until devfsd updates permissions
959===============================================================================
960Changes for patch v112
961
962- Modified patch so it applies against 2.3.5 and 2.3.6
963
964- Updated an email address in ChangeLog
965
966- Do not automatically change ownership/protection of /dev/tty<n>
967
968- Updated sample modules.conf
969
970- Switched to sending process uid/gid to devfsd
971
972- Renamed <call_kmod> to <try_modload>
973
974- Added DEVFSD_NOTIFY_LOOKUP event
975
976- Added DEVFSD_NOTIFY_CHANGE event
977
978- Added DEVFSD_NOTIFY_CREATE event
979
980- Incremented devfsd protocol revision to 4
981
982- Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h
983===============================================================================
984Changes for patch v113
985
986- Ported to kernel 2.3.9
987
988- Restricted permissions on some block devices
989===============================================================================
990Changes for patch v114
991
992- Added support for /dev/netlink
993 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
994
995- Return EISDIR rather than EINVAL for read(2) on directories
996
997- Ported to kernel 2.3.10
998===============================================================================
999Changes for patch v115
1000
1001- Added support for all remaining character devices
1002 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
1003
1004- Cleaned up netlink support
1005===============================================================================
1006Changes for patch v116
1007
1008- Added support for /dev/parport%d
1009 Thanks to Tim Waugh <tim@cyberelk.demon.co.uk>
1010
1011- Fixed parallel port ATAPI tape driver
1012
1013- Fixed Atari SLM laser printer driver
1014===============================================================================
1015Changes for patch v117
1016
1017- Added support for COSA card
1018 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
1019
1020- Fixed drivers/char/ppdev.c: missing #include <linux/init.h>
1021
1022- Fixed drivers/char/ftape/zftape/zftape-init.c
1023 Thanks to Vladimir Popov <mashgrad@usa.net>
1024===============================================================================
1025Changes for patch v118
1026
1027- Ported to kernel 2.3.15-pre3
1028
1029- Fixed bug in loop driver
1030
1031- Unregister /dev/lp%d entries in drivers/char/lp.c
1032 Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl>
1033===============================================================================
1034Changes for patch v119
1035
1036- Ported to kernel 2.3.16
1037===============================================================================
1038Changes for patch v120
1039
1040- Fixed bug in drivers/scsi/scsi.c
1041
1042- Added /dev/ppp
1043 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
1044
1045- Ported to kernel 2.3.17
1046===============================================================================
1047Changes for patch v121
1048
1049- Fixed bug in drivers/block/loop.c
1050
1051- Ported to kernel 2.3.18
1052===============================================================================
1053Changes for patch v122
1054
1055- Ported to kernel 2.3.19
1056===============================================================================
1057Changes for patch v123
1058
1059- Ported to kernel 2.3.20
1060===============================================================================
1061Changes for patch v124
1062
1063- Ported to kernel 2.3.21
1064===============================================================================
1065Changes for patch v125
1066
1067- Created <devfs_get_info>, <devfs_set_info>,
1068 <devfs_get_first_child> and <devfs_get_next_sibling>
1069 Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>,
1070 <devfs_mk_dir> and <devfs_find_handle>
1071 Work sponsored by SGI
1072
1073- Fixed apparent bug in COSA driver
1074
1075- Re-instated "scsihosts=" boot option
1076===============================================================================
1077Changes for patch v126
1078
1079- Always create /dev/pts if CONFIG_UNIX98_PTYS=y
1080
1081- Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c
1082 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
1083
1084- Allow multiple unregistrations
1085
1086- Created /dev/scsi hierarchy
1087 Work sponsored by SGI
1088===============================================================================
1089Changes for patch v127
1090
1091Work sponsored by SGI
1092
1093- No longer disable devpts if devfs enabled (caveat emptor)
1094
1095- Added flags array to struct gendisk and removed code from
1096 drivers/scsi/sd.c
1097
1098- Created /dev/discs hierarchy
1099===============================================================================
1100Changes for patch v128
1101
1102Work sponsored by SGI
1103
1104- Created /dev/cdroms hierarchy
1105===============================================================================
1106Changes for patch v129
1107
1108Work sponsored by SGI
1109
1110- Removed compatibility entries for sound devices
1111
1112- Removed compatibility entries for printer devices
1113
1114- Removed compatibility entries for video4linux devices
1115
1116- Removed compatibility entries for parallel port devices
1117
1118- Removed compatibility entries for frame buffer devices
1119===============================================================================
1120Changes for patch v130
1121
1122Work sponsored by SGI
1123
1124- Added major and minor number to devfsd protocol
1125
1126- Incremented devfsd protocol revision to 5
1127
1128- Removed compatibility entries for SoundBlaster CD-ROMs
1129
1130- Removed compatibility entries for netlink devices
1131
1132- Removed compatibility entries for SCSI generic devices
1133
1134- Removed compatibility entries for SCSI tape devices
1135===============================================================================
1136Changes for patch v131
1137
1138Work sponsored by SGI
1139
1140- Support info pointer for all devfs entry types
1141
1142- Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink>
1143
1144- Removed /dev/st hierarchy
1145
1146- Removed /dev/sg hierarchy
1147
1148- Removed compatibility entries for loop devices
1149
1150- Removed compatibility entries for IDE tape devices
1151
1152- Removed compatibility entries for SCSI CD-ROMs
1153
1154- Removed /dev/sr hierarchy
1155===============================================================================
1156Changes for patch v132
1157
1158Work sponsored by SGI
1159
1160- Removed compatibility entries for floppy devices
1161
1162- Removed compatibility entries for RAMDISCs
1163
1164- Removed compatibility entries for meta-devices
1165
1166- Removed compatibility entries for SCSI discs
1167
1168- Created <devfs_make_root>
1169
1170- Removed /dev/sd hierarchy
1171
1172- Support "../" when searching devfs namespace
1173
1174- Created /dev/ide/host* hierarchy
1175
1176- Supported IDE hard discs in /dev/ide/host* hierarchy
1177
1178- Removed compatibility entries for IDE discs
1179
1180- Removed /dev/ide/hd hierarchy
1181
1182- Supported IDE CD-ROMs in /dev/ide/host* hierarchy
1183
1184- Removed compatibility entries for IDE CD-ROMs
1185
1186- Removed /dev/ide/cd hierarchy
1187===============================================================================
1188Changes for patch v133
1189
1190Work sponsored by SGI
1191
1192- Created <devfs_get_unregister_slave>
1193
1194- Fixed bug in fs/partitions/check.c when rescanning
1195===============================================================================
1196Changes for patch v134
1197
1198Work sponsored by SGI
1199
1200- Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories
1201
1202- Removed /dev/ide/hd directory
1203
1204- Exported <devfs_get_parent>
1205
1206- Created <devfs_register_tape> and /dev/tapes hierarchy
1207
1208- Removed /dev/ide/mt hierarchy
1209
1210- Removed /dev/ide/fd hierarchy
1211
1212- Ported to kernel 2.3.25
1213===============================================================================
1214Changes for patch v135
1215
1216Work sponsored by SGI
1217
1218- Removed compatibility entries for virtual console capture devices
1219
1220- Removed unused <devfs_set_symlink_destination>
1221
1222- Removed compatibility entries for serial devices
1223
1224- Removed compatibility entries for console devices
1225
1226- Do not hide entries from devfsd or children
1227
1228- Removed DEVFS_FL_TTY_COMPAT flag
1229
1230- Removed "nottycompat" boot option
1231
1232- Removed <devfs_mk_compat>
1233===============================================================================
1234Changes for patch v136
1235
1236Work sponsored by SGI
1237
1238- Moved BSD pty devices to /dev/pty
1239
1240- Added DEVFS_FL_WAIT flag
1241===============================================================================
1242Changes for patch v137
1243
1244Work sponsored by SGI
1245
1246- Really fixed bug in fs/partitions/check.c when rescanning
1247
1248- Support new "disc" naming scheme in <get_removable_partition>
1249
1250- Allow NULL fops in <devfs_register>
1251
1252- Removed redundant name functions in SCSI disc and IDE drivers
1253===============================================================================
1254Changes for patch v138
1255
1256Work sponsored by SGI
1257
1258- Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c,
1259 drivers/net/wan/cosa.c and drivers/scsi/scsi.c
1260 Thanks to Sergey Kubushin <ksi@ksi-linux.com>
1261
1262- Fall back to major table if NULL fops given to <devfs_register>
1263===============================================================================
1264Changes for patch v139
1265
1266Work sponsored by SGI
1267
1268- Corrected and moved <get_blkfops> and <get_chrfops> declarations
1269 from arch/alpha/kernel/osf_sys.c to include/linux/fs.h
1270
1271- Removed name function from struct gendisk
1272
1273- Updated devfs FAQ
1274===============================================================================
1275Changes for patch v140
1276
1277Work sponsored by SGI
1278
1279- Ported to kernel 2.3.27
1280===============================================================================
1281Changes for patch v141
1282
1283Work sponsored by SGI
1284
1285- Bug fix in arch/m68k/atari/joystick.c
1286
1287- Moved ISDN and capi devices to /dev/isdn
1288===============================================================================
1289Changes for patch v142
1290
1291Work sponsored by SGI
1292
1293- Bug fix in drivers/block/ide-probe.c (patch confusion)
1294===============================================================================
1295Changes for patch v143
1296
1297Work sponsored by SGI
1298
1299- Bug fix in drivers/block/blkpg.c:partition_name()
1300===============================================================================
1301Changes for patch v144
1302
1303Work sponsored by SGI
1304
1305- Ported to kernel 2.3.29
1306
1307- Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx
1308 CD-ROM drivers: generic driver handles this now
1309
1310- Moved joystick devices to /dev/joysticks
1311===============================================================================
1312Changes for patch v145
1313
1314Work sponsored by SGI
1315
1316- Ported to kernel 2.3.30-pre3
1317
1318- Register whole-disc entry even for invalid partition tables
1319
1320- Fixed bug in mounting root FS when initrd enabled
1321
1322- Fixed device entry leak with IDE CD-ROMs
1323
1324- Fixed compile problem with drivers/isdn/isdn_common.c
1325
1326- Moved COSA devices to /dev/cosa
1327
1328- Support fifos when unregistering
1329
1330- Created <devfs_register_series> and used in many drivers
1331
1332- Moved Coda devices to /dev/coda
1333
1334- Moved parallel port IDE tapes to /dev/pt
1335
1336- Moved parallel port IDE generic devices to /dev/pg
1337===============================================================================
1338Changes for patch v146
1339
1340Work sponsored by SGI
1341
1342- Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags
1343
1344- Fixed compile problem with fs/coda/psdev.c
1345
1346- Reinstate change to <devfs_register_blkdev> in
1347 drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed
1348
1349- Switched to <devfs_register_blkdev> in drivers/block/floppy.c,
1350 drivers/scsi/sr.c and drivers/block/md.c
1351
1352- Moved DAC960 devices to /dev/dac960
1353===============================================================================
1354Changes for patch v147
1355
1356Work sponsored by SGI
1357
1358- Ported to kernel 2.3.32-pre4
1359===============================================================================
1360Changes for patch v148
1361
1362Work sponsored by SGI
1363
1364- Removed kmod support: use devfsd instead
1365
1366- Moved miscellaneous character devices to /dev/misc
1367===============================================================================
1368Changes for patch v149
1369
1370Work sponsored by SGI
1371
1372- Ensure include/linux/joystick.h is OK for user-space
1373
1374- Improved debugging in <get_vfs_inode>
1375
1376- Ensure dentries created by devfsd will be cleaned up
1377===============================================================================
1378Changes for patch v150
1379
1380Work sponsored by SGI
1381
1382- Ported to kernel 2.3.34
1383===============================================================================
1384Changes for patch v151
1385
1386Work sponsored by SGI
1387
1388- Ported to kernel 2.3.35-pre1
1389
1390- Created <devfs_get_name>
1391===============================================================================
1392Changes for patch v152
1393
1394Work sponsored by SGI
1395
1396- Updated sample modules.conf
1397
1398- Ported to kernel 2.3.36-pre1
1399===============================================================================
1400Changes for patch v153
1401
1402Work sponsored by SGI
1403
1404- Ported to kernel 2.3.42
1405
1406- Removed <devfs_fill_file>
1407===============================================================================
1408Changes for patch v154
1409
1410Work sponsored by SGI
1411
1412- Took account of device number changes for /dev/fb*
1413===============================================================================
1414Changes for patch v155
1415
1416Work sponsored by SGI
1417
1418- Ported to kernel 2.3.43-pre8
1419
1420- Moved /dev/tty0 to /dev/vc/0
1421
1422- Moved sequence number formatting from <_tty_make_name> to drivers
1423===============================================================================
1424Changes for patch v156
1425
1426Work sponsored by SGI
1427
1428- Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes
1429===============================================================================
1430Changes for patch v157
1431
1432Work sponsored by SGI
1433
1434- Ported to kernel 2.3.45
1435===============================================================================
1436Changes for patch v158
1437
1438Work sponsored by SGI
1439
1440- Ported to kernel 2.3.46-pre2
1441===============================================================================
1442Changes for patch v159
1443
1444Work sponsored by SGI
1445
1446- Fixed drivers/block/md.c
1447 Thanks to Mike Galbraith <mikeg@weiden.de>
1448
1449- Documentation fixes
1450
1451- Moved device registration from <lp_init> to <lp_register>
1452 Thanks to Tim Waugh <twaugh@redhat.com>
1453===============================================================================
1454Changes for patch v160
1455
1456Work sponsored by SGI
1457
1458- Fixed drivers/char/joystick/joystick.c
1459 Thanks to Vojtech Pavlik <vojtech@suse.cz>
1460
1461- Documentation updates
1462
1463- Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled
1464
1465- Fixed drivers/char/stallion.c
1466===============================================================================
1467Changes for patch v161
1468
1469Work sponsored by SGI
1470
1471- Remove /dev/ide when ide-mod is unloaded
1472
1473- Fixed bug in drivers/block/ide-probe.c when secondary but no primary
1474
1475- Added DEVFS_FL_NO_PERSISTENCE flag
1476
1477- Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves
1478
1479- Removed unnecessary call to <update_devfs_inode_from_entry> in
1480 <devfs_readdir>
1481
1482- Only set auto-ownership for /dev/pty/s*
1483===============================================================================
1484Changes for patch v162
1485
1486Work sponsored by SGI
1487
1488- Set inode->i_size to correct size for symlinks
1489 Thanks to Jeremy Fitzhardinge <jeremy@goop.org>
1490
1491- Only give lookup() method to directories to comply with new VFS
1492 assumptions
1493
1494- Remove unnecessary tests in symlink methods
1495
1496- Don't kill existing block ops in <devfs_read_inode>
1497
1498- Restore auto-ownership for /dev/pty/m*
1499===============================================================================
1500Changes for patch v163
1501
1502Work sponsored by SGI
1503
1504- Don't create missing directories in <devfs_find_handle>
1505
1506- Removed Documentation/filesystems/devfs/mk-devlinks
1507
1508- Updated Documentation/filesystems/devfs/README
1509===============================================================================
1510Changes for patch v164
1511
1512Work sponsored by SGI
1513
1514- Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in
1515 linux-2.3.99-pre6-7
1516===============================================================================
1517Changes for patch v165
1518
1519Work sponsored by SGI
1520
1521- Ported to kernel 2.3.99-pre6
1522===============================================================================
1523Changes for patch v166
1524
1525Work sponsored by SGI
1526
1527- Added CONFIG_DEVFS_MOUNT
1528===============================================================================
1529Changes for patch v167
1530
1531Work sponsored by SGI
1532
1533- Updated Documentation/filesystems/devfs/README
1534
1535- Updated sample modules.conf
1536===============================================================================
1537Changes for patch v168
1538
1539Work sponsored by SGI
1540
1541- Disabled multi-mount capability (use VFS bindings instead)
1542
1543- Updated README from master HTML file
1544===============================================================================
1545Changes for patch v169
1546
1547Work sponsored by SGI
1548
1549- Removed multi-mount code
1550
1551- Removed compatibility macros: VFS has changed too much
1552===============================================================================
1553Changes for patch v170
1554
1555Work sponsored by SGI
1556
1557- Updated README from master HTML file
1558
1559- Merged devfs inode into devfs entry
1560===============================================================================
1561Changes for patch v171
1562
1563Work sponsored by SGI
1564
1565- Updated sample modules.conf
1566
1567- Removed dead code in <devfs_register> which used to call
1568 <free_dentries>
1569
1570- Ported to kernel 2.4.0-test2-pre3
1571===============================================================================
1572Changes for patch v172
1573
1574Work sponsored by SGI
1575
1576- Changed interface to <devfs_register>
1577
1578- Changed interface to <devfs_register_series>
1579===============================================================================
1580Changes for patch v173
1581
1582Work sponsored by SGI
1583
1584- Simplified interface to <devfs_mk_symlink>
1585
1586- Simplified interface to <devfs_mk_dir>
1587
1588- Simplified interface to <devfs_find_handle>
1589===============================================================================
1590Changes for patch v174
1591
1592Work sponsored by SGI
1593
1594- Updated README from master HTML file
1595===============================================================================
1596Changes for patch v175
1597
1598Work sponsored by SGI
1599
1600- DocBook update for fs/devfs/base.c
1601 Thanks to Tim Waugh <twaugh@redhat.com>
1602
1603- Removed stale fs/tunnel.c (was never used or completed)
1604===============================================================================
1605Changes for patch v176
1606
1607Work sponsored by SGI
1608
1609- Updated ToDo list
1610
1611- Removed sample modules.conf: now distributed with devfsd
1612
1613- Updated README from master HTML file
1614
1615- Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174)
1616===============================================================================
1617Changes for patch v177
1618
1619- Updated README from master HTML file
1620
1621- Documentation cleanups
1622
1623- Ensure <devfs_generate_path> terminates string for root entry
1624 Thanks to Tim Jansen <tim@tjansen.de>
1625
1626- Exported <devfs_get_name> to modules
1627
1628- Make <devfs_mk_symlink> send events to devfsd
1629
1630- Cleaned up option processing in <devfs_setup>
1631
1632- Fixed bugs in handling symlinks: could leak or cause Oops
1633
1634- Cleaned up directory handling by separating fops
1635 Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk>
1636===============================================================================
1637Changes for patch v178
1638
1639- Fixed handling of inverted options in <devfs_setup>
1640===============================================================================
1641Changes for patch v179
1642
1643- Adjusted <try_modload> to account for <devfs_generate_path> fix
1644===============================================================================
1645Changes for patch v180
1646
1647- Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info>
1648===============================================================================
1649Changes for patch v181
1650
1651- Answered question posed by Al Viro and removed his comments from <devfs_open>
1652
1653- Moved setting of registered flag after other fields are changed
1654
1655- Fixed race between <devfsd_close> and <devfsd_notify_one>
1656
1657- Global VFS changes added bogus BKL to devfsd_close(): removed
1658
1659- Widened locking in <devfs_readlink> and <devfs_follow_link>
1660
1661- Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc
1662
1663- Simplified locking in <devfsd_ioctl> and fixed memory leak
1664===============================================================================
1665Changes for patch v182
1666
1667- Created <devfs_*alloc_major> and <devfs_*alloc_devnum>
1668
1669- Removed broken devnum allocation and use <devfs_alloc_devnum>
1670
1671- Fixed old devnum leak by calling new <devfs_dealloc_devnum>
1672
1673- Created <devfs_*alloc_unique_number>
1674
1675- Fixed number leak for /dev/cdroms/cdrom%d
1676
1677- Fixed number leak for /dev/discs/disc%d
1678===============================================================================
1679Changes for patch v183
1680
1681- Fixed bug in <devfs_setup> which could hang boot process
1682===============================================================================
1683Changes for patch v184
1684
1685- Documentation typo fix for fs/devfs/util.c
1686
1687- Fixed drivers/char/stallion.c for devfs
1688
1689- Added DEVFSD_NOTIFY_DELETE event
1690
1691- Updated README from master HTML file
1692
1693- Removed #include <asm/segment.h> from fs/devfs/base.c
1694===============================================================================
1695Changes for patch v185
1696
1697- Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c
1698 private
1699
1700- Fixed inode table races by removing it and using inode->u.generic_ip
1701 instead
1702
1703- Moved <devfs_read_inode> into <get_vfs_inode>
1704
1705- Moved <devfs_write_inode> into <devfs_notify_change>
1706===============================================================================
1707Changes for patch v186
1708
1709- Fixed race in <devfs_do_symlink> for uni-processor
1710
1711- Updated README from master HTML file
1712===============================================================================
1713Changes for patch v187
1714
1715- Fixed drivers/char/stallion.c for devfs
1716
1717- Fixed drivers/char/rocket.c for devfs
1718
1719- Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers
1720===============================================================================
1721Changes for patch v188
1722
1723- Updated major masks in fs/devfs/util.c up to Linus' "no new majors"
1724 proclamation. Block: were 126 now 122 free, char: were 26 now 19 free
1725
1726- Updated README from master HTML file
1727
1728- Removed remnant of multi-mount support in <devfs_mknod>
1729
1730- Removed unused DEVFS_FL_SHOW_UNREG flag
1731===============================================================================
1732Changes for patch v189
1733
1734- Removed nlink field from struct devfs_inode
1735
1736- Removed auto-ownership for /dev/pty/* (BSD ptys) and used
1737 DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just
1738 like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access
1739===============================================================================
1740Changes for patch v190
1741
1742- Updated README from master HTML file
1743
1744- Replaced BKL with global rwsem to protect symlink data (quick and
1745 dirty hack)
1746===============================================================================
1747Changes for patch v191
1748
1749- Replaced global rwsem for symlink with per-link refcount
1750===============================================================================
1751Changes for patch v192
1752
1753- Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c
1754
1755- Ported to kernel 2.4.10-pre11
1756
1757- Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode>
1758===============================================================================
1759Changes for patch v193
1760
1761- Went back to global rwsem for symlinks (refcount scheme no good)
1762===============================================================================
1763Changes for patch v194
1764
1765- Fixed overrun in <devfs_link> by removing function (not needed)
1766
1767- Updated README from master HTML file
1768===============================================================================
1769Changes for patch v195
1770
1771- Fixed buffer underrun in <try_modload>
1772
1773- Moved down_read() from <search_for_entry_in_dir> to <find_entry>
1774===============================================================================
1775Changes for patch v196
1776
1777- Fixed race in <devfsd_ioctl> when setting event mask
1778 Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi>
1779
1780- Avoid deadlock in <devfs_follow_link> by using temporary buffer
1781===============================================================================
1782Changes for patch v197
1783
1784- First release of new locking code for devfs core (v1.0)
1785
1786- Fixed bug in drivers/cdrom/cdrom.c
1787===============================================================================
1788Changes for patch v198
1789
1790- Discard temporary buffer, now use "%s" for dentry names
1791
1792- Don't generate path in <try_modload>: use fake entry instead
1793
1794- Use "existing" directory in <_devfs_make_parent_for_leaf>
1795
1796- Use slab cache rather than fixed buffer for devfsd events
1797===============================================================================
1798Changes for patch v199
1799
1800- Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE
1801
1802- Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir>
1803
1804- Fixed locking bug in <devfs_d_revalidate_wait> due to typo
1805
1806- Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd
1807 or children
1808===============================================================================
1809Changes for patch v200
1810
1811- Ported to kernel 2.5.1-pre2
1812===============================================================================
1813Changes for patch v201
1814
1815- Fixed bug in <devfsd_read>: was dereferencing freed pointer
1816===============================================================================
1817Changes for patch v202
1818
1819- Fixed bug in <devfsd_close>: was dereferencing freed pointer
1820
1821- Added process group check for devfsd privileges
1822===============================================================================
1823Changes for patch v203
1824
1825- Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete>
1826===============================================================================
1827Changes for patch v204
1828
1829- Removed long obsolete rc.devfs
1830
1831- Return old entry in <devfs_mk_dir> for 2.4.x kernels
1832
1833- Updated README from master HTML file
1834
1835- Increment refcount on module in <check_disc_changed>
1836
1837- Created <devfs_get_handle> and exported <devfs_put>
1838
1839- Increment refcount on module in <devfs_get_ops>
1840
1841- Created <devfs_put_ops> and used where needed to fix races
1842
1843- Added clarifying comments in response to preliminary EMC code review
1844
1845- Added poisoning to <devfs_put>
1846
1847- Improved debugging messages
1848
1849- Fixed unregister bugs in drivers/md/lvm-fs.c
1850===============================================================================
1851Changes for patch v205
1852
1853- Corrected (made useful) debugging message in <unregister>
1854
1855- Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs>
1856
1857- Fixed drivers/md/lvm-fs.c to create "lvm" entry
1858
1859- Added magic number to guard against scribbling drivers
1860
1861- Only return old entry in <devfs_mk_dir> if a directory
1862
1863- Defined macros for error and debug messages
1864
1865- Updated README from master HTML file
1866===============================================================================
1867Changes for patch v206
1868
1869- Added support for multiple Compaq cpqarray controllers
1870
1871- Fixed (rare, old) race in <devfs_lookup>
1872===============================================================================
1873Changes for patch v207
1874
1875- Fixed deadlock bug in <devfs_d_revalidate_wait>
1876
1877- Tag VFS deletable in <devfs_mk_symlink> if handle ignored
1878
1879- Updated README from master HTML file
1880===============================================================================
1881Changes for patch v208
1882
1883- Added KERN_* to remaining messages
1884
1885- Cleaned up declaration of <stat_read>
1886
1887- Updated README from master HTML file
1888===============================================================================
1889Changes for patch v209
1890
1891- Updated README from master HTML file
1892
1893- Removed silently introduced calls to lock_kernel() and
1894 unlock_kernel() due to recent VFS locking changes. BKL isn't
1895 required in devfs
1896
1897- Changed <devfs_rmdir> to allow later additions if not yet empty
1898
1899- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c
1900 <add_partition> and <del_partition>
1901
1902- Fixed bug in <devfs_alloc_unique_number>: was clearing beyond
1903 bitfield
1904
1905- Fixed bitfield data type for <devfs_*alloc_devnum>
1906
1907- Made major bitfield type and initialiser 64 bit safe
1908===============================================================================
1909Changes for patch v210
1910
1911- Updated fs/devfs/util.c to fix shift warning on 64 bit machines
1912 Thanks to Anton Blanchard <anton@samba.org>
1913
1914- Updated README from master HTML file
1915===============================================================================
1916Changes for patch v211
1917
1918- Do not put miscellaneous character devices in /dev/misc if they
1919 specify their own directory (i.e. contain a '/' character)
1920
1921- Copied macro for error messages from fs/devfs/base.c to
1922 fs/devfs/util.c and made use of this macro
1923
1924- Removed 2.4.x compatibility code from fs/devfs/base.c
1925===============================================================================
1926Changes for patch v212
1927
1928- Added BKL to <devfs_open> because drivers still need it
1929===============================================================================
1930Changes for patch v213
1931
1932- Protected <scan_dir_for_removable> and <get_removable_partition>
1933 from changing directory contents
1934===============================================================================
1935Changes for patch v214
1936
1937- Switched to ISO C structure field initialisers
1938
1939- Switch to set_current_state() and move before add_wait_queue()
1940
1941- Updated README from master HTML file
1942
1943- Fixed devfs entry leak in <devfs_readdir> when *readdir fails
1944===============================================================================
1945Changes for patch v215
1946
1947- Created <devfs_find_and_unregister>
1948
1949- Switched many functions from <devfs_find_handle> to
1950 <devfs_find_and_unregister>
1951
1952- Switched many functions from <devfs_find_handle> to <devfs_get_handle>
1953===============================================================================
1954Changes for patch v216
1955
1956- Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to
1957 <devfs_get_handle>
1958
1959- Removed deprecated <devfs_find_handle>
1960===============================================================================
1961Changes for patch v217
1962
1963- Exported <devfs_find_and_unregister> and <devfs_only> to modules
1964
1965- Updated README from master HTML file
1966
1967- Fixed module unload race in <devfs_open>
1968===============================================================================
1969Changes for patch v218
1970
1971- Removed DEVFS_FL_AUTO_OWNER flag
1972
1973- Switched lingering structure field initialiser to ISO C
1974
1975- Added locking when setting/clearing flags
1976
1977- Documentation fix in fs/devfs/util.c
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README
new file mode 100644
index 000000000000..54366ecc241f
--- /dev/null
+++ b/Documentation/filesystems/devfs/README
@@ -0,0 +1,1964 @@
1Devfs (Device File System) FAQ
2
3
4Linux Devfs (Device File System) FAQ
5Richard Gooch
620-AUG-2002
7
8
9Document languages:
10
11
12
13
14
15
16
17-----------------------------------------------------------------------------
18
19NOTE: the master copy of this document is available online at:
20
21http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html
22and looks much better than the text version distributed with the
23kernel sources. A mirror site is available at:
24
25http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html
26
27There is also an optional daemon that may be used with devfs. You can
28find out more about it at:
29
30http://www.atnf.csiro.au/~rgooch/linux/
31
32A mailing list is available which you may subscribe to. Send
33email
34to majordomo@oss.sgi.com with the following line in the
35body of the message:
36subscribe devfs
37To unsubscribe, send the message body:
38unsubscribe devfs
39instead. The list is archived at
40
41http://oss.sgi.com/projects/devfs/archive/.
42
43-----------------------------------------------------------------------------
44
45Contents
46
47
48What is it?
49
50Why do it?
51
52Who else does it?
53
54How it works
55
56Operational issues (essential reading)
57
58Instructions for the impatient
59Permissions persistence across reboots
60Dealing with drivers without devfs support
61All the way with Devfs
62Other Issues
63Kernel Naming Scheme
64Devfsd Naming Scheme
65Old Compatibility Names
66SCSI Host Probing Issues
67
68
69
70Device drivers currently ported
71
72Allocation of Device Numbers
73
74Questions and Answers
75
76Making things work
77Alternatives to devfs
78What I don't like about devfs
79How to report bugs
80Strange kernel messages
81Compilation problems with devfsd
82
83
84Other resources
85
86Translations of this document
87
88
89-----------------------------------------------------------------------------
90
91
92What is it?
93
94Devfs is an alternative to "real" character and block special devices
95on your root filesystem. Kernel device drivers can register devices by
96name rather than major and minor numbers. These devices will appear in
97devfs automatically, with whatever default ownership and
98protection the driver specified. A daemon (devfsd) can be used to
99override these defaults. Devfs has been in the kernel since 2.3.46.
100
101NOTE that devfs is entirely optional. If you prefer the old
102disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the
103default). In this case, nothing will change. ALSO NOTE that if you do
104enable devfs, the defaults are such that full compatibility is
105maintained with the old devices names.
106
107There are two aspects to devfs: one is the underlying device
108namespace, which is a namespace just like any mounted filesystem. The
109other aspect is the filesystem code which provides a view of the
110device namespace. The reason I make a distinction is because devfs
111can be mounted many times, with each mount showing the same device
112namespace. Changes made are global to all mounted devfs filesystems.
113Also, because the devfs namespace exists without any devfs mounts, you
114can easily mount the root filesystem by referring to an entry in the
115devfs namespace.
116
117
118The cost of devfs is a small increase in kernel code size and memory
119usage. About 7 pages of code (some of that in __init sections) and 72
120bytes for each entry in the namespace. A modest system has only a
121couple of hundred device entries, so this costs a few more
122pages. Compare this with the suggestion to put /dev on a <a
123href="#why-faq-ramdisc">ramdisc.
124
125On a typical machine, the cost is under 0.2 percent. On a modest
126system with 64 MBytes of RAM, the cost is under 0.1 percent. The
127accusations of "bloatware" levelled at devfs are not justified.
128
129-----------------------------------------------------------------------------
130
131
132Why do it?
133
134There are several problems that devfs addresses. Some of these
135problems are more serious than others (depending on your point of
136view), and some can be solved without devfs. However, the totality of
137these problems really calls out for devfs.
138
139The choice is a patchwork of inefficient user space solutions, which
140are complex and likely to be fragile, or to use a simple and efficient
141devfs which is robust.
142
143There have been many counter-proposals to devfs, all seeking to
144provide some of the benefits without actually implementing devfs. So
145far there has been an absence of code and no proposed alternative has
146been able to provide all the features that devfs does. Further,
147alternative proposals require far more complexity in user-space (and
148still deliver less functionality than devfs). Some people have the
149mantra of reducing "kernel bloat", but don't consider the effects on
150user-space.
151
152A good solution limits the total complexity of kernel-space and
153user-space.
154
155
156Major&minor allocation
157
158The existing scheme requires the allocation of major and minor device
159numbers for each and every device. This means that a central
160co-ordinating authority is required to issue these device numbers
161(unless you're developing a "private" device driver), in order to
162preserve uniqueness. Devfs shifts the burden to a namespace. This may
163not seem like a huge benefit, but actually it is. Since driver authors
164will naturally choose a device name which reflects the functionality
165of the device, there is far less potential for namespace conflict.
166Solving this requires a kernel change.
167
168/dev management
169
170Because you currently access devices through device nodes, these must
171be created by the system administrator. For standard devices you can
172usually find a MAKEDEV programme which creates all these (hundreds!)
173of nodes. This means that changes in the kernel must be reflected by
174changes in the MAKEDEV programme, or else the system administrator
175creates device nodes by hand.
176
177The basic problem is that there are two separate databases of
178major and minor numbers. One is in the kernel and one is in /dev (or
179in a MAKEDEV programme, if you want to look at it that way). This is
180duplication of information, which is not good practice.
181Solving this requires a kernel change.
182
183/dev growth
184
185A typical /dev has over 1200 nodes! Most of these devices simply don't
186exist because the hardware is not available. A huge /dev increases the
187time to access devices (I'm just referring to the dentry lookup times
188and the time taken to read inodes off disc: the next subsection shows
189some more horrors).
190
191An example of how big /dev can grow is if we consider SCSI devices:
192
193host 6 bits (say up to 64 hosts on a really big machine)
194channel 4 bits (say up to 16 SCSI buses per host)
195id 4 bits
196lun 3 bits
197partition 6 bits
198TOTAL 23 bits
199
200
201This requires 8 Mega (1024*1024) inodes if we want to store all
202possible device nodes. Even if we scrap everything but id,partition
203and assume a single host adapter with a single SCSI bus and only one
204logical unit per SCSI target (id), that's still 10 bits or 1024
205inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so
206that's 256 kBytes of inode storage on disc (assuming real inodes take
207a similar amount of space as VFS inodes). This is actually not so bad,
208because disc is cheap these days. Embedded systems would care about
209256 kBytes of /dev inodes, but you could argue that embedded systems
210would have hand-tuned /dev directories. I've had to do just that on my
211embedded systems, but I would rather just leave it to devfs.
212
213Another issue is the time taken to lookup an inode when first
214referenced. Not only does this take time in scanning through a list in
215memory, but also the seek times to read the inodes off disc.
216This could be solved in user-space using a clever programme which
217scanned the kernel logs and deleted /dev entries which are not
218available and created them when they were available. This programme
219would need to be run every time a new module was loaded, which would
220slow things down a lot.
221
222There is an existing programme called scsidev which will automatically
223create device nodes for SCSI devices. It can do this by scanning files
224in /proc/scsi. Unfortunately, to extend this idea to other device
225nodes would require significant modifications to existing drivers (so
226they too would provide information in /proc). This is a non-trivial
227change (I should know: devfs has had to do something similar). Once
228you go to this much effort, you may as well use devfs itself (which
229also provides this information). Furthermore, such a system would
230likely be implemented in an ad-hoc fashion, as different drivers will
231provide their information in different ways.
232
233Devfs is much cleaner, because it (naturally) has a uniform mechanism
234to provide this information: the device nodes themselves!
235
236
237Node to driver file_operations translation
238
239There is an important difference between the way disc-based character
240and block nodes and devfs entries make the connection between an entry
241in /dev and the actual device driver.
242
243With the current 8 bit major and minor numbers the connection between
244disc-based c&b nodes and per-major drivers is done through a
245fixed-length table of 128 entries. The various filesystem types set
246the inode operations for c&b nodes to {chr,blk}dev_inode_operations,
247so when a device is opened a few quick levels of indirection bring us
248to the driver file_operations.
249
250For miscellaneous character devices a second step is required: there
251is a scan for the driver entry with the same minor number as the file
252that was opened, and the appropriate minor open method is called. This
253scanning is done *every time* you open a device node. Potentially, you
254may be searching through dozens of misc. entries before you find your
255open method. While not an enormous performance overhead, this does
256seem pointless.
257
258Linux *must* move beyond the 8 bit major and minor barrier,
259somehow. If we simply increase each to 16 bits, then the indexing
260scheme used for major driver lookup becomes untenable, because the
261major tables (one each for character and block devices) would need to
262be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit
263systems). So we would have to use a scheme like that used for
264miscellaneous character devices, which means the search time goes up
265linearly with the average number of major device drivers on your
266system. Not all "devices" are hardware, some are higher-level drivers
267like KGI, so you can get more "devices" without adding hardware
268You can improve this by creating an ordered (balanced:-)
269binary tree, in which case your search time becomes log(N).
270Alternatively, you can use hashing to speed up the search.
271But why do that search at all if you don't have to? Once again, it
272seems pointless.
273
274Note that devfs doesn't use the major&minor system. For devfs
275entries, the connection is done when you lookup the /dev entry. When
276devfs_register() is called, an internal table is appended which has
277the entry name and the file_operations. If the dentry cache doesn't
278have the /dev entry already, this internal table is scanned to get the
279file_operations, and an inode is created. If the dentry cache already
280has the entry, there is *no lookup time* (other than the dentry scan
281itself, but we can't avoid that anyway, and besides Linux dentries
282cream other OS's which don't have them:-). Furthermore, the number of
283node entries in a devfs is only the number of available device
284entries, not the number of *conceivable* entries. Even if you remove
285unnecessary entries in a disc-based /dev, the number of conceivable
286entries remains the same: you just limit yourself in order to save
287space.
288
289Devfs provides a fast connection between a VFS node and the device
290driver, in a scalable way.
291
292/dev as a system administration tool
293
294Right now /dev contains a list of conceivable devices, most of which I
295don't have. Devfs only shows those devices available on my
296system. This means that listing /dev is a handy way of checking what
297devices are available.
298
299Major&minor size
300
301Existing major and minor numbers are limited to 8 bits each. This is
302now a limiting factor for some drivers, particularly the SCSI disc
303driver, which consumes a single major number. Only 16 discs are
304supported, and each disc may have only 15 partitions. Maybe this isn't
305a problem for you, but some of us are building huge Linux systems with
306disc arrays. With devfs an arbitrary pointer can be associated with
307each device entry, which can be used to give an effective 32 bit
308device identifier (i.e. that's like having a 32 bit minor
309number). Since this is private to the kernel, there are no C library
310compatibility issues which you would have with increasing major and
311minor number sizes. See the section on "Allocation of Device Numbers"
312for details on maintaining compatibility with userspace.
313
314Solving this requires a kernel change.
315
316Since writing this, the kernel has been modified so that the SCSI disc
317driver has more major numbers allocated to it and now supports up to
318128 discs. Since these major numbers are non-contiguous (a result of
319unplanned expansion), the implementation is a little more cumbersome
320than originally.
321
322Just like the changes to IPv4 to fix impending limitations in the
323address space, people find ways around the limitations. In the long
324run, however, solutions like IPv6 or devfs can't be put off forever.
325
326Read-only root filesystem
327
328Having your device nodes on the root filesystem means that you can't
329operate properly with a read-only root filesystem. This is because you
330want to change ownerships and protections of tty devices. Existing
331practice prevents you using a CD-ROM as your root filesystem for a
332*real* system. Sure, you can boot off a CD-ROM, but you can't change
333tty ownerships, so it's only good for installing.
334
335Also, you can't use a shared NFS root filesystem for a cluster of
336discless Linux machines (having tty ownerships changed on a common
337/dev is not good). Nor can you embed your root filesystem in a
338ROM-FS.
339
340You can get around this by creating a RAMDISC at boot time, making
341an ext2 filesystem in it, mounting it somewhere and copying the
342contents of /dev into it, then unmounting it and mounting it over
343/dev.
344
345A devfs is a cleaner way of solving this.
346
347Non-Unix root filesystem
348
349Non-Unix filesystems (such as NTFS) can't be used for a root
350filesystem because they variously don't support character and block
351special files or symbolic links. You can't have a separate disc-based
352or RAMDISC-based filesystem mounted on /dev because you need device
353nodes before you can mount these. Devfs can be mounted without any
354device nodes. Devlinks won't work because symlinks aren't supported.
355An alternative solution is to use initrd to mount a RAMDISC initial
356root filesystem (which is populated with a minimal set of device
357nodes), and then construct a new /dev in another RAMDISC, and finally
358switch to your non-Unix root filesystem. This requires clever boot
359scripts and a fragile and conceptually complex boot procedure.
360
361Devfs solves this in a robust and conceptually simple way.
362
363PTY security
364
365Current pseudo-tty (pty) devices are owned by root and read-writable
366by everyone. The user of a pty-pair cannot change
367ownership/protections without being suid-root.
368
369This could be solved with a secure user-space daemon which runs as
370root and does the actual creation of pty-pairs. Such a daemon would
371require modification to *every* programme that wants to use this new
372mechanism. It also slows down creation of pty-pairs.
373
374An alternative is to create a new open_pty() syscall which does much
375the same thing as the user-space daemon. Once again, this requires
376modifications to pty-handling programmes.
377
378The devfs solution allows a device driver to "tag" certain device
379files so that when an unopened device is opened, the ownerships are
380changed to the current euid and egid of the opening process, and the
381protections are changed to the default registered by the driver. When
382the device is closed ownership is set back to root and protections are
383set back to read-write for everybody. No programme need be changed.
384The devpts filesystem provides this auto-ownership feature for Unix98
385ptys. It doesn't support old-style pty devices, nor does it have all
386the other features of devfs.
387
388Intelligent device management
389
390Devfs implements a simple yet powerful protocol for communication with
391a device management daemon (devfsd) which runs in user space. It is
392possible to send a message (either synchronously or asynchronously) to
393devfsd on any event, such as registration/unregistration of device
394entries, opening and closing devices, looking up inodes, scanning
395directories and more. This has many possibilities. Some of these are
396already implemented. See:
397
398
399http://www.atnf.csiro.au/~rgooch/linux/
400
401Device entry registration events can be used by devfsd to change
402permissions of newly-created device nodes. This is one mechanism to
403control device permissions.
404
405Device entry registration/unregistration events can be used to run
406programmes or scripts. This can be used to provide automatic mounting
407of filesystems when a new block device media is inserted into the
408drive.
409
410Asynchronous device open and close events can be used to implement
411clever permissions management. For example, the default permissions on
412/dev/dsp do not allow everybody to read from the device. This is
413sensible, as you don't want some remote user recording what you say at
414your console. However, the console user is also prevented from
415recording. This behaviour is not desirable. With asynchronous device
416open and close events, you can have devfsd run a programme or script
417when console devices are opened to change the ownerships for *other*
418device nodes (such as /dev/dsp). On closure, you can run a different
419script to restore permissions. An advantage of this scheme over
420modifying the C library tty handling is that this works even if your
421programme crashes (how many times have you seen the utmp database with
422lingering entries for non-existent logins?).
423
424Synchronous device open events can be used to perform intelligent
425device access protections. Before the device driver open() method is
426called, the daemon must first validate the open attempt, by running an
427external programme or script. This is far more flexible than access
428control lists, as access can be determined on the basis of other
429system conditions instead of just the UID and GID.
430
431Inode lookup events can be used to authenticate module autoload
432requests. Instead of using kmod directly, the event is sent to
433devfsd which can implement an arbitrary authentication before loading
434the module itself.
435
436Inode lookup events can also be used to construct arbitrary
437namespaces, without having to resort to populating devfs with symlinks
438to devices that don't exist.
439
440Speculative Device Scanning
441
442Consider an application (like cdparanoia) that wants to find all
443CD-ROM devices on the system (SCSI, IDE and other types), whether or
444not their respective modules are loaded. The application must
445speculatively open certain device nodes (such as /dev/sr0 for the SCSI
446CD-ROMs) in order to make sure the module is loaded. This requires
447that all Linux distributions follow the standard device naming scheme
448(last time I looked RedHat did things differently). Devfs solves the
449naming problem.
450
451The same application also wants to see which devices are actually
452available on the system. With the existing system it needs to read the
453/dev directory and speculatively open each /dev/sr* device to
454determine if the device exists or not. With a large /dev this is an
455inefficient operation, especially if there are many /dev/sr* nodes. A
456solution like scsidev could reduce the number of /dev/sr* entries (but
457of course that also requires all that inefficient directory scanning).
458
459With devfs, the application can open the /dev/sr directory
460(which triggers the module autoloading if required), and proceed to
461read /dev/sr. Since only the available devices will have
462entries, there are no inefficencies in directory scanning or device
463openings.
464
465-----------------------------------------------------------------------------
466
467Who else does it?
468
469FreeBSD has a devfs implementation. Solaris and AIX each have a
470pseudo-devfs (something akin to scsidev but for all devices, with some
471unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's
472IRIX 6.4 and above also have a device filesystem.
473
474While we shouldn't just automatically do something because others do
475it, we should not ignore the work of others either. FreeBSD has a lot
476of competent people working on it, so their opinion should not be
477blithely ignored.
478
479-----------------------------------------------------------------------------
480
481
482How it works
483
484Registering device entries
485
486For every entry (device node) in a devfs-based /dev a driver must call
487devfs_register(). This adds the name of the device entry, the
488file_operations structure pointer and a few other things to an
489internal table. Device entries may be added and removed at any
490time. When a device entry is registered, it automagically appears in
491any mounted devfs'.
492
493Inode lookup
494
495When a lookup operation on an entry is performed and if there is no
496driver information for that entry devfs will attempt to call
497devfsd. If still no driver information can be found then a negative
498dentry is yielded and the next stage operation will be called by the
499VFS (such as create() or mknod() inode methods). If driver information
500can be found, an inode is created (if one does not exist already) and
501all is well.
502
503Manually creating device nodes
504
505The mknod() method allows you to create an ordinary named pipe in the
506devfs, or you can create a character or block special inode if one
507does not already exist. You may wish to create a character or block
508special inode so that you can set permissions and ownership. Later, if
509a device driver registers an entry with the same name, the
510permissions, ownership and times are retained. This is how you can set
511the protections on a device even before the driver is loaded. Once you
512create an inode it appears in the directory listing.
513
514Unregistering device entries
515
516A device driver calls devfs_unregister() to unregister an entry.
517
518Chroot() gaols
519
5202.2.x kernels
521
522The semantics of inode creation are different when devfs is mounted
523with the "explicit" option. Now, when a device entry is registered, it
524will not appear until you use mknod() to create the device. It doesn't
525matter if you mknod() before or after the device is registered with
526devfs_register(). The purpose of this behaviour is to support
527chroot(2) gaols, where you want to mount a minimal devfs inside the
528gaol. Only the devices you specifically want to be available (through
529your mknod() setup) will be accessible.
530
5312.4.x kernels
532
533As of kernel 2.3.99, the VFS has had the ability to rebind parts of
534the global filesystem namespace into another part of the namespace.
535This now works even at the leaf-node level, which means that
536individual files and device nodes may be bound into other parts of the
537namespace. This is like making links, but better, because it works
538across filesystems (unlike hard links) and works through chroot()
539gaols (unlike symbolic links).
540
541Because of these improvements to the VFS, the multi-mount capability
542in devfs is no longer needed. The administrator may create a minimal
543device tree inside a chroot(2) gaol by using VFS bindings. As this
544provides most of the features of the devfs multi-mount capability, I
545removed the multi-mount support code (after issuing an RFC). This
546yielded code size reductions and simplifications.
547
548If you want to construct a minimal chroot() gaol, the following
549command should suffice:
550
551mount --bind /dev/null /gaol/dev/null
552
553
554Repeat for other device nodes you want to expose. Simple!
555
556-----------------------------------------------------------------------------
557
558
559Operational issues
560
561
562Instructions for the impatient
563
564Nobody likes reading documentation. People just want to get in there
565and play. So this section tells you quickly the steps you need to take
566to run with devfs mounted over /dev. Skip these steps and you will end
567up with a nearly unbootable system. Subsequent sections describe the
568issues in more detail, and discuss non-essential configuration
569options.
570
571Devfsd
572OK, if you're reading this, I assume you want to play with
573devfs. First you should ensure that /usr/src/linux contains a
574recent kernel source tree. Then you need to compile devfsd, the device
575management daemon, available at
576
577http://www.atnf.csiro.au/~rgooch/linux/.
578Because the kernel has a naming scheme
579which is quite different from the old naming scheme, you need to
580install devfsd so that software and configuration files that use the
581old naming scheme will not break.
582
583Compile and install devfsd. You will be provided with a default
584configuration file /etc/devfsd.conf which will provide
585compatibility symlinks for the old naming scheme. Don't change this
586config file unless you know what you're doing. Even if you think you
587do know what you're doing, don't change it until you've followed all
588the steps below and booted a devfs-enabled system and verified that it
589works.
590
591Now edit your main system boot script so that devfsd is started at the
592very beginning (before any filesystem
593checks). /etc/rc.d/rc.sysinit is often the main boot script
594on systems with SysV-style boot scripts. On systems with BSD-style
595boot scripts it is often /etc/rc. Also check
596/sbin/rc.
597
598NOTE that the line you put into the boot
599script should be exactly:
600
601/sbin/devfsd /dev
602
603DO NOT use some special daemon-launching
604programme, otherwise the boot script may not wait for devfsd to finish
605initialising.
606
607System Libraries
608There may still be some problems because of broken software making
609assumptions about device names. In particular, some software does not
610handle devices which are symbolic links. If you are running a libc 5
611based system, install libc 5.4.44 (if you have libc 5.4.46, go back to
612libc 5.4.44, which is actually correct). If you are running a glibc
613based system, make sure you have glibc 2.1.3 or later.
614
615/etc/securetty
616PAM (Pluggable Authentication Modules) is supposed to be a flexible
617mechanism for providing better user authentication and access to
618services. Unfortunately, it's also fragile, complex and undocumented
619(check out RedHat 6.1, and probably other distributions as well). PAM
620has problems with symbolic links. Append the following lines to your
621/etc/securetty file:
622
623vc/1
624vc/2
625vc/3
626vc/4
627vc/5
628vc/6
629vc/7
630vc/8
631
632This will not weaken security. If you have a version of util-linux
633earlier than 2.10.h, please upgrade to 2.10.h or later. If you
634absolutely cannot upgrade, then also append the following lines to
635your /etc/securetty file:
636
6371
6382
6393
6404
6415
6426
6437
6448
645
646This may potentially weaken security by allowing root logins over the
647network (a password is still required, though). However, since there
648are problems with dealing with symlinks, I'm suspicious of the level
649of security offered in any case.
650
651XFree86
652While not essential, it's probably a good idea to upgrade to XFree86
6534.0, as patches went in to make it more devfs-friendly. If you don't,
654you'll probably need to apply the following patch to
655/etc/security/console.perms so that ordinary users can run
656startx. Note that not all distributions have this file (e.g. Debian),
657so if it's not present, don't worry about it.
658
659--- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999
660+++ /etc/security/console.perms Fri Feb 25 23:53:55 2000
661@@ -14,7 +14,7 @@
662 # man 5 console.perms
663
664 # file classes -- these are regular expressions
665-<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
666+<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
667
668 # device classes -- these are shell-style globs
669 <floppy>=/dev/fd[0-1]*
670
671If the patch does not apply, then change the line:
672
673<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
674
675with:
676
677<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
678
679
680Disable devpts
681I've had a report of devpts mounted on /dev/pts not working
682correctly. Since devfs will also manage /dev/pts, there is no
683need to mount devpts as well. You should either edit your
684/etc/fstab so devpts is not mounted, or disable devpts from
685your kernel configuration.
686
687Unsupported drivers
688Not all drivers have devfs support. If you depend on one of these
689drivers, you will need to create a script or tarfile that you can use
690at boot time to create device nodes as appropriate. There is a
691section which describes this. Another
692section lists the drivers which have
693devfs support.
694
695/dev/mouse
696
697Many disributions configure /dev/mouse to be the mouse device
698for XFree86 and GPM. I actually think this is a bad idea, because it
699adds another level of indirection. When looking at a config file, if
700you see /dev/mouse you're left wondering which mouse
701is being referred to. Hence I recommend putting the actual mouse
702device (for example /dev/psaux) into your
703/etc/X11/XF86Config file (and similarly for the GPM
704configuration file).
705
706Alternatively, use the same technique used for unsupported drivers
707described above.
708
709The Kernel
710Finally, you need to make sure devfs is compiled into your kernel. Set
711CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by
712using favourite configuration tool (i.e. make config or
713make xconfig) and then make clean and then recompile your kernel and
714modules. At boot, devfs will be mounted onto /dev.
715
716If you encounter problems booting (for example if you forgot a
717configuration step), you can pass devfs=nomount at the kernel
718boot command line. This will prevent the kernel from mounting devfs at
719boot time onto /dev.
720
721In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting
722devfs onto /dev is completely safe, and requires no
723configuration changes. One exception to take note of is when
724LABEL= directives are used in /etc/fstab. In this
725case you will be unable to boot properly. This is because the
726mount(8) programme uses /proc/partitions as part of
727the volume label search process, and the device names it finds are not
728available, because setting CONFIG_DEVFS_FS=y changes the names in
729/proc/partitions, irrespective of whether devfs is mounted.
730
731Now you've finished all the steps required. You're now ready to boot
732your shiny new kernel. Enjoy.
733
734Changing the configuration
735
736OK, you've now booted a devfs-enabled system, and everything works.
737Now you may feel like changing the configuration (common targets are
738/etc/fstab and /etc/devfsd.conf). Since you have a
739system that works, if you make any changes and it doesn't work, you
740now know that you only have to restore your configuration files to the
741default and it will work again.
742
743
744Permissions persistence across reboots
745
746If you don't use mknod(2) to create a device file, nor use chmod(2) or
747chown(2) to change the ownerships/permissions, the inode ctime will
748remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime
749later than this has had it's ownership/permissions changed. Hence, a
750simple script or programme may be used to tar up all changed inodes,
751prior to shutdown. Although effective, many consider this approach a
752kludge.
753
754A much better approach is to use devfsd to save and restore
755permissions. It may be configured to record changes in permissions and
756will save them in a database (in fact a directory tree), and restore
757these upon boot. This is an efficient method and results in immediate
758saving of current permissions (unlike the tar approach, which saves
759permissions at some unspecified future time).
760
761The default configuration file supplied with devfsd has config entries
762which you may uncomment to enable persistence management.
763
764If you decide to use the tar approach anyway, be aware that tar will
765first unlink(2) an inode before creating a new device node. The
766unlink(2) has the effect of breaking the connection between a devfs
767entry and the device driver. If you use the "devfs=only" boot option,
768you lose access to the device driver, requiring you to reload the
769module. I consider this a bug in tar (there is no real need to
770unlink(2) the inode first).
771
772Alternatively, you can use devfsd to provide more sophisticated
773management of device permissions. You can use devfsd to store
774permissions for whole groups of devices with a single configuration
775entry, rather than the conventional single entry per device entry.
776
777Permissions database stored in mounted-over /dev
778
779If you wish to save and restore your device permissions into the
780disc-based /dev while still mounting devfs onto /dev
781you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or
782later), which has the VFS binding facility. You need to do the
783following to set this up:
784
785
786
787make sure the kernel does not mount devfs at boot time
788
789
790make sure you have a correct /dev/console entry in your
791root file-system (where your disc-based /dev lives)
792
793create the /dev-state directory
794
795
796add the following lines near the very beginning of your boot
797scripts:
798
799mount --bind /dev /dev-state
800mount -t devfs none /dev
801devfsd /dev
802
803
804
805
806add the following lines to your /etc/devfsd.conf file:
807
808REGISTER ^pt[sy] IGNORE
809CREATE ^pt[sy] IGNORE
810CHANGE ^pt[sy] IGNORE
811DELETE ^pt[sy] IGNORE
812REGISTER .* COPY /dev-state/$devname $devpath
813CREATE .* COPY $devpath /dev-state/$devname
814CHANGE .* COPY $devpath /dev-state/$devname
815DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname
816RESTORE /dev-state
817
818Note that the sample devfsd.conf file contains these lines,
819as well as other sample configurations you may find useful. See the
820devfsd distribution
821
822
823reboot.
824
825
826
827
828Permissions database stored in normal directory
829
830If you are using an older kernel which doesn't support VFS binding,
831then you won't be able to have the permissions database in a
832mounted-over /dev. However, you can still use a regular
833directory to store the database. The sample /etc/devfsd.conf
834file above may still be used. You will need to create the
835/dev-state directory prior to installing devfsd. If you have
836old permissions in /dev, then just copy (or move) the device
837nodes over to the new directory.
838
839Which method is better?
840
841The best method is to have the permissions database stored in the
842mounted-over /dev. This is because you will not need to copy
843device nodes over to /dev-state, and because it allows you to
844switch between devfs and non-devfs kernels, without requiring you to
845copy permissions between /dev-state (for devfs) and
846/dev (for non-devfs).
847
848
849Dealing with drivers without devfs support
850
851Currently, not all device drivers in the kernel have been modified to
852use devfs. Device drivers which do not yet have devfs support will not
853automagically appear in devfs. The simplest way to create device nodes
854for these drivers is to unpack a tarfile containing the required
855device nodes. You can do this in your boot scripts. All your drivers
856will now work as before.
857
858Hopefully for most people devfs will have enough support so that they
859can mount devfs directly over /dev without losing most functionality
860(i.e. losing access to various devices). As of 22-JAN-1998 (devfs
861patch version 10) I am now running this way. All the devices I have
862are available in devfs, so I don't lose anything.
863
864WARNING: if your configuration requires the old-style device names
865(i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure
866it to maintain compatibility entries. It is almost certain that you
867will require this. Note that the kernel creates a compatibility entry
868for the root device, so you don't need initrd.
869
870Note that you no longer need to mount devpts if you use Unix98 PTYs,
871as devfs can manage /dev/pts itself. This saves you some RAM, as you
872don't need to compile and install devpts. Note that some versions of
873glibc have a bug with Unix98 pty handling on devfs systems. Contact
874the glibc maintainers for a fix. Glibc 2.1.3 has the fix.
875
876Note also that apart from editing /etc/fstab, other things will need
877to be changed if you *don't* install devfsd. Some software (like the X
878server) hard-wire device names in their source. It really is much
879easier to install devfsd so that compatibility entries are created.
880You can then slowly migrate your system to using the new device names
881(for example, by starting with /etc/fstab), and then limiting the
882compatibility entries that devfsd creates.
883
884IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD
885BEFORE YOU BOOT A DEVFS-ENABLED KERNEL!
886
887Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of
888reports back. Many of these are because people are trying to run
889without devfsd, and hence some things break. Please just run devfsd if
890things break. I want to concentrate on real bugs rather than
891misconfiguration problems at the moment. If people are willing to fix
892bugs/false assumptions in other code (i.e. glibc, X server) and submit
893that to the respective maintainers, that would be great.
894
895
896All the way with Devfs
897
898The devfs kernel patch creates a rationalised device tree. As stated
899above, if you want to keep using the old /dev naming scheme,
900you just need to configure devfsd appopriately (see the man
901page). People who prefer the old names can ignore this section. For
902those of us who like the rationalised names and an uncluttered
903/dev, read on.
904
905If you don't run devfsd, or don't enable compatibility entry
906management, then you will have to configure your system to use the new
907names. For example, you will then need to edit your
908/etc/fstab to use the new disc naming scheme. If you want to
909be able to boot non-devfs kernels, you will need compatibility
910symlinks in the underlying disc-based /dev pointing back to
911the old-style names for when you boot a kernel without devfs.
912
913You can selectively decide which devices you want compatibility
914entries for. For example, you may only want compatibility entries for
915BSD pseudo-terminal devices (otherwise you'll have to patch you C
916library or use Unix98 ptys instead). It's just a matter of putting in
917the correct regular expression into /dev/devfsd.conf.
918
919There are other choices of naming schemes that you may prefer. For
920example, I don't use the kernel-supplied
921names, because they are too verbose. A common misconception is
922that the kernel-supplied names are meant to be used directly in
923configuration files. This is not the case. They are designed to
924reflect the layout of the devices attached and to provide easy
925classification.
926
927If you like the kernel-supplied names, that's fine. If you don't then
928you should be using devfsd to construct a namespace more to your
929liking. Devfsd has built-in code to construct a
930namespace that is both logical and easy to
931manage. In essence, it creates a convenient abbreviation of the
932kernel-supplied namespace.
933
934You are of course free to build your own namespace. Devfsd has all the
935infrastructure required to make this easy for you. All you need do is
936write a script. You can even write some C code and devfsd can load the
937shared object as a callable extension.
938
939
940Other Issues
941
942The init programme
943Another thing to take note of is whether your init programme
944creates a Unix socket /dev/telinit. Some versions of init
945create /dev/telinit so that the telinit programme can
946communicate with the init process. If you have such a system you need
947to make sure that devfs is mounted over /dev *before* init
948starts. In other words, you can't leave the mounting of devfs to
949/etc/rc, since this is executed after init. Other
950versions of init require a named pipe /dev/initctl
951which must exist *before* init starts. Once again, you need to
952mount devfs and then create the named pipe *before* init
953starts.
954
955The default behaviour now is not to mount devfs onto /dev at
956boot time for 2.3.x and later kernels. You can correct this with the
957"devfs=mount" boot option. This solves any problems with init,
958and also prevents the dreaded:
959
960Cannot open initial console
961
962message. For 2.2.x kernels where you need to apply the devfs patch,
963the default is to mount.
964
965If you have automatic mounting of devfs onto /dev then you
966may need to create /dev/initctl in your boot scripts. The
967following lines should suffice:
968
969mknod /dev/initctl p
970kill -SIGUSR1 1 # tell init that /dev/initctl now exists
971
972Alternatively, if you don't want the kernel to mount devfs onto
973/dev then you could use the following procedure is a
974guideline for how to get around /dev/initctl problems:
975
976# cd /sbin
977# mv init init.real
978# cat > init
979#! /bin/sh
980mount -n -t devfs none /dev
981mknod /dev/initctl p
982exec /sbin/init.real $*
983[control-D]
984# chmod a+x init
985
986Note that newer versions of init create /dev/initctl
987automatically, so you don't have to worry about this.
988
989Module autoloading
990You will need to configure devfsd to enable module
991autoloading. The following lines should be placed in your
992/etc/devfsd.conf file:
993
994LOOKUP .* MODLOAD
995
996
997As of devfsd-v1.3.10, a generic /etc/modules.devfs
998configuration file is installed, which is used by the MODLOAD
999action. This should be sufficient for most configurations. If you
1000require further configuration, edit your /etc/modules.conf
1001file. The way module autoloading work with devfs is:
1002
1003
1004a process attempts to lookup a device node (e.g. /dev/fred)
1005
1006
1007if that device node does not exist, the full pathname is passed to
1008devfsd as a string
1009
1010
1011devfsd will pass the string to the modprobe programme (provided the
1012configuration line shown above is present), and specifies that
1013/etc/modules.devfs is the configuration file
1014
1015
1016/etc/modules.devfs includes /etc/modules.conf to
1017access local configurations
1018
1019modprobe will search it's configuration files, looking for an alias
1020that translates the pathname into a module name
1021
1022
1023the translated pathname is then used to load the module.
1024
1025
1026If you wanted a lookup of /dev/fred to load the
1027mymod module, you would require the following configuration
1028line in /etc/modules.conf:
1029
1030alias /dev/fred mymod
1031
1032The /etc/modules.devfs configuration file provides many such
1033aliases for standard device names. If you look closely at this file,
1034you will note that some modules require multiple alias configuration
1035lines. This is required to support module autoloading for old and new
1036device names.
1037
1038Mounting root off a devfs device
1039If you wish to mount root off a devfs device when you pass the
1040"devfs=only" boot option, then you need to pass in the
1041"root=<device>" option to the kernel when booting. If you use
1042LILO, then you must have this in lilo.conf:
1043
1044append = "root=<device>"
1045
1046Surprised? Yep, so was I. It turns out if you have (as most people
1047do):
1048
1049root = <device>
1050
1051
1052then LILO will determine the device number of <device> and will
1053write that device number into a special place in the kernel image
1054before starting the kernel, and the kernel will use that device number
1055to mount the root filesystem. So, using the "append" variety ensures
1056that LILO passes the root filesystem device as a string, which devfs
1057can then use.
1058
1059Note that this isn't an issue if you don't pass "devfs=only".
1060
1061TTY issues
1062The ttyname(3) function in some versions of the C library makes
1063false assumptions about device entries which are symbolic links. The
1064tty(1) programme is one that depends on this function. I've
1065written a patch to libc 5.4.43 which fixes this. This has been
1066included in libc 5.4.44 and a similar fix is in glibc 2.1.3.
1067
1068
1069Kernel Naming Scheme
1070
1071The kernel provides a default naming scheme. This scheme is designed
1072to make it easy to search for specific devices or device types, and to
1073view the available devices. Some device types (such as hard discs),
1074have a directory of entries, making it easy to see what devices of
1075that class are available. Often, the entries are symbolic links into a
1076directory tree that reflects the topology of available devices. The
1077topological tree is useful for finding how your devices are arranged.
1078
1079Below is a list of the naming schemes for the most common drivers. A
1080list of reserved device names is
1081available for reference. Please send email to
1082rgooch@atnf.csiro.au to obtain an allocation. Please be
1083patient (the maintainer is busy). An alternative name may be allocated
1084instead of the requested name, at the discretion of the maintainer.
1085
1086Disc Devices
1087
1088All discs, whether SCSI, IDE or whatever, are placed under the
1089/dev/discs hierarchy:
1090
1091 /dev/discs/disc0 first disc
1092 /dev/discs/disc1 second disc
1093
1094
1095Each of these entries is a symbolic link to the directory for that
1096device. The device directory contains:
1097
1098 disc for the whole disc
1099 part* for individual partitions
1100
1101
1102CD-ROM Devices
1103
1104All CD-ROMs, whether SCSI, IDE or whatever, are placed under the
1105/dev/cdroms hierarchy:
1106
1107 /dev/cdroms/cdrom0 first CD-ROM
1108 /dev/cdroms/cdrom1 second CD-ROM
1109
1110
1111Each of these entries is a symbolic link to the real device entry for
1112that device.
1113
1114Tape Devices
1115
1116All tapes, whether SCSI, IDE or whatever, are placed under the
1117/dev/tapes hierarchy:
1118
1119 /dev/tapes/tape0 first tape
1120 /dev/tapes/tape1 second tape
1121
1122
1123Each of these entries is a symbolic link to the directory for that
1124device. The device directory contains:
1125
1126 mt for mode 0
1127 mtl for mode 1
1128 mtm for mode 2
1129 mta for mode 3
1130 mtn for mode 0, no rewind
1131 mtln for mode 1, no rewind
1132 mtmn for mode 2, no rewind
1133 mtan for mode 3, no rewind
1134
1135
1136SCSI Devices
1137
1138To uniquely identify any SCSI device requires the following
1139information:
1140
1141 controller (host adapter)
1142 bus (SCSI channel)
1143 target (SCSI ID)
1144 unit (Logical Unit Number)
1145
1146
1147All SCSI devices are placed under /dev/scsi (assuming devfs
1148is mounted on /dev). Hence, a SCSI device with the following
1149parameters: c=1,b=2,t=3,u=4 would appear as:
1150
1151 /dev/scsi/host1/bus2/target3/lun4 device directory
1152
1153
1154Inside this directory, a number of device entries may be created,
1155depending on which SCSI device-type drivers were installed.
1156
1157See the section on the disc naming scheme to see what entries the SCSI
1158disc driver creates.
1159
1160See the section on the tape naming scheme to see what entries the SCSI
1161tape driver creates.
1162
1163The SCSI CD-ROM driver creates:
1164
1165 cd
1166
1167
1168The SCSI generic driver creates:
1169
1170 generic
1171
1172
1173IDE Devices
1174
1175To uniquely identify any IDE device requires the following
1176information:
1177
1178 controller
1179 bus (aka. primary/secondary)
1180 target (aka. master/slave)
1181 unit
1182
1183
1184All IDE devices are placed under /dev/ide, and uses a similar
1185naming scheme to the SCSI subsystem.
1186
1187XT Hard Discs
1188
1189All XT discs are placed under /dev/xd. The first XT disc has
1190the directory /dev/xd/disc0.
1191
1192TTY devices
1193
1194The tty devices now appear as:
1195
1196 New name Old-name Device Type
1197 -------- -------- -----------
1198 /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports
1199 /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices
1200 /dev/vc/0 /dev/tty Current virtual console
1201 /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles
1202 /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles
1203 /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters
1204 /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves
1205
1206
1207RAMDISCS
1208
1209The RAMDISCS are placed in their own directory, and are named thus:
1210
1211 /dev/rd/{0,1,2,...}
1212
1213
1214Meta Devices
1215
1216The meta devices are placed in their own directory, and are named
1217thus:
1218
1219 /dev/md/{0,1,2,...}
1220
1221
1222Floppy discs
1223
1224Floppy discs are placed in the /dev/floppy directory.
1225
1226Loop devices
1227
1228Loop devices are placed in the /dev/loop directory.
1229
1230Sound devices
1231
1232Sound devices are placed in the /dev/sound directory
1233(audio, sequencer, ...).
1234
1235
1236Devfsd Naming Scheme
1237
1238Devfsd provides a naming scheme which is a convenient abbreviation of
1239the kernel-supplied namespace. In some
1240cases, the kernel-supplied naming scheme is quite convenient, so
1241devfsd does not provide another naming scheme. The convenience names
1242that devfsd creates are in fact the same names as the original devfs
1243kernel patch created (before Linus mandated the Big Name
1244Change). These are referred to as "new compatibility entries".
1245
1246In order to configure devfsd to create these convenience names, the
1247following lines should be placed in your /etc/devfsd.conf:
1248
1249REGISTER .* MKNEWCOMPAT
1250UNREGISTER .* RMNEWCOMPAT
1251
1252This will cause devfsd to create (and destroy) symbolic links which
1253point to the kernel-supplied names.
1254
1255SCSI Hard Discs
1256
1257All SCSI discs are placed under /dev/sd (assuming devfs is
1258mounted on /dev). Hence, a SCSI disc with the following
1259parameters: c=1,b=2,t=3,u=4 would appear as:
1260
1261 /dev/sd/c1b2t3u4 for the whole disc
1262 /dev/sd/c1b2t3u4p5 for the 5th partition
1263 /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition
1264
1265
1266SCSI Tapes
1267
1268All SCSI tapes are placed under /dev/st. A similar naming
1269scheme is used as for SCSI discs. A SCSI tape with the
1270parameters:c=1,b=2,t=3,u=4 would appear as:
1271
1272 /dev/st/c1b2t3u4m0 for mode 0
1273 /dev/st/c1b2t3u4m1 for mode 1
1274 /dev/st/c1b2t3u4m2 for mode 2
1275 /dev/st/c1b2t3u4m3 for mode 3
1276 /dev/st/c1b2t3u4m0n for mode 0, no rewind
1277 /dev/st/c1b2t3u4m1n for mode 1, no rewind
1278 /dev/st/c1b2t3u4m2n for mode 2, no rewind
1279 /dev/st/c1b2t3u4m3n for mode 3, no rewind
1280
1281
1282SCSI CD-ROMs
1283
1284All SCSI CD-ROMs are placed under /dev/sr. A similar naming
1285scheme is used as for SCSI discs. A SCSI CD-ROM with the
1286parameters:c=1,b=2,t=3,u=4 would appear as:
1287
1288 /dev/sr/c1b2t3u4
1289
1290
1291SCSI Generic Devices
1292
1293The generic (aka. raw) interface for all SCSI devices are placed under
1294/dev/sg. A similar naming scheme is used as for SCSI discs. A
1295SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear
1296as:
1297
1298 /dev/sg/c1b2t3u4
1299
1300
1301IDE Hard Discs
1302
1303All IDE discs are placed under /dev/ide/hd, using a similar
1304convention to SCSI discs. The following mappings exist between the new
1305and the old names:
1306
1307 /dev/hda /dev/ide/hd/c0b0t0u0
1308 /dev/hdb /dev/ide/hd/c0b0t1u0
1309 /dev/hdc /dev/ide/hd/c0b1t0u0
1310 /dev/hdd /dev/ide/hd/c0b1t1u0
1311
1312
1313IDE Tapes
1314
1315A similar naming scheme is used as for IDE discs. The entries will
1316appear in the /dev/ide/mt directory.
1317
1318IDE CD-ROM
1319
1320A similar naming scheme is used as for IDE discs. The entries will
1321appear in the /dev/ide/cd directory.
1322
1323IDE Floppies
1324
1325A similar naming scheme is used as for IDE discs. The entries will
1326appear in the /dev/ide/fd directory.
1327
1328XT Hard Discs
1329
1330All XT discs are placed under /dev/xd. The first XT disc
1331would appear as /dev/xd/c0t0.
1332
1333
1334Old Compatibility Names
1335
1336The old compatibility names are the legacy device names, such as
1337/dev/hda, /dev/sda, /dev/rtc and so on.
1338Devfsd can be configured to create compatibility symlinks so that you
1339may continue to use the old names in your configuration files and so
1340that old applications will continue to function correctly.
1341
1342In order to configure devfsd to create these legacy names, the
1343following lines should be placed in your /etc/devfsd.conf:
1344
1345REGISTER .* MKOLDCOMPAT
1346UNREGISTER .* RMOLDCOMPAT
1347
1348This will cause devfsd to create (and destroy) symbolic links which
1349point to the kernel-supplied names.
1350
1351
1352-----------------------------------------------------------------------------
1353
1354
1355Device drivers currently ported
1356
1357- All miscellaneous character devices support devfs (this is done
1358 transparently through misc_register())
1359
1360- SCSI discs and generic hard discs
1361
1362- Character memory devices (null, zero, full and so on)
1363 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
1364
1365- Loop devices (/dev/loop?)
1366
1367- TTY devices (console, serial ports, terminals and pseudo-terminals)
1368 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
1369
1370- SCSI tapes (/dev/scsi and /dev/tapes)
1371
1372- SCSI CD-ROMs (/dev/scsi and /dev/cdroms)
1373
1374- SCSI generic devices (/dev/scsi)
1375
1376- RAMDISCS (/dev/ram?)
1377
1378- Meta Devices (/dev/md*)
1379
1380- Floppy discs (/dev/floppy)
1381
1382- Parallel port printers (/dev/printers)
1383
1384- Sound devices (/dev/sound)
1385 Thanks to Eric Dumas <dumas@linux.eu.org> and
1386 C. Scott Ananian <cananian@alumni.princeton.edu>
1387
1388- Joysticks (/dev/joysticks)
1389
1390- Sparc keyboard (/dev/kbd)
1391
1392- DSP56001 digital signal processor (/dev/dsp56k)
1393
1394- Apple Desktop Bus (/dev/adb)
1395
1396- Coda network file system (/dev/cfs*)
1397
1398- Virtual console capture devices (/dev/vcc)
1399 Thanks to Dennis Hou <smilax@mindmeld.yi.org>
1400
1401- Frame buffer devices (/dev/fb)
1402
1403- Video capture devices (/dev/v4l)
1404
1405
1406-----------------------------------------------------------------------------
1407
1408
1409Allocation of Device Numbers
1410
1411Devfs allows you to write a driver which doesn't need to allocate a
1412device number (major&minor numbers) for the internal operation of the
1413kernel. However, there are a number of userspace programmes that use
1414the device number as a unique handle for a device. An example is the
1415find programme, which uses device numbers to determine whether
1416an inode is on a different filesystem than another inode. The device
1417number used is the one for the block device which a filesystem is
1418using. To preserve compatibility with userspace programmes, block
1419devices using devfs need to have unique device numbers allocated to
1420them. Furthermore, POSIX specifies device numbers, so some kind of
1421device number needs to be presented to userspace.
1422
1423The simplest option (especially when porting drivers to devfs) is to
1424keep using the old major and minor numbers. Devfs will take whatever
1425values are given for major&minor and pass them onto userspace.
1426
1427This device number is a 16 bit number, so this leaves plenty of space
1428for large numbers of discs and partitions. This scheme can also be
1429used for character devices, in particular the tty devices, which are
1430currently limited to 256 pseudo-ttys (this limits the total number of
1431simultaneous xterms and remote logins). Note that the device number
1432is limited to the range 36864-61439 (majors 144-239), in order to
1433avoid any possible conflicts with existing official allocations.
1434
1435Please note that using dynamically allocated block device numbers may
1436break the NFS daemons (both user and kernel mode), which expect dev_t
1437for a given device to be constant over the lifetime of remote mounts.
1438
1439A final note on this scheme: since it doesn't increase the size of
1440device numbers, there are no compatibility issues with userspace.
1441
1442-----------------------------------------------------------------------------
1443
1444
1445Questions and Answers
1446
1447
1448Making things work
1449Alternatives to devfs
1450What I don't like about devfs
1451How to report bugs
1452Strange kernel messages
1453Compilation problems with devfsd
1454
1455
1456
1457Making things work
1458
1459Here are some common questions and answers.
1460
1461
1462
1463Devfsd doesn't start
1464
1465Make sure you have compiled and installed devfsd
1466Make sure devfsd is being started from your boot
1467scripts
1468Make sure you have configured your kernel to enable devfs (see
1469below)
1470Make sure devfs is mounted (see below)
1471
1472
1473Devfsd is not managing all my permissions
1474
1475Make sure you are capturing the appropriate events. For example,
1476device entries created by the kernel generate REGISTER events,
1477but those created by devfsd generate CREATE events.
1478
1479
1480Devfsd is not capturing all REGISTER events
1481
1482See the previous entry: you may need to capture CREATE events.
1483
1484
1485X will not start
1486
1487Make sure you followed the steps
1488outlined above.
1489
1490
1491Why don't my network devices appear in devfs?
1492
1493This is not a bug. Network devices have their own, completely separate
1494namespace. They are accessed via socket(2) and
1495setsockopt(2) calls, and thus require no device nodes. I have
1496raised the possibilty of moving network devices into the device
1497namespace, but have had no response.
1498
1499
1500How can I test if I have devfs compiled into my kernel?
1501
1502All filesystems built-in or currently loaded are listed in
1503/proc/filesystems. If you see a devfs entry, then
1504you know that devfs was compiled into your kernel. If you have
1505correctly configured and rebuilt your kernel, then devfs will be
1506built-in. If you think you've configured it in, but
1507/proc/filesystems doesn't show it, you've made a mistake.
1508Common mistakes include:
1509
1510Using a 2.2.x kernel without applying the devfs patch (if you
1511don't know how to patch your kernel, use 2.4.x instead, don't bother
1512asking me how to patch)
1513Forgetting to set CONFIG_EXPERIMENTAL=y
1514Forgetting to set CONFIG_DEVFS_FS=y
1515Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs
1516to be automatically mounted at boot)
1517Editing your .config manually, instead of using make
1518config or make xconfig
1519Forgetting to run make dep; make clean after changing the
1520configuration and before compiling
1521Forgetting to compile your kernel and modules
1522Forgetting to install your kernel
1523Forgetting to install your modules
1524
1525Please check twice that you've done all these steps before sending in
1526a bug report.
1527
1528
1529
1530How can I test if devfs is mounted on /dev?
1531
1532The device filesystem will always create an entry called
1533".devfsd", which is used to communicate with the daemon. Even
1534if the daemon is not running, this entry will exist. Testing for the
1535existence of this entry is the approved method of determining if devfs
1536is mounted or not. Note that the type of entry (i.e. regular file,
1537character device, named pipe, etc.) may change without notice. Only
1538the existence of the entry should be relied upon.
1539
1540
1541When I start devfsd, I see the error:
1542Error opening file: ".devfsd" No such file or directory?
1543
1544This means that devfs is not mounted. Make sure you have devfs mounted.
1545
1546
1547How do I mount devfs?
1548
1549First make sure you have devfs compiled into your kernel (see
1550above). Then you will either need to:
1551
1552set CONFIG_DEVFS_MOUNT=y in your kernel config
1553pass devfs=mount to your boot loader
1554mount devfs manually in your boot scripts with:
1555mount -t none devfs /dev
1556
1557
1558
1559Mount by volume LABEL=<label> doesn't work with
1560devfs
1561
1562Most probably you are not mounting devfs onto /dev. What
1563happens is that if your kernel config has CONFIG_DEVFS_FS=y
1564then the contents of /proc/partitions will have the devfs
1565names (such as scsi/host0/bus0/target0/lun0/part1). The
1566contents of /proc/partitions are used by mount(8) when
1567mounting by volume label. If devfs is not mounted on /dev,
1568then mount(8) will fail to find devices. The solution is to
1569make sure that devfs is mounted on /dev. See above for how to
1570do that.
1571
1572
1573I have extra or incorrect entries in /dev
1574
1575You may have stale entries in your dev-state area. Check for a
1576RESTORE configuration line in your devfsd configuration
1577(typically /etc/devfsd.conf). If you have this line, check
1578the contents of the specified directory for stale entries. Remove
1579any entries which are incorrect, then reboot.
1580
1581
1582I get "Unable to open initial console" messages at boot
1583
1584This usually happens when you don't have devfs automounted onto
1585/dev at boot time, and there is no valid
1586/dev/console entry on your root file-system. Create a valid
1587/dev/console device node.
1588
1589
1590
1591
1592
1593Alternatives to devfs
1594
1595I've attempted to collate all the anti-devfs proposals and explain
1596their limitations. Under construction.
1597
1598
1599Why not just pass device create/remove events to a daemon?
1600
1601Here the suggestion is to develop an API in the kernel so that devices
1602can register create and remove events, and a daemon listens for those
1603events. The daemon would then populate/depopulate /dev (which
1604resides on disc).
1605
1606This has several limitations:
1607
1608
1609it only works for modules loaded and unloaded (or devices inserted
1610and removed) after the kernel has finished booting. Without a database
1611of events, there is no way the daemon could fully populate
1612/dev
1613
1614
1615if you add a database to this scheme, the question is then how to
1616present that database to user-space. If you make it a list of strings
1617with embedded event codes which are passed through a pipe to the
1618daemon, then this is only of use to the daemon. I would argue that the
1619natural way to present this data is via a filesystem (since many of
1620the events will be of a hierarchical nature), such as devfs.
1621Presenting the data as a filesystem makes it easy for the user to see
1622what is available and also makes it easy to write scripts to scan the
1623"database"
1624
1625
1626the tight binding between device nodes and drivers is no longer
1627possible (requiring the otherwise perfectly avoidable
1628table lookups)
1629
1630
1631you cannot catch inode lookup events on /dev which means
1632that module autoloading requires device nodes to be created. This is a
1633problem, particularly for drivers where only a few inodes are created
1634from a potentially large set
1635
1636
1637this technique can't be used when the root FS is mounted
1638read-only
1639
1640
1641
1642
1643Just implement a better scsidev
1644
1645This suggestion involves taking the scsidev programme and
1646extending it to scan for all devices, not just SCSI devices. The
1647scsidev programme works by scanning /proc/scsi
1648
1649Problems:
1650
1651
1652the kernel does not currently provide a list of all devices
1653available. Not all drivers register entries in /proc or
1654generate kernel messages
1655
1656
1657there is no uniform mechanism to register devices other than the
1658devfs API
1659
1660
1661implementing such an API is then the same as the
1662proposal above
1663
1664
1665
1666
1667Put /dev on a ramdisc
1668
1669This suggestion involves creating a ramdisc and populating it with
1670device nodes and then mounting it over /dev.
1671
1672Problems:
1673
1674
1675
1676this doesn't help when mounting the root filesystem, since you
1677still need a device node to do that
1678
1679
1680if you want to use this technique for the root device node as
1681well, you need to use initrd. This complicates the booting sequence
1682and makes it significantly harder to administer and configure. The
1683initrd is essentially opaque, robbing the system administrator of easy
1684configuration
1685
1686
1687insufficient information is available to correctly populate the
1688ramdisc. So we come back to the
1689proposal above to "solve" this
1690
1691
1692a ramdisc-based solution would take more kernel memory, since the
1693backing store would be (at best) normal VFS inodes and dentries, which
1694take 284 bytes and 112 bytes, respectively, for each entry. Compare
1695that to 72 bytes for devfs
1696
1697
1698
1699
1700Do nothing: there's no problem
1701
1702Sometimes people can be heard to claim that the existing scheme is
1703fine. This is what they're ignoring:
1704
1705
1706device number size (8 bits each for major and minor) is a real
1707limitation, and must be fixed somehow. Systems with large numbers of
1708SCSI devices, for example, will continue to consume the remaining
1709unallocated major numbers. USB will also need to push beyond the 8 bit
1710minor limitation
1711
1712
1713simply increasing the device number size is insufficient. Apart
1714from causing a lot of pain, it doesn't solve the management issues
1715of a /dev with thousands or more device nodes
1716
1717
1718ignoring the problem of a huge /dev will not make it go
1719away, and dismisses the legitimacy of a large number of people who
1720want a dynamic /dev
1721
1722
1723the standard response then becomes: "write a device management
1724daemon", which brings us back to the
1725proposal above
1726
1727
1728
1729
1730What I don't like about devfs
1731
1732Here are some common complaints about devfs, and some suggestions and
1733solutions that may make it more palatable for you. I can't please
1734everybody, but I do try :-)
1735
1736I hate the naming scheme
1737
1738First, remember that no naming scheme will please everybody. You hate
1739the scheme, others love it. Who's to say who's right and who's wrong?
1740Ultimately, the person who writes the code gets to choose, and what
1741exists now is a combination of the choices made by the
1742devfs author and the
1743kernel maintainer (Linus).
1744
1745However, not all is lost. If you want to create your own naming
1746scheme, it is a simple matter to write a standalone script, hack
1747devfsd, or write a script called by devfsd. You can create whatever
1748naming scheme you like.
1749
1750Further, if you want to remove all traces of the devfs naming scheme
1751from /dev, you can mount devfs elsewhere (say
1752/devfs) and populate /dev with links into
1753/devfs. This population can be automated using devfsd if you
1754wish.
1755
1756You can even use the VFS binding facility to make the links, rather
1757than using symbolic links. This way, you don't even have to see the
1758"destination" of these symbolic links.
1759
1760Devfs puts policy into the kernel
1761
1762There's already policy in the kernel. Device numbers are in fact
1763policy (why should the kernel dictate what device numbers I use?).
1764Face it, some policy has to be in the kernel. The real difference
1765between device names as policy and device numbers as policy is that
1766no one will use device numbers directly, because device
1767numbers are devoid of meaning to humans and are ugly. At least with
1768the devfs device names, (even though you can add your own naming
1769scheme) some people will use the devfs-supplied names directly. This
1770offends some people :-)
1771
1772Devfs is bloatware
1773
1774This is not even remotely true. As shown above,
1775both code and data size are quite modest.
1776
1777
1778How to report bugs
1779
1780If you have (or think you have) a bug with devfs, please follow the
1781steps below:
1782
1783
1784
1785make sure you have enabled debugging output when configuring your
1786kernel. You will need to set (at least) the following config options:
1787
1788CONFIG_DEVFS_DEBUG=y
1789CONFIG_DEBUG_KERNEL=y
1790CONFIG_DEBUG_SLAB=y
1791
1792
1793
1794please make sure you have the latest devfs patches applied. The
1795latest kernel version might not have the latest devfs patches applied
1796yet (Linus is very busy)
1797
1798
1799save a copy of your complete kernel logs (preferably by
1800using the dmesg programme) for later inclusion in your bug
1801report. You may need to use the -s switch to increase the
1802internal buffer size so you can capture all the boot messages.
1803Don't edit or trim the dmesg output
1804
1805
1806
1807
1808try booting with devfs=dall passed to the kernel boot
1809command line (read the documentation on your bootloader on how to do
1810this), and save the result to a file. This may be quite verbose, and
1811it may overflow the messages buffer, but try to get as much of it as
1812you can
1813
1814
1815if you get an Oops, run ksymoops to decode it so that the
1816names of the offending functions are provided. A non-decoded Oops is
1817pretty useless
1818
1819
1820send a copy of your devfsd configuration file(s)
1821
1822send the bug report to me first.
1823Don't expect that I will see it if you post it to the linux-kernel
1824mailing list. Include all the information listed above, plus
1825anything else that you think might be relevant. Put the string
1826devfs somewhere in the subject line, so my mail filters mark
1827it as urgent
1828
1829
1830
1831
1832Here is a general guide on how to ask questions in a way that greatly
1833improves your chances of getting a reply:
1834
1835http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have
1836a bug to report, you should also read
1837
1838http://www.chiark.greenend.org.uk/~sgtatham/bugs.html.
1839
1840
1841Strange kernel messages
1842
1843You may see devfs-related messages in your kernel logs. Below are some
1844messages and what they mean (and what you should do about them, if
1845anything).
1846
1847
1848
1849devfs_register(fred): could not append to parent, err: -17
1850
1851You need to check what the error code means, but usually 17 means
1852EEXIST. This means that a driver attempted to create an entry
1853fred in a directory, but there already was an entry with that
1854name. This is often caused by flawed boot scripts which untar a bunch
1855of inodes into /dev, as a way to restore permissions. This
1856message is harmless, as the device nodes will still
1857provide access to the driver (unless you use the devfs=only
1858boot option, which is only for dedicated souls:-). If you want to get
1859rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the
1860recommended RESTORE directive to restore permissions.
1861
1862
1863devfs_mk_dir(bill): using old entry in dir: c1808724 ""
1864
1865This is similar to the message above, except that a driver attempted
1866to create a directory named bill, and the parent directory
1867has an entry with the same name. In this case, to ensure that drivers
1868continue to work properly, the old entry is re-used and given to the
1869driver. In 2.5 kernels, the driver is given a NULL entry, and thus,
1870under rare circumstances, may not create the require device nodes.
1871The solution is the same as above.
1872
1873
1874
1875
1876
1877Compilation problems with devfsd
1878
1879Usually, you can compile devfsd just by typing in
1880make in the source directory, followed by a make
1881install (as root). Sometimes, you may have problems, particularly
1882on broken configurations.
1883
1884
1885
1886error messages relating to DEVFSD_NOTIFY_DELETE
1887
1888This happened because you have an ancient set of kernel headers
1889installed in /usr/include/linux or /usr/src/linux.
1890Install kernel 2.4.10 or later. You may need to pass the
1891KERNEL_DIR variable to make (if you did not install
1892the new kernel sources as /usr/src/linux), or you may copy
1893the devfs_fs.h file in the kernel source tree into
1894/usr/include/linux.
1895
1896
1897
1898
1899-----------------------------------------------------------------------------
1900
1901
1902Other resources
1903
1904
1905
1906Douglas Gilbert has written a useful document at
1907
1908http://www.torque.net/sg/devfs_scsi.html which
1909explores the SCSI subsystem and how it interacts with devfs
1910
1911
1912Douglas Gilbert has written another useful document at
1913
1914http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which
1915discusses the Linux SCSI subsystem in 2.4.
1916
1917
1918Johannes Erdfelt has started a discussion paper on Linux and
1919hot-swap devices, describing what the requirements are for a scalable
1920solution and how and why he's used devfs+devfsd. Note that this is an
1921early draft only, available in plain text form at:
1922
1923http://johannes.erdfelt.com/hotswap.txt.
1924Johannes has promised a HTML version will follow.
1925
1926
1927I presented an invited
1928paper
1929at the
1930
19312nd Annual Storage Management Workshop held in Miamia, Florida,
1932U.S.A. in October 2000.
1933
1934
1935
1936
1937-----------------------------------------------------------------------------
1938
1939
1940Translations of this document
1941
1942This document has been translated into other languages.
1943
1944
1945
1946
1947The document master (in English) by rgooch@atnf.csiro.au is
1948available at
1949
1950http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html
1951
1952
1953
1954A Korean translation by viatoris@nownuri.net is available at
1955
1956http://your.destiny.pe.kr/devfs/devfs.html
1957
1958
1959
1960
1961-----------------------------------------------------------------------------
1962Most flags courtesy of ITA's
1963Flags of All Countries
1964used with permission.
diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo
new file mode 100644
index 000000000000..afd5a8f2c19b
--- /dev/null
+++ b/Documentation/filesystems/devfs/ToDo
@@ -0,0 +1,40 @@
1 Device File System (devfs) ToDo List
2
3 Richard Gooch <rgooch@atnf.csiro.au>
4
5 3-JUL-2000
6
7This is a list of things to be done for better devfs support in the
8Linux kernel. If you'd like to contribute to the devfs, please have a
9look at this list for anything that is unallocated. Also, if there are
10items missing (surely), please contact me so I can add them to the
11list (preferably with your name attached to them:-).
12
13
14- >256 ptys
15 Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
16
17- Amiga floppy driver (drivers/block/amiflop.c)
18
19- Atari floppy driver (drivers/block/ataflop.c)
20
21- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c)
22
23- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c)
24
25- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c)
26
27- Parallel port ATAPI floppy (drivers/block/paride/pf.c)
28
29- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c)
30
31- Archimedes floppy (drivers/acorn/block/fd1772.c)
32
33- MFM hard drive (drivers/acorn/block/mfmhd.c)
34
35- I2O block device (drivers/message/i2o/i2o_block.c)
36
37- ST-RAM device (arch/m68k/atari/stram.c)
38
39- Raw devices
40
diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options
new file mode 100644
index 000000000000..df3d33b03e0a
--- /dev/null
+++ b/Documentation/filesystems/devfs/boot-options
@@ -0,0 +1,65 @@
1/* -*- auto-fill -*- */
2
3 Device File System (devfs) Boot Options
4
5 Richard Gooch <rgooch@atnf.csiro.au>
6
7 18-AUG-2001
8
9
10When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options
11to the kernel to debug devfs. The boot options are prefixed by
12"devfs=", and are separated by commas. Spaces are not allowed. The
13syntax looks like this:
14
15devfs=<option1>,<option2>,<option3>
16
17and so on. For example, if you wanted to turn on debugging for module
18load requests and device registration, you would do:
19
20devfs=dmod,dreg
21
22You may prefix "no" to any option. This will invert the option.
23
24
25Debugging Options
26=================
27
28These requires CONFIG_DEVFS_DEBUG to be enabled.
29Note that all debugging options have 'd' as the first character. By
30default all options are off. All debugging output is sent to the
31kernel logs. The debugging options do not take effect until the devfs
32version message appears (just prior to the root filesystem being
33mounted).
34
35These are the options:
36
37dmod print module load requests to <request_module>
38
39dreg print device register requests to <devfs_register>
40
41dunreg print device unregister requests to <devfs_unregister>
42
43dchange print device change requests to <devfs_set_flags>
44
45dilookup print inode lookup requests
46
47diget print VFS inode allocations
48
49diunlink print inode unlinks
50
51dichange print inode changes
52
53dimknod print calls to mknod(2)
54
55dall some debugging turned on
56
57
58Other Options
59=============
60
61These control the default behaviour of devfs. The options are:
62
63mount mount devfs onto /dev at boot time
64
65only disable non-devfs device nodes for devfs-capable drivers
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
new file mode 100644
index 000000000000..34380d4fbce3
--- /dev/null
+++ b/Documentation/filesystems/directory-locking
@@ -0,0 +1,113 @@
1 Locking scheme used for directory operations is based on two
2kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
3
4 For our purposes all operations fall in 5 classes:
5
61) read access. Locking rules: caller locks directory we are accessing.
7
82) object creation. Locking rules: same as above.
9
103) object removal. Locking rules: caller locks parent, finds victim,
11locks victim and calls the method.
12
134) rename() that is _not_ cross-directory. Locking rules: caller locks
14the parent, finds source and target, if target already exists - locks it
15and then calls the method.
16
175) link creation. Locking rules:
18 * lock parent
19 * check that source is not a directory
20 * lock source
21 * call the method.
22
236) cross-directory rename. The trickiest in the whole bunch. Locking
24rules:
25 * lock the filesystem
26 * lock parents in "ancestors first" order.
27 * find source and target.
28 * if old parent is equal to or is a descendent of target
29 fail with -ENOTEMPTY
30 * if new parent is equal to or is a descendent of source
31 fail with -ELOOP
32 * if target exists - lock it.
33 * call the method.
34
35
36The rules above obviously guarantee that all directories that are going to be
37read, modified or removed by method will be locked by caller.
38
39
40If no directory is its own ancestor, the scheme above is deadlock-free.
41Proof:
42
43 First of all, at any moment we have a partial ordering of the
44objects - A < B iff A is an ancestor of B.
45
46 That ordering can change. However, the following is true:
47
48(1) if object removal or non-cross-directory rename holds lock on A and
49 attempts to acquire lock on B, A will remain the parent of B until we
50 acquire the lock on B. (Proof: only cross-directory rename can change
51 the parent of object and it would have to lock the parent).
52
53(2) if cross-directory rename holds the lock on filesystem, order will not
54 change until rename acquires all locks. (Proof: other cross-directory
55 renames will be blocked on filesystem lock and we don't start changing
56 the order until we had acquired all locks).
57
58(3) any operation holds at most one lock on non-directory object and
59 that lock is acquired after all other locks. (Proof: see descriptions
60 of operations).
61
62 Now consider the minimal deadlock. Each process is blocked on
63attempt to acquire some lock and already holds at least one lock. Let's
64consider the set of contended locks. First of all, filesystem lock is
65not contended, since any process blocked on it is not holding any locks.
66Thus all processes are blocked on ->i_sem.
67
68 Non-directory objects are not contended due to (3). Thus link
69creation can't be a part of deadlock - it can't be blocked on source
70and it means that it doesn't hold any locks.
71
72 Any contended object is either held by cross-directory rename or
73has a child that is also contended. Indeed, suppose that it is held by
74operation other than cross-directory rename. Then the lock this operation
75is blocked on belongs to child of that object due to (1).
76
77 It means that one of the operations is cross-directory rename.
78Otherwise the set of contended objects would be infinite - each of them
79would have a contended child and we had assumed that no object is its
80own descendent. Moreover, there is exactly one cross-directory rename
81(see above).
82
83 Consider the object blocking the cross-directory rename. One
84of its descendents is locked by cross-directory rename (otherwise we
85would again have an infinite set of of contended objects). But that
86means that cross-directory rename is taking locks out of order. Due
87to (2) the order hadn't changed since we had acquired filesystem lock.
88But locking rules for cross-directory rename guarantee that we do not
89try to acquire lock on descendent before the lock on ancestor.
90Contradiction. I.e. deadlock is impossible. Q.E.D.
91
92
93 These operations are guaranteed to avoid loop creation. Indeed,
94the only operation that could introduce loops is cross-directory rename.
95Since the only new (parent, child) pair added by rename() is (new parent,
96source), such loop would have to contain these objects and the rest of it
97would have to exist before rename(). I.e. at the moment of loop creation
98rename() responsible for that would be holding filesystem lock and new parent
99would have to be equal to or a descendent of source. But that means that
100new parent had been equal to or a descendent of source since the moment when
101we had acquired filesystem lock and rename() would fail with -ELOOP in that
102case.
103
104 While this locking scheme works for arbitrary DAGs, it relies on
105ability to check that directory is a descendent of another object. Current
106implementation assumes that directory graph is a tree. This assumption is
107also preserved by all operations (cross-directory rename on a tree that would
108not introduce a cycle will leave it a tree and link() fails for directories).
109
110 Notice that "directory" in the above == "anything that might have
111children", so if we are going to introduce hybrid objects we will need
112either to make sure that link(2) doesn't work for them or to make changes
113in is_subdir() that would make it work even in presence of such beasts.
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
new file mode 100644
index 000000000000..b5cb9110cc6b
--- /dev/null
+++ b/Documentation/filesystems/ext2.txt
@@ -0,0 +1,383 @@
1
2The Second Extended Filesystem
3==============================
4
5ext2 was originally released in January 1993. Written by R\'emy Card,
6Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
7Extended Filesystem. It is currently still (April 2001) the predominant
8filesystem in use by Linux. There are also implementations available
9for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
10
11Options
12=======
13
14Most defaults are determined by the filesystem superblock, and can be
15set using tune2fs(8). Kernel-determined defaults are indicated by (*).
16
17bsddf (*) Makes `df' act like BSD.
18minixdf Makes `df' act like Minix.
19
20check Check block and inode bitmaps at mount time
21 (requires CONFIG_EXT2_CHECK).
22check=none, nocheck (*) Don't do extra checking of bitmaps on mount
23 (check=normal and check=strict options removed)
24
25debug Extra debugging information is sent to the
26 kernel syslog. Useful for developers.
27
28errors=continue Keep going on a filesystem error.
29errors=remount-ro Remount the filesystem read-only on an error.
30errors=panic Panic and halt the machine if an error occurs.
31
32grpid, bsdgroups Give objects the same group ID as their parent.
33nogrpid, sysvgroups New objects have the group ID of their creator.
34
35nouid32 Use 16-bit UIDs and GIDs.
36
37oldalloc Enable the old block allocator. Orlov should
38 have better performance, we'd like to get some
39 feedback if it's the contrary for you.
40orlov (*) Use the Orlov block allocator.
41 (See http://lwn.net/Articles/14633/ and
42 http://lwn.net/Articles/14446/.)
43
44resuid=n The user ID which may use the reserved blocks.
45resgid=n The group ID which may use the reserved blocks.
46
47sb=n Use alternate superblock at this location.
48
49user_xattr Enable "user." POSIX Extended Attributes
50 (requires CONFIG_EXT2_FS_XATTR).
51 See also http://acl.bestbits.at
52nouser_xattr Don't support "user." extended attributes.
53
54acl Enable POSIX Access Control Lists support
55 (requires CONFIG_EXT2_FS_POSIX_ACL).
56 See also http://acl.bestbits.at
57noacl Don't support POSIX ACLs.
58
59nobh Do not attach buffer_heads to file pagecache.
60
61grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
62
63
64Specification
65=============
66
67ext2 shares many properties with traditional Unix filesystems. It has
68the concepts of blocks, inodes and directories. It has space in the
69specification for Access Control Lists (ACLs), fragments, undeletion and
70compression though these are not yet implemented (some are available as
71separate patches). There is also a versioning mechanism to allow new
72features (such as journalling) to be added in a maximally compatible
73manner.
74
75Blocks
76------
77
78The space in the device or file is split up into blocks. These are
79a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
80which is decided when the filesystem is created. Smaller blocks mean
81less wasted space per file, but require slightly more accounting overhead,
82and also impose other limits on the size of files and the filesystem.
83
84Block Groups
85------------
86
87Blocks are clustered into block groups in order to reduce fragmentation
88and minimise the amount of head seeking when reading a large amount
89of consecutive data. Information about each block group is kept in a
90descriptor table stored in the block(s) immediately after the superblock.
91Two blocks near the start of each group are reserved for the block usage
92bitmap and the inode usage bitmap which show which blocks and inodes
93are in use. Since each bitmap is limited to a single block, this means
94that the maximum size of a block group is 8 times the size of a block.
95
96The block(s) following the bitmaps in each block group are designated
97as the inode table for that block group and the remainder are the data
98blocks. The block allocation algorithm attempts to allocate data blocks
99in the same block group as the inode which contains them.
100
101The Superblock
102--------------
103
104The superblock contains all the information about the configuration of
105the filing system. The primary copy of the superblock is stored at an
106offset of 1024 bytes from the start of the device, and it is essential
107to mounting the filesystem. Since it is so important, backup copies of
108the superblock are stored in block groups throughout the filesystem.
109The first version of ext2 (revision 0) stores a copy at the start of
110every block group, along with backups of the group descriptor block(s).
111Because this can consume a considerable amount of space for large
112filesystems, later revisions can optionally reduce the number of backup
113copies by only putting backups in specific groups (this is the sparse
114superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
115
116The information in the superblock contains fields such as the total
117number of inodes and blocks in the filesystem and how many are free,
118how many inodes and blocks are in each block group, when the filesystem
119was mounted (and if it was cleanly unmounted), when it was modified,
120what version of the filesystem it is (see the Revisions section below)
121and which OS created it.
122
123If the filesystem is revision 1 or higher, then there are extra fields,
124such as a volume name, a unique identification number, the inode size,
125and space for optional filesystem features to store configuration info.
126
127All fields in the superblock (as in all other ext2 structures) are stored
128on the disc in little endian format, so a filesystem is portable between
129machines without having to know what machine it was created on.
130
131Inodes
132------
133
134The inode (index node) is a fundamental concept in the ext2 filesystem.
135Each object in the filesystem is represented by an inode. The inode
136structure contains pointers to the filesystem blocks which contain the
137data held in the object and all of the metadata about an object except
138its name. The metadata about an object includes the permissions, owner,
139group, flags, size, number of blocks used, access time, change time,
140modification time, deletion time, number of links, fragments, version
141(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
142
143There are some reserved fields which are currently unused in the inode
144structure and several which are overloaded. One field is reserved for the
145directory ACL if the inode is a directory and alternately for the top 32
146bits of the file size if the inode is a regular file (allowing file sizes
147larger than 2GB). The translator field is unused under Linux, but is used
148by the HURD to reference the inode of a program which will be used to
149interpret this object. Most of the remaining reserved fields have been
150used up for both Linux and the HURD for larger owner and group fields,
151The HURD also has a larger mode field so it uses another of the remaining
152fields to store the extra more bits.
153
154There are pointers to the first 12 blocks which contain the file's data
155in the inode. There is a pointer to an indirect block (which contains
156pointers to the next set of blocks), a pointer to a doubly-indirect
157block (which contains pointers to indirect blocks) and a pointer to a
158trebly-indirect block (which contains pointers to doubly-indirect blocks).
159
160The flags field contains some ext2-specific flags which aren't catered
161for by the standard chmod flags. These flags can be listed with lsattr
162and changed with the chattr command, and allow specific filesystem
163behaviour on a per-file basis. There are flags for secure deletion,
164undeletable, compression, synchronous updates, immutability, append-only,
165dumpable, no-atime, indexed directories, and data-journaling. Not all
166of these are supported yet.
167
168Directories
169-----------
170
171A directory is a filesystem object and has an inode just like a file.
172It is a specially formatted file containing records which associate
173each name with an inode number. Later revisions of the filesystem also
174encode the type of the object (file, directory, symlink, device, fifo,
175socket) to avoid the need to check the inode itself for this information
176(support for taking advantage of this feature does not yet exist in
177Glibc 2.2).
178
179The inode allocation code tries to assign inodes which are in the same
180block group as the directory in which they are first created.
181
182The current implementation of ext2 uses a singly-linked list to store
183the filenames in the directory; a pending enhancement uses hashing of the
184filenames to allow lookup without the need to scan the entire directory.
185
186The current implementation never removes empty directory blocks once they
187have been allocated to hold more files.
188
189Special files
190-------------
191
192Symbolic links are also filesystem objects with inodes. They deserve
193special mention because the data for them is stored within the inode
194itself if the symlink is less than 60 bytes long. It uses the fields
195which would normally be used to store the pointers to data blocks.
196This is a worthwhile optimisation as it we avoid allocating a full
197block for the symlink, and most symlinks are less than 60 characters long.
198
199Character and block special devices never have data blocks assigned to
200them. Instead, their device number is stored in the inode, again reusing
201the fields which would be used to point to the data blocks.
202
203Reserved Space
204--------------
205
206In ext2, there is a mechanism for reserving a certain number of blocks
207for a particular user (normally the super-user). This is intended to
208allow for the system to continue functioning even if non-priveleged users
209fill up all the space available to them (this is independent of filesystem
210quotas). It also keeps the filesystem from filling up entirely which
211helps combat fragmentation.
212
213Filesystem check
214----------------
215
216At boot time, most systems run a consistency check (e2fsck) on their
217filesystems. The superblock of the ext2 filesystem contains several
218fields which indicate whether fsck should actually run (since checking
219the filesystem at boot can take a long time if it is large). fsck will
220run if the filesystem was not cleanly unmounted, if the maximum mount
221count has been exceeded or if the maximum time between checks has been
222exceeded.
223
224Feature Compatibility
225---------------------
226
227The compatibility feature mechanism used in ext2 is sophisticated.
228It safely allows features to be added to the filesystem, without
229unnecessarily sacrificing compatibility with older versions of the
230filesystem code. The feature compatibility mechanism is not supported by
231the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
232revision 1. There are three 32-bit fields, one for compatible features
233(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
234incompatible (INCOMPAT) features.
235
236These feature flags have specific meanings for the kernel as follows:
237
238A COMPAT flag indicates that a feature is present in the filesystem,
239but the on-disk format is 100% compatible with older on-disk formats, so
240a kernel which didn't know anything about this feature could read/write
241the filesystem without any chance of corrupting the filesystem (or even
242making it inconsistent). This is essentially just a flag which says
243"this filesystem has a (hidden) feature" that the kernel or e2fsck may
244want to be aware of (more on e2fsck and feature flags later). The ext3
245HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
246a regular file with data blocks in it so the kernel does not need to
247take any special notice of it if it doesn't understand ext3 journaling.
248
249An RO_COMPAT flag indicates that the on-disk format is 100% compatible
250with older on-disk formats for reading (i.e. the feature does not change
251the visible on-disk format). However, an old kernel writing to such a
252filesystem would/could corrupt the filesystem, so this is prevented. The
253most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
254sparse groups allow file data blocks where superblock/group descriptor
255backups used to live, and ext2_free_blocks() refuses to free these blocks,
256which would leading to inconsistent bitmaps. An old kernel would also
257get an error if it tried to free a series of blocks which crossed a group
258boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
259
260An INCOMPAT flag indicates the on-disk format has changed in some
261way that makes it unreadable by older kernels, or would otherwise
262cause a problem if an old kernel tried to mount it. FILETYPE is an
263INCOMPAT flag because older kernels would think a filename was longer
264than 256 characters, which would lead to corrupt directory listings.
265The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
266doesn't understand compression, you would just get garbage back from
267read() instead of it automatically decompressing your data. The ext3
268RECOVER flag is needed to prevent a kernel which does not understand the
269ext3 journal from mounting the filesystem without replaying the journal.
270
271For e2fsck, it needs to be more strict with the handling of these
272flags than the kernel. If it doesn't understand ANY of the COMPAT,
273RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
274because it has no way of verifying whether a given feature is valid
275or not. Allowing e2fsck to succeed on a filesystem with an unknown
276feature is a false sense of security for the user. Refusing to check
277a filesystem with unknown features is a good incentive for the user to
278update to the latest e2fsck. This also means that anyone adding feature
279flags to ext2 also needs to update e2fsck to verify these features.
280
281Metadata
282--------
283
284It is frequently claimed that the ext2 implementation of writing
285asynchronous metadata is faster than the ffs synchronous metadata
286scheme but less reliable. Both methods are equally resolvable by their
287respective fsck programs.
288
289If you're exceptionally paranoid, there are 3 ways of making metadata
290writes synchronous on ext2:
291
292per-file if you have the program source: use the O_SYNC flag to open()
293per-file if you don't have the source: use "chattr +S" on the file
294per-filesystem: add the "sync" option to mount (or in /etc/fstab)
295
296the first and last are not ext2 specific but do force the metadata to
297be written synchronously. See also Journaling below.
298
299Limitations
300-----------
301
302There are various limits imposed by the on-disk layout of ext2. Other
303limits are imposed by the current implementation of the kernel code.
304Many of the limits are determined at the time the filesystem is first
305created, and depend upon the block size chosen. The ratio of inodes to
306data blocks is fixed at filesystem creation time, so the only way to
307increase the number of inodes is to increase the size of the filesystem.
308No tools currently exist which can change the ratio of inodes to blocks.
309
310Most of these limits could be overcome with slight changes in the on-disk
311format and using a compatibility flag to signal the format change (at
312the expense of some compatibility).
313
314Filesystem block size: 1kB 2kB 4kB 8kB
315
316File size limit: 16GB 256GB 2048GB 2048GB
317Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
318
319There is a 2.4 kernel limit of 2048GB for a single block device, so no
320filesystem larger than that can be created at this time. There is also
321an upper limit on the block size imposed by the page size of the kernel,
322so 8kB blocks are only allowed on Alpha systems (and other architectures
323which support larger pages).
324
325There is an upper limit of 32768 subdirectories in a single directory.
326
327There is a "soft" upper limit of about 10-15k files in a single directory
328with the current linear linked-list directory implementation. This limit
329stems from performance problems when creating and deleting (and also
330finding) files in such large directories. Using a hashed directory index
331(under development) allows 100k-1M+ files in a single directory without
332performance problems (although RAM size becomes an issue at this point).
333
334The (meaningless) absolute upper limit of files in a single directory
335(imposed by the file size, the realistic limit is obviously much less)
336is over 130 trillion files. It would be higher except there are not
337enough 4-character names to make up unique directory entries, so they
338have to be 8 character filenames, even then we are fairly close to
339running out of unique filenames.
340
341Journaling
342----------
343
344A journaling extension to the ext2 code has been developed by Stephen
345Tweedie. It avoids the risks of metadata corruption and the need to
346wait for e2fsck to complete after a crash, without requiring a change
347to the on-disk ext2 layout. In a nutshell, the journal is a regular
348file which stores whole metadata (and optionally data) blocks that have
349been modified, prior to writing them into the filesystem. This means
350it is possible to add a journal to an existing ext2 filesystem without
351the need for data conversion.
352
353When changes to the filesystem (e.g. a file is renamed) they are stored in
354a transaction in the journal and can either be complete or incomplete at
355the time of a crash. If a transaction is complete at the time of a crash
356(or in the normal case where the system does not crash), then any blocks
357in that transaction are guaranteed to represent a valid filesystem state,
358and are copied into the filesystem. If a transaction is incomplete at
359the time of the crash, then there is no guarantee of consistency for
360the blocks in that transaction so they are discarded (which means any
361filesystem changes they represent are also lost).
362Check Documentation/filesystems/ext3.txt if you want to read more about
363ext3 and journaling.
364
365References
366==========
367
368The kernel source file:/usr/src/linux/fs/ext2/
369e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
370Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
371Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
372Hashed Directories http://kernelnewbies.org/~phillips/htree/
373Filesystem Resizing http://ext2resize.sourceforge.net/
374Compression (*) http://www.netspace.net.au/~reiter/e2compr/
375
376Implementations for:
377Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
378Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
379DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
380OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
381RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
382
383(*) no longer actively developed/supported (as of Apr 2001)
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
new file mode 100644
index 000000000000..9ab7f446f7ad
--- /dev/null
+++ b/Documentation/filesystems/ext3.txt
@@ -0,0 +1,183 @@
1
2Ext3 Filesystem
3===============
4
5ext3 was originally released in September 1999. Written by Stephen Tweedie
6for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
7Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
8
9ext3 is ext2 filesystem enhanced with journalling capabilities.
10
11Options
12=======
13
14When mounting an ext3 filesystem, the following option are accepted:
15(*) == default
16
17jounal=update Update the ext3 file system's journal to the
18 current format.
19
20journal=inum When a journal already exists, this option is
21 ignored. Otherwise, it specifies the number of
22 the inode which will represent the ext3 file
23 system's journal file.
24
25noload Don't load the journal on mounting.
26
27data=journal All data are committed into the journal prior
28 to being written into the main file system.
29
30data=ordered (*) All data are forced directly out to the main file
31 system prior to its metadata being committed to
32 the journal.
33
34data=writeback Data ordering is not preserved, data may be
35 written into the main file system after its
36 metadata has been committed to the journal.
37
38commit=nrsec (*) Ext3 can be told to sync all its data and metadata
39 every 'nrsec' seconds. The default value is 5 seconds.
40 This means that if you lose your power, you will lose,
41 as much, the latest 5 seconds of work (your filesystem
42 will not be damaged though, thanks to journaling). This
43 default value (or any low value) will hurt performance,
44 but it's good for data-safety. Setting it to 0 will
45 have the same effect than leaving the default 5 sec.
46 Setting it to very large values will improve
47 performance.
48
49barrier=1 This enables/disables barriers. barrier=0 disables it,
50 barrier=1 enables it.
51
52orlov (*) This enables the new Orlov block allocator. It's enabled
53 by default.
54
55oldalloc This disables the Orlov block allocator and enables the
56 old block allocator. Orlov should have better performance,
57 we'd like to get some feedback if it's the contrary for
58 you.
59
60user_xattr (*) Enables POSIX Extended Attributes. It's enabled by
61 default, however you need to confifure its support
62 (CONFIG_EXT3_FS_XATTR). This is neccesary if you want
63 to use POSIX Acces Control Lists support. You can visit
64 http://acl.bestbits.at to know more about POSIX Extended
65 attributes.
66
67nouser_xattr Disables POSIX Extended Attributes.
68
69acl (*) Enables POSIX Access Control Lists support. This is
70 enabled by default, however you need to configure
71 its support (CONFIG_EXT3_FS_POSIX_ACL). If you want
72 to know more about ACLs visit http://acl.bestbits.at
73
74noacl This option disables POSIX Access Control List support.
75
76reservation
77
78noreservation
79
80resize=
81
82bsddf (*) Make 'df' act like BSD.
83minixdf Make 'df' act like Minix.
84
85check=none Don't do extra checking of bitmaps on mount.
86nocheck
87
88debug Extra debugging information is sent to syslog.
89
90errors=remount-ro(*) Remount the filesystem read-only on an error.
91errors=continue Keep going on a filesystem error.
92errors=panic Panic and halt the machine if an error occurs.
93
94grpid Give objects the same group ID as their creator.
95bsdgroups
96
97nogrpid (*) New objects have the group ID of their creator.
98sysvgroups
99
100resgid=n The group ID which may use the reserved blocks.
101
102resuid=n The user ID which may use the reserved blocks.
103
104sb=n Use alternate superblock at this location.
105
106quota Quota options are currently silently ignored.
107noquota (see fs/ext3/super.c, line 594)
108grpquota
109usrquota
110
111
112Specification
113=============
114ext3 shares all disk implementation with ext2 filesystem, and add
115transactions capabilities to ext2. Journaling is done by the
116Journaling block device layer.
117
118Journaling Block Device layer
119-----------------------------
120The Journaling Block Device layer (JBD) isn't ext3 specific. It was
121design to add journaling capabilities on a block device. The ext3
122filesystem code will inform the JBD of modifications it is performing
123(Call a transaction). the journal support the transactions start and
124stop, and in case of crash, the journal can replayed the transactions
125to put the partition on a consistent state fastly.
126
127handles represent a single atomic update to a filesystem. JBD can
128handle external journal on a block device.
129
130Data Mode
131---------
132There's 3 different data modes:
133
134* writeback mode
135In data=writeback mode, ext3 does not journal data at all. This mode
136provides a similar level of journaling as XFS, JFS, and ReiserFS in its
137default mode - metadata journaling. A crash+recovery can cause
138incorrect data to appear in files which were written shortly before the
139crash. This mode will typically provide the best ext3 performance.
140
141* ordered mode
142In data=ordered mode, ext3 only officially journals metadata, but it
143logically groups metadata and data blocks into a single unit called a
144transaction. When it's time to write the new metadata out to disk, the
145associated data blocks are written first. In general, this mode
146perform slightly slower than writeback but significantly faster than
147journal mode.
148
149* journal mode
150data=journal mode provides full data and metadata journaling. All new
151data is written to the journal first, and then to its final location.
152In the event of a crash, the journal can be replayed, bringing both
153data and metadata into a consistent state. This mode is the slowest
154except when data needs to be read from and written to disk at the same
155time where it outperform all others mode.
156
157Compatibility
158-------------
159
160Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
161Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
162mounted as Ext2.
163
164External Tools
165==============
166see manual pages to know more.
167
168tune2fs: create a ext3 journal on a ext2 partition with the -j flags
169mke2fs: create a ext3 partition with the -j flags
170debugfs: ext2 and ext3 file system debugger
171
172References
173==========
174
175kernel source: file:/usr/src/linux/fs/ext3
176 file:/usr/src/linux/fs/jbd
177
178programs: http://e2fsprogs.sourceforge.net
179
180useful link:
181 http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
182 http://www-106.ibm.com/developerworks/linux/library/l-fs7/
183 http://www-106.ibm.com/developerworks/linux/library/l-fs8/
diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt
new file mode 100644
index 000000000000..bd0fa7704035
--- /dev/null
+++ b/Documentation/filesystems/hfs.txt
@@ -0,0 +1,83 @@
1
2Macintosh HFS Filesystem for Linux
3==================================
4
5HFS stands for ``Hierarchical File System'' and is the filesystem used
6by the Mac Plus and all later Macintosh models. Earlier Macintosh
7models used MFS (``Macintosh File System''), which is not supported,
8MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
9HFS but is extended in various areas. Use the hfsplus filesystem driver
10to access such filesystems from Linux.
11
12
13Mount options
14=============
15
16When mounting an HFS filesystem, the following options are accepted:
17
18 creator=cccc, type=cccc
19 Specifies the creator/type values as shown by the MacOS finder
20 used for creating new files. Default values: '????'.
21
22 uid=n, gid=n
23 Specifies the user/group that owns all files on the filesystems.
24 Default: user/group id of the mounting process.
25
26 dir_umask=n, file_umask=n, umask=n
27 Specifies the umask used for all files , all directories or all
28 files and directories. Defaults to the umask of the mounting process.
29
30 session=n
31 Select the CDROM session to mount as HFS filesystem. Defaults to
32 leaving that decision to the CDROM driver. This option will fail
33 with anything but a CDROM as underlying devices.
34
35 part=n
36 Select partition number n from the devices. Does only makes
37 sense for CDROMS because they can't be partitioned under Linux.
38 For disk devices the generic partition parsing code does this
39 for us. Defaults to not parsing the partition table at all.
40
41 quiet
42 Ignore invalid mount options instead of complaining.
43
44
45Writing to HFS Filesystems
46==========================
47
48HFS is not a UNIX filesystem, thus it does not have the usual features you'd
49expect:
50
51 o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
52 and gid of files.
53 o You can't create hard- or symlinks, device files, sockets or FIFOs.
54
55HFS does on the other have the concepts of multiple forks per file. These
56non-standard forks are represented as hidden additional files in the normal
57filesystems namespace which is kind of a cludge and makes the semantics for
58the a little strange:
59
60 o You can't create, delete or rename resource forks of files or the
61 Finder's metadata.
62 o They are however created (with default values), deleted and renamed
63 along with the corresponding data fork or directory.
64 o Copying files to a different filesystem will loose those attributes
65 that are essential for MacOS to work.
66
67
68Creating HFS filesystems
69===================================
70
71The hfsutils package from Robert Leslie contains a program called
72hformat that can be used to create HFS filesystem. See
73<http://www.mars.org/home/rob/proj/hfs/> for details.
74
75
76Credits
77=======
78
79The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
80and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
81Technologies.
82Roman rewrote large parts of the code and brought in btree routines derived
83from Brad Boyer's hfsplus driver (also maintained by Roman now).
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt
new file mode 100644
index 000000000000..33dc360c8e89
--- /dev/null
+++ b/Documentation/filesystems/hpfs.txt
@@ -0,0 +1,296 @@
1Read/Write HPFS 2.09
21998-2004, Mikulas Patocka
3
4email: mikulas@artax.karlin.mff.cuni.cz
5homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
6
7CREDITS:
8Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
9 is taken from it
10Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
11Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
12
13Mount options
14
15uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
16 Set owner/group/mode for files that do not have it specified in extended
17 attributes. Mode is inverted umask - for example umask 027 gives owner
18 all permission, group read permission and anybody else no access. Note
19 that for files mode is anded with 0666. If you want files to have 'x'
20 rights, you must use extended attributes.
21case=lower,asis (default asis)
22 File name lowercasing in readdir.
23conv=binary,text,auto (default binary)
24 CR/LF -> LF conversion, if auto, decision is made according to extension
25 - there is a list of text extensions (I thing it's better to not convert
26 text file than to damage binary file). If you want to change that list,
27 change it in the source. Original readonly HPFS contained some strange
28 heuristic algorithm that I removed. I thing it's danger to let the
29 computer decide whether file is text or binary. For example, DJGPP
30 binaries contain small text message at the beginning and they could be
31 misidentified and damaged under some circumstances.
32check=none,normal,strict (default normal)
33 Check level. Selecting none will cause only little speedup and big
34 danger. I tried to write it so that it won't crash if check=normal on
35 corrupted filesystems. check=strict means many superfluous checks -
36 used for debugging (for example it checks if file is allocated in
37 bitmaps when accessing it).
38errors=continue,remount-ro,panic (default remount-ro)
39 Behaviour when filesystem errors found.
40chkdsk=no,errors,always (default errors)
41 When to mark filesystem dirty so that OS/2 checks it.
42eas=no,ro,rw (default rw)
43 What to do with extended attributes. 'no' - ignore them and use always
44 values specified in uid/gid/mode options. 'ro' - read extended
45 attributes but do not create them. 'rw' - create extended attributes
46 when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
47timeshift=(-)nnn (default 0)
48 Shifts the time by nnn seconds. For example, if you see under linux
49 one hour more, than under os/2, use timeshift=-3600.
50
51
52File names
53
54As in OS/2, filenames are case insensitive. However, shell thinks that names
55are case sensitive, so for example when you create a file FOO, you can use
56'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
57also won't be able to compile linux kernel (and maybe other things) on HPFS
58because kernel creates different files with names like bootsect.S and
59bootsect.s. When searching for file thats name has characters >= 128, codepages
60are used - see below.
61OS/2 ignores dots and spaces at the end of file name, so this driver does as
62well. If you create 'a. ...', the file 'a' will be created, but you can still
63access it under names 'a.', 'a..', 'a . . . ' etc.
64
65
66Extended attributes
67
68On HPFS partitions, OS/2 can associate to each file a special information called
69extended attributes. Extended attributes are pairs of (key,value) where key is
70an ascii string identifying that attribute and value is any string of bytes of
71variable length. OS/2 stores window and icon positions and file types there. So
72why not use it for unix-specific info like file owner or access rights? This
73driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
74attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
75that extended attributes those value differs from defaults specified in mount
76options are created. Once created, the extended attributes are never deleted,
77they're just changed. It means that when your default uid=0 and you type
78something like 'chown luser file; chown root file' the file will contain
79extended attribute UID=0. And when you umount the fs and mount it again with
80uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
81extended attribute "MODE" will not be set, this special case is done by setting
82read-only flag. When you mknod a block or char device, besides "MODE", the
83special 4-byte extended attribute "DEV" will be created containing the device
84number. Currently this driver cannot resize extended attributes - it means
85that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
86attributes with different sizes, they won't be rewritten and changing these
87values doesn't work.
88
89
90Symlinks
91
92You can do symlinks on HPFS partition, symlinks are achieved by setting extended
93attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
94chgrp symlinks but I don't know what is it good for. chmoding symlink results
95in chmoding file where symlink points. These symlinks are just for Linux use and
96incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
97stored in very crazy way. They tried to do it so that link changes when file is
98moved ... sometimes it works. But the link is partly stored in directory
99extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
100to analyze or change OS2SYS.INI.
101
102
103Codepages
104
105HPFS can contain several uppercasing tables for several codepages and each
106file has a pointer to codepage it's name is in. However OS/2 was created in
107America where people don't care much about codepages and so multiple codepages
108support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
109Once I booted English OS/2 working in cp 850 and I created a file on my 852
110partition. It marked file name codepage as 850 - good. But when I again booted
111Czech OS/2, the file was completely inaccessible under any name. It seems that
112OS/2 uppercases the search pattern with its system code page (852) and file
113name it's comparing to with its code page (850). These could never match. Is it
114really what IBM developers wanted? But problems continued. When I created in
115Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
116probably uses different uppercasing method when searching where to place a file
117(note, that files in HPFS directory must be sorted) and when searching for
118a file. Finally when I opened this directory in PmShell, PmShell crashed (the
119funny thing was that, when rebooted, PmShell tried to reopen this directory
120again :-). chkdsk happily ignores these errors and only low-level disk
121modification saved me. Never mix different language versions of OS/2 on one
122system although HPFS was designed to allow that.
123OK, I could implement complex codepage support to this driver but I think it
124would cause more problems than benefit with such buggy implementation in OS/2.
125So this driver simply uses first codepage it finds for uppercasing and
126lowercasing no matter what's file codepage index. Usually all file names are in
127this codepage - if you don't try to do what I described above :-)
128
129
130Known bugs
131
132HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
133should work. If you have OS/2 server, use only read-only mode. I don't know how
134to handle some HPFS386 structures like access control list or extended perm
135list, I don't know how to delete them when file is deleted and how to not
136overwrite them with extended attributes. Send me some info on these structures
137and I'll make it. However, this driver should detect presence of HPFS386
138structures, remount read-only and not destroy them (I hope).
139
140When there's not enough space for extended attributes, they will be truncated
141and no error is returned.
142
143OS/2 can't access files if the path is longer than about 256 chars but this
144driver allows you to do it. chkdsk ignores such errors.
145
146Sometimes you won't be able to delete some files on a very full filesystem
147(returning error ENOSPC). That's because file in non-leaf node in directory tree
148(one directory, if it's large, has dirents in tree on HPFS) must be replaced
149with another node when deleted. And that new file might have larger name than
150the old one so the new name doesn't fit in directory node (dnode). And that
151would result in directory tree splitting, that takes disk space. Workaround is
152to delete other files that are leaf (probability that the file is non-leaf is
153about 1/50) or to truncate file first to make some space.
154You encounter this problem only if you have many directories so that
155preallocated directory band is full i.e.
156 number_of_directories / size_of_filesystem_in_mb > 4.
157
158You can't delete open directories.
159
160You can't rename over directories (what is it good for?).
161
162Renaming files so that only case changes doesn't work. This driver supports it
163but vfs doesn't. Something like 'mv file FILE' won't work.
164
165All atimes and directory mtimes are not updated. That's because of performance
166reasons. If you extremely wish to update them, let me know, I'll write it (but
167it will be slow).
168
169When the system is out of memory and swap, it may slightly corrupt filesystem
170(lost files, unbalanced directories). (I guess all filesystem may do it).
171
172When compiled, you get warning: function declaration isn't a prototype. Does
173anybody know what does it mean?
174
175
176What does "unbalanced tree" message mean?
177
178Old versions of this driver created sometimes unbalanced dnode trees. OS/2
179chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
180unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
181crashes when the tree is not balanced. This driver handles unbalanced trees
182correctly and writes warning if it finds them. If you see this message, this is
183probably because of directories created with old version of this driver.
184Workaround is to move all files from that directory to another and then back
185again. Do it in Linux, not OS/2! If you see this message in directory that is
186whole created by this driver, it is BUG - let me know about it.
187
188
189Bugs in OS/2
190
191When you have two (or more) lost directories pointing each to other, chkdsk
192locks up when repairing filesystem.
193
194Sometimes (I think it's random) when you create a file with one-char name under
195OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
196error corrected".
197
198File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
199marks them as short (and writes "minor fs error corrected"). This bug is not in
200HPFS386.
201
202Codepage bugs described above.
203
204If you don't install fixpacks, there are many, many more...
205
206
207History
208
2090.90 First public release
2100.91 Fixed bug that caused shooting to memory when write_inode was called on
211 open inode (rarely happened)
2120.92 Fixed a little memory leak in freeing directory inodes
2130.93 Fixed bug that locked up the machine when there were too many filenames
214 with first 15 characters same
215 Fixed write_file to zero file when writing behind file end
2160.94 Fixed a little memory leak when trying to delete busy file or directory
2170.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
2181.90 First version for 2.1.1xx kernels
2191.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
220 Fixed a race-condition when write_inode is called while deleting file
221 Fixed a bug that could possibly happen (with very low probability) when
222 using 0xff in filenames
223 Rewritten locking to avoid race-conditions
224 Mount option 'eas' now works
225 Fsync no longer returns error
226 Files beginning with '.' are marked hidden
227 Remount support added
228 Alloc is not so slow when filesystem becomes full
229 Atimes are no more updated because it slows down operation
230 Code cleanup (removed all commented debug prints)
2311.92 Corrected a bug when sync was called just before closing file
2321.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
233 works with previous versions
234 Fixed a possible problem with disks > 64G (but I don't have one, so I can't
235 test it)
236 Fixed a file overflow at 2G
237 Added new option 'timeshift'
238 Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
239 read-only mode
240 Fixed a bug that slowed down alloc and prevented allocating 100% space
241 (this bug was not destructive)
2421.94 Added workaround for one bug in Linux
243 Fixed one buffer leak
244 Fixed some incompatibilities with large extended attributes (but it's still
245 not 100% ok, I have no info on it and OS/2 doesn't want to create them)
246 Rewritten allocation
247 Fixed a bug with i_blocks (du sometimes didn't display correct values)
248 Directories have no longer archive attribute set (some programs don't like
249 it)
250 Fixed a bug that it set badly one flag in large anode tree (it was not
251 destructive)
2521.95 Fixed one buffer leak, that could happen on corrupted filesystem
253 Fixed one bug in allocation in 1.94
2541.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
255 error sometimes when opening directories in PMSHELL)
256 Fixed a possible bitmap race
257 Fixed possible problem on large disks
258 You can now delete open files
259 Fixed a nondestructive race in rename
2601.97 Support for HPFS v3 (on large partitions)
261 Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
2621.97.1 Changed names of global symbols
263 Fixed a bug when chmoding or chowning root directory
2641.98 Fixed a deadlock when using old_readdir
265 Better directory handling; workaround for "unbalanced tree" bug in OS/2
2661.99 Corrected a possible problem when there's not enough space while deleting
267 file
268 Now it tries to truncate the file if there's not enough space when deleting
269 Removed a lot of redundant code
2702.00 Fixed a bug in rename (it was there since 1.96)
271 Better anti-fragmentation strategy
2722.01 Fixed problem with directory listing over NFS
273 Directory lseek now checks for proper parameters
274 Fixed race-condition in buffer code - it is in all filesystems in Linux;
275 when reading device (cat /dev/hda) while creating files on it, files
276 could be damaged
2772.02 Woraround for bug in breada in Linux. breada could cause accesses beyond
278 end of partition
2792.03 Char, block devices and pipes are correctly created
280 Fixed non-crashing race in unlink (Alexander Viro)
281 Now it works with Japanese version of OS/2
2822.04 Fixed error when ftruncate used to extend file
2832.05 Fixed crash when got mount parameters without =
284 Fixed crash when allocation of anode failed due to full disk
285 Fixed some crashes when block io or inode allocation failed
2862.06 Fixed some crash on corrupted disk structures
287 Better allocation strategy
288 Reschedule points added so that it doesn't lock CPU long time
289 It should work in read-only mode on Warp Server
2902.07 More fixes for Warp Server. Now it really works
2912.08 Creating new files is not so slow on large disks
292 An attempt to sync deleted file does not generate filesystem error
2932.09 Fixed error on extremly fragmented files
294
295
296 vim: set textwidth=80:
diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt
new file mode 100644
index 000000000000..f64a10506689
--- /dev/null
+++ b/Documentation/filesystems/isofs.txt
@@ -0,0 +1,38 @@
1Mount options that are the same as for msdos and vfat partitions.
2
3 gid=nnn All files in the partition will be in group nnn.
4 uid=nnn All files in the partition will be owned by user id nnn.
5 umask=nnn The permission mask (see umask(1)) for the partition.
6
7Mount options that are the same as vfat partitions. These are only useful
8when using discs encoded using Microsoft's Joliet extensions.
9 iocharset=name Character set to use for converting from Unicode to
10 ASCII. Joliet filenames are stored in Unicode format, but
11 Unix for the most part doesn't know how to deal with Unicode.
12 There is also an option of doing UTF8 translations with the
13 utf8 option.
14 utf8 Encode Unicode names in UTF8 format. Default is no.
15
16Mount options unique to the isofs filesystem.
17 block=512 Set the block size for the disk to 512 bytes
18 block=1024 Set the block size for the disk to 1024 bytes
19 block=2048 Set the block size for the disk to 2048 bytes
20 check=relaxed Matches filenames with different cases
21 check=strict Matches only filenames with the exact same case
22 cruft Try to handle badly formatted CDs.
23 map=off Do not map non-Rock Ridge filenames to lower case
24 map=normal Map non-Rock Ridge filenames to lower case
25 map=acorn As map=normal but also apply Acorn extensions if present
26 mode=xxx Sets the permissions on files to xxx
27 nojoliet Ignore Joliet extensions if they are present.
28 norock Ignore Rock Ridge extensions if they are present.
29 unhide Show hidden files.
30 session=x Select number of session on multisession CD
31 sbsector=xxx Session begins from sector xxx
32
33Recommended documents about ISO 9660 standard are located at:
34http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
35ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
36Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
37identical with ISO 9660.", so it is a valid and gratis substitute of the
38official ISO specification.
diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt
new file mode 100644
index 000000000000..3e992daf99ad
--- /dev/null
+++ b/Documentation/filesystems/jfs.txt
@@ -0,0 +1,35 @@
1IBM's Journaled File System (JFS) for Linux
2
3JFS Homepage: http://jfs.sourceforge.net/
4
5The following mount options are supported:
6
7iocharset=name Character set to use for converting from Unicode to
8 ASCII. The default is to do no conversion. Use
9 iocharset=utf8 for UTF8 translations. This requires
10 CONFIG_NLS_UTF8 to be set in the kernel .config file.
11 iocharset=none specifies the default behavior explicitly.
12
13resize=value Resize the volume to <value> blocks. JFS only supports
14 growing a volume, not shrinking it. This option is only
15 valid during a remount, when the volume is mounted
16 read-write. The resize keyword with no value will grow
17 the volume to the full size of the partition.
18
19nointegrity Do not write to the journal. The primary use of this option
20 is to allow for higher performance when restoring a volume
21 from backup media. The integrity of the volume is not
22 guaranteed if the system abnormally abends.
23
24integrity Default. Commit metadata changes to the journal. Use this
25 option to remount a volume where the nointegrity option was
26 previously specified in order to restore normal behavior.
27
28errors=continue Keep going on a filesystem error.
29errors=remount-ro Default. Remount the filesystem read-only on an error.
30errors=panic Panic and halt the machine if an error occurs.
31
32Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
33
34The JFS mailing list can be subscribed to by using the link labeled
35"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt
new file mode 100644
index 000000000000..f12c30c93f2f
--- /dev/null
+++ b/Documentation/filesystems/ncpfs.txt
@@ -0,0 +1,12 @@
1The ncpfs filesystem understands the NCP protocol, designed by the
2Novell Corporation for their NetWare(tm) product. NCP is functionally
3similar to the NFS used in the TCP/IP community.
4To mount a NetWare filesystem, you need a special mount program, which
5can be found in the ncpfs package. The home site for ncpfs is
6ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
7will have it as well.
8
9Related products are linware and mars_nwe, which will give Linux partial
10NetWare server functionality. Linware's home site is
11klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
12ftp.gwdg.de/pub/linux/misc/ncpfs.
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
new file mode 100644
index 000000000000..f89b440fad1d
--- /dev/null
+++ b/Documentation/filesystems/ntfs.txt
@@ -0,0 +1,630 @@
1The Linux NTFS filesystem driver
2================================
3
4
5Table of contents
6=================
7
8- Overview
9- Web site
10- Features
11- Supported mount options
12- Known bugs and (mis-)features
13- Using NTFS volume and stripe sets
14 - The Device-Mapper driver
15 - The Software RAID / MD driver
16 - Limitiations when using the MD driver
17- ChangeLog
18
19
20Overview
21========
22
23Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
24These include mkntfs, a full-featured ntfs file system format utility,
25ntfsundelete used for recovering files that were unintentionally deleted
26from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
27See the web site for more information.
28
29To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
30system type 'ntfs'. The driver currently supports read-only mode (with no
31fault-tolerance, encryption or journalling) and very limited, but safe, write
32support.
33
34For fault tolerance and raid support (i.e. volume and stripe sets), you can
35use the kernel's Software RAID / MD driver. See section "Using Software RAID
36with NTFS" for details.
37
38
39Web site
40========
41
42There is plenty of additional information on the linux-ntfs web site
43at http://linux-ntfs.sourceforge.net/
44
45The web site has a lot of additional information, such as a comprehensive
46FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS
47userspace utilities, etc.
48
49
50Features
51========
52
53- This is a complete rewrite of the NTFS driver that used to be in the kernel.
54 This new driver implements NTFS read support and is functionally equivalent
55 to the old ntfs driver.
56- The new driver has full support for sparse files on NTFS 3.x volumes which
57 the old driver isn't happy with.
58- The new driver supports execution of binaries due to mmap() now being
59 supported.
60- The new driver supports loopback mounting of files on NTFS which is used by
61 some Linux distributions to enable the user to run Linux from an NTFS
62 partition by creating a large file while in Windows and then loopback
63 mounting the file while in Linux and creating a Linux filesystem on it that
64 is used to install Linux on it.
65- A comparison of the two drivers using:
66 time find . -type f -exec md5sum "{}" \;
67 run three times in sequence with each driver (after a reboot) on a 1.4GiB
68 NTFS partition, showed the new driver to be 20% faster in total time elapsed
69 (from 9:43 minutes on average down to 7:53). The time spent in user space
70 was unchanged but the time spent in the kernel was decreased by a factor of
71 2.5 (from 85 CPU seconds down to 33).
72- The driver does not support short file names in general. For backwards
73 compatibility, we implement access to files using their short file names if
74 they exist. The driver will not create short file names however, and a
75 rename will discard any existing short file name.
76- The new driver supports exporting of mounted NTFS volumes via NFS.
77- The new driver supports async io (aio).
78- The new driver supports fsync(2), fdatasync(2), and msync(2).
79- The new driver supports readv(2) and writev(2).
80- The new driver supports access time updates (including mtime and ctime).
81
82
83Supported mount options
84=======================
85
86In addition to the generic mount options described by the manual page for the
87mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
88following mount options:
89
90iocharset=name Deprecated option. Still supported but please use
91 nls=name in the future. See description for nls=name.
92
93nls=name Character set to use when returning file names.
94 Unlike VFAT, NTFS suppresses names that contain
95 unconvertible characters. Note that most character
96 sets contain insufficient characters to represent all
97 possible Unicode characters that can exist on NTFS.
98 To be sure you are not missing any files, you are
99 advised to use nls=utf8 which is capable of
100 representing all Unicode characters.
101
102utf8=<bool> Option no longer supported. Currently mapped to
103 nls=utf8 but please use nls=utf8 in the future and
104 make sure utf8 is compiled either as module or into
105 the kernel. See description for nls=name.
106
107uid=
108gid=
109umask= Provide default owner, group, and access mode mask.
110 These options work as documented in mount(8). By
111 default, the files/directories are owned by root and
112 he/she has read and write permissions, as well as
113 browse permission for directories. No one else has any
114 access permissions. I.e. the mode on all files is by
115 default rw------- and for directories rwx------, a
116 consequence of the default fmask=0177 and dmask=0077.
117 Using a umask of zero will grant all permissions to
118 everyone, i.e. all files and directories will have mode
119 rwxrwxrwx.
120
121fmask=
122dmask= Instead of specifying umask which applies both to
123 files and directories, fmask applies only to files and
124 dmask only to directories.
125
126sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
127 Otherwise the default behaviour is to abort mount if
128 any unknown options are found.
129
130show_sys_files=<BOOL> If show_sys_files is specified, show the system files
131 in directory listings. Otherwise the default behaviour
132 is to hide the system files.
133 Note that even when show_sys_files is specified, "$MFT"
134 will not be visible due to bugs/mis-features in glibc.
135 Further, note that irrespective of show_sys_files, all
136 files are accessible by name, i.e. you can always do
137 "ls -l \$UpCase" for example to specifically show the
138 system file containing the Unicode upcase table.
139
140case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
141 case sensitive and create file names in the POSIX
142 namespace. Otherwise the default behaviour is to treat
143 file names as case insensitive and to create file names
144 in the WIN32/LONG name space. Note, the Linux NTFS
145 driver will never create short file names and will
146 remove them on rename/delete of the corresponding long
147 file name.
148 Note that files remain accessible via their short file
149 name, if it exists. If case_sensitive, you will need
150 to provide the correct case of the short file name.
151
152errors=opt What to do when critical file system errors are found.
153 Following values can be used for "opt":
154 continue: DEFAULT, try to clean-up as much as
155 possible, e.g. marking a corrupt inode as
156 bad so it is no longer accessed, and then
157 continue.
158 recover: At present only supported is recovery of
159 the boot sector from the backup copy.
160 If read-only mount, the recovery is done
161 in memory only and not written to disk.
162 Note that the options are additive, i.e. specifying:
163 errors=continue,errors=recover
164 means the driver will attempt to recover and if that
165 fails it will clean-up as much as possible and
166 continue.
167
168mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
169 setting is not persistent across mounts and can be
170 changed from mount to mount but cannot be changed on
171 remount). Values of 1 to 4 are allowed, 1 being the
172 default. The MFT zone multiplier determines how much
173 space is reserved for the MFT on the volume. If all
174 other space is used up, then the MFT zone will be
175 shrunk dynamically, so this has no impact on the
176 amount of free space. However, it can have an impact
177 on performance by affecting fragmentation of the MFT.
178 In general use the default. If you have a lot of small
179 files then use a higher value. The values have the
180 following meaning:
181 Value MFT zone size (% of volume size)
182 1 12.5%
183 2 25%
184 3 37.5%
185 4 50%
186 Note this option is irrelevant for read-only mounts.
187
188
189Known bugs and (mis-)features
190=============================
191
192- The link count on each directory inode entry is set to 1, due to Linux not
193 supporting directory hard links. This may well confuse some user space
194 applications, since the directory names will have the same inode numbers.
195 This also speeds up ntfs_read_inode() immensely. And we haven't found any
196 problems with this approach so far. If you find a problem with this, please
197 let us know.
198
199
200Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
201list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
202
203
204Using NTFS volume and stripe sets
205=================================
206
207For support of volume and stripe sets, you can either use the kernel's
208Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
209the recommended one to use for linear raid. But the latter is required for
210raid level 5. For striping and mirroring, either driver should work fine.
211
212
213The Device-Mapper driver
214------------------------
215
216You will need to create a table of the components of the volume/stripe set and
217how they fit together and load this into the kernel using the dmsetup utility
218(see man 8 dmsetup).
219
220Linear volume sets, i.e. linear raid, has been tested and works fine. Even
221though untested, there is no reason why stripe sets, i.e. raid level 0, and
222mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
223raid level 5, unfortunately cannot work yet because the current version of the
224Device-Mapper driver does not support raid level 5. You may be able to use the
225Software RAID / MD driver for raid level 5, see the next section for details.
226
227To create the table describing your volume you will need to know each of its
228components and their sizes in sectors, i.e. multiples of 512-byte blocks.
229
230For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
231example if one of your partitions is /dev/hda2 you would do:
232
233$ fdisk -ul /dev/hda
234
235Disk /dev/hda: 81.9 GB, 81964302336 bytes
236255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
237Units = sectors of 1 * 512 = 512 bytes
238
239 Device Boot Start End Blocks Id System
240 /dev/hda1 * 63 4209029 2104483+ 83 Linux
241 /dev/hda2 4209030 37768814 16779892+ 86 NTFS
242 /dev/hda3 37768815 46170809 4200997+ 83 Linux
243
244And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
24533559785 sectors.
246
247For Win2k and later dynamic disks, you can for example use the ldminfo utility
248which is part of the Linux LDM tools (the latest version at the time of
249writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
250 http://linux-ntfs.sourceforge.net/downloads.html
251Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
252into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
253will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
254able to compile this yourself easily so use the binary version!
255
256Then you would use ldminfo in dump mode to obtain the necessary information:
257
258$ ./ldminfo --dump /dev/hda
259
260This would dump the LDM database found on /dev/hda which describes all of your
261dynamic disks and all the volumes on them. At the bottom you will see the
262VOLUME DEFINITIONS section which is all you really need. You may need to look
263further above to determine which of the disks in the volume definitions is
264which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
265look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
266section). You can then find these Disk Ids in the VBLK DATABASE section in the
267<Disk> components where you will get the LDM Name for the disk that is found in
268the VOLUME DEFINITIONS section.
269
270Note you will also need to enable the LDM driver in the Linux kernel. If your
271distribution did not enable it, you will need to recompile the kernel with it
272enabled. This will create the LDM partitions on each device at boot time. You
273would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
274in the Device-Mapper table.
275
276You can also bypass using the LDM driver by using the main device (e.g.
277/dev/hda) and then using the offsets of the LDM partitions into this device as
278the "Start sector of device" when creating the table. Once again ldminfo would
279give you the correct information to do this.
280
281Assuming you know all your devices and their sizes things are easy.
282
283For a linear raid the table would look like this (note all values are in
284512-byte sectors):
285
286--- cut here ---
287# Offset into Size of this Raid type Device Start sector
288# volume device of device
2890 1028161 linear /dev/hda1 0
2901028161 3903762 linear /dev/hdb2 0
2914931923 2103211 linear /dev/hdc1 0
292--- cut here ---
293
294For a striped volume, i.e. raid level 0, you will need to know the chunk size
295you used when creating the volume. Windows uses 64kiB as the default, so it
296will probably be this unless you changes the defaults when creating the array.
297
298For a raid level 0 the table would look like this (note all values are in
299512-byte sectors):
300
301--- cut here ---
302# Offset Size Raid Number Chunk 1st Start 2nd Start
303# into of the type of size Device in Device in
304# volume volume stripes device device
3050 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
306--- cut here ---
307
308If there are more than two devices, just add each of them to the end of the
309line.
310
311Finally, for a mirrored volume, i.e. raid level 1, the table would look like
312this (note all values are in 512-byte sectors):
313
314--- cut here ---
315# Ofs Size Raid Log Number Region Should Number Source Start Taget Start
316# in of the type type of log size sync? of Device in Device in
317# vol volume params mirrors Device Device
3180 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
319--- cut here ---
320
321If you are mirroring to multiple devices you can specify further targets at the
322end of the line.
323
324Note the "Should sync?" parameter "nosync" means that the two mirrors are
325already in sync which will be the case on a clean shutdown of Windows. If the
326mirrors are not clean, you can specify the "sync" option instead of "nosync"
327and the Device-Mapper driver will then copy the entirey of the "Source Device"
328to the "Target Device" or if you specified multipled target devices to all of
329them.
330
331Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
332and hand it over to dmsetup to work with, like so:
333
334$ dmsetup create myvolume1 /etc/ntfsvolume1
335
336You can obviously replace "myvolume1" with whatever name you like.
337
338If it all worked, you will now have the device /dev/device-mapper/myvolume1
339which you can then just use as an argument to the mount command as usual to
340mount the ntfs volume. For example:
341
342$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
343
344(You need to create the directory /mnt/myvol1 first and of course you can use
345anything you like instead of /mnt/myvol1 as long as it is an existing
346directory.)
347
348It is advisable to do the mount read-only to see if the volume has been setup
349correctly to avoid the possibility of causing damage to the data on the ntfs
350volume.
351
352
353The Software RAID / MD driver
354-----------------------------
355
356An alternative to using the Device-Mapper driver is to use the kernel's
357Software RAID / MD driver. For which you need to set up your /etc/raidtab
358appropriately (see man 5 raidtab).
359
360Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
3610, have been tested and work fine (though see section "Limitiations when using
362the MD driver with NTFS volumes" especially if you want to use linear raid).
363Even though untested, there is no reason why mirrors, i.e. raid level 1, and
364stripes with parity, i.e. raid level 5, should not work, too.
365
366You have to use the "persistent-superblock 0" option for each raid-disk in the
367NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
368superblock used by the MD driver would damange the NTFS volume.
369
370Windows by default uses a stripe chunk size of 64k, so you probably want the
371"chunk-size 64k" option for each raid-disk, too.
372
373For example, if you have a stripe set consisting of two partitions /dev/hda5
374and /dev/hdb1 your /etc/raidtab would look like this:
375
376raiddev /dev/md0
377 raid-level 0
378 nr-raid-disks 2
379 nr-spare-disks 0
380 persistent-superblock 0
381 chunk-size 64k
382 device /dev/hda5
383 raid-disk 0
384 device /dev/hdb1
385 raid-disl 1
386
387For linear raid, just change the raid-level above to "raid-level linear", for
388mirrors, change it to "raid-level 1", and for stripe sets with parity, change
389it to "raid-level 5".
390
391Note for stripe sets with parity you will also need to tell the MD driver
392which parity algorithm to use by specifying the option "parity-algorithm
393which", where you need to replace "which" with the name of the algorithm to
394use (see man 5 raidtab for available algorithms) and you will have to try the
395different available algorithms until you find one that works. Make sure you
396are working read-only when playing with this as you may damage your data
397otherwise. If you find which algorithm works please let us know (email the
398linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
399IRC in channel #ntfs on the irc.freenode.net network) so we can update this
400documentation.
401
402Once the raidtab is setup, run for example raid0run -a to start all devices or
403raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
404
405Then just use the mount command as usual to mount the ntfs volume using for
406example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
407
408It is advisable to do the mount read-only to see if the md volume has been
409setup correctly to avoid the possibility of causing damage to the data on the
410ntfs volume.
411
412
413Limitiations when using the Software RAID / MD driver
414-----------------------------------------------------
415
416Using the md driver will not work properly if any of your NTFS partitions have
417an odd number of sectors. This is especially important for linear raid as all
418data after the first partition with an odd number of sectors will be offset by
419one or more sectors so if you mount such a partition with write support you
420will cause massive damage to the data on the volume which will only become
421apparent when you try to use the volume again under Windows.
422
423So when using linear raid, make sure that all your partitions have an even
424number of sectors BEFORE attempting to use it. You have been warned!
425
426Even better is to simply use the Device-Mapper for linear raid and then you do
427not have this problem with odd numbers of sectors.
428
429
430ChangeLog
431=========
432
433Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
434
4352.1.22:
436 - Improve handling of ntfs volumes with errors.
437 - Fix various bugs and race conditions.
4382.1.21:
439 - Fix several race conditions and various other bugs.
440 - Many internal cleanups, code reorganization, optimizations, and mft
441 and index record writing code rewritten to fit in with the changes.
442 - Update Documentation/filesystems/ntfs.txt with instructions on how to
443 use the Device-Mapper driver with NTFS ftdisk/LDM raid.
4442.1.20:
445 - Fix two stupid bugs introduced in 2.1.18 release.
4462.1.19:
447 - Minor bugfix in handling of the default upcase table.
448 - Many internal cleanups and improvements. Many thanks to Linus
449 Torvalds and Al Viro for the help and advice with the sparse
450 annotations and cleanups.
4512.1.18:
452 - Fix scheduling latencies at mount time. (Ingo Molnar)
453 - Fix endianness bug in a little traversed portion of the attribute
454 lookup code.
4552.1.17:
456 - Fix bugs in mount time error code paths.
4572.1.16:
458 - Implement access time updates (including mtime and ctime).
459 - Implement fsync(2), fdatasync(2), and msync(2) system calls.
460 - Enable the readv(2) and writev(2) system calls.
461 - Enable access via the asynchronous io (aio) API by adding support for
462 the aio_read(3) and aio_write(3) functions.
4632.1.15:
464 - Invalidate quotas when (re)mounting read-write.
465 NOTE: This now only leave user space journalling on the side. (See
466 note for version 2.1.13, below.)
4672.1.14:
468 - Fix an NFSd caused deadlock reported by several users.
4692.1.13:
470 - Implement writing of inodes (access time updates are not implemented
471 yet so mounting with -o noatime,nodiratime is enforced).
472 - Enable writing out of resident files so you can now overwrite any
473 uncompressed, unencrypted, nonsparse file as long as you do not
474 change the file size.
475 - Add housekeeping of ntfs system files so that ntfsfix no longer needs
476 to be run after writing to an NTFS volume.
477 NOTE: This still leaves quota tracking and user space journalling on
478 the side but they should not cause data corruption. In the worst
479 case the charged quotas will be out of date ($Quota) and some
480 userspace applications might get confused due to the out of date
481 userspace journal ($UsnJrnl).
4822.1.12:
483 - Fix the second fix to the decompression engine from the 2.1.9 release
484 and some further internals cleanups.
4852.1.11:
486 - Driver internal cleanups.
4872.1.10:
488 - Force read-only (re)mounting of volumes with unsupported volume
489 flags and various cleanups.
4902.1.9:
491 - Fix two bugs in handling of corner cases in the decompression engine.
4922.1.8:
493 - Read the $MFT mirror and compare it to the $MFT and if the two do not
494 match, force a read-only mount and do not allow read-write remounts.
495 - Read and parse the $LogFile journal and if it indicates that the
496 volume was not shutdown cleanly, force a read-only mount and do not
497 allow read-write remounts. If the $LogFile indicates a clean
498 shutdown and a read-write (re)mount is requested, empty $LogFile to
499 ensure that Windows cannot cause data corruption by replaying a stale
500 journal after Linux has written to the volume.
501 - Improve time handling so that the NTFS time is fully preserved when
502 converted to kernel time and only up to 99 nano-seconds are lost when
503 kernel time is converted to NTFS time.
5042.1.7:
505 - Enable NFS exporting of mounted NTFS volumes.
5062.1.6:
507 - Fix minor bug in handling of compressed directories that fixes the
508 erroneous "du" and "stat" output people reported.
5092.1.5:
510 - Minor bug fix in attribute list attribute handling that fixes the
511 I/O errors on "ls" of certain fragmented files found by at least two
512 people running Windows XP.
5132.1.4:
514 - Minor update allowing compilation with all gcc versions (well, the
515 ones the kernel can be compiled with anyway).
5162.1.3:
517 - Major bug fixes for reading files and volumes in corner cases which
518 were being hit by Windows 2k/XP users.
5192.1.2:
520 - Major bug fixes aleviating the hangs in statfs experienced by some
521 users.
5222.1.1:
523 - Update handling of compressed files so people no longer get the
524 frequently reported warning messages about initialized_size !=
525 data_size.
5262.1.0:
527 - Add configuration option for developmental write support.
528 - Initial implementation of file overwriting. (Writes to resident files
529 are not written out to disk yet, so avoid writing to files smaller
530 than about 1kiB.)
531 - Intercept/abort changes in file size as they are not implemented yet.
5322.0.25:
533 - Minor bugfixes in error code paths and small cleanups.
5342.0.24:
535 - Small internal cleanups.
536 - Support for sendfile system call. (Christoph Hellwig)
5372.0.23:
538 - Massive internal locking changes to mft record locking. Fixes
539 various race conditions and deadlocks.
540 - Fix ntfs over loopback for compressed files by adding an
541 optimization barrier. (gcc was screwing up otherwise ?)
542 Thanks go to Christoph Hellwig for pointing these two out:
543 - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
544 - Fix ntfs_free() for ia64 and parisc.
5452.0.22:
546 - Small internal cleanups.
5472.0.21:
548 These only affect 32-bit architectures:
549 - Check for, and refuse to mount too large volumes (maximum is 2TiB).
550 - Check for, and refuse to open too large files and directories
551 (maximum is 16TiB).
5522.0.20:
553 - Support non-resident directory index bitmaps. This means we now cope
554 with huge directories without problems.
555 - Fix a page leak that manifested itself in some cases when reading
556 directory contents.
557 - Internal cleanups.
5582.0.19:
559 - Fix race condition and improvements in block i/o interface.
560 - Optimization when reading compressed files.
5612.0.18:
562 - Fix race condition in reading of compressed files.
5632.0.17:
564 - Cleanups and optimizations.
5652.0.16:
566 - Fix stupid bug introduced in 2.0.15 in new attribute inode API.
567 - Big internal cleanup replacing the mftbmp access hacks by using the
568 new attribute inode API instead.
5692.0.15:
570 - Bug fix in parsing of remount options.
571 - Internal changes implementing attribute (fake) inodes allowing all
572 attribute i/o to go via the page cache and to use all the normal
573 vfs/mm functionality.
5742.0.14:
575 - Internal changes improving run list merging code and minor locking
576 change to not rely on BKL in ntfs_statfs().
5772.0.13:
578 - Internal changes towards using iget5_locked() in preparation for
579 fake inodes and small cleanups to ntfs_volume structure.
5802.0.12:
581 - Internal cleanups in address space operations made possible by the
582 changes introduced in the previous release.
5832.0.11:
584 - Internal updates and cleanups introducing the first step towards
585 fake inode based attribute i/o.
5862.0.10:
587 - Microsoft says that the maximum number of inodes is 2^32 - 1. Update
588 the driver accordingly to only use 32-bits to store inode numbers on
589 32-bit architectures. This improves the speed of the driver a little.
5902.0.9:
591 - Change decompression engine to use a single buffer. This should not
592 affect performance except perhaps on the most heavy i/o on SMP
593 systems when accessing multiple compressed files from multiple
594 devices simultaneously.
595 - Minor updates and cleanups.
5962.0.8:
597 - Remove now obsolete show_inodes and posix mount option(s).
598 - Restore show_sys_files mount option.
599 - Add new mount option case_sensitive, to determine if the driver
600 treats file names as case sensitive or not.
601 - Mostly drop support for short file names (for backwards compatibility
602 we only support accessing files via their short file name if one
603 exists).
604 - Fix dcache aliasing issues wrt short/long file names.
605 - Cleanups and minor fixes.
6062.0.7:
607 - Just cleanups.
6082.0.6:
609 - Major bugfix to make compatible with other kernel changes. This fixes
610 the hangs/oopses on umount.
611 - Locking cleanup in directory operations (remove BKL usage).
6122.0.5:
613 - Major buffer overflow bug fix.
614 - Minor cleanups and updates for kernel 2.5.12.
6152.0.4:
616 - Cleanups and updates for kernel 2.5.11.
6172.0.3:
618 - Small bug fixes, cleanups, and performance improvements.
6192.0.2:
620 - Use default fmask of 0177 so that files are no executable by default.
621 If you want owner executable files, just use fmask=0077.
622 - Update for kernel 2.5.9 but preserve backwards compatibility with
623 kernel 2.5.7.
624 - Minor bug fixes, cleanups, and updates.
6252.0.1:
626 - Minor updates, primarily set the executable bit by default on files
627 so they can be executed.
6282.0.0:
629 - Started ChangeLog.
630
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
new file mode 100644
index 000000000000..2f388460cbe7
--- /dev/null
+++ b/Documentation/filesystems/porting
@@ -0,0 +1,266 @@
1Changes since 2.5.0:
2
3---
4[recommended]
5
6New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
7 sb_set_blocksize() and sb_min_blocksize().
8
9Use them.
10
11(sb_find_get_block() replaces 2.4's get_hash_table())
12
13---
14[recommended]
15
16New methods: ->alloc_inode() and ->destroy_inode().
17
18Remove inode->u.foo_inode_i
19Declare
20 struct foo_inode_info {
21 /* fs-private stuff */
22 struct inode vfs_inode;
23 };
24 static inline struct foo_inode_info *FOO_I(struct inode *inode)
25 {
26 return list_entry(inode, struct foo_inode_info, vfs_inode);
27 }
28
29Use FOO_I(inode) instead of &inode->u.foo_inode_i;
30
31Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
32foo_inode_info and return the address of ->vfs_inode, the latter should free
33FOO_I(inode) (see in-tree filesystems for examples).
34
35Make them ->alloc_inode and ->destroy_inode in your super_operations.
36
37Keep in mind that now you need explicit initialization of private data -
38typically in ->read_inode() and after getting an inode from new_inode().
39
40At some point that will become mandatory.
41
42---
43[mandatory]
44
45Change of file_system_type method (->read_super to ->get_sb)
46
47->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
48
49Turn your foo_read_super() into a function that would return 0 in case of
50success and negative number in case of error (-EINVAL unless you have more
51informative error value to report). Call it foo_fill_super(). Now declare
52
53struct super_block foo_get_sb(struct file_system_type *fs_type,
54 int flags, const char *dev_name, void *data)
55{
56 return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
57}
58
59(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
60filesystem).
61
62Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
63foo_get_sb.
64
65---
66[mandatory]
67
68Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
69Most likely there is no need to change anything, but if you relied on
70global exclusion between renames for some internal purpose - you need to
71change your internal locking. Otherwise exclusion warranties remain the
72same (i.e. parents and victim are locked, etc.).
73
74---
75[informational]
76
77Now we have the exclusion between ->lookup() and directory removal (by
78->rmdir() and ->rename()). If you used to need that exclusion and do
79it by internal locking (most of filesystems couldn't care less) - you
80can relax your locking.
81
82---
83[mandatory]
84
85->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
86->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
87and ->readdir() are called without BKL now. Grab it on entry, drop upon return
88- that will guarantee the same locking you used to have. If your method or its
89parts do not need BKL - better yet, now you can shift lock_kernel() and
90unlock_kernel() so that they would protect exactly what needs to be
91protected.
92
93---
94[mandatory]
95
96BKL is also moved from around sb operations. ->write_super() Is now called
97without BKL held. BKL should have been shifted into individual fs sb_op
98functions. If you don't need it, remove it.
99
100---
101[informational]
102
103check for ->link() target not being a directory is done by callers. Feel
104free to drop it...
105
106---
107[informational]
108
109->link() callers hold ->i_sem on the object we are linking to. Some of your
110problems might be over...
111
112---
113[mandatory]
114
115new file_system_type method - kill_sb(superblock). If you are converting
116an existing filesystem, set it according to ->fs_flags:
117 FS_REQUIRES_DEV - kill_block_super
118 FS_LITTER - kill_litter_super
119 neither - kill_anon_super
120FS_LITTER is gone - just remove it from fs_flags.
121
122---
123[mandatory]
124
125 FS_SINGLE is gone (actually, that had happened back when ->get_sb()
126went in - and hadn't been documented ;-/). Just remove it from fs_flags
127(and see ->get_sb() entry for other actions).
128
129---
130[mandatory]
131
132->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
133watch for ->i_sem-grabbing code that might be used by your ->setattr().
134Callers of notify_change() need ->i_sem now.
135
136---
137[recommended]
138
139New super_block field "struct export_operations *s_export_op" for
140explicit support for exporting, e.g. via NFS. The structure is fully
141documented at its declaration in include/linux/fs.h, and in
142Documentation/filesystems/Exporting.
143
144Briefly it allows for the definition of decode_fh and encode_fh operations
145to encode and decode filehandles, and allows the filesystem to use
146a standard helper function for decode_fh, and provide file-system specific
147support for this helper, particularly get_parent.
148
149It is planned that this will be required for exporting once the code
150settles down a bit.
151
152[mandatory]
153
154s_export_op is now required for exporting a filesystem.
155isofs, ext2, ext3, resierfs, fat
156can be used as examples of very different filesystems.
157
158---
159[mandatory]
160
161iget4() and the read_inode2 callback have been superseded by iget5_locked()
162which has the following prototype,
163
164 struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
165 int (*test)(struct inode *, void *),
166 int (*set)(struct inode *, void *),
167 void *data);
168
169'test' is an additional function that can be used when the inode
170number is not sufficient to identify the actual file object. 'set'
171should be a non-blocking function that initializes those parts of a
172newly created inode to allow the test function to succeed. 'data' is
173passed as an opaque value to both test and set functions.
174
175When the inode has been created by iget5_locked(), it will be returned with
176the I_NEW flag set and will still be locked. read_inode has not been
177called so the file system still has to finalize the initialization. Once
178the inode is initialized it must be unlocked by calling unlock_new_inode().
179
180The filesystem is responsible for setting (and possibly testing) i_ino
181when appropriate. There is also a simpler iget_locked function that
182just takes the superblock and inode number as arguments and does the
183test and set for you.
184
185e.g.
186 inode = iget_locked(sb, ino);
187 if (inode->i_state & I_NEW) {
188 read_inode_from_disk(inode);
189 unlock_new_inode(inode);
190 }
191
192---
193[recommended]
194
195->getattr() finally getting used. See instances in nfs, minix, etc.
196
197---
198[mandatory]
199
200->revalidate() is gone. If your filesystem had it - provide ->getattr()
201and let it call whatever you had as ->revlidate() + (for symlinks that
202had ->revalidate()) add calls in ->follow_link()/->readlink().
203
204---
205[mandatory]
206
207->d_parent changes are not protected by BKL anymore. Read access is safe
208if at least one of the following is true:
209 * filesystem has no cross-directory rename()
210 * dcache_lock is held
211 * we know that parent had been locked (e.g. we are looking at
212->d_parent of ->lookup() argument).
213 * we are called from ->rename().
214 * the child's ->d_lock is held
215Audit your code and add locking if needed. Notice that any place that is
216not protected by the conditions above is risky even in the old tree - you
217had been relying on BKL and that's prone to screwups. Old tree had quite
218a few holes of that kind - unprotected access to ->d_parent leading to
219anything from oops to silent memory corruption.
220
221---
222[mandatory]
223
224 FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
225(see rootfs for one kind of solution and bdev/socket/pipe for another).
226
227---
228[recommended]
229
230 Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
231is still alive, but only because of the mess in drivers/s390/block/dasd.c.
232As soon as it gets fixed is_read_only() will die.
233
234---
235[mandatory]
236
237->permission() is called without BKL now. Grab it on entry, drop upon
238return - that will guarantee the same locking you used to have. If
239your method or its parts do not need BKL - better yet, now you can
240shift lock_kernel() and unlock_kernel() so that they would protect
241exactly what needs to be protected.
242
243---
244[mandatory]
245
246->statfs() is now called without BKL held. BKL should have been
247shifted into individual fs sb_op functions where it's not clear that
248it's safe to remove it. If you don't need it, remove it.
249
250---
251[mandatory]
252
253 is_read_only() is gone; use bdev_read_only() instead.
254
255---
256[mandatory]
257
258 destroy_buffers() is gone; use invalidate_bdev().
259
260---
261[mandatory]
262
263 fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
264deliberate; as soon as struct block_device * is propagated in a reasonable
265way by that code fixing will become trivial; until then nothing can be
266done.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
new file mode 100644
index 000000000000..cbe85c17176b
--- /dev/null
+++ b/Documentation/filesystems/proc.txt
@@ -0,0 +1,1940 @@
1------------------------------------------------------------------------------
2 T H E /proc F I L E S Y S T E M
3------------------------------------------------------------------------------
4/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999
5 Bodo Bauer <bb@ricochet.net>
6
72.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
8------------------------------------------------------------------------------
9Version 1.3 Kernel version 2.2.12
10 Kernel version 2.4.0-test11-pre4
11------------------------------------------------------------------------------
12
13Table of Contents
14-----------------
15
16 0 Preface
17 0.1 Introduction/Credits
18 0.2 Legal Stuff
19
20 1 Collecting System Information
21 1.1 Process-Specific Subdirectories
22 1.2 Kernel data
23 1.3 IDE devices in /proc/ide
24 1.4 Networking info in /proc/net
25 1.5 SCSI info
26 1.6 Parallel port info in /proc/parport
27 1.7 TTY info in /proc/tty
28 1.8 Miscellaneous kernel statistics in /proc/stat
29
30 2 Modifying System Parameters
31 2.1 /proc/sys/fs - File system data
32 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
33 2.3 /proc/sys/kernel - general kernel parameters
34 2.4 /proc/sys/vm - The virtual memory subsystem
35 2.5 /proc/sys/dev - Device specific parameters
36 2.6 /proc/sys/sunrpc - Remote procedure calls
37 2.7 /proc/sys/net - Networking stuff
38 2.8 /proc/sys/net/ipv4 - IPV4 settings
39 2.9 Appletalk
40 2.10 IPX
41 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
42
43------------------------------------------------------------------------------
44Preface
45------------------------------------------------------------------------------
46
470.1 Introduction/Credits
48------------------------
49
50This documentation is part of a soon (or so we hope) to be released book on
51the SuSE Linux distribution. As there is no complete documentation for the
52/proc file system and we've used many freely available sources to write these
53chapters, it seems only fair to give the work back to the Linux community.
54This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm
55afraid it's still far from complete, but we hope it will be useful. As far as
56we know, it is the first 'all-in-one' document about the /proc file system. It
57is focused on the Intel x86 hardware, so if you are looking for PPC, ARM,
58SPARC, AXP, etc., features, you probably won't find what you are looking for.
59It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
60additions and patches are welcome and will be added to this document if you
61mail them to Bodo.
62
63We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
64other people for help compiling this documentation. We'd also like to extend a
65special thank you to Andi Kleen for documentation, which we relied on heavily
66to create this document, as well as the additional information he provided.
67Thanks to everybody else who contributed source or docs to the Linux kernel
68and helped create a great piece of software... :)
69
70If you have any comments, corrections or additions, please don't hesitate to
71contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
72document.
73
74The latest version of this document is available online at
75http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version.
76
77If the above direction does not works for you, ypu could try the kernel
78mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
79comandante@zaralinux.com.
80
810.2 Legal Stuff
82---------------
83
84We don't guarantee the correctness of this document, and if you come to us
85complaining about how you screwed up your system because of incorrect
86documentation, we won't feel responsible...
87
88------------------------------------------------------------------------------
89CHAPTER 1: COLLECTING SYSTEM INFORMATION
90------------------------------------------------------------------------------
91
92------------------------------------------------------------------------------
93In This Chapter
94------------------------------------------------------------------------------
95* Investigating the properties of the pseudo file system /proc and its
96 ability to provide information on the running Linux system
97* Examining /proc's structure
98* Uncovering various information about the kernel and the processes running
99 on the system
100------------------------------------------------------------------------------
101
102
103The proc file system acts as an interface to internal data structures in the
104kernel. It can be used to obtain information about the system and to change
105certain kernel parameters at runtime (sysctl).
106
107First, we'll take a look at the read-only parts of /proc. In Chapter 2, we
108show you how you can use /proc/sys to change settings.
109
1101.1 Process-Specific Subdirectories
111-----------------------------------
112
113The directory /proc contains (among other things) one subdirectory for each
114process running on the system, which is named after the process ID (PID).
115
116The link self points to the process reading the file system. Each process
117subdirectory has the entries listed in Table 1-1.
118
119
120Table 1-1: Process specific entries in /proc
121..............................................................................
122 File Content
123 cmdline Command line arguments
124 cpu Current and last cpu in wich it was executed (2.4)(smp)
125 cwd Link to the current working directory
126 environ Values of environment variables
127 exe Link to the executable of this process
128 fd Directory, which contains all file descriptors
129 maps Memory maps to executables and library files (2.4)
130 mem Memory held by this process
131 root Link to the root directory of this process
132 stat Process status
133 statm Process memory status information
134 status Process status in human readable form
135 wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
136..............................................................................
137
138For example, to get the status information of a process, all you have to do is
139read the file /proc/PID/status:
140
141 >cat /proc/self/status
142 Name: cat
143 State: R (running)
144 Pid: 5452
145 PPid: 743
146 TracerPid: 0 (2.4)
147 Uid: 501 501 501 501
148 Gid: 100 100 100 100
149 Groups: 100 14 16
150 VmSize: 1112 kB
151 VmLck: 0 kB
152 VmRSS: 348 kB
153 VmData: 24 kB
154 VmStk: 12 kB
155 VmExe: 8 kB
156 VmLib: 1044 kB
157 SigPnd: 0000000000000000
158 SigBlk: 0000000000000000
159 SigIgn: 0000000000000000
160 SigCgt: 0000000000000000
161 CapInh: 00000000fffffeff
162 CapPrm: 0000000000000000
163 CapEff: 0000000000000000
164
165
166This shows you nearly the same information you would get if you viewed it with
167the ps command. In fact, ps uses the proc file system to obtain its
168information. The statm file contains more detailed information about the
169process memory usage. Its seven fields are explained in Table 1-2.
170
171
172Table 1-2: Contents of the statm files (as of 2.6.8-rc3)
173..............................................................................
174 Field Content
175 size total program size (pages) (same as VmSize in status)
176 resident size of memory portions (pages) (same as VmRSS in status)
177 shared number of pages that are shared (i.e. backed by a file)
178 trs number of pages that are 'code' (not including libs; broken,
179 includes data segment)
180 lrs number of pages of library (always 0 on 2.6)
181 drs number of pages of data/stack (including libs; broken,
182 includes library text)
183 dt number of dirty pages (always 0 on 2.6)
184..............................................................................
185
1861.2 Kernel data
187---------------
188
189Similar to the process entries, the kernel data files give information about
190the running kernel. The files used to obtain this information are contained in
191/proc and are listed in Table 1-3. Not all of these will be present in your
192system. It depends on the kernel configuration and the loaded modules, which
193files are there, and which are missing.
194
195Table 1-3: Kernel info in /proc
196..............................................................................
197 File Content
198 apm Advanced power management info
199 buddyinfo Kernel memory allocator information (see text) (2.5)
200 bus Directory containing bus specific information
201 cmdline Kernel command line
202 cpuinfo Info about the CPU
203 devices Available devices (block and character)
204 dma Used DMS channels
205 filesystems Supported filesystems
206 driver Various drivers grouped here, currently rtc (2.4)
207 execdomains Execdomains, related to security (2.4)
208 fb Frame Buffer devices (2.4)
209 fs File system parameters, currently nfs/exports (2.4)
210 ide Directory containing info about the IDE subsystem
211 interrupts Interrupt usage
212 iomem Memory map (2.4)
213 ioports I/O port usage
214 irq Masks for irq to cpu affinity (2.4)(smp?)
215 isapnp ISA PnP (Plug&Play) Info (2.4)
216 kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
217 kmsg Kernel messages
218 ksyms Kernel symbol table
219 loadavg Load average of last 1, 5 & 15 minutes
220 locks Kernel locks
221 meminfo Memory info
222 misc Miscellaneous
223 modules List of loaded modules
224 mounts Mounted filesystems
225 net Networking info (see text)
226 partitions Table of partitions known to the system
227 pci Depreciated info of PCI bus (new way -> /proc/bus/pci/,
228 decoupled by lspci (2.4)
229 rtc Real time clock
230 scsi SCSI info (see text)
231 slabinfo Slab pool info
232 stat Overall statistics
233 swaps Swap space utilization
234 sys See chapter 2
235 sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
236 tty Info of tty drivers
237 uptime System uptime
238 version Kernel version
239 video bttv info of video resources (2.4)
240..............................................................................
241
242You can, for example, check which interrupts are currently in use and what
243they are used for by looking in the file /proc/interrupts:
244
245 > cat /proc/interrupts
246 CPU0
247 0: 8728810 XT-PIC timer
248 1: 895 XT-PIC keyboard
249 2: 0 XT-PIC cascade
250 3: 531695 XT-PIC aha152x
251 4: 2014133 XT-PIC serial
252 5: 44401 XT-PIC pcnet_cs
253 8: 2 XT-PIC rtc
254 11: 8 XT-PIC i82365
255 12: 182918 XT-PIC PS/2 Mouse
256 13: 1 XT-PIC fpu
257 14: 1232265 XT-PIC ide0
258 15: 7 XT-PIC ide1
259 NMI: 0
260
261In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
262output of a SMP machine):
263
264 > cat /proc/interrupts
265
266 CPU0 CPU1
267 0: 1243498 1214548 IO-APIC-edge timer
268 1: 8949 8958 IO-APIC-edge keyboard
269 2: 0 0 XT-PIC cascade
270 5: 11286 10161 IO-APIC-edge soundblaster
271 8: 1 0 IO-APIC-edge rtc
272 9: 27422 27407 IO-APIC-edge 3c503
273 12: 113645 113873 IO-APIC-edge PS/2 Mouse
274 13: 0 0 XT-PIC fpu
275 14: 22491 24012 IO-APIC-edge ide0
276 15: 2183 2415 IO-APIC-edge ide1
277 17: 30564 30414 IO-APIC-level eth0
278 18: 177 164 IO-APIC-level bttv
279 NMI: 2457961 2457959
280 LOC: 2457882 2457881
281 ERR: 2155
282
283NMI is incremented in this case because every timer interrupt generates a NMI
284(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups.
285
286LOC is the local interrupt counter of the internal APIC of every CPU.
287
288ERR is incremented in the case of errors in the IO-APIC bus (the bus that
289connects the CPUs in a SMP system. This means that an error has been detected,
290the IO-APIC automatically retry the transmission, so it should not be a big
291problem, but you should read the SMP-FAQ.
292
293In this context it could be interesting to note the new irq directory in 2.4.
294It could be used to set IRQ to CPU affinity, this means that you can "hook" an
295IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
296irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask
297
298For example
299 > ls /proc/irq/
300 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
301 1 11 13 15 17 19 3 5 7 9
302 > ls /proc/irq/0/
303 smp_affinity
304
305The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ
306is the same by default:
307
308 > cat /proc/irq/0/smp_affinity
309 ffffffff
310
311It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can
312set it by doing:
313
314 > echo 1 > /proc/irq/prof_cpu_mask
315
316This means that only the first CPU will handle the IRQ, but you can also echo 5
317wich means that only the first and fourth CPU can handle the IRQ.
318
319The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
320between all the CPUs which are allowed to handle it. As usual the kernel has
321more info than you and does a better job than you, so the defaults are the
322best choice for almost everyone.
323
324There are three more important subdirectories in /proc: net, scsi, and sys.
325The general rule is that the contents, or even the existence of these
326directories, depend on your kernel configuration. If SCSI is not enabled, the
327directory scsi may not exist. The same is true with the net, which is there
328only when networking support is present in the running kernel.
329
330The slabinfo file gives information about memory usage at the slab level.
331Linux uses slab pools for memory management above page level in version 2.2.
332Commonly used objects have their own slab pool (such as network buffers,
333directory cache, and so on).
334
335..............................................................................
336
337> cat /proc/buddyinfo
338
339Node 0, zone DMA 0 4 5 4 4 3 ...
340Node 0, zone Normal 1 0 0 1 101 8 ...
341Node 0, zone HighMem 2 0 0 1 1 0 ...
342
343Memory fragmentation is a problem under some workloads, and buddyinfo is a
344useful tool for helping diagnose these problems. Buddyinfo will give you a
345clue as to how big an area you can safely allocate, or why a previous
346allocation failed.
347
348Each column represents the number of pages of a certain order which are
349available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
350ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
351available in ZONE_NORMAL, etc...
352
353..............................................................................
354
355meminfo:
356
357Provides information about distribution and utilization of memory. This
358varies by architecture and compile options. The following is from a
35916GB PIII, which has highmem enabled. You may not have all of these fields.
360
361> cat /proc/meminfo
362
363
364MemTotal: 16344972 kB
365MemFree: 13634064 kB
366Buffers: 3656 kB
367Cached: 1195708 kB
368SwapCached: 0 kB
369Active: 891636 kB
370Inactive: 1077224 kB
371HighTotal: 15597528 kB
372HighFree: 13629632 kB
373LowTotal: 747444 kB
374LowFree: 4432 kB
375SwapTotal: 0 kB
376SwapFree: 0 kB
377Dirty: 968 kB
378Writeback: 0 kB
379Mapped: 280372 kB
380Slab: 684068 kB
381CommitLimit: 7669796 kB
382Committed_AS: 100056 kB
383PageTables: 24448 kB
384VmallocTotal: 112216 kB
385VmallocUsed: 428 kB
386VmallocChunk: 111088 kB
387
388 MemTotal: Total usable ram (i.e. physical ram minus a few reserved
389 bits and the kernel binary code)
390 MemFree: The sum of LowFree+HighFree
391 Buffers: Relatively temporary storage for raw disk blocks
392 shouldn't get tremendously large (20MB or so)
393 Cached: in-memory cache for files read from the disk (the
394 pagecache). Doesn't include SwapCached
395 SwapCached: Memory that once was swapped out, is swapped back in but
396 still also is in the swapfile (if memory is needed it
397 doesn't need to be swapped out AGAIN because it is already
398 in the swapfile. This saves I/O)
399 Active: Memory that has been used more recently and usually not
400 reclaimed unless absolutely necessary.
401 Inactive: Memory which has been less recently used. It is more
402 eligible to be reclaimed for other purposes
403 HighTotal:
404 HighFree: Highmem is all memory above ~860MB of physical memory
405 Highmem areas are for use by userspace programs, or
406 for the pagecache. The kernel must use tricks to access
407 this memory, making it slower to access than lowmem.
408 LowTotal:
409 LowFree: Lowmem is memory which can be used for everything that
410 highmem can be used for, but it is also availble for the
411 kernel's use for its own data structures. Among many
412 other things, it is where everything from the Slab is
413 allocated. Bad things happen when you're out of lowmem.
414 SwapTotal: total amount of swap space available
415 SwapFree: Memory which has been evicted from RAM, and is temporarily
416 on the disk
417 Dirty: Memory which is waiting to get written back to the disk
418 Writeback: Memory which is actively being written back to the disk
419 Mapped: files which have been mmaped, such as libraries
420 Slab: in-kernel data structures cache
421 CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
422 this is the total amount of memory currently available to
423 be allocated on the system. This limit is only adhered to
424 if strict overcommit accounting is enabled (mode 2 in
425 'vm.overcommit_memory').
426 The CommitLimit is calculated with the following formula:
427 CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap
428 For example, on a system with 1G of physical RAM and 7G
429 of swap with a `vm.overcommit_ratio` of 30 it would
430 yield a CommitLimit of 7.3G.
431 For more details, see the memory overcommit documentation
432 in vm/overcommit-accounting.
433Committed_AS: The amount of memory presently allocated on the system.
434 The committed memory is a sum of all of the memory which
435 has been allocated by processes, even if it has not been
436 "used" by them as of yet. A process which malloc()'s 1G
437 of memory, but only touches 300M of it will only show up
438 as using 300M of memory even if it has the address space
439 allocated for the entire 1G. This 1G is memory which has
440 been "committed" to by the VM and can be used at any time
441 by the allocating application. With strict overcommit
442 enabled on the system (mode 2 in 'vm.overcommit_memory'),
443 allocations which would exceed the CommitLimit (detailed
444 above) will not be permitted. This is useful if one needs
445 to guarantee that processes will not fail due to lack of
446 memory once that memory has been successfully allocated.
447 PageTables: amount of memory dedicated to the lowest level of page
448 tables.
449VmallocTotal: total size of vmalloc memory area
450 VmallocUsed: amount of vmalloc area which is used
451VmallocChunk: largest contigious block of vmalloc area which is free
452
453
4541.3 IDE devices in /proc/ide
455----------------------------
456
457The subdirectory /proc/ide contains information about all IDE devices of which
458the kernel is aware. There is one subdirectory for each IDE controller, the
459file drivers and a link for each IDE device, pointing to the device directory
460in the controller specific subtree.
461
462The file drivers contains general information about the drivers used for the
463IDE devices:
464
465 > cat /proc/ide/drivers
466 ide-cdrom version 4.53
467 ide-disk version 1.08
468
469More detailed information can be found in the controller specific
470subdirectories. These are named ide0, ide1 and so on. Each of these
471directories contains the files shown in table 1-4.
472
473
474Table 1-4: IDE controller info in /proc/ide/ide?
475..............................................................................
476 File Content
477 channel IDE channel (0 or 1)
478 config Configuration (only for PCI/IDE bridge)
479 mate Mate name
480 model Type/Chipset of IDE controller
481..............................................................................
482
483Each device connected to a controller has a separate subdirectory in the
484controllers directory. The files listed in table 1-5 are contained in these
485directories.
486
487
488Table 1-5: IDE device information
489..............................................................................
490 File Content
491 cache The cache
492 capacity Capacity of the medium (in 512Byte blocks)
493 driver driver and version
494 geometry physical and logical geometry
495 identify device identify block
496 media media type
497 model device identifier
498 settings device setup
499 smart_thresholds IDE disk management thresholds
500 smart_values IDE disk management values
501..............................................................................
502
503The most interesting file is settings. This file contains a nice overview of
504the drive parameters:
505
506 # cat /proc/ide/ide0/hda/settings
507 name value min max mode
508 ---- ----- --- --- ----
509 bios_cyl 526 0 65535 rw
510 bios_head 255 0 255 rw
511 bios_sect 63 0 63 rw
512 breada_readahead 4 0 127 rw
513 bswap 0 0 1 r
514 file_readahead 72 0 2097151 rw
515 io_32bit 0 0 3 rw
516 keepsettings 0 0 1 rw
517 max_kb_per_request 122 1 127 rw
518 multcount 0 0 8 rw
519 nice1 1 0 1 rw
520 nowerr 0 0 1 rw
521 pio_mode write-only 0 255 w
522 slow 0 0 1 rw
523 unmaskirq 0 0 1 rw
524 using_dma 0 0 1 rw
525
526
5271.4 Networking info in /proc/net
528--------------------------------
529
530The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the
531additional values you get for IP version 6 if you configure the kernel to
532support this. Table 1-7 lists the files and their meaning.
533
534
535Table 1-6: IPv6 info in /proc/net
536..............................................................................
537 File Content
538 udp6 UDP sockets (IPv6)
539 tcp6 TCP sockets (IPv6)
540 raw6 Raw device statistics (IPv6)
541 igmp6 IP multicast addresses, which this host joined (IPv6)
542 if_inet6 List of IPv6 interface addresses
543 ipv6_route Kernel routing table for IPv6
544 rt6_stats Global IPv6 routing tables statistics
545 sockstat6 Socket statistics (IPv6)
546 snmp6 Snmp data (IPv6)
547..............................................................................
548
549
550Table 1-7: Network info in /proc/net
551..............................................................................
552 File Content
553 arp Kernel ARP table
554 dev network devices with statistics
555 dev_mcast the Layer2 multicast groups a device is listening too
556 (interface index, label, number of references, number of bound
557 addresses).
558 dev_stat network device status
559 ip_fwchains Firewall chain linkage
560 ip_fwnames Firewall chain names
561 ip_masq Directory containing the masquerading tables
562 ip_masquerade Major masquerading table
563 netstat Network statistics
564 raw raw device statistics
565 route Kernel routing table
566 rpc Directory containing rpc info
567 rt_cache Routing cache
568 snmp SNMP data
569 sockstat Socket statistics
570 tcp TCP sockets
571 tr_rif Token ring RIF routing table
572 udp UDP sockets
573 unix UNIX domain sockets
574 wireless Wireless interface data (Wavelan etc)
575 igmp IP multicast addresses, which this host joined
576 psched Global packet scheduler parameters.
577 netlink List of PF_NETLINK sockets
578 ip_mr_vifs List of multicast virtual interfaces
579 ip_mr_cache List of multicast routing cache
580..............................................................................
581
582You can use this information to see which network devices are available in
583your system and how much traffic was routed over those devices:
584
585 > cat /proc/net/dev
586 Inter-|Receive |[...
587 face |bytes packets errs drop fifo frame compressed multicast|[...
588 lo: 908188 5596 0 0 0 0 0 0 [...
589 ppp0:15475140 20721 410 0 0 410 0 0 [...
590 eth0: 614530 7085 0 0 0 0 0 1 [...
591
592 ...] Transmit
593 ...] bytes packets errs drop fifo colls carrier compressed
594 ...] 908188 5596 0 0 0 0 0 0
595 ...] 1375103 17405 0 0 0 0 0 0
596 ...] 1703981 5535 0 0 0 3 0 0
597
598In addition, each Channel Bond interface has it's own directory. For
599example, the bond0 device will have a directory called /proc/net/bond0/.
600It will contain information that is specific to that bond, such as the
601current slaves of the bond, the link status of the slaves, and how
602many times the slaves link has failed.
603
6041.5 SCSI info
605-------------
606
607If you have a SCSI host adapter in your system, you'll find a subdirectory
608named after the driver for this adapter in /proc/scsi. You'll also see a list
609of all recognized SCSI devices in /proc/scsi:
610
611 >cat /proc/scsi/scsi
612 Attached devices:
613 Host: scsi0 Channel: 00 Id: 00 Lun: 00
614 Vendor: IBM Model: DGHS09U Rev: 03E0
615 Type: Direct-Access ANSI SCSI revision: 03
616 Host: scsi0 Channel: 00 Id: 06 Lun: 00
617 Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04
618 Type: CD-ROM ANSI SCSI revision: 02
619
620
621The directory named after the driver has one file for each adapter found in
622the system. These files contain information about the controller, including
623the used IRQ and the IO address range. The amount of information shown is
624dependent on the adapter you use. The example shows the output for an Adaptec
625AHA-2940 SCSI adapter:
626
627 > cat /proc/scsi/aic7xxx/0
628
629 Adaptec AIC7xxx driver version: 5.1.19/3.2.4
630 Compile Options:
631 TCQ Enabled By Default : Disabled
632 AIC7XXX_PROC_STATS : Disabled
633 AIC7XXX_RESET_DELAY : 5
634 Adapter Configuration:
635 SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
636 Ultra Wide Controller
637 PCI MMAPed I/O Base: 0xeb001000
638 Adapter SEEPROM Config: SEEPROM found and used.
639 Adaptec SCSI BIOS: Enabled
640 IRQ: 10
641 SCBs: Active 0, Max Active 2,
642 Allocated 15, HW 16, Page 255
643 Interrupts: 160328
644 BIOS Control Word: 0x18b6
645 Adapter Control Word: 0x005b
646 Extended Translation: Enabled
647 Disconnect Enable Flags: 0xffff
648 Ultra Enable Flags: 0x0001
649 Tag Queue Enable Flags: 0x0000
650 Ordered Queue Tag Flags: 0x0000
651 Default Tag Queue Depth: 8
652 Tagged Queue By Device array for aic7xxx host instance 0:
653 {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
654 Actual queue depth per device for aic7xxx host instance 0:
655 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
656 Statistics:
657 (scsi0:0:0:0)
658 Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
659 Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
660 Total transfers 160151 (74577 reads and 85574 writes)
661 (scsi0:0:6:0)
662 Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
663 Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
664 Total transfers 0 (0 reads and 0 writes)
665
666
6671.6 Parallel port info in /proc/parport
668---------------------------------------
669
670The directory /proc/parport contains information about the parallel ports of
671your system. It has one subdirectory for each port, named after the port
672number (0,1,2,...).
673
674These directories contain the four files shown in Table 1-8.
675
676
677Table 1-8: Files in /proc/parport
678..............................................................................
679 File Content
680 autoprobe Any IEEE-1284 device ID information that has been acquired.
681 devices list of the device drivers using that port. A + will appear by the
682 name of the device currently using the port (it might not appear
683 against any).
684 hardware Parallel port's base address, IRQ line and DMA channel.
685 irq IRQ that parport is using for that port. This is in a separate
686 file to allow you to alter it by writing a new value in (IRQ
687 number or none).
688..............................................................................
689
6901.7 TTY info in /proc/tty
691-------------------------
692
693Information about the available and actually used tty's can be found in the
694directory /proc/tty.You'll find entries for drivers and line disciplines in
695this directory, as shown in Table 1-9.
696
697
698Table 1-9: Files in /proc/tty
699..............................................................................
700 File Content
701 drivers list of drivers and their usage
702 ldiscs registered line disciplines
703 driver/serial usage statistic and status of single tty lines
704..............................................................................
705
706To see which tty's are currently in use, you can simply look into the file
707/proc/tty/drivers:
708
709 > cat /proc/tty/drivers
710 pty_slave /dev/pts 136 0-255 pty:slave
711 pty_master /dev/ptm 128 0-255 pty:master
712 pty_slave /dev/ttyp 3 0-255 pty:slave
713 pty_master /dev/pty 2 0-255 pty:master
714 serial /dev/cua 5 64-67 serial:callout
715 serial /dev/ttyS 4 64-67 serial
716 /dev/tty0 /dev/tty0 4 0 system:vtmaster
717 /dev/ptmx /dev/ptmx 5 2 system
718 /dev/console /dev/console 5 1 system:console
719 /dev/tty /dev/tty 5 0 system:/dev/tty
720 unknown /dev/tty 4 1-63 console
721
722
7231.8 Miscellaneous kernel statistics in /proc/stat
724-------------------------------------------------
725
726Various pieces of information about kernel activity are available in the
727/proc/stat file. All of the numbers reported in this file are aggregates
728since the system first booted. For a quick look, simply cat the file:
729
730 > cat /proc/stat
731 cpu 2255 34 2290 22625563 6290 127 456
732 cpu0 1132 34 1441 11311718 3675 127 438
733 cpu1 1123 0 849 11313845 2614 0 18
734 intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
735 ctxt 1990473
736 btime 1062191376
737 processes 2915
738 procs_running 1
739 procs_blocked 0
740
741The very first "cpu" line aggregates the numbers in all of the other "cpuN"
742lines. These numbers identify the amount of time the CPU has spent performing
743different kinds of work. Time units are in USER_HZ (typically hundredths of a
744second). The meanings of the columns are as follows, from left to right:
745
746- user: normal processes executing in user mode
747- nice: niced processes executing in user mode
748- system: processes executing in kernel mode
749- idle: twiddling thumbs
750- iowait: waiting for I/O to complete
751- irq: servicing interrupts
752- softirq: servicing softirqs
753
754The "intr" line gives counts of interrupts serviced since boot time, for each
755of the possible system interrupts. The first column is the total of all
756interrupts serviced; each subsequent column is the total for that particular
757interrupt.
758
759The "ctxt" line gives the total number of context switches across all CPUs.
760
761The "btime" line gives the time at which the system booted, in seconds since
762the Unix epoch.
763
764The "processes" line gives the number of processes and threads created, which
765includes (but is not limited to) those created by calls to the fork() and
766clone() system calls.
767
768The "procs_running" line gives the number of processes currently running on
769CPUs.
770
771The "procs_blocked" line gives the number of processes currently blocked,
772waiting for I/O to complete.
773
774
775------------------------------------------------------------------------------
776Summary
777------------------------------------------------------------------------------
778The /proc file system serves information about the running system. It not only
779allows access to process data but also allows you to request the kernel status
780by reading files in the hierarchy.
781
782The directory structure of /proc reflects the types of information and makes
783it easy, if not obvious, where to look for specific data.
784------------------------------------------------------------------------------
785
786------------------------------------------------------------------------------
787CHAPTER 2: MODIFYING SYSTEM PARAMETERS
788------------------------------------------------------------------------------
789
790------------------------------------------------------------------------------
791In This Chapter
792------------------------------------------------------------------------------
793* Modifying kernel parameters by writing into files found in /proc/sys
794* Exploring the files which modify certain parameters
795* Review of the /proc/sys file tree
796------------------------------------------------------------------------------
797
798
799A very interesting part of /proc is the directory /proc/sys. This is not only
800a source of information, it also allows you to change parameters within the
801kernel. Be very careful when attempting this. You can optimize your system,
802but you can also cause it to crash. Never alter kernel parameters on a
803production system. Set up a development machine and test to make sure that
804everything works the way you want it to. You may have no alternative but to
805reboot the machine once an error has been made.
806
807To change a value, simply echo the new value into the file. An example is
808given below in the section on the file system data. You need to be root to do
809this. You can create your own boot script to perform this every time your
810system boots.
811
812The files in /proc/sys can be used to fine tune and monitor miscellaneous and
813general things in the operation of the Linux kernel. Since some of the files
814can inadvertently disrupt your system, it is advisable to read both
815documentation and source before actually making adjustments. In any case, be
816very careful when writing to any of these files. The entries in /proc may
817change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
818review the kernel documentation in the directory /usr/src/linux/Documentation.
819This chapter is heavily based on the documentation included in the pre 2.2
820kernels, and became part of it in version 2.2.1 of the Linux kernel.
821
8222.1 /proc/sys/fs - File system data
823-----------------------------------
824
825This subdirectory contains specific file system, file handle, inode, dentry
826and quota information.
827
828Currently, these files are in /proc/sys/fs:
829
830dentry-state
831------------
832
833Status of the directory cache. Since directory entries are dynamically
834allocated and deallocated, this file indicates the current status. It holds
835six values, in which the last two are not used and are always zero. The others
836are listed in table 2-1.
837
838
839Table 2-1: Status files of the directory cache
840..............................................................................
841 File Content
842 nr_dentry Almost always zero
843 nr_unused Number of unused cache entries
844 age_limit
845 in seconds after the entry may be reclaimed, when memory is short
846 want_pages internally
847..............................................................................
848
849dquot-nr and dquot-max
850----------------------
851
852The file dquot-max shows the maximum number of cached disk quota entries.
853
854The file dquot-nr shows the number of allocated disk quota entries and the
855number of free disk quota entries.
856
857If the number of available cached disk quotas is very low and you have a large
858number of simultaneous system users, you might want to raise the limit.
859
860file-nr and file-max
861--------------------
862
863The kernel allocates file handles dynamically, but doesn't free them again at
864this time.
865
866The value in file-max denotes the maximum number of file handles that the
867Linux kernel will allocate. When you get a lot of error messages about running
868out of file handles, you might want to raise this limit. The default value is
86910% of RAM in kilobytes. To change it, just write the new number into the
870file:
871
872 # cat /proc/sys/fs/file-max
873 4096
874 # echo 8192 > /proc/sys/fs/file-max
875 # cat /proc/sys/fs/file-max
876 8192
877
878
879This method of revision is useful for all customizable parameters of the
880kernel - simply echo the new value to the corresponding file.
881
882Historically, the three values in file-nr denoted the number of allocated file
883handles, the number of allocated but unused file handles, and the maximum
884number of file handles. Linux 2.6 always reports 0 as the number of free file
885handles -- this is not an error, it just means that the number of allocated
886file handles exactly matches the number of used file handles.
887
888Attempts to allocate more file descriptors than file-max are reported with
889printk, look for "VFS: file-max limit <number> reached".
890
891inode-state and inode-nr
892------------------------
893
894The file inode-nr contains the first two items from inode-state, so we'll skip
895to that file...
896
897inode-state contains two actual numbers and five dummy values. The numbers
898are nr_inodes and nr_free_inodes (in order of appearance).
899
900nr_inodes
901~~~~~~~~~
902
903Denotes the number of inodes the system has allocated. This number will
904grow and shrink dynamically.
905
906nr_free_inodes
907--------------
908
909Represents the number of free inodes. Ie. The number of inuse inodes is
910(nr_inodes - nr_free_inodes).
911
912super-nr and super-max
913----------------------
914
915Again, super block structures are allocated by the kernel, but not freed. The
916file super-max contains the maximum number of super block handlers, where
917super-nr shows the number of currently allocated ones.
918
919Every mounted file system needs a super block, so if you plan to mount lots of
920file systems, you may want to increase these numbers.
921
922aio-nr and aio-max-nr
923---------------------
924
925aio-nr is the running total of the number of events specified on the
926io_setup system call for all currently active aio contexts. If aio-nr
927reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
928raising aio-max-nr does not result in the pre-allocation or re-sizing
929of any kernel data structures.
930
9312.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
932-----------------------------------------------------------
933
934Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This
935handles the kernel support for miscellaneous binary formats.
936
937Binfmt_misc provides the ability to register additional binary formats to the
938Kernel without compiling an additional module/kernel. Therefore, binfmt_misc
939needs to know magic numbers at the beginning or the filename extension of the
940binary.
941
942It works by maintaining a linked list of structs that contain a description of
943a binary format, including a magic with size (or the filename extension),
944offset and mask, and the interpreter name. On request it invokes the given
945interpreter with the original program as argument, as binfmt_java and
946binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default
947binary-formats, you have to register an additional binary-format.
948
949There are two general files in binfmt_misc and one file per registered format.
950The two general files are register and status.
951
952Registering a new binary format
953-------------------------------
954
955To register a new binary format you have to issue the command
956
957 echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register
958
959
960
961with appropriate name (the name for the /proc-dir entry), offset (defaults to
9620, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and
963last but not least, the interpreter that is to be invoked (for example and
964testing /bin/echo). Type can be M for usual magic matching or E for filename
965extension matching (give extension in place of magic).
966
967Check or reset the status of the binary format handler
968------------------------------------------------------
969
970If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the
971current status (enabled/disabled) of binfmt_misc. Change the status by echoing
9720 (disables) or 1 (enables) or -1 (caution: this clears all previously
973registered binary formats) to status. For example echo 0 > status to disable
974binfmt_misc (temporarily).
975
976Status of a single handler
977--------------------------
978
979Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files
980perform the same function as status, but their scope is limited to the actual
981binary format. By cating this file, you also receive all related information
982about the interpreter/magic of the binfmt.
983
984Example usage of binfmt_misc (emulate binfmt_java)
985--------------------------------------------------
986
987 cd /proc/sys/fs/binfmt_misc
988 echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register
989 echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register
990 echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register
991 echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register
992
993
994These four lines add support for Java executables and Java applets (like
995binfmt_java, additionally recognizing the .html extension with no need to put
996<!--applet> to every applet file). You have to install the JDK and the
997shell-script /usr/local/java/bin/javawrapper too. It works around the
998brokenness of the Java filename handling. To add a Java binary, just create a
999link to the class-file somewhere in the path.
1000
10012.3 /proc/sys/kernel - general kernel parameters
1002------------------------------------------------
1003
1004This directory reflects general kernel behaviors. As I've said before, the
1005contents depend on your configuration. Here you'll find the most important
1006files, along with descriptions of what they mean and how to use them.
1007
1008acct
1009----
1010
1011The file contains three values; highwater, lowwater, and frequency.
1012
1013It exists only when BSD-style process accounting is enabled. These values
1014control its behavior. If the free space on the file system where the log lives
1015goes below lowwater percentage, accounting suspends. If it goes above
1016highwater percentage, accounting resumes. Frequency determines how often you
1017check the amount of free space (value is in seconds). Default settings are: 4,
10182, and 30. That is, suspend accounting if there is less than 2 percent free;
1019resume it if we have a value of 3 or more percent; consider information about
1020the amount of free space valid for 30 seconds
1021
1022ctrl-alt-del
1023------------
1024
1025When the value in this file is 0, ctrl-alt-del is trapped and sent to the init
1026program to handle a graceful restart. However, when the value is greater that
1027zero, Linux's reaction to this key combination will be an immediate reboot,
1028without syncing its dirty buffers.
1029
1030[NOTE]
1031 When a program (like dosemu) has the keyboard in raw mode, the
1032 ctrl-alt-del is intercepted by the program before it ever reaches the
1033 kernel tty layer, and it is up to the program to decide what to do with
1034 it.
1035
1036domainname and hostname
1037-----------------------
1038
1039These files can be controlled to set the NIS domainname and hostname of your
1040box. For the classic darkstar.frop.org a simple:
1041
1042 # echo "darkstar" > /proc/sys/kernel/hostname
1043 # echo "frop.org" > /proc/sys/kernel/domainname
1044
1045
1046would suffice to set your hostname and NIS domainname.
1047
1048osrelease, ostype and version
1049-----------------------------
1050
1051The names make it pretty obvious what these fields contain:
1052
1053 > cat /proc/sys/kernel/osrelease
1054 2.2.12
1055
1056 > cat /proc/sys/kernel/ostype
1057 Linux
1058
1059 > cat /proc/sys/kernel/version
1060 #4 Fri Oct 1 12:41:14 PDT 1999
1061
1062
1063The files osrelease and ostype should be clear enough. Version needs a little
1064more clarification. The #4 means that this is the 4th kernel built from this
1065source base and the date after it indicates the time the kernel was built. The
1066only way to tune these values is to rebuild the kernel.
1067
1068panic
1069-----
1070
1071The value in this file represents the number of seconds the kernel waits
1072before rebooting on a panic. When you use the software watchdog, the
1073recommended setting is 60. If set to 0, the auto reboot after a kernel panic
1074is disabled, which is the default setting.
1075
1076printk
1077------
1078
1079The four values in printk denote
1080* console_loglevel,
1081* default_message_loglevel,
1082* minimum_console_loglevel and
1083* default_console_loglevel
1084respectively.
1085
1086These values influence printk() behavior when printing or logging error
1087messages, which come from inside the kernel. See syslog(2) for more
1088information on the different log levels.
1089
1090console_loglevel
1091----------------
1092
1093Messages with a higher priority than this will be printed to the console.
1094
1095default_message_level
1096---------------------
1097
1098Messages without an explicit priority will be printed with this priority.
1099
1100minimum_console_loglevel
1101------------------------
1102
1103Minimum (highest) value to which the console_loglevel can be set.
1104
1105default_console_loglevel
1106------------------------
1107
1108Default value for console_loglevel.
1109
1110sg-big-buff
1111-----------
1112
1113This file shows the size of the generic SCSI (sg) buffer. At this point, you
1114can't tune it yet, but you can change it at compile time by editing
1115include/scsi/sg.h and changing the value of SG_BIG_BUFF.
1116
1117If you use a scanner with SANE (Scanner Access Now Easy) you might want to set
1118this to a higher value. Refer to the SANE documentation on this issue.
1119
1120modprobe
1121--------
1122
1123The location where the modprobe binary is located. The kernel uses this
1124program to load modules on demand.
1125
1126unknown_nmi_panic
1127-----------------
1128
1129The value in this file affects behavior of handling NMI. When the value is
1130non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel
1131debugging information is displayed on console.
1132
1133NMI switch that most IA32 servers have fires unknown NMI up, for example.
1134If a system hangs up, try pressing the NMI switch.
1135
1136[NOTE]
1137 This function and oprofile share a NMI callback. Therefore this function
1138 cannot be enabled when oprofile is activated.
1139 And NMI watchdog will be disabled when the value in this file is set to
1140 non-zero.
1141
1142
11432.4 /proc/sys/vm - The virtual memory subsystem
1144-----------------------------------------------
1145
1146The files in this directory can be used to tune the operation of the virtual
1147memory (VM) subsystem of the Linux kernel.
1148
1149vfs_cache_pressure
1150------------------
1151
1152Controls the tendency of the kernel to reclaim the memory which is used for
1153caching of directory and inode objects.
1154
1155At the default value of vfs_cache_pressure=100 the kernel will attempt to
1156reclaim dentries and inodes at a "fair" rate with respect to pagecache and
1157swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
1158to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100
1159causes the kernel to prefer to reclaim dentries and inodes.
1160
1161dirty_background_ratio
1162----------------------
1163
1164Contains, as a percentage of total system memory, the number of pages at which
1165the pdflush background writeback daemon will start writing out dirty data.
1166
1167dirty_ratio
1168-----------------
1169
1170Contains, as a percentage of total system memory, the number of pages at which
1171a process which is generating disk writes will itself start writing out dirty
1172data.
1173
1174dirty_writeback_centisecs
1175-------------------------
1176
1177The pdflush writeback daemons will periodically wake up and write `old' data
1178out to disk. This tunable expresses the interval between those wakeups, in
1179100'ths of a second.
1180
1181Setting this to zero disables periodic writeback altogether.
1182
1183dirty_expire_centisecs
1184----------------------
1185
1186This tunable is used to define when dirty data is old enough to be eligible
1187for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
1188Data which has been dirty in-memory for longer than this interval will be
1189written out next time a pdflush daemon wakes up.
1190
1191legacy_va_layout
1192----------------
1193
1194If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
1195will use the legacy (2.4) layout for all processes.
1196
1197lower_zone_protection
1198---------------------
1199
1200For some specialised workloads on highmem machines it is dangerous for
1201the kernel to allow process memory to be allocated from the "lowmem"
1202zone. This is because that memory could then be pinned via the mlock()
1203system call, or by unavailability of swapspace.
1204
1205And on large highmem machines this lack of reclaimable lowmem memory
1206can be fatal.
1207
1208So the Linux page allocator has a mechanism which prevents allocations
1209which _could_ use highmem from using too much lowmem. This means that
1210a certain amount of lowmem is defended from the possibility of being
1211captured into pinned user memory.
1212
1213(The same argument applies to the old 16 megabyte ISA DMA region. This
1214mechanism will also defend that region from allocations which could use
1215highmem or lowmem).
1216
1217The `lower_zone_protection' tunable determines how aggressive the kernel is
1218in defending these lower zones. The default value is zero - no
1219protection at all.
1220
1221If you have a machine which uses highmem or ISA DMA and your
1222applications are using mlock(), or if you are running with no swap then
1223you probably should increase the lower_zone_protection setting.
1224
1225The units of this tunable are fairly vague. It is approximately equal
1226to "megabytes". So setting lower_zone_protection=100 will protect around 100
1227megabytes of the lowmem zone from user allocations. It will also make
1228those 100 megabytes unavaliable for use by applications and by
1229pagecache, so there is a cost.
1230
1231The effects of this tunable may be observed by monitoring
1232/proc/meminfo:LowFree. Write a single huge file and observe the point
1233at which LowFree ceases to fall.
1234
1235A reasonable value for lower_zone_protection is 100.
1236
1237page-cluster
1238------------
1239
1240page-cluster controls the number of pages which are written to swap in
1241a single attempt. The swap I/O size.
1242
1243It is a logarithmic value - setting it to zero means "1 page", setting
1244it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
1245
1246The default value is three (eight pages at a time). There may be some
1247small benefits in tuning this to a different value if your workload is
1248swap-intensive.
1249
1250overcommit_memory
1251-----------------
1252
1253This file contains one value. The following algorithm is used to decide if
1254there's enough memory: if the value of overcommit_memory is positive, then
1255there's always enough memory. This is a useful feature, since programs often
1256malloc() huge amounts of memory 'just in case', while they only use a small
1257part of it. Leaving this value at 0 will lead to the failure of such a huge
1258malloc(), when in fact the system has enough memory for the program to run.
1259
1260On the other hand, enabling this feature can cause you to run out of memory
1261and thrash the system to death, so large and/or important servers will want to
1262set this value to 0.
1263
1264nr_hugepages and hugetlb_shm_group
1265----------------------------------
1266
1267nr_hugepages configures number of hugetlb page reserved for the system.
1268
1269hugetlb_shm_group contains group id that is allowed to create SysV shared
1270memory segment using hugetlb page.
1271
1272laptop_mode
1273-----------
1274
1275laptop_mode is a knob that controls "laptop mode". All the things that are
1276controlled by this knob are discussed in Documentation/laptop-mode.txt.
1277
1278block_dump
1279----------
1280
1281block_dump enables block I/O debugging when set to a nonzero value. More
1282information on block I/O debugging is in Documentation/laptop-mode.txt.
1283
1284swap_token_timeout
1285------------------
1286
1287This file contains valid hold time of swap out protection token. The Linux
1288VM has token based thrashing control mechanism and uses the token to prevent
1289unnecessary page faults in thrashing situation. The unit of the value is
1290second. The value would be useful to tune thrashing behavior.
1291
12922.5 /proc/sys/dev - Device specific parameters
1293----------------------------------------------
1294
1295Currently there is only support for CDROM drives, and for those, there is only
1296one read-only file containing information about the CD-ROM drives attached to
1297the system:
1298
1299 >cat /proc/sys/dev/cdrom/info
1300 CD-ROM information, Id: cdrom.c 2.55 1999/04/25
1301
1302 drive name: sr0 hdb
1303 drive speed: 32 40
1304 drive # of slots: 1 0
1305 Can close tray: 1 1
1306 Can open tray: 1 1
1307 Can lock tray: 1 1
1308 Can change speed: 1 1
1309 Can select disk: 0 1
1310 Can read multisession: 1 1
1311 Can read MCN: 1 1
1312 Reports media changed: 1 1
1313 Can play audio: 1 1
1314
1315
1316You see two drives, sr0 and hdb, along with a list of their features.
1317
13182.6 /proc/sys/sunrpc - Remote procedure calls
1319---------------------------------------------
1320
1321This directory contains four files, which enable or disable debugging for the
1322RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can
1323be set to one to turn debugging on. (The default value is 0 for each)
1324
13252.7 /proc/sys/net - Networking stuff
1326------------------------------------
1327
1328The interface to the networking parts of the kernel is located in
1329/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only
1330some of them, depending on your kernel's configuration.
1331
1332
1333Table 2-3: Subdirectories in /proc/sys/net
1334..............................................................................
1335 Directory Content Directory Content
1336 core General parameter appletalk Appletalk protocol
1337 unix Unix domain sockets netrom NET/ROM
1338 802 E802 protocol ax25 AX25
1339 ethernet Ethernet protocol rose X.25 PLP layer
1340 ipv4 IP version 4 x25 X.25 protocol
1341 ipx IPX token-ring IBM token ring
1342 bridge Bridging decnet DEC net
1343 ipv6 IP version 6
1344..............................................................................
1345
1346We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are
1347only minor players in the Linux world, we'll skip them in this chapter. You'll
1348find some short info on Appletalk and IPX further on in this chapter. Review
1349the online documentation and the kernel source to get a detailed view of the
1350parameters for those protocols. In this section we'll discuss the
1351subdirectories printed in bold letters in the table above. As default values
1352are suitable for most needs, there is no need to change these values.
1353
1354/proc/sys/net/core - Network core options
1355-----------------------------------------
1356
1357rmem_default
1358------------
1359
1360The default setting of the socket receive buffer in bytes.
1361
1362rmem_max
1363--------
1364
1365The maximum receive socket buffer size in bytes.
1366
1367wmem_default
1368------------
1369
1370The default setting (in bytes) of the socket send buffer.
1371
1372wmem_max
1373--------
1374
1375The maximum send socket buffer size in bytes.
1376
1377message_burst and message_cost
1378------------------------------
1379
1380These parameters are used to limit the warning messages written to the kernel
1381log from the networking code. They enforce a rate limit to make a
1382denial-of-service attack impossible. A higher message_cost factor, results in
1383fewer messages that will be written. Message_burst controls when messages will
1384be dropped. The default settings limit warning messages to one every five
1385seconds.
1386
1387netdev_max_backlog
1388------------------
1389
1390Maximum number of packets, queued on the INPUT side, when the interface
1391receives packets faster than kernel can process them.
1392
1393optmem_max
1394----------
1395
1396Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
1397of struct cmsghdr structures with appended data.
1398
1399/proc/sys/net/unix - Parameters for Unix domain sockets
1400-------------------------------------------------------
1401
1402There are only two files in this subdirectory. They control the delays for
1403deleting and destroying socket descriptors.
1404
14052.8 /proc/sys/net/ipv4 - IPV4 settings
1406--------------------------------------
1407
1408IP version 4 is still the most used protocol in Unix networking. It will be
1409replaced by IP version 6 in the next couple of years, but for the moment it's
1410the de facto standard for the internet and is used in most networking
1411environments around the world. Because of the importance of this protocol,
1412we'll have a deeper look into the subtree controlling the behavior of the IPv4
1413subsystem of the Linux kernel.
1414
1415Let's start with the entries in /proc/sys/net/ipv4.
1416
1417ICMP settings
1418-------------
1419
1420icmp_echo_ignore_all and icmp_echo_ignore_broadcasts
1421----------------------------------------------------
1422
1423Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or
1424just those to broadcast and multicast addresses.
1425
1426Please note that if you accept ICMP echo requests with a broadcast/multi\-cast
1427destination address your network may be used as an exploder for denial of
1428service packet flooding attacks to other hosts.
1429
1430icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate
1431---------------------------------------------------------------------------------------
1432
1433Sets limits for sending ICMP packets to specific targets. A value of zero
1434disables all limiting. Any positive value sets the maximum package rate in
1435hundredth of a second (on Intel systems).
1436
1437IP settings
1438-----------
1439
1440ip_autoconfig
1441-------------
1442
1443This file contains the number one if the host received its IP configuration by
1444RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero.
1445
1446ip_default_ttl
1447--------------
1448
1449TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of
1450hops a packet may travel.
1451
1452ip_dynaddr
1453----------
1454
1455Enable dynamic socket address rewriting on interface address change. This is
1456useful for dialup interface with changing IP addresses.
1457
1458ip_forward
1459----------
1460
1461Enable or disable forwarding of IP packages between interfaces. Changing this
1462value resets all other parameters to their default values. They differ if the
1463kernel is configured as host or router.
1464
1465ip_local_port_range
1466-------------------
1467
1468Range of ports used by TCP and UDP to choose the local port. Contains two
1469numbers, the first number is the lowest port, the second number the highest
1470local port. Default is 1024-4999. Should be changed to 32768-61000 for
1471high-usage systems.
1472
1473ip_no_pmtu_disc
1474---------------
1475
1476Global switch to turn path MTU discovery off. It can also be set on a per
1477socket basis by the applications or on a per route basis.
1478
1479ip_masq_debug
1480-------------
1481
1482Enable/disable debugging of IP masquerading.
1483
1484IP fragmentation settings
1485-------------------------
1486
1487ipfrag_high_trash and ipfrag_low_trash
1488--------------------------------------
1489
1490Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes
1491of memory is allocated for this purpose, the fragment handler will toss
1492packets until ipfrag_low_thresh is reached.
1493
1494ipfrag_time
1495-----------
1496
1497Time in seconds to keep an IP fragment in memory.
1498
1499TCP settings
1500------------
1501
1502tcp_ecn
1503-------
1504
1505This file controls the use of the ECN bit in the IPv4 headers, this is a new
1506feature about Explicit Congestion Notification, but some routers and firewalls
1507block trafic that has this bit set, so it could be necessary to echo 0 to
1508/proc/sys/net/ipv4/tcp_ecn, if you want to talk to this sites. For more info
1509you could read RFC2481.
1510
1511tcp_retrans_collapse
1512--------------------
1513
1514Bug-to-bug compatibility with some broken printers. On retransmit, try to send
1515larger packets to work around bugs in certain TCP stacks. Can be turned off by
1516setting it to zero.
1517
1518tcp_keepalive_probes
1519--------------------
1520
1521Number of keep alive probes TCP sends out, until it decides that the
1522connection is broken.
1523
1524tcp_keepalive_time
1525------------------
1526
1527How often TCP sends out keep alive messages, when keep alive is enabled. The
1528default is 2 hours.
1529
1530tcp_syn_retries
1531---------------
1532
1533Number of times initial SYNs for a TCP connection attempt will be
1534retransmitted. Should not be higher than 255. This is only the timeout for
1535outgoing connections, for incoming connections the number of retransmits is
1536defined by tcp_retries1.
1537
1538tcp_sack
1539--------
1540
1541Enable select acknowledgments after RFC2018.
1542
1543tcp_timestamps
1544--------------
1545
1546Enable timestamps as defined in RFC1323.
1547
1548tcp_stdurg
1549----------
1550
1551Enable the strict RFC793 interpretation of the TCP urgent pointer field. The
1552default is to use the BSD compatible interpretation of the urgent pointer
1553pointing to the first byte after the urgent data. The RFC793 interpretation is
1554to have it point to the last byte of urgent data. Enabling this option may
1555lead to interoperatibility problems. Disabled by default.
1556
1557tcp_syncookies
1558--------------
1559
1560Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out
1561syncookies when the syn backlog queue of a socket overflows. This is to ward
1562off the common 'syn flood attack'. Disabled by default.
1563
1564Note that the concept of a socket backlog is abandoned. This means the peer
1565may not receive reliable error messages from an over loaded server with
1566syncookies enabled.
1567
1568tcp_window_scaling
1569------------------
1570
1571Enable window scaling as defined in RFC1323.
1572
1573tcp_fin_timeout
1574---------------
1575
1576The length of time in seconds it takes to receive a final FIN before the
1577socket is always closed. This is strictly a violation of the TCP
1578specification, but required to prevent denial-of-service attacks.
1579
1580tcp_max_ka_probes
1581-----------------
1582
1583Indicates how many keep alive probes are sent per slow timer run. Should not
1584be set too high to prevent bursts.
1585
1586tcp_max_syn_backlog
1587-------------------
1588
1589Length of the per socket backlog queue. Since Linux 2.2 the backlog specified
1590in listen(2) only specifies the length of the backlog queue of already
1591established sockets. When more connection requests arrive Linux starts to drop
1592packets. When syncookies are enabled the packets are still answered and the
1593maximum queue is effectively ignored.
1594
1595tcp_retries1
1596------------
1597
1598Defines how often an answer to a TCP connection request is retransmitted
1599before giving up.
1600
1601tcp_retries2
1602------------
1603
1604Defines how often a TCP packet is retransmitted before giving up.
1605
1606Interface specific settings
1607---------------------------
1608
1609In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each
1610interface the system knows about and one directory calls all. Changes in the
1611all subdirectory affect all interfaces, whereas changes in the other
1612subdirectories affect only one interface. All directories have the same
1613entries:
1614
1615accept_redirects
1616----------------
1617
1618This switch decides if the kernel accepts ICMP redirect messages or not. The
1619default is 'yes' if the kernel is configured for a regular host and 'no' for a
1620router configuration.
1621
1622accept_source_route
1623-------------------
1624
1625Should source routed packages be accepted or declined. The default is
1626dependent on the kernel configuration. It's 'yes' for routers and 'no' for
1627hosts.
1628
1629bootp_relay
1630~~~~~~~~~~~
1631
1632Accept packets with source address 0.b.c.d with destinations not to this host
1633as local ones. It is supposed that a BOOTP relay daemon will catch and forward
1634such packets.
1635
1636The default is 0, since this feature is not implemented yet (kernel version
16372.2.12).
1638
1639forwarding
1640----------
1641
1642Enable or disable IP forwarding on this interface.
1643
1644log_martians
1645------------
1646
1647Log packets with source addresses with no known route to kernel log.
1648
1649mc_forwarding
1650-------------
1651
1652Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a
1653multicast routing daemon is required.
1654
1655proxy_arp
1656---------
1657
1658Does (1) or does not (0) perform proxy ARP.
1659
1660rp_filter
1661---------
1662
1663Integer value determines if a source validation should be made. 1 means yes, 0
1664means no. Disabled by default, but local/broadcast address spoofing is always
1665on.
1666
1667If you set this to 1 on a router that is the only connection for a network to
1668the net, it will prevent spoofing attacks against your internal networks
1669(external addresses can still be spoofed), without the need for additional
1670firewall rules.
1671
1672secure_redirects
1673----------------
1674
1675Accept ICMP redirect messages only for gateways, listed in default gateway
1676list. Enabled by default.
1677
1678shared_media
1679------------
1680
1681If it is not set the kernel does not assume that different subnets on this
1682device can communicate directly. Default setting is 'yes'.
1683
1684send_redirects
1685--------------
1686
1687Determines whether to send ICMP redirects to other hosts.
1688
1689Routing settings
1690----------------
1691
1692The directory /proc/sys/net/ipv4/route contains several file to control
1693routing issues.
1694
1695error_burst and error_cost
1696--------------------------
1697
1698These parameters are used to limit how many ICMP destination unreachable to
1699send from the host in question. ICMP destination unreachable messages are
1700sent when we can not reach the next hop, while trying to transmit a packet.
1701It will also print some error messages to kernel logs if someone is ignoring
1702our ICMP redirects. The higher the error_cost factor is, the fewer
1703destination unreachable and error messages will be let through. Error_burst
1704controls when destination unreachable messages and error messages will be
1705dropped. The default settings limit warning messages to five every second.
1706
1707flush
1708-----
1709
1710Writing to this file results in a flush of the routing cache.
1711
1712gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh
1713---------------------------------------------------------------------
1714
1715Values to control the frequency and behavior of the garbage collection
1716algorithm for the routing cache. gc_min_interval is deprecated and replaced
1717by gc_min_interval_ms.
1718
1719
1720max_size
1721--------
1722
1723Maximum size of the routing cache. Old entries will be purged once the cache
1724reached has this size.
1725
1726max_delay, min_delay
1727--------------------
1728
1729Delays for flushing the routing cache.
1730
1731redirect_load, redirect_number
1732------------------------------
1733
1734Factors which determine if more ICPM redirects should be sent to a specific
1735host. No redirects will be sent once the load limit or the maximum number of
1736redirects has been reached.
1737
1738redirect_silence
1739----------------
1740
1741Timeout for redirects. After this period redirects will be sent again, even if
1742this has been stopped, because the load or number limit has been reached.
1743
1744Network Neighbor handling
1745-------------------------
1746
1747Settings about how to handle connections with direct neighbors (nodes attached
1748to the same link) can be found in the directory /proc/sys/net/ipv4/neigh.
1749
1750As we saw it in the conf directory, there is a default subdirectory which
1751holds the default values, and one directory for each interface. The contents
1752of the directories are identical, with the single exception that the default
1753settings contain additional options to set garbage collection parameters.
1754
1755In the interface directories you'll find the following entries:
1756
1757base_reachable_time, base_reachable_time_ms
1758-------------------------------------------
1759
1760A base value used for computing the random reachable time value as specified
1761in RFC2461.
1762
1763Expression of base_reachable_time, which is deprecated, is in seconds.
1764Expression of base_reachable_time_ms is in milliseconds.
1765
1766retrans_time, retrans_time_ms
1767-----------------------------
1768
1769The time between retransmitted Neighbor Solicitation messages.
1770Used for address resolution and to determine if a neighbor is
1771unreachable.
1772
1773Expression of retrans_time, which is deprecated, is in 1/100 seconds (for
1774IPv4) or in jiffies (for IPv6).
1775Expression of retrans_time_ms is in milliseconds.
1776
1777unres_qlen
1778----------
1779
1780Maximum queue length for a pending arp request - the number of packets which
1781are accepted from other layers while the ARP address is still resolved.
1782
1783anycast_delay
1784-------------
1785
1786Maximum for random delay of answers to neighbor solicitation messages in
1787jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support
1788yet).
1789
1790ucast_solicit
1791-------------
1792
1793Maximum number of retries for unicast solicitation.
1794
1795mcast_solicit
1796-------------
1797
1798Maximum number of retries for multicast solicitation.
1799
1800delay_first_probe_time
1801----------------------
1802
1803Delay for the first time probe if the neighbor is reachable. (see
1804gc_stale_time)
1805
1806locktime
1807--------
1808
1809An ARP/neighbor entry is only replaced with a new one if the old is at least
1810locktime old. This prevents ARP cache thrashing.
1811
1812proxy_delay
1813-----------
1814
1815Maximum time (real time is random [0..proxytime]) before answering to an ARP
1816request for which we have an proxy ARP entry. In some cases, this is used to
1817prevent network flooding.
1818
1819proxy_qlen
1820----------
1821
1822Maximum queue length of the delayed proxy arp timer. (see proxy_delay).
1823
1824app_solcit
1825----------
1826
1827Determines the number of requests to send to the user level ARP daemon. Use 0
1828to turn off.
1829
1830gc_stale_time
1831-------------
1832
1833Determines how often to check for stale ARP entries. After an ARP entry is
1834stale it will be resolved again (which is useful when an IP address migrates
1835to another machine). When ucast_solicit is greater than 0 it first tries to
1836send an ARP packet directly to the known host When that fails and
1837mcast_solicit is greater than 0, an ARP request is broadcasted.
1838
18392.9 Appletalk
1840-------------
1841
1842The /proc/sys/net/appletalk directory holds the Appletalk configuration data
1843when Appletalk is loaded. The configurable parameters are:
1844
1845aarp-expiry-time
1846----------------
1847
1848The amount of time we keep an ARP entry before expiring it. Used to age out
1849old hosts.
1850
1851aarp-resolve-time
1852-----------------
1853
1854The amount of time we will spend trying to resolve an Appletalk address.
1855
1856aarp-retransmit-limit
1857---------------------
1858
1859The number of times we will retransmit a query before giving up.
1860
1861aarp-tick-time
1862--------------
1863
1864Controls the rate at which expires are checked.
1865
1866The directory /proc/net/appletalk holds the list of active Appletalk sockets
1867on a machine.
1868
1869The fields indicate the DDP type, the local address (in network:node format)
1870the remote address, the size of the transmit pending queue, the size of the
1871received queue (bytes waiting for applications to read) the state and the uid
1872owning the socket.
1873
1874/proc/net/atalk_iface lists all the interfaces configured for appletalk.It
1875shows the name of the interface, its Appletalk address, the network range on
1876that address (or network number for phase 1 networks), and the status of the
1877interface.
1878
1879/proc/net/atalk_route lists each known network route. It lists the target
1880(network) that the route leads to, the router (may be directly connected), the
1881route flags, and the device the route is using.
1882
18832.10 IPX
1884--------
1885
1886The IPX protocol has no tunable values in proc/sys/net.
1887
1888The IPX protocol does, however, provide proc/net/ipx. This lists each IPX
1889socket giving the local and remote addresses in Novell format (that is
1890network:node:port). In accordance with the strange Novell tradition,
1891everything but the port is in hex. Not_Connected is displayed for sockets that
1892are not tied to a specific remote address. The Tx and Rx queue sizes indicate
1893the number of bytes pending for transmission and reception. The state
1894indicates the state the socket is in and the uid is the owning uid of the
1895socket.
1896
1897The /proc/net/ipx_interface file lists all IPX interfaces. For each interface
1898it gives the network number, the node number, and indicates if the network is
1899the primary network. It also indicates which device it is bound to (or
1900Internal for internal networks) and the Frame Type if appropriate. Linux
1901supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for
1902IPX.
1903
1904The /proc/net/ipx_route table holds a list of IPX routes. For each route it
1905gives the destination network, the router node (or Directly) and the network
1906address of the router (or Connected) for internal networks.
1907
19082.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
1909----------------------------------------------------------
1910
1911The "mqueue" filesystem provides the necessary kernel features to enable the
1912creation of a user space library that implements the POSIX message queues
1913API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
1914Interfaces specification.)
1915
1916The "mqueue" filesystem contains values for determining/setting the amount of
1917resources used by the file system.
1918
1919/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
1920maximum number of message queues allowed on the system.
1921
1922/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
1923maximum number of messages in a queue value. In fact it is the limiting value
1924for another (user) limit which is set in mq_open invocation. This attribute of
1925a queue must be less or equal then msg_max.
1926
1927/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
1928maximum message size value (it is every message queue's attribute set during
1929its creation).
1930
1931
1932------------------------------------------------------------------------------
1933Summary
1934------------------------------------------------------------------------------
1935Certain aspects of kernel behavior can be modified at runtime, without the
1936need to recompile the kernel, or even to reboot the system. The files in the
1937/proc/sys tree can not only be read, but also modified. You can use the echo
1938command to write value into these files, thereby changing the default settings
1939of the kernel.
1940------------------------------------------------------------------------------
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt
new file mode 100644
index 000000000000..2d2a7b2a16b9
--- /dev/null
+++ b/Documentation/filesystems/romfs.txt
@@ -0,0 +1,187 @@
1ROMFS - ROM FILE SYSTEM
2
3This is a quite dumb, read only filesystem, mainly for initial RAM
4disks of installation disks. It has grown up by the need of having
5modules linked at boot time. Using this filesystem, you get a very
6similar feature, and even the possibility of a small kernel, with a
7file system which doesn't take up useful memory from the router
8functions in the basement of your office.
9
10For comparison, both the older minix and xiafs (the latter is now
11defunct) filesystems, compiled as module need more than 20000 bytes,
12while romfs is less than a page, about 4000 bytes (assuming i586
13code). Under the same conditions, the msdos filesystem would need
14about 30K (and does not support device nodes or symlinks), while the
15nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
16comparison, an actual rescue disk used up 3202 blocks with ext2, while
17with romfs, it needed 3079 blocks.
18
19To create such a file system, you'll need a user program named
20genromfs. It is available via anonymous ftp on sunsite.unc.edu and
21its mirrors, in the /pub/Linux/system/recovery/ directory.
22
23As the name suggests, romfs could be also used (space-efficiently) on
24various read-only media, like (E)EPROM disks if someone will have the
25motivation.. :)
26
27However, the main purpose of romfs is to have a very small kernel,
28which has only this filesystem linked in, and then can load any module
29later, with the current module utilities. It can also be used to run
30some program to decide if you need SCSI devices, and even IDE or
31floppy drives can be loaded later if you use the "initrd"--initial
32RAM disk--feature of the kernel. This would not be really news
33flash, but with romfs, you can even spare off your ext2 or minix or
34maybe even affs filesystem until you really know that you need it.
35
36For example, a distribution boot disk can contain only the cd disk
37drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
38module. The kernel can be small enough, since it doesn't have other
39filesystems, like the quite large ext2fs module, which can then be
40loaded off the CD at a later stage of the installation. Another use
41would be for a recovery disk, when you are reinstalling a workstation
42from the network, and you will have all the tools/modules available
43from a nearby server, so you don't want to carry two disks for this
44purpose, just because it won't fit into ext2.
45
46romfs operates on block devices as you can expect, and the underlying
47structure is very simple. Every accessible structure begins on 16
48byte boundaries for fast access. The minimum space a file will take
49is 32 bytes (this is an empty file, with a less than 16 character
50name). The maximum overhead for any non-empty file is the header, and
51the 16 byte padding for the name and the contents, also 16+14+15 = 45
52bytes. This is quite rare however, since most file names are longer
53than 3 bytes, and shorter than 15 bytes.
54
55The layout of the filesystem is the following:
56
57offset content
58
59 +---+---+---+---+
60 0 | - | r | o | m | \
61 +---+---+---+---+ The ASCII representation of those bytes
62 4 | 1 | f | s | - | / (i.e. "-rom1fs-")
63 +---+---+---+---+
64 8 | full size | The number of accessible bytes in this fs.
65 +---+---+---+---+
66 12 | checksum | The checksum of the FIRST 512 BYTES.
67 +---+---+---+---+
68 16 | volume name | The zero terminated name of the volume,
69 : : padded to 16 byte boundary.
70 +---+---+---+---+
71 xx | file |
72 : headers :
73
74Every multi byte value (32 bit words, I'll use the longwords term from
75now on) must be in big endian order.
76
77The first eight bytes identify the filesystem, even for the casual
78inspector. After that, in the 3rd longword, it contains the number of
79bytes accessible from the start of this filesystem. The 4th longword
80is the checksum of the first 512 bytes (or the number of bytes
81accessible, whichever is smaller). The applied algorithm is the same
82as in the AFFS filesystem, namely a simple sum of the longwords
83(assuming bigendian quantities again). For details, please consult
84the source. This algorithm was chosen because although it's not quite
85reliable, it does not require any tables, and it is very simple.
86
87The following bytes are now part of the file system; each file header
88must begin on a 16 byte boundary.
89
90offset content
91
92 +---+---+---+---+
93 0 | next filehdr|X| The offset of the next file header
94 +---+---+---+---+ (zero if no more files)
95 4 | spec.info | Info for directories/hard links/devices
96 +---+---+---+---+
97 8 | size | The size of this file in bytes
98 +---+---+---+---+
99 12 | checksum | Covering the meta data, including the file
100 +---+---+---+---+ name, and padding
101 16 | file name | The zero terminated name of the file,
102 : : padded to 16 byte boundary
103 +---+---+---+---+
104 xx | file data |
105 : :
106
107Since the file headers begin always at a 16 byte boundary, the lowest
1084 bits would be always zero in the next filehdr pointer. These four
109bits are used for the mode information. Bits 0..2 specify the type of
110the file; while bit 4 shows if the file is executable or not. The
111permissions are assumed to be world readable, if this bit is not set,
112and world executable if it is; except the character and block devices,
113they are never accessible for other than owner. The owner of every
114file is user and group 0, this should never be a problem for the
115intended use. The mapping of the 8 possible values to file types is
116the following:
117
118 mapping spec.info means
119 0 hard link link destination [file header]
120 1 directory first file's header
121 2 regular file unused, must be zero [MBZ]
122 3 symbolic link unused, MBZ (file data is the link content)
123 4 block device 16/16 bits major/minor number
124 5 char device - " -
125 6 socket unused, MBZ
126 7 fifo unused, MBZ
127
128Note that hard links are specifically marked in this filesystem, but
129they will behave as you can expect (i.e. share the inode number).
130Note also that it is your responsibility to not create hard link
131loops, and creating all the . and .. links for directories. This is
132normally done correctly by the genromfs program. Please refrain from
133using the executable bits for special purposes on the socket and fifo
134special files, they may have other uses in the future. Additionally,
135please remember that only regular files, and symlinks are supposed to
136have a nonzero size field; they contain the number of bytes available
137directly after the (padded) file name.
138
139Another thing to note is that romfs works on file headers and data
140aligned to 16 byte boundaries, but most hardware devices and the block
141device drivers are unable to cope with smaller than block-sized data.
142To overcome this limitation, the whole size of the file system must be
143padded to an 1024 byte boundary.
144
145If you have any problems or suggestions concerning this file system,
146please contact me. However, think twice before wanting me to add
147features and code, because the primary and most important advantage of
148this file system is the small code. On the other hand, don't be
149alarmed, I'm not getting that much romfs related mail. Now I can
150understand why Avery wrote poems in the ARCnet docs to get some more
151feedback. :)
152
153romfs has also a mailing list, and to date, it hasn't received any
154traffic, so you are welcome to join it to discuss your ideas. :)
155
156It's run by ezmlm, so you can subscribe to it by sending a message
157to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
158
159Pending issues:
160
161- Permissions and owner information are pretty essential features of a
162Un*x like system, but romfs does not provide the full possibilities.
163I have never found this limiting, but others might.
164
165- The file system is read only, so it can be very small, but in case
166one would want to write _anything_ to a file system, he still needs
167a writable file system, thus negating the size advantages. Possible
168solutions: implement write access as a compile-time option, or a new,
169similarly small writable filesystem for RAM disks.
170
171- Since the files are only required to have alignment on a 16 byte
172boundary, it is currently possibly suboptimal to read or execute files
173from the filesystem. It might be resolved by reordering file data to
174have most of it (i.e. except the start and the end) laying at "natural"
175boundaries, thus it would be possible to directly map a big portion of
176the file contents to the mm subsystem.
177
178- Compression might be an useful feature, but memory is quite a
179limiting factor in my eyes.
180
181- Where it is used?
182
183- Does it work on other architectures than intel and motorola?
184
185
186Have fun,
187Janos Farkas <chexum@shadow.banki.hu>
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt
new file mode 100644
index 000000000000..f673ef0de0f7
--- /dev/null
+++ b/Documentation/filesystems/smbfs.txt
@@ -0,0 +1,8 @@
1Smbfs is a filesystem that implements the SMB protocol, which is the
2protocol used by Windows for Workgroups, Windows 95 and Windows NT.
3Smbfs was inspired by Samba, the program written by Andrew Tridgell
4that turns any Unix host into a file server for DOS or Windows clients.
5
6Smbfs is a SMB client, but uses parts of samba for it's operation. For
7more info on samba, including documentation, please go to
8http://www.samba.org/ and then on to your nearest mirror.
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
new file mode 100644
index 000000000000..e97d024eae77
--- /dev/null
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -0,0 +1,88 @@
1Accessing PCI device resources through sysfs
2
3sysfs, usually mounted at /sys, provides access to PCI resources on platforms
4that support it. For example, a given bus might look like this:
5
6 /sys/devices/pci0000:17
7 |-- 0000:17:00.0
8 | |-- class
9 | |-- config
10 | |-- detach_state
11 | |-- device
12 | |-- irq
13 | |-- local_cpus
14 | |-- resource
15 | |-- resource0
16 | |-- resource1
17 | |-- resource2
18 | |-- rom
19 | |-- subsystem_device
20 | |-- subsystem_vendor
21 | `-- vendor
22 `-- detach_state
23
24The topmost element describes the PCI domain and bus number. In this case,
25the domain number is 0000 and the bus number is 17 (both values are in hex).
26This bus contains a single function device in slot 0. The domain and bus
27numbers are reproduced for convenience. Under the device directory are several
28files, each with their own function.
29
30 file function
31 ---- --------
32 class PCI class (ascii, ro)
33 config PCI config space (binary, rw)
34 detach_state connection status (bool, rw)
35 device PCI device (ascii, ro)
36 irq IRQ number (ascii, ro)
37 local_cpus nearby CPU mask (cpumask, ro)
38 resource PCI resource host addresses (ascii, ro)
39 resource0..N PCI resource N, if present (binary, mmap)
40 rom PCI ROM resource, if present (binary, ro)
41 subsystem_device PCI subsystem device (ascii, ro)
42 subsystem_vendor PCI subsystem vendor (ascii, ro)
43 vendor PCI vendor (ascii, ro)
44
45 ro - read only file
46 rw - file is readable and writable
47 mmap - file is mmapable
48 ascii - file contains ascii text
49 binary - file contains binary data
50 cpumask - file contains a cpumask type
51
52The read only files are informational, writes to them will be ignored.
53Writable files can be used to perform actions on the device (e.g. changing
54config space, detaching a device). mmapable files are available via an
55mmap of the file at offset 0 and can be used to do actual device programming
56from userspace. Note that some platforms don't support mmapping of certain
57resources, so be sure to check the return value from any attempted mmap.
58
59Accessing legacy resources through sysfs
60
61Legacy I/O port and ISA memory resources are also provided in sysfs if the
62underlying platform supports them. They're located in the PCI class heirarchy,
63e.g.
64
65 /sys/class/pci_bus/0000:17/
66 |-- bridge -> ../../../devices/pci0000:17
67 |-- cpuaffinity
68 |-- legacy_io
69 `-- legacy_mem
70
71The legacy_io file is a read/write file that can be used by applications to
72do legacy port I/O. The application should open the file, seek to the desired
73port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
74file should be mmapped with an offset corresponding to the memory offset
75desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
76simply dereference the returned pointer (after checking for errors of course)
77to access legacy memory space.
78
79Supporting PCI access on new platforms
80
81In order to support PCI resource mapping as described above, Linux platform
82code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
83Platforms are free to only support subsets of the mmap functionality, but
84useful return codes should be provided.
85
86Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
87wishing to support legacy functionality should define it and provide
88pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions. \ No newline at end of file
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
new file mode 100644
index 000000000000..60f6c2c4d477
--- /dev/null
+++ b/Documentation/filesystems/sysfs.txt
@@ -0,0 +1,341 @@
1
2sysfs - _The_ filesystem for exporting kernel objects.
3
4Patrick Mochel <mochel@osdl.org>
5
610 January 2003
7
8
9What it is:
10~~~~~~~~~~~
11
12sysfs is a ram-based filesystem initially based on ramfs. It provides
13a means to export kernel data structures, their attributes, and the
14linkages between them to userspace.
15
16sysfs is tied inherently to the kobject infrastructure. Please read
17Documentation/kobject.txt for more information concerning the kobject
18interface.
19
20
21Using sysfs
22~~~~~~~~~~~
23
24sysfs is always compiled in. You can access it by doing:
25
26 mount -t sysfs sysfs /sys
27
28
29Directory Creation
30~~~~~~~~~~~~~~~~~~
31
32For every kobject that is registered with the system, a directory is
33created for it in sysfs. That directory is created as a subdirectory
34of the kobject's parent, expressing internal object hierarchies to
35userspace. Top-level directories in sysfs represent the common
36ancestors of object hierarchies; i.e. the subsystems the objects
37belong to.
38
39Sysfs internally stores the kobject that owns the directory in the
40->d_fsdata pointer of the directory's dentry. This allows sysfs to do
41reference counting directly on the kobject when the file is opened and
42closed.
43
44
45Attributes
46~~~~~~~~~~
47
48Attributes can be exported for kobjects in the form of regular files in
49the filesystem. Sysfs forwards file I/O operations to methods defined
50for the attributes, providing a means to read and write kernel
51attributes.
52
53Attributes should be ASCII text files, preferably with only one value
54per file. It is noted that it may not be efficient to contain only
55value per file, so it is socially acceptable to express an array of
56values of the same type.
57
58Mixing types, expressing multiple lines of data, and doing fancy
59formatting of data is heavily frowned upon. Doing these things may get
60you publically humiliated and your code rewritten without notice.
61
62
63An attribute definition is simply:
64
65struct attribute {
66 char * name;
67 mode_t mode;
68};
69
70
71int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
72void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
73
74
75A bare attribute contains no means to read or write the value of the
76attribute. Subsystems are encouraged to define their own attribute
77structure and wrapper functions for adding and removing attributes for
78a specific object type.
79
80For example, the driver model defines struct device_attribute like:
81
82struct device_attribute {
83 struct attribute attr;
84 ssize_t (*show)(struct device * dev, char * buf);
85 ssize_t (*store)(struct device * dev, const char * buf);
86};
87
88int device_create_file(struct device *, struct device_attribute *);
89void device_remove_file(struct device *, struct device_attribute *);
90
91It also defines this helper for defining device attributes:
92
93#define DEVICE_ATTR(_name,_mode,_show,_store) \
94struct device_attribute dev_attr_##_name = { \
95 .attr = {.name = __stringify(_name) , .mode = _mode }, \
96 .show = _show, \
97 .store = _store, \
98};
99
100For example, declaring
101
102static DEVICE_ATTR(foo,0644,show_foo,store_foo);
103
104is equivalent to doing:
105
106static struct device_attribute dev_attr_foo = {
107 .attr = {
108 .name = "foo",
109 .mode = 0644,
110 },
111 .show = show_foo,
112 .store = store_foo,
113};
114
115
116Subsystem-Specific Callbacks
117~~~~~~~~~~~~~~~~~~~~~~~~~~~~
118
119When a subsystem defines a new attribute type, it must implement a
120set of sysfs operations for forwarding read and write calls to the
121show and store methods of the attribute owners.
122
123struct sysfs_ops {
124 ssize_t (*show)(struct kobject *, struct attribute *,char *);
125 ssize_t (*store)(struct kobject *,struct attribute *,const char *);
126};
127
128[ Subsystems should have already defined a struct kobj_type as a
129descriptor for this type, which is where the sysfs_ops pointer is
130stored. See the kobject documentation for more information. ]
131
132When a file is read or written, sysfs calls the appropriate method
133for the type. The method then translates the generic struct kobject
134and struct attribute pointers to the appropriate pointer types, and
135calls the associated methods.
136
137
138To illustrate:
139
140#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr)
141#define to_dev(d) container_of(d, struct device, kobj)
142
143static ssize_t
144dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
145{
146 struct device_attribute * dev_attr = to_dev_attr(attr);
147 struct device * dev = to_dev(kobj);
148 ssize_t ret = 0;
149
150 if (dev_attr->show)
151 ret = dev_attr->show(dev,buf);
152 return ret;
153}
154
155
156
157Reading/Writing Attribute Data
158~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159
160To read or write attributes, show() or store() methods must be
161specified when declaring the attribute. The method types should be as
162simple as those defined for device attributes:
163
164 ssize_t (*show)(struct device * dev, char * buf);
165 ssize_t (*store)(struct device * dev, const char * buf);
166
167IOW, they should take only an object and a buffer as parameters.
168
169
170sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
171method. Sysfs will call the method exactly once for each read or
172write. This forces the following behavior on the method
173implementations:
174
175- On read(2), the show() method should fill the entire buffer.
176 Recall that an attribute should only be exporting one value, or an
177 array of similar values, so this shouldn't be that expensive.
178
179 This allows userspace to do partial reads and seeks arbitrarily over
180 the entire file at will.
181
182- On write(2), sysfs expects the entire buffer to be passed during the
183 first write. Sysfs then passes the entire buffer to the store()
184 method.
185
186 When writing sysfs files, userspace processes should first read the
187 entire file, modify the values it wishes to change, then write the
188 entire buffer back.
189
190 Attribute method implementations should operate on an identical
191 buffer when reading and writing values.
192
193Other notes:
194
195- The buffer will always be PAGE_SIZE bytes in length. On i386, this
196 is 4096.
197
198- show() methods should return the number of bytes printed into the
199 buffer. This is the return value of snprintf().
200
201- show() should always use snprintf().
202
203- store() should return the number of bytes used from the buffer. This
204 can be done using strlen().
205
206- show() or store() can always return errors. If a bad value comes
207 through, be sure to return an error.
208
209- The object passed to the methods will be pinned in memory via sysfs
210 referencing counting its embedded object. However, the physical
211 entity (e.g. device) the object represents may not be present. Be
212 sure to have a way to check this, if necessary.
213
214
215A very simple (and naive) implementation of a device attribute is:
216
217static ssize_t show_name(struct device * dev, char * buf)
218{
219 return sprintf(buf,"%s\n",dev->name);
220}
221
222static ssize_t store_name(struct device * dev, const char * buf)
223{
224 sscanf(buf,"%20s",dev->name);
225 return strlen(buf);
226}
227
228static DEVICE_ATTR(name,S_IRUGO,show_name,store_name);
229
230
231(Note that the real implementation doesn't allow userspace to set the
232name for a device.)
233
234
235Top Level Directory Layout
236~~~~~~~~~~~~~~~~~~~~~~~~~~
237
238The sysfs directory arrangement exposes the relationship of kernel
239data structures.
240
241The top level sysfs diretory looks like:
242
243block/
244bus/
245class/
246devices/
247firmware/
248net/
249
250devices/ contains a filesystem representation of the device tree. It maps
251directly to the internal kernel device tree, which is a hierarchy of
252struct device.
253
254bus/ contains flat directory layout of the various bus types in the
255kernel. Each bus's directory contains two subdirectories:
256
257 devices/
258 drivers/
259
260devices/ contains symlinks for each device discovered in the system
261that point to the device's directory under root/.
262
263drivers/ contains a directory for each device driver that is loaded
264for devices on that particular bus (this assumes that drivers do not
265span multiple bus types).
266
267
268More information can driver-model specific features can be found in
269Documentation/driver-model/.
270
271
272TODO: Finish this section.
273
274
275Current Interfaces
276~~~~~~~~~~~~~~~~~~
277
278The following interface layers currently exist in sysfs:
279
280
281- devices (include/linux/device.h)
282----------------------------------
283Structure:
284
285struct device_attribute {
286 struct attribute attr;
287 ssize_t (*show)(struct device * dev, char * buf);
288 ssize_t (*store)(struct device * dev, const char * buf);
289};
290
291Declaring:
292
293DEVICE_ATTR(_name,_str,_mode,_show,_store);
294
295Creation/Removal:
296
297int device_create_file(struct device *device, struct device_attribute * attr);
298void device_remove_file(struct device * dev, struct device_attribute * attr);
299
300
301- bus drivers (include/linux/device.h)
302--------------------------------------
303Structure:
304
305struct bus_attribute {
306 struct attribute attr;
307 ssize_t (*show)(struct bus_type *, char * buf);
308 ssize_t (*store)(struct bus_type *, const char * buf);
309};
310
311Declaring:
312
313BUS_ATTR(_name,_mode,_show,_store)
314
315Creation/Removal:
316
317int bus_create_file(struct bus_type *, struct bus_attribute *);
318void bus_remove_file(struct bus_type *, struct bus_attribute *);
319
320
321- device drivers (include/linux/device.h)
322-----------------------------------------
323
324Structure:
325
326struct driver_attribute {
327 struct attribute attr;
328 ssize_t (*show)(struct device_driver *, char * buf);
329 ssize_t (*store)(struct device_driver *, const char * buf);
330};
331
332Declaring:
333
334DRIVER_ATTR(_name,_mode,_show,_store)
335
336Creation/Removal:
337
338int driver_create_file(struct device_driver *, struct driver_attribute *);
339void driver_remove_file(struct device_driver *, struct driver_attribute *);
340
341
diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt
new file mode 100644
index 000000000000..d81722418010
--- /dev/null
+++ b/Documentation/filesystems/sysv-fs.txt
@@ -0,0 +1,38 @@
1This is the implementation of the SystemV/Coherent filesystem for Linux.
2It implements all of
3 - Xenix FS,
4 - SystemV/386 FS,
5 - Coherent FS.
6
7This is version beta 4.
8
9To install:
10* Answer the 'System V and Coherent filesystem support' question with 'y'
11 when configuring the kernel.
12* To mount a disk or a partition, use
13 mount [-r] -t sysv device mountpoint
14 The file system type names
15 -t sysv
16 -t xenix
17 -t coherent
18 may be used interchangeably, but the last two will eventually disappear.
19
20Bugs in the present implementation:
21- Coherent FS:
22 - The "free list interleave" n:m is currently ignored.
23 - Only file systems with no filesystem name and no pack name are recognized.
24 (See Coherent "man mkfs" for a description of these features.)
25- SystemV Release 2 FS:
26 The superblock is only searched in the blocks 9, 15, 18, which
27 corresponds to the beginning of track 1 on floppy disks. No support
28 for this FS on hard disk yet.
29
30
31Please report any bugs and suggestions to
32 Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de>
33 Pascal Haible <haible@izfm.uni-stuttgart.de>
34 Krzysztof G. Baranowski <kgb@manjak.knm.org.pl>
35
36Bruno Haible
37<haible@ma2s2.mathematik.uni-karlsruhe.de>
38
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
new file mode 100644
index 000000000000..417e3095fe39
--- /dev/null
+++ b/Documentation/filesystems/tmpfs.txt
@@ -0,0 +1,100 @@
1Tmpfs is a file system which keeps all files in virtual memory.
2
3
4Everything in tmpfs is temporary in the sense that no files will be
5created on your hard drive. If you unmount a tmpfs instance,
6everything stored therein is lost.
7
8tmpfs puts everything into the kernel internal caches and grows and
9shrinks to accommodate the files it contains and is able to swap
10unneeded pages out to swap space. It has maximum size limits which can
11be adjusted on the fly via 'mount -o remount ...'
12
13If you compare it to ramfs (which was the template to create tmpfs)
14you gain swapping and limit checking. Another similar thing is the RAM
15disk (/dev/ram*), which simulates a fixed size hard disk in physical
16RAM, where you have to create an ordinary filesystem on top. Ramdisks
17cannot swap and you do not have the possibility to resize them.
18
19Since tmpfs lives completely in the page cache and on swap, all tmpfs
20pages currently in memory will show up as cached. It will not show up
21as shared or something like that. Further on you can check the actual
22RAM+swap use of a tmpfs instance with df(1) and du(1).
23
24
25tmpfs has the following uses:
26
271) There is always a kernel internal mount which you will not see at
28 all. This is used for shared anonymous mappings and SYSV shared
29 memory.
30
31 This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
32 set, the user visible part of tmpfs is not build. But the internal
33 mechanisms are always present.
34
352) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
36 POSIX shared memory (shm_open, shm_unlink). Adding the following
37 line to /etc/fstab should take care of this:
38
39 tmpfs /dev/shm tmpfs defaults 0 0
40
41 Remember to create the directory that you intend to mount tmpfs on
42 if necessary (/dev/shm is automagically created if you use devfs).
43
44 This mount is _not_ needed for SYSV shared memory. The internal
45 mount is used for that. (In the 2.3 kernel versions it was
46 necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
47 shared memory)
48
493) Some people (including me) find it very convenient to mount it
50 e.g. on /tmp and /var/tmp and have a big swap partition. And now
51 loop mounts of tmpfs files do work, so mkinitrd shipped by most
52 distributions should succeed with a tmpfs /tmp.
53
544) And probably a lot more I do not know about :-)
55
56
57tmpfs has three mount options for sizing:
58
59size: The limit of allocated bytes for this tmpfs instance. The
60 default is half of your physical RAM without swap. If you
61 oversize your tmpfs instances the machine will deadlock
62 since the OOM handler will not be able to free that memory.
63nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
64nr_inodes: The maximum number of inodes for this instance. The default
65 is half of the number of your physical RAM pages, or (on a
66 a machine with highmem) the number of lowmem RAM pages,
67 whichever is the lower.
68
69These parameters accept a suffix k, m or g for kilo, mega and giga and
70can be changed on remount. The size parameter also accepts a suffix %
71to limit this tmpfs instance to that percentage of your physical RAM:
72the default, when neither size nor nr_blocks is specified, is size=50%
73
74If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
75nor inodes will be limited in that instance. It is generally unwise to
76mount with such options, since it allows any user with write access to
77use up all the memory on the machine; but enhances the scalability of
78that instance in a system with many cpus making intensive use of it.
79
80
81To specify the initial root directory you can use the following mount
82options:
83
84mode: The permissions as an octal number
85uid: The user id
86gid: The group id
87
88These options do not have any effect on remount. You can change these
89parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
90
91
92So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
93will give you tmpfs instance on /mytmpfs which can allocate 10GB
94RAM/SWAP in 10240 inodes and it is only accessible by root.
95
96
97Author:
98 Christoph Rohland <cr@sap.com>, 1.12.01
99Updated:
100 Hugh Dickins <hugh@veritas.com>, 01 September 2004
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt
new file mode 100644
index 000000000000..e5213bc301f7
--- /dev/null
+++ b/Documentation/filesystems/udf.txt
@@ -0,0 +1,57 @@
1*
2* Documentation/filesystems/udf.txt
3*
4UDF Filesystem version 0.9.8.1
5
6If you encounter problems with reading UDF discs using this driver,
7please report them to linux_udf@hpesjro.fc.hp.com, which is the
8developer's list.
9
10Write support requires a block driver which supports writing. The current
11scsi and ide cdrom drivers do not support writing.
12
13-------------------------------------------------------------------------------
14The following mount options are supported:
15
16 gid= Set the default group.
17 umask= Set the default umask.
18 uid= Set the default user.
19 bs= Set the block size.
20 unhide Show otherwise hidden files.
21 undelete Show deleted files in lists.
22 adinicb Embed data in the inode (default)
23 noadinicb Don't embed data in the inode
24 shortad Use short ad's
25 longad Use long ad's (default)
26 nostrict Unset strict conformance
27 iocharset= Set the NLS character set
28
29The remaining are for debugging and disaster recovery:
30
31 novrs Skip volume sequence recognition
32
33The following expect a offset from 0.
34
35 session= Set the CDROM session (default= last session)
36 anchor= Override standard anchor location. (default= 256)
37 volume= Override the VolumeDesc location. (unused)
38 partition= Override the PartitionDesc location. (unused)
39 lastblock= Set the last block of the filesystem/
40
41The following expect a offset from the partition root.
42
43 fileset= Override the fileset block location. (unused)
44 rootdir= Override the root directory location. (unused)
45 WARNING: overriding the rootdir to a non-directory may
46 yield highly unpredictable results.
47-------------------------------------------------------------------------------
48
49
50For the latest version and toolset see:
51 http://linux-udf.sourceforge.net/
52
53Documentation on UDF and ECMA 167 is available FREE from:
54 http://www.osta.org/
55 http://www.ecma-international.org/
56
57Ben Fennema <bfennema@falcon.csc.calpoly.edu>
diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt
new file mode 100644
index 000000000000..2b5a56a6a558
--- /dev/null
+++ b/Documentation/filesystems/ufs.txt
@@ -0,0 +1,61 @@
1USING UFS
2=========
3
4mount -t ufs -o ufstype=type_of_ufs device dir
5
6
7UFS OPTIONS
8===========
9
10ufstype=type_of_ufs
11 UFS is a file system widely used in different operating systems.
12 The problem are differences among implementations. Features of
13 some implementations are undocumented, so its hard to recognize
14 type of ufs automatically. That's why user must specify type of
15 ufs manually by mount option ufstype. Possible values are:
16
17 old old format of ufs
18 default value, supported as read-only
19
20 44bsd used in FreeBSD, NetBSD, OpenBSD
21 supported as read-write
22
23 ufs2 used in FreeBSD 5.x
24 supported as read-only
25
26 5xbsd synonym for ufs2
27
28 sun used in SunOS (Solaris)
29 supported as read-write
30
31 sunx86 used in SunOS for Intel (Solarisx86)
32 supported as read-write
33
34 hp used in HP-UX
35 supported as read-only
36
37 nextstep
38 used in NextStep
39 supported as read-only
40
41 nextstep-cd
42 used for NextStep CDROMs (block_size == 2048)
43 supported as read-only
44
45 openstep
46 used in OpenStep
47 supported as read-only
48
49
50POSSIBLE PROBLEMS
51=================
52
53There is still bug in reallocation of fragment, in file fs/ufs/balloc.c,
54line 364. But it seems working on current buffer cache configuration.
55
56
57BUG REPORTS
58===========
59
60Any ufs bug report you can send to daniel.pirkl@email.cz (do not send
61partition tables bug reports.)
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
new file mode 100644
index 000000000000..5ead20c6c744
--- /dev/null
+++ b/Documentation/filesystems/vfat.txt
@@ -0,0 +1,231 @@
1USING VFAT
2----------------------------------------------------------------------
3To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
4 mount -t vfat /dev/fd0 /mnt
5
6No special partition formatter is required. mkdosfs will work fine
7if you want to format from within Linux.
8
9VFAT MOUNT OPTIONS
10----------------------------------------------------------------------
11umask=### -- The permission mask (for files and directories, see umask(1)).
12 The default is the umask of current process.
13
14dmask=### -- The permission mask for the directory.
15 The default is the umask of current process.
16
17fmask=### -- The permission mask for files.
18 The default is the umask of current process.
19
20codepage=### -- Sets the codepage number for converting to shortname
21 characters on FAT filesystem.
22 By default, FAT_DEFAULT_CODEPAGE setting is used.
23
24iocharset=name -- Character set to use for converting between the
25 encoding is used for user visible filename and 16 bit
26 Unicode characters. Long filenames are stored on disk
27 in Unicode format, but Unix for the most part doesn't
28 know how to deal with Unicode.
29 By default, FAT_DEFAULT_IOCHARSET setting is used.
30
31 There is also an option of doing UTF8 translations
32 with the utf8 option.
33
34 NOTE: "iocharset=utf8" is not recommended. If unsure,
35 you should consider the following option instead.
36
37utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that
38 is used by the console. It can be be enabled for the
39 filesystem with this option. If 'uni_xlate' gets set,
40 UTF8 gets disabled.
41
42uni_xlate=<bool> -- Translate unhandled Unicode characters to special
43 escaped sequences. This would let you backup and
44 restore filenames that are created with any Unicode
45 characters. Until Linux supports Unicode for real,
46 this gives you an alternative. Without this option,
47 a '?' is used when no translation is possible. The
48 escape character is ':' because it is otherwise
49 illegal on the vfat filesystem. The escape sequence
50 that gets used is ':' and the four digits of hexadecimal
51 unicode.
52
53nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
54 end in '~1' or tilde followed by some number. If this
55 option is set, then if the filename is
56 "longfilename.txt" and "longfile.txt" does not
57 currently exist in the directory, 'longfile.txt' will
58 be the short alias instead of 'longfi~1.txt'.
59
60quiet -- Stops printing certain warning messages.
61
62check=s|r|n -- Case sensitivity checking setting.
63 s: strict, case sensitive
64 r: relaxed, case insensitive
65 n: normal, default setting, currently case insensitive
66
67shortname=lower|win95|winnt|mixed
68 -- Shortname display/create setting.
69 lower: convert to lowercase for display,
70 emulate the Windows 95 rule for create.
71 win95: emulate the Windows 95 rule for display/create.
72 winnt: emulate the Windows NT rule for display/create.
73 mixed: emulate the Windows NT rule for display,
74 emulate the Windows 95 rule for create.
75 Default setting is `lower'.
76
77<bool>: 0,1,yes,no,true,false
78
79TODO
80----------------------------------------------------------------------
81* Need to get rid of the raw scanning stuff. Instead, always use
82 a get next directory entry approach. The only thing left that uses
83 raw scanning is the directory renaming code.
84
85
86POSSIBLE PROBLEMS
87----------------------------------------------------------------------
88* vfat_valid_longname does not properly checked reserved names.
89* When a volume name is the same as a directory name in the root
90 directory of the filesystem, the directory name sometimes shows
91 up as an empty file.
92* autoconv option does not work correctly.
93
94BUG REPORTS
95----------------------------------------------------------------------
96If you have trouble with the VFAT filesystem, mail bug reports to
97chaffee@bmrc.cs.berkeley.edu. Please specify the filename
98and the operation that gave you trouble.
99
100TEST SUITE
101----------------------------------------------------------------------
102If you plan to make any modifications to the vfat filesystem, please
103get the test suite that comes with the vfat distribution at
104
105 http://bmrc.berkeley.edu/people/chaffee/vfat.html
106
107This tests quite a few parts of the vfat filesystem and additional
108tests for new features or untested features would be appreciated.
109
110NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
111----------------------------------------------------------------------
112(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
113 and lightly annotated by Gordon Chaffee).
114
115This document presents a very rough, technical overview of my
116knowledge of the extended FAT file system used in Windows NT 3.5 and
117Windows 95. I don't guarantee that any of the following is correct,
118but it appears to be so.
119
120The extended FAT file system is almost identical to the FAT
121file system used in DOS versions up to and including 6.223410239847
122:-). The significant change has been the addition of long file names.
123These names support up to 255 characters including spaces and lower
124case characters as opposed to the traditional 8.3 short names.
125
126Here is the description of the traditional FAT entry in the current
127Windows 95 filesystem:
128
129 struct directory { // Short 8.3 names
130 unsigned char name[8]; // file name
131 unsigned char ext[3]; // file extension
132 unsigned char attr; // attribute byte
133 unsigned char lcase; // Case for base and extension
134 unsigned char ctime_ms; // Creation time, milliseconds
135 unsigned char ctime[2]; // Creation time
136 unsigned char cdate[2]; // Creation date
137 unsigned char adate[2]; // Last access date
138 unsigned char reserved[2]; // reserved values (ignored)
139 unsigned char time[2]; // time stamp
140 unsigned char date[2]; // date stamp
141 unsigned char start[2]; // starting cluster number
142 unsigned char size[4]; // size of the file
143 };
144
145The lcase field specifies if the base and/or the extension of an 8.3
146name should be capitalized. This field does not seem to be used by
147Windows 95 but it is used by Windows NT. The case of filenames is not
148completely compatible from Windows NT to Windows 95. It is not completely
149compatible in the reverse direction, however. Filenames that fit in
150the 8.3 namespace and are written on Windows NT to be lowercase will
151show up as uppercase on Windows 95.
152
153Note that the "start" and "size" values are actually little
154endian integer values. The descriptions of the fields in this
155structure are public knowledge and can be found elsewhere.
156
157With the extended FAT system, Microsoft has inserted extra
158directory entries for any files with extended names. (Any name which
159legally fits within the old 8.3 encoding scheme does not have extra
160entries.) I call these extra entries slots. Basically, a slot is a
161specially formatted directory entry which holds up to 13 characters of
162a file's extended name. Think of slots as additional labeling for the
163directory entry of the file to which they correspond. Microsoft
164prefers to refer to the 8.3 entry for a file as its alias and the
165extended slot directory entries as the file name.
166
167The C structure for a slot directory entry follows:
168
169 struct slot { // Up to 13 characters of a long name
170 unsigned char id; // sequence number for slot
171 unsigned char name0_4[10]; // first 5 characters in name
172 unsigned char attr; // attribute byte
173 unsigned char reserved; // always 0
174 unsigned char alias_checksum; // checksum for 8.3 alias
175 unsigned char name5_10[12]; // 6 more characters in name
176 unsigned char start[2]; // starting cluster number
177 unsigned char name11_12[4]; // last 2 characters in name
178 };
179
180If the layout of the slots looks a little odd, it's only
181because of Microsoft's efforts to maintain compatibility with old
182software. The slots must be disguised to prevent old software from
183panicking. To this end, a number of measures are taken:
184
185 1) The attribute byte for a slot directory entry is always set
186 to 0x0f. This corresponds to an old directory entry with
187 attributes of "hidden", "system", "read-only", and "volume
188 label". Most old software will ignore any directory
189 entries with the "volume label" bit set. Real volume label
190 entries don't have the other three bits set.
191
192 2) The starting cluster is always set to 0, an impossible
193 value for a DOS file.
194
195Because the extended FAT system is backward compatible, it is
196possible for old software to modify directory entries. Measures must
197be taken to ensure the validity of slots. An extended FAT system can
198verify that a slot does in fact belong to an 8.3 directory entry by
199the following:
200
201 1) Positioning. Slots for a file always immediately proceed
202 their corresponding 8.3 directory entry. In addition, each
203 slot has an id which marks its order in the extended file
204 name. Here is a very abbreviated view of an 8.3 directory
205 entry and its corresponding long name slots for the file
206 "My Big File.Extension which is long":
207
208 <proceeding files...>
209 <slot #3, id = 0x43, characters = "h is long">
210 <slot #2, id = 0x02, characters = "xtension whic">
211 <slot #1, id = 0x01, characters = "My Big File.E">
212 <directory entry, name = "MYBIGFIL.EXT">
213
214 Note that the slots are stored from last to first. Slots
215 are numbered from 1 to N. The Nth slot is or'ed with 0x40
216 to mark it as the last one.
217
218 2) Checksum. Each slot has an "alias_checksum" value. The
219 checksum is calculated from the 8.3 name using the
220 following algorithm:
221
222 for (sum = i = 0; i < 11; i++) {
223 sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
224 }
225
226 3) If there is free space in the final slot, a Unicode NULL (0x0000)
227 is stored after the final character. After that, all unused
228 characters in the final slot are set to Unicode 0xFFFF.
229
230Finally, note that the extended name is stored in Unicode. Each Unicode
231character takes two bytes.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
new file mode 100644
index 000000000000..3f318dd44c77
--- /dev/null
+++ b/Documentation/filesystems/vfs.txt
@@ -0,0 +1,671 @@
1/* -*- auto-fill -*- */
2
3 Overview of the Virtual File System
4
5 Richard Gooch <rgooch@atnf.csiro.au>
6
7 5-JUL-1999
8
9
10Conventions used in this document <section>
11=================================
12
13Each section in this document will have the string "<section>" at the
14right-hand side of the section title. Each subsection will have
15"<subsection>" at the right-hand side. These strings are meant to make
16it easier to search through the document.
17
18NOTE that the master copy of this document is available online at:
19http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
20
21
22What is it? <section>
23===========
24
25The Virtual File System (otherwise known as the Virtual Filesystem
26Switch) is the software layer in the kernel that provides the
27filesystem interface to userspace programs. It also provides an
28abstraction within the kernel which allows different filesystem
29implementations to co-exist.
30
31
32A Quick Look At How It Works <section>
33============================
34
35In this section I'll briefly describe how things work, before
36launching into the details. I'll start with describing what happens
37when user programs open and manipulate files, and then look from the
38other view which is how a filesystem is supported and subsequently
39mounted.
40
41Opening a File <subsection>
42--------------
43
44The VFS implements the open(2), stat(2), chmod(2) and similar system
45calls. The pathname argument is used by the VFS to search through the
46directory entry cache (dentry cache or "dcache"). This provides a very
47fast look-up mechanism to translate a pathname (filename) into a
48specific dentry.
49
50An individual dentry usually has a pointer to an inode. Inodes are the
51things that live on disc drives, and can be regular files (you know:
52those things that you write data into), directories, FIFOs and other
53beasts. Dentries live in RAM and are never saved to disc: they exist
54only for performance. Inodes live on disc and are copied into memory
55when required. Later any changes are written back to disc. The inode
56that lives in RAM is a VFS inode, and it is this which the dentry
57points to. A single inode can be pointed to by multiple dentries
58(think about hardlinks).
59
60The dcache is meant to be a view into your entire filespace. Unlike
61Linus, most of us losers can't fit enough dentries into RAM to cover
62all of our filespace, so the dcache has bits missing. In order to
63resolve your pathname into a dentry, the VFS may have to resort to
64creating dentries along the way, and then loading the inode. This is
65done by looking up the inode.
66
67To look up an inode (usually read from disc) requires that the VFS
68calls the lookup() method of the parent directory inode. This method
69is installed by the specific filesystem implementation that the inode
70lives in. There will be more on this later.
71
72Once the VFS has the required dentry (and hence the inode), we can do
73all those boring things like open(2) the file, or stat(2) it to peek
74at the inode data. The stat(2) operation is fairly simple: once the
75VFS has the dentry, it peeks at the inode data and passes some of it
76back to userspace.
77
78Opening a file requires another operation: allocation of a file
79structure (this is the kernel-side implementation of file
80descriptors). The freshly allocated file structure is initialised with
81a pointer to the dentry and a set of file operation member functions.
82These are taken from the inode data. The open() file method is then
83called so the specific filesystem implementation can do it's work. You
84can see that this is another switch performed by the VFS.
85
86The file structure is placed into the file descriptor table for the
87process.
88
89Reading, writing and closing files (and other assorted VFS operations)
90is done by using the userspace file descriptor to grab the appropriate
91file structure, and then calling the required file structure method
92function to do whatever is required.
93
94For as long as the file is open, it keeps the dentry "open" (in use),
95which in turn means that the VFS inode is still in use.
96
97All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
98chmod(2) and so on) are called from a process context. You should
99assume that these calls are made without any kernel locks being
100held. This means that the processes may be executing the same piece of
101filesystem or driver code at the same time, on different
102processors. You should ensure that access to shared resources is
103protected by appropriate locks.
104
105Registering and Mounting a Filesystem <subsection>
106-------------------------------------
107
108If you want to support a new kind of filesystem in the kernel, all you
109need to do is call register_filesystem(). You pass a structure
110describing the filesystem implementation (struct file_system_type)
111which is then added to an internal table of supported filesystems. You
112can do:
113
114% cat /proc/filesystems
115
116to see what filesystems are currently available on your system.
117
118When a request is made to mount a block device onto a directory in
119your filespace the VFS will call the appropriate method for the
120specific filesystem. The dentry for the mount point will then be
121updated to point to the root inode for the new filesystem.
122
123It's now time to look at things in more detail.
124
125
126struct file_system_type <section>
127=======================
128
129This describes the filesystem. As of kernel 2.1.99, the following
130members are defined:
131
132struct file_system_type {
133 const char *name;
134 int fs_flags;
135 struct super_block *(*read_super) (struct super_block *, void *, int);
136 struct file_system_type * next;
137};
138
139 name: the name of the filesystem type, such as "ext2", "iso9660",
140 "msdos" and so on
141
142 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
143
144 read_super: the method to call when a new instance of this
145 filesystem should be mounted
146
147 next: for internal VFS use: you should initialise this to NULL
148
149The read_super() method has the following arguments:
150
151 struct super_block *sb: the superblock structure. This is partially
152 initialised by the VFS and the rest must be initialised by the
153 read_super() method
154
155 void *data: arbitrary mount options, usually comes as an ASCII
156 string
157
158 int silent: whether or not to be silent on error
159
160The read_super() method must determine if the block device specified
161in the superblock contains a filesystem of the type the method
162supports. On success the method returns the superblock pointer, on
163failure it returns NULL.
164
165The most interesting member of the superblock structure that the
166read_super() method fills in is the "s_op" field. This is a pointer to
167a "struct super_operations" which describes the next level of the
168filesystem implementation.
169
170
171struct super_operations <section>
172=======================
173
174This describes how the VFS can manipulate the superblock of your
175filesystem. As of kernel 2.1.99, the following members are defined:
176
177struct super_operations {
178 void (*read_inode) (struct inode *);
179 int (*write_inode) (struct inode *, int);
180 void (*put_inode) (struct inode *);
181 void (*drop_inode) (struct inode *);
182 void (*delete_inode) (struct inode *);
183 int (*notify_change) (struct dentry *, struct iattr *);
184 void (*put_super) (struct super_block *);
185 void (*write_super) (struct super_block *);
186 int (*statfs) (struct super_block *, struct statfs *, int);
187 int (*remount_fs) (struct super_block *, int *, char *);
188 void (*clear_inode) (struct inode *);
189};
190
191All methods are called without any locks being held, unless otherwise
192noted. This means that most methods can block safely. All methods are
193only called from a process context (i.e. not from an interrupt handler
194or bottom half).
195
196 read_inode: this method is called to read a specific inode from the
197 mounted filesystem. The "i_ino" member in the "struct inode"
198 will be initialised by the VFS to indicate which inode to
199 read. Other members are filled in by this method
200
201 write_inode: this method is called when the VFS needs to write an
202 inode to disc. The second parameter indicates whether the write
203 should be synchronous or not, not all filesystems check this flag.
204
205 put_inode: called when the VFS inode is removed from the inode
206 cache. This method is optional
207
208 drop_inode: called when the last access to the inode is dropped,
209 with the inode_lock spinlock held.
210
211 This method should be either NULL (normal unix filesystem
212 semantics) or "generic_delete_inode" (for filesystems that do not
213 want to cache inodes - causing "delete_inode" to always be
214 called regardless of the value of i_nlink)
215
216 The "generic_delete_inode()" behaviour is equivalent to the
217 old practice of using "force_delete" in the put_inode() case,
218 but does not have the races that the "force_delete()" approach
219 had.
220
221 delete_inode: called when the VFS wants to delete an inode
222
223 notify_change: called when VFS inode attributes are changed. If this
224 is NULL the VFS falls back to the write_inode() method. This
225 is called with the kernel lock held
226
227 put_super: called when the VFS wishes to free the superblock
228 (i.e. unmount). This is called with the superblock lock held
229
230 write_super: called when the VFS superblock needs to be written to
231 disc. This method is optional
232
233 statfs: called when the VFS needs to get filesystem statistics. This
234 is called with the kernel lock held
235
236 remount_fs: called when the filesystem is remounted. This is called
237 with the kernel lock held
238
239 clear_inode: called then the VFS clears the inode. Optional
240
241The read_inode() method is responsible for filling in the "i_op"
242field. This is a pointer to a "struct inode_operations" which
243describes the methods that can be performed on individual inodes.
244
245
246struct inode_operations <section>
247=======================
248
249This describes how the VFS can manipulate an inode in your
250filesystem. As of kernel 2.1.99, the following members are defined:
251
252struct inode_operations {
253 struct file_operations * default_file_ops;
254 int (*create) (struct inode *,struct dentry *,int);
255 int (*lookup) (struct inode *,struct dentry *);
256 int (*link) (struct dentry *,struct inode *,struct dentry *);
257 int (*unlink) (struct inode *,struct dentry *);
258 int (*symlink) (struct inode *,struct dentry *,const char *);
259 int (*mkdir) (struct inode *,struct dentry *,int);
260 int (*rmdir) (struct inode *,struct dentry *);
261 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
262 int (*rename) (struct inode *, struct dentry *,
263 struct inode *, struct dentry *);
264 int (*readlink) (struct dentry *, char *,int);
265 struct dentry * (*follow_link) (struct dentry *, struct dentry *);
266 int (*readpage) (struct file *, struct page *);
267 int (*writepage) (struct page *page, struct writeback_control *wbc);
268 int (*bmap) (struct inode *,int);
269 void (*truncate) (struct inode *);
270 int (*permission) (struct inode *, int);
271 int (*smap) (struct inode *,int);
272 int (*updatepage) (struct file *, struct page *, const char *,
273 unsigned long, unsigned int, int);
274 int (*revalidate) (struct dentry *);
275};
276
277Again, all methods are called without any locks being held, unless
278otherwise noted.
279
280 default_file_ops: this is a pointer to a "struct file_operations"
281 which describes how to open and then manipulate open files
282
283 create: called by the open(2) and creat(2) system calls. Only
284 required if you want to support regular files. The dentry you
285 get should not have an inode (i.e. it should be a negative
286 dentry). Here you will probably call d_instantiate() with the
287 dentry and the newly created inode
288
289 lookup: called when the VFS needs to look up an inode in a parent
290 directory. The name to look for is found in the dentry. This
291 method must call d_add() to insert the found inode into the
292 dentry. The "i_count" field in the inode structure should be
293 incremented. If the named inode does not exist a NULL inode
294 should be inserted into the dentry (this is called a negative
295 dentry). Returning an error code from this routine must only
296 be done on a real error, otherwise creating inodes with system
297 calls like create(2), mknod(2), mkdir(2) and so on will fail.
298 If you wish to overload the dentry methods then you should
299 initialise the "d_dop" field in the dentry; this is a pointer
300 to a struct "dentry_operations".
301 This method is called with the directory inode semaphore held
302
303 link: called by the link(2) system call. Only required if you want
304 to support hard links. You will probably need to call
305 d_instantiate() just as you would in the create() method
306
307 unlink: called by the unlink(2) system call. Only required if you
308 want to support deleting inodes
309
310 symlink: called by the symlink(2) system call. Only required if you
311 want to support symlinks. You will probably need to call
312 d_instantiate() just as you would in the create() method
313
314 mkdir: called by the mkdir(2) system call. Only required if you want
315 to support creating subdirectories. You will probably need to
316 call d_instantiate() just as you would in the create() method
317
318 rmdir: called by the rmdir(2) system call. Only required if you want
319 to support deleting subdirectories
320
321 mknod: called by the mknod(2) system call to create a device (char,
322 block) inode or a named pipe (FIFO) or socket. Only required
323 if you want to support creating these types of inodes. You
324 will probably need to call d_instantiate() just as you would
325 in the create() method
326
327 readlink: called by the readlink(2) system call. Only required if
328 you want to support reading symbolic links
329
330 follow_link: called by the VFS to follow a symbolic link to the
331 inode it points to. Only required if you want to support
332 symbolic links
333
334
335struct file_operations <section>
336======================
337
338This describes how the VFS can manipulate an open file. As of kernel
3392.1.99, the following members are defined:
340
341struct file_operations {
342 loff_t (*llseek) (struct file *, loff_t, int);
343 ssize_t (*read) (struct file *, char *, size_t, loff_t *);
344 ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
345 int (*readdir) (struct file *, void *, filldir_t);
346 unsigned int (*poll) (struct file *, struct poll_table_struct *);
347 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
348 int (*mmap) (struct file *, struct vm_area_struct *);
349 int (*open) (struct inode *, struct file *);
350 int (*release) (struct inode *, struct file *);
351 int (*fsync) (struct file *, struct dentry *);
352 int (*fasync) (struct file *, int);
353 int (*check_media_change) (kdev_t dev);
354 int (*revalidate) (kdev_t dev);
355 int (*lock) (struct file *, int, struct file_lock *);
356};
357
358Again, all methods are called without any locks being held, unless
359otherwise noted.
360
361 llseek: called when the VFS needs to move the file position index
362
363 read: called by read(2) and related system calls
364
365 write: called by write(2) and related system calls
366
367 readdir: called when the VFS needs to read the directory contents
368
369 poll: called by the VFS when a process wants to check if there is
370 activity on this file and (optionally) go to sleep until there
371 is activity. Called by the select(2) and poll(2) system calls
372
373 ioctl: called by the ioctl(2) system call
374
375 mmap: called by the mmap(2) system call
376
377 open: called by the VFS when an inode should be opened. When the VFS
378 opens a file, it creates a new "struct file" and initialises
379 the "f_op" file operations member with the "default_file_ops"
380 field in the inode structure. It then calls the open method
381 for the newly allocated file structure. You might think that
382 the open method really belongs in "struct inode_operations",
383 and you may be right. I think it's done the way it is because
384 it makes filesystems simpler to implement. The open() method
385 is a good place to initialise the "private_data" member in the
386 file structure if you want to point to a device structure
387
388 release: called when the last reference to an open file is closed
389
390 fsync: called by the fsync(2) system call
391
392 fasync: called by the fcntl(2) system call when asynchronous
393 (non-blocking) mode is enabled for a file
394
395Note that the file operations are implemented by the specific
396filesystem in which the inode resides. When opening a device node
397(character or block special) most filesystems will call special
398support routines in the VFS which will locate the required device
399driver information. These support routines replace the filesystem file
400operations with those for the device driver, and then proceed to call
401the new open() method for the file. This is how opening a device file
402in the filesystem eventually ends up calling the device driver open()
403method. Note the devfs (the Device FileSystem) has a more direct path
404from device node to device driver (this is an unofficial kernel
405patch).
406
407
408Directory Entry Cache (dcache) <section>
409------------------------------
410
411struct dentry_operations
412========================
413
414This describes how a filesystem can overload the standard dentry
415operations. Dentries and the dcache are the domain of the VFS and the
416individual filesystem implementations. Device drivers have no business
417here. These methods may be set to NULL, as they are either optional or
418the VFS uses a default. As of kernel 2.1.99, the following members are
419defined:
420
421struct dentry_operations {
422 int (*d_revalidate)(struct dentry *);
423 int (*d_hash) (struct dentry *, struct qstr *);
424 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
425 void (*d_delete)(struct dentry *);
426 void (*d_release)(struct dentry *);
427 void (*d_iput)(struct dentry *, struct inode *);
428};
429
430 d_revalidate: called when the VFS needs to revalidate a dentry. This
431 is called whenever a name look-up finds a dentry in the
432 dcache. Most filesystems leave this as NULL, because all their
433 dentries in the dcache are valid
434
435 d_hash: called when the VFS adds a dentry to the hash table
436
437 d_compare: called when a dentry should be compared with another
438
439 d_delete: called when the last reference to a dentry is
440 deleted. This means no-one is using the dentry, however it is
441 still valid and in the dcache
442
443 d_release: called when a dentry is really deallocated
444
445 d_iput: called when a dentry loses its inode (just prior to its
446 being deallocated). The default when this is NULL is that the
447 VFS calls iput(). If you define this method, you must call
448 iput() yourself
449
450Each dentry has a pointer to its parent dentry, as well as a hash list
451of child dentries. Child dentries are basically like files in a
452directory.
453
454Directory Entry Cache APIs
455--------------------------
456
457There are a number of functions defined which permit a filesystem to
458manipulate dentries:
459
460 dget: open a new handle for an existing dentry (this just increments
461 the usage count)
462
463 dput: close a handle for a dentry (decrements the usage count). If
464 the usage count drops to 0, the "d_delete" method is called
465 and the dentry is placed on the unused list if the dentry is
466 still in its parents hash list. Putting the dentry on the
467 unused list just means that if the system needs some RAM, it
468 goes through the unused list of dentries and deallocates them.
469 If the dentry has already been unhashed and the usage count
470 drops to 0, in this case the dentry is deallocated after the
471 "d_delete" method is called
472
473 d_drop: this unhashes a dentry from its parents hash list. A
474 subsequent call to dput() will dellocate the dentry if its
475 usage count drops to 0
476
477 d_delete: delete a dentry. If there are no other open references to
478 the dentry then the dentry is turned into a negative dentry
479 (the d_iput() method is called). If there are other
480 references, then d_drop() is called instead
481
482 d_add: add a dentry to its parents hash list and then calls
483 d_instantiate()
484
485 d_instantiate: add a dentry to the alias hash list for the inode and
486 updates the "d_inode" member. The "i_count" member in the
487 inode structure should be set/incremented. If the inode
488 pointer is NULL, the dentry is called a "negative
489 dentry". This function is commonly called when an inode is
490 created for an existing negative dentry
491
492 d_lookup: look up a dentry given its parent and path name component
493 It looks up the child of that given name from the dcache
494 hash table. If it is found, the reference count is incremented
495 and the dentry is returned. The caller must use d_put()
496 to free the dentry when it finishes using it.
497
498
499RCU-based dcache locking model
500------------------------------
501
502On many workloads, the most common operation on dcache is
503to look up a dentry, given a parent dentry and the name
504of the child. Typically, for every open(), stat() etc.,
505the dentry corresponding to the pathname will be looked
506up by walking the tree starting with the first component
507of the pathname and using that dentry along with the next
508component to look up the next level and so on. Since it
509is a frequent operation for workloads like multiuser
510environments and webservers, it is important to optimize
511this path.
512
513Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
514in every component during path look-up. Since 2.5.10 onwards,
515fastwalk algorithm changed this by holding the dcache_lock
516at the beginning and walking as many cached path component
517dentries as possible. This signficantly decreases the number
518of acquisition of dcache_lock. However it also increases the
519lock hold time signficantly and affects performance in large
520SMP machines. Since 2.5.62 kernel, dcache has been using
521a new locking model that uses RCU to make dcache look-up
522lock-free.
523
524The current dcache locking model is not very different from the existing
525dcache locking model. Prior to 2.5.62 kernel, dcache_lock
526protected the hash chain, d_child, d_alias, d_lru lists as well
527as d_inode and several other things like mount look-up. RCU-based
528changes affect only the way the hash chain is protected. For everything
529else the dcache_lock must be taken for both traversing as well as
530updating. The hash chain updations too take the dcache_lock.
531The significant change is the way d_lookup traverses the hash chain,
532it doesn't acquire the dcache_lock for this and rely on RCU to
533ensure that the dentry has not been *freed*.
534
535
536Dcache locking details
537----------------------
538For many multi-user workloads, open() and stat() on files are
539very frequently occurring operations. Both involve walking
540of path names to find the dentry corresponding to the
541concerned file. In 2.4 kernel, dcache_lock was held
542during look-up of each path component. Contention and
543cacheline bouncing of this global lock caused significant
544scalability problems. With the introduction of RCU
545in linux kernel, this was worked around by making
546the look-up of path components during path walking lock-free.
547
548
549Safe lock-free look-up of dcache hash table
550===========================================
551
552Dcache is a complex data structure with the hash table entries
553also linked together in other lists. In 2.4 kernel, dcache_lock
554protected all the lists. We applied RCU only on hash chain
555walking. The rest of the lists are still protected by dcache_lock.
556Some of the important changes are :
557
5581. The deletion from hash chain is done using hlist_del_rcu() macro which
559 doesn't initialize next pointer of the deleted dentry and this
560 allows us to walk safely lock-free while a deletion is happening.
561
5622. Insertion of a dentry into the hash table is done using
563 hlist_add_head_rcu() which take care of ordering the writes -
564 the writes to the dentry must be visible before the dentry
565 is inserted. This works in conjuction with hlist_for_each_rcu()
566 while walking the hash chain. The only requirement is that
567 all initialization to the dentry must be done before hlist_add_head_rcu()
568 since we don't have dcache_lock protection while traversing
569 the hash chain. This isn't different from the existing code.
570
5713. The dentry looked up without holding dcache_lock by cannot be
572 returned for walking if it is unhashed. It then may have a NULL
573 d_inode or other bogosity since RCU doesn't protect the other
574 fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
575 indicate unhashed dentries and use this in conjunction with a
576 per-dentry lock (d_lock). Once looked up without the dcache_lock,
577 we acquire the per-dentry lock (d_lock) and check if the
578 dentry is unhashed. If so, the look-up is failed. If not, the
579 reference count of the dentry is increased and the dentry is returned.
580
5814. Once a dentry is looked up, it must be ensured during the path
582 walk for that component it doesn't go away. In pre-2.5.10 code,
583 this was done holding a reference to the dentry. dcache_rcu does
584 the same. In some sense, dcache_rcu path walking looks like
585 the pre-2.5.10 version.
586
5875. All dentry hash chain updations must take the dcache_lock as well as
588 the per-dentry lock in that order. dput() does this to ensure
589 that a dentry that has just been looked up in another CPU
590 doesn't get deleted before dget() can be done on it.
591
5926. There are several ways to do reference counting of RCU protected
593 objects. One such example is in ipv4 route cache where
594 deferred freeing (using call_rcu()) is done as soon as
595 the reference count goes to zero. This cannot be done in
596 the case of dentries because tearing down of dentries
597 require blocking (dentry_iput()) which isn't supported from
598 RCU callbacks. Instead, tearing down of dentries happen
599 synchronously in dput(), but actual freeing happens later
600 when RCU grace period is over. This allows safe lock-free
601 walking of the hash chains, but a matched dentry may have
602 been partially torn down. The checking of DCACHE_UNHASHED
603 flag with d_lock held detects such dentries and prevents
604 them from being returned from look-up.
605
606
607Maintaining POSIX rename semantics
608==================================
609
610Since look-up of dentries is lock-free, it can race against
611a concurrent rename operation. For example, during rename
612of file A to B, look-up of either A or B must succeed.
613So, if look-up of B happens after A has been removed from the
614hash chain but not added to the new hash chain, it may fail.
615Also, a comparison while the name is being written concurrently
616by a rename may result in false positive matches violating
617rename semantics. Issues related to race with rename are
618handled as described below :
619
6201. Look-up can be done in two ways - d_lookup() which is safe
621 from simultaneous renames and __d_lookup() which is not.
622 If __d_lookup() fails, it must be followed up by a d_lookup()
623 to correctly determine whether a dentry is in the hash table
624 or not. d_lookup() protects look-ups using a sequence
625 lock (rename_lock).
626
6272. The name associated with a dentry (d_name) may be changed if
628 a rename is allowed to happen simultaneously. To avoid memcmp()
629 in __d_lookup() go out of bounds due to a rename and false
630 positive comparison, the name comparison is done while holding the
631 per-dentry lock. This prevents concurrent renames during this
632 operation.
633
6343. Hash table walking during look-up may move to a different bucket as
635 the current dentry is moved to a different bucket due to rename.
636 But we use hlists in dcache hash table and they are null-terminated.
637 So, even if a dentry moves to a different bucket, hash chain
638 walk will terminate. [with a list_head list, it may not since
639 termination is when the list_head in the original bucket is reached].
640 Since we redo the d_parent check and compare name while holding
641 d_lock, lock-free look-up will not race against d_move().
642
6434. There can be a theoritical race when a dentry keeps coming back
644 to original bucket due to double moves. Due to this look-up may
645 consider that it has never moved and can end up in a infinite loop.
646 But this is not any worse that theoritical livelocks we already
647 have in the kernel.
648
649
650Important guidelines for filesystem developers related to dcache_rcu
651====================================================================
652
6531. Existing dcache interfaces (pre-2.5.62) exported to filesystem
654 don't change. Only dcache internal implementation changes. However
655 filesystems *must not* delete from the dentry hash chains directly
656 using the list macros like allowed earlier. They must use dcache
657 APIs like d_drop() or __d_drop() depending on the situation.
658
6592. d_flags is now protected by a per-dentry lock (d_lock). All
660 access to d_flags must be protected by it.
661
6623. For a hashed dentry, checking of d_count needs to be protected
663 by d_lock.
664
665
666Papers and other documentation on dcache locking
667================================================
668
6691. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
670
6712. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
new file mode 100644
index 000000000000..c7d5d0c7067d
--- /dev/null
+++ b/Documentation/filesystems/xfs.txt
@@ -0,0 +1,188 @@
1
2The SGI XFS Filesystem
3======================
4
5XFS is a high performance journaling filesystem which originated
6on the SGI IRIX platform. It is completely multi-threaded, can
7support large files and large filesystems, extended attributes,
8variable block sizes, is extent based, and makes extensive use of
9Btrees (directories, extents, free space) to aid both performance
10and scalability.
11
12Refer to the documentation at http://oss.sgi.com/projects/xfs/
13for further details. This implementation is on-disk compatible
14with the IRIX version of XFS.
15
16
17Mount Options
18=============
19
20When mounting an XFS filesystem, the following options are accepted.
21
22 biosize=size
23 Sets the preferred buffered I/O size (default size is 64K).
24 "size" must be expressed as the logarithm (base2) of the
25 desired I/O size.
26 Valid values for this option are 14 through 16, inclusive
27 (i.e. 16K, 32K, and 64K bytes). On machines with a 4K
28 pagesize, 13 (8K bytes) is also a valid size.
29 The preferred buffered I/O size can also be altered on an
30 individual file basis using the ioctl(2) system call.
31
32 ikeep/noikeep
33 When inode clusters are emptied of inodes, keep them around
34 on the disk (ikeep) - this is the traditional XFS behaviour
35 and is still the default for now. Using the noikeep option,
36 inode clusters are returned to the free space pool.
37
38 logbufs=value
39 Set the number of in-memory log buffers. Valid numbers range
40 from 2-8 inclusive.
41 The default value is 8 buffers for filesystems with a
42 blocksize of 64K, 4 buffers for filesystems with a blocksize
43 of 32K, 3 buffers for filesystems with a blocksize of 16K
44 and 2 buffers for all other configurations. Increasing the
45 number of buffers may increase performance on some workloads
46 at the cost of the memory used for the additional log buffers
47 and their associated control structures.
48
49 logbsize=value
50 Set the size of each in-memory log buffer.
51 Size may be specified in bytes, or in kilobytes with a "k" suffix.
52 Valid sizes for version 1 and version 2 logs are 16384 (16k) and
53 32768 (32k). Valid sizes for version 2 logs also include
54 65536 (64k), 131072 (128k) and 262144 (256k).
55 The default value for machines with more than 32MB of memory
56 is 32768, machines with less memory use 16384 by default.
57
58 logdev=device and rtdev=device
59 Use an external log (metadata journal) and/or real-time device.
60 An XFS filesystem has up to three parts: a data section, a log
61 section, and a real-time section. The real-time section is
62 optional, and the log section can be separate from the data
63 section or contained within it.
64
65 noalign
66 Data allocations will not be aligned at stripe unit boundaries.
67
68 noatime
69 Access timestamps are not updated when a file is read.
70
71 norecovery
72 The filesystem will be mounted without running log recovery.
73 If the filesystem was not cleanly unmounted, it is likely to
74 be inconsistent when mounted in "norecovery" mode.
75 Some files or directories may not be accessible because of this.
76 Filesystems mounted "norecovery" must be mounted read-only or
77 the mount will fail.
78
79 nouuid
80 Don't check for double mounted file systems using the file system uuid.
81 This is useful to mount LVM snapshot volumes.
82
83 osyncisosync
84 Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
85 Linux XFS behaves as if an "osyncisdsync" option is used,
86 which will make writes to files opened with the O_SYNC flag set
87 behave as if the O_DSYNC flag had been used instead.
88 This can result in better performance without compromising
89 data safety.
90 However if this option is not in effect, timestamp updates from
91 O_SYNC writes can be lost if the system crashes.
92 If timestamp updates are critical, use the osyncisosync option.
93
94 quota/usrquota/uqnoenforce
95 User disk quota accounting enabled, and limits (optionally)
96 enforced.
97
98 grpquota/gqnoenforce
99 Group disk quota accounting enabled and limits (optionally)
100 enforced.
101
102 sunit=value and swidth=value
103 Used to specify the stripe unit and width for a RAID device or
104 a stripe volume. "value" must be specified in 512-byte block
105 units.
106 If this option is not specified and the filesystem was made on
107 a stripe volume or the stripe width or unit were specified for
108 the RAID device at mkfs time, then the mount system call will
109 restore the value from the superblock. For filesystems that
110 are made directly on RAID devices, these options can be used
111 to override the information in the superblock if the underlying
112 disk layout changes after the filesystem has been created.
113 The "swidth" option is required if the "sunit" option has been
114 specified, and must be a multiple of the "sunit" value.
115
116sysctls
117=======
118
119The following sysctls are available for the XFS filesystem:
120
121 fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
122 Setting this to "1" clears accumulated XFS statistics
123 in /proc/fs/xfs/stat. It then immediately resets to "0".
124
125 fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
126 The interval at which the xfssyncd thread flushes metadata
127 out to disk. This thread will flush log activity out, and
128 do some processing on unlinked inodes.
129
130 fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
131 The interval at which xfsbufd scans the dirty metadata buffers list.
132
133 fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
134 The age at which xfsbufd flushes dirty metadata buffers to disk.
135
136 fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
137 A volume knob for error reporting when internal errors occur.
138 This will generate detailed messages & backtraces for filesystem
139 shutdowns, for example. Current threshold values are:
140
141 XFS_ERRLEVEL_OFF: 0
142 XFS_ERRLEVEL_LOW: 1
143 XFS_ERRLEVEL_HIGH: 5
144
145 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
146 Causes certain error conditions to call BUG(). Value is a bitmask;
147 AND together the tags which represent errors which should cause panics:
148
149 XFS_NO_PTAG 0
150 XFS_PTAG_IFLUSH 0x00000001
151 XFS_PTAG_LOGRES 0x00000002
152 XFS_PTAG_AILDELETE 0x00000004
153 XFS_PTAG_ERROR_REPORT 0x00000008
154 XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
155 XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
156 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
157
158 This option is intended for debugging only.
159
160 fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
161 Controls whether symlinks are created with mode 0777 (default)
162 or whether their mode is affected by the umask (irix mode).
163
164 fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
165 Controls files created in SGID directories.
166 If the group ID of the new file does not match the effective group
167 ID or one of the supplementary group IDs of the parent dir, the
168 ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
169 is set.
170
171 fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
172 Controls whether unprivileged users can use chown to "give away"
173 a file to another user.
174
175 fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1)
176 Setting this to "1" will cause the "sync" flag set
177 by the chattr(1) command on a directory to be
178 inherited by files in that directory.
179
180 fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1)
181 Setting this to "1" will cause the "nodump" flag set
182 by the chattr(1) command on a directory to be
183 inherited by files in that directory.
184
185 fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1)
186 Setting this to "1" will cause the "noatime" flag set
187 by the chattr(1) command on a directory to be
188 inherited by files in that directory.