diff options
author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/filesystems |
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation/filesystems')
38 files changed, 13259 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX new file mode 100644 index 000000000000..bcfbab899b37 --- /dev/null +++ b/Documentation/filesystems/00-INDEX | |||
@@ -0,0 +1,50 @@ | |||
1 | 00-INDEX | ||
2 | - this file (info on some of the filesystems supported by linux). | ||
3 | Locking | ||
4 | - info on locking rules as they pertain to Linux VFS. | ||
5 | adfs.txt | ||
6 | - info and mount options for the Acorn Advanced Disc Filing System. | ||
7 | affs.txt | ||
8 | - info and mount options for the Amiga Fast File System. | ||
9 | bfs.txt | ||
10 | - info for the SCO UnixWare Boot Filesystem (BFS). | ||
11 | cifs.txt | ||
12 | - description of the CIFS filesystem | ||
13 | coda.txt | ||
14 | - description of the CODA filesystem. | ||
15 | cramfs.txt | ||
16 | - info on the cram filesystem for small storage (ROMs etc) | ||
17 | devfs/ | ||
18 | - directory containing devfs documentation. | ||
19 | ext2.txt | ||
20 | - info, mount options and specifications for the Ext2 filesystem. | ||
21 | fat_cvf.txt | ||
22 | - info on the Compressed Volume Files extension to the FAT filesystem | ||
23 | hpfs.txt | ||
24 | - info and mount options for the OS/2 HPFS. | ||
25 | isofs.txt | ||
26 | - info and mount options for the ISO 9660 (CDROM) filesystem. | ||
27 | jfs.txt | ||
28 | - info and mount options for the JFS filesystem. | ||
29 | ncpfs.txt | ||
30 | - info on Novell Netware(tm) filesystem using NCP protocol. | ||
31 | ntfs.txt | ||
32 | - info and mount options for the NTFS filesystem (Windows NT). | ||
33 | proc.txt | ||
34 | - info on Linux's /proc filesystem. | ||
35 | romfs.txt | ||
36 | - Description of the ROMFS filesystem. | ||
37 | smbfs.txt | ||
38 | - info on using filesystems with the SMB protocol (Windows 3.11 and NT) | ||
39 | sysv-fs.txt | ||
40 | - info on the SystemV/V7/Xenix/Coherent filesystem. | ||
41 | udf.txt | ||
42 | - info and mount options for the UDF filesystem. | ||
43 | ufs.txt | ||
44 | - info on the ufs filesystem. | ||
45 | vfat.txt | ||
46 | - info on using the VFAT filesystem used in Windows NT and Windows 95 | ||
47 | vfs.txt | ||
48 | - Overview of the Virtual File System | ||
49 | xfs.txt | ||
50 | - info and mount options for the XFS filesystem. | ||
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting new file mode 100644 index 000000000000..31047e0fe14b --- /dev/null +++ b/Documentation/filesystems/Exporting | |||
@@ -0,0 +1,176 @@ | |||
1 | |||
2 | Making Filesystems Exportable | ||
3 | ============================= | ||
4 | |||
5 | Most filesystem operations require a dentry (or two) as a starting | ||
6 | point. Local applications have a reference-counted hold on suitable | ||
7 | dentrys via open file descriptors or cwd/root. However remote | ||
8 | applications that access a filesystem via a remote filesystem protocol | ||
9 | such as NFS may not be able to hold such a reference, and so need a | ||
10 | different way to refer to a particular dentry. As the alternative | ||
11 | form of reference needs to be stable across renames, truncates, and | ||
12 | server-reboot (among other things, though these tend to be the most | ||
13 | problematic), there is no simple answer like 'filename'. | ||
14 | |||
15 | The mechanism discussed here allows each filesystem implementation to | ||
16 | specify how to generate an opaque (out side of the filesystem) byte | ||
17 | string for any dentry, and how to find an appropriate dentry for any | ||
18 | given opaque byte string. | ||
19 | This byte string will be called a "filehandle fragment" as it | ||
20 | corresponds to part of an NFS filehandle. | ||
21 | |||
22 | A filesystem which supports the mapping between filehandle fragments | ||
23 | and dentrys will be termed "exportable". | ||
24 | |||
25 | |||
26 | |||
27 | Dcache Issues | ||
28 | ------------- | ||
29 | |||
30 | The dcache normally contains a proper prefix of any given filesystem | ||
31 | tree. This means that if any filesystem object is in the dcache, then | ||
32 | all of the ancestors of that filesystem object are also in the dcache. | ||
33 | As normal access is by filename this prefix is created naturally and | ||
34 | maintained easily (by each object maintaining a reference count on | ||
35 | its parent). | ||
36 | |||
37 | However when objects are included into the dcache by interpreting a | ||
38 | filehandle fragment, there is no automatic creation of a path prefix | ||
39 | for the object. This leads to two related but distinct features of | ||
40 | the dcache that are not needed for normal filesystem access. | ||
41 | |||
42 | 1/ The dcache must sometimes contain objects that are not part of the | ||
43 | proper prefix. i.e that are not connected to the root. | ||
44 | 2/ The dcache must be prepared for a newly found (via ->lookup) directory | ||
45 | to already have a (non-connected) dentry, and must be able to move | ||
46 | that dentry into place (based on the parent and name in the | ||
47 | ->lookup). This is particularly needed for directories as | ||
48 | it is a dcache invariant that directories only have one dentry. | ||
49 | |||
50 | To implement these features, the dcache has: | ||
51 | |||
52 | a/ A dentry flag DCACHE_DISCONNECTED which is set on | ||
53 | any dentry that might not be part of the proper prefix. | ||
54 | This is set when anonymous dentries are created, and cleared when a | ||
55 | dentry is noticed to be a child of a dentry which is in the proper | ||
56 | prefix. | ||
57 | |||
58 | b/ A per-superblock list "s_anon" of dentries which are the roots of | ||
59 | subtrees that are not in the proper prefix. These dentries, as | ||
60 | well as the proper prefix, need to be released at unmount time. As | ||
61 | these dentries will not be hashed, they are linked together on the | ||
62 | d_hash list_head. | ||
63 | |||
64 | c/ Helper routines to allocate anonymous dentries, and to help attach | ||
65 | loose directory dentries at lookup time. They are: | ||
66 | d_alloc_anon(inode) will return a dentry for the given inode. | ||
67 | If the inode already has a dentry, one of those is returned. | ||
68 | If it doesn't, a new anonymous (IS_ROOT and | ||
69 | DCACHE_DISCONNECTED) dentry is allocated and attached. | ||
70 | In the case of a directory, care is taken that only one dentry | ||
71 | can ever be attached. | ||
72 | d_splice_alias(inode, dentry) will make sure that there is a | ||
73 | dentry with the same name and parent as the given dentry, and | ||
74 | which refers to the given inode. | ||
75 | If the inode is a directory and already has a dentry, then that | ||
76 | dentry is d_moved over the given dentry. | ||
77 | If the passed dentry gets attached, care is taken that this is | ||
78 | mutually exclusive to a d_alloc_anon operation. | ||
79 | If the passed dentry is used, NULL is returned, else the used | ||
80 | dentry is returned. This corresponds to the calling pattern of | ||
81 | ->lookup. | ||
82 | |||
83 | |||
84 | Filesystem Issues | ||
85 | ----------------- | ||
86 | |||
87 | For a filesystem to be exportable it must: | ||
88 | |||
89 | 1/ provide the filehandle fragment routines described below. | ||
90 | 2/ make sure that d_splice_alias is used rather than d_add | ||
91 | when ->lookup finds an inode for a given parent and name. | ||
92 | Typically the ->lookup routine will end: | ||
93 | if (inode) | ||
94 | return d_splice(inode, dentry); | ||
95 | d_add(dentry, inode); | ||
96 | return NULL; | ||
97 | } | ||
98 | |||
99 | |||
100 | |||
101 | A file system implementation declares that instances of the filesystem | ||
102 | are exportable by setting the s_export_op field in the struct | ||
103 | super_block. This field must point to a "struct export_operations" | ||
104 | struct which could potentially be full of NULLs, though normally at | ||
105 | least get_parent will be set. | ||
106 | |||
107 | The primary operations are decode_fh and encode_fh. | ||
108 | decode_fh takes a filehandle fragment and tries to find or create a | ||
109 | dentry for the object referred to by the filehandle. | ||
110 | encode_fh takes a dentry and creates a filehandle fragment which can | ||
111 | later be used to find/create a dentry for the same object. | ||
112 | |||
113 | decode_fh will probably make use of "find_exported_dentry". | ||
114 | This function lives in the "exportfs" module which a filesystem does | ||
115 | not need unless it is being exported. So rather that calling | ||
116 | find_exported_dentry directly, each filesystem should call it through | ||
117 | the find_exported_dentry pointer in it's export_operations table. | ||
118 | This field is set correctly by the exporting agent (e.g. nfsd) when a | ||
119 | filesystem is exported, and before any export operations are called. | ||
120 | |||
121 | find_exported_dentry needs three support functions from the | ||
122 | filesystem: | ||
123 | get_name. When given a parent dentry and a child dentry, this | ||
124 | should find a name in the directory identified by the parent | ||
125 | dentry, which leads to the object identified by the child dentry. | ||
126 | If no get_name function is supplied, a default implementation is | ||
127 | provided which uses vfs_readdir to find potential names, and | ||
128 | matches inode numbers to find the correct match. | ||
129 | |||
130 | get_parent. When given a dentry for a directory, this should return | ||
131 | a dentry for the parent. Quite possibly the parent dentry will | ||
132 | have been allocated by d_alloc_anon. | ||
133 | The default get_parent function just returns an error so any | ||
134 | filehandle lookup that requires finding a parent will fail. | ||
135 | ->lookup("..") is *not* used as a default as it can leave ".." | ||
136 | entries in the dcache which are too messy to work with. | ||
137 | |||
138 | get_dentry. When given an opaque datum, this should find the | ||
139 | implied object and create a dentry for it (possibly with | ||
140 | d_alloc_anon). | ||
141 | The opaque datum is whatever is passed down by the decode_fh | ||
142 | function, and is often simply a fragment of the filehandle | ||
143 | fragment. | ||
144 | decode_fh passes two datums through find_exported_dentry. One that | ||
145 | should be used to identify the target object, and one that can be | ||
146 | used to identify the object's parent, should that be necessary. | ||
147 | The default get_dentry function assumes that the datum contains an | ||
148 | inode number and a generation number, and it attempts to get the | ||
149 | inode using "iget" and check it's validity by matching the | ||
150 | generation number. A filesystem should only depend on the default | ||
151 | if iget can safely be used this way. | ||
152 | |||
153 | If decode_fh and/or encode_fh are left as NULL, then default | ||
154 | implementations are used. These defaults are suitable for ext2 and | ||
155 | extremely similar filesystems (like ext3). | ||
156 | |||
157 | The default encode_fh creates a filehandle fragment from the inode | ||
158 | number and generation number of the target together with the inode | ||
159 | number and generation number of the parent (if the parent is | ||
160 | required). | ||
161 | |||
162 | The default decode_fh extract the target and parent datums from the | ||
163 | filehandle assuming the format used by the default encode_fh and | ||
164 | passed them to find_exported_dentry. | ||
165 | |||
166 | |||
167 | A filehandle fragment consists of an array of 1 or more 4byte words, | ||
168 | together with a one byte "type". | ||
169 | The decode_fh routine should not depend on the stated size that is | ||
170 | passed to it. This size may be larger than the original filehandle | ||
171 | generated by encode_fh, in which case it will have been padded with | ||
172 | nuls. Rather, the encode_fh routine should choose a "type" which | ||
173 | indicates the decode_fh how much of the filehandle is valid, and how | ||
174 | it should be interpreted. | ||
175 | |||
176 | |||
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking new file mode 100644 index 000000000000..a934baeeb33a --- /dev/null +++ b/Documentation/filesystems/Locking | |||
@@ -0,0 +1,515 @@ | |||
1 | The text below describes the locking rules for VFS-related methods. | ||
2 | It is (believed to be) up-to-date. *Please*, if you change anything in | ||
3 | prototypes or locking protocols - update this file. And update the relevant | ||
4 | instances in the tree, don't leave that to maintainers of filesystems/devices/ | ||
5 | etc. At the very least, put the list of dubious cases in the end of this file. | ||
6 | Don't turn it into log - maintainers of out-of-the-tree code are supposed to | ||
7 | be able to use diff(1). | ||
8 | Thing currently missing here: socket operations. Alexey? | ||
9 | |||
10 | --------------------------- dentry_operations -------------------------- | ||
11 | prototypes: | ||
12 | int (*d_revalidate)(struct dentry *, int); | ||
13 | int (*d_hash) (struct dentry *, struct qstr *); | ||
14 | int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); | ||
15 | int (*d_delete)(struct dentry *); | ||
16 | void (*d_release)(struct dentry *); | ||
17 | void (*d_iput)(struct dentry *, struct inode *); | ||
18 | |||
19 | locking rules: | ||
20 | none have BKL | ||
21 | dcache_lock rename_lock ->d_lock may block | ||
22 | d_revalidate: no no no yes | ||
23 | d_hash no no no yes | ||
24 | d_compare: no yes no no | ||
25 | d_delete: yes no yes no | ||
26 | d_release: no no no yes | ||
27 | d_iput: no no no yes | ||
28 | |||
29 | --------------------------- inode_operations --------------------------- | ||
30 | prototypes: | ||
31 | int (*create) (struct inode *,struct dentry *,int, struct nameidata *); | ||
32 | struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid | ||
33 | ata *); | ||
34 | int (*link) (struct dentry *,struct inode *,struct dentry *); | ||
35 | int (*unlink) (struct inode *,struct dentry *); | ||
36 | int (*symlink) (struct inode *,struct dentry *,const char *); | ||
37 | int (*mkdir) (struct inode *,struct dentry *,int); | ||
38 | int (*rmdir) (struct inode *,struct dentry *); | ||
39 | int (*mknod) (struct inode *,struct dentry *,int,dev_t); | ||
40 | int (*rename) (struct inode *, struct dentry *, | ||
41 | struct inode *, struct dentry *); | ||
42 | int (*readlink) (struct dentry *, char __user *,int); | ||
43 | int (*follow_link) (struct dentry *, struct nameidata *); | ||
44 | void (*truncate) (struct inode *); | ||
45 | int (*permission) (struct inode *, int, struct nameidata *); | ||
46 | int (*setattr) (struct dentry *, struct iattr *); | ||
47 | int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); | ||
48 | int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); | ||
49 | ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); | ||
50 | ssize_t (*listxattr) (struct dentry *, char *, size_t); | ||
51 | int (*removexattr) (struct dentry *, const char *); | ||
52 | |||
53 | locking rules: | ||
54 | all may block, none have BKL | ||
55 | i_sem(inode) | ||
56 | lookup: yes | ||
57 | create: yes | ||
58 | link: yes (both) | ||
59 | mknod: yes | ||
60 | symlink: yes | ||
61 | mkdir: yes | ||
62 | unlink: yes (both) | ||
63 | rmdir: yes (both) (see below) | ||
64 | rename: yes (all) (see below) | ||
65 | readlink: no | ||
66 | follow_link: no | ||
67 | truncate: yes (see below) | ||
68 | setattr: yes | ||
69 | permission: no | ||
70 | getattr: no | ||
71 | setxattr: yes | ||
72 | getxattr: no | ||
73 | listxattr: no | ||
74 | removexattr: yes | ||
75 | Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on | ||
76 | victim. | ||
77 | cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. | ||
78 | ->truncate() is never called directly - it's a callback, not a | ||
79 | method. It's called by vmtruncate() - library function normally used by | ||
80 | ->setattr(). Locking information above applies to that call (i.e. is | ||
81 | inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been | ||
82 | passed). | ||
83 | |||
84 | See Documentation/filesystems/directory-locking for more detailed discussion | ||
85 | of the locking scheme for directory operations. | ||
86 | |||
87 | --------------------------- super_operations --------------------------- | ||
88 | prototypes: | ||
89 | struct inode *(*alloc_inode)(struct super_block *sb); | ||
90 | void (*destroy_inode)(struct inode *); | ||
91 | void (*read_inode) (struct inode *); | ||
92 | void (*dirty_inode) (struct inode *); | ||
93 | int (*write_inode) (struct inode *, int); | ||
94 | void (*put_inode) (struct inode *); | ||
95 | void (*drop_inode) (struct inode *); | ||
96 | void (*delete_inode) (struct inode *); | ||
97 | void (*put_super) (struct super_block *); | ||
98 | void (*write_super) (struct super_block *); | ||
99 | int (*sync_fs)(struct super_block *sb, int wait); | ||
100 | void (*write_super_lockfs) (struct super_block *); | ||
101 | void (*unlockfs) (struct super_block *); | ||
102 | int (*statfs) (struct super_block *, struct kstatfs *); | ||
103 | int (*remount_fs) (struct super_block *, int *, char *); | ||
104 | void (*clear_inode) (struct inode *); | ||
105 | void (*umount_begin) (struct super_block *); | ||
106 | int (*show_options)(struct seq_file *, struct vfsmount *); | ||
107 | ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); | ||
108 | ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); | ||
109 | |||
110 | locking rules: | ||
111 | All may block. | ||
112 | BKL s_lock s_umount | ||
113 | alloc_inode: no no no | ||
114 | destroy_inode: no | ||
115 | read_inode: no (see below) | ||
116 | dirty_inode: no (must not sleep) | ||
117 | write_inode: no | ||
118 | put_inode: no | ||
119 | drop_inode: no !!!inode_lock!!! | ||
120 | delete_inode: no | ||
121 | put_super: yes yes no | ||
122 | write_super: no yes read | ||
123 | sync_fs: no no read | ||
124 | write_super_lockfs: ? | ||
125 | unlockfs: ? | ||
126 | statfs: no no no | ||
127 | remount_fs: no yes maybe (see below) | ||
128 | clear_inode: no | ||
129 | umount_begin: yes no no | ||
130 | show_options: no (vfsmount->sem) | ||
131 | quota_read: no no no (see below) | ||
132 | quota_write: no no no (see below) | ||
133 | |||
134 | ->read_inode() is not a method - it's a callback used in iget(). | ||
135 | ->remount_fs() will have the s_umount lock if it's already mounted. | ||
136 | When called from get_sb_single, it does NOT have the s_umount lock. | ||
137 | ->quota_read() and ->quota_write() functions are both guaranteed to | ||
138 | be the only ones operating on the quota file by the quota code (via | ||
139 | dqio_sem) (unless an admin really wants to screw up something and | ||
140 | writes to quota files with quotas on). For other details about locking | ||
141 | see also dquot_operations section. | ||
142 | |||
143 | --------------------------- file_system_type --------------------------- | ||
144 | prototypes: | ||
145 | struct super_block *(*get_sb) (struct file_system_type *, int, | ||
146 | const char *, void *); | ||
147 | void (*kill_sb) (struct super_block *); | ||
148 | locking rules: | ||
149 | may block BKL | ||
150 | get_sb yes yes | ||
151 | kill_sb yes yes | ||
152 | |||
153 | ->get_sb() returns error or a locked superblock (exclusive on ->s_umount). | ||
154 | ->kill_sb() takes a write-locked superblock, does all shutdown work on it, | ||
155 | unlocks and drops the reference. | ||
156 | |||
157 | --------------------------- address_space_operations -------------------------- | ||
158 | prototypes: | ||
159 | int (*writepage)(struct page *page, struct writeback_control *wbc); | ||
160 | int (*readpage)(struct file *, struct page *); | ||
161 | int (*sync_page)(struct page *); | ||
162 | int (*writepages)(struct address_space *, struct writeback_control *); | ||
163 | int (*set_page_dirty)(struct page *page); | ||
164 | int (*readpages)(struct file *filp, struct address_space *mapping, | ||
165 | struct list_head *pages, unsigned nr_pages); | ||
166 | int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); | ||
167 | int (*commit_write)(struct file *, struct page *, unsigned, unsigned); | ||
168 | sector_t (*bmap)(struct address_space *, sector_t); | ||
169 | int (*invalidatepage) (struct page *, unsigned long); | ||
170 | int (*releasepage) (struct page *, int); | ||
171 | int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, | ||
172 | loff_t offset, unsigned long nr_segs); | ||
173 | |||
174 | locking rules: | ||
175 | All except set_page_dirty may block | ||
176 | |||
177 | BKL PageLocked(page) | ||
178 | writepage: no yes, unlocks (see below) | ||
179 | readpage: no yes, unlocks | ||
180 | sync_page: no maybe | ||
181 | writepages: no | ||
182 | set_page_dirty no no | ||
183 | readpages: no | ||
184 | prepare_write: no yes | ||
185 | commit_write: no yes | ||
186 | bmap: yes | ||
187 | invalidatepage: no yes | ||
188 | releasepage: no yes | ||
189 | direct_IO: no | ||
190 | |||
191 | ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage() | ||
192 | may be called from the request handler (/dev/loop). | ||
193 | |||
194 | ->readpage() unlocks the page, either synchronously or via I/O | ||
195 | completion. | ||
196 | |||
197 | ->readpages() populates the pagecache with the passed pages and starts | ||
198 | I/O against them. They come unlocked upon I/O completion. | ||
199 | |||
200 | ->writepage() is used for two purposes: for "memory cleansing" and for | ||
201 | "sync". These are quite different operations and the behaviour may differ | ||
202 | depending upon the mode. | ||
203 | |||
204 | If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then | ||
205 | it *must* start I/O against the page, even if that would involve | ||
206 | blocking on in-progress I/O. | ||
207 | |||
208 | If writepage is called for memory cleansing (sync_mode == | ||
209 | WBC_SYNC_NONE) then its role is to get as much writeout underway as | ||
210 | possible. So writepage should try to avoid blocking against | ||
211 | currently-in-progress I/O. | ||
212 | |||
213 | If the filesystem is not called for "sync" and it determines that it | ||
214 | would need to block against in-progress I/O to be able to start new I/O | ||
215 | against the page the filesystem should redirty the page with | ||
216 | redirty_page_for_writepage(), then unlock the page and return zero. | ||
217 | This may also be done to avoid internal deadlocks, but rarely. | ||
218 | |||
219 | If the filesytem is called for sync then it must wait on any | ||
220 | in-progress I/O and then start new I/O. | ||
221 | |||
222 | The filesystem should unlock the page synchronously, before returning | ||
223 | to the caller. | ||
224 | |||
225 | Unless the filesystem is going to redirty_page_for_writepage(), unlock the page | ||
226 | and return zero, writepage *must* run set_page_writeback() against the page, | ||
227 | followed by unlocking it. Once set_page_writeback() has been run against the | ||
228 | page, write I/O can be submitted and the write I/O completion handler must run | ||
229 | end_page_writeback() once the I/O is complete. If no I/O is submitted, the | ||
230 | filesystem must run end_page_writeback() against the page before returning from | ||
231 | writepage. | ||
232 | |||
233 | That is: after 2.5.12, pages which are under writeout are *not* locked. Note, | ||
234 | if the filesystem needs the page to be locked during writeout, that is ok, too, | ||
235 | the page is allowed to be unlocked at any point in time between the calls to | ||
236 | set_page_writeback() and end_page_writeback(). | ||
237 | |||
238 | Note, failure to run either redirty_page_for_writepage() or the combination of | ||
239 | set_page_writeback()/end_page_writeback() on a page submitted to writepage | ||
240 | will leave the page itself marked clean but it will be tagged as dirty in the | ||
241 | radix tree. This incoherency can lead to all sorts of hard-to-debug problems | ||
242 | in the filesystem like having dirty inodes at umount and losing written data. | ||
243 | |||
244 | ->sync_page() locking rules are not well-defined - usually it is called | ||
245 | with lock on page, but that is not guaranteed. Considering the currently | ||
246 | existing instances of this method ->sync_page() itself doesn't look | ||
247 | well-defined... | ||
248 | |||
249 | ->writepages() is used for periodic writeback and for syscall-initiated | ||
250 | sync operations. The address_space should start I/O against at least | ||
251 | *nr_to_write pages. *nr_to_write must be decremented for each page which is | ||
252 | written. The address_space implementation may write more (or less) pages | ||
253 | than *nr_to_write asks for, but it should try to be reasonably close. If | ||
254 | nr_to_write is NULL, all dirty pages must be written. | ||
255 | |||
256 | writepages should _only_ write pages which are present on | ||
257 | mapping->io_pages. | ||
258 | |||
259 | ->set_page_dirty() is called from various places in the kernel | ||
260 | when the target page is marked as needing writeback. It may be called | ||
261 | under spinlock (it cannot block) and is sometimes called with the page | ||
262 | not locked. | ||
263 | |||
264 | ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some | ||
265 | filesystems and by the swapper. The latter will eventually go away. All | ||
266 | instances do not actually need the BKL. Please, keep it that way and don't | ||
267 | breed new callers. | ||
268 | |||
269 | ->invalidatepage() is called when the filesystem must attempt to drop | ||
270 | some or all of the buffers from the page when it is being truncated. It | ||
271 | returns zero on success. If ->invalidatepage is zero, the kernel uses | ||
272 | block_invalidatepage() instead. | ||
273 | |||
274 | ->releasepage() is called when the kernel is about to try to drop the | ||
275 | buffers from the page in preparation for freeing it. It returns zero to | ||
276 | indicate that the buffers are (or may be) freeable. If ->releasepage is zero, | ||
277 | the kernel assumes that the fs has no private interest in the buffers. | ||
278 | |||
279 | Note: currently almost all instances of address_space methods are | ||
280 | using BKL for internal serialization and that's one of the worst sources | ||
281 | of contention. Normally they are calling library functions (in fs/buffer.c) | ||
282 | and pass foo_get_block() as a callback (on local block-based filesystems, | ||
283 | indeed). BKL is not needed for library stuff and is usually taken by | ||
284 | foo_get_block(). It's an overkill, since block bitmaps can be protected by | ||
285 | internal fs locking and real critical areas are much smaller than the areas | ||
286 | filesystems protect now. | ||
287 | |||
288 | ----------------------- file_lock_operations ------------------------------ | ||
289 | prototypes: | ||
290 | void (*fl_insert)(struct file_lock *); /* lock insertion callback */ | ||
291 | void (*fl_remove)(struct file_lock *); /* lock removal callback */ | ||
292 | void (*fl_copy_lock)(struct file_lock *, struct file_lock *); | ||
293 | void (*fl_release_private)(struct file_lock *); | ||
294 | |||
295 | |||
296 | locking rules: | ||
297 | BKL may block | ||
298 | fl_insert: yes no | ||
299 | fl_remove: yes no | ||
300 | fl_copy_lock: yes no | ||
301 | fl_release_private: yes yes | ||
302 | |||
303 | ----------------------- lock_manager_operations --------------------------- | ||
304 | prototypes: | ||
305 | int (*fl_compare_owner)(struct file_lock *, struct file_lock *); | ||
306 | void (*fl_notify)(struct file_lock *); /* unblock callback */ | ||
307 | void (*fl_copy_lock)(struct file_lock *, struct file_lock *); | ||
308 | void (*fl_release_private)(struct file_lock *); | ||
309 | void (*fl_break)(struct file_lock *); /* break_lease callback */ | ||
310 | |||
311 | locking rules: | ||
312 | BKL may block | ||
313 | fl_compare_owner: yes no | ||
314 | fl_notify: yes no | ||
315 | fl_copy_lock: yes no | ||
316 | fl_release_private: yes yes | ||
317 | fl_break: yes no | ||
318 | |||
319 | Currently only NFSD and NLM provide instances of this class. None of the | ||
320 | them block. If you have out-of-tree instances - please, show up. Locking | ||
321 | in that area will change. | ||
322 | --------------------------- buffer_head ----------------------------------- | ||
323 | prototypes: | ||
324 | void (*b_end_io)(struct buffer_head *bh, int uptodate); | ||
325 | |||
326 | locking rules: | ||
327 | called from interrupts. In other words, extreme care is needed here. | ||
328 | bh is locked, but that's all warranties we have here. Currently only RAID1, | ||
329 | highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices | ||
330 | call this method upon the IO completion. | ||
331 | |||
332 | --------------------------- block_device_operations ----------------------- | ||
333 | prototypes: | ||
334 | int (*open) (struct inode *, struct file *); | ||
335 | int (*release) (struct inode *, struct file *); | ||
336 | int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); | ||
337 | int (*media_changed) (struct gendisk *); | ||
338 | int (*revalidate_disk) (struct gendisk *); | ||
339 | |||
340 | locking rules: | ||
341 | BKL bd_sem | ||
342 | open: yes yes | ||
343 | release: yes yes | ||
344 | ioctl: yes no | ||
345 | media_changed: no no | ||
346 | revalidate_disk: no no | ||
347 | |||
348 | The last two are called only from check_disk_change(). | ||
349 | |||
350 | --------------------------- file_operations ------------------------------- | ||
351 | prototypes: | ||
352 | loff_t (*llseek) (struct file *, loff_t, int); | ||
353 | ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); | ||
354 | ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); | ||
355 | ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); | ||
356 | ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, | ||
357 | loff_t); | ||
358 | int (*readdir) (struct file *, void *, filldir_t); | ||
359 | unsigned int (*poll) (struct file *, struct poll_table_struct *); | ||
360 | int (*ioctl) (struct inode *, struct file *, unsigned int, | ||
361 | unsigned long); | ||
362 | long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); | ||
363 | long (*compat_ioctl) (struct file *, unsigned int, unsigned long); | ||
364 | int (*mmap) (struct file *, struct vm_area_struct *); | ||
365 | int (*open) (struct inode *, struct file *); | ||
366 | int (*flush) (struct file *); | ||
367 | int (*release) (struct inode *, struct file *); | ||
368 | int (*fsync) (struct file *, struct dentry *, int datasync); | ||
369 | int (*aio_fsync) (struct kiocb *, int datasync); | ||
370 | int (*fasync) (int, struct file *, int); | ||
371 | int (*lock) (struct file *, int, struct file_lock *); | ||
372 | ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, | ||
373 | loff_t *); | ||
374 | ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, | ||
375 | loff_t *); | ||
376 | ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, | ||
377 | void __user *); | ||
378 | ssize_t (*sendpage) (struct file *, struct page *, int, size_t, | ||
379 | loff_t *, int); | ||
380 | unsigned long (*get_unmapped_area)(struct file *, unsigned long, | ||
381 | unsigned long, unsigned long, unsigned long); | ||
382 | int (*check_flags)(int); | ||
383 | int (*dir_notify)(struct file *, unsigned long); | ||
384 | }; | ||
385 | |||
386 | locking rules: | ||
387 | All except ->poll() may block. | ||
388 | BKL | ||
389 | llseek: no (see below) | ||
390 | read: no | ||
391 | aio_read: no | ||
392 | write: no | ||
393 | aio_write: no | ||
394 | readdir: no | ||
395 | poll: no | ||
396 | ioctl: yes (see below) | ||
397 | unlocked_ioctl: no (see below) | ||
398 | compat_ioctl: no | ||
399 | mmap: no | ||
400 | open: maybe (see below) | ||
401 | flush: no | ||
402 | release: no | ||
403 | fsync: no (see below) | ||
404 | aio_fsync: no | ||
405 | fasync: yes (see below) | ||
406 | lock: yes | ||
407 | readv: no | ||
408 | writev: no | ||
409 | sendfile: no | ||
410 | sendpage: no | ||
411 | get_unmapped_area: no | ||
412 | check_flags: no | ||
413 | dir_notify: no | ||
414 | |||
415 | ->llseek() locking has moved from llseek to the individual llseek | ||
416 | implementations. If your fs is not using generic_file_llseek, you | ||
417 | need to acquire and release the appropriate locks in your ->llseek(). | ||
418 | For many filesystems, it is probably safe to acquire the inode | ||
419 | semaphore. Note some filesystems (i.e. remote ones) provide no | ||
420 | protection for i_size so you will need to use the BKL. | ||
421 | |||
422 | ->open() locking is in-transit: big lock partially moved into the methods. | ||
423 | The only exception is ->open() in the instances of file_operations that never | ||
424 | end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices | ||
425 | (chrdev_open() takes lock before replacing ->f_op and calling the secondary | ||
426 | method. As soon as we fix the handling of module reference counters all | ||
427 | instances of ->open() will be called without the BKL. | ||
428 | |||
429 | Note: ext2_release() was *the* source of contention on fs-intensive | ||
430 | loads and dropping BKL on ->release() helps to get rid of that (we still | ||
431 | grab BKL for cases when we close a file that had been opened r/w, but that | ||
432 | can and should be done using the internal locking with smaller critical areas). | ||
433 | Current worst offender is ext2_get_block()... | ||
434 | |||
435 | ->fasync() is a mess. This area needs a big cleanup and that will probably | ||
436 | affect locking. | ||
437 | |||
438 | ->readdir() and ->ioctl() on directories must be changed. Ideally we would | ||
439 | move ->readdir() to inode_operations and use a separate method for directory | ||
440 | ->ioctl() or kill the latter completely. One of the problems is that for | ||
441 | anything that resembles union-mount we won't have a struct file for all | ||
442 | components. And there are other reasons why the current interface is a mess... | ||
443 | |||
444 | ->ioctl() on regular files is superceded by the ->unlocked_ioctl() that | ||
445 | doesn't take the BKL. | ||
446 | |||
447 | ->read on directories probably must go away - we should just enforce -EISDIR | ||
448 | in sys_read() and friends. | ||
449 | |||
450 | ->fsync() has i_sem on inode. | ||
451 | |||
452 | --------------------------- dquot_operations ------------------------------- | ||
453 | prototypes: | ||
454 | int (*initialize) (struct inode *, int); | ||
455 | int (*drop) (struct inode *); | ||
456 | int (*alloc_space) (struct inode *, qsize_t, int); | ||
457 | int (*alloc_inode) (const struct inode *, unsigned long); | ||
458 | int (*free_space) (struct inode *, qsize_t); | ||
459 | int (*free_inode) (const struct inode *, unsigned long); | ||
460 | int (*transfer) (struct inode *, struct iattr *); | ||
461 | int (*write_dquot) (struct dquot *); | ||
462 | int (*acquire_dquot) (struct dquot *); | ||
463 | int (*release_dquot) (struct dquot *); | ||
464 | int (*mark_dirty) (struct dquot *); | ||
465 | int (*write_info) (struct super_block *, int); | ||
466 | |||
467 | These operations are intended to be more or less wrapping functions that ensure | ||
468 | a proper locking wrt the filesystem and call the generic quota operations. | ||
469 | |||
470 | What filesystem should expect from the generic quota functions: | ||
471 | |||
472 | FS recursion Held locks when called | ||
473 | initialize: yes maybe dqonoff_sem | ||
474 | drop: yes - | ||
475 | alloc_space: ->mark_dirty() - | ||
476 | alloc_inode: ->mark_dirty() - | ||
477 | free_space: ->mark_dirty() - | ||
478 | free_inode: ->mark_dirty() - | ||
479 | transfer: yes - | ||
480 | write_dquot: yes dqonoff_sem or dqptr_sem | ||
481 | acquire_dquot: yes dqonoff_sem or dqptr_sem | ||
482 | release_dquot: yes dqonoff_sem or dqptr_sem | ||
483 | mark_dirty: no - | ||
484 | write_info: yes dqonoff_sem | ||
485 | |||
486 | FS recursion means calling ->quota_read() and ->quota_write() from superblock | ||
487 | operations. | ||
488 | |||
489 | ->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called | ||
490 | only directly by the filesystem and do not call any fs functions only | ||
491 | the ->mark_dirty() operation. | ||
492 | |||
493 | More details about quota locking can be found in fs/dquot.c. | ||
494 | |||
495 | --------------------------- vm_operations_struct ----------------------------- | ||
496 | prototypes: | ||
497 | void (*open)(struct vm_area_struct*); | ||
498 | void (*close)(struct vm_area_struct*); | ||
499 | struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); | ||
500 | |||
501 | locking rules: | ||
502 | BKL mmap_sem | ||
503 | open: no yes | ||
504 | close: no yes | ||
505 | nopage: no yes | ||
506 | |||
507 | ================================================================================ | ||
508 | Dubious stuff | ||
509 | |||
510 | (if you break something or notice that it is broken and do not fix it yourself | ||
511 | - at least put it here) | ||
512 | |||
513 | ipc/shm.c::shm_delete() - may need BKL. | ||
514 | ->read() and ->write() in many drivers are (probably) missing BKL. | ||
515 | drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL. | ||
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt new file mode 100644 index 000000000000..060abb0c7004 --- /dev/null +++ b/Documentation/filesystems/adfs.txt | |||
@@ -0,0 +1,57 @@ | |||
1 | Mount options for ADFS | ||
2 | ---------------------- | ||
3 | |||
4 | uid=nnn All files in the partition will be owned by | ||
5 | user id nnn. Default 0 (root). | ||
6 | gid=nnn All files in the partition willbe in group | ||
7 | nnn. Default 0 (root). | ||
8 | ownmask=nnn The permission mask for ADFS 'owner' permissions | ||
9 | will be nnn. Default 0700. | ||
10 | othmask=nnn The permission mask for ADFS 'other' permissions | ||
11 | will be nnn. Default 0077. | ||
12 | |||
13 | Mapping of ADFS permissions to Linux permissions | ||
14 | ------------------------------------------------ | ||
15 | |||
16 | ADFS permissions consist of the following: | ||
17 | |||
18 | Owner read | ||
19 | Owner write | ||
20 | Other read | ||
21 | Other write | ||
22 | |||
23 | (In older versions, an 'execute' permission did exist, but this | ||
24 | does not hold the same meaning as the Linux 'execute' permission | ||
25 | and is now obsolete). | ||
26 | |||
27 | The mapping is performed as follows: | ||
28 | |||
29 | Owner read -> -r--r--r-- | ||
30 | Owner write -> --w--w---w | ||
31 | Owner read and filetype UnixExec -> ---x--x--x | ||
32 | These are then masked by ownmask, eg 700 -> -rwx------ | ||
33 | Possible owner mode permissions -> -rwx------ | ||
34 | |||
35 | Other read -> -r--r--r-- | ||
36 | Other write -> --w--w--w- | ||
37 | Other read and filetype UnixExec -> ---x--x--x | ||
38 | These are then masked by othmask, eg 077 -> ----rwxrwx | ||
39 | Possible other mode permissions -> ----rwxrwx | ||
40 | |||
41 | Hence, with the default masks, if a file is owner read/write, and | ||
42 | not a UnixExec filetype, then the permissions will be: | ||
43 | |||
44 | -rw------- | ||
45 | |||
46 | However, if the masks were ownmask=0770,othmask=0007, then this would | ||
47 | be modified to: | ||
48 | -rw-rw---- | ||
49 | |||
50 | There is no restriction on what you can do with these masks. You may | ||
51 | wish that either read bits give read access to the file for all, but | ||
52 | keep the default write protection (ownmask=0755,othmask=0577): | ||
53 | |||
54 | -rw-r--r-- | ||
55 | |||
56 | You can therefore tailor the permission translation to whatever you | ||
57 | desire the permissions should be under Linux. | ||
diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt new file mode 100644 index 000000000000..30c9738590f4 --- /dev/null +++ b/Documentation/filesystems/affs.txt | |||
@@ -0,0 +1,219 @@ | |||
1 | Overview of Amiga Filesystems | ||
2 | ============================= | ||
3 | |||
4 | Not all varieties of the Amiga filesystems are supported for reading and | ||
5 | writing. The Amiga currently knows six different filesystems: | ||
6 | |||
7 | DOS\0 The old or original filesystem, not really suited for | ||
8 | hard disks and normally not used on them, either. | ||
9 | Supported read/write. | ||
10 | |||
11 | DOS\1 The original Fast File System. Supported read/write. | ||
12 | |||
13 | DOS\2 The old "international" filesystem. International means that | ||
14 | a bug has been fixed so that accented ("international") letters | ||
15 | in file names are case-insensitive, as they ought to be. | ||
16 | Supported read/write. | ||
17 | |||
18 | DOS\3 The "international" Fast File System. Supported read/write. | ||
19 | |||
20 | DOS\4 The original filesystem with directory cache. The directory | ||
21 | cache speeds up directory accesses on floppies considerably, | ||
22 | but slows down file creation/deletion. Doesn't make much | ||
23 | sense on hard disks. Supported read only. | ||
24 | |||
25 | DOS\5 The Fast File System with directory cache. Supported read only. | ||
26 | |||
27 | All of the above filesystems allow block sizes from 512 to 32K bytes. | ||
28 | Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks | ||
29 | speed up almost everything at the expense of wasted disk space. The speed | ||
30 | gain above 4K seems not really worth the price, so you don't lose too | ||
31 | much here, either. | ||
32 | |||
33 | The muFS (multi user File System) equivalents of the above file systems | ||
34 | are supported, too. | ||
35 | |||
36 | Mount options for the AFFS | ||
37 | ========================== | ||
38 | |||
39 | protect If this option is set, the protection bits cannot be altered. | ||
40 | |||
41 | setuid[=uid] This sets the owner of all files and directories in the file | ||
42 | system to uid or the uid of the current user, respectively. | ||
43 | |||
44 | setgid[=gid] Same as above, but for gid. | ||
45 | |||
46 | mode=mode Sets the mode flags to the given (octal) value, regardless | ||
47 | of the original permissions. Directories will get an x | ||
48 | permission if the corresponding r bit is set. | ||
49 | This is useful since most of the plain AmigaOS files | ||
50 | will map to 600. | ||
51 | |||
52 | reserved=num Sets the number of reserved blocks at the start of the | ||
53 | partition to num. You should never need this option. | ||
54 | Default is 2. | ||
55 | |||
56 | root=block Sets the block number of the root block. This should never | ||
57 | be necessary. | ||
58 | |||
59 | bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, | ||
60 | 1024, 2048 and 4096. Like the root option, this should | ||
61 | never be necessary, as the affs can figure it out itself. | ||
62 | |||
63 | quiet The file system will not return an error for disallowed | ||
64 | mode changes. | ||
65 | |||
66 | verbose The volume name, file system type and block size will | ||
67 | be written to the syslog when the filesystem is mounted. | ||
68 | |||
69 | mufs The filesystem is really a muFS, also it doesn't | ||
70 | identify itself as one. This option is necessary if | ||
71 | the filesystem wasn't formatted as muFS, but is used | ||
72 | as one. | ||
73 | |||
74 | prefix=path Path will be prefixed to every absolute path name of | ||
75 | symbolic links on an AFFS partition. Default = "/". | ||
76 | (See below.) | ||
77 | |||
78 | volume=name When symbolic links with an absolute path are created | ||
79 | on an AFFS partition, name will be prepended as the | ||
80 | volume name. Default = "" (empty string). | ||
81 | (See below.) | ||
82 | |||
83 | Handling of the Users/Groups and protection flags | ||
84 | ================================================= | ||
85 | |||
86 | Amiga -> Linux: | ||
87 | |||
88 | The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: | ||
89 | |||
90 | - R maps to r for user, group and others. On directories, R implies x. | ||
91 | |||
92 | - If both W and D are allowed, w will be set. | ||
93 | |||
94 | - E maps to x. | ||
95 | |||
96 | - H and P are always retained and ignored under Linux. | ||
97 | |||
98 | - A is always reset when a file is written to. | ||
99 | |||
100 | User id and group id will be used unless set[gu]id are given as mount | ||
101 | options. Since most of the Amiga file systems are single user systems | ||
102 | they will be owned by root. The root directory (the mount point) of the | ||
103 | Amiga filesystem will be owned by the user who actually mounts the | ||
104 | filesystem (the root directory doesn't have uid/gid fields). | ||
105 | |||
106 | Linux -> Amiga: | ||
107 | |||
108 | The Linux rwxrwxrwx file mode is handled as follows: | ||
109 | |||
110 | - r permission will set R for user, group and others. | ||
111 | |||
112 | - w permission will set W and D for user, group and others. | ||
113 | |||
114 | - x permission of the user will set E for plain files. | ||
115 | |||
116 | - All other flags (suid, sgid, ...) are ignored and will | ||
117 | not be retained. | ||
118 | |||
119 | Newly created files and directories will get the user and group ID | ||
120 | of the current user and a mode according to the umask. | ||
121 | |||
122 | Symbolic links | ||
123 | ============== | ||
124 | |||
125 | Although the Amiga and Linux file systems resemble each other, there | ||
126 | are some, not always subtle, differences. One of them becomes apparent | ||
127 | with symbolic links. While Linux has a file system with exactly one | ||
128 | root directory, the Amiga has a separate root directory for each | ||
129 | file system (for example, partition, floppy disk, ...). With the Amiga, | ||
130 | these entities are called "volumes". They have symbolic names which | ||
131 | can be used to access them. Thus, symbolic links can point to a | ||
132 | different volume. AFFS turns the volume name into a directory name | ||
133 | and prepends the prefix path (see prefix option) to it. | ||
134 | |||
135 | Example: | ||
136 | You mount all your Amiga partitions under /amiga/<volume> (where | ||
137 | <volume> is the name of the volume), and you give the option | ||
138 | "prefix=/amiga/" when mounting all your AFFS partitions. (They | ||
139 | might be "User", "WB" and "Graphics", the mount points /amiga/User, | ||
140 | /amiga/WB and /amiga/Graphics). A symbolic link referring to | ||
141 | "User:sc/include/dos/dos.h" will be followed to | ||
142 | "/amiga/User/sc/include/dos/dos.h". | ||
143 | |||
144 | Examples | ||
145 | ======== | ||
146 | |||
147 | Command line: | ||
148 | mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose | ||
149 | mount /dev/sda3 /Amiga -t affs | ||
150 | |||
151 | /etc/fstab entry: | ||
152 | /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 | ||
153 | |||
154 | IMPORTANT NOTE | ||
155 | ============== | ||
156 | |||
157 | If you boot Windows 95 (don't know about 3.x, 98 and NT) while you | ||
158 | have an Amiga harddisk connected to your PC, it will overwrite | ||
159 | the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating | ||
160 | the Rigid Disk Block. Sheer luck has it that this is an unused | ||
161 | area of the RDB, so only the checksum doesn't match anymore. | ||
162 | Linux will ignore this garbage and recognize the RDB anyway, but | ||
163 | before you connect that drive to your Amiga again, you must | ||
164 | restore or repair your RDB. So please do make a backup copy of it | ||
165 | before booting Windows! | ||
166 | |||
167 | If the damage is already done, the following should fix the RDB | ||
168 | (where <disk> is the device name). | ||
169 | DO AT YOUR OWN RISK: | ||
170 | |||
171 | dd if=/dev/<disk> of=rdb.tmp count=1 | ||
172 | cp rdb.tmp rdb.fixed | ||
173 | dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 | ||
174 | dd if=rdb.fixed of=/dev/<disk> | ||
175 | |||
176 | Bugs, Restrictions, Caveats | ||
177 | =========================== | ||
178 | |||
179 | Quite a few things may not work as advertised. Not everything is | ||
180 | tested, though several hundred MB have been read and written using | ||
181 | this fs. For a most up-to-date list of bugs please consult | ||
182 | fs/affs/Changes. | ||
183 | |||
184 | Filenames are truncated to 30 characters without warning (this | ||
185 | can be changed by setting the compile-time option AFFS_NO_TRUNCATE | ||
186 | in include/linux/amigaffs.h). | ||
187 | |||
188 | Case is ignored by the affs in filename matching, but Linux shells | ||
189 | do care about the case. Example (with /wb being an affs mounted fs): | ||
190 | rm /wb/WRONGCASE | ||
191 | will remove /mnt/wrongcase, but | ||
192 | rm /wb/WR* | ||
193 | will not since the names are matched by the shell. | ||
194 | |||
195 | The block allocation is designed for hard disk partitions. If more | ||
196 | than 1 process writes to a (small) diskette, the blocks are allocated | ||
197 | in an ugly way (but the real AFFS doesn't do much better). This | ||
198 | is also true when space gets tight. | ||
199 | |||
200 | You cannot execute programs on an OFS (Old File System), since the | ||
201 | program files cannot be memory mapped due to the 488 byte blocks. | ||
202 | For the same reason you cannot mount an image on such a filesystem | ||
203 | via the loopback device. | ||
204 | |||
205 | The bitmap valid flag in the root block may not be accurate when the | ||
206 | system crashes while an affs partition is mounted. There's currently | ||
207 | no way to fix a garbled filesystem without an Amiga (disk validator) | ||
208 | or manually (who would do this?). Maybe later. | ||
209 | |||
210 | If you mount affs partitions on system startup, you may want to tell | ||
211 | fsck that the fs should not be checked (place a '0' in the sixth field | ||
212 | of /etc/fstab). | ||
213 | |||
214 | It's not possible to read floppy disks with a normal PC or workstation | ||
215 | due to an incompatibility with the Amiga floppy controller. | ||
216 | |||
217 | If you are interested in an Amiga Emulator for Linux, look at | ||
218 | |||
219 | http://www-users.informatik.rwth-aachen.de/~crux/uae.html | ||
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt new file mode 100644 index 000000000000..2f4237dfb8c7 --- /dev/null +++ b/Documentation/filesystems/afs.txt | |||
@@ -0,0 +1,155 @@ | |||
1 | kAFS: AFS FILESYSTEM | ||
2 | ==================== | ||
3 | |||
4 | ABOUT | ||
5 | ===== | ||
6 | |||
7 | This filesystem provides a fairly simple AFS filesystem driver. It is under | ||
8 | development and only provides very basic facilities. It does not yet support | ||
9 | the following AFS features: | ||
10 | |||
11 | (*) Write support. | ||
12 | (*) Communications security. | ||
13 | (*) Local caching. | ||
14 | (*) pioctl() system call. | ||
15 | (*) Automatic mounting of embedded mountpoints. | ||
16 | |||
17 | |||
18 | USAGE | ||
19 | ===== | ||
20 | |||
21 | When inserting the driver modules the root cell must be specified along with a | ||
22 | list of volume location server IP addresses: | ||
23 | |||
24 | insmod rxrpc.o | ||
25 | insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 | ||
26 | |||
27 | The first module is a driver for the RxRPC remote operation protocol, and the | ||
28 | second is the actual filesystem driver for the AFS filesystem. | ||
29 | |||
30 | Once the module has been loaded, more modules can be added by the following | ||
31 | procedure: | ||
32 | |||
33 | echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells | ||
34 | |||
35 | Where the parameters to the "add" command are the name of a cell and a list of | ||
36 | volume location servers within that cell. | ||
37 | |||
38 | Filesystems can be mounted anywhere by commands similar to the following: | ||
39 | |||
40 | mount -t afs "%cambridge.redhat.com:root.afs." /afs | ||
41 | mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge | ||
42 | mount -t afs "#root.afs." /afs | ||
43 | mount -t afs "#root.cell." /afs/cambridge | ||
44 | |||
45 | NB: When using this on Linux 2.4, the mount command has to be different, | ||
46 | since the filesystem doesn't have access to the device name argument: | ||
47 | |||
48 | mount -t afs none /afs -ovol="#root.afs." | ||
49 | |||
50 | Where the initial character is either a hash or a percent symbol depending on | ||
51 | whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O | ||
52 | volume, but are willing to use a R/W volume instead (percent). | ||
53 | |||
54 | The name of the volume can be suffixes with ".backup" or ".readonly" to | ||
55 | specify connection to only volumes of those types. | ||
56 | |||
57 | The name of the cell is optional, and if not given during a mount, then the | ||
58 | named volume will be looked up in the cell specified during insmod. | ||
59 | |||
60 | Additional cells can be added through /proc (see later section). | ||
61 | |||
62 | |||
63 | MOUNTPOINTS | ||
64 | =========== | ||
65 | |||
66 | AFS has a concept of mountpoints. These are specially formatted symbolic links | ||
67 | (of the same form as the "device name" passed to mount). kAFS presents these | ||
68 | to the user as directories that have special properties: | ||
69 | |||
70 | (*) They cannot be listed. Running a program like "ls" on them will incur an | ||
71 | EREMOTE error (Object is remote). | ||
72 | |||
73 | (*) Other objects can't be looked up inside of them. This also incurs an | ||
74 | EREMOTE error. | ||
75 | |||
76 | (*) They can be queried with the readlink() system call, which will return | ||
77 | the name of the mountpoint to which they point. The "readlink" program | ||
78 | will also work. | ||
79 | |||
80 | (*) They can be mounted on (which symbolic links can't). | ||
81 | |||
82 | |||
83 | PROC FILESYSTEM | ||
84 | =============== | ||
85 | |||
86 | The rxrpc module creates a number of files in various places in the /proc | ||
87 | filesystem: | ||
88 | |||
89 | (*) Firstly, some information files are made available in a directory called | ||
90 | "/proc/net/rxrpc/". These list the extant transport endpoint, peer, | ||
91 | connection and call records. | ||
92 | |||
93 | (*) Secondly, some control files are made available in a directory called | ||
94 | "/proc/sys/rxrpc/". Currently, all these files can be used for is to | ||
95 | turn on various levels of tracing. | ||
96 | |||
97 | The AFS modules creates a "/proc/fs/afs/" directory and populates it: | ||
98 | |||
99 | (*) A "cells" file that lists cells currently known to the afs module. | ||
100 | |||
101 | (*) A directory per cell that contains files that list volume location | ||
102 | servers, volumes, and active servers known within that cell. | ||
103 | |||
104 | |||
105 | THE CELL DATABASE | ||
106 | ================= | ||
107 | |||
108 | The filesystem maintains an internal database of all the cells it knows and | ||
109 | the IP addresses of the volume location servers for those cells. The cell to | ||
110 | which the computer belongs is added to the database when insmod is performed | ||
111 | by the "rootcell=" argument. | ||
112 | |||
113 | Further cells can be added by commands similar to the following: | ||
114 | |||
115 | echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells | ||
116 | echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells | ||
117 | |||
118 | No other cell database operations are available at this time. | ||
119 | |||
120 | |||
121 | EXAMPLES | ||
122 | ======== | ||
123 | |||
124 | Here's what I use to test this. Some of the names and IP addresses are local | ||
125 | to my internal DNS. My "root.afs" partition has a mount point within it for | ||
126 | some public volumes volumes. | ||
127 | |||
128 | insmod -S /tmp/rxrpc.o | ||
129 | insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 | ||
130 | |||
131 | mount -t afs \%root.afs. /afs | ||
132 | mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ | ||
133 | |||
134 | echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells | ||
135 | mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ | ||
136 | mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive | ||
137 | mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib | ||
138 | mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc | ||
139 | mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project | ||
140 | mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service | ||
141 | mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software | ||
142 | mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user | ||
143 | |||
144 | umount /afs/grand.central.org/user | ||
145 | umount /afs/grand.central.org/software | ||
146 | umount /afs/grand.central.org/service | ||
147 | umount /afs/grand.central.org/project | ||
148 | umount /afs/grand.central.org/doc | ||
149 | umount /afs/grand.central.org/contrib | ||
150 | umount /afs/grand.central.org/archive | ||
151 | umount /afs/grand.central.org | ||
152 | umount /afs/cambridge.redhat.com | ||
153 | umount /afs | ||
154 | rmmod kafs | ||
155 | rmmod rxrpc | ||
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt new file mode 100644 index 000000000000..58c65a1713e5 --- /dev/null +++ b/Documentation/filesystems/automount-support.txt | |||
@@ -0,0 +1,118 @@ | |||
1 | Support is available for filesystems that wish to do automounting support (such | ||
2 | as kAFS which can be found in fs/afs/). This facility includes allowing | ||
3 | in-kernel mounts to be performed and mountpoint degradation to be | ||
4 | requested. The latter can also be requested by userspace. | ||
5 | |||
6 | |||
7 | ====================== | ||
8 | IN-KERNEL AUTOMOUNTING | ||
9 | ====================== | ||
10 | |||
11 | A filesystem can now mount another filesystem on one of its directories by the | ||
12 | following procedure: | ||
13 | |||
14 | (1) Give the directory a follow_link() operation. | ||
15 | |||
16 | When the directory is accessed, the follow_link op will be called, and | ||
17 | it will be provided with the location of the mountpoint in the nameidata | ||
18 | structure (vfsmount and dentry). | ||
19 | |||
20 | (2) Have the follow_link() op do the following steps: | ||
21 | |||
22 | (a) Call do_kern_mount() to call the appropriate filesystem to set up a | ||
23 | superblock and gain a vfsmount structure representing it. | ||
24 | |||
25 | (b) Copy the nameidata provided as an argument and substitute the dentry | ||
26 | argument into it the copy. | ||
27 | |||
28 | (c) Call do_add_mount() to install the new vfsmount into the namespace's | ||
29 | mountpoint tree, thus making it accessible to userspace. Use the | ||
30 | nameidata set up in (b) as the destination. | ||
31 | |||
32 | If the mountpoint will be automatically expired, then do_add_mount() | ||
33 | should also be given the location of an expiration list (see further | ||
34 | down). | ||
35 | |||
36 | (d) Release the path in the nameidata argument and substitute in the new | ||
37 | vfsmount and its root dentry. The ref counts on these will need | ||
38 | incrementing. | ||
39 | |||
40 | Then from userspace, you can just do something like: | ||
41 | |||
42 | [root@andromeda root]# mount -t afs \#root.afs. /afs | ||
43 | [root@andromeda root]# ls /afs | ||
44 | asd cambridge cambridge.redhat.com grand.central.org | ||
45 | [root@andromeda root]# ls /afs/cambridge | ||
46 | afsdoc | ||
47 | [root@andromeda root]# ls /afs/cambridge/afsdoc/ | ||
48 | ChangeLog html LICENSE pdf RELNOTES-1.2.2 | ||
49 | |||
50 | And then if you look in the mountpoint catalogue, you'll see something like: | ||
51 | |||
52 | [root@andromeda root]# cat /proc/mounts | ||
53 | ... | ||
54 | #root.afs. /afs afs rw 0 0 | ||
55 | #root.cell. /afs/cambridge.redhat.com afs rw 0 0 | ||
56 | #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 | ||
57 | |||
58 | |||
59 | =========================== | ||
60 | AUTOMATIC MOUNTPOINT EXPIRY | ||
61 | =========================== | ||
62 | |||
63 | Automatic expiration of mountpoints is easy, provided you've mounted the | ||
64 | mountpoint to be expired in the automounting procedure outlined above. | ||
65 | |||
66 | To do expiration, you need to follow these steps: | ||
67 | |||
68 | (3) Create at least one list off which the vfsmounts to be expired can be | ||
69 | hung. Access to this list will be governed by the vfsmount_lock. | ||
70 | |||
71 | (4) In step (2c) above, the call to do_add_mount() should be provided with a | ||
72 | pointer to this list. It will hang the vfsmount off of it if it succeeds. | ||
73 | |||
74 | (5) When you want mountpoints to be expired, call mark_mounts_for_expiry() | ||
75 | with a pointer to this list. This will process the list, marking every | ||
76 | vfsmount thereon for potential expiry on the next call. | ||
77 | |||
78 | If a vfsmount was already flagged for expiry, and if its usage count is 1 | ||
79 | (it's only referenced by its parent vfsmount), then it will be deleted | ||
80 | from the namespace and thrown away (effectively unmounted). | ||
81 | |||
82 | It may prove simplest to simply call this at regular intervals, using | ||
83 | some sort of timed event to drive it. | ||
84 | |||
85 | The expiration flag is cleared by calls to mntput. This means that expiration | ||
86 | will only happen on the second expiration request after the last time the | ||
87 | mountpoint was accessed. | ||
88 | |||
89 | If a mountpoint is moved, it gets removed from the expiration list. If a bind | ||
90 | mount is made on an expirable mount, the new vfsmount will not be on the | ||
91 | expiration list and will not expire. | ||
92 | |||
93 | If a namespace is copied, all mountpoints contained therein will be copied, | ||
94 | and the copies of those that are on an expiration list will be added to the | ||
95 | same expiration list. | ||
96 | |||
97 | |||
98 | ======================= | ||
99 | USERSPACE DRIVEN EXPIRY | ||
100 | ======================= | ||
101 | |||
102 | As an alternative, it is possible for userspace to request expiry of any | ||
103 | mountpoint (though some will be rejected - the current process's idea of the | ||
104 | rootfs for example). It does this by passing the MNT_EXPIRE flag to | ||
105 | umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. | ||
106 | |||
107 | If the mountpoint in question is in referenced by something other than | ||
108 | umount() or its parent mountpoint, an EBUSY error will be returned and the | ||
109 | mountpoint will not be marked for expiration or unmounted. | ||
110 | |||
111 | If the mountpoint was not already marked for expiry at that time, an EAGAIN | ||
112 | error will be given and it won't be unmounted. | ||
113 | |||
114 | Otherwise if it was already marked and it wasn't referenced, unmounting will | ||
115 | take place as usual. | ||
116 | |||
117 | Again, the expiration flag is cleared every time anything other than umount() | ||
118 | looks at a mountpoint. | ||
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt new file mode 100644 index 000000000000..877a7b1d46ec --- /dev/null +++ b/Documentation/filesystems/befs.txt | |||
@@ -0,0 +1,117 @@ | |||
1 | BeOS filesystem for Linux | ||
2 | |||
3 | Document last updated: Dec 6, 2001 | ||
4 | |||
5 | WARNING | ||
6 | ======= | ||
7 | Make sure you understand that this is alpha software. This means that the | ||
8 | implementation is neither complete nor well-tested. | ||
9 | |||
10 | I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! | ||
11 | |||
12 | LICENSE | ||
13 | ===== | ||
14 | This software is covered by the GNU General Public License. | ||
15 | See the file COPYING for the complete text of the license. | ||
16 | Or the GNU website: <http://www.gnu.org/licenses/licenses.html> | ||
17 | |||
18 | AUTHOR | ||
19 | ===== | ||
20 | The largest part of the code written by Will Dyson <will_dyson@pobox.com> | ||
21 | He has been working on the code since Aug 13, 2001. See the changelog for | ||
22 | details. | ||
23 | |||
24 | Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> | ||
25 | His orriginal code can still be found at: | ||
26 | <http://hp.vector.co.jp/authors/VA008030/bfs/> | ||
27 | Does anyone know of a more current email address for Makoto? He doesn't | ||
28 | respond to the address given above... | ||
29 | |||
30 | Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru> | ||
31 | |||
32 | WHAT IS THIS DRIVER? | ||
33 | ================== | ||
34 | This module implements the native filesystem of BeOS <http://www.be.com/> | ||
35 | for the linux 2.4.1 and later kernels. Currently it is a read-only | ||
36 | implementation. | ||
37 | |||
38 | Which is it, BFS or BEFS? | ||
39 | ================ | ||
40 | Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". | ||
41 | But Unixware Boot Filesystem is called bfs, too. And they are already in | ||
42 | the kernel. Because of this nameing conflict, on Linux the BeOS | ||
43 | filesystem is called befs. | ||
44 | |||
45 | HOW TO INSTALL | ||
46 | ============== | ||
47 | step 1. Install the BeFS patch into the source code tree of linux. | ||
48 | |||
49 | Apply the patchfile to your kernel source tree. | ||
50 | Assuming that your kernel source is in /foo/bar/linux and the patchfile | ||
51 | is called patch-befs-xxx, you would do the following: | ||
52 | |||
53 | cd /foo/bar/linux | ||
54 | patch -p1 < /path/to/patch-befs-xxx | ||
55 | |||
56 | if the patching step fails (i.e. there are rejected hunks), you can try to | ||
57 | figure it out yourself (it shouldn't be hard), or mail the maintainer | ||
58 | (Will Dyson <will_dyson@pobox.com>) for help. | ||
59 | |||
60 | step 2. Configuretion & make kernel | ||
61 | |||
62 | The linux kernel has many compile-time options. Most of them are beyond the | ||
63 | scope of this document. I suggest the Kernel-HOWTO document as a good general | ||
64 | reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html> | ||
65 | |||
66 | However, to use the BeFS module, you must enable it at configure time. | ||
67 | |||
68 | cd /foo/bar/linux | ||
69 | make menuconfig (or xconfig) | ||
70 | |||
71 | The BeFS module is not a standard part of the linux kernel, so you must first | ||
72 | enable support for experimental code under the "Code maturity level" menu. | ||
73 | |||
74 | Then, under the "Filesystems" menu will be an option called "BeFS | ||
75 | filesystem (experimental)", or something like that. Enable that option | ||
76 | (it is fine to make it a module). | ||
77 | |||
78 | Save your kernel configuration and then build your kernel. | ||
79 | |||
80 | step 3. Install | ||
81 | |||
82 | See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for | ||
83 | instructions on this critical step. | ||
84 | |||
85 | USING BFS | ||
86 | ========= | ||
87 | To use the BeOS filesystem, use filesystem type 'befs'. | ||
88 | |||
89 | ex) | ||
90 | mount -t befs /dev/fd0 /beos | ||
91 | |||
92 | MOUNT OPTIONS | ||
93 | ============= | ||
94 | uid=nnn All files in the partition will be owned by user id nnn. | ||
95 | gid=nnn All files in the partition will be in group nnn. | ||
96 | iocharset=xxx Use xxx as the name of the NLS translation table. | ||
97 | debug The driver will output debugging information to the syslog. | ||
98 | |||
99 | HOW TO GET LASTEST VERSION | ||
100 | ========================== | ||
101 | |||
102 | The latest version is currently available at: | ||
103 | <http://befs-driver.sourceforge.net/> | ||
104 | |||
105 | ANY KNOWN BUGS? | ||
106 | =========== | ||
107 | As of Jan 20, 2002: | ||
108 | |||
109 | None | ||
110 | |||
111 | SPECIAL THANKS | ||
112 | ============== | ||
113 | Dominic Giampalo ... Writing "Practical file system design with Be filesystem" | ||
114 | Hiroyuki Yamada ... Testing LinuxPPC. | ||
115 | |||
116 | |||
117 | |||
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt new file mode 100644 index 000000000000..d2841e0bcf02 --- /dev/null +++ b/Documentation/filesystems/bfs.txt | |||
@@ -0,0 +1,57 @@ | |||
1 | BFS FILESYSTEM FOR LINUX | ||
2 | ======================== | ||
3 | |||
4 | The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which | ||
5 | usually contains the kernel image and a few other files required for the | ||
6 | boot process. | ||
7 | |||
8 | In order to access /stand partition under Linux you obviously need to | ||
9 | know the partition number and the kernel must support UnixWare disk slices | ||
10 | (CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not | ||
11 | depend on having UnixWare disklabel support because one can also mount | ||
12 | BFS filesystem via loopback: | ||
13 | |||
14 | # losetup /dev/loop0 stand.img | ||
15 | # mount -t bfs /dev/loop0 /mnt/stand | ||
16 | |||
17 | where stand.img is a file containing the image of BFS filesystem. | ||
18 | When you have finished using it and umounted you need to also deallocate | ||
19 | /dev/loop0 device by: | ||
20 | |||
21 | # losetup -d /dev/loop0 | ||
22 | |||
23 | You can simplify mounting by just typing: | ||
24 | |||
25 | # mount -t bfs -o loop stand.img /mnt/stand | ||
26 | |||
27 | this will allocate the first available loopback device (and load loop.o | ||
28 | kernel module if necessary) automatically. If the loopback driver is not | ||
29 | loaded automatically, make sure that your kernel is compiled with kmod | ||
30 | support (CONFIG_KMOD) enabled. Beware that umount will not | ||
31 | deallocate /dev/loopN device if /etc/mtab file on your system is a | ||
32 | symbolic link to /proc/mounts. You will need to do it manually using | ||
33 | "-d" switch of losetup(8). Read losetup(8) manpage for more info. | ||
34 | |||
35 | To create the BFS image under UnixWare you need to find out first which | ||
36 | slice contains it. The command prtvtoc(1M) is your friend: | ||
37 | |||
38 | # prtvtoc /dev/rdsk/c0b0t0d0s0 | ||
39 | |||
40 | (assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you | ||
41 | look for the slice with tag "STAND", which is usually slice 10. With this | ||
42 | information you can use dd(1) to create the BFS image: | ||
43 | |||
44 | # umount /stand | ||
45 | # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 | ||
46 | |||
47 | Just in case, you can verify that you have done the right thing by checking | ||
48 | the magic number: | ||
49 | |||
50 | # od -Ad -tx4 stand.img | more | ||
51 | |||
52 | The first 4 bytes should be 0x1badface. | ||
53 | |||
54 | If you have any patches, questions or suggestions regarding this BFS | ||
55 | implementation please contact the author: | ||
56 | |||
57 | Tigran A. Aivazian <tigran@veritas.com> | ||
diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt new file mode 100644 index 000000000000..49cc923a93e3 --- /dev/null +++ b/Documentation/filesystems/cifs.txt | |||
@@ -0,0 +1,51 @@ | |||
1 | This is the client VFS module for the Common Internet File System | ||
2 | (CIFS) protocol which is the successor to the Server Message Block | ||
3 | (SMB) protocol, the native file sharing mechanism for most early | ||
4 | PC operating systems. CIFS is fully supported by current network | ||
5 | file servers such as Windows 2000, Windows 2003 (including | ||
6 | Windows XP) as well by Samba (which provides excellent CIFS | ||
7 | server support for Linux and many other operating systems), so | ||
8 | this network filesystem client can mount to a wide variety of | ||
9 | servers. The smbfs module should be used instead of this cifs module | ||
10 | for mounting to older SMB servers such as OS/2. The smbfs and cifs | ||
11 | modules can coexist and do not conflict. The CIFS VFS filesystem | ||
12 | module is designed to work well with servers that implement the | ||
13 | newer versions (dialects) of the SMB/CIFS protocol such as Samba, | ||
14 | the program written by Andrew Tridgell that turns any Unix host | ||
15 | into a SMB/CIFS file server. | ||
16 | |||
17 | The intent of this module is to provide the most advanced network | ||
18 | file system function for CIFS compliant servers, including better | ||
19 | POSIX compliance, secure per-user session establishment, high | ||
20 | performance safe distributed caching (oplock), optional packet | ||
21 | signing, large files, Unicode support and other internationalization | ||
22 | improvements. Since both Samba server and this filesystem client support | ||
23 | the CIFS Unix extensions, the combination can provide a reasonable | ||
24 | alternative to NFSv4 for fileserving in some Linux to Linux environments, | ||
25 | not just in Linux to Windows environments. | ||
26 | |||
27 | This filesystem has an optional mount utility (mount.cifs) that can | ||
28 | be obtained from the project page and installed in the path in the same | ||
29 | directory with the other mount helpers (such as mount.smbfs). | ||
30 | Mounting using the cifs filesystem without installing the mount helper | ||
31 | requires specifying the server's ip address. | ||
32 | |||
33 | For Linux 2.4: | ||
34 | mount //anything/here /mnt_target -o | ||
35 | user=username,pass=password,unc=//ip_address_of_server/sharename | ||
36 | |||
37 | For Linux 2.5: | ||
38 | mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password | ||
39 | |||
40 | |||
41 | For more information on the module see the project page at | ||
42 | |||
43 | http://us1.samba.org/samba/Linux_CIFS_client.html | ||
44 | |||
45 | For more information on CIFS see: | ||
46 | |||
47 | http://www.snia.org/tech_activities/CIFS | ||
48 | |||
49 | or the Samba site: | ||
50 | |||
51 | http://www.samba.org | ||
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt new file mode 100644 index 000000000000..61311356025d --- /dev/null +++ b/Documentation/filesystems/coda.txt | |||
@@ -0,0 +1,1673 @@ | |||
1 | NOTE: | ||
2 | This is one of the technical documents describing a component of | ||
3 | Coda -- this document describes the client kernel-Venus interface. | ||
4 | |||
5 | For more information: | ||
6 | http://www.coda.cs.cmu.edu | ||
7 | For user level software needed to run Coda: | ||
8 | ftp://ftp.coda.cs.cmu.edu | ||
9 | |||
10 | To run Coda you need to get a user level cache manager for the client, | ||
11 | named Venus, as well as tools to manipulate ACLs, to log in, etc. The | ||
12 | client needs to have the Coda filesystem selected in the kernel | ||
13 | configuration. | ||
14 | |||
15 | The server needs a user level server and at present does not depend on | ||
16 | kernel support. | ||
17 | |||
18 | |||
19 | |||
20 | |||
21 | |||
22 | |||
23 | |||
24 | The Venus kernel interface | ||
25 | Peter J. Braam | ||
26 | v1.0, Nov 9, 1997 | ||
27 | |||
28 | This document describes the communication between Venus and kernel | ||
29 | level filesystem code needed for the operation of the Coda file sys- | ||
30 | tem. This document version is meant to describe the current interface | ||
31 | (version 1.0) as well as improvements we envisage. | ||
32 | ______________________________________________________________________ | ||
33 | |||
34 | Table of Contents | ||
35 | |||
36 | |||
37 | |||
38 | |||
39 | |||
40 | |||
41 | |||
42 | |||
43 | |||
44 | |||
45 | |||
46 | |||
47 | |||
48 | |||
49 | |||
50 | |||
51 | |||
52 | |||
53 | |||
54 | |||
55 | |||
56 | |||
57 | |||
58 | |||
59 | |||
60 | |||
61 | |||
62 | |||
63 | |||
64 | |||
65 | |||
66 | |||
67 | |||
68 | |||
69 | |||
70 | |||
71 | |||
72 | |||
73 | |||
74 | |||
75 | |||
76 | |||
77 | |||
78 | |||
79 | |||
80 | |||
81 | |||
82 | |||
83 | |||
84 | |||
85 | |||
86 | |||
87 | |||
88 | |||
89 | |||
90 | 1. Introduction | ||
91 | |||
92 | 2. Servicing Coda filesystem calls | ||
93 | |||
94 | 3. The message layer | ||
95 | |||
96 | 3.1 Implementation details | ||
97 | |||
98 | 4. The interface at the call level | ||
99 | |||
100 | 4.1 Data structures shared by the kernel and Venus | ||
101 | 4.2 The pioctl interface | ||
102 | 4.3 root | ||
103 | 4.4 lookup | ||
104 | 4.5 getattr | ||
105 | 4.6 setattr | ||
106 | 4.7 access | ||
107 | 4.8 create | ||
108 | 4.9 mkdir | ||
109 | 4.10 link | ||
110 | 4.11 symlink | ||
111 | 4.12 remove | ||
112 | 4.13 rmdir | ||
113 | 4.14 readlink | ||
114 | 4.15 open | ||
115 | 4.16 close | ||
116 | 4.17 ioctl | ||
117 | 4.18 rename | ||
118 | 4.19 readdir | ||
119 | 4.20 vget | ||
120 | 4.21 fsync | ||
121 | 4.22 inactive | ||
122 | 4.23 rdwr | ||
123 | 4.24 odymount | ||
124 | 4.25 ody_lookup | ||
125 | 4.26 ody_expand | ||
126 | 4.27 prefetch | ||
127 | 4.28 signal | ||
128 | |||
129 | 5. The minicache and downcalls | ||
130 | |||
131 | 5.1 INVALIDATE | ||
132 | 5.2 FLUSH | ||
133 | 5.3 PURGEUSER | ||
134 | 5.4 ZAPFILE | ||
135 | 5.5 ZAPDIR | ||
136 | 5.6 ZAPVNODE | ||
137 | 5.7 PURGEFID | ||
138 | 5.8 REPLACE | ||
139 | |||
140 | 6. Initialization and cleanup | ||
141 | |||
142 | 6.1 Requirements | ||
143 | |||
144 | |||
145 | ______________________________________________________________________ | ||
146 | 0wpage | ||
147 | |||
148 | 11.. IInnttrroodduuccttiioonn | ||
149 | |||
150 | |||
151 | |||
152 | A key component in the Coda Distributed File System is the cache | ||
153 | manager, _V_e_n_u_s. | ||
154 | |||
155 | |||
156 | When processes on a Coda enabled system access files in the Coda | ||
157 | filesystem, requests are directed at the filesystem layer in the | ||
158 | operating system. The operating system will communicate with Venus to | ||
159 | service the request for the process. Venus manages a persistent | ||
160 | client cache and makes remote procedure calls to Coda file servers and | ||
161 | related servers (such as authentication servers) to service these | ||
162 | requests it receives from the operating system. When Venus has | ||
163 | serviced a request it replies to the operating system with appropriate | ||
164 | return codes, and other data related to the request. Optionally the | ||
165 | kernel support for Coda may maintain a minicache of recently processed | ||
166 | requests to limit the number of interactions with Venus. Venus | ||
167 | possesses the facility to inform the kernel when elements from its | ||
168 | minicache are no longer valid. | ||
169 | |||
170 | This document describes precisely this communication between the | ||
171 | kernel and Venus. The definitions of so called upcalls and downcalls | ||
172 | will be given with the format of the data they handle. We shall also | ||
173 | describe the semantic invariants resulting from the calls. | ||
174 | |||
175 | Historically Coda was implemented in a BSD file system in Mach 2.6. | ||
176 | The interface between the kernel and Venus is very similar to the BSD | ||
177 | VFS interface. Similar functionality is provided, and the format of | ||
178 | the parameters and returned data is very similar to the BSD VFS. This | ||
179 | leads to an almost natural environment for implementing a kernel-level | ||
180 | filesystem driver for Coda in a BSD system. However, other operating | ||
181 | systems such as Linux and Windows 95 and NT have virtual filesystem | ||
182 | with different interfaces. | ||
183 | |||
184 | To implement Coda on these systems some reverse engineering of the | ||
185 | Venus/Kernel protocol is necessary. Also it came to light that other | ||
186 | systems could profit significantly from certain small optimizations | ||
187 | and modifications to the protocol. To facilitate this work as well as | ||
188 | to make future ports easier, communication between Venus and the | ||
189 | kernel should be documented in great detail. This is the aim of this | ||
190 | document. | ||
191 | |||
192 | 0wpage | ||
193 | |||
194 | 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss | ||
195 | |||
196 | The service of a request for a Coda file system service originates in | ||
197 | a process PP which accessing a Coda file. It makes a system call which | ||
198 | traps to the OS kernel. Examples of such calls trapping to the kernel | ||
199 | are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix | ||
200 | context. Similar calls exist in the Win32 environment, and are named | ||
201 | _C_r_e_a_t_e_F_i_l_e_, . | ||
202 | |||
203 | Generally the operating system handles the request in a virtual | ||
204 | filesystem (VFS) layer, which is named I/O Manager in NT and IFS | ||
205 | manager in Windows 95. The VFS is responsible for partial processing | ||
206 | of the request and for locating the specific filesystem(s) which will | ||
207 | service parts of the request. Usually the information in the path | ||
208 | assists in locating the correct FS drivers. Sometimes after extensive | ||
209 | pre-processing, the VFS starts invoking exported routines in the FS | ||
210 | driver. This is the point where the FS specific processing of the | ||
211 | request starts, and here the Coda specific kernel code comes into | ||
212 | play. | ||
213 | |||
214 | The FS layer for Coda must expose and implement several interfaces. | ||
215 | First and foremost the VFS must be able to make all necessary calls to | ||
216 | the Coda FS layer, so the Coda FS driver must expose the VFS interface | ||
217 | as applicable in the operating system. These differ very significantly | ||
218 | among operating systems, but share features such as facilities to | ||
219 | read/write and create and remove objects. The Coda FS layer services | ||
220 | such VFS requests by invoking one or more well defined services | ||
221 | offered by the cache manager Venus. When the replies from Venus have | ||
222 | come back to the FS driver, servicing of the VFS call continues and | ||
223 | finishes with a reply to the kernel's VFS. Finally the VFS layer | ||
224 | returns to the process. | ||
225 | |||
226 | As a result of this design a basic interface exposed by the FS driver | ||
227 | must allow Venus to manage message traffic. In particular Venus must | ||
228 | be able to retrieve and place messages and to be notified of the | ||
229 | arrival of a new message. The notification must be through a mechanism | ||
230 | which does not block Venus since Venus must attend to other tasks even | ||
231 | when no messages are waiting or being processed. | ||
232 | |||
233 | |||
234 | |||
235 | |||
236 | |||
237 | |||
238 | Interfaces of the Coda FS Driver | ||
239 | |||
240 | Furthermore the FS layer provides for a special path of communication | ||
241 | between a user process and Venus, called the pioctl interface. The | ||
242 | pioctl interface is used for Coda specific services, such as | ||
243 | requesting detailed information about the persistent cache managed by | ||
244 | Venus. Here the involvement of the kernel is minimal. It identifies | ||
245 | the calling process and passes the information on to Venus. When | ||
246 | Venus replies the response is passed back to the caller in unmodified | ||
247 | form. | ||
248 | |||
249 | Finally Venus allows the kernel FS driver to cache the results from | ||
250 | certain services. This is done to avoid excessive context switches | ||
251 | and results in an efficient system. However, Venus may acquire | ||
252 | information, for example from the network which implies that cached | ||
253 | information must be flushed or replaced. Venus then makes a downcall | ||
254 | to the Coda FS layer to request flushes or updates in the cache. The | ||
255 | kernel FS driver handles such requests synchronously. | ||
256 | |||
257 | Among these interfaces the VFS interface and the facility to place, | ||
258 | receive and be notified of messages are platform specific. We will | ||
259 | not go into the calls exported to the VFS layer but we will state the | ||
260 | requirements of the message exchange mechanism. | ||
261 | |||
262 | 0wpage | ||
263 | |||
264 | 33.. TThhee mmeessssaaggee llaayyeerr | ||
265 | |||
266 | |||
267 | |||
268 | At the lowest level the communication between Venus and the FS driver | ||
269 | proceeds through messages. The synchronization between processes | ||
270 | requesting Coda file service and Venus relies on blocking and waking | ||
271 | up processes. The Coda FS driver processes VFS- and pioctl-requests | ||
272 | on behalf of a process P, creates messages for Venus, awaits replies | ||
273 | and finally returns to the caller. The implementation of the exchange | ||
274 | of messages is platform specific, but the semantics have (so far) | ||
275 | appeared to be generally applicable. Data buffers are created by the | ||
276 | FS Driver in kernel memory on behalf of P and copied to user memory in | ||
277 | Venus. | ||
278 | |||
279 | The FS Driver while servicing P makes upcalls to Venus. Such an | ||
280 | upcall is dispatched to Venus by creating a message structure. The | ||
281 | structure contains the identification of P, the message sequence | ||
282 | number, the size of the request and a pointer to the data in kernel | ||
283 | memory for the request. Since the data buffer is re-used to hold the | ||
284 | reply from Venus, there is a field for the size of the reply. A flags | ||
285 | field is used in the message to precisely record the status of the | ||
286 | message. Additional platform dependent structures involve pointers to | ||
287 | determine the position of the message on queues and pointers to | ||
288 | synchronization objects. In the upcall routine the message structure | ||
289 | is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g | ||
290 | queue. The routine calling upcall is responsible for allocating the | ||
291 | data buffer; its structure will be described in the next section. | ||
292 | |||
293 | A facility must exist to notify Venus that the message has been | ||
294 | created, and implemented using available synchronization objects in | ||
295 | the OS. This notification is done in the upcall context of the process | ||
296 | P. When the message is on the pending queue, process P cannot proceed | ||
297 | in upcall. The (kernel mode) processing of P in the filesystem | ||
298 | request routine must be suspended until Venus has replied. Therefore | ||
299 | the calling thread in P is blocked in upcall. A pointer in the | ||
300 | message structure will locate the synchronization object on which P is | ||
301 | sleeping. | ||
302 | |||
303 | Venus detects the notification that a message has arrived, and the FS | ||
304 | driver allow Venus to retrieve the message with a getmsg_from_kernel | ||
305 | call. This action finishes in the kernel by putting the message on the | ||
306 | queue of processing messages and setting flags to READ. Venus is | ||
307 | passed the contents of the data buffer. The getmsg_from_kernel call | ||
308 | now returns and Venus processes the request. | ||
309 | |||
310 | At some later point the FS driver receives a message from Venus, | ||
311 | namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS | ||
312 | driver looks at the contents of the message and decides if: | ||
313 | |||
314 | |||
315 | +o the message is a reply for a suspended thread P. If so it removes | ||
316 | the message from the processing queue and marks the message as | ||
317 | WRITTEN. Finally, the FS driver unblocks P (still in the kernel | ||
318 | mode context of Venus) and the sendmsg_to_kernel call returns to | ||
319 | Venus. The process P will be scheduled at some point and continues | ||
320 | processing its upcall with the data buffer replaced with the reply | ||
321 | from Venus. | ||
322 | |||
323 | +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to | ||
324 | the FS Driver. The FS driver processes the request immediately | ||
325 | (usually a cache eviction or replacement) and when it finishes | ||
326 | sendmsg_to_kernel returns. | ||
327 | |||
328 | Now P awakes and continues processing upcall. There are some | ||
329 | subtleties to take account of. First P will determine if it was woken | ||
330 | up in upcall by a signal from some other source (for example an | ||
331 | attempt to terminate P) or as is normally the case by Venus in its | ||
332 | sendmsg_to_kernel call. In the normal case, the upcall routine will | ||
333 | deallocate the message structure and return. The FS routine can proceed | ||
334 | with its processing. | ||
335 | |||
336 | |||
337 | |||
338 | |||
339 | |||
340 | |||
341 | |||
342 | Sleeping and IPC arrangements | ||
343 | |||
344 | In case P is woken up by a signal and not by Venus, it will first look | ||
345 | at the flags field. If the message is not yet READ, the process P can | ||
346 | handle its signal without notifying Venus. If Venus has READ, and | ||
347 | the request should not be processed, P can send Venus a signal message | ||
348 | to indicate that it should disregard the previous message. Such | ||
349 | signals are put in the queue at the head, and read first by Venus. If | ||
350 | the message is already marked as WRITTEN it is too late to stop the | ||
351 | processing. The VFS routine will now continue. (-- If a VFS request | ||
352 | involves more than one upcall, this can lead to complicated state, an | ||
353 | extra field "handle_signals" could be added in the message structure | ||
354 | to indicate points of no return have been passed.--) | ||
355 | |||
356 | |||
357 | |||
358 | 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss | ||
359 | |||
360 | The Unix implementation of this mechanism has been through the | ||
361 | implementation of a character device associated with Coda. Venus | ||
362 | retrieves messages by doing a read on the device, replies are sent | ||
363 | with a write and notification is through the select system call on the | ||
364 | file descriptor for the device. The process P is kept waiting on an | ||
365 | interruptible wait queue object. | ||
366 | |||
367 | In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl | ||
368 | call is used. The DeviceIoControl call is designed to copy buffers | ||
369 | from user memory to kernel memory with OPCODES. The sendmsg_to_kernel | ||
370 | is issued as a synchronous call, while the getmsg_from_kernel call is | ||
371 | asynchronous. Windows EventObjects are used for notification of | ||
372 | message arrival. The process P is kept waiting on a KernelEvent | ||
373 | object in NT and a semaphore in Windows 95. | ||
374 | |||
375 | 0wpage | ||
376 | |||
377 | 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell | ||
378 | |||
379 | |||
380 | This section describes the upcalls a Coda FS driver can make to Venus. | ||
381 | Each of these upcalls make use of two structures: inputArgs and | ||
382 | outputArgs. In pseudo BNF form the structures take the following | ||
383 | form: | ||
384 | |||
385 | |||
386 | struct inputArgs { | ||
387 | u_long opcode; | ||
388 | u_long unique; /* Keep multiple outstanding msgs distinct */ | ||
389 | u_short pid; /* Common to all */ | ||
390 | u_short pgid; /* Common to all */ | ||
391 | struct CodaCred cred; /* Common to all */ | ||
392 | |||
393 | <union "in" of call dependent parts of inputArgs> | ||
394 | }; | ||
395 | |||
396 | struct outputArgs { | ||
397 | u_long opcode; | ||
398 | u_long unique; /* Keep multiple outstanding msgs distinct */ | ||
399 | u_long result; | ||
400 | |||
401 | <union "out" of call dependent parts of inputArgs> | ||
402 | }; | ||
403 | |||
404 | |||
405 | |||
406 | Before going on let us elucidate the role of the various fields. The | ||
407 | inputArgs start with the opcode which defines the type of service | ||
408 | requested from Venus. There are approximately 30 upcalls at present | ||
409 | which we will discuss. The unique field labels the inputArg with a | ||
410 | unique number which will identify the message uniquely. A process and | ||
411 | process group id are passed. Finally the credentials of the caller | ||
412 | are included. | ||
413 | |||
414 | Before delving into the specific calls we need to discuss a variety of | ||
415 | data structures shared by the kernel and Venus. | ||
416 | |||
417 | |||
418 | |||
419 | |||
420 | 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss | ||
421 | |||
422 | |||
423 | The CodaCred structure defines a variety of user and group ids as | ||
424 | they are set for the calling process. The vuid_t and guid_t are 32 bit | ||
425 | unsigned integers. It also defines group membership in an array. On | ||
426 | Unix the CodaCred has proven sufficient to implement good security | ||
427 | semantics for Coda but the structure may have to undergo modification | ||
428 | for the Windows environment when these mature. | ||
429 | |||
430 | struct CodaCred { | ||
431 | vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/ | ||
432 | vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ | ||
433 | vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ | ||
434 | }; | ||
435 | |||
436 | |||
437 | |||
438 | NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus | ||
439 | doesn't know about groups, although it does create files with the | ||
440 | default uid/gid. Perhaps the list of group membership is superfluous. | ||
441 | |||
442 | |||
443 | The next item is the fundamental identifier used to identify Coda | ||
444 | files, the ViceFid. A fid of a file uniquely defines a file or | ||
445 | directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a | ||
446 | group of Coda servers acting under the aegis of a single system | ||
447 | control machine or SCM. See the Coda Administration manual for a | ||
448 | detailed description of the role of the SCM.--) | ||
449 | |||
450 | |||
451 | typedef struct ViceFid { | ||
452 | VolumeId Volume; | ||
453 | VnodeId Vnode; | ||
454 | Unique_t Unique; | ||
455 | } ViceFid; | ||
456 | |||
457 | |||
458 | |||
459 | Each of the constituent fields: VolumeId, VnodeId and Unique_t are | ||
460 | unsigned 32 bit integers. We envisage that a further field will need | ||
461 | to be prefixed to identify the Coda cell; this will probably take the | ||
462 | form of a Ipv6 size IP address naming the Coda cell through DNS. | ||
463 | |||
464 | The next important structure shared between Venus and the kernel is | ||
465 | the attributes of the file. The following structure is used to | ||
466 | exchange information. It has room for future extensions such as | ||
467 | support for device files (currently not present in Coda). | ||
468 | |||
469 | |||
470 | |||
471 | |||
472 | |||
473 | |||
474 | |||
475 | |||
476 | |||
477 | |||
478 | |||
479 | |||
480 | |||
481 | |||
482 | |||
483 | |||
484 | |||
485 | |||
486 | struct coda_vattr { | ||
487 | enum coda_vtype va_type; /* vnode type (for create) */ | ||
488 | u_short va_mode; /* files access mode and type */ | ||
489 | short va_nlink; /* number of references to file */ | ||
490 | vuid_t va_uid; /* owner user id */ | ||
491 | vgid_t va_gid; /* owner group id */ | ||
492 | long va_fsid; /* file system id (dev for now) */ | ||
493 | long va_fileid; /* file id */ | ||
494 | u_quad_t va_size; /* file size in bytes */ | ||
495 | long va_blocksize; /* blocksize preferred for i/o */ | ||
496 | struct timespec va_atime; /* time of last access */ | ||
497 | struct timespec va_mtime; /* time of last modification */ | ||
498 | struct timespec va_ctime; /* time file changed */ | ||
499 | u_long va_gen; /* generation number of file */ | ||
500 | u_long va_flags; /* flags defined for file */ | ||
501 | dev_t va_rdev; /* device special file represents */ | ||
502 | u_quad_t va_bytes; /* bytes of disk space held by file */ | ||
503 | u_quad_t va_filerev; /* file modification number */ | ||
504 | u_int va_vaflags; /* operations flags, see below */ | ||
505 | long va_spare; /* remain quad aligned */ | ||
506 | }; | ||
507 | |||
508 | |||
509 | |||
510 | |||
511 | 44..22.. TThhee ppiiooccttll iinntteerrffaaccee | ||
512 | |||
513 | |||
514 | Coda specific requests can be made by application through the pioctl | ||
515 | interface. The pioctl is implemented as an ordinary ioctl on a | ||
516 | fictitious file /coda/.CONTROL. The pioctl call opens this file, gets | ||
517 | a file handle and makes the ioctl call. Finally it closes the file. | ||
518 | |||
519 | The kernel involvement in this is limited to providing the facility to | ||
520 | open and close and pass the ioctl message _a_n_d to verify that a path in | ||
521 | the pioctl data buffers is a file in a Coda filesystem. | ||
522 | |||
523 | The kernel is handed a data packet of the form: | ||
524 | |||
525 | struct { | ||
526 | const char *path; | ||
527 | struct ViceIoctl vidata; | ||
528 | int follow; | ||
529 | } data; | ||
530 | |||
531 | |||
532 | |||
533 | where | ||
534 | |||
535 | |||
536 | struct ViceIoctl { | ||
537 | caddr_t in, out; /* Data to be transferred in, or out */ | ||
538 | short in_size; /* Size of input buffer <= 2K */ | ||
539 | short out_size; /* Maximum size of output buffer, <= 2K */ | ||
540 | }; | ||
541 | |||
542 | |||
543 | |||
544 | The path must be a Coda file, otherwise the ioctl upcall will not be | ||
545 | made. | ||
546 | |||
547 | NNOOTTEE The data structures and code are a mess. We need to clean this | ||
548 | up. | ||
549 | |||
550 | We now proceed to document the individual calls: | ||
551 | |||
552 | 0wpage | ||
553 | |||
554 | 44..33.. rroooott | ||
555 | |||
556 | |||
557 | AArrgguummeennttss | ||
558 | |||
559 | iinn empty | ||
560 | |||
561 | oouutt | ||
562 | |||
563 | struct cfs_root_out { | ||
564 | ViceFid VFid; | ||
565 | } cfs_root; | ||
566 | |||
567 | |||
568 | |||
569 | DDeessccrriippttiioonn This call is made to Venus during the initialization of | ||
570 | the Coda filesystem. If the result is zero, the cfs_root structure | ||
571 | contains the ViceFid of the root of the Coda filesystem. If a non-zero | ||
572 | result is generated, its value is a platform dependent error code | ||
573 | indicating the difficulty Venus encountered in locating the root of | ||
574 | the Coda filesystem. | ||
575 | |||
576 | 0wpage | ||
577 | |||
578 | 44..44.. llooookkuupp | ||
579 | |||
580 | |||
581 | SSuummmmaarryy Find the ViceFid and type of an object in a directory if it | ||
582 | exists. | ||
583 | |||
584 | AArrgguummeennttss | ||
585 | |||
586 | iinn | ||
587 | |||
588 | struct cfs_lookup_in { | ||
589 | ViceFid VFid; | ||
590 | char *name; /* Place holder for data. */ | ||
591 | } cfs_lookup; | ||
592 | |||
593 | |||
594 | |||
595 | oouutt | ||
596 | |||
597 | struct cfs_lookup_out { | ||
598 | ViceFid VFid; | ||
599 | int vtype; | ||
600 | } cfs_lookup; | ||
601 | |||
602 | |||
603 | |||
604 | DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of | ||
605 | a directory entry. The directory entry requested carries name name | ||
606 | and Venus will search the directory identified by cfs_lookup_in.VFid. | ||
607 | The result may indicate that the name does not exist, or that | ||
608 | difficulty was encountered in finding it (e.g. due to disconnection). | ||
609 | If the result is zero, the field cfs_lookup_out.VFid contains the | ||
610 | targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the | ||
611 | type of object the name designates. | ||
612 | |||
613 | The name of the object is an 8 bit character string of maximum length | ||
614 | CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) | ||
615 | |||
616 | It is extremely important to realize that Venus bitwise ors the field | ||
617 | cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should | ||
618 | not be put in the kernel name cache. | ||
619 | |||
620 | NNOOTTEE The type of the vtype is currently wrong. It should be | ||
621 | coda_vtype. Linux does not take note of CFS_NOCACHE. It should. | ||
622 | |||
623 | 0wpage | ||
624 | |||
625 | 44..55.. ggeettaattttrr | ||
626 | |||
627 | |||
628 | SSuummmmaarryy Get the attributes of a file. | ||
629 | |||
630 | AArrgguummeennttss | ||
631 | |||
632 | iinn | ||
633 | |||
634 | struct cfs_getattr_in { | ||
635 | ViceFid VFid; | ||
636 | struct coda_vattr attr; /* XXXXX */ | ||
637 | } cfs_getattr; | ||
638 | |||
639 | |||
640 | |||
641 | oouutt | ||
642 | |||
643 | struct cfs_getattr_out { | ||
644 | struct coda_vattr attr; | ||
645 | } cfs_getattr; | ||
646 | |||
647 | |||
648 | |||
649 | DDeessccrriippttiioonn This call returns the attributes of the file identified by | ||
650 | fid. | ||
651 | |||
652 | EErrrroorrss Errors can occur if the object with fid does not exist, is | ||
653 | unaccessible or if the caller does not have permission to fetch | ||
654 | attributes. | ||
655 | |||
656 | NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire | ||
657 | the attributes as well as the Fid for the instantiation of an internal | ||
658 | "inode" or "FileHandle". A significant improvement in performance on | ||
659 | such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls | ||
660 | both at the Venus/kernel interaction level and at the RPC level. | ||
661 | |||
662 | The vattr structure included in the input arguments is superfluous and | ||
663 | should be removed. | ||
664 | |||
665 | 0wpage | ||
666 | |||
667 | 44..66.. sseettaattttrr | ||
668 | |||
669 | |||
670 | SSuummmmaarryy Set the attributes of a file. | ||
671 | |||
672 | AArrgguummeennttss | ||
673 | |||
674 | iinn | ||
675 | |||
676 | struct cfs_setattr_in { | ||
677 | ViceFid VFid; | ||
678 | struct coda_vattr attr; | ||
679 | } cfs_setattr; | ||
680 | |||
681 | |||
682 | |||
683 | |||
684 | oouutt | ||
685 | empty | ||
686 | |||
687 | DDeessccrriippttiioonn The structure attr is filled with attributes to be changed | ||
688 | in BSD style. Attributes not to be changed are set to -1, apart from | ||
689 | vtype which is set to VNON. Other are set to the value to be assigned. | ||
690 | The only attributes which the FS driver may request to change are the | ||
691 | mode, owner, groupid, atime, mtime and ctime. The return value | ||
692 | indicates success or failure. | ||
693 | |||
694 | EErrrroorrss A variety of errors can occur. The object may not exist, may | ||
695 | be inaccessible, or permission may not be granted by Venus. | ||
696 | |||
697 | 0wpage | ||
698 | |||
699 | 44..77.. aacccceessss | ||
700 | |||
701 | |||
702 | SSuummmmaarryy | ||
703 | |||
704 | AArrgguummeennttss | ||
705 | |||
706 | iinn | ||
707 | |||
708 | struct cfs_access_in { | ||
709 | ViceFid VFid; | ||
710 | int flags; | ||
711 | } cfs_access; | ||
712 | |||
713 | |||
714 | |||
715 | oouutt | ||
716 | empty | ||
717 | |||
718 | DDeessccrriippttiioonn Verify if access to the object identified by VFid for | ||
719 | operations described by flags is permitted. The result indicates if | ||
720 | access will be granted. It is important to remember that Coda uses | ||
721 | ACLs to enforce protection and that ultimately the servers, not the | ||
722 | clients enforce the security of the system. The result of this call | ||
723 | will depend on whether a _t_o_k_e_n is held by the user. | ||
724 | |||
725 | EErrrroorrss The object may not exist, or the ACL describing the protection | ||
726 | may not be accessible. | ||
727 | |||
728 | 0wpage | ||
729 | |||
730 | 44..88.. ccrreeaattee | ||
731 | |||
732 | |||
733 | SSuummmmaarryy Invoked to create a file | ||
734 | |||
735 | AArrgguummeennttss | ||
736 | |||
737 | iinn | ||
738 | |||
739 | struct cfs_create_in { | ||
740 | ViceFid VFid; | ||
741 | struct coda_vattr attr; | ||
742 | int excl; | ||
743 | int mode; | ||
744 | char *name; /* Place holder for data. */ | ||
745 | } cfs_create; | ||
746 | |||
747 | |||
748 | |||
749 | |||
750 | oouutt | ||
751 | |||
752 | struct cfs_create_out { | ||
753 | ViceFid VFid; | ||
754 | struct coda_vattr attr; | ||
755 | } cfs_create; | ||
756 | |||
757 | |||
758 | |||
759 | DDeessccrriippttiioonn This upcall is invoked to request creation of a file. | ||
760 | The file will be created in the directory identified by VFid, its name | ||
761 | will be name, and the mode will be mode. If excl is set an error will | ||
762 | be returned if the file already exists. If the size field in attr is | ||
763 | set to zero the file will be truncated. The uid and gid of the file | ||
764 | are set by converting the CodaCred to a uid using a macro CRTOUID | ||
765 | (this macro is platform dependent). Upon success the VFid and | ||
766 | attributes of the file are returned. The Coda FS Driver will normally | ||
767 | instantiate a vnode, inode or file handle at kernel level for the new | ||
768 | object. | ||
769 | |||
770 | |||
771 | EErrrroorrss A variety of errors can occur. Permissions may be insufficient. | ||
772 | If the object exists and is not a file the error EISDIR is returned | ||
773 | under Unix. | ||
774 | |||
775 | NNOOTTEE The packing of parameters is very inefficient and appears to | ||
776 | indicate confusion between the system call creat and the VFS operation | ||
777 | create. The VFS operation create is only called to create new objects. | ||
778 | This create call differs from the Unix one in that it is not invoked | ||
779 | to return a file descriptor. The truncate and exclusive options, | ||
780 | together with the mode, could simply be part of the mode as it is | ||
781 | under Unix. There should be no flags argument; this is used in open | ||
782 | (2) to return a file descriptor for READ or WRITE mode. | ||
783 | |||
784 | The attributes of the directory should be returned too, since the size | ||
785 | and mtime changed. | ||
786 | |||
787 | 0wpage | ||
788 | |||
789 | 44..99.. mmkkddiirr | ||
790 | |||
791 | |||
792 | SSuummmmaarryy Create a new directory. | ||
793 | |||
794 | AArrgguummeennttss | ||
795 | |||
796 | iinn | ||
797 | |||
798 | struct cfs_mkdir_in { | ||
799 | ViceFid VFid; | ||
800 | struct coda_vattr attr; | ||
801 | char *name; /* Place holder for data. */ | ||
802 | } cfs_mkdir; | ||
803 | |||
804 | |||
805 | |||
806 | oouutt | ||
807 | |||
808 | struct cfs_mkdir_out { | ||
809 | ViceFid VFid; | ||
810 | struct coda_vattr attr; | ||
811 | } cfs_mkdir; | ||
812 | |||
813 | |||
814 | |||
815 | |||
816 | DDeessccrriippttiioonn This call is similar to create but creates a directory. | ||
817 | Only the mode field in the input parameters is used for creation. | ||
818 | Upon successful creation, the attr returned contains the attributes of | ||
819 | the new directory. | ||
820 | |||
821 | EErrrroorrss As for create. | ||
822 | |||
823 | NNOOTTEE The input parameter should be changed to mode instead of | ||
824 | attributes. | ||
825 | |||
826 | The attributes of the parent should be returned since the size and | ||
827 | mtime changes. | ||
828 | |||
829 | 0wpage | ||
830 | |||
831 | 44..1100.. lliinnkk | ||
832 | |||
833 | |||
834 | SSuummmmaarryy Create a link to an existing file. | ||
835 | |||
836 | AArrgguummeennttss | ||
837 | |||
838 | iinn | ||
839 | |||
840 | struct cfs_link_in { | ||
841 | ViceFid sourceFid; /* cnode to link *to* */ | ||
842 | ViceFid destFid; /* Directory in which to place link */ | ||
843 | char *tname; /* Place holder for data. */ | ||
844 | } cfs_link; | ||
845 | |||
846 | |||
847 | |||
848 | oouutt | ||
849 | empty | ||
850 | |||
851 | DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory | ||
852 | identified by destFid with name tname. The source must reside in the | ||
853 | target's parent, i.e. the source must be have parent destFid, i.e. Coda | ||
854 | does not support cross directory hard links. Only the return value is | ||
855 | relevant. It indicates success or the type of failure. | ||
856 | |||
857 | EErrrroorrss The usual errors can occur.0wpage | ||
858 | |||
859 | 44..1111.. ssyymmlliinnkk | ||
860 | |||
861 | |||
862 | SSuummmmaarryy create a symbolic link | ||
863 | |||
864 | AArrgguummeennttss | ||
865 | |||
866 | iinn | ||
867 | |||
868 | struct cfs_symlink_in { | ||
869 | ViceFid VFid; /* Directory to put symlink in */ | ||
870 | char *srcname; | ||
871 | struct coda_vattr attr; | ||
872 | char *tname; | ||
873 | } cfs_symlink; | ||
874 | |||
875 | |||
876 | |||
877 | oouutt | ||
878 | none | ||
879 | |||
880 | DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the | ||
881 | directory identified by VFid and named tname. It should point to the | ||
882 | pathname srcname. The attributes of the newly created object are to | ||
883 | be set to attr. | ||
884 | |||
885 | EErrrroorrss | ||
886 | |||
887 | NNOOTTEE The attributes of the target directory should be returned since | ||
888 | its size changed. | ||
889 | |||
890 | 0wpage | ||
891 | |||
892 | 44..1122.. rreemmoovvee | ||
893 | |||
894 | |||
895 | SSuummmmaarryy Remove a file | ||
896 | |||
897 | AArrgguummeennttss | ||
898 | |||
899 | iinn | ||
900 | |||
901 | struct cfs_remove_in { | ||
902 | ViceFid VFid; | ||
903 | char *name; /* Place holder for data. */ | ||
904 | } cfs_remove; | ||
905 | |||
906 | |||
907 | |||
908 | oouutt | ||
909 | none | ||
910 | |||
911 | DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory | ||
912 | identified by VFid. | ||
913 | |||
914 | EErrrroorrss | ||
915 | |||
916 | NNOOTTEE The attributes of the directory should be returned since its | ||
917 | mtime and size may change. | ||
918 | |||
919 | 0wpage | ||
920 | |||
921 | 44..1133.. rrmmddiirr | ||
922 | |||
923 | |||
924 | SSuummmmaarryy Remove a directory | ||
925 | |||
926 | AArrgguummeennttss | ||
927 | |||
928 | iinn | ||
929 | |||
930 | struct cfs_rmdir_in { | ||
931 | ViceFid VFid; | ||
932 | char *name; /* Place holder for data. */ | ||
933 | } cfs_rmdir; | ||
934 | |||
935 | |||
936 | |||
937 | oouutt | ||
938 | none | ||
939 | |||
940 | DDeessccrriippttiioonn Remove the directory with name name from the directory | ||
941 | identified by VFid. | ||
942 | |||
943 | EErrrroorrss | ||
944 | |||
945 | NNOOTTEE The attributes of the parent directory should be returned since | ||
946 | its mtime and size may change. | ||
947 | |||
948 | 0wpage | ||
949 | |||
950 | 44..1144.. rreeaaddlliinnkk | ||
951 | |||
952 | |||
953 | SSuummmmaarryy Read the value of a symbolic link. | ||
954 | |||
955 | AArrgguummeennttss | ||
956 | |||
957 | iinn | ||
958 | |||
959 | struct cfs_readlink_in { | ||
960 | ViceFid VFid; | ||
961 | } cfs_readlink; | ||
962 | |||
963 | |||
964 | |||
965 | oouutt | ||
966 | |||
967 | struct cfs_readlink_out { | ||
968 | int count; | ||
969 | caddr_t data; /* Place holder for data. */ | ||
970 | } cfs_readlink; | ||
971 | |||
972 | |||
973 | |||
974 | DDeessccrriippttiioonn This routine reads the contents of symbolic link | ||
975 | identified by VFid into the buffer data. The buffer data must be able | ||
976 | to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). | ||
977 | |||
978 | EErrrroorrss No unusual errors. | ||
979 | |||
980 | 0wpage | ||
981 | |||
982 | 44..1155.. ooppeenn | ||
983 | |||
984 | |||
985 | SSuummmmaarryy Open a file. | ||
986 | |||
987 | AArrgguummeennttss | ||
988 | |||
989 | iinn | ||
990 | |||
991 | struct cfs_open_in { | ||
992 | ViceFid VFid; | ||
993 | int flags; | ||
994 | } cfs_open; | ||
995 | |||
996 | |||
997 | |||
998 | oouutt | ||
999 | |||
1000 | struct cfs_open_out { | ||
1001 | dev_t dev; | ||
1002 | ino_t inode; | ||
1003 | } cfs_open; | ||
1004 | |||
1005 | |||
1006 | |||
1007 | DDeessccrriippttiioonn This request asks Venus to place the file identified by | ||
1008 | VFid in its cache and to note that the calling process wishes to open | ||
1009 | it with flags as in open(2). The return value to the kernel differs | ||
1010 | for Unix and Windows systems. For Unix systems the Coda FS Driver is | ||
1011 | informed of the device and inode number of the container file in the | ||
1012 | fields dev and inode. For Windows the path of the container file is | ||
1013 | returned to the kernel. | ||
1014 | EErrrroorrss | ||
1015 | |||
1016 | NNOOTTEE Currently the cfs_open_out structure is not properly adapted to | ||
1017 | deal with the Windows case. It might be best to implement two | ||
1018 | upcalls, one to open aiming at a container file name, the other at a | ||
1019 | container file inode. | ||
1020 | |||
1021 | 0wpage | ||
1022 | |||
1023 | 44..1166.. cclloossee | ||
1024 | |||
1025 | |||
1026 | SSuummmmaarryy Close a file, update it on the servers. | ||
1027 | |||
1028 | AArrgguummeennttss | ||
1029 | |||
1030 | iinn | ||
1031 | |||
1032 | struct cfs_close_in { | ||
1033 | ViceFid VFid; | ||
1034 | int flags; | ||
1035 | } cfs_close; | ||
1036 | |||
1037 | |||
1038 | |||
1039 | oouutt | ||
1040 | none | ||
1041 | |||
1042 | DDeessccrriippttiioonn Close the file identified by VFid. | ||
1043 | |||
1044 | EErrrroorrss | ||
1045 | |||
1046 | NNOOTTEE The flags argument is bogus and not used. However, Venus' code | ||
1047 | has room to deal with an execp input field, probably this field should | ||
1048 | be used to inform Venus that the file was closed but is still memory | ||
1049 | mapped for execution. There are comments about fetching versus not | ||
1050 | fetching the data in Venus vproc_vfscalls. This seems silly. If a | ||
1051 | file is being closed, the data in the container file is to be the new | ||
1052 | data. Here again the execp flag might be in play to create confusion: | ||
1053 | currently Venus might think a file can be flushed from the cache when | ||
1054 | it is still memory mapped. This needs to be understood. | ||
1055 | |||
1056 | 0wpage | ||
1057 | |||
1058 | 44..1177.. iiooccttll | ||
1059 | |||
1060 | |||
1061 | SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface. | ||
1062 | |||
1063 | AArrgguummeennttss | ||
1064 | |||
1065 | iinn | ||
1066 | |||
1067 | struct cfs_ioctl_in { | ||
1068 | ViceFid VFid; | ||
1069 | int cmd; | ||
1070 | int len; | ||
1071 | int rwflag; | ||
1072 | char *data; /* Place holder for data. */ | ||
1073 | } cfs_ioctl; | ||
1074 | |||
1075 | |||
1076 | |||
1077 | oouutt | ||
1078 | |||
1079 | |||
1080 | struct cfs_ioctl_out { | ||
1081 | int len; | ||
1082 | caddr_t data; /* Place holder for data. */ | ||
1083 | } cfs_ioctl; | ||
1084 | |||
1085 | |||
1086 | |||
1087 | DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and | ||
1088 | data arguments are filled as usual. flags is not used by Venus. | ||
1089 | |||
1090 | EErrrroorrss | ||
1091 | |||
1092 | NNOOTTEE Another bogus parameter. flags is not used. What is the | ||
1093 | business about PREFETCHING in the Venus code? | ||
1094 | |||
1095 | |||
1096 | 0wpage | ||
1097 | |||
1098 | 44..1188.. rreennaammee | ||
1099 | |||
1100 | |||
1101 | SSuummmmaarryy Rename a fid. | ||
1102 | |||
1103 | AArrgguummeennttss | ||
1104 | |||
1105 | iinn | ||
1106 | |||
1107 | struct cfs_rename_in { | ||
1108 | ViceFid sourceFid; | ||
1109 | char *srcname; | ||
1110 | ViceFid destFid; | ||
1111 | char *destname; | ||
1112 | } cfs_rename; | ||
1113 | |||
1114 | |||
1115 | |||
1116 | oouutt | ||
1117 | none | ||
1118 | |||
1119 | DDeessccrriippttiioonn Rename the object with name srcname in directory | ||
1120 | sourceFid to destname in destFid. It is important that the names | ||
1121 | srcname and destname are 0 terminated strings. Strings in Unix | ||
1122 | kernels are not always null terminated. | ||
1123 | |||
1124 | EErrrroorrss | ||
1125 | |||
1126 | 0wpage | ||
1127 | |||
1128 | 44..1199.. rreeaaddddiirr | ||
1129 | |||
1130 | |||
1131 | SSuummmmaarryy Read directory entries. | ||
1132 | |||
1133 | AArrgguummeennttss | ||
1134 | |||
1135 | iinn | ||
1136 | |||
1137 | struct cfs_readdir_in { | ||
1138 | ViceFid VFid; | ||
1139 | int count; | ||
1140 | int offset; | ||
1141 | } cfs_readdir; | ||
1142 | |||
1143 | |||
1144 | |||
1145 | |||
1146 | oouutt | ||
1147 | |||
1148 | struct cfs_readdir_out { | ||
1149 | int size; | ||
1150 | caddr_t data; /* Place holder for data. */ | ||
1151 | } cfs_readdir; | ||
1152 | |||
1153 | |||
1154 | |||
1155 | DDeessccrriippttiioonn Read directory entries from VFid starting at offset and | ||
1156 | read at most count bytes. Returns the data in data and returns | ||
1157 | the size in size. | ||
1158 | |||
1159 | EErrrroorrss | ||
1160 | |||
1161 | NNOOTTEE This call is not used. Readdir operations exploit container | ||
1162 | files. We will re-evaluate this during the directory revamp which is | ||
1163 | about to take place. | ||
1164 | |||
1165 | 0wpage | ||
1166 | |||
1167 | 44..2200.. vvggeett | ||
1168 | |||
1169 | |||
1170 | SSuummmmaarryy instructs Venus to do an FSDB->Get. | ||
1171 | |||
1172 | AArrgguummeennttss | ||
1173 | |||
1174 | iinn | ||
1175 | |||
1176 | struct cfs_vget_in { | ||
1177 | ViceFid VFid; | ||
1178 | } cfs_vget; | ||
1179 | |||
1180 | |||
1181 | |||
1182 | oouutt | ||
1183 | |||
1184 | struct cfs_vget_out { | ||
1185 | ViceFid VFid; | ||
1186 | int vtype; | ||
1187 | } cfs_vget; | ||
1188 | |||
1189 | |||
1190 | |||
1191 | DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj | ||
1192 | labelled by VFid. | ||
1193 | |||
1194 | EErrrroorrss | ||
1195 | |||
1196 | NNOOTTEE This operation is not used. However, it is extremely useful | ||
1197 | since it can be used to deal with read/write memory mapped files. | ||
1198 | These can be "pinned" in the Venus cache using vget and released with | ||
1199 | inactive. | ||
1200 | |||
1201 | 0wpage | ||
1202 | |||
1203 | 44..2211.. ffssyynncc | ||
1204 | |||
1205 | |||
1206 | SSuummmmaarryy Tell Venus to update the RVM attributes of a file. | ||
1207 | |||
1208 | AArrgguummeennttss | ||
1209 | |||
1210 | iinn | ||
1211 | |||
1212 | struct cfs_fsync_in { | ||
1213 | ViceFid VFid; | ||
1214 | } cfs_fsync; | ||
1215 | |||
1216 | |||
1217 | |||
1218 | oouutt | ||
1219 | none | ||
1220 | |||
1221 | DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This | ||
1222 | should be called as part of kernel level fsync type calls. The | ||
1223 | result indicates if the syncing was successful. | ||
1224 | |||
1225 | EErrrroorrss | ||
1226 | |||
1227 | NNOOTTEE Linux does not implement this call. It should. | ||
1228 | |||
1229 | 0wpage | ||
1230 | |||
1231 | 44..2222.. iinnaaccttiivvee | ||
1232 | |||
1233 | |||
1234 | SSuummmmaarryy Tell Venus a vnode is no longer in use. | ||
1235 | |||
1236 | AArrgguummeennttss | ||
1237 | |||
1238 | iinn | ||
1239 | |||
1240 | struct cfs_inactive_in { | ||
1241 | ViceFid VFid; | ||
1242 | } cfs_inactive; | ||
1243 | |||
1244 | |||
1245 | |||
1246 | oouutt | ||
1247 | none | ||
1248 | |||
1249 | DDeessccrriippttiioonn This operation returns EOPNOTSUPP. | ||
1250 | |||
1251 | EErrrroorrss | ||
1252 | |||
1253 | NNOOTTEE This should perhaps be removed. | ||
1254 | |||
1255 | 0wpage | ||
1256 | |||
1257 | 44..2233.. rrddwwrr | ||
1258 | |||
1259 | |||
1260 | SSuummmmaarryy Read or write from a file | ||
1261 | |||
1262 | AArrgguummeennttss | ||
1263 | |||
1264 | iinn | ||
1265 | |||
1266 | struct cfs_rdwr_in { | ||
1267 | ViceFid VFid; | ||
1268 | int rwflag; | ||
1269 | int count; | ||
1270 | int offset; | ||
1271 | int ioflag; | ||
1272 | caddr_t data; /* Place holder for data. */ | ||
1273 | } cfs_rdwr; | ||
1274 | |||
1275 | |||
1276 | |||
1277 | |||
1278 | oouutt | ||
1279 | |||
1280 | struct cfs_rdwr_out { | ||
1281 | int rwflag; | ||
1282 | int count; | ||
1283 | caddr_t data; /* Place holder for data. */ | ||
1284 | } cfs_rdwr; | ||
1285 | |||
1286 | |||
1287 | |||
1288 | DDeessccrriippttiioonn This upcall asks Venus to read or write from a file. | ||
1289 | |||
1290 | EErrrroorrss | ||
1291 | |||
1292 | NNOOTTEE It should be removed since it is against the Coda philosophy that | ||
1293 | read/write operations never reach Venus. I have been told the | ||
1294 | operation does not work. It is not currently used. | ||
1295 | |||
1296 | |||
1297 | 0wpage | ||
1298 | |||
1299 | 44..2244.. ooddyymmoouunntt | ||
1300 | |||
1301 | |||
1302 | SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount | ||
1303 | point. | ||
1304 | |||
1305 | AArrgguummeennttss | ||
1306 | |||
1307 | iinn | ||
1308 | |||
1309 | struct ody_mount_in { | ||
1310 | char *name; /* Place holder for data. */ | ||
1311 | } ody_mount; | ||
1312 | |||
1313 | |||
1314 | |||
1315 | oouutt | ||
1316 | |||
1317 | struct ody_mount_out { | ||
1318 | ViceFid VFid; | ||
1319 | } ody_mount; | ||
1320 | |||
1321 | |||
1322 | |||
1323 | DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named | ||
1324 | name. The fid is returned in VFid. | ||
1325 | |||
1326 | EErrrroorrss | ||
1327 | |||
1328 | NNOOTTEE This call was used by David for dynamic sets. It should be | ||
1329 | removed since it causes a jungle of pointers in the VFS mounting area. | ||
1330 | It is not used by Coda proper. Call is not implemented by Venus. | ||
1331 | |||
1332 | 0wpage | ||
1333 | |||
1334 | 44..2255.. ooddyy__llooookkuupp | ||
1335 | |||
1336 | |||
1337 | SSuummmmaarryy Looks up something. | ||
1338 | |||
1339 | AArrgguummeennttss | ||
1340 | |||
1341 | iinn irrelevant | ||
1342 | |||
1343 | |||
1344 | oouutt | ||
1345 | irrelevant | ||
1346 | |||
1347 | DDeessccrriippttiioonn | ||
1348 | |||
1349 | EErrrroorrss | ||
1350 | |||
1351 | NNOOTTEE Gut it. Call is not implemented by Venus. | ||
1352 | |||
1353 | 0wpage | ||
1354 | |||
1355 | 44..2266.. ooddyy__eexxppaanndd | ||
1356 | |||
1357 | |||
1358 | SSuummmmaarryy expands something in a dynamic set. | ||
1359 | |||
1360 | AArrgguummeennttss | ||
1361 | |||
1362 | iinn irrelevant | ||
1363 | |||
1364 | oouutt | ||
1365 | irrelevant | ||
1366 | |||
1367 | DDeessccrriippttiioonn | ||
1368 | |||
1369 | EErrrroorrss | ||
1370 | |||
1371 | NNOOTTEE Gut it. Call is not implemented by Venus. | ||
1372 | |||
1373 | 0wpage | ||
1374 | |||
1375 | 44..2277.. pprreeffeettcchh | ||
1376 | |||
1377 | |||
1378 | SSuummmmaarryy Prefetch a dynamic set. | ||
1379 | |||
1380 | AArrgguummeennttss | ||
1381 | |||
1382 | iinn Not documented. | ||
1383 | |||
1384 | oouutt | ||
1385 | Not documented. | ||
1386 | |||
1387 | DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is | ||
1388 | noted that it doesn't work. Not surprising, since the kernel does not | ||
1389 | have support for it. (ODY_PREFETCH is not a defined operation). | ||
1390 | |||
1391 | EErrrroorrss | ||
1392 | |||
1393 | NNOOTTEE Gut it. It isn't working and isn't used by Coda. | ||
1394 | |||
1395 | |||
1396 | 0wpage | ||
1397 | |||
1398 | 44..2288.. ssiiggnnaall | ||
1399 | |||
1400 | |||
1401 | SSuummmmaarryy Send Venus a signal about an upcall. | ||
1402 | |||
1403 | AArrgguummeennttss | ||
1404 | |||
1405 | iinn none | ||
1406 | |||
1407 | oouutt | ||
1408 | not applicable. | ||
1409 | |||
1410 | DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus | ||
1411 | that the calling process received a signal after Venus read the | ||
1412 | message from the input queue. Venus is supposed to clean up the | ||
1413 | operation. | ||
1414 | |||
1415 | EErrrroorrss No reply is given. | ||
1416 | |||
1417 | NNOOTTEE We need to better understand what Venus needs to clean up and if | ||
1418 | it is doing this correctly. Also we need to handle multiple upcall | ||
1419 | per system call situations correctly. It would be important to know | ||
1420 | what state changes in Venus take place after an upcall for which the | ||
1421 | kernel is responsible for notifying Venus to clean up (e.g. open | ||
1422 | definitely is such a state change, but many others are maybe not). | ||
1423 | |||
1424 | 0wpage | ||
1425 | |||
1426 | 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss | ||
1427 | |||
1428 | |||
1429 | The Coda FS Driver can cache results of lookup and access upcalls, to | ||
1430 | limit the frequency of upcalls. Upcalls carry a price since a process | ||
1431 | context switch needs to take place. The counterpart of caching the | ||
1432 | information is that Venus will notify the FS Driver that cached | ||
1433 | entries must be flushed or renamed. | ||
1434 | |||
1435 | The kernel code generally has to maintain a structure which links the | ||
1436 | internal file handles (called vnodes in BSD, inodes in Linux and | ||
1437 | FileHandles in Windows) with the ViceFid's which Venus maintains. The | ||
1438 | reason is that frequent translations back and forth are needed in | ||
1439 | order to make upcalls and use the results of upcalls. Such linking | ||
1440 | objects are called ccnnooddeess. | ||
1441 | |||
1442 | The current minicache implementations have cache entries which record | ||
1443 | the following: | ||
1444 | |||
1445 | 1. the name of the file | ||
1446 | |||
1447 | 2. the cnode of the directory containing the object | ||
1448 | |||
1449 | 3. a list of CodaCred's for which the lookup is permitted. | ||
1450 | |||
1451 | 4. the cnode of the object | ||
1452 | |||
1453 | The lookup call in the Coda FS Driver may request the cnode of the | ||
1454 | desired object from the cache, by passing its name, directory and the | ||
1455 | CodaCred's of the caller. The cache will return the cnode or indicate | ||
1456 | that it cannot be found. The Coda FS Driver must be careful to | ||
1457 | invalidate cache entries when it modifies or removes objects. | ||
1458 | |||
1459 | When Venus obtains information that indicates that cache entries are | ||
1460 | no longer valid, it will make a downcall to the kernel. Downcalls are | ||
1461 | intercepted by the Coda FS Driver and lead to cache invalidations of | ||
1462 | the kind described below. The Coda FS Driver does not return an error | ||
1463 | unless the downcall data could not be read into kernel memory. | ||
1464 | |||
1465 | |||
1466 | 55..11.. IINNVVAALLIIDDAATTEE | ||
1467 | |||
1468 | |||
1469 | No information is available on this call. | ||
1470 | |||
1471 | |||
1472 | 55..22.. FFLLUUSSHH | ||
1473 | |||
1474 | |||
1475 | |||
1476 | AArrgguummeennttss None | ||
1477 | |||
1478 | SSuummmmaarryy Flush the name cache entirely. | ||
1479 | |||
1480 | DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This | ||
1481 | is to prevent stale cache information being held. Some operating | ||
1482 | systems allow the kernel name cache to be switched off dynamically. | ||
1483 | When this is done, this downcall is made. | ||
1484 | |||
1485 | |||
1486 | 55..33.. PPUURRGGEEUUSSEERR | ||
1487 | |||
1488 | |||
1489 | AArrgguummeennttss | ||
1490 | |||
1491 | struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ | ||
1492 | struct CodaCred cred; | ||
1493 | } cfs_purgeuser; | ||
1494 | |||
1495 | |||
1496 | |||
1497 | DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This | ||
1498 | call is issued when tokens for a user expire or are flushed. | ||
1499 | |||
1500 | |||
1501 | 55..44.. ZZAAPPFFIILLEE | ||
1502 | |||
1503 | |||
1504 | AArrgguummeennttss | ||
1505 | |||
1506 | struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ | ||
1507 | ViceFid CodaFid; | ||
1508 | } cfs_zapfile; | ||
1509 | |||
1510 | |||
1511 | |||
1512 | DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair. | ||
1513 | This is issued as a result of an invalidation of cached attributes of | ||
1514 | a vnode. | ||
1515 | |||
1516 | NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache | ||
1517 | zapfile routine takes different arguments. Linux does not implement | ||
1518 | the invalidation of attributes correctly. | ||
1519 | |||
1520 | |||
1521 | |||
1522 | 55..55.. ZZAAPPDDIIRR | ||
1523 | |||
1524 | |||
1525 | AArrgguummeennttss | ||
1526 | |||
1527 | struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ | ||
1528 | ViceFid CodaFid; | ||
1529 | } cfs_zapdir; | ||
1530 | |||
1531 | |||
1532 | |||
1533 | DDeessccrriippttiioonn Remove all entries in the cache lying in a directory | ||
1534 | CodaFid, and all children of this directory. This call is issued when | ||
1535 | Venus receives a callback on the directory. | ||
1536 | |||
1537 | |||
1538 | 55..66.. ZZAAPPVVNNOODDEE | ||
1539 | |||
1540 | |||
1541 | |||
1542 | AArrgguummeennttss | ||
1543 | |||
1544 | struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ | ||
1545 | struct CodaCred cred; | ||
1546 | ViceFid VFid; | ||
1547 | } cfs_zapvnode; | ||
1548 | |||
1549 | |||
1550 | |||
1551 | DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid | ||
1552 | as in the arguments. This downcall is probably never issued. | ||
1553 | |||
1554 | |||
1555 | 55..77.. PPUURRGGEEFFIIDD | ||
1556 | |||
1557 | |||
1558 | SSuummmmaarryy | ||
1559 | |||
1560 | AArrgguummeennttss | ||
1561 | |||
1562 | struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ | ||
1563 | ViceFid CodaFid; | ||
1564 | } cfs_purgefid; | ||
1565 | |||
1566 | |||
1567 | |||
1568 | DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd | ||
1569 | vnode), purge its children from the namecache and remove the file from the | ||
1570 | namecache. | ||
1571 | |||
1572 | |||
1573 | |||
1574 | 55..88.. RREEPPLLAACCEE | ||
1575 | |||
1576 | |||
1577 | SSuummmmaarryy Replace the Fid's for a collection of names. | ||
1578 | |||
1579 | AArrgguummeennttss | ||
1580 | |||
1581 | struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ | ||
1582 | ViceFid NewFid; | ||
1583 | ViceFid OldFid; | ||
1584 | } cfs_replace; | ||
1585 | |||
1586 | |||
1587 | |||
1588 | DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with | ||
1589 | another. It is added to allow Venus during reintegration to replace | ||
1590 | locally allocated temp fids while disconnected with global fids even | ||
1591 | when the reference counts on those fids are not zero. | ||
1592 | |||
1593 | 0wpage | ||
1594 | |||
1595 | 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp | ||
1596 | |||
1597 | |||
1598 | This section gives brief hints as to desirable features for the Coda | ||
1599 | FS Driver at startup and upon shutdown or Venus failures. Before | ||
1600 | entering the discussion it is useful to repeat that the Coda FS Driver | ||
1601 | maintains the following data: | ||
1602 | |||
1603 | |||
1604 | 1. message queues | ||
1605 | |||
1606 | 2. cnodes | ||
1607 | |||
1608 | 3. name cache entries | ||
1609 | |||
1610 | The name cache entries are entirely private to the driver, so they | ||
1611 | can easily be manipulated. The message queues will generally have | ||
1612 | clear points of initialization and destruction. The cnodes are | ||
1613 | much more delicate. User processes hold reference counts in Coda | ||
1614 | filesystems and it can be difficult to clean up the cnodes. | ||
1615 | |||
1616 | It can expect requests through: | ||
1617 | |||
1618 | 1. the message subsystem | ||
1619 | |||
1620 | 2. the VFS layer | ||
1621 | |||
1622 | 3. pioctl interface | ||
1623 | |||
1624 | Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can | ||
1625 | treat these similarly. | ||
1626 | |||
1627 | |||
1628 | 66..11.. RReeqquuiirreemmeennttss | ||
1629 | |||
1630 | |||
1631 | The following requirements should be accommodated: | ||
1632 | |||
1633 | 1. The message queues should have open and close routines. On Unix | ||
1634 | the opening of the character devices are such routines. | ||
1635 | |||
1636 | +o Before opening, no messages can be placed. | ||
1637 | |||
1638 | +o Opening will remove any old messages still pending. | ||
1639 | |||
1640 | +o Close will notify any sleeping processes that their upcall cannot | ||
1641 | be completed. | ||
1642 | |||
1643 | +o Close will free all memory allocated by the message queues. | ||
1644 | |||
1645 | |||
1646 | 2. At open the namecache shall be initialized to empty state. | ||
1647 | |||
1648 | 3. Before the message queues are open, all VFS operations will fail. | ||
1649 | Fortunately this can be achieved by making sure than mounting the | ||
1650 | Coda filesystem cannot succeed before opening. | ||
1651 | |||
1652 | 4. After closing of the queues, no VFS operations can succeed. Here | ||
1653 | one needs to be careful, since a few operations (lookup, | ||
1654 | read/write, readdir) can proceed without upcalls. These must be | ||
1655 | explicitly blocked. | ||
1656 | |||
1657 | 5. Upon closing the namecache shall be flushed and disabled. | ||
1658 | |||
1659 | 6. All memory held by cnodes can be freed without relying on upcalls. | ||
1660 | |||
1661 | 7. Unmounting the file system can be done without relying on upcalls. | ||
1662 | |||
1663 | 8. Mounting the Coda filesystem should fail gracefully if Venus cannot | ||
1664 | get the rootfid or the attributes of the rootfid. The latter is | ||
1665 | best implemented by Venus fetching these objects before attempting | ||
1666 | to mount. | ||
1667 | |||
1668 | NNOOTTEE NetBSD in particular but also Linux have not implemented the | ||
1669 | above requirements fully. For smooth operation this needs to be | ||
1670 | corrected. | ||
1671 | |||
1672 | |||
1673 | |||
diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt new file mode 100644 index 000000000000..31f53f0ab957 --- /dev/null +++ b/Documentation/filesystems/cramfs.txt | |||
@@ -0,0 +1,76 @@ | |||
1 | |||
2 | Cramfs - cram a filesystem onto a small ROM | ||
3 | |||
4 | cramfs is designed to be simple and small, and to compress things well. | ||
5 | |||
6 | It uses the zlib routines to compress a file one page at a time, and | ||
7 | allows random page access. The meta-data is not compressed, but is | ||
8 | expressed in a very terse representation to make it use much less | ||
9 | diskspace than traditional filesystems. | ||
10 | |||
11 | You can't write to a cramfs filesystem (making it compressible and | ||
12 | compact also makes it _very_ hard to update on-the-fly), so you have to | ||
13 | create the disk image with the "mkcramfs" utility. | ||
14 | |||
15 | |||
16 | Usage Notes | ||
17 | ----------- | ||
18 | |||
19 | File sizes are limited to less than 16MB. | ||
20 | |||
21 | Maximum filesystem size is a little over 256MB. (The last file on the | ||
22 | filesystem is allowed to extend past 256MB.) | ||
23 | |||
24 | Only the low 8 bits of gid are stored. The current version of | ||
25 | mkcramfs simply truncates to 8 bits, which is a potential security | ||
26 | issue. | ||
27 | |||
28 | Hard links are supported, but hard linked files | ||
29 | will still have a link count of 1 in the cramfs image. | ||
30 | |||
31 | Cramfs directories have no `.' or `..' entries. Directories (like | ||
32 | every other file on cramfs) always have a link count of 1. (There's | ||
33 | no need to use -noleaf in `find', btw.) | ||
34 | |||
35 | No timestamps are stored in a cramfs, so these default to the epoch | ||
36 | (1970 GMT). Recently-accessed files may have updated timestamps, but | ||
37 | the update lasts only as long as the inode is cached in memory, after | ||
38 | which the timestamp reverts to 1970, i.e. moves backwards in time. | ||
39 | |||
40 | Currently, cramfs must be written and read with architectures of the | ||
41 | same endianness, and can be read only by kernels with PAGE_CACHE_SIZE | ||
42 | == 4096. At least the latter of these is a bug, but it hasn't been | ||
43 | decided what the best fix is. For the moment if you have larger pages | ||
44 | you can just change the #define in mkcramfs.c, so long as you don't | ||
45 | mind the filesystem becoming unreadable to future kernels. | ||
46 | |||
47 | |||
48 | For /usr/share/magic | ||
49 | -------------------- | ||
50 | |||
51 | 0 ulelong 0x28cd3d45 Linux cramfs offset 0 | ||
52 | >4 ulelong x size %d | ||
53 | >8 ulelong x flags 0x%x | ||
54 | >12 ulelong x future 0x%x | ||
55 | >16 string >\0 signature "%.16s" | ||
56 | >32 ulelong x fsid.crc 0x%x | ||
57 | >36 ulelong x fsid.edition %d | ||
58 | >40 ulelong x fsid.blocks %d | ||
59 | >44 ulelong x fsid.files %d | ||
60 | >48 string >\0 name "%.16s" | ||
61 | 512 ulelong 0x28cd3d45 Linux cramfs offset 512 | ||
62 | >516 ulelong x size %d | ||
63 | >520 ulelong x flags 0x%x | ||
64 | >524 ulelong x future 0x%x | ||
65 | >528 string >\0 signature "%.16s" | ||
66 | >544 ulelong x fsid.crc 0x%x | ||
67 | >548 ulelong x fsid.edition %d | ||
68 | >552 ulelong x fsid.blocks %d | ||
69 | >556 ulelong x fsid.files %d | ||
70 | >560 string >\0 name "%.16s" | ||
71 | |||
72 | |||
73 | Hacker Notes | ||
74 | ------------ | ||
75 | |||
76 | See fs/cramfs/README for filesystem layout and implementation notes. | ||
diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog new file mode 100644 index 000000000000..e5aba5246d7c --- /dev/null +++ b/Documentation/filesystems/devfs/ChangeLog | |||
@@ -0,0 +1,1977 @@ | |||
1 | /* -*- auto-fill -*- */ | ||
2 | =============================================================================== | ||
3 | Changes for patch v1 | ||
4 | |||
5 | - creation of devfs | ||
6 | |||
7 | - modified miscellaneous character devices to support devfs | ||
8 | =============================================================================== | ||
9 | Changes for patch v2 | ||
10 | |||
11 | - bug fix with manual inode creation | ||
12 | =============================================================================== | ||
13 | Changes for patch v3 | ||
14 | |||
15 | - bugfixes | ||
16 | |||
17 | - documentation improvements | ||
18 | |||
19 | - created a couple of scripts (one to save&restore a devfs and the | ||
20 | other to set up compatibility symlinks) | ||
21 | |||
22 | - devfs support for SCSI discs. New name format is: sd_hHcCiIlL | ||
23 | =============================================================================== | ||
24 | Changes for patch v4 | ||
25 | |||
26 | - bugfix for the directory reading code | ||
27 | |||
28 | - bugfix for compilation with kerneld | ||
29 | |||
30 | - devfs support for generic hard discs | ||
31 | |||
32 | - rationalisation of the various watchdog drivers | ||
33 | =============================================================================== | ||
34 | Changes for patch v5 | ||
35 | |||
36 | - support for mounting directly from entries in the devfs (it doesn't | ||
37 | need to be mounted to do this), including the root filesystem. | ||
38 | Mounting of swap partitions also works. Hence, now if you set | ||
39 | CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs | ||
40 | via ordinary device nodes. Naturally, the default is 'N' so that you | ||
41 | can still use your old device nodes. If you want to mount from devfs | ||
42 | entries, make sure you use: append = "root=/dev/sd_..." in your | ||
43 | lilo.conf. It seems LILO looks for the device number (major&minor) | ||
44 | and writes that into the kernel image :-( | ||
45 | |||
46 | - support for character memory devices (/dev/null, /dev/zero, /dev/full | ||
47 | and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
48 | =============================================================================== | ||
49 | Changes for patch v6 | ||
50 | |||
51 | - support for subdirectories | ||
52 | |||
53 | - support for symbolic links (created by devfs_mk_symlink(), no | ||
54 | support yet for creation via symlink(2)) | ||
55 | |||
56 | - SCSI disc naming now cast in stone, with the format: | ||
57 | /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc | ||
58 | /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition | ||
59 | |||
60 | - loop devices now appear in devfs | ||
61 | |||
62 | - tty devices, console, serial ports, etc. now appear in devfs | ||
63 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
64 | |||
65 | - bugs with mounting devfs-only devices now fixed | ||
66 | =============================================================================== | ||
67 | Changes for patch v7 | ||
68 | |||
69 | - SCSI CD-ROMS, tapes and generic devices now appear in devfs | ||
70 | =============================================================================== | ||
71 | Changes for patch v8 | ||
72 | |||
73 | - bugfix with no-rewind SCSI tapes | ||
74 | |||
75 | - RAMDISCs now appear in devfs | ||
76 | |||
77 | - better cleaning up of devfs entries created by various modules | ||
78 | |||
79 | - interface change to <devfs_register> | ||
80 | =============================================================================== | ||
81 | Changes for patch v9 | ||
82 | |||
83 | - the v8 patch was corrupted somehow, which would affect the patch for | ||
84 | linux/fs/filesystems.c | ||
85 | I've also fixed the v8 patch file on the WWW | ||
86 | |||
87 | - MetaDevices (/dev/md*) should now appear in devfs | ||
88 | =============================================================================== | ||
89 | Changes for patch v10 | ||
90 | |||
91 | - bugfix in meta device support for devfs | ||
92 | |||
93 | - created this ChangeLog file | ||
94 | |||
95 | - added devfs support to the floppy driver | ||
96 | |||
97 | - added support for creating sockets in a devfs | ||
98 | =============================================================================== | ||
99 | Changes for patch v11 | ||
100 | |||
101 | - added DEVFS_FL_HIDE_UNREG flag | ||
102 | |||
103 | - incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu. | ||
104 | |||
105 | - interface change to <devfs_mk_symlink> | ||
106 | |||
107 | - support for creating symlinks with symlink(2) | ||
108 | |||
109 | - parallel port printer (/dev/lp*) now appears in devfs | ||
110 | =============================================================================== | ||
111 | Changes for patch v12 | ||
112 | |||
113 | - added inode check to <devfs_fill_file> function | ||
114 | |||
115 | - improved devfs support when mounting from devfs | ||
116 | |||
117 | - added call to <<release>> operation when removing swap areas on | ||
118 | devfs devices | ||
119 | |||
120 | - increased NR_SUPER to 128 to support large numbers of devfs mounts | ||
121 | (for chroot(2) gaols) | ||
122 | |||
123 | - fixed bug in SCSI disc support: was generating incorrect minors if | ||
124 | SCSI ID's did not start at 0 and increase by 1 | ||
125 | |||
126 | - support symlink traversal when mounting root | ||
127 | =============================================================================== | ||
128 | Changes for patch v13 | ||
129 | |||
130 | - added devfs support to soundcard driver | ||
131 | Thanks to Eric Dumas <dumas@linux.eu.org> and | ||
132 | C. Scott Ananian <cananian@alumni.princeton.edu> | ||
133 | |||
134 | - added devfs support to the joystick driver | ||
135 | |||
136 | - loop driver now has it's own subdirectory "/dev/loop/" | ||
137 | |||
138 | - created <devfs_get_flags> and <devfs_set_flags> functions | ||
139 | |||
140 | - fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f}) | ||
141 | which assumes ID's start at 0 and increase by 1. Also only create | ||
142 | devfs entries for SCSI disc partitions which actually exist | ||
143 | Show new names in partition check | ||
144 | Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz> | ||
145 | =============================================================================== | ||
146 | Changes for patch v14 | ||
147 | |||
148 | - bug fix in floppy driver: would not compile without | ||
149 | CONFIG_DEVFS_FS='Y' | ||
150 | Thanks to Jurgen Botz <jbotz@nova.botz.org> | ||
151 | |||
152 | - bug fix in loop driver | ||
153 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
154 | |||
155 | - do not create devfs entries for printers not configured | ||
156 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
157 | |||
158 | - do not create devfs entries for serial ports not present | ||
159 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
160 | |||
161 | - ensure <tty_register_devfs> is exported from tty_io.c | ||
162 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
163 | |||
164 | - allow unregistering of devfs symlink entries | ||
165 | |||
166 | - fixed bug in SCSI disc naming introduced in last patch version | ||
167 | =============================================================================== | ||
168 | Changes for patch v15 | ||
169 | |||
170 | - ported to kernel 2.1.81 | ||
171 | =============================================================================== | ||
172 | Changes for patch v16 | ||
173 | |||
174 | - created <devfs_set_symlink_destination> function | ||
175 | |||
176 | - moved DEVFS_SUPER_MAGIC into header file | ||
177 | |||
178 | - added DEVFS_FL_HIDE flag | ||
179 | |||
180 | - created <devfs_get_maj_min> | ||
181 | |||
182 | - created <devfs_get_handle_from_inode> | ||
183 | |||
184 | - fixed bugs in searching by major&minor | ||
185 | |||
186 | - changed interface to <devfs_unregister>, <devfs_fill_file> and | ||
187 | <devfs_find_handle> | ||
188 | |||
189 | - fixed inode times when symlink created with symlink(2) | ||
190 | |||
191 | - change tty driver to do auto-creation of devfs entries | ||
192 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
193 | |||
194 | - fixed bug in genhd.c: whole disc (non-SCSI) was not registered to | ||
195 | devfs | ||
196 | |||
197 | - updated libc 5.4.43 patch for ttyname() | ||
198 | =============================================================================== | ||
199 | Changes for patch v17 | ||
200 | |||
201 | - added CONFIG_DEVFS_TTY_COMPAT | ||
202 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
203 | |||
204 | - bugfix in devfs support for drivers/char/lp.c | ||
205 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
206 | |||
207 | - clean up serial driver so that PCMCIA devices unregister correctly | ||
208 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
209 | |||
210 | - fixed bug in genhd.c: whole disc (non-SCSI) was not registered to | ||
211 | devfs [was missing in patch v16] | ||
212 | |||
213 | - updated libc 5.4.43 patch for ttyname() [was missing in patch v16] | ||
214 | |||
215 | - all SCSI devices now registered in /dev/sg | ||
216 | |||
217 | - support removal of devfs entries via unlink(2) | ||
218 | =============================================================================== | ||
219 | Changes for patch v18 | ||
220 | |||
221 | - added floppy/?u720 floppy entry | ||
222 | |||
223 | - fixed kerneld support for entries in devfs subdirectories | ||
224 | |||
225 | - incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu. | ||
226 | =============================================================================== | ||
227 | Changes for patch v19 | ||
228 | |||
229 | - bug fix when looking up unregistered entries: kerneld was not called | ||
230 | |||
231 | - fixes for kernel 2.1.86 (now requires 2.1.86) | ||
232 | =============================================================================== | ||
233 | Changes for patch v20 | ||
234 | |||
235 | - only create available floppy entries | ||
236 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
237 | |||
238 | - new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1 | ||
239 | instead of /dev/hda1) | ||
240 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
241 | |||
242 | - new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1 | ||
243 | instead of /dev/xda1) | ||
244 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
245 | |||
246 | - new non-standard CD-ROM names (i.e. /dev/sbp/c#t#) | ||
247 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
248 | |||
249 | - allow symlink traversal when mounting the root filesystem | ||
250 | |||
251 | - Create entries for MD devices at MD init | ||
252 | Thanks to Christophe Leroy <christophe.leroy5@capway.com> | ||
253 | =============================================================================== | ||
254 | Changes for patch v21 | ||
255 | |||
256 | - ported to kernel 2.1.91 | ||
257 | =============================================================================== | ||
258 | Changes for patch v22 | ||
259 | |||
260 | - SCSI host number patch ("scsihosts=" kernel option) | ||
261 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
262 | =============================================================================== | ||
263 | Changes for patch v23 | ||
264 | |||
265 | - Fixed persistence bug with device numbers for manually created | ||
266 | device files | ||
267 | |||
268 | - Fixed problem with recreating symlinks with different content | ||
269 | |||
270 | - Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time) | ||
271 | =============================================================================== | ||
272 | Changes for patch v24 | ||
273 | |||
274 | - Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading | ||
275 | should now work again | ||
276 | |||
277 | - Hide entries which are manually unlinked | ||
278 | |||
279 | - Always invalidate devfs dentry cache when registering entries | ||
280 | |||
281 | - Support removal of devfs directories via rmdir(2) | ||
282 | |||
283 | - Ensure directories created by <devfs_mk_dir> are visible | ||
284 | |||
285 | - Default no access for "other" for floppy device | ||
286 | =============================================================================== | ||
287 | Changes for patch v25 | ||
288 | |||
289 | - Updates to CREDITS file and minor IDE numbering change | ||
290 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
291 | |||
292 | - Invalidate devfs dentry cache when making directories | ||
293 | |||
294 | - Invalidate devfs dentry cache when removing entries | ||
295 | |||
296 | - More informative message if root FS mount fails when devfs | ||
297 | configured | ||
298 | |||
299 | - Fixed persistence bug with fifos | ||
300 | =============================================================================== | ||
301 | Changes for patch v26 | ||
302 | |||
303 | - ported to kernel 2.1.97 | ||
304 | |||
305 | - Changed serial directory from "/dev/serial" to "/dev/tts" and | ||
306 | "/dev/consoles" to "/dev/vc" to be more friendly to new procps | ||
307 | =============================================================================== | ||
308 | Changes for patch v27 | ||
309 | |||
310 | - Added support for IDE4 and IDE5 | ||
311 | Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> | ||
312 | |||
313 | - Documented "scsihosts=" boot parameter | ||
314 | |||
315 | - Print process command when debugging kerneld/kmod | ||
316 | |||
317 | - Added debugging for register/unregister/change operations | ||
318 | |||
319 | - Added "devfs=" boot options | ||
320 | |||
321 | - Hide unregistered entries by default | ||
322 | =============================================================================== | ||
323 | Changes for patch v28 | ||
324 | |||
325 | - No longer lock/unlock superblock in <devfs_put_super> (cope with | ||
326 | recent VFS interface change) | ||
327 | |||
328 | - Do not automatically change ownership/protection of /dev/tty | ||
329 | |||
330 | - Drop negative dentries when they are released | ||
331 | |||
332 | - Manage dcache more efficiently | ||
333 | =============================================================================== | ||
334 | Changes for patch v29 | ||
335 | |||
336 | - Added DEVFS_FL_AUTO_DEVNUM flag | ||
337 | =============================================================================== | ||
338 | Changes for patch v30 | ||
339 | |||
340 | - No longer set unnecessary methods | ||
341 | |||
342 | - Ported to kernel 2.1.99-pre3 | ||
343 | =============================================================================== | ||
344 | Changes for patch v31 | ||
345 | |||
346 | - Added PID display to <call_kerneld> debugging message | ||
347 | |||
348 | - Added "diread" and "diwrite" options | ||
349 | |||
350 | - Ported to kernel 2.1.102 | ||
351 | |||
352 | - Fixed persistence problem with permissions | ||
353 | =============================================================================== | ||
354 | Changes for patch v32 | ||
355 | |||
356 | - Fixed devfs support in drivers/block/md.c | ||
357 | =============================================================================== | ||
358 | Changes for patch v33 | ||
359 | |||
360 | - Support legacy device nodes | ||
361 | |||
362 | - Fixed bug where recreated inodes were hidden | ||
363 | |||
364 | - New IDE naming scheme: everything is under /dev/ide | ||
365 | =============================================================================== | ||
366 | Changes for patch v34 | ||
367 | |||
368 | - Improved debugging in <get_vfs_inode> | ||
369 | |||
370 | - Prevent duplicate calls to <devfs_mk_dir> in SCSI layer | ||
371 | |||
372 | - No longer free old dentries in <devfs_mk_dir> | ||
373 | |||
374 | - Free all dentries for a given entry when deleting inodes | ||
375 | =============================================================================== | ||
376 | Changes for patch v35 | ||
377 | |||
378 | - Ported to kernel 2.1.105 (sound driver changes) | ||
379 | =============================================================================== | ||
380 | Changes for patch v36 | ||
381 | |||
382 | - Fixed sound driver port | ||
383 | =============================================================================== | ||
384 | Changes for patch v37 | ||
385 | |||
386 | - Minor documentation tweaks | ||
387 | =============================================================================== | ||
388 | Changes for patch v38 | ||
389 | |||
390 | - More documentation tweaks | ||
391 | |||
392 | - Fix for sound driver port | ||
393 | |||
394 | - Removed ttyname-patch (grab libc 5.4.44 instead) | ||
395 | |||
396 | - Ported to kernel 2.1.107-pre2 (loop driver fix) | ||
397 | =============================================================================== | ||
398 | Changes for patch v39 | ||
399 | |||
400 | - Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh | ||
401 | |||
402 | - Removed many #ifdef's, replaced with trickery in include/devfs_fs.h | ||
403 | =============================================================================== | ||
404 | Changes for patch v40 | ||
405 | |||
406 | - Fix for sound driver port | ||
407 | |||
408 | - Limit auto-device numbering to majors 128 to 239 | ||
409 | =============================================================================== | ||
410 | Changes for patch v41 | ||
411 | |||
412 | - Fixed inode times persistence problem | ||
413 | =============================================================================== | ||
414 | Changes for patch v42 | ||
415 | |||
416 | - Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke) | ||
417 | =============================================================================== | ||
418 | Changes for patch v43 | ||
419 | |||
420 | - Fixed spelling in <devfs_readlink> debug | ||
421 | |||
422 | - Fixed bug in <devfs_setup> parsing "dilookup" | ||
423 | |||
424 | - More #ifdef's removed | ||
425 | |||
426 | - Supported Sparc keyboard (/dev/kbd) | ||
427 | |||
428 | - Supported DSP56001 digital signal processor (/dev/dsp56k) | ||
429 | |||
430 | - Supported Apple Desktop Bus (/dev/adb) | ||
431 | |||
432 | - Supported Coda network file system (/dev/cfs*) | ||
433 | =============================================================================== | ||
434 | Changes for patch v44 | ||
435 | |||
436 | - Fixed devfs inode leak when manually recreating inodes | ||
437 | |||
438 | - Fixed permission persistence problem when recreating inodes | ||
439 | =============================================================================== | ||
440 | Changes for patch v45 | ||
441 | |||
442 | - Ported to kernel 2.1.110 | ||
443 | =============================================================================== | ||
444 | Changes for patch v46 | ||
445 | |||
446 | - Ported to kernel 2.1.112-pre1 | ||
447 | |||
448 | - Removed harmless "unused variable" compiler warning | ||
449 | |||
450 | - Fixed modes for manually recreated device nodes | ||
451 | =============================================================================== | ||
452 | Changes for patch v47 | ||
453 | |||
454 | - Added NULL devfs inode warning in <devfs_read_inode> | ||
455 | |||
456 | - Force all inode nlink values to 1 | ||
457 | =============================================================================== | ||
458 | Changes for patch v48 | ||
459 | |||
460 | - Added "dimknod" option | ||
461 | |||
462 | - Set inode nlink to 0 when freeing dentries | ||
463 | |||
464 | - Added support for virtual console capture devices (/dev/vcs*) | ||
465 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
466 | |||
467 | - Fixed modes for manually recreated symlinks | ||
468 | =============================================================================== | ||
469 | Changes for patch v49 | ||
470 | |||
471 | - Ported to kernel 2.1.113 | ||
472 | =============================================================================== | ||
473 | Changes for patch v50 | ||
474 | |||
475 | - Fixed bugs in recreated directories and symlinks | ||
476 | =============================================================================== | ||
477 | Changes for patch v51 | ||
478 | |||
479 | - Improved robustness of rc.devfs script | ||
480 | Thanks to Roderich Schupp <rsch@experteam.de> | ||
481 | |||
482 | - Fixed bugs in recreated device nodes | ||
483 | |||
484 | - Fixed bug in currently unused <devfs_get_handle_from_inode> | ||
485 | |||
486 | - Defined new <devfs_handle_t> type | ||
487 | |||
488 | - Improved debugging when getting entries | ||
489 | |||
490 | - Fixed bug where directories could be emptied | ||
491 | |||
492 | - Ported to kernel 2.1.115 | ||
493 | =============================================================================== | ||
494 | Changes for patch v52 | ||
495 | |||
496 | - Replaced dummy .epoch inode with .devfsd character device | ||
497 | |||
498 | - Modified rc.devfs to take account of above change | ||
499 | |||
500 | - Removed spurious driver warning messages when CONFIG_DEVFS_FS=n | ||
501 | |||
502 | - Implemented devfsd protocol revision 0 | ||
503 | =============================================================================== | ||
504 | Changes for patch v53 | ||
505 | |||
506 | - Ported to kernel 2.1.116 (kmod change broke hunk) | ||
507 | |||
508 | - Updated Documentation/Configure.help | ||
509 | |||
510 | - Test and tty pattern patch for rc.devfs script | ||
511 | Thanks to Roderich Schupp <rsch@experteam.de> | ||
512 | |||
513 | - Added soothing message to warning in <devfs_d_iput> | ||
514 | =============================================================================== | ||
515 | Changes for patch v54 | ||
516 | |||
517 | - Ported to kernel 2.1.117 | ||
518 | |||
519 | - Fixed default permissions in sound driver | ||
520 | |||
521 | - Added support for frame buffer devices (/dev/fb*) | ||
522 | =============================================================================== | ||
523 | Changes for patch v55 | ||
524 | |||
525 | - Ported to kernel 2.1.119 | ||
526 | |||
527 | - Use GCC extensions for structure initialisations | ||
528 | |||
529 | - Implemented async open notification | ||
530 | |||
531 | - Incremented devfsd protocol revision to 1 | ||
532 | =============================================================================== | ||
533 | Changes for patch v56 | ||
534 | |||
535 | - Ported to kernel 2.1.120-pre3 | ||
536 | |||
537 | - Moved async open notification to end of <devfs_open> | ||
538 | =============================================================================== | ||
539 | Changes for patch v57 | ||
540 | |||
541 | - Ported to kernel 2.1.121 | ||
542 | |||
543 | - Prepended "/dev/" to module load request | ||
544 | |||
545 | - Renamed <call_kerneld> to <call_kmod> | ||
546 | |||
547 | - Created sample modules.conf file | ||
548 | =============================================================================== | ||
549 | Changes for patch v58 | ||
550 | |||
551 | - Fixed typo "AYSNC" -> "ASYNC" | ||
552 | =============================================================================== | ||
553 | Changes for patch v59 | ||
554 | |||
555 | - Added open flag for files | ||
556 | =============================================================================== | ||
557 | Changes for patch v60 | ||
558 | |||
559 | - Ported to kernel 2.1.123-pre2 | ||
560 | =============================================================================== | ||
561 | Changes for patch v61 | ||
562 | |||
563 | - Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode> | ||
564 | =============================================================================== | ||
565 | Changes for patch v62 | ||
566 | |||
567 | - Ported to kernel 2.1.123 | ||
568 | =============================================================================== | ||
569 | Changes for patch v63 | ||
570 | |||
571 | - Ported to kernel 2.1.124-pre2 | ||
572 | =============================================================================== | ||
573 | Changes for patch v64 | ||
574 | |||
575 | - Fixed Unix98 pty support | ||
576 | |||
577 | - Increased buffer size in <get_partition_list> to avoid crash and | ||
578 | burn | ||
579 | =============================================================================== | ||
580 | Changes for patch v65 | ||
581 | |||
582 | - More Unix98 pty support fixes | ||
583 | |||
584 | - Added test for empty <<name>> in <devfs_find_handle> | ||
585 | |||
586 | - Renamed <generate_path> to <devfs_generate_path> and published | ||
587 | |||
588 | - Created /dev/root symlink | ||
589 | Thanks to Roderich Schupp <rsch@ExperTeam.de> | ||
590 | with further modifications by me | ||
591 | =============================================================================== | ||
592 | Changes for patch v66 | ||
593 | |||
594 | - Yet more Unix98 pty support fixes (now tested) | ||
595 | |||
596 | - Created <devfs_get_fops> | ||
597 | |||
598 | - Support media change checks when CONFIG_DEVFS_ONLY=y | ||
599 | |||
600 | - Abolished Unix98-style PTY names for old PTY devices | ||
601 | =============================================================================== | ||
602 | Changes for patch v67 | ||
603 | |||
604 | - Added inline declaration for dummy <devfs_generate_path> | ||
605 | |||
606 | - Removed spurious "unable to register... in devfs" messages when | ||
607 | CONFIG_DEVFS_FS=n | ||
608 | |||
609 | - Fixed misc. devices when CONFIG_DEVFS_FS=n | ||
610 | |||
611 | - Limit auto-device numbering to majors 144 to 239 | ||
612 | =============================================================================== | ||
613 | Changes for patch v68 | ||
614 | |||
615 | - Hide unopened virtual consoles from directory listings | ||
616 | |||
617 | - Added support for video capture devices | ||
618 | |||
619 | - Ported to kernel 2.1.125 | ||
620 | =============================================================================== | ||
621 | Changes for patch v69 | ||
622 | |||
623 | - Fix for CONFIG_VT=n | ||
624 | =============================================================================== | ||
625 | Changes for patch v70 | ||
626 | |||
627 | - Added support for non-OSS/Free sound cards | ||
628 | =============================================================================== | ||
629 | Changes for patch v71 | ||
630 | |||
631 | - Ported to kernel 2.1.126-pre2 | ||
632 | =============================================================================== | ||
633 | Changes for patch v72 | ||
634 | |||
635 | - #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed | ||
636 | =============================================================================== | ||
637 | Changes for patch v73 | ||
638 | |||
639 | - CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option | ||
640 | |||
641 | - CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available | ||
642 | =============================================================================== | ||
643 | Changes for patch v74 | ||
644 | |||
645 | - Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with | ||
646 | "nomount" boot option | ||
647 | |||
648 | - Documentation updates | ||
649 | |||
650 | - Updated sample modules.conf | ||
651 | =============================================================================== | ||
652 | Changes for patch v75 | ||
653 | |||
654 | - Updated sample modules.conf | ||
655 | |||
656 | - Remount devfs after initrd finishes | ||
657 | |||
658 | - Ported to kernel 2.1.127 | ||
659 | |||
660 | - Added support for ISDN | ||
661 | Thanks to Christophe Leroy <christophe.leroy5@capway.com> | ||
662 | =============================================================================== | ||
663 | Changes for patch v76 | ||
664 | |||
665 | - Updated an email address in ChangeLog | ||
666 | |||
667 | - CONFIG_DEVFS_ONLY replaced with "only" boot option | ||
668 | =============================================================================== | ||
669 | Changes for patch v77 | ||
670 | |||
671 | - Added DEVFS_FL_REMOVABLE flag | ||
672 | |||
673 | - Check for disc change when listing directories with removable media | ||
674 | devices | ||
675 | |||
676 | - Use DEVFS_FL_REMOVABLE in sd.c | ||
677 | |||
678 | - Ported to kernel 2.1.128 | ||
679 | =============================================================================== | ||
680 | Changes for patch v78 | ||
681 | |||
682 | - Only call <scan_dir_for_removable> on first call to <devfs_readdir> | ||
683 | |||
684 | - Ported to kernel 2.1.129-pre5 | ||
685 | |||
686 | - ISDN support improvements | ||
687 | Thanks to Christophe Leroy <christophe.leroy5@capway.com> | ||
688 | =============================================================================== | ||
689 | Changes for patch v79 | ||
690 | |||
691 | - Ported to kernel 2.1.130 | ||
692 | |||
693 | - Renamed miscdevice "apm" to "apm_bios" to be consistent with | ||
694 | devices.txt | ||
695 | =============================================================================== | ||
696 | Changes for patch v80 | ||
697 | |||
698 | - Ported to kernel 2.1.131 | ||
699 | |||
700 | - Updated <devfs_rmdir> for VFS change in 2.1.131 | ||
701 | =============================================================================== | ||
702 | Changes for patch v81 | ||
703 | |||
704 | - Fixed permissions on /dev/ptmx | ||
705 | =============================================================================== | ||
706 | Changes for patch v82 | ||
707 | |||
708 | - Ported to kernel 2.1.132-pre4 | ||
709 | |||
710 | - Changed initial permissions on /dev/pts/* | ||
711 | |||
712 | - Created <devfs_mk_compat> | ||
713 | |||
714 | - Added "symlinks" boot option | ||
715 | |||
716 | - Changed devfs_register_blkdev() back to register_blkdev() for IDE | ||
717 | |||
718 | - Check for partitions on removable media in <devfs_lookup> | ||
719 | =============================================================================== | ||
720 | Changes for patch v83 | ||
721 | |||
722 | - Fixed support for ramdisc when using string-based root FS name | ||
723 | |||
724 | - Ported to kernel 2.2.0-pre1 | ||
725 | =============================================================================== | ||
726 | Changes for patch v84 | ||
727 | |||
728 | - Ported to kernel 2.2.0-pre7 | ||
729 | =============================================================================== | ||
730 | Changes for patch v85 | ||
731 | |||
732 | - Compile fixes for driver/sound/sound_common.c (non-module) and | ||
733 | drivers/isdn/isdn_common.c | ||
734 | Thanks to Christophe Leroy <christophe.leroy5@capway.com> | ||
735 | |||
736 | - Added support for registering regular files | ||
737 | |||
738 | - Created <devfs_set_file_size> | ||
739 | |||
740 | - Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr | ||
741 | |||
742 | - Update devfs inodes from entries if not changed through FS | ||
743 | =============================================================================== | ||
744 | Changes for patch v86 | ||
745 | |||
746 | - Ported to kernel 2.2.0-pre9 | ||
747 | =============================================================================== | ||
748 | Changes for patch v87 | ||
749 | |||
750 | - Fixed bug when mounting non-devfs devices in a devfs | ||
751 | =============================================================================== | ||
752 | Changes for patch v88 | ||
753 | |||
754 | - Fixed <devfs_fill_file> to only initialise temporary inodes | ||
755 | |||
756 | - Trap for NULL fops in <devfs_register> | ||
757 | |||
758 | - Return -ENODEV in <devfs_fill_file> for non-driver inodes | ||
759 | |||
760 | - Fixed bug when unswapping non-devfs devices in a devfs | ||
761 | =============================================================================== | ||
762 | Changes for patch v89 | ||
763 | |||
764 | - Switched to C data types in include/linux/devfs_fs.h | ||
765 | |||
766 | - Switched from PATH_MAX to DEVFS_PATHLEN | ||
767 | |||
768 | - Updated Documentation/filesystems/devfs/modules.conf to take account | ||
769 | of reverse scanning (!) by modprobe | ||
770 | |||
771 | - Ported to kernel 2.2.0 | ||
772 | =============================================================================== | ||
773 | Changes for patch v90 | ||
774 | |||
775 | - CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot | ||
776 | option | ||
777 | |||
778 | - CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now | ||
779 | controls this. This means you must have libc 5.4.44 or later, or a | ||
780 | recent version of libc 6 if you use the "symlinks" option | ||
781 | =============================================================================== | ||
782 | Changes for patch v91 | ||
783 | |||
784 | - Switch from <devfs_mk_symlink> to <devfs_mk_compat> in | ||
785 | drivers/char/vc_screen.c to fix problems with Midnight Commander | ||
786 | =============================================================================== | ||
787 | Changes for patch v92 | ||
788 | |||
789 | - Ported to kernel 2.2.2-pre5 | ||
790 | =============================================================================== | ||
791 | Changes for patch v93 | ||
792 | |||
793 | - Modified <sd_name> in drivers/scsi/sd.c to cope with devices that | ||
794 | don't exist (which happens with new RAID autostart code printk()s) | ||
795 | =============================================================================== | ||
796 | Changes for patch v94 | ||
797 | |||
798 | - Fixed bug in joystick driver: only first joystick was registered | ||
799 | =============================================================================== | ||
800 | Changes for patch v95 | ||
801 | |||
802 | - Fixed another bug in joystick driver | ||
803 | |||
804 | - Fixed <devfsd_read> to not overrun event buffer | ||
805 | =============================================================================== | ||
806 | Changes for patch v96 | ||
807 | |||
808 | - Ported to kernel 2.2.5-2 | ||
809 | |||
810 | - Created <devfs_auto_unregister> | ||
811 | |||
812 | - Fixed bugs: compatibility entries were not unregistered for: | ||
813 | loop driver | ||
814 | floppy driver | ||
815 | RAMDISC driver | ||
816 | IDE tape driver | ||
817 | SCSI CD-ROM driver | ||
818 | SCSI HDD driver | ||
819 | =============================================================================== | ||
820 | Changes for patch v97 | ||
821 | |||
822 | - Fixed bugs: compatibility entries were not unregistered for: | ||
823 | ALSA sound driver | ||
824 | partitions in generic disc driver | ||
825 | |||
826 | - Don't return unregistred entries in <devfs_find_handle> | ||
827 | |||
828 | - Panic in <devfs_unregister> if entry unregistered | ||
829 | |||
830 | - Don't panic in <devfs_auto_unregister> for duplicates | ||
831 | =============================================================================== | ||
832 | Changes for patch v98 | ||
833 | |||
834 | - Don't unregister already unregistered entries in <unregister> | ||
835 | |||
836 | - Register entry in <sd_detect> | ||
837 | |||
838 | - Unregister entry in <sd_detach> | ||
839 | |||
840 | - Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c | ||
841 | |||
842 | - Ported to kernel 2.2.7 | ||
843 | =============================================================================== | ||
844 | Changes for patch v99 | ||
845 | |||
846 | - Ported to kernel 2.2.8 | ||
847 | |||
848 | - Fixed bug in drivers/scsi/sd.c when >16 SCSI discs | ||
849 | |||
850 | - Disable warning messages when unable to read partition table for | ||
851 | removable media | ||
852 | =============================================================================== | ||
853 | Changes for patch v100 | ||
854 | |||
855 | - Ported to kernel 2.3.1-pre5 | ||
856 | |||
857 | - Added "oops-on-panic" boot option | ||
858 | |||
859 | - Improved debugging in <devfs_register> and <devfs_unregister> | ||
860 | |||
861 | - Register entry in <sr_detect> | ||
862 | |||
863 | - Unregister entry in <sr_detach> | ||
864 | |||
865 | - Register entry in <sg_detect> | ||
866 | |||
867 | - Unregister entry in <sg_detach> | ||
868 | |||
869 | - Added support for ALSA drivers | ||
870 | =============================================================================== | ||
871 | Changes for patch v101 | ||
872 | |||
873 | - Ported to kernel 2.3.2 | ||
874 | =============================================================================== | ||
875 | Changes for patch v102 | ||
876 | |||
877 | - Update serial driver to register PCMCIA entries | ||
878 | Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr> | ||
879 | |||
880 | - Updated an email address in ChangeLog | ||
881 | |||
882 | - Hide virtual console capture entries from directory listings when | ||
883 | corresponding console device is not open | ||
884 | =============================================================================== | ||
885 | Changes for patch v103 | ||
886 | |||
887 | - Ported to kernel 2.3.3 | ||
888 | =============================================================================== | ||
889 | Changes for patch v104 | ||
890 | |||
891 | - Added documentation for some functions | ||
892 | |||
893 | - Added "doc" target to fs/devfs/Makefile | ||
894 | |||
895 | - Added "v4l" directory for video4linux devices | ||
896 | |||
897 | - Replaced call to <devfs_unregister> in <sd_detach> with call to | ||
898 | <devfs_register_partitions> | ||
899 | |||
900 | - Moved registration for sr and sg drivers from detect() to attach() | ||
901 | methods | ||
902 | |||
903 | - Register entries in <st_attach> and unregister in <st_detach> | ||
904 | |||
905 | - Work around IDE driver treating CD-ROM as gendisk | ||
906 | |||
907 | - Use <sed> instead of <tr> in rc.devfs | ||
908 | |||
909 | - Updated ToDo list | ||
910 | |||
911 | - Removed "oops-on-panic" boot option: now always Oops | ||
912 | =============================================================================== | ||
913 | Changes for patch v105 | ||
914 | |||
915 | - Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister> | ||
916 | Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu> | ||
917 | |||
918 | - Don't save /dev/log in rc.devfs | ||
919 | |||
920 | - Ported to kernel 2.3.4-pre1 | ||
921 | =============================================================================== | ||
922 | Changes for patch v106 | ||
923 | |||
924 | - Fixed silly typo in drivers/scsi/st.c | ||
925 | |||
926 | - Improved debugging in <devfs_register> | ||
927 | =============================================================================== | ||
928 | Changes for patch v107 | ||
929 | |||
930 | - Added "diunlink" and "nokmod" boot options | ||
931 | |||
932 | - Removed superfluous warning message in <devfs_d_iput> | ||
933 | =============================================================================== | ||
934 | Changes for patch v108 | ||
935 | |||
936 | - Remove entries when unloading sound module | ||
937 | =============================================================================== | ||
938 | Changes for patch v109 | ||
939 | |||
940 | - Ported to kernel 2.3.6-pre2 | ||
941 | =============================================================================== | ||
942 | Changes for patch v110 | ||
943 | |||
944 | - Took account of change to <d_alloc_root> | ||
945 | =============================================================================== | ||
946 | Changes for patch v111 | ||
947 | |||
948 | - Created separate event queue for each mounted devfs | ||
949 | |||
950 | - Removed <devfs_invalidate_dcache> | ||
951 | |||
952 | - Created new ioctl()s for devfsd | ||
953 | |||
954 | - Incremented devfsd protocol revision to 3 | ||
955 | |||
956 | - Fixed bug when re-creating directories: contents were lost | ||
957 | |||
958 | - Block access to inodes until devfsd updates permissions | ||
959 | =============================================================================== | ||
960 | Changes for patch v112 | ||
961 | |||
962 | - Modified patch so it applies against 2.3.5 and 2.3.6 | ||
963 | |||
964 | - Updated an email address in ChangeLog | ||
965 | |||
966 | - Do not automatically change ownership/protection of /dev/tty<n> | ||
967 | |||
968 | - Updated sample modules.conf | ||
969 | |||
970 | - Switched to sending process uid/gid to devfsd | ||
971 | |||
972 | - Renamed <call_kmod> to <try_modload> | ||
973 | |||
974 | - Added DEVFSD_NOTIFY_LOOKUP event | ||
975 | |||
976 | - Added DEVFSD_NOTIFY_CHANGE event | ||
977 | |||
978 | - Added DEVFSD_NOTIFY_CREATE event | ||
979 | |||
980 | - Incremented devfsd protocol revision to 4 | ||
981 | |||
982 | - Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h | ||
983 | =============================================================================== | ||
984 | Changes for patch v113 | ||
985 | |||
986 | - Ported to kernel 2.3.9 | ||
987 | |||
988 | - Restricted permissions on some block devices | ||
989 | =============================================================================== | ||
990 | Changes for patch v114 | ||
991 | |||
992 | - Added support for /dev/netlink | ||
993 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
994 | |||
995 | - Return EISDIR rather than EINVAL for read(2) on directories | ||
996 | |||
997 | - Ported to kernel 2.3.10 | ||
998 | =============================================================================== | ||
999 | Changes for patch v115 | ||
1000 | |||
1001 | - Added support for all remaining character devices | ||
1002 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
1003 | |||
1004 | - Cleaned up netlink support | ||
1005 | =============================================================================== | ||
1006 | Changes for patch v116 | ||
1007 | |||
1008 | - Added support for /dev/parport%d | ||
1009 | Thanks to Tim Waugh <tim@cyberelk.demon.co.uk> | ||
1010 | |||
1011 | - Fixed parallel port ATAPI tape driver | ||
1012 | |||
1013 | - Fixed Atari SLM laser printer driver | ||
1014 | =============================================================================== | ||
1015 | Changes for patch v117 | ||
1016 | |||
1017 | - Added support for COSA card | ||
1018 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
1019 | |||
1020 | - Fixed drivers/char/ppdev.c: missing #include <linux/init.h> | ||
1021 | |||
1022 | - Fixed drivers/char/ftape/zftape/zftape-init.c | ||
1023 | Thanks to Vladimir Popov <mashgrad@usa.net> | ||
1024 | =============================================================================== | ||
1025 | Changes for patch v118 | ||
1026 | |||
1027 | - Ported to kernel 2.3.15-pre3 | ||
1028 | |||
1029 | - Fixed bug in loop driver | ||
1030 | |||
1031 | - Unregister /dev/lp%d entries in drivers/char/lp.c | ||
1032 | Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl> | ||
1033 | =============================================================================== | ||
1034 | Changes for patch v119 | ||
1035 | |||
1036 | - Ported to kernel 2.3.16 | ||
1037 | =============================================================================== | ||
1038 | Changes for patch v120 | ||
1039 | |||
1040 | - Fixed bug in drivers/scsi/scsi.c | ||
1041 | |||
1042 | - Added /dev/ppp | ||
1043 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
1044 | |||
1045 | - Ported to kernel 2.3.17 | ||
1046 | =============================================================================== | ||
1047 | Changes for patch v121 | ||
1048 | |||
1049 | - Fixed bug in drivers/block/loop.c | ||
1050 | |||
1051 | - Ported to kernel 2.3.18 | ||
1052 | =============================================================================== | ||
1053 | Changes for patch v122 | ||
1054 | |||
1055 | - Ported to kernel 2.3.19 | ||
1056 | =============================================================================== | ||
1057 | Changes for patch v123 | ||
1058 | |||
1059 | - Ported to kernel 2.3.20 | ||
1060 | =============================================================================== | ||
1061 | Changes for patch v124 | ||
1062 | |||
1063 | - Ported to kernel 2.3.21 | ||
1064 | =============================================================================== | ||
1065 | Changes for patch v125 | ||
1066 | |||
1067 | - Created <devfs_get_info>, <devfs_set_info>, | ||
1068 | <devfs_get_first_child> and <devfs_get_next_sibling> | ||
1069 | Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>, | ||
1070 | <devfs_mk_dir> and <devfs_find_handle> | ||
1071 | Work sponsored by SGI | ||
1072 | |||
1073 | - Fixed apparent bug in COSA driver | ||
1074 | |||
1075 | - Re-instated "scsihosts=" boot option | ||
1076 | =============================================================================== | ||
1077 | Changes for patch v126 | ||
1078 | |||
1079 | - Always create /dev/pts if CONFIG_UNIX98_PTYS=y | ||
1080 | |||
1081 | - Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c | ||
1082 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
1083 | |||
1084 | - Allow multiple unregistrations | ||
1085 | |||
1086 | - Created /dev/scsi hierarchy | ||
1087 | Work sponsored by SGI | ||
1088 | =============================================================================== | ||
1089 | Changes for patch v127 | ||
1090 | |||
1091 | Work sponsored by SGI | ||
1092 | |||
1093 | - No longer disable devpts if devfs enabled (caveat emptor) | ||
1094 | |||
1095 | - Added flags array to struct gendisk and removed code from | ||
1096 | drivers/scsi/sd.c | ||
1097 | |||
1098 | - Created /dev/discs hierarchy | ||
1099 | =============================================================================== | ||
1100 | Changes for patch v128 | ||
1101 | |||
1102 | Work sponsored by SGI | ||
1103 | |||
1104 | - Created /dev/cdroms hierarchy | ||
1105 | =============================================================================== | ||
1106 | Changes for patch v129 | ||
1107 | |||
1108 | Work sponsored by SGI | ||
1109 | |||
1110 | - Removed compatibility entries for sound devices | ||
1111 | |||
1112 | - Removed compatibility entries for printer devices | ||
1113 | |||
1114 | - Removed compatibility entries for video4linux devices | ||
1115 | |||
1116 | - Removed compatibility entries for parallel port devices | ||
1117 | |||
1118 | - Removed compatibility entries for frame buffer devices | ||
1119 | =============================================================================== | ||
1120 | Changes for patch v130 | ||
1121 | |||
1122 | Work sponsored by SGI | ||
1123 | |||
1124 | - Added major and minor number to devfsd protocol | ||
1125 | |||
1126 | - Incremented devfsd protocol revision to 5 | ||
1127 | |||
1128 | - Removed compatibility entries for SoundBlaster CD-ROMs | ||
1129 | |||
1130 | - Removed compatibility entries for netlink devices | ||
1131 | |||
1132 | - Removed compatibility entries for SCSI generic devices | ||
1133 | |||
1134 | - Removed compatibility entries for SCSI tape devices | ||
1135 | =============================================================================== | ||
1136 | Changes for patch v131 | ||
1137 | |||
1138 | Work sponsored by SGI | ||
1139 | |||
1140 | - Support info pointer for all devfs entry types | ||
1141 | |||
1142 | - Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink> | ||
1143 | |||
1144 | - Removed /dev/st hierarchy | ||
1145 | |||
1146 | - Removed /dev/sg hierarchy | ||
1147 | |||
1148 | - Removed compatibility entries for loop devices | ||
1149 | |||
1150 | - Removed compatibility entries for IDE tape devices | ||
1151 | |||
1152 | - Removed compatibility entries for SCSI CD-ROMs | ||
1153 | |||
1154 | - Removed /dev/sr hierarchy | ||
1155 | =============================================================================== | ||
1156 | Changes for patch v132 | ||
1157 | |||
1158 | Work sponsored by SGI | ||
1159 | |||
1160 | - Removed compatibility entries for floppy devices | ||
1161 | |||
1162 | - Removed compatibility entries for RAMDISCs | ||
1163 | |||
1164 | - Removed compatibility entries for meta-devices | ||
1165 | |||
1166 | - Removed compatibility entries for SCSI discs | ||
1167 | |||
1168 | - Created <devfs_make_root> | ||
1169 | |||
1170 | - Removed /dev/sd hierarchy | ||
1171 | |||
1172 | - Support "../" when searching devfs namespace | ||
1173 | |||
1174 | - Created /dev/ide/host* hierarchy | ||
1175 | |||
1176 | - Supported IDE hard discs in /dev/ide/host* hierarchy | ||
1177 | |||
1178 | - Removed compatibility entries for IDE discs | ||
1179 | |||
1180 | - Removed /dev/ide/hd hierarchy | ||
1181 | |||
1182 | - Supported IDE CD-ROMs in /dev/ide/host* hierarchy | ||
1183 | |||
1184 | - Removed compatibility entries for IDE CD-ROMs | ||
1185 | |||
1186 | - Removed /dev/ide/cd hierarchy | ||
1187 | =============================================================================== | ||
1188 | Changes for patch v133 | ||
1189 | |||
1190 | Work sponsored by SGI | ||
1191 | |||
1192 | - Created <devfs_get_unregister_slave> | ||
1193 | |||
1194 | - Fixed bug in fs/partitions/check.c when rescanning | ||
1195 | =============================================================================== | ||
1196 | Changes for patch v134 | ||
1197 | |||
1198 | Work sponsored by SGI | ||
1199 | |||
1200 | - Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories | ||
1201 | |||
1202 | - Removed /dev/ide/hd directory | ||
1203 | |||
1204 | - Exported <devfs_get_parent> | ||
1205 | |||
1206 | - Created <devfs_register_tape> and /dev/tapes hierarchy | ||
1207 | |||
1208 | - Removed /dev/ide/mt hierarchy | ||
1209 | |||
1210 | - Removed /dev/ide/fd hierarchy | ||
1211 | |||
1212 | - Ported to kernel 2.3.25 | ||
1213 | =============================================================================== | ||
1214 | Changes for patch v135 | ||
1215 | |||
1216 | Work sponsored by SGI | ||
1217 | |||
1218 | - Removed compatibility entries for virtual console capture devices | ||
1219 | |||
1220 | - Removed unused <devfs_set_symlink_destination> | ||
1221 | |||
1222 | - Removed compatibility entries for serial devices | ||
1223 | |||
1224 | - Removed compatibility entries for console devices | ||
1225 | |||
1226 | - Do not hide entries from devfsd or children | ||
1227 | |||
1228 | - Removed DEVFS_FL_TTY_COMPAT flag | ||
1229 | |||
1230 | - Removed "nottycompat" boot option | ||
1231 | |||
1232 | - Removed <devfs_mk_compat> | ||
1233 | =============================================================================== | ||
1234 | Changes for patch v136 | ||
1235 | |||
1236 | Work sponsored by SGI | ||
1237 | |||
1238 | - Moved BSD pty devices to /dev/pty | ||
1239 | |||
1240 | - Added DEVFS_FL_WAIT flag | ||
1241 | =============================================================================== | ||
1242 | Changes for patch v137 | ||
1243 | |||
1244 | Work sponsored by SGI | ||
1245 | |||
1246 | - Really fixed bug in fs/partitions/check.c when rescanning | ||
1247 | |||
1248 | - Support new "disc" naming scheme in <get_removable_partition> | ||
1249 | |||
1250 | - Allow NULL fops in <devfs_register> | ||
1251 | |||
1252 | - Removed redundant name functions in SCSI disc and IDE drivers | ||
1253 | =============================================================================== | ||
1254 | Changes for patch v138 | ||
1255 | |||
1256 | Work sponsored by SGI | ||
1257 | |||
1258 | - Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c, | ||
1259 | drivers/net/wan/cosa.c and drivers/scsi/scsi.c | ||
1260 | Thanks to Sergey Kubushin <ksi@ksi-linux.com> | ||
1261 | |||
1262 | - Fall back to major table if NULL fops given to <devfs_register> | ||
1263 | =============================================================================== | ||
1264 | Changes for patch v139 | ||
1265 | |||
1266 | Work sponsored by SGI | ||
1267 | |||
1268 | - Corrected and moved <get_blkfops> and <get_chrfops> declarations | ||
1269 | from arch/alpha/kernel/osf_sys.c to include/linux/fs.h | ||
1270 | |||
1271 | - Removed name function from struct gendisk | ||
1272 | |||
1273 | - Updated devfs FAQ | ||
1274 | =============================================================================== | ||
1275 | Changes for patch v140 | ||
1276 | |||
1277 | Work sponsored by SGI | ||
1278 | |||
1279 | - Ported to kernel 2.3.27 | ||
1280 | =============================================================================== | ||
1281 | Changes for patch v141 | ||
1282 | |||
1283 | Work sponsored by SGI | ||
1284 | |||
1285 | - Bug fix in arch/m68k/atari/joystick.c | ||
1286 | |||
1287 | - Moved ISDN and capi devices to /dev/isdn | ||
1288 | =============================================================================== | ||
1289 | Changes for patch v142 | ||
1290 | |||
1291 | Work sponsored by SGI | ||
1292 | |||
1293 | - Bug fix in drivers/block/ide-probe.c (patch confusion) | ||
1294 | =============================================================================== | ||
1295 | Changes for patch v143 | ||
1296 | |||
1297 | Work sponsored by SGI | ||
1298 | |||
1299 | - Bug fix in drivers/block/blkpg.c:partition_name() | ||
1300 | =============================================================================== | ||
1301 | Changes for patch v144 | ||
1302 | |||
1303 | Work sponsored by SGI | ||
1304 | |||
1305 | - Ported to kernel 2.3.29 | ||
1306 | |||
1307 | - Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx | ||
1308 | CD-ROM drivers: generic driver handles this now | ||
1309 | |||
1310 | - Moved joystick devices to /dev/joysticks | ||
1311 | =============================================================================== | ||
1312 | Changes for patch v145 | ||
1313 | |||
1314 | Work sponsored by SGI | ||
1315 | |||
1316 | - Ported to kernel 2.3.30-pre3 | ||
1317 | |||
1318 | - Register whole-disc entry even for invalid partition tables | ||
1319 | |||
1320 | - Fixed bug in mounting root FS when initrd enabled | ||
1321 | |||
1322 | - Fixed device entry leak with IDE CD-ROMs | ||
1323 | |||
1324 | - Fixed compile problem with drivers/isdn/isdn_common.c | ||
1325 | |||
1326 | - Moved COSA devices to /dev/cosa | ||
1327 | |||
1328 | - Support fifos when unregistering | ||
1329 | |||
1330 | - Created <devfs_register_series> and used in many drivers | ||
1331 | |||
1332 | - Moved Coda devices to /dev/coda | ||
1333 | |||
1334 | - Moved parallel port IDE tapes to /dev/pt | ||
1335 | |||
1336 | - Moved parallel port IDE generic devices to /dev/pg | ||
1337 | =============================================================================== | ||
1338 | Changes for patch v146 | ||
1339 | |||
1340 | Work sponsored by SGI | ||
1341 | |||
1342 | - Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags | ||
1343 | |||
1344 | - Fixed compile problem with fs/coda/psdev.c | ||
1345 | |||
1346 | - Reinstate change to <devfs_register_blkdev> in | ||
1347 | drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed | ||
1348 | |||
1349 | - Switched to <devfs_register_blkdev> in drivers/block/floppy.c, | ||
1350 | drivers/scsi/sr.c and drivers/block/md.c | ||
1351 | |||
1352 | - Moved DAC960 devices to /dev/dac960 | ||
1353 | =============================================================================== | ||
1354 | Changes for patch v147 | ||
1355 | |||
1356 | Work sponsored by SGI | ||
1357 | |||
1358 | - Ported to kernel 2.3.32-pre4 | ||
1359 | =============================================================================== | ||
1360 | Changes for patch v148 | ||
1361 | |||
1362 | Work sponsored by SGI | ||
1363 | |||
1364 | - Removed kmod support: use devfsd instead | ||
1365 | |||
1366 | - Moved miscellaneous character devices to /dev/misc | ||
1367 | =============================================================================== | ||
1368 | Changes for patch v149 | ||
1369 | |||
1370 | Work sponsored by SGI | ||
1371 | |||
1372 | - Ensure include/linux/joystick.h is OK for user-space | ||
1373 | |||
1374 | - Improved debugging in <get_vfs_inode> | ||
1375 | |||
1376 | - Ensure dentries created by devfsd will be cleaned up | ||
1377 | =============================================================================== | ||
1378 | Changes for patch v150 | ||
1379 | |||
1380 | Work sponsored by SGI | ||
1381 | |||
1382 | - Ported to kernel 2.3.34 | ||
1383 | =============================================================================== | ||
1384 | Changes for patch v151 | ||
1385 | |||
1386 | Work sponsored by SGI | ||
1387 | |||
1388 | - Ported to kernel 2.3.35-pre1 | ||
1389 | |||
1390 | - Created <devfs_get_name> | ||
1391 | =============================================================================== | ||
1392 | Changes for patch v152 | ||
1393 | |||
1394 | Work sponsored by SGI | ||
1395 | |||
1396 | - Updated sample modules.conf | ||
1397 | |||
1398 | - Ported to kernel 2.3.36-pre1 | ||
1399 | =============================================================================== | ||
1400 | Changes for patch v153 | ||
1401 | |||
1402 | Work sponsored by SGI | ||
1403 | |||
1404 | - Ported to kernel 2.3.42 | ||
1405 | |||
1406 | - Removed <devfs_fill_file> | ||
1407 | =============================================================================== | ||
1408 | Changes for patch v154 | ||
1409 | |||
1410 | Work sponsored by SGI | ||
1411 | |||
1412 | - Took account of device number changes for /dev/fb* | ||
1413 | =============================================================================== | ||
1414 | Changes for patch v155 | ||
1415 | |||
1416 | Work sponsored by SGI | ||
1417 | |||
1418 | - Ported to kernel 2.3.43-pre8 | ||
1419 | |||
1420 | - Moved /dev/tty0 to /dev/vc/0 | ||
1421 | |||
1422 | - Moved sequence number formatting from <_tty_make_name> to drivers | ||
1423 | =============================================================================== | ||
1424 | Changes for patch v156 | ||
1425 | |||
1426 | Work sponsored by SGI | ||
1427 | |||
1428 | - Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes | ||
1429 | =============================================================================== | ||
1430 | Changes for patch v157 | ||
1431 | |||
1432 | Work sponsored by SGI | ||
1433 | |||
1434 | - Ported to kernel 2.3.45 | ||
1435 | =============================================================================== | ||
1436 | Changes for patch v158 | ||
1437 | |||
1438 | Work sponsored by SGI | ||
1439 | |||
1440 | - Ported to kernel 2.3.46-pre2 | ||
1441 | =============================================================================== | ||
1442 | Changes for patch v159 | ||
1443 | |||
1444 | Work sponsored by SGI | ||
1445 | |||
1446 | - Fixed drivers/block/md.c | ||
1447 | Thanks to Mike Galbraith <mikeg@weiden.de> | ||
1448 | |||
1449 | - Documentation fixes | ||
1450 | |||
1451 | - Moved device registration from <lp_init> to <lp_register> | ||
1452 | Thanks to Tim Waugh <twaugh@redhat.com> | ||
1453 | =============================================================================== | ||
1454 | Changes for patch v160 | ||
1455 | |||
1456 | Work sponsored by SGI | ||
1457 | |||
1458 | - Fixed drivers/char/joystick/joystick.c | ||
1459 | Thanks to Vojtech Pavlik <vojtech@suse.cz> | ||
1460 | |||
1461 | - Documentation updates | ||
1462 | |||
1463 | - Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled | ||
1464 | |||
1465 | - Fixed drivers/char/stallion.c | ||
1466 | =============================================================================== | ||
1467 | Changes for patch v161 | ||
1468 | |||
1469 | Work sponsored by SGI | ||
1470 | |||
1471 | - Remove /dev/ide when ide-mod is unloaded | ||
1472 | |||
1473 | - Fixed bug in drivers/block/ide-probe.c when secondary but no primary | ||
1474 | |||
1475 | - Added DEVFS_FL_NO_PERSISTENCE flag | ||
1476 | |||
1477 | - Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves | ||
1478 | |||
1479 | - Removed unnecessary call to <update_devfs_inode_from_entry> in | ||
1480 | <devfs_readdir> | ||
1481 | |||
1482 | - Only set auto-ownership for /dev/pty/s* | ||
1483 | =============================================================================== | ||
1484 | Changes for patch v162 | ||
1485 | |||
1486 | Work sponsored by SGI | ||
1487 | |||
1488 | - Set inode->i_size to correct size for symlinks | ||
1489 | Thanks to Jeremy Fitzhardinge <jeremy@goop.org> | ||
1490 | |||
1491 | - Only give lookup() method to directories to comply with new VFS | ||
1492 | assumptions | ||
1493 | |||
1494 | - Remove unnecessary tests in symlink methods | ||
1495 | |||
1496 | - Don't kill existing block ops in <devfs_read_inode> | ||
1497 | |||
1498 | - Restore auto-ownership for /dev/pty/m* | ||
1499 | =============================================================================== | ||
1500 | Changes for patch v163 | ||
1501 | |||
1502 | Work sponsored by SGI | ||
1503 | |||
1504 | - Don't create missing directories in <devfs_find_handle> | ||
1505 | |||
1506 | - Removed Documentation/filesystems/devfs/mk-devlinks | ||
1507 | |||
1508 | - Updated Documentation/filesystems/devfs/README | ||
1509 | =============================================================================== | ||
1510 | Changes for patch v164 | ||
1511 | |||
1512 | Work sponsored by SGI | ||
1513 | |||
1514 | - Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in | ||
1515 | linux-2.3.99-pre6-7 | ||
1516 | =============================================================================== | ||
1517 | Changes for patch v165 | ||
1518 | |||
1519 | Work sponsored by SGI | ||
1520 | |||
1521 | - Ported to kernel 2.3.99-pre6 | ||
1522 | =============================================================================== | ||
1523 | Changes for patch v166 | ||
1524 | |||
1525 | Work sponsored by SGI | ||
1526 | |||
1527 | - Added CONFIG_DEVFS_MOUNT | ||
1528 | =============================================================================== | ||
1529 | Changes for patch v167 | ||
1530 | |||
1531 | Work sponsored by SGI | ||
1532 | |||
1533 | - Updated Documentation/filesystems/devfs/README | ||
1534 | |||
1535 | - Updated sample modules.conf | ||
1536 | =============================================================================== | ||
1537 | Changes for patch v168 | ||
1538 | |||
1539 | Work sponsored by SGI | ||
1540 | |||
1541 | - Disabled multi-mount capability (use VFS bindings instead) | ||
1542 | |||
1543 | - Updated README from master HTML file | ||
1544 | =============================================================================== | ||
1545 | Changes for patch v169 | ||
1546 | |||
1547 | Work sponsored by SGI | ||
1548 | |||
1549 | - Removed multi-mount code | ||
1550 | |||
1551 | - Removed compatibility macros: VFS has changed too much | ||
1552 | =============================================================================== | ||
1553 | Changes for patch v170 | ||
1554 | |||
1555 | Work sponsored by SGI | ||
1556 | |||
1557 | - Updated README from master HTML file | ||
1558 | |||
1559 | - Merged devfs inode into devfs entry | ||
1560 | =============================================================================== | ||
1561 | Changes for patch v171 | ||
1562 | |||
1563 | Work sponsored by SGI | ||
1564 | |||
1565 | - Updated sample modules.conf | ||
1566 | |||
1567 | - Removed dead code in <devfs_register> which used to call | ||
1568 | <free_dentries> | ||
1569 | |||
1570 | - Ported to kernel 2.4.0-test2-pre3 | ||
1571 | =============================================================================== | ||
1572 | Changes for patch v172 | ||
1573 | |||
1574 | Work sponsored by SGI | ||
1575 | |||
1576 | - Changed interface to <devfs_register> | ||
1577 | |||
1578 | - Changed interface to <devfs_register_series> | ||
1579 | =============================================================================== | ||
1580 | Changes for patch v173 | ||
1581 | |||
1582 | Work sponsored by SGI | ||
1583 | |||
1584 | - Simplified interface to <devfs_mk_symlink> | ||
1585 | |||
1586 | - Simplified interface to <devfs_mk_dir> | ||
1587 | |||
1588 | - Simplified interface to <devfs_find_handle> | ||
1589 | =============================================================================== | ||
1590 | Changes for patch v174 | ||
1591 | |||
1592 | Work sponsored by SGI | ||
1593 | |||
1594 | - Updated README from master HTML file | ||
1595 | =============================================================================== | ||
1596 | Changes for patch v175 | ||
1597 | |||
1598 | Work sponsored by SGI | ||
1599 | |||
1600 | - DocBook update for fs/devfs/base.c | ||
1601 | Thanks to Tim Waugh <twaugh@redhat.com> | ||
1602 | |||
1603 | - Removed stale fs/tunnel.c (was never used or completed) | ||
1604 | =============================================================================== | ||
1605 | Changes for patch v176 | ||
1606 | |||
1607 | Work sponsored by SGI | ||
1608 | |||
1609 | - Updated ToDo list | ||
1610 | |||
1611 | - Removed sample modules.conf: now distributed with devfsd | ||
1612 | |||
1613 | - Updated README from master HTML file | ||
1614 | |||
1615 | - Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174) | ||
1616 | =============================================================================== | ||
1617 | Changes for patch v177 | ||
1618 | |||
1619 | - Updated README from master HTML file | ||
1620 | |||
1621 | - Documentation cleanups | ||
1622 | |||
1623 | - Ensure <devfs_generate_path> terminates string for root entry | ||
1624 | Thanks to Tim Jansen <tim@tjansen.de> | ||
1625 | |||
1626 | - Exported <devfs_get_name> to modules | ||
1627 | |||
1628 | - Make <devfs_mk_symlink> send events to devfsd | ||
1629 | |||
1630 | - Cleaned up option processing in <devfs_setup> | ||
1631 | |||
1632 | - Fixed bugs in handling symlinks: could leak or cause Oops | ||
1633 | |||
1634 | - Cleaned up directory handling by separating fops | ||
1635 | Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk> | ||
1636 | =============================================================================== | ||
1637 | Changes for patch v178 | ||
1638 | |||
1639 | - Fixed handling of inverted options in <devfs_setup> | ||
1640 | =============================================================================== | ||
1641 | Changes for patch v179 | ||
1642 | |||
1643 | - Adjusted <try_modload> to account for <devfs_generate_path> fix | ||
1644 | =============================================================================== | ||
1645 | Changes for patch v180 | ||
1646 | |||
1647 | - Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info> | ||
1648 | =============================================================================== | ||
1649 | Changes for patch v181 | ||
1650 | |||
1651 | - Answered question posed by Al Viro and removed his comments from <devfs_open> | ||
1652 | |||
1653 | - Moved setting of registered flag after other fields are changed | ||
1654 | |||
1655 | - Fixed race between <devfsd_close> and <devfsd_notify_one> | ||
1656 | |||
1657 | - Global VFS changes added bogus BKL to devfsd_close(): removed | ||
1658 | |||
1659 | - Widened locking in <devfs_readlink> and <devfs_follow_link> | ||
1660 | |||
1661 | - Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc | ||
1662 | |||
1663 | - Simplified locking in <devfsd_ioctl> and fixed memory leak | ||
1664 | =============================================================================== | ||
1665 | Changes for patch v182 | ||
1666 | |||
1667 | - Created <devfs_*alloc_major> and <devfs_*alloc_devnum> | ||
1668 | |||
1669 | - Removed broken devnum allocation and use <devfs_alloc_devnum> | ||
1670 | |||
1671 | - Fixed old devnum leak by calling new <devfs_dealloc_devnum> | ||
1672 | |||
1673 | - Created <devfs_*alloc_unique_number> | ||
1674 | |||
1675 | - Fixed number leak for /dev/cdroms/cdrom%d | ||
1676 | |||
1677 | - Fixed number leak for /dev/discs/disc%d | ||
1678 | =============================================================================== | ||
1679 | Changes for patch v183 | ||
1680 | |||
1681 | - Fixed bug in <devfs_setup> which could hang boot process | ||
1682 | =============================================================================== | ||
1683 | Changes for patch v184 | ||
1684 | |||
1685 | - Documentation typo fix for fs/devfs/util.c | ||
1686 | |||
1687 | - Fixed drivers/char/stallion.c for devfs | ||
1688 | |||
1689 | - Added DEVFSD_NOTIFY_DELETE event | ||
1690 | |||
1691 | - Updated README from master HTML file | ||
1692 | |||
1693 | - Removed #include <asm/segment.h> from fs/devfs/base.c | ||
1694 | =============================================================================== | ||
1695 | Changes for patch v185 | ||
1696 | |||
1697 | - Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c | ||
1698 | private | ||
1699 | |||
1700 | - Fixed inode table races by removing it and using inode->u.generic_ip | ||
1701 | instead | ||
1702 | |||
1703 | - Moved <devfs_read_inode> into <get_vfs_inode> | ||
1704 | |||
1705 | - Moved <devfs_write_inode> into <devfs_notify_change> | ||
1706 | =============================================================================== | ||
1707 | Changes for patch v186 | ||
1708 | |||
1709 | - Fixed race in <devfs_do_symlink> for uni-processor | ||
1710 | |||
1711 | - Updated README from master HTML file | ||
1712 | =============================================================================== | ||
1713 | Changes for patch v187 | ||
1714 | |||
1715 | - Fixed drivers/char/stallion.c for devfs | ||
1716 | |||
1717 | - Fixed drivers/char/rocket.c for devfs | ||
1718 | |||
1719 | - Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers | ||
1720 | =============================================================================== | ||
1721 | Changes for patch v188 | ||
1722 | |||
1723 | - Updated major masks in fs/devfs/util.c up to Linus' "no new majors" | ||
1724 | proclamation. Block: were 126 now 122 free, char: were 26 now 19 free | ||
1725 | |||
1726 | - Updated README from master HTML file | ||
1727 | |||
1728 | - Removed remnant of multi-mount support in <devfs_mknod> | ||
1729 | |||
1730 | - Removed unused DEVFS_FL_SHOW_UNREG flag | ||
1731 | =============================================================================== | ||
1732 | Changes for patch v189 | ||
1733 | |||
1734 | - Removed nlink field from struct devfs_inode | ||
1735 | |||
1736 | - Removed auto-ownership for /dev/pty/* (BSD ptys) and used | ||
1737 | DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just | ||
1738 | like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access | ||
1739 | =============================================================================== | ||
1740 | Changes for patch v190 | ||
1741 | |||
1742 | - Updated README from master HTML file | ||
1743 | |||
1744 | - Replaced BKL with global rwsem to protect symlink data (quick and | ||
1745 | dirty hack) | ||
1746 | =============================================================================== | ||
1747 | Changes for patch v191 | ||
1748 | |||
1749 | - Replaced global rwsem for symlink with per-link refcount | ||
1750 | =============================================================================== | ||
1751 | Changes for patch v192 | ||
1752 | |||
1753 | - Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c | ||
1754 | |||
1755 | - Ported to kernel 2.4.10-pre11 | ||
1756 | |||
1757 | - Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode> | ||
1758 | =============================================================================== | ||
1759 | Changes for patch v193 | ||
1760 | |||
1761 | - Went back to global rwsem for symlinks (refcount scheme no good) | ||
1762 | =============================================================================== | ||
1763 | Changes for patch v194 | ||
1764 | |||
1765 | - Fixed overrun in <devfs_link> by removing function (not needed) | ||
1766 | |||
1767 | - Updated README from master HTML file | ||
1768 | =============================================================================== | ||
1769 | Changes for patch v195 | ||
1770 | |||
1771 | - Fixed buffer underrun in <try_modload> | ||
1772 | |||
1773 | - Moved down_read() from <search_for_entry_in_dir> to <find_entry> | ||
1774 | =============================================================================== | ||
1775 | Changes for patch v196 | ||
1776 | |||
1777 | - Fixed race in <devfsd_ioctl> when setting event mask | ||
1778 | Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi> | ||
1779 | |||
1780 | - Avoid deadlock in <devfs_follow_link> by using temporary buffer | ||
1781 | =============================================================================== | ||
1782 | Changes for patch v197 | ||
1783 | |||
1784 | - First release of new locking code for devfs core (v1.0) | ||
1785 | |||
1786 | - Fixed bug in drivers/cdrom/cdrom.c | ||
1787 | =============================================================================== | ||
1788 | Changes for patch v198 | ||
1789 | |||
1790 | - Discard temporary buffer, now use "%s" for dentry names | ||
1791 | |||
1792 | - Don't generate path in <try_modload>: use fake entry instead | ||
1793 | |||
1794 | - Use "existing" directory in <_devfs_make_parent_for_leaf> | ||
1795 | |||
1796 | - Use slab cache rather than fixed buffer for devfsd events | ||
1797 | =============================================================================== | ||
1798 | Changes for patch v199 | ||
1799 | |||
1800 | - Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE | ||
1801 | |||
1802 | - Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir> | ||
1803 | |||
1804 | - Fixed locking bug in <devfs_d_revalidate_wait> due to typo | ||
1805 | |||
1806 | - Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd | ||
1807 | or children | ||
1808 | =============================================================================== | ||
1809 | Changes for patch v200 | ||
1810 | |||
1811 | - Ported to kernel 2.5.1-pre2 | ||
1812 | =============================================================================== | ||
1813 | Changes for patch v201 | ||
1814 | |||
1815 | - Fixed bug in <devfsd_read>: was dereferencing freed pointer | ||
1816 | =============================================================================== | ||
1817 | Changes for patch v202 | ||
1818 | |||
1819 | - Fixed bug in <devfsd_close>: was dereferencing freed pointer | ||
1820 | |||
1821 | - Added process group check for devfsd privileges | ||
1822 | =============================================================================== | ||
1823 | Changes for patch v203 | ||
1824 | |||
1825 | - Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete> | ||
1826 | =============================================================================== | ||
1827 | Changes for patch v204 | ||
1828 | |||
1829 | - Removed long obsolete rc.devfs | ||
1830 | |||
1831 | - Return old entry in <devfs_mk_dir> for 2.4.x kernels | ||
1832 | |||
1833 | - Updated README from master HTML file | ||
1834 | |||
1835 | - Increment refcount on module in <check_disc_changed> | ||
1836 | |||
1837 | - Created <devfs_get_handle> and exported <devfs_put> | ||
1838 | |||
1839 | - Increment refcount on module in <devfs_get_ops> | ||
1840 | |||
1841 | - Created <devfs_put_ops> and used where needed to fix races | ||
1842 | |||
1843 | - Added clarifying comments in response to preliminary EMC code review | ||
1844 | |||
1845 | - Added poisoning to <devfs_put> | ||
1846 | |||
1847 | - Improved debugging messages | ||
1848 | |||
1849 | - Fixed unregister bugs in drivers/md/lvm-fs.c | ||
1850 | =============================================================================== | ||
1851 | Changes for patch v205 | ||
1852 | |||
1853 | - Corrected (made useful) debugging message in <unregister> | ||
1854 | |||
1855 | - Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs> | ||
1856 | |||
1857 | - Fixed drivers/md/lvm-fs.c to create "lvm" entry | ||
1858 | |||
1859 | - Added magic number to guard against scribbling drivers | ||
1860 | |||
1861 | - Only return old entry in <devfs_mk_dir> if a directory | ||
1862 | |||
1863 | - Defined macros for error and debug messages | ||
1864 | |||
1865 | - Updated README from master HTML file | ||
1866 | =============================================================================== | ||
1867 | Changes for patch v206 | ||
1868 | |||
1869 | - Added support for multiple Compaq cpqarray controllers | ||
1870 | |||
1871 | - Fixed (rare, old) race in <devfs_lookup> | ||
1872 | =============================================================================== | ||
1873 | Changes for patch v207 | ||
1874 | |||
1875 | - Fixed deadlock bug in <devfs_d_revalidate_wait> | ||
1876 | |||
1877 | - Tag VFS deletable in <devfs_mk_symlink> if handle ignored | ||
1878 | |||
1879 | - Updated README from master HTML file | ||
1880 | =============================================================================== | ||
1881 | Changes for patch v208 | ||
1882 | |||
1883 | - Added KERN_* to remaining messages | ||
1884 | |||
1885 | - Cleaned up declaration of <stat_read> | ||
1886 | |||
1887 | - Updated README from master HTML file | ||
1888 | =============================================================================== | ||
1889 | Changes for patch v209 | ||
1890 | |||
1891 | - Updated README from master HTML file | ||
1892 | |||
1893 | - Removed silently introduced calls to lock_kernel() and | ||
1894 | unlock_kernel() due to recent VFS locking changes. BKL isn't | ||
1895 | required in devfs | ||
1896 | |||
1897 | - Changed <devfs_rmdir> to allow later additions if not yet empty | ||
1898 | |||
1899 | - Added calls to <devfs_register_partitions> in drivers/block/blkpc.c | ||
1900 | <add_partition> and <del_partition> | ||
1901 | |||
1902 | - Fixed bug in <devfs_alloc_unique_number>: was clearing beyond | ||
1903 | bitfield | ||
1904 | |||
1905 | - Fixed bitfield data type for <devfs_*alloc_devnum> | ||
1906 | |||
1907 | - Made major bitfield type and initialiser 64 bit safe | ||
1908 | =============================================================================== | ||
1909 | Changes for patch v210 | ||
1910 | |||
1911 | - Updated fs/devfs/util.c to fix shift warning on 64 bit machines | ||
1912 | Thanks to Anton Blanchard <anton@samba.org> | ||
1913 | |||
1914 | - Updated README from master HTML file | ||
1915 | =============================================================================== | ||
1916 | Changes for patch v211 | ||
1917 | |||
1918 | - Do not put miscellaneous character devices in /dev/misc if they | ||
1919 | specify their own directory (i.e. contain a '/' character) | ||
1920 | |||
1921 | - Copied macro for error messages from fs/devfs/base.c to | ||
1922 | fs/devfs/util.c and made use of this macro | ||
1923 | |||
1924 | - Removed 2.4.x compatibility code from fs/devfs/base.c | ||
1925 | =============================================================================== | ||
1926 | Changes for patch v212 | ||
1927 | |||
1928 | - Added BKL to <devfs_open> because drivers still need it | ||
1929 | =============================================================================== | ||
1930 | Changes for patch v213 | ||
1931 | |||
1932 | - Protected <scan_dir_for_removable> and <get_removable_partition> | ||
1933 | from changing directory contents | ||
1934 | =============================================================================== | ||
1935 | Changes for patch v214 | ||
1936 | |||
1937 | - Switched to ISO C structure field initialisers | ||
1938 | |||
1939 | - Switch to set_current_state() and move before add_wait_queue() | ||
1940 | |||
1941 | - Updated README from master HTML file | ||
1942 | |||
1943 | - Fixed devfs entry leak in <devfs_readdir> when *readdir fails | ||
1944 | =============================================================================== | ||
1945 | Changes for patch v215 | ||
1946 | |||
1947 | - Created <devfs_find_and_unregister> | ||
1948 | |||
1949 | - Switched many functions from <devfs_find_handle> to | ||
1950 | <devfs_find_and_unregister> | ||
1951 | |||
1952 | - Switched many functions from <devfs_find_handle> to <devfs_get_handle> | ||
1953 | =============================================================================== | ||
1954 | Changes for patch v216 | ||
1955 | |||
1956 | - Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to | ||
1957 | <devfs_get_handle> | ||
1958 | |||
1959 | - Removed deprecated <devfs_find_handle> | ||
1960 | =============================================================================== | ||
1961 | Changes for patch v217 | ||
1962 | |||
1963 | - Exported <devfs_find_and_unregister> and <devfs_only> to modules | ||
1964 | |||
1965 | - Updated README from master HTML file | ||
1966 | |||
1967 | - Fixed module unload race in <devfs_open> | ||
1968 | =============================================================================== | ||
1969 | Changes for patch v218 | ||
1970 | |||
1971 | - Removed DEVFS_FL_AUTO_OWNER flag | ||
1972 | |||
1973 | - Switched lingering structure field initialiser to ISO C | ||
1974 | |||
1975 | - Added locking when setting/clearing flags | ||
1976 | |||
1977 | - Documentation fix in fs/devfs/util.c | ||
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README new file mode 100644 index 000000000000..54366ecc241f --- /dev/null +++ b/Documentation/filesystems/devfs/README | |||
@@ -0,0 +1,1964 @@ | |||
1 | Devfs (Device File System) FAQ | ||
2 | |||
3 | |||
4 | Linux Devfs (Device File System) FAQ | ||
5 | Richard Gooch | ||
6 | 20-AUG-2002 | ||
7 | |||
8 | |||
9 | Document languages: | ||
10 | |||
11 | |||
12 | |||
13 | |||
14 | |||
15 | |||
16 | |||
17 | ----------------------------------------------------------------------------- | ||
18 | |||
19 | NOTE: the master copy of this document is available online at: | ||
20 | |||
21 | http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html | ||
22 | and looks much better than the text version distributed with the | ||
23 | kernel sources. A mirror site is available at: | ||
24 | |||
25 | http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html | ||
26 | |||
27 | There is also an optional daemon that may be used with devfs. You can | ||
28 | find out more about it at: | ||
29 | |||
30 | http://www.atnf.csiro.au/~rgooch/linux/ | ||
31 | |||
32 | A mailing list is available which you may subscribe to. Send | ||
33 | |||
34 | to majordomo@oss.sgi.com with the following line in the | ||
35 | body of the message: | ||
36 | subscribe devfs | ||
37 | To unsubscribe, send the message body: | ||
38 | unsubscribe devfs | ||
39 | instead. The list is archived at | ||
40 | |||
41 | http://oss.sgi.com/projects/devfs/archive/. | ||
42 | |||
43 | ----------------------------------------------------------------------------- | ||
44 | |||
45 | Contents | ||
46 | |||
47 | |||
48 | What is it? | ||
49 | |||
50 | Why do it? | ||
51 | |||
52 | Who else does it? | ||
53 | |||
54 | How it works | ||
55 | |||
56 | Operational issues (essential reading) | ||
57 | |||
58 | Instructions for the impatient | ||
59 | Permissions persistence across reboots | ||
60 | Dealing with drivers without devfs support | ||
61 | All the way with Devfs | ||
62 | Other Issues | ||
63 | Kernel Naming Scheme | ||
64 | Devfsd Naming Scheme | ||
65 | Old Compatibility Names | ||
66 | SCSI Host Probing Issues | ||
67 | |||
68 | |||
69 | |||
70 | Device drivers currently ported | ||
71 | |||
72 | Allocation of Device Numbers | ||
73 | |||
74 | Questions and Answers | ||
75 | |||
76 | Making things work | ||
77 | Alternatives to devfs | ||
78 | What I don't like about devfs | ||
79 | How to report bugs | ||
80 | Strange kernel messages | ||
81 | Compilation problems with devfsd | ||
82 | |||
83 | |||
84 | Other resources | ||
85 | |||
86 | Translations of this document | ||
87 | |||
88 | |||
89 | ----------------------------------------------------------------------------- | ||
90 | |||
91 | |||
92 | What is it? | ||
93 | |||
94 | Devfs is an alternative to "real" character and block special devices | ||
95 | on your root filesystem. Kernel device drivers can register devices by | ||
96 | name rather than major and minor numbers. These devices will appear in | ||
97 | devfs automatically, with whatever default ownership and | ||
98 | protection the driver specified. A daemon (devfsd) can be used to | ||
99 | override these defaults. Devfs has been in the kernel since 2.3.46. | ||
100 | |||
101 | NOTE that devfs is entirely optional. If you prefer the old | ||
102 | disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the | ||
103 | default). In this case, nothing will change. ALSO NOTE that if you do | ||
104 | enable devfs, the defaults are such that full compatibility is | ||
105 | maintained with the old devices names. | ||
106 | |||
107 | There are two aspects to devfs: one is the underlying device | ||
108 | namespace, which is a namespace just like any mounted filesystem. The | ||
109 | other aspect is the filesystem code which provides a view of the | ||
110 | device namespace. The reason I make a distinction is because devfs | ||
111 | can be mounted many times, with each mount showing the same device | ||
112 | namespace. Changes made are global to all mounted devfs filesystems. | ||
113 | Also, because the devfs namespace exists without any devfs mounts, you | ||
114 | can easily mount the root filesystem by referring to an entry in the | ||
115 | devfs namespace. | ||
116 | |||
117 | |||
118 | The cost of devfs is a small increase in kernel code size and memory | ||
119 | usage. About 7 pages of code (some of that in __init sections) and 72 | ||
120 | bytes for each entry in the namespace. A modest system has only a | ||
121 | couple of hundred device entries, so this costs a few more | ||
122 | pages. Compare this with the suggestion to put /dev on a <a | ||
123 | href="#why-faq-ramdisc">ramdisc. | ||
124 | |||
125 | On a typical machine, the cost is under 0.2 percent. On a modest | ||
126 | system with 64 MBytes of RAM, the cost is under 0.1 percent. The | ||
127 | accusations of "bloatware" levelled at devfs are not justified. | ||
128 | |||
129 | ----------------------------------------------------------------------------- | ||
130 | |||
131 | |||
132 | Why do it? | ||
133 | |||
134 | There are several problems that devfs addresses. Some of these | ||
135 | problems are more serious than others (depending on your point of | ||
136 | view), and some can be solved without devfs. However, the totality of | ||
137 | these problems really calls out for devfs. | ||
138 | |||
139 | The choice is a patchwork of inefficient user space solutions, which | ||
140 | are complex and likely to be fragile, or to use a simple and efficient | ||
141 | devfs which is robust. | ||
142 | |||
143 | There have been many counter-proposals to devfs, all seeking to | ||
144 | provide some of the benefits without actually implementing devfs. So | ||
145 | far there has been an absence of code and no proposed alternative has | ||
146 | been able to provide all the features that devfs does. Further, | ||
147 | alternative proposals require far more complexity in user-space (and | ||
148 | still deliver less functionality than devfs). Some people have the | ||
149 | mantra of reducing "kernel bloat", but don't consider the effects on | ||
150 | user-space. | ||
151 | |||
152 | A good solution limits the total complexity of kernel-space and | ||
153 | user-space. | ||
154 | |||
155 | |||
156 | Major&minor allocation | ||
157 | |||
158 | The existing scheme requires the allocation of major and minor device | ||
159 | numbers for each and every device. This means that a central | ||
160 | co-ordinating authority is required to issue these device numbers | ||
161 | (unless you're developing a "private" device driver), in order to | ||
162 | preserve uniqueness. Devfs shifts the burden to a namespace. This may | ||
163 | not seem like a huge benefit, but actually it is. Since driver authors | ||
164 | will naturally choose a device name which reflects the functionality | ||
165 | of the device, there is far less potential for namespace conflict. | ||
166 | Solving this requires a kernel change. | ||
167 | |||
168 | /dev management | ||
169 | |||
170 | Because you currently access devices through device nodes, these must | ||
171 | be created by the system administrator. For standard devices you can | ||
172 | usually find a MAKEDEV programme which creates all these (hundreds!) | ||
173 | of nodes. This means that changes in the kernel must be reflected by | ||
174 | changes in the MAKEDEV programme, or else the system administrator | ||
175 | creates device nodes by hand. | ||
176 | |||
177 | The basic problem is that there are two separate databases of | ||
178 | major and minor numbers. One is in the kernel and one is in /dev (or | ||
179 | in a MAKEDEV programme, if you want to look at it that way). This is | ||
180 | duplication of information, which is not good practice. | ||
181 | Solving this requires a kernel change. | ||
182 | |||
183 | /dev growth | ||
184 | |||
185 | A typical /dev has over 1200 nodes! Most of these devices simply don't | ||
186 | exist because the hardware is not available. A huge /dev increases the | ||
187 | time to access devices (I'm just referring to the dentry lookup times | ||
188 | and the time taken to read inodes off disc: the next subsection shows | ||
189 | some more horrors). | ||
190 | |||
191 | An example of how big /dev can grow is if we consider SCSI devices: | ||
192 | |||
193 | host 6 bits (say up to 64 hosts on a really big machine) | ||
194 | channel 4 bits (say up to 16 SCSI buses per host) | ||
195 | id 4 bits | ||
196 | lun 3 bits | ||
197 | partition 6 bits | ||
198 | TOTAL 23 bits | ||
199 | |||
200 | |||
201 | This requires 8 Mega (1024*1024) inodes if we want to store all | ||
202 | possible device nodes. Even if we scrap everything but id,partition | ||
203 | and assume a single host adapter with a single SCSI bus and only one | ||
204 | logical unit per SCSI target (id), that's still 10 bits or 1024 | ||
205 | inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so | ||
206 | that's 256 kBytes of inode storage on disc (assuming real inodes take | ||
207 | a similar amount of space as VFS inodes). This is actually not so bad, | ||
208 | because disc is cheap these days. Embedded systems would care about | ||
209 | 256 kBytes of /dev inodes, but you could argue that embedded systems | ||
210 | would have hand-tuned /dev directories. I've had to do just that on my | ||
211 | embedded systems, but I would rather just leave it to devfs. | ||
212 | |||
213 | Another issue is the time taken to lookup an inode when first | ||
214 | referenced. Not only does this take time in scanning through a list in | ||
215 | memory, but also the seek times to read the inodes off disc. | ||
216 | This could be solved in user-space using a clever programme which | ||
217 | scanned the kernel logs and deleted /dev entries which are not | ||
218 | available and created them when they were available. This programme | ||
219 | would need to be run every time a new module was loaded, which would | ||
220 | slow things down a lot. | ||
221 | |||
222 | There is an existing programme called scsidev which will automatically | ||
223 | create device nodes for SCSI devices. It can do this by scanning files | ||
224 | in /proc/scsi. Unfortunately, to extend this idea to other device | ||
225 | nodes would require significant modifications to existing drivers (so | ||
226 | they too would provide information in /proc). This is a non-trivial | ||
227 | change (I should know: devfs has had to do something similar). Once | ||
228 | you go to this much effort, you may as well use devfs itself (which | ||
229 | also provides this information). Furthermore, such a system would | ||
230 | likely be implemented in an ad-hoc fashion, as different drivers will | ||
231 | provide their information in different ways. | ||
232 | |||
233 | Devfs is much cleaner, because it (naturally) has a uniform mechanism | ||
234 | to provide this information: the device nodes themselves! | ||
235 | |||
236 | |||
237 | Node to driver file_operations translation | ||
238 | |||
239 | There is an important difference between the way disc-based character | ||
240 | and block nodes and devfs entries make the connection between an entry | ||
241 | in /dev and the actual device driver. | ||
242 | |||
243 | With the current 8 bit major and minor numbers the connection between | ||
244 | disc-based c&b nodes and per-major drivers is done through a | ||
245 | fixed-length table of 128 entries. The various filesystem types set | ||
246 | the inode operations for c&b nodes to {chr,blk}dev_inode_operations, | ||
247 | so when a device is opened a few quick levels of indirection bring us | ||
248 | to the driver file_operations. | ||
249 | |||
250 | For miscellaneous character devices a second step is required: there | ||
251 | is a scan for the driver entry with the same minor number as the file | ||
252 | that was opened, and the appropriate minor open method is called. This | ||
253 | scanning is done *every time* you open a device node. Potentially, you | ||
254 | may be searching through dozens of misc. entries before you find your | ||
255 | open method. While not an enormous performance overhead, this does | ||
256 | seem pointless. | ||
257 | |||
258 | Linux *must* move beyond the 8 bit major and minor barrier, | ||
259 | somehow. If we simply increase each to 16 bits, then the indexing | ||
260 | scheme used for major driver lookup becomes untenable, because the | ||
261 | major tables (one each for character and block devices) would need to | ||
262 | be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit | ||
263 | systems). So we would have to use a scheme like that used for | ||
264 | miscellaneous character devices, which means the search time goes up | ||
265 | linearly with the average number of major device drivers on your | ||
266 | system. Not all "devices" are hardware, some are higher-level drivers | ||
267 | like KGI, so you can get more "devices" without adding hardware | ||
268 | You can improve this by creating an ordered (balanced:-) | ||
269 | binary tree, in which case your search time becomes log(N). | ||
270 | Alternatively, you can use hashing to speed up the search. | ||
271 | But why do that search at all if you don't have to? Once again, it | ||
272 | seems pointless. | ||
273 | |||
274 | Note that devfs doesn't use the major&minor system. For devfs | ||
275 | entries, the connection is done when you lookup the /dev entry. When | ||
276 | devfs_register() is called, an internal table is appended which has | ||
277 | the entry name and the file_operations. If the dentry cache doesn't | ||
278 | have the /dev entry already, this internal table is scanned to get the | ||
279 | file_operations, and an inode is created. If the dentry cache already | ||
280 | has the entry, there is *no lookup time* (other than the dentry scan | ||
281 | itself, but we can't avoid that anyway, and besides Linux dentries | ||
282 | cream other OS's which don't have them:-). Furthermore, the number of | ||
283 | node entries in a devfs is only the number of available device | ||
284 | entries, not the number of *conceivable* entries. Even if you remove | ||
285 | unnecessary entries in a disc-based /dev, the number of conceivable | ||
286 | entries remains the same: you just limit yourself in order to save | ||
287 | space. | ||
288 | |||
289 | Devfs provides a fast connection between a VFS node and the device | ||
290 | driver, in a scalable way. | ||
291 | |||
292 | /dev as a system administration tool | ||
293 | |||
294 | Right now /dev contains a list of conceivable devices, most of which I | ||
295 | don't have. Devfs only shows those devices available on my | ||
296 | system. This means that listing /dev is a handy way of checking what | ||
297 | devices are available. | ||
298 | |||
299 | Major&minor size | ||
300 | |||
301 | Existing major and minor numbers are limited to 8 bits each. This is | ||
302 | now a limiting factor for some drivers, particularly the SCSI disc | ||
303 | driver, which consumes a single major number. Only 16 discs are | ||
304 | supported, and each disc may have only 15 partitions. Maybe this isn't | ||
305 | a problem for you, but some of us are building huge Linux systems with | ||
306 | disc arrays. With devfs an arbitrary pointer can be associated with | ||
307 | each device entry, which can be used to give an effective 32 bit | ||
308 | device identifier (i.e. that's like having a 32 bit minor | ||
309 | number). Since this is private to the kernel, there are no C library | ||
310 | compatibility issues which you would have with increasing major and | ||
311 | minor number sizes. See the section on "Allocation of Device Numbers" | ||
312 | for details on maintaining compatibility with userspace. | ||
313 | |||
314 | Solving this requires a kernel change. | ||
315 | |||
316 | Since writing this, the kernel has been modified so that the SCSI disc | ||
317 | driver has more major numbers allocated to it and now supports up to | ||
318 | 128 discs. Since these major numbers are non-contiguous (a result of | ||
319 | unplanned expansion), the implementation is a little more cumbersome | ||
320 | than originally. | ||
321 | |||
322 | Just like the changes to IPv4 to fix impending limitations in the | ||
323 | address space, people find ways around the limitations. In the long | ||
324 | run, however, solutions like IPv6 or devfs can't be put off forever. | ||
325 | |||
326 | Read-only root filesystem | ||
327 | |||
328 | Having your device nodes on the root filesystem means that you can't | ||
329 | operate properly with a read-only root filesystem. This is because you | ||
330 | want to change ownerships and protections of tty devices. Existing | ||
331 | practice prevents you using a CD-ROM as your root filesystem for a | ||
332 | *real* system. Sure, you can boot off a CD-ROM, but you can't change | ||
333 | tty ownerships, so it's only good for installing. | ||
334 | |||
335 | Also, you can't use a shared NFS root filesystem for a cluster of | ||
336 | discless Linux machines (having tty ownerships changed on a common | ||
337 | /dev is not good). Nor can you embed your root filesystem in a | ||
338 | ROM-FS. | ||
339 | |||
340 | You can get around this by creating a RAMDISC at boot time, making | ||
341 | an ext2 filesystem in it, mounting it somewhere and copying the | ||
342 | contents of /dev into it, then unmounting it and mounting it over | ||
343 | /dev. | ||
344 | |||
345 | A devfs is a cleaner way of solving this. | ||
346 | |||
347 | Non-Unix root filesystem | ||
348 | |||
349 | Non-Unix filesystems (such as NTFS) can't be used for a root | ||
350 | filesystem because they variously don't support character and block | ||
351 | special files or symbolic links. You can't have a separate disc-based | ||
352 | or RAMDISC-based filesystem mounted on /dev because you need device | ||
353 | nodes before you can mount these. Devfs can be mounted without any | ||
354 | device nodes. Devlinks won't work because symlinks aren't supported. | ||
355 | An alternative solution is to use initrd to mount a RAMDISC initial | ||
356 | root filesystem (which is populated with a minimal set of device | ||
357 | nodes), and then construct a new /dev in another RAMDISC, and finally | ||
358 | switch to your non-Unix root filesystem. This requires clever boot | ||
359 | scripts and a fragile and conceptually complex boot procedure. | ||
360 | |||
361 | Devfs solves this in a robust and conceptually simple way. | ||
362 | |||
363 | PTY security | ||
364 | |||
365 | Current pseudo-tty (pty) devices are owned by root and read-writable | ||
366 | by everyone. The user of a pty-pair cannot change | ||
367 | ownership/protections without being suid-root. | ||
368 | |||
369 | This could be solved with a secure user-space daemon which runs as | ||
370 | root and does the actual creation of pty-pairs. Such a daemon would | ||
371 | require modification to *every* programme that wants to use this new | ||
372 | mechanism. It also slows down creation of pty-pairs. | ||
373 | |||
374 | An alternative is to create a new open_pty() syscall which does much | ||
375 | the same thing as the user-space daemon. Once again, this requires | ||
376 | modifications to pty-handling programmes. | ||
377 | |||
378 | The devfs solution allows a device driver to "tag" certain device | ||
379 | files so that when an unopened device is opened, the ownerships are | ||
380 | changed to the current euid and egid of the opening process, and the | ||
381 | protections are changed to the default registered by the driver. When | ||
382 | the device is closed ownership is set back to root and protections are | ||
383 | set back to read-write for everybody. No programme need be changed. | ||
384 | The devpts filesystem provides this auto-ownership feature for Unix98 | ||
385 | ptys. It doesn't support old-style pty devices, nor does it have all | ||
386 | the other features of devfs. | ||
387 | |||
388 | Intelligent device management | ||
389 | |||
390 | Devfs implements a simple yet powerful protocol for communication with | ||
391 | a device management daemon (devfsd) which runs in user space. It is | ||
392 | possible to send a message (either synchronously or asynchronously) to | ||
393 | devfsd on any event, such as registration/unregistration of device | ||
394 | entries, opening and closing devices, looking up inodes, scanning | ||
395 | directories and more. This has many possibilities. Some of these are | ||
396 | already implemented. See: | ||
397 | |||
398 | |||
399 | http://www.atnf.csiro.au/~rgooch/linux/ | ||
400 | |||
401 | Device entry registration events can be used by devfsd to change | ||
402 | permissions of newly-created device nodes. This is one mechanism to | ||
403 | control device permissions. | ||
404 | |||
405 | Device entry registration/unregistration events can be used to run | ||
406 | programmes or scripts. This can be used to provide automatic mounting | ||
407 | of filesystems when a new block device media is inserted into the | ||
408 | drive. | ||
409 | |||
410 | Asynchronous device open and close events can be used to implement | ||
411 | clever permissions management. For example, the default permissions on | ||
412 | /dev/dsp do not allow everybody to read from the device. This is | ||
413 | sensible, as you don't want some remote user recording what you say at | ||
414 | your console. However, the console user is also prevented from | ||
415 | recording. This behaviour is not desirable. With asynchronous device | ||
416 | open and close events, you can have devfsd run a programme or script | ||
417 | when console devices are opened to change the ownerships for *other* | ||
418 | device nodes (such as /dev/dsp). On closure, you can run a different | ||
419 | script to restore permissions. An advantage of this scheme over | ||
420 | modifying the C library tty handling is that this works even if your | ||
421 | programme crashes (how many times have you seen the utmp database with | ||
422 | lingering entries for non-existent logins?). | ||
423 | |||
424 | Synchronous device open events can be used to perform intelligent | ||
425 | device access protections. Before the device driver open() method is | ||
426 | called, the daemon must first validate the open attempt, by running an | ||
427 | external programme or script. This is far more flexible than access | ||
428 | control lists, as access can be determined on the basis of other | ||
429 | system conditions instead of just the UID and GID. | ||
430 | |||
431 | Inode lookup events can be used to authenticate module autoload | ||
432 | requests. Instead of using kmod directly, the event is sent to | ||
433 | devfsd which can implement an arbitrary authentication before loading | ||
434 | the module itself. | ||
435 | |||
436 | Inode lookup events can also be used to construct arbitrary | ||
437 | namespaces, without having to resort to populating devfs with symlinks | ||
438 | to devices that don't exist. | ||
439 | |||
440 | Speculative Device Scanning | ||
441 | |||
442 | Consider an application (like cdparanoia) that wants to find all | ||
443 | CD-ROM devices on the system (SCSI, IDE and other types), whether or | ||
444 | not their respective modules are loaded. The application must | ||
445 | speculatively open certain device nodes (such as /dev/sr0 for the SCSI | ||
446 | CD-ROMs) in order to make sure the module is loaded. This requires | ||
447 | that all Linux distributions follow the standard device naming scheme | ||
448 | (last time I looked RedHat did things differently). Devfs solves the | ||
449 | naming problem. | ||
450 | |||
451 | The same application also wants to see which devices are actually | ||
452 | available on the system. With the existing system it needs to read the | ||
453 | /dev directory and speculatively open each /dev/sr* device to | ||
454 | determine if the device exists or not. With a large /dev this is an | ||
455 | inefficient operation, especially if there are many /dev/sr* nodes. A | ||
456 | solution like scsidev could reduce the number of /dev/sr* entries (but | ||
457 | of course that also requires all that inefficient directory scanning). | ||
458 | |||
459 | With devfs, the application can open the /dev/sr directory | ||
460 | (which triggers the module autoloading if required), and proceed to | ||
461 | read /dev/sr. Since only the available devices will have | ||
462 | entries, there are no inefficencies in directory scanning or device | ||
463 | openings. | ||
464 | |||
465 | ----------------------------------------------------------------------------- | ||
466 | |||
467 | Who else does it? | ||
468 | |||
469 | FreeBSD has a devfs implementation. Solaris and AIX each have a | ||
470 | pseudo-devfs (something akin to scsidev but for all devices, with some | ||
471 | unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's | ||
472 | IRIX 6.4 and above also have a device filesystem. | ||
473 | |||
474 | While we shouldn't just automatically do something because others do | ||
475 | it, we should not ignore the work of others either. FreeBSD has a lot | ||
476 | of competent people working on it, so their opinion should not be | ||
477 | blithely ignored. | ||
478 | |||
479 | ----------------------------------------------------------------------------- | ||
480 | |||
481 | |||
482 | How it works | ||
483 | |||
484 | Registering device entries | ||
485 | |||
486 | For every entry (device node) in a devfs-based /dev a driver must call | ||
487 | devfs_register(). This adds the name of the device entry, the | ||
488 | file_operations structure pointer and a few other things to an | ||
489 | internal table. Device entries may be added and removed at any | ||
490 | time. When a device entry is registered, it automagically appears in | ||
491 | any mounted devfs'. | ||
492 | |||
493 | Inode lookup | ||
494 | |||
495 | When a lookup operation on an entry is performed and if there is no | ||
496 | driver information for that entry devfs will attempt to call | ||
497 | devfsd. If still no driver information can be found then a negative | ||
498 | dentry is yielded and the next stage operation will be called by the | ||
499 | VFS (such as create() or mknod() inode methods). If driver information | ||
500 | can be found, an inode is created (if one does not exist already) and | ||
501 | all is well. | ||
502 | |||
503 | Manually creating device nodes | ||
504 | |||
505 | The mknod() method allows you to create an ordinary named pipe in the | ||
506 | devfs, or you can create a character or block special inode if one | ||
507 | does not already exist. You may wish to create a character or block | ||
508 | special inode so that you can set permissions and ownership. Later, if | ||
509 | a device driver registers an entry with the same name, the | ||
510 | permissions, ownership and times are retained. This is how you can set | ||
511 | the protections on a device even before the driver is loaded. Once you | ||
512 | create an inode it appears in the directory listing. | ||
513 | |||
514 | Unregistering device entries | ||
515 | |||
516 | A device driver calls devfs_unregister() to unregister an entry. | ||
517 | |||
518 | Chroot() gaols | ||
519 | |||
520 | 2.2.x kernels | ||
521 | |||
522 | The semantics of inode creation are different when devfs is mounted | ||
523 | with the "explicit" option. Now, when a device entry is registered, it | ||
524 | will not appear until you use mknod() to create the device. It doesn't | ||
525 | matter if you mknod() before or after the device is registered with | ||
526 | devfs_register(). The purpose of this behaviour is to support | ||
527 | chroot(2) gaols, where you want to mount a minimal devfs inside the | ||
528 | gaol. Only the devices you specifically want to be available (through | ||
529 | your mknod() setup) will be accessible. | ||
530 | |||
531 | 2.4.x kernels | ||
532 | |||
533 | As of kernel 2.3.99, the VFS has had the ability to rebind parts of | ||
534 | the global filesystem namespace into another part of the namespace. | ||
535 | This now works even at the leaf-node level, which means that | ||
536 | individual files and device nodes may be bound into other parts of the | ||
537 | namespace. This is like making links, but better, because it works | ||
538 | across filesystems (unlike hard links) and works through chroot() | ||
539 | gaols (unlike symbolic links). | ||
540 | |||
541 | Because of these improvements to the VFS, the multi-mount capability | ||
542 | in devfs is no longer needed. The administrator may create a minimal | ||
543 | device tree inside a chroot(2) gaol by using VFS bindings. As this | ||
544 | provides most of the features of the devfs multi-mount capability, I | ||
545 | removed the multi-mount support code (after issuing an RFC). This | ||
546 | yielded code size reductions and simplifications. | ||
547 | |||
548 | If you want to construct a minimal chroot() gaol, the following | ||
549 | command should suffice: | ||
550 | |||
551 | mount --bind /dev/null /gaol/dev/null | ||
552 | |||
553 | |||
554 | Repeat for other device nodes you want to expose. Simple! | ||
555 | |||
556 | ----------------------------------------------------------------------------- | ||
557 | |||
558 | |||
559 | Operational issues | ||
560 | |||
561 | |||
562 | Instructions for the impatient | ||
563 | |||
564 | Nobody likes reading documentation. People just want to get in there | ||
565 | and play. So this section tells you quickly the steps you need to take | ||
566 | to run with devfs mounted over /dev. Skip these steps and you will end | ||
567 | up with a nearly unbootable system. Subsequent sections describe the | ||
568 | issues in more detail, and discuss non-essential configuration | ||
569 | options. | ||
570 | |||
571 | Devfsd | ||
572 | OK, if you're reading this, I assume you want to play with | ||
573 | devfs. First you should ensure that /usr/src/linux contains a | ||
574 | recent kernel source tree. Then you need to compile devfsd, the device | ||
575 | management daemon, available at | ||
576 | |||
577 | http://www.atnf.csiro.au/~rgooch/linux/. | ||
578 | Because the kernel has a naming scheme | ||
579 | which is quite different from the old naming scheme, you need to | ||
580 | install devfsd so that software and configuration files that use the | ||
581 | old naming scheme will not break. | ||
582 | |||
583 | Compile and install devfsd. You will be provided with a default | ||
584 | configuration file /etc/devfsd.conf which will provide | ||
585 | compatibility symlinks for the old naming scheme. Don't change this | ||
586 | config file unless you know what you're doing. Even if you think you | ||
587 | do know what you're doing, don't change it until you've followed all | ||
588 | the steps below and booted a devfs-enabled system and verified that it | ||
589 | works. | ||
590 | |||
591 | Now edit your main system boot script so that devfsd is started at the | ||
592 | very beginning (before any filesystem | ||
593 | checks). /etc/rc.d/rc.sysinit is often the main boot script | ||
594 | on systems with SysV-style boot scripts. On systems with BSD-style | ||
595 | boot scripts it is often /etc/rc. Also check | ||
596 | /sbin/rc. | ||
597 | |||
598 | NOTE that the line you put into the boot | ||
599 | script should be exactly: | ||
600 | |||
601 | /sbin/devfsd /dev | ||
602 | |||
603 | DO NOT use some special daemon-launching | ||
604 | programme, otherwise the boot script may not wait for devfsd to finish | ||
605 | initialising. | ||
606 | |||
607 | System Libraries | ||
608 | There may still be some problems because of broken software making | ||
609 | assumptions about device names. In particular, some software does not | ||
610 | handle devices which are symbolic links. If you are running a libc 5 | ||
611 | based system, install libc 5.4.44 (if you have libc 5.4.46, go back to | ||
612 | libc 5.4.44, which is actually correct). If you are running a glibc | ||
613 | based system, make sure you have glibc 2.1.3 or later. | ||
614 | |||
615 | /etc/securetty | ||
616 | PAM (Pluggable Authentication Modules) is supposed to be a flexible | ||
617 | mechanism for providing better user authentication and access to | ||
618 | services. Unfortunately, it's also fragile, complex and undocumented | ||
619 | (check out RedHat 6.1, and probably other distributions as well). PAM | ||
620 | has problems with symbolic links. Append the following lines to your | ||
621 | /etc/securetty file: | ||
622 | |||
623 | vc/1 | ||
624 | vc/2 | ||
625 | vc/3 | ||
626 | vc/4 | ||
627 | vc/5 | ||
628 | vc/6 | ||
629 | vc/7 | ||
630 | vc/8 | ||
631 | |||
632 | This will not weaken security. If you have a version of util-linux | ||
633 | earlier than 2.10.h, please upgrade to 2.10.h or later. If you | ||
634 | absolutely cannot upgrade, then also append the following lines to | ||
635 | your /etc/securetty file: | ||
636 | |||
637 | 1 | ||
638 | 2 | ||
639 | 3 | ||
640 | 4 | ||
641 | 5 | ||
642 | 6 | ||
643 | 7 | ||
644 | 8 | ||
645 | |||
646 | This may potentially weaken security by allowing root logins over the | ||
647 | network (a password is still required, though). However, since there | ||
648 | are problems with dealing with symlinks, I'm suspicious of the level | ||
649 | of security offered in any case. | ||
650 | |||
651 | XFree86 | ||
652 | While not essential, it's probably a good idea to upgrade to XFree86 | ||
653 | 4.0, as patches went in to make it more devfs-friendly. If you don't, | ||
654 | you'll probably need to apply the following patch to | ||
655 | /etc/security/console.perms so that ordinary users can run | ||
656 | startx. Note that not all distributions have this file (e.g. Debian), | ||
657 | so if it's not present, don't worry about it. | ||
658 | |||
659 | --- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999 | ||
660 | +++ /etc/security/console.perms Fri Feb 25 23:53:55 2000 | ||
661 | @@ -14,7 +14,7 @@ | ||
662 | # man 5 console.perms | ||
663 | |||
664 | # file classes -- these are regular expressions | ||
665 | -<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] | ||
666 | +<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] | ||
667 | |||
668 | # device classes -- these are shell-style globs | ||
669 | <floppy>=/dev/fd[0-1]* | ||
670 | |||
671 | If the patch does not apply, then change the line: | ||
672 | |||
673 | <console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] | ||
674 | |||
675 | with: | ||
676 | |||
677 | <console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] | ||
678 | |||
679 | |||
680 | Disable devpts | ||
681 | I've had a report of devpts mounted on /dev/pts not working | ||
682 | correctly. Since devfs will also manage /dev/pts, there is no | ||
683 | need to mount devpts as well. You should either edit your | ||
684 | /etc/fstab so devpts is not mounted, or disable devpts from | ||
685 | your kernel configuration. | ||
686 | |||
687 | Unsupported drivers | ||
688 | Not all drivers have devfs support. If you depend on one of these | ||
689 | drivers, you will need to create a script or tarfile that you can use | ||
690 | at boot time to create device nodes as appropriate. There is a | ||
691 | section which describes this. Another | ||
692 | section lists the drivers which have | ||
693 | devfs support. | ||
694 | |||
695 | /dev/mouse | ||
696 | |||
697 | Many disributions configure /dev/mouse to be the mouse device | ||
698 | for XFree86 and GPM. I actually think this is a bad idea, because it | ||
699 | adds another level of indirection. When looking at a config file, if | ||
700 | you see /dev/mouse you're left wondering which mouse | ||
701 | is being referred to. Hence I recommend putting the actual mouse | ||
702 | device (for example /dev/psaux) into your | ||
703 | /etc/X11/XF86Config file (and similarly for the GPM | ||
704 | configuration file). | ||
705 | |||
706 | Alternatively, use the same technique used for unsupported drivers | ||
707 | described above. | ||
708 | |||
709 | The Kernel | ||
710 | Finally, you need to make sure devfs is compiled into your kernel. Set | ||
711 | CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by | ||
712 | using favourite configuration tool (i.e. make config or | ||
713 | make xconfig) and then make clean and then recompile your kernel and | ||
714 | modules. At boot, devfs will be mounted onto /dev. | ||
715 | |||
716 | If you encounter problems booting (for example if you forgot a | ||
717 | configuration step), you can pass devfs=nomount at the kernel | ||
718 | boot command line. This will prevent the kernel from mounting devfs at | ||
719 | boot time onto /dev. | ||
720 | |||
721 | In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting | ||
722 | devfs onto /dev is completely safe, and requires no | ||
723 | configuration changes. One exception to take note of is when | ||
724 | LABEL= directives are used in /etc/fstab. In this | ||
725 | case you will be unable to boot properly. This is because the | ||
726 | mount(8) programme uses /proc/partitions as part of | ||
727 | the volume label search process, and the device names it finds are not | ||
728 | available, because setting CONFIG_DEVFS_FS=y changes the names in | ||
729 | /proc/partitions, irrespective of whether devfs is mounted. | ||
730 | |||
731 | Now you've finished all the steps required. You're now ready to boot | ||
732 | your shiny new kernel. Enjoy. | ||
733 | |||
734 | Changing the configuration | ||
735 | |||
736 | OK, you've now booted a devfs-enabled system, and everything works. | ||
737 | Now you may feel like changing the configuration (common targets are | ||
738 | /etc/fstab and /etc/devfsd.conf). Since you have a | ||
739 | system that works, if you make any changes and it doesn't work, you | ||
740 | now know that you only have to restore your configuration files to the | ||
741 | default and it will work again. | ||
742 | |||
743 | |||
744 | Permissions persistence across reboots | ||
745 | |||
746 | If you don't use mknod(2) to create a device file, nor use chmod(2) or | ||
747 | chown(2) to change the ownerships/permissions, the inode ctime will | ||
748 | remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime | ||
749 | later than this has had it's ownership/permissions changed. Hence, a | ||
750 | simple script or programme may be used to tar up all changed inodes, | ||
751 | prior to shutdown. Although effective, many consider this approach a | ||
752 | kludge. | ||
753 | |||
754 | A much better approach is to use devfsd to save and restore | ||
755 | permissions. It may be configured to record changes in permissions and | ||
756 | will save them in a database (in fact a directory tree), and restore | ||
757 | these upon boot. This is an efficient method and results in immediate | ||
758 | saving of current permissions (unlike the tar approach, which saves | ||
759 | permissions at some unspecified future time). | ||
760 | |||
761 | The default configuration file supplied with devfsd has config entries | ||
762 | which you may uncomment to enable persistence management. | ||
763 | |||
764 | If you decide to use the tar approach anyway, be aware that tar will | ||
765 | first unlink(2) an inode before creating a new device node. The | ||
766 | unlink(2) has the effect of breaking the connection between a devfs | ||
767 | entry and the device driver. If you use the "devfs=only" boot option, | ||
768 | you lose access to the device driver, requiring you to reload the | ||
769 | module. I consider this a bug in tar (there is no real need to | ||
770 | unlink(2) the inode first). | ||
771 | |||
772 | Alternatively, you can use devfsd to provide more sophisticated | ||
773 | management of device permissions. You can use devfsd to store | ||
774 | permissions for whole groups of devices with a single configuration | ||
775 | entry, rather than the conventional single entry per device entry. | ||
776 | |||
777 | Permissions database stored in mounted-over /dev | ||
778 | |||
779 | If you wish to save and restore your device permissions into the | ||
780 | disc-based /dev while still mounting devfs onto /dev | ||
781 | you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or | ||
782 | later), which has the VFS binding facility. You need to do the | ||
783 | following to set this up: | ||
784 | |||
785 | |||
786 | |||
787 | make sure the kernel does not mount devfs at boot time | ||
788 | |||
789 | |||
790 | make sure you have a correct /dev/console entry in your | ||
791 | root file-system (where your disc-based /dev lives) | ||
792 | |||
793 | create the /dev-state directory | ||
794 | |||
795 | |||
796 | add the following lines near the very beginning of your boot | ||
797 | scripts: | ||
798 | |||
799 | mount --bind /dev /dev-state | ||
800 | mount -t devfs none /dev | ||
801 | devfsd /dev | ||
802 | |||
803 | |||
804 | |||
805 | |||
806 | add the following lines to your /etc/devfsd.conf file: | ||
807 | |||
808 | REGISTER ^pt[sy] IGNORE | ||
809 | CREATE ^pt[sy] IGNORE | ||
810 | CHANGE ^pt[sy] IGNORE | ||
811 | DELETE ^pt[sy] IGNORE | ||
812 | REGISTER .* COPY /dev-state/$devname $devpath | ||
813 | CREATE .* COPY $devpath /dev-state/$devname | ||
814 | CHANGE .* COPY $devpath /dev-state/$devname | ||
815 | DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname | ||
816 | RESTORE /dev-state | ||
817 | |||
818 | Note that the sample devfsd.conf file contains these lines, | ||
819 | as well as other sample configurations you may find useful. See the | ||
820 | devfsd distribution | ||
821 | |||
822 | |||
823 | reboot. | ||
824 | |||
825 | |||
826 | |||
827 | |||
828 | Permissions database stored in normal directory | ||
829 | |||
830 | If you are using an older kernel which doesn't support VFS binding, | ||
831 | then you won't be able to have the permissions database in a | ||
832 | mounted-over /dev. However, you can still use a regular | ||
833 | directory to store the database. The sample /etc/devfsd.conf | ||
834 | file above may still be used. You will need to create the | ||
835 | /dev-state directory prior to installing devfsd. If you have | ||
836 | old permissions in /dev, then just copy (or move) the device | ||
837 | nodes over to the new directory. | ||
838 | |||
839 | Which method is better? | ||
840 | |||
841 | The best method is to have the permissions database stored in the | ||
842 | mounted-over /dev. This is because you will not need to copy | ||
843 | device nodes over to /dev-state, and because it allows you to | ||
844 | switch between devfs and non-devfs kernels, without requiring you to | ||
845 | copy permissions between /dev-state (for devfs) and | ||
846 | /dev (for non-devfs). | ||
847 | |||
848 | |||
849 | Dealing with drivers without devfs support | ||
850 | |||
851 | Currently, not all device drivers in the kernel have been modified to | ||
852 | use devfs. Device drivers which do not yet have devfs support will not | ||
853 | automagically appear in devfs. The simplest way to create device nodes | ||
854 | for these drivers is to unpack a tarfile containing the required | ||
855 | device nodes. You can do this in your boot scripts. All your drivers | ||
856 | will now work as before. | ||
857 | |||
858 | Hopefully for most people devfs will have enough support so that they | ||
859 | can mount devfs directly over /dev without losing most functionality | ||
860 | (i.e. losing access to various devices). As of 22-JAN-1998 (devfs | ||
861 | patch version 10) I am now running this way. All the devices I have | ||
862 | are available in devfs, so I don't lose anything. | ||
863 | |||
864 | WARNING: if your configuration requires the old-style device names | ||
865 | (i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure | ||
866 | it to maintain compatibility entries. It is almost certain that you | ||
867 | will require this. Note that the kernel creates a compatibility entry | ||
868 | for the root device, so you don't need initrd. | ||
869 | |||
870 | Note that you no longer need to mount devpts if you use Unix98 PTYs, | ||
871 | as devfs can manage /dev/pts itself. This saves you some RAM, as you | ||
872 | don't need to compile and install devpts. Note that some versions of | ||
873 | glibc have a bug with Unix98 pty handling on devfs systems. Contact | ||
874 | the glibc maintainers for a fix. Glibc 2.1.3 has the fix. | ||
875 | |||
876 | Note also that apart from editing /etc/fstab, other things will need | ||
877 | to be changed if you *don't* install devfsd. Some software (like the X | ||
878 | server) hard-wire device names in their source. It really is much | ||
879 | easier to install devfsd so that compatibility entries are created. | ||
880 | You can then slowly migrate your system to using the new device names | ||
881 | (for example, by starting with /etc/fstab), and then limiting the | ||
882 | compatibility entries that devfsd creates. | ||
883 | |||
884 | IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD | ||
885 | BEFORE YOU BOOT A DEVFS-ENABLED KERNEL! | ||
886 | |||
887 | Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of | ||
888 | reports back. Many of these are because people are trying to run | ||
889 | without devfsd, and hence some things break. Please just run devfsd if | ||
890 | things break. I want to concentrate on real bugs rather than | ||
891 | misconfiguration problems at the moment. If people are willing to fix | ||
892 | bugs/false assumptions in other code (i.e. glibc, X server) and submit | ||
893 | that to the respective maintainers, that would be great. | ||
894 | |||
895 | |||
896 | All the way with Devfs | ||
897 | |||
898 | The devfs kernel patch creates a rationalised device tree. As stated | ||
899 | above, if you want to keep using the old /dev naming scheme, | ||
900 | you just need to configure devfsd appopriately (see the man | ||
901 | page). People who prefer the old names can ignore this section. For | ||
902 | those of us who like the rationalised names and an uncluttered | ||
903 | /dev, read on. | ||
904 | |||
905 | If you don't run devfsd, or don't enable compatibility entry | ||
906 | management, then you will have to configure your system to use the new | ||
907 | names. For example, you will then need to edit your | ||
908 | /etc/fstab to use the new disc naming scheme. If you want to | ||
909 | be able to boot non-devfs kernels, you will need compatibility | ||
910 | symlinks in the underlying disc-based /dev pointing back to | ||
911 | the old-style names for when you boot a kernel without devfs. | ||
912 | |||
913 | You can selectively decide which devices you want compatibility | ||
914 | entries for. For example, you may only want compatibility entries for | ||
915 | BSD pseudo-terminal devices (otherwise you'll have to patch you C | ||
916 | library or use Unix98 ptys instead). It's just a matter of putting in | ||
917 | the correct regular expression into /dev/devfsd.conf. | ||
918 | |||
919 | There are other choices of naming schemes that you may prefer. For | ||
920 | example, I don't use the kernel-supplied | ||
921 | names, because they are too verbose. A common misconception is | ||
922 | that the kernel-supplied names are meant to be used directly in | ||
923 | configuration files. This is not the case. They are designed to | ||
924 | reflect the layout of the devices attached and to provide easy | ||
925 | classification. | ||
926 | |||
927 | If you like the kernel-supplied names, that's fine. If you don't then | ||
928 | you should be using devfsd to construct a namespace more to your | ||
929 | liking. Devfsd has built-in code to construct a | ||
930 | namespace that is both logical and easy to | ||
931 | manage. In essence, it creates a convenient abbreviation of the | ||
932 | kernel-supplied namespace. | ||
933 | |||
934 | You are of course free to build your own namespace. Devfsd has all the | ||
935 | infrastructure required to make this easy for you. All you need do is | ||
936 | write a script. You can even write some C code and devfsd can load the | ||
937 | shared object as a callable extension. | ||
938 | |||
939 | |||
940 | Other Issues | ||
941 | |||
942 | The init programme | ||
943 | Another thing to take note of is whether your init programme | ||
944 | creates a Unix socket /dev/telinit. Some versions of init | ||
945 | create /dev/telinit so that the telinit programme can | ||
946 | communicate with the init process. If you have such a system you need | ||
947 | to make sure that devfs is mounted over /dev *before* init | ||
948 | starts. In other words, you can't leave the mounting of devfs to | ||
949 | /etc/rc, since this is executed after init. Other | ||
950 | versions of init require a named pipe /dev/initctl | ||
951 | which must exist *before* init starts. Once again, you need to | ||
952 | mount devfs and then create the named pipe *before* init | ||
953 | starts. | ||
954 | |||
955 | The default behaviour now is not to mount devfs onto /dev at | ||
956 | boot time for 2.3.x and later kernels. You can correct this with the | ||
957 | "devfs=mount" boot option. This solves any problems with init, | ||
958 | and also prevents the dreaded: | ||
959 | |||
960 | Cannot open initial console | ||
961 | |||
962 | message. For 2.2.x kernels where you need to apply the devfs patch, | ||
963 | the default is to mount. | ||
964 | |||
965 | If you have automatic mounting of devfs onto /dev then you | ||
966 | may need to create /dev/initctl in your boot scripts. The | ||
967 | following lines should suffice: | ||
968 | |||
969 | mknod /dev/initctl p | ||
970 | kill -SIGUSR1 1 # tell init that /dev/initctl now exists | ||
971 | |||
972 | Alternatively, if you don't want the kernel to mount devfs onto | ||
973 | /dev then you could use the following procedure is a | ||
974 | guideline for how to get around /dev/initctl problems: | ||
975 | |||
976 | # cd /sbin | ||
977 | # mv init init.real | ||
978 | # cat > init | ||
979 | #! /bin/sh | ||
980 | mount -n -t devfs none /dev | ||
981 | mknod /dev/initctl p | ||
982 | exec /sbin/init.real $* | ||
983 | [control-D] | ||
984 | # chmod a+x init | ||
985 | |||
986 | Note that newer versions of init create /dev/initctl | ||
987 | automatically, so you don't have to worry about this. | ||
988 | |||
989 | Module autoloading | ||
990 | You will need to configure devfsd to enable module | ||
991 | autoloading. The following lines should be placed in your | ||
992 | /etc/devfsd.conf file: | ||
993 | |||
994 | LOOKUP .* MODLOAD | ||
995 | |||
996 | |||
997 | As of devfsd-v1.3.10, a generic /etc/modules.devfs | ||
998 | configuration file is installed, which is used by the MODLOAD | ||
999 | action. This should be sufficient for most configurations. If you | ||
1000 | require further configuration, edit your /etc/modules.conf | ||
1001 | file. The way module autoloading work with devfs is: | ||
1002 | |||
1003 | |||
1004 | a process attempts to lookup a device node (e.g. /dev/fred) | ||
1005 | |||
1006 | |||
1007 | if that device node does not exist, the full pathname is passed to | ||
1008 | devfsd as a string | ||
1009 | |||
1010 | |||
1011 | devfsd will pass the string to the modprobe programme (provided the | ||
1012 | configuration line shown above is present), and specifies that | ||
1013 | /etc/modules.devfs is the configuration file | ||
1014 | |||
1015 | |||
1016 | /etc/modules.devfs includes /etc/modules.conf to | ||
1017 | access local configurations | ||
1018 | |||
1019 | modprobe will search it's configuration files, looking for an alias | ||
1020 | that translates the pathname into a module name | ||
1021 | |||
1022 | |||
1023 | the translated pathname is then used to load the module. | ||
1024 | |||
1025 | |||
1026 | If you wanted a lookup of /dev/fred to load the | ||
1027 | mymod module, you would require the following configuration | ||
1028 | line in /etc/modules.conf: | ||
1029 | |||
1030 | alias /dev/fred mymod | ||
1031 | |||
1032 | The /etc/modules.devfs configuration file provides many such | ||
1033 | aliases for standard device names. If you look closely at this file, | ||
1034 | you will note that some modules require multiple alias configuration | ||
1035 | lines. This is required to support module autoloading for old and new | ||
1036 | device names. | ||
1037 | |||
1038 | Mounting root off a devfs device | ||
1039 | If you wish to mount root off a devfs device when you pass the | ||
1040 | "devfs=only" boot option, then you need to pass in the | ||
1041 | "root=<device>" option to the kernel when booting. If you use | ||
1042 | LILO, then you must have this in lilo.conf: | ||
1043 | |||
1044 | append = "root=<device>" | ||
1045 | |||
1046 | Surprised? Yep, so was I. It turns out if you have (as most people | ||
1047 | do): | ||
1048 | |||
1049 | root = <device> | ||
1050 | |||
1051 | |||
1052 | then LILO will determine the device number of <device> and will | ||
1053 | write that device number into a special place in the kernel image | ||
1054 | before starting the kernel, and the kernel will use that device number | ||
1055 | to mount the root filesystem. So, using the "append" variety ensures | ||
1056 | that LILO passes the root filesystem device as a string, which devfs | ||
1057 | can then use. | ||
1058 | |||
1059 | Note that this isn't an issue if you don't pass "devfs=only". | ||
1060 | |||
1061 | TTY issues | ||
1062 | The ttyname(3) function in some versions of the C library makes | ||
1063 | false assumptions about device entries which are symbolic links. The | ||
1064 | tty(1) programme is one that depends on this function. I've | ||
1065 | written a patch to libc 5.4.43 which fixes this. This has been | ||
1066 | included in libc 5.4.44 and a similar fix is in glibc 2.1.3. | ||
1067 | |||
1068 | |||
1069 | Kernel Naming Scheme | ||
1070 | |||
1071 | The kernel provides a default naming scheme. This scheme is designed | ||
1072 | to make it easy to search for specific devices or device types, and to | ||
1073 | view the available devices. Some device types (such as hard discs), | ||
1074 | have a directory of entries, making it easy to see what devices of | ||
1075 | that class are available. Often, the entries are symbolic links into a | ||
1076 | directory tree that reflects the topology of available devices. The | ||
1077 | topological tree is useful for finding how your devices are arranged. | ||
1078 | |||
1079 | Below is a list of the naming schemes for the most common drivers. A | ||
1080 | list of reserved device names is | ||
1081 | available for reference. Please send email to | ||
1082 | rgooch@atnf.csiro.au to obtain an allocation. Please be | ||
1083 | patient (the maintainer is busy). An alternative name may be allocated | ||
1084 | instead of the requested name, at the discretion of the maintainer. | ||
1085 | |||
1086 | Disc Devices | ||
1087 | |||
1088 | All discs, whether SCSI, IDE or whatever, are placed under the | ||
1089 | /dev/discs hierarchy: | ||
1090 | |||
1091 | /dev/discs/disc0 first disc | ||
1092 | /dev/discs/disc1 second disc | ||
1093 | |||
1094 | |||
1095 | Each of these entries is a symbolic link to the directory for that | ||
1096 | device. The device directory contains: | ||
1097 | |||
1098 | disc for the whole disc | ||
1099 | part* for individual partitions | ||
1100 | |||
1101 | |||
1102 | CD-ROM Devices | ||
1103 | |||
1104 | All CD-ROMs, whether SCSI, IDE or whatever, are placed under the | ||
1105 | /dev/cdroms hierarchy: | ||
1106 | |||
1107 | /dev/cdroms/cdrom0 first CD-ROM | ||
1108 | /dev/cdroms/cdrom1 second CD-ROM | ||
1109 | |||
1110 | |||
1111 | Each of these entries is a symbolic link to the real device entry for | ||
1112 | that device. | ||
1113 | |||
1114 | Tape Devices | ||
1115 | |||
1116 | All tapes, whether SCSI, IDE or whatever, are placed under the | ||
1117 | /dev/tapes hierarchy: | ||
1118 | |||
1119 | /dev/tapes/tape0 first tape | ||
1120 | /dev/tapes/tape1 second tape | ||
1121 | |||
1122 | |||
1123 | Each of these entries is a symbolic link to the directory for that | ||
1124 | device. The device directory contains: | ||
1125 | |||
1126 | mt for mode 0 | ||
1127 | mtl for mode 1 | ||
1128 | mtm for mode 2 | ||
1129 | mta for mode 3 | ||
1130 | mtn for mode 0, no rewind | ||
1131 | mtln for mode 1, no rewind | ||
1132 | mtmn for mode 2, no rewind | ||
1133 | mtan for mode 3, no rewind | ||
1134 | |||
1135 | |||
1136 | SCSI Devices | ||
1137 | |||
1138 | To uniquely identify any SCSI device requires the following | ||
1139 | information: | ||
1140 | |||
1141 | controller (host adapter) | ||
1142 | bus (SCSI channel) | ||
1143 | target (SCSI ID) | ||
1144 | unit (Logical Unit Number) | ||
1145 | |||
1146 | |||
1147 | All SCSI devices are placed under /dev/scsi (assuming devfs | ||
1148 | is mounted on /dev). Hence, a SCSI device with the following | ||
1149 | parameters: c=1,b=2,t=3,u=4 would appear as: | ||
1150 | |||
1151 | /dev/scsi/host1/bus2/target3/lun4 device directory | ||
1152 | |||
1153 | |||
1154 | Inside this directory, a number of device entries may be created, | ||
1155 | depending on which SCSI device-type drivers were installed. | ||
1156 | |||
1157 | See the section on the disc naming scheme to see what entries the SCSI | ||
1158 | disc driver creates. | ||
1159 | |||
1160 | See the section on the tape naming scheme to see what entries the SCSI | ||
1161 | tape driver creates. | ||
1162 | |||
1163 | The SCSI CD-ROM driver creates: | ||
1164 | |||
1165 | cd | ||
1166 | |||
1167 | |||
1168 | The SCSI generic driver creates: | ||
1169 | |||
1170 | generic | ||
1171 | |||
1172 | |||
1173 | IDE Devices | ||
1174 | |||
1175 | To uniquely identify any IDE device requires the following | ||
1176 | information: | ||
1177 | |||
1178 | controller | ||
1179 | bus (aka. primary/secondary) | ||
1180 | target (aka. master/slave) | ||
1181 | unit | ||
1182 | |||
1183 | |||
1184 | All IDE devices are placed under /dev/ide, and uses a similar | ||
1185 | naming scheme to the SCSI subsystem. | ||
1186 | |||
1187 | XT Hard Discs | ||
1188 | |||
1189 | All XT discs are placed under /dev/xd. The first XT disc has | ||
1190 | the directory /dev/xd/disc0. | ||
1191 | |||
1192 | TTY devices | ||
1193 | |||
1194 | The tty devices now appear as: | ||
1195 | |||
1196 | New name Old-name Device Type | ||
1197 | -------- -------- ----------- | ||
1198 | /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports | ||
1199 | /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices | ||
1200 | /dev/vc/0 /dev/tty Current virtual console | ||
1201 | /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles | ||
1202 | /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles | ||
1203 | /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters | ||
1204 | /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves | ||
1205 | |||
1206 | |||
1207 | RAMDISCS | ||
1208 | |||
1209 | The RAMDISCS are placed in their own directory, and are named thus: | ||
1210 | |||
1211 | /dev/rd/{0,1,2,...} | ||
1212 | |||
1213 | |||
1214 | Meta Devices | ||
1215 | |||
1216 | The meta devices are placed in their own directory, and are named | ||
1217 | thus: | ||
1218 | |||
1219 | /dev/md/{0,1,2,...} | ||
1220 | |||
1221 | |||
1222 | Floppy discs | ||
1223 | |||
1224 | Floppy discs are placed in the /dev/floppy directory. | ||
1225 | |||
1226 | Loop devices | ||
1227 | |||
1228 | Loop devices are placed in the /dev/loop directory. | ||
1229 | |||
1230 | Sound devices | ||
1231 | |||
1232 | Sound devices are placed in the /dev/sound directory | ||
1233 | (audio, sequencer, ...). | ||
1234 | |||
1235 | |||
1236 | Devfsd Naming Scheme | ||
1237 | |||
1238 | Devfsd provides a naming scheme which is a convenient abbreviation of | ||
1239 | the kernel-supplied namespace. In some | ||
1240 | cases, the kernel-supplied naming scheme is quite convenient, so | ||
1241 | devfsd does not provide another naming scheme. The convenience names | ||
1242 | that devfsd creates are in fact the same names as the original devfs | ||
1243 | kernel patch created (before Linus mandated the Big Name | ||
1244 | Change). These are referred to as "new compatibility entries". | ||
1245 | |||
1246 | In order to configure devfsd to create these convenience names, the | ||
1247 | following lines should be placed in your /etc/devfsd.conf: | ||
1248 | |||
1249 | REGISTER .* MKNEWCOMPAT | ||
1250 | UNREGISTER .* RMNEWCOMPAT | ||
1251 | |||
1252 | This will cause devfsd to create (and destroy) symbolic links which | ||
1253 | point to the kernel-supplied names. | ||
1254 | |||
1255 | SCSI Hard Discs | ||
1256 | |||
1257 | All SCSI discs are placed under /dev/sd (assuming devfs is | ||
1258 | mounted on /dev). Hence, a SCSI disc with the following | ||
1259 | parameters: c=1,b=2,t=3,u=4 would appear as: | ||
1260 | |||
1261 | /dev/sd/c1b2t3u4 for the whole disc | ||
1262 | /dev/sd/c1b2t3u4p5 for the 5th partition | ||
1263 | /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition | ||
1264 | |||
1265 | |||
1266 | SCSI Tapes | ||
1267 | |||
1268 | All SCSI tapes are placed under /dev/st. A similar naming | ||
1269 | scheme is used as for SCSI discs. A SCSI tape with the | ||
1270 | parameters:c=1,b=2,t=3,u=4 would appear as: | ||
1271 | |||
1272 | /dev/st/c1b2t3u4m0 for mode 0 | ||
1273 | /dev/st/c1b2t3u4m1 for mode 1 | ||
1274 | /dev/st/c1b2t3u4m2 for mode 2 | ||
1275 | /dev/st/c1b2t3u4m3 for mode 3 | ||
1276 | /dev/st/c1b2t3u4m0n for mode 0, no rewind | ||
1277 | /dev/st/c1b2t3u4m1n for mode 1, no rewind | ||
1278 | /dev/st/c1b2t3u4m2n for mode 2, no rewind | ||
1279 | /dev/st/c1b2t3u4m3n for mode 3, no rewind | ||
1280 | |||
1281 | |||
1282 | SCSI CD-ROMs | ||
1283 | |||
1284 | All SCSI CD-ROMs are placed under /dev/sr. A similar naming | ||
1285 | scheme is used as for SCSI discs. A SCSI CD-ROM with the | ||
1286 | parameters:c=1,b=2,t=3,u=4 would appear as: | ||
1287 | |||
1288 | /dev/sr/c1b2t3u4 | ||
1289 | |||
1290 | |||
1291 | SCSI Generic Devices | ||
1292 | |||
1293 | The generic (aka. raw) interface for all SCSI devices are placed under | ||
1294 | /dev/sg. A similar naming scheme is used as for SCSI discs. A | ||
1295 | SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear | ||
1296 | as: | ||
1297 | |||
1298 | /dev/sg/c1b2t3u4 | ||
1299 | |||
1300 | |||
1301 | IDE Hard Discs | ||
1302 | |||
1303 | All IDE discs are placed under /dev/ide/hd, using a similar | ||
1304 | convention to SCSI discs. The following mappings exist between the new | ||
1305 | and the old names: | ||
1306 | |||
1307 | /dev/hda /dev/ide/hd/c0b0t0u0 | ||
1308 | /dev/hdb /dev/ide/hd/c0b0t1u0 | ||
1309 | /dev/hdc /dev/ide/hd/c0b1t0u0 | ||
1310 | /dev/hdd /dev/ide/hd/c0b1t1u0 | ||
1311 | |||
1312 | |||
1313 | IDE Tapes | ||
1314 | |||
1315 | A similar naming scheme is used as for IDE discs. The entries will | ||
1316 | appear in the /dev/ide/mt directory. | ||
1317 | |||
1318 | IDE CD-ROM | ||
1319 | |||
1320 | A similar naming scheme is used as for IDE discs. The entries will | ||
1321 | appear in the /dev/ide/cd directory. | ||
1322 | |||
1323 | IDE Floppies | ||
1324 | |||
1325 | A similar naming scheme is used as for IDE discs. The entries will | ||
1326 | appear in the /dev/ide/fd directory. | ||
1327 | |||
1328 | XT Hard Discs | ||
1329 | |||
1330 | All XT discs are placed under /dev/xd. The first XT disc | ||
1331 | would appear as /dev/xd/c0t0. | ||
1332 | |||
1333 | |||
1334 | Old Compatibility Names | ||
1335 | |||
1336 | The old compatibility names are the legacy device names, such as | ||
1337 | /dev/hda, /dev/sda, /dev/rtc and so on. | ||
1338 | Devfsd can be configured to create compatibility symlinks so that you | ||
1339 | may continue to use the old names in your configuration files and so | ||
1340 | that old applications will continue to function correctly. | ||
1341 | |||
1342 | In order to configure devfsd to create these legacy names, the | ||
1343 | following lines should be placed in your /etc/devfsd.conf: | ||
1344 | |||
1345 | REGISTER .* MKOLDCOMPAT | ||
1346 | UNREGISTER .* RMOLDCOMPAT | ||
1347 | |||
1348 | This will cause devfsd to create (and destroy) symbolic links which | ||
1349 | point to the kernel-supplied names. | ||
1350 | |||
1351 | |||
1352 | ----------------------------------------------------------------------------- | ||
1353 | |||
1354 | |||
1355 | Device drivers currently ported | ||
1356 | |||
1357 | - All miscellaneous character devices support devfs (this is done | ||
1358 | transparently through misc_register()) | ||
1359 | |||
1360 | - SCSI discs and generic hard discs | ||
1361 | |||
1362 | - Character memory devices (null, zero, full and so on) | ||
1363 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
1364 | |||
1365 | - Loop devices (/dev/loop?) | ||
1366 | |||
1367 | - TTY devices (console, serial ports, terminals and pseudo-terminals) | ||
1368 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
1369 | |||
1370 | - SCSI tapes (/dev/scsi and /dev/tapes) | ||
1371 | |||
1372 | - SCSI CD-ROMs (/dev/scsi and /dev/cdroms) | ||
1373 | |||
1374 | - SCSI generic devices (/dev/scsi) | ||
1375 | |||
1376 | - RAMDISCS (/dev/ram?) | ||
1377 | |||
1378 | - Meta Devices (/dev/md*) | ||
1379 | |||
1380 | - Floppy discs (/dev/floppy) | ||
1381 | |||
1382 | - Parallel port printers (/dev/printers) | ||
1383 | |||
1384 | - Sound devices (/dev/sound) | ||
1385 | Thanks to Eric Dumas <dumas@linux.eu.org> and | ||
1386 | C. Scott Ananian <cananian@alumni.princeton.edu> | ||
1387 | |||
1388 | - Joysticks (/dev/joysticks) | ||
1389 | |||
1390 | - Sparc keyboard (/dev/kbd) | ||
1391 | |||
1392 | - DSP56001 digital signal processor (/dev/dsp56k) | ||
1393 | |||
1394 | - Apple Desktop Bus (/dev/adb) | ||
1395 | |||
1396 | - Coda network file system (/dev/cfs*) | ||
1397 | |||
1398 | - Virtual console capture devices (/dev/vcc) | ||
1399 | Thanks to Dennis Hou <smilax@mindmeld.yi.org> | ||
1400 | |||
1401 | - Frame buffer devices (/dev/fb) | ||
1402 | |||
1403 | - Video capture devices (/dev/v4l) | ||
1404 | |||
1405 | |||
1406 | ----------------------------------------------------------------------------- | ||
1407 | |||
1408 | |||
1409 | Allocation of Device Numbers | ||
1410 | |||
1411 | Devfs allows you to write a driver which doesn't need to allocate a | ||
1412 | device number (major&minor numbers) for the internal operation of the | ||
1413 | kernel. However, there are a number of userspace programmes that use | ||
1414 | the device number as a unique handle for a device. An example is the | ||
1415 | find programme, which uses device numbers to determine whether | ||
1416 | an inode is on a different filesystem than another inode. The device | ||
1417 | number used is the one for the block device which a filesystem is | ||
1418 | using. To preserve compatibility with userspace programmes, block | ||
1419 | devices using devfs need to have unique device numbers allocated to | ||
1420 | them. Furthermore, POSIX specifies device numbers, so some kind of | ||
1421 | device number needs to be presented to userspace. | ||
1422 | |||
1423 | The simplest option (especially when porting drivers to devfs) is to | ||
1424 | keep using the old major and minor numbers. Devfs will take whatever | ||
1425 | values are given for major&minor and pass them onto userspace. | ||
1426 | |||
1427 | This device number is a 16 bit number, so this leaves plenty of space | ||
1428 | for large numbers of discs and partitions. This scheme can also be | ||
1429 | used for character devices, in particular the tty devices, which are | ||
1430 | currently limited to 256 pseudo-ttys (this limits the total number of | ||
1431 | simultaneous xterms and remote logins). Note that the device number | ||
1432 | is limited to the range 36864-61439 (majors 144-239), in order to | ||
1433 | avoid any possible conflicts with existing official allocations. | ||
1434 | |||
1435 | Please note that using dynamically allocated block device numbers may | ||
1436 | break the NFS daemons (both user and kernel mode), which expect dev_t | ||
1437 | for a given device to be constant over the lifetime of remote mounts. | ||
1438 | |||
1439 | A final note on this scheme: since it doesn't increase the size of | ||
1440 | device numbers, there are no compatibility issues with userspace. | ||
1441 | |||
1442 | ----------------------------------------------------------------------------- | ||
1443 | |||
1444 | |||
1445 | Questions and Answers | ||
1446 | |||
1447 | |||
1448 | Making things work | ||
1449 | Alternatives to devfs | ||
1450 | What I don't like about devfs | ||
1451 | How to report bugs | ||
1452 | Strange kernel messages | ||
1453 | Compilation problems with devfsd | ||
1454 | |||
1455 | |||
1456 | |||
1457 | Making things work | ||
1458 | |||
1459 | Here are some common questions and answers. | ||
1460 | |||
1461 | |||
1462 | |||
1463 | Devfsd doesn't start | ||
1464 | |||
1465 | Make sure you have compiled and installed devfsd | ||
1466 | Make sure devfsd is being started from your boot | ||
1467 | scripts | ||
1468 | Make sure you have configured your kernel to enable devfs (see | ||
1469 | below) | ||
1470 | Make sure devfs is mounted (see below) | ||
1471 | |||
1472 | |||
1473 | Devfsd is not managing all my permissions | ||
1474 | |||
1475 | Make sure you are capturing the appropriate events. For example, | ||
1476 | device entries created by the kernel generate REGISTER events, | ||
1477 | but those created by devfsd generate CREATE events. | ||
1478 | |||
1479 | |||
1480 | Devfsd is not capturing all REGISTER events | ||
1481 | |||
1482 | See the previous entry: you may need to capture CREATE events. | ||
1483 | |||
1484 | |||
1485 | X will not start | ||
1486 | |||
1487 | Make sure you followed the steps | ||
1488 | outlined above. | ||
1489 | |||
1490 | |||
1491 | Why don't my network devices appear in devfs? | ||
1492 | |||
1493 | This is not a bug. Network devices have their own, completely separate | ||
1494 | namespace. They are accessed via socket(2) and | ||
1495 | setsockopt(2) calls, and thus require no device nodes. I have | ||
1496 | raised the possibilty of moving network devices into the device | ||
1497 | namespace, but have had no response. | ||
1498 | |||
1499 | |||
1500 | How can I test if I have devfs compiled into my kernel? | ||
1501 | |||
1502 | All filesystems built-in or currently loaded are listed in | ||
1503 | /proc/filesystems. If you see a devfs entry, then | ||
1504 | you know that devfs was compiled into your kernel. If you have | ||
1505 | correctly configured and rebuilt your kernel, then devfs will be | ||
1506 | built-in. If you think you've configured it in, but | ||
1507 | /proc/filesystems doesn't show it, you've made a mistake. | ||
1508 | Common mistakes include: | ||
1509 | |||
1510 | Using a 2.2.x kernel without applying the devfs patch (if you | ||
1511 | don't know how to patch your kernel, use 2.4.x instead, don't bother | ||
1512 | asking me how to patch) | ||
1513 | Forgetting to set CONFIG_EXPERIMENTAL=y | ||
1514 | Forgetting to set CONFIG_DEVFS_FS=y | ||
1515 | Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs | ||
1516 | to be automatically mounted at boot) | ||
1517 | Editing your .config manually, instead of using make | ||
1518 | config or make xconfig | ||
1519 | Forgetting to run make dep; make clean after changing the | ||
1520 | configuration and before compiling | ||
1521 | Forgetting to compile your kernel and modules | ||
1522 | Forgetting to install your kernel | ||
1523 | Forgetting to install your modules | ||
1524 | |||
1525 | Please check twice that you've done all these steps before sending in | ||
1526 | a bug report. | ||
1527 | |||
1528 | |||
1529 | |||
1530 | How can I test if devfs is mounted on /dev? | ||
1531 | |||
1532 | The device filesystem will always create an entry called | ||
1533 | ".devfsd", which is used to communicate with the daemon. Even | ||
1534 | if the daemon is not running, this entry will exist. Testing for the | ||
1535 | existence of this entry is the approved method of determining if devfs | ||
1536 | is mounted or not. Note that the type of entry (i.e. regular file, | ||
1537 | character device, named pipe, etc.) may change without notice. Only | ||
1538 | the existence of the entry should be relied upon. | ||
1539 | |||
1540 | |||
1541 | When I start devfsd, I see the error: | ||
1542 | Error opening file: ".devfsd" No such file or directory? | ||
1543 | |||
1544 | This means that devfs is not mounted. Make sure you have devfs mounted. | ||
1545 | |||
1546 | |||
1547 | How do I mount devfs? | ||
1548 | |||
1549 | First make sure you have devfs compiled into your kernel (see | ||
1550 | above). Then you will either need to: | ||
1551 | |||
1552 | set CONFIG_DEVFS_MOUNT=y in your kernel config | ||
1553 | pass devfs=mount to your boot loader | ||
1554 | mount devfs manually in your boot scripts with: | ||
1555 | mount -t none devfs /dev | ||
1556 | |||
1557 | |||
1558 | |||
1559 | Mount by volume LABEL=<label> doesn't work with | ||
1560 | devfs | ||
1561 | |||
1562 | Most probably you are not mounting devfs onto /dev. What | ||
1563 | happens is that if your kernel config has CONFIG_DEVFS_FS=y | ||
1564 | then the contents of /proc/partitions will have the devfs | ||
1565 | names (such as scsi/host0/bus0/target0/lun0/part1). The | ||
1566 | contents of /proc/partitions are used by mount(8) when | ||
1567 | mounting by volume label. If devfs is not mounted on /dev, | ||
1568 | then mount(8) will fail to find devices. The solution is to | ||
1569 | make sure that devfs is mounted on /dev. See above for how to | ||
1570 | do that. | ||
1571 | |||
1572 | |||
1573 | I have extra or incorrect entries in /dev | ||
1574 | |||
1575 | You may have stale entries in your dev-state area. Check for a | ||
1576 | RESTORE configuration line in your devfsd configuration | ||
1577 | (typically /etc/devfsd.conf). If you have this line, check | ||
1578 | the contents of the specified directory for stale entries. Remove | ||
1579 | any entries which are incorrect, then reboot. | ||
1580 | |||
1581 | |||
1582 | I get "Unable to open initial console" messages at boot | ||
1583 | |||
1584 | This usually happens when you don't have devfs automounted onto | ||
1585 | /dev at boot time, and there is no valid | ||
1586 | /dev/console entry on your root file-system. Create a valid | ||
1587 | /dev/console device node. | ||
1588 | |||
1589 | |||
1590 | |||
1591 | |||
1592 | |||
1593 | Alternatives to devfs | ||
1594 | |||
1595 | I've attempted to collate all the anti-devfs proposals and explain | ||
1596 | their limitations. Under construction. | ||
1597 | |||
1598 | |||
1599 | Why not just pass device create/remove events to a daemon? | ||
1600 | |||
1601 | Here the suggestion is to develop an API in the kernel so that devices | ||
1602 | can register create and remove events, and a daemon listens for those | ||
1603 | events. The daemon would then populate/depopulate /dev (which | ||
1604 | resides on disc). | ||
1605 | |||
1606 | This has several limitations: | ||
1607 | |||
1608 | |||
1609 | it only works for modules loaded and unloaded (or devices inserted | ||
1610 | and removed) after the kernel has finished booting. Without a database | ||
1611 | of events, there is no way the daemon could fully populate | ||
1612 | /dev | ||
1613 | |||
1614 | |||
1615 | if you add a database to this scheme, the question is then how to | ||
1616 | present that database to user-space. If you make it a list of strings | ||
1617 | with embedded event codes which are passed through a pipe to the | ||
1618 | daemon, then this is only of use to the daemon. I would argue that the | ||
1619 | natural way to present this data is via a filesystem (since many of | ||
1620 | the events will be of a hierarchical nature), such as devfs. | ||
1621 | Presenting the data as a filesystem makes it easy for the user to see | ||
1622 | what is available and also makes it easy to write scripts to scan the | ||
1623 | "database" | ||
1624 | |||
1625 | |||
1626 | the tight binding between device nodes and drivers is no longer | ||
1627 | possible (requiring the otherwise perfectly avoidable | ||
1628 | table lookups) | ||
1629 | |||
1630 | |||
1631 | you cannot catch inode lookup events on /dev which means | ||
1632 | that module autoloading requires device nodes to be created. This is a | ||
1633 | problem, particularly for drivers where only a few inodes are created | ||
1634 | from a potentially large set | ||
1635 | |||
1636 | |||
1637 | this technique can't be used when the root FS is mounted | ||
1638 | read-only | ||
1639 | |||
1640 | |||
1641 | |||
1642 | |||
1643 | Just implement a better scsidev | ||
1644 | |||
1645 | This suggestion involves taking the scsidev programme and | ||
1646 | extending it to scan for all devices, not just SCSI devices. The | ||
1647 | scsidev programme works by scanning /proc/scsi | ||
1648 | |||
1649 | Problems: | ||
1650 | |||
1651 | |||
1652 | the kernel does not currently provide a list of all devices | ||
1653 | available. Not all drivers register entries in /proc or | ||
1654 | generate kernel messages | ||
1655 | |||
1656 | |||
1657 | there is no uniform mechanism to register devices other than the | ||
1658 | devfs API | ||
1659 | |||
1660 | |||
1661 | implementing such an API is then the same as the | ||
1662 | proposal above | ||
1663 | |||
1664 | |||
1665 | |||
1666 | |||
1667 | Put /dev on a ramdisc | ||
1668 | |||
1669 | This suggestion involves creating a ramdisc and populating it with | ||
1670 | device nodes and then mounting it over /dev. | ||
1671 | |||
1672 | Problems: | ||
1673 | |||
1674 | |||
1675 | |||
1676 | this doesn't help when mounting the root filesystem, since you | ||
1677 | still need a device node to do that | ||
1678 | |||
1679 | |||
1680 | if you want to use this technique for the root device node as | ||
1681 | well, you need to use initrd. This complicates the booting sequence | ||
1682 | and makes it significantly harder to administer and configure. The | ||
1683 | initrd is essentially opaque, robbing the system administrator of easy | ||
1684 | configuration | ||
1685 | |||
1686 | |||
1687 | insufficient information is available to correctly populate the | ||
1688 | ramdisc. So we come back to the | ||
1689 | proposal above to "solve" this | ||
1690 | |||
1691 | |||
1692 | a ramdisc-based solution would take more kernel memory, since the | ||
1693 | backing store would be (at best) normal VFS inodes and dentries, which | ||
1694 | take 284 bytes and 112 bytes, respectively, for each entry. Compare | ||
1695 | that to 72 bytes for devfs | ||
1696 | |||
1697 | |||
1698 | |||
1699 | |||
1700 | Do nothing: there's no problem | ||
1701 | |||
1702 | Sometimes people can be heard to claim that the existing scheme is | ||
1703 | fine. This is what they're ignoring: | ||
1704 | |||
1705 | |||
1706 | device number size (8 bits each for major and minor) is a real | ||
1707 | limitation, and must be fixed somehow. Systems with large numbers of | ||
1708 | SCSI devices, for example, will continue to consume the remaining | ||
1709 | unallocated major numbers. USB will also need to push beyond the 8 bit | ||
1710 | minor limitation | ||
1711 | |||
1712 | |||
1713 | simply increasing the device number size is insufficient. Apart | ||
1714 | from causing a lot of pain, it doesn't solve the management issues | ||
1715 | of a /dev with thousands or more device nodes | ||
1716 | |||
1717 | |||
1718 | ignoring the problem of a huge /dev will not make it go | ||
1719 | away, and dismisses the legitimacy of a large number of people who | ||
1720 | want a dynamic /dev | ||
1721 | |||
1722 | |||
1723 | the standard response then becomes: "write a device management | ||
1724 | daemon", which brings us back to the | ||
1725 | proposal above | ||
1726 | |||
1727 | |||
1728 | |||
1729 | |||
1730 | What I don't like about devfs | ||
1731 | |||
1732 | Here are some common complaints about devfs, and some suggestions and | ||
1733 | solutions that may make it more palatable for you. I can't please | ||
1734 | everybody, but I do try :-) | ||
1735 | |||
1736 | I hate the naming scheme | ||
1737 | |||
1738 | First, remember that no naming scheme will please everybody. You hate | ||
1739 | the scheme, others love it. Who's to say who's right and who's wrong? | ||
1740 | Ultimately, the person who writes the code gets to choose, and what | ||
1741 | exists now is a combination of the choices made by the | ||
1742 | devfs author and the | ||
1743 | kernel maintainer (Linus). | ||
1744 | |||
1745 | However, not all is lost. If you want to create your own naming | ||
1746 | scheme, it is a simple matter to write a standalone script, hack | ||
1747 | devfsd, or write a script called by devfsd. You can create whatever | ||
1748 | naming scheme you like. | ||
1749 | |||
1750 | Further, if you want to remove all traces of the devfs naming scheme | ||
1751 | from /dev, you can mount devfs elsewhere (say | ||
1752 | /devfs) and populate /dev with links into | ||
1753 | /devfs. This population can be automated using devfsd if you | ||
1754 | wish. | ||
1755 | |||
1756 | You can even use the VFS binding facility to make the links, rather | ||
1757 | than using symbolic links. This way, you don't even have to see the | ||
1758 | "destination" of these symbolic links. | ||
1759 | |||
1760 | Devfs puts policy into the kernel | ||
1761 | |||
1762 | There's already policy in the kernel. Device numbers are in fact | ||
1763 | policy (why should the kernel dictate what device numbers I use?). | ||
1764 | Face it, some policy has to be in the kernel. The real difference | ||
1765 | between device names as policy and device numbers as policy is that | ||
1766 | no one will use device numbers directly, because device | ||
1767 | numbers are devoid of meaning to humans and are ugly. At least with | ||
1768 | the devfs device names, (even though you can add your own naming | ||
1769 | scheme) some people will use the devfs-supplied names directly. This | ||
1770 | offends some people :-) | ||
1771 | |||
1772 | Devfs is bloatware | ||
1773 | |||
1774 | This is not even remotely true. As shown above, | ||
1775 | both code and data size are quite modest. | ||
1776 | |||
1777 | |||
1778 | How to report bugs | ||
1779 | |||
1780 | If you have (or think you have) a bug with devfs, please follow the | ||
1781 | steps below: | ||
1782 | |||
1783 | |||
1784 | |||
1785 | make sure you have enabled debugging output when configuring your | ||
1786 | kernel. You will need to set (at least) the following config options: | ||
1787 | |||
1788 | CONFIG_DEVFS_DEBUG=y | ||
1789 | CONFIG_DEBUG_KERNEL=y | ||
1790 | CONFIG_DEBUG_SLAB=y | ||
1791 | |||
1792 | |||
1793 | |||
1794 | please make sure you have the latest devfs patches applied. The | ||
1795 | latest kernel version might not have the latest devfs patches applied | ||
1796 | yet (Linus is very busy) | ||
1797 | |||
1798 | |||
1799 | save a copy of your complete kernel logs (preferably by | ||
1800 | using the dmesg programme) for later inclusion in your bug | ||
1801 | report. You may need to use the -s switch to increase the | ||
1802 | internal buffer size so you can capture all the boot messages. | ||
1803 | Don't edit or trim the dmesg output | ||
1804 | |||
1805 | |||
1806 | |||
1807 | |||
1808 | try booting with devfs=dall passed to the kernel boot | ||
1809 | command line (read the documentation on your bootloader on how to do | ||
1810 | this), and save the result to a file. This may be quite verbose, and | ||
1811 | it may overflow the messages buffer, but try to get as much of it as | ||
1812 | you can | ||
1813 | |||
1814 | |||
1815 | if you get an Oops, run ksymoops to decode it so that the | ||
1816 | names of the offending functions are provided. A non-decoded Oops is | ||
1817 | pretty useless | ||
1818 | |||
1819 | |||
1820 | send a copy of your devfsd configuration file(s) | ||
1821 | |||
1822 | send the bug report to me first. | ||
1823 | Don't expect that I will see it if you post it to the linux-kernel | ||
1824 | mailing list. Include all the information listed above, plus | ||
1825 | anything else that you think might be relevant. Put the string | ||
1826 | devfs somewhere in the subject line, so my mail filters mark | ||
1827 | it as urgent | ||
1828 | |||
1829 | |||
1830 | |||
1831 | |||
1832 | Here is a general guide on how to ask questions in a way that greatly | ||
1833 | improves your chances of getting a reply: | ||
1834 | |||
1835 | http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have | ||
1836 | a bug to report, you should also read | ||
1837 | |||
1838 | http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. | ||
1839 | |||
1840 | |||
1841 | Strange kernel messages | ||
1842 | |||
1843 | You may see devfs-related messages in your kernel logs. Below are some | ||
1844 | messages and what they mean (and what you should do about them, if | ||
1845 | anything). | ||
1846 | |||
1847 | |||
1848 | |||
1849 | devfs_register(fred): could not append to parent, err: -17 | ||
1850 | |||
1851 | You need to check what the error code means, but usually 17 means | ||
1852 | EEXIST. This means that a driver attempted to create an entry | ||
1853 | fred in a directory, but there already was an entry with that | ||
1854 | name. This is often caused by flawed boot scripts which untar a bunch | ||
1855 | of inodes into /dev, as a way to restore permissions. This | ||
1856 | message is harmless, as the device nodes will still | ||
1857 | provide access to the driver (unless you use the devfs=only | ||
1858 | boot option, which is only for dedicated souls:-). If you want to get | ||
1859 | rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the | ||
1860 | recommended RESTORE directive to restore permissions. | ||
1861 | |||
1862 | |||
1863 | devfs_mk_dir(bill): using old entry in dir: c1808724 "" | ||
1864 | |||
1865 | This is similar to the message above, except that a driver attempted | ||
1866 | to create a directory named bill, and the parent directory | ||
1867 | has an entry with the same name. In this case, to ensure that drivers | ||
1868 | continue to work properly, the old entry is re-used and given to the | ||
1869 | driver. In 2.5 kernels, the driver is given a NULL entry, and thus, | ||
1870 | under rare circumstances, may not create the require device nodes. | ||
1871 | The solution is the same as above. | ||
1872 | |||
1873 | |||
1874 | |||
1875 | |||
1876 | |||
1877 | Compilation problems with devfsd | ||
1878 | |||
1879 | Usually, you can compile devfsd just by typing in | ||
1880 | make in the source directory, followed by a make | ||
1881 | install (as root). Sometimes, you may have problems, particularly | ||
1882 | on broken configurations. | ||
1883 | |||
1884 | |||
1885 | |||
1886 | error messages relating to DEVFSD_NOTIFY_DELETE | ||
1887 | |||
1888 | This happened because you have an ancient set of kernel headers | ||
1889 | installed in /usr/include/linux or /usr/src/linux. | ||
1890 | Install kernel 2.4.10 or later. You may need to pass the | ||
1891 | KERNEL_DIR variable to make (if you did not install | ||
1892 | the new kernel sources as /usr/src/linux), or you may copy | ||
1893 | the devfs_fs.h file in the kernel source tree into | ||
1894 | /usr/include/linux. | ||
1895 | |||
1896 | |||
1897 | |||
1898 | |||
1899 | ----------------------------------------------------------------------------- | ||
1900 | |||
1901 | |||
1902 | Other resources | ||
1903 | |||
1904 | |||
1905 | |||
1906 | Douglas Gilbert has written a useful document at | ||
1907 | |||
1908 | http://www.torque.net/sg/devfs_scsi.html which | ||
1909 | explores the SCSI subsystem and how it interacts with devfs | ||
1910 | |||
1911 | |||
1912 | Douglas Gilbert has written another useful document at | ||
1913 | |||
1914 | http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which | ||
1915 | discusses the Linux SCSI subsystem in 2.4. | ||
1916 | |||
1917 | |||
1918 | Johannes Erdfelt has started a discussion paper on Linux and | ||
1919 | hot-swap devices, describing what the requirements are for a scalable | ||
1920 | solution and how and why he's used devfs+devfsd. Note that this is an | ||
1921 | early draft only, available in plain text form at: | ||
1922 | |||
1923 | http://johannes.erdfelt.com/hotswap.txt. | ||
1924 | Johannes has promised a HTML version will follow. | ||
1925 | |||
1926 | |||
1927 | I presented an invited | ||
1928 | paper | ||
1929 | at the | ||
1930 | |||
1931 | 2nd Annual Storage Management Workshop held in Miamia, Florida, | ||
1932 | U.S.A. in October 2000. | ||
1933 | |||
1934 | |||
1935 | |||
1936 | |||
1937 | ----------------------------------------------------------------------------- | ||
1938 | |||
1939 | |||
1940 | Translations of this document | ||
1941 | |||
1942 | This document has been translated into other languages. | ||
1943 | |||
1944 | |||
1945 | |||
1946 | |||
1947 | The document master (in English) by rgooch@atnf.csiro.au is | ||
1948 | available at | ||
1949 | |||
1950 | http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html | ||
1951 | |||
1952 | |||
1953 | |||
1954 | A Korean translation by viatoris@nownuri.net is available at | ||
1955 | |||
1956 | http://your.destiny.pe.kr/devfs/devfs.html | ||
1957 | |||
1958 | |||
1959 | |||
1960 | |||
1961 | ----------------------------------------------------------------------------- | ||
1962 | Most flags courtesy of ITA's | ||
1963 | Flags of All Countries | ||
1964 | used with permission. | ||
diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo new file mode 100644 index 000000000000..afd5a8f2c19b --- /dev/null +++ b/Documentation/filesystems/devfs/ToDo | |||
@@ -0,0 +1,40 @@ | |||
1 | Device File System (devfs) ToDo List | ||
2 | |||
3 | Richard Gooch <rgooch@atnf.csiro.au> | ||
4 | |||
5 | 3-JUL-2000 | ||
6 | |||
7 | This is a list of things to be done for better devfs support in the | ||
8 | Linux kernel. If you'd like to contribute to the devfs, please have a | ||
9 | look at this list for anything that is unallocated. Also, if there are | ||
10 | items missing (surely), please contact me so I can add them to the | ||
11 | list (preferably with your name attached to them:-). | ||
12 | |||
13 | |||
14 | - >256 ptys | ||
15 | Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> | ||
16 | |||
17 | - Amiga floppy driver (drivers/block/amiflop.c) | ||
18 | |||
19 | - Atari floppy driver (drivers/block/ataflop.c) | ||
20 | |||
21 | - SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c) | ||
22 | |||
23 | - Amiga ZorroII ramdisc driver (drivers/block/z2ram.c) | ||
24 | |||
25 | - Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c) | ||
26 | |||
27 | - Parallel port ATAPI floppy (drivers/block/paride/pf.c) | ||
28 | |||
29 | - AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c) | ||
30 | |||
31 | - Archimedes floppy (drivers/acorn/block/fd1772.c) | ||
32 | |||
33 | - MFM hard drive (drivers/acorn/block/mfmhd.c) | ||
34 | |||
35 | - I2O block device (drivers/message/i2o/i2o_block.c) | ||
36 | |||
37 | - ST-RAM device (arch/m68k/atari/stram.c) | ||
38 | |||
39 | - Raw devices | ||
40 | |||
diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options new file mode 100644 index 000000000000..df3d33b03e0a --- /dev/null +++ b/Documentation/filesystems/devfs/boot-options | |||
@@ -0,0 +1,65 @@ | |||
1 | /* -*- auto-fill -*- */ | ||
2 | |||
3 | Device File System (devfs) Boot Options | ||
4 | |||
5 | Richard Gooch <rgooch@atnf.csiro.au> | ||
6 | |||
7 | 18-AUG-2001 | ||
8 | |||
9 | |||
10 | When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options | ||
11 | to the kernel to debug devfs. The boot options are prefixed by | ||
12 | "devfs=", and are separated by commas. Spaces are not allowed. The | ||
13 | syntax looks like this: | ||
14 | |||
15 | devfs=<option1>,<option2>,<option3> | ||
16 | |||
17 | and so on. For example, if you wanted to turn on debugging for module | ||
18 | load requests and device registration, you would do: | ||
19 | |||
20 | devfs=dmod,dreg | ||
21 | |||
22 | You may prefix "no" to any option. This will invert the option. | ||
23 | |||
24 | |||
25 | Debugging Options | ||
26 | ================= | ||
27 | |||
28 | These requires CONFIG_DEVFS_DEBUG to be enabled. | ||
29 | Note that all debugging options have 'd' as the first character. By | ||
30 | default all options are off. All debugging output is sent to the | ||
31 | kernel logs. The debugging options do not take effect until the devfs | ||
32 | version message appears (just prior to the root filesystem being | ||
33 | mounted). | ||
34 | |||
35 | These are the options: | ||
36 | |||
37 | dmod print module load requests to <request_module> | ||
38 | |||
39 | dreg print device register requests to <devfs_register> | ||
40 | |||
41 | dunreg print device unregister requests to <devfs_unregister> | ||
42 | |||
43 | dchange print device change requests to <devfs_set_flags> | ||
44 | |||
45 | dilookup print inode lookup requests | ||
46 | |||
47 | diget print VFS inode allocations | ||
48 | |||
49 | diunlink print inode unlinks | ||
50 | |||
51 | dichange print inode changes | ||
52 | |||
53 | dimknod print calls to mknod(2) | ||
54 | |||
55 | dall some debugging turned on | ||
56 | |||
57 | |||
58 | Other Options | ||
59 | ============= | ||
60 | |||
61 | These control the default behaviour of devfs. The options are: | ||
62 | |||
63 | mount mount devfs onto /dev at boot time | ||
64 | |||
65 | only disable non-devfs device nodes for devfs-capable drivers | ||
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking new file mode 100644 index 000000000000..34380d4fbce3 --- /dev/null +++ b/Documentation/filesystems/directory-locking | |||
@@ -0,0 +1,113 @@ | |||
1 | Locking scheme used for directory operations is based on two | ||
2 | kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). | ||
3 | |||
4 | For our purposes all operations fall in 5 classes: | ||
5 | |||
6 | 1) read access. Locking rules: caller locks directory we are accessing. | ||
7 | |||
8 | 2) object creation. Locking rules: same as above. | ||
9 | |||
10 | 3) object removal. Locking rules: caller locks parent, finds victim, | ||
11 | locks victim and calls the method. | ||
12 | |||
13 | 4) rename() that is _not_ cross-directory. Locking rules: caller locks | ||
14 | the parent, finds source and target, if target already exists - locks it | ||
15 | and then calls the method. | ||
16 | |||
17 | 5) link creation. Locking rules: | ||
18 | * lock parent | ||
19 | * check that source is not a directory | ||
20 | * lock source | ||
21 | * call the method. | ||
22 | |||
23 | 6) cross-directory rename. The trickiest in the whole bunch. Locking | ||
24 | rules: | ||
25 | * lock the filesystem | ||
26 | * lock parents in "ancestors first" order. | ||
27 | * find source and target. | ||
28 | * if old parent is equal to or is a descendent of target | ||
29 | fail with -ENOTEMPTY | ||
30 | * if new parent is equal to or is a descendent of source | ||
31 | fail with -ELOOP | ||
32 | * if target exists - lock it. | ||
33 | * call the method. | ||
34 | |||
35 | |||
36 | The rules above obviously guarantee that all directories that are going to be | ||
37 | read, modified or removed by method will be locked by caller. | ||
38 | |||
39 | |||
40 | If no directory is its own ancestor, the scheme above is deadlock-free. | ||
41 | Proof: | ||
42 | |||
43 | First of all, at any moment we have a partial ordering of the | ||
44 | objects - A < B iff A is an ancestor of B. | ||
45 | |||
46 | That ordering can change. However, the following is true: | ||
47 | |||
48 | (1) if object removal or non-cross-directory rename holds lock on A and | ||
49 | attempts to acquire lock on B, A will remain the parent of B until we | ||
50 | acquire the lock on B. (Proof: only cross-directory rename can change | ||
51 | the parent of object and it would have to lock the parent). | ||
52 | |||
53 | (2) if cross-directory rename holds the lock on filesystem, order will not | ||
54 | change until rename acquires all locks. (Proof: other cross-directory | ||
55 | renames will be blocked on filesystem lock and we don't start changing | ||
56 | the order until we had acquired all locks). | ||
57 | |||
58 | (3) any operation holds at most one lock on non-directory object and | ||
59 | that lock is acquired after all other locks. (Proof: see descriptions | ||
60 | of operations). | ||
61 | |||
62 | Now consider the minimal deadlock. Each process is blocked on | ||
63 | attempt to acquire some lock and already holds at least one lock. Let's | ||
64 | consider the set of contended locks. First of all, filesystem lock is | ||
65 | not contended, since any process blocked on it is not holding any locks. | ||
66 | Thus all processes are blocked on ->i_sem. | ||
67 | |||
68 | Non-directory objects are not contended due to (3). Thus link | ||
69 | creation can't be a part of deadlock - it can't be blocked on source | ||
70 | and it means that it doesn't hold any locks. | ||
71 | |||
72 | Any contended object is either held by cross-directory rename or | ||
73 | has a child that is also contended. Indeed, suppose that it is held by | ||
74 | operation other than cross-directory rename. Then the lock this operation | ||
75 | is blocked on belongs to child of that object due to (1). | ||
76 | |||
77 | It means that one of the operations is cross-directory rename. | ||
78 | Otherwise the set of contended objects would be infinite - each of them | ||
79 | would have a contended child and we had assumed that no object is its | ||
80 | own descendent. Moreover, there is exactly one cross-directory rename | ||
81 | (see above). | ||
82 | |||
83 | Consider the object blocking the cross-directory rename. One | ||
84 | of its descendents is locked by cross-directory rename (otherwise we | ||
85 | would again have an infinite set of of contended objects). But that | ||
86 | means that cross-directory rename is taking locks out of order. Due | ||
87 | to (2) the order hadn't changed since we had acquired filesystem lock. | ||
88 | But locking rules for cross-directory rename guarantee that we do not | ||
89 | try to acquire lock on descendent before the lock on ancestor. | ||
90 | Contradiction. I.e. deadlock is impossible. Q.E.D. | ||
91 | |||
92 | |||
93 | These operations are guaranteed to avoid loop creation. Indeed, | ||
94 | the only operation that could introduce loops is cross-directory rename. | ||
95 | Since the only new (parent, child) pair added by rename() is (new parent, | ||
96 | source), such loop would have to contain these objects and the rest of it | ||
97 | would have to exist before rename(). I.e. at the moment of loop creation | ||
98 | rename() responsible for that would be holding filesystem lock and new parent | ||
99 | would have to be equal to or a descendent of source. But that means that | ||
100 | new parent had been equal to or a descendent of source since the moment when | ||
101 | we had acquired filesystem lock and rename() would fail with -ELOOP in that | ||
102 | case. | ||
103 | |||
104 | While this locking scheme works for arbitrary DAGs, it relies on | ||
105 | ability to check that directory is a descendent of another object. Current | ||
106 | implementation assumes that directory graph is a tree. This assumption is | ||
107 | also preserved by all operations (cross-directory rename on a tree that would | ||
108 | not introduce a cycle will leave it a tree and link() fails for directories). | ||
109 | |||
110 | Notice that "directory" in the above == "anything that might have | ||
111 | children", so if we are going to introduce hybrid objects we will need | ||
112 | either to make sure that link(2) doesn't work for them or to make changes | ||
113 | in is_subdir() that would make it work even in presence of such beasts. | ||
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt new file mode 100644 index 000000000000..b5cb9110cc6b --- /dev/null +++ b/Documentation/filesystems/ext2.txt | |||
@@ -0,0 +1,383 @@ | |||
1 | |||
2 | The Second Extended Filesystem | ||
3 | ============================== | ||
4 | |||
5 | ext2 was originally released in January 1993. Written by R\'emy Card, | ||
6 | Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the | ||
7 | Extended Filesystem. It is currently still (April 2001) the predominant | ||
8 | filesystem in use by Linux. There are also implementations available | ||
9 | for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. | ||
10 | |||
11 | Options | ||
12 | ======= | ||
13 | |||
14 | Most defaults are determined by the filesystem superblock, and can be | ||
15 | set using tune2fs(8). Kernel-determined defaults are indicated by (*). | ||
16 | |||
17 | bsddf (*) Makes `df' act like BSD. | ||
18 | minixdf Makes `df' act like Minix. | ||
19 | |||
20 | check Check block and inode bitmaps at mount time | ||
21 | (requires CONFIG_EXT2_CHECK). | ||
22 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount | ||
23 | (check=normal and check=strict options removed) | ||
24 | |||
25 | debug Extra debugging information is sent to the | ||
26 | kernel syslog. Useful for developers. | ||
27 | |||
28 | errors=continue Keep going on a filesystem error. | ||
29 | errors=remount-ro Remount the filesystem read-only on an error. | ||
30 | errors=panic Panic and halt the machine if an error occurs. | ||
31 | |||
32 | grpid, bsdgroups Give objects the same group ID as their parent. | ||
33 | nogrpid, sysvgroups New objects have the group ID of their creator. | ||
34 | |||
35 | nouid32 Use 16-bit UIDs and GIDs. | ||
36 | |||
37 | oldalloc Enable the old block allocator. Orlov should | ||
38 | have better performance, we'd like to get some | ||
39 | feedback if it's the contrary for you. | ||
40 | orlov (*) Use the Orlov block allocator. | ||
41 | (See http://lwn.net/Articles/14633/ and | ||
42 | http://lwn.net/Articles/14446/.) | ||
43 | |||
44 | resuid=n The user ID which may use the reserved blocks. | ||
45 | resgid=n The group ID which may use the reserved blocks. | ||
46 | |||
47 | sb=n Use alternate superblock at this location. | ||
48 | |||
49 | user_xattr Enable "user." POSIX Extended Attributes | ||
50 | (requires CONFIG_EXT2_FS_XATTR). | ||
51 | See also http://acl.bestbits.at | ||
52 | nouser_xattr Don't support "user." extended attributes. | ||
53 | |||
54 | acl Enable POSIX Access Control Lists support | ||
55 | (requires CONFIG_EXT2_FS_POSIX_ACL). | ||
56 | See also http://acl.bestbits.at | ||
57 | noacl Don't support POSIX ACLs. | ||
58 | |||
59 | nobh Do not attach buffer_heads to file pagecache. | ||
60 | |||
61 | grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. | ||
62 | |||
63 | |||
64 | Specification | ||
65 | ============= | ||
66 | |||
67 | ext2 shares many properties with traditional Unix filesystems. It has | ||
68 | the concepts of blocks, inodes and directories. It has space in the | ||
69 | specification for Access Control Lists (ACLs), fragments, undeletion and | ||
70 | compression though these are not yet implemented (some are available as | ||
71 | separate patches). There is also a versioning mechanism to allow new | ||
72 | features (such as journalling) to be added in a maximally compatible | ||
73 | manner. | ||
74 | |||
75 | Blocks | ||
76 | ------ | ||
77 | |||
78 | The space in the device or file is split up into blocks. These are | ||
79 | a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), | ||
80 | which is decided when the filesystem is created. Smaller blocks mean | ||
81 | less wasted space per file, but require slightly more accounting overhead, | ||
82 | and also impose other limits on the size of files and the filesystem. | ||
83 | |||
84 | Block Groups | ||
85 | ------------ | ||
86 | |||
87 | Blocks are clustered into block groups in order to reduce fragmentation | ||
88 | and minimise the amount of head seeking when reading a large amount | ||
89 | of consecutive data. Information about each block group is kept in a | ||
90 | descriptor table stored in the block(s) immediately after the superblock. | ||
91 | Two blocks near the start of each group are reserved for the block usage | ||
92 | bitmap and the inode usage bitmap which show which blocks and inodes | ||
93 | are in use. Since each bitmap is limited to a single block, this means | ||
94 | that the maximum size of a block group is 8 times the size of a block. | ||
95 | |||
96 | The block(s) following the bitmaps in each block group are designated | ||
97 | as the inode table for that block group and the remainder are the data | ||
98 | blocks. The block allocation algorithm attempts to allocate data blocks | ||
99 | in the same block group as the inode which contains them. | ||
100 | |||
101 | The Superblock | ||
102 | -------------- | ||
103 | |||
104 | The superblock contains all the information about the configuration of | ||
105 | the filing system. The primary copy of the superblock is stored at an | ||
106 | offset of 1024 bytes from the start of the device, and it is essential | ||
107 | to mounting the filesystem. Since it is so important, backup copies of | ||
108 | the superblock are stored in block groups throughout the filesystem. | ||
109 | The first version of ext2 (revision 0) stores a copy at the start of | ||
110 | every block group, along with backups of the group descriptor block(s). | ||
111 | Because this can consume a considerable amount of space for large | ||
112 | filesystems, later revisions can optionally reduce the number of backup | ||
113 | copies by only putting backups in specific groups (this is the sparse | ||
114 | superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. | ||
115 | |||
116 | The information in the superblock contains fields such as the total | ||
117 | number of inodes and blocks in the filesystem and how many are free, | ||
118 | how many inodes and blocks are in each block group, when the filesystem | ||
119 | was mounted (and if it was cleanly unmounted), when it was modified, | ||
120 | what version of the filesystem it is (see the Revisions section below) | ||
121 | and which OS created it. | ||
122 | |||
123 | If the filesystem is revision 1 or higher, then there are extra fields, | ||
124 | such as a volume name, a unique identification number, the inode size, | ||
125 | and space for optional filesystem features to store configuration info. | ||
126 | |||
127 | All fields in the superblock (as in all other ext2 structures) are stored | ||
128 | on the disc in little endian format, so a filesystem is portable between | ||
129 | machines without having to know what machine it was created on. | ||
130 | |||
131 | Inodes | ||
132 | ------ | ||
133 | |||
134 | The inode (index node) is a fundamental concept in the ext2 filesystem. | ||
135 | Each object in the filesystem is represented by an inode. The inode | ||
136 | structure contains pointers to the filesystem blocks which contain the | ||
137 | data held in the object and all of the metadata about an object except | ||
138 | its name. The metadata about an object includes the permissions, owner, | ||
139 | group, flags, size, number of blocks used, access time, change time, | ||
140 | modification time, deletion time, number of links, fragments, version | ||
141 | (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). | ||
142 | |||
143 | There are some reserved fields which are currently unused in the inode | ||
144 | structure and several which are overloaded. One field is reserved for the | ||
145 | directory ACL if the inode is a directory and alternately for the top 32 | ||
146 | bits of the file size if the inode is a regular file (allowing file sizes | ||
147 | larger than 2GB). The translator field is unused under Linux, but is used | ||
148 | by the HURD to reference the inode of a program which will be used to | ||
149 | interpret this object. Most of the remaining reserved fields have been | ||
150 | used up for both Linux and the HURD for larger owner and group fields, | ||
151 | The HURD also has a larger mode field so it uses another of the remaining | ||
152 | fields to store the extra more bits. | ||
153 | |||
154 | There are pointers to the first 12 blocks which contain the file's data | ||
155 | in the inode. There is a pointer to an indirect block (which contains | ||
156 | pointers to the next set of blocks), a pointer to a doubly-indirect | ||
157 | block (which contains pointers to indirect blocks) and a pointer to a | ||
158 | trebly-indirect block (which contains pointers to doubly-indirect blocks). | ||
159 | |||
160 | The flags field contains some ext2-specific flags which aren't catered | ||
161 | for by the standard chmod flags. These flags can be listed with lsattr | ||
162 | and changed with the chattr command, and allow specific filesystem | ||
163 | behaviour on a per-file basis. There are flags for secure deletion, | ||
164 | undeletable, compression, synchronous updates, immutability, append-only, | ||
165 | dumpable, no-atime, indexed directories, and data-journaling. Not all | ||
166 | of these are supported yet. | ||
167 | |||
168 | Directories | ||
169 | ----------- | ||
170 | |||
171 | A directory is a filesystem object and has an inode just like a file. | ||
172 | It is a specially formatted file containing records which associate | ||
173 | each name with an inode number. Later revisions of the filesystem also | ||
174 | encode the type of the object (file, directory, symlink, device, fifo, | ||
175 | socket) to avoid the need to check the inode itself for this information | ||
176 | (support for taking advantage of this feature does not yet exist in | ||
177 | Glibc 2.2). | ||
178 | |||
179 | The inode allocation code tries to assign inodes which are in the same | ||
180 | block group as the directory in which they are first created. | ||
181 | |||
182 | The current implementation of ext2 uses a singly-linked list to store | ||
183 | the filenames in the directory; a pending enhancement uses hashing of the | ||
184 | filenames to allow lookup without the need to scan the entire directory. | ||
185 | |||
186 | The current implementation never removes empty directory blocks once they | ||
187 | have been allocated to hold more files. | ||
188 | |||
189 | Special files | ||
190 | ------------- | ||
191 | |||
192 | Symbolic links are also filesystem objects with inodes. They deserve | ||
193 | special mention because the data for them is stored within the inode | ||
194 | itself if the symlink is less than 60 bytes long. It uses the fields | ||
195 | which would normally be used to store the pointers to data blocks. | ||
196 | This is a worthwhile optimisation as it we avoid allocating a full | ||
197 | block for the symlink, and most symlinks are less than 60 characters long. | ||
198 | |||
199 | Character and block special devices never have data blocks assigned to | ||
200 | them. Instead, their device number is stored in the inode, again reusing | ||
201 | the fields which would be used to point to the data blocks. | ||
202 | |||
203 | Reserved Space | ||
204 | -------------- | ||
205 | |||
206 | In ext2, there is a mechanism for reserving a certain number of blocks | ||
207 | for a particular user (normally the super-user). This is intended to | ||
208 | allow for the system to continue functioning even if non-priveleged users | ||
209 | fill up all the space available to them (this is independent of filesystem | ||
210 | quotas). It also keeps the filesystem from filling up entirely which | ||
211 | helps combat fragmentation. | ||
212 | |||
213 | Filesystem check | ||
214 | ---------------- | ||
215 | |||
216 | At boot time, most systems run a consistency check (e2fsck) on their | ||
217 | filesystems. The superblock of the ext2 filesystem contains several | ||
218 | fields which indicate whether fsck should actually run (since checking | ||
219 | the filesystem at boot can take a long time if it is large). fsck will | ||
220 | run if the filesystem was not cleanly unmounted, if the maximum mount | ||
221 | count has been exceeded or if the maximum time between checks has been | ||
222 | exceeded. | ||
223 | |||
224 | Feature Compatibility | ||
225 | --------------------- | ||
226 | |||
227 | The compatibility feature mechanism used in ext2 is sophisticated. | ||
228 | It safely allows features to be added to the filesystem, without | ||
229 | unnecessarily sacrificing compatibility with older versions of the | ||
230 | filesystem code. The feature compatibility mechanism is not supported by | ||
231 | the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in | ||
232 | revision 1. There are three 32-bit fields, one for compatible features | ||
233 | (COMPAT), one for read-only compatible (RO_COMPAT) features and one for | ||
234 | incompatible (INCOMPAT) features. | ||
235 | |||
236 | These feature flags have specific meanings for the kernel as follows: | ||
237 | |||
238 | A COMPAT flag indicates that a feature is present in the filesystem, | ||
239 | but the on-disk format is 100% compatible with older on-disk formats, so | ||
240 | a kernel which didn't know anything about this feature could read/write | ||
241 | the filesystem without any chance of corrupting the filesystem (or even | ||
242 | making it inconsistent). This is essentially just a flag which says | ||
243 | "this filesystem has a (hidden) feature" that the kernel or e2fsck may | ||
244 | want to be aware of (more on e2fsck and feature flags later). The ext3 | ||
245 | HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply | ||
246 | a regular file with data blocks in it so the kernel does not need to | ||
247 | take any special notice of it if it doesn't understand ext3 journaling. | ||
248 | |||
249 | An RO_COMPAT flag indicates that the on-disk format is 100% compatible | ||
250 | with older on-disk formats for reading (i.e. the feature does not change | ||
251 | the visible on-disk format). However, an old kernel writing to such a | ||
252 | filesystem would/could corrupt the filesystem, so this is prevented. The | ||
253 | most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because | ||
254 | sparse groups allow file data blocks where superblock/group descriptor | ||
255 | backups used to live, and ext2_free_blocks() refuses to free these blocks, | ||
256 | which would leading to inconsistent bitmaps. An old kernel would also | ||
257 | get an error if it tried to free a series of blocks which crossed a group | ||
258 | boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. | ||
259 | |||
260 | An INCOMPAT flag indicates the on-disk format has changed in some | ||
261 | way that makes it unreadable by older kernels, or would otherwise | ||
262 | cause a problem if an old kernel tried to mount it. FILETYPE is an | ||
263 | INCOMPAT flag because older kernels would think a filename was longer | ||
264 | than 256 characters, which would lead to corrupt directory listings. | ||
265 | The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel | ||
266 | doesn't understand compression, you would just get garbage back from | ||
267 | read() instead of it automatically decompressing your data. The ext3 | ||
268 | RECOVER flag is needed to prevent a kernel which does not understand the | ||
269 | ext3 journal from mounting the filesystem without replaying the journal. | ||
270 | |||
271 | For e2fsck, it needs to be more strict with the handling of these | ||
272 | flags than the kernel. If it doesn't understand ANY of the COMPAT, | ||
273 | RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, | ||
274 | because it has no way of verifying whether a given feature is valid | ||
275 | or not. Allowing e2fsck to succeed on a filesystem with an unknown | ||
276 | feature is a false sense of security for the user. Refusing to check | ||
277 | a filesystem with unknown features is a good incentive for the user to | ||
278 | update to the latest e2fsck. This also means that anyone adding feature | ||
279 | flags to ext2 also needs to update e2fsck to verify these features. | ||
280 | |||
281 | Metadata | ||
282 | -------- | ||
283 | |||
284 | It is frequently claimed that the ext2 implementation of writing | ||
285 | asynchronous metadata is faster than the ffs synchronous metadata | ||
286 | scheme but less reliable. Both methods are equally resolvable by their | ||
287 | respective fsck programs. | ||
288 | |||
289 | If you're exceptionally paranoid, there are 3 ways of making metadata | ||
290 | writes synchronous on ext2: | ||
291 | |||
292 | per-file if you have the program source: use the O_SYNC flag to open() | ||
293 | per-file if you don't have the source: use "chattr +S" on the file | ||
294 | per-filesystem: add the "sync" option to mount (or in /etc/fstab) | ||
295 | |||
296 | the first and last are not ext2 specific but do force the metadata to | ||
297 | be written synchronously. See also Journaling below. | ||
298 | |||
299 | Limitations | ||
300 | ----------- | ||
301 | |||
302 | There are various limits imposed by the on-disk layout of ext2. Other | ||
303 | limits are imposed by the current implementation of the kernel code. | ||
304 | Many of the limits are determined at the time the filesystem is first | ||
305 | created, and depend upon the block size chosen. The ratio of inodes to | ||
306 | data blocks is fixed at filesystem creation time, so the only way to | ||
307 | increase the number of inodes is to increase the size of the filesystem. | ||
308 | No tools currently exist which can change the ratio of inodes to blocks. | ||
309 | |||
310 | Most of these limits could be overcome with slight changes in the on-disk | ||
311 | format and using a compatibility flag to signal the format change (at | ||
312 | the expense of some compatibility). | ||
313 | |||
314 | Filesystem block size: 1kB 2kB 4kB 8kB | ||
315 | |||
316 | File size limit: 16GB 256GB 2048GB 2048GB | ||
317 | Filesystem size limit: 2047GB 8192GB 16384GB 32768GB | ||
318 | |||
319 | There is a 2.4 kernel limit of 2048GB for a single block device, so no | ||
320 | filesystem larger than that can be created at this time. There is also | ||
321 | an upper limit on the block size imposed by the page size of the kernel, | ||
322 | so 8kB blocks are only allowed on Alpha systems (and other architectures | ||
323 | which support larger pages). | ||
324 | |||
325 | There is an upper limit of 32768 subdirectories in a single directory. | ||
326 | |||
327 | There is a "soft" upper limit of about 10-15k files in a single directory | ||
328 | with the current linear linked-list directory implementation. This limit | ||
329 | stems from performance problems when creating and deleting (and also | ||
330 | finding) files in such large directories. Using a hashed directory index | ||
331 | (under development) allows 100k-1M+ files in a single directory without | ||
332 | performance problems (although RAM size becomes an issue at this point). | ||
333 | |||
334 | The (meaningless) absolute upper limit of files in a single directory | ||
335 | (imposed by the file size, the realistic limit is obviously much less) | ||
336 | is over 130 trillion files. It would be higher except there are not | ||
337 | enough 4-character names to make up unique directory entries, so they | ||
338 | have to be 8 character filenames, even then we are fairly close to | ||
339 | running out of unique filenames. | ||
340 | |||
341 | Journaling | ||
342 | ---------- | ||
343 | |||
344 | A journaling extension to the ext2 code has been developed by Stephen | ||
345 | Tweedie. It avoids the risks of metadata corruption and the need to | ||
346 | wait for e2fsck to complete after a crash, without requiring a change | ||
347 | to the on-disk ext2 layout. In a nutshell, the journal is a regular | ||
348 | file which stores whole metadata (and optionally data) blocks that have | ||
349 | been modified, prior to writing them into the filesystem. This means | ||
350 | it is possible to add a journal to an existing ext2 filesystem without | ||
351 | the need for data conversion. | ||
352 | |||
353 | When changes to the filesystem (e.g. a file is renamed) they are stored in | ||
354 | a transaction in the journal and can either be complete or incomplete at | ||
355 | the time of a crash. If a transaction is complete at the time of a crash | ||
356 | (or in the normal case where the system does not crash), then any blocks | ||
357 | in that transaction are guaranteed to represent a valid filesystem state, | ||
358 | and are copied into the filesystem. If a transaction is incomplete at | ||
359 | the time of the crash, then there is no guarantee of consistency for | ||
360 | the blocks in that transaction so they are discarded (which means any | ||
361 | filesystem changes they represent are also lost). | ||
362 | Check Documentation/filesystems/ext3.txt if you want to read more about | ||
363 | ext3 and journaling. | ||
364 | |||
365 | References | ||
366 | ========== | ||
367 | |||
368 | The kernel source file:/usr/src/linux/fs/ext2/ | ||
369 | e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ | ||
370 | Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html | ||
371 | Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ | ||
372 | Hashed Directories http://kernelnewbies.org/~phillips/htree/ | ||
373 | Filesystem Resizing http://ext2resize.sourceforge.net/ | ||
374 | Compression (*) http://www.netspace.net.au/~reiter/e2compr/ | ||
375 | |||
376 | Implementations for: | ||
377 | Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm | ||
378 | Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 | ||
379 | DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ | ||
380 | OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ | ||
381 | RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ | ||
382 | |||
383 | (*) no longer actively developed/supported (as of Apr 2001) | ||
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt new file mode 100644 index 000000000000..9ab7f446f7ad --- /dev/null +++ b/Documentation/filesystems/ext3.txt | |||
@@ -0,0 +1,183 @@ | |||
1 | |||
2 | Ext3 Filesystem | ||
3 | =============== | ||
4 | |||
5 | ext3 was originally released in September 1999. Written by Stephen Tweedie | ||
6 | for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, | ||
7 | Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. | ||
8 | |||
9 | ext3 is ext2 filesystem enhanced with journalling capabilities. | ||
10 | |||
11 | Options | ||
12 | ======= | ||
13 | |||
14 | When mounting an ext3 filesystem, the following option are accepted: | ||
15 | (*) == default | ||
16 | |||
17 | jounal=update Update the ext3 file system's journal to the | ||
18 | current format. | ||
19 | |||
20 | journal=inum When a journal already exists, this option is | ||
21 | ignored. Otherwise, it specifies the number of | ||
22 | the inode which will represent the ext3 file | ||
23 | system's journal file. | ||
24 | |||
25 | noload Don't load the journal on mounting. | ||
26 | |||
27 | data=journal All data are committed into the journal prior | ||
28 | to being written into the main file system. | ||
29 | |||
30 | data=ordered (*) All data are forced directly out to the main file | ||
31 | system prior to its metadata being committed to | ||
32 | the journal. | ||
33 | |||
34 | data=writeback Data ordering is not preserved, data may be | ||
35 | written into the main file system after its | ||
36 | metadata has been committed to the journal. | ||
37 | |||
38 | commit=nrsec (*) Ext3 can be told to sync all its data and metadata | ||
39 | every 'nrsec' seconds. The default value is 5 seconds. | ||
40 | This means that if you lose your power, you will lose, | ||
41 | as much, the latest 5 seconds of work (your filesystem | ||
42 | will not be damaged though, thanks to journaling). This | ||
43 | default value (or any low value) will hurt performance, | ||
44 | but it's good for data-safety. Setting it to 0 will | ||
45 | have the same effect than leaving the default 5 sec. | ||
46 | Setting it to very large values will improve | ||
47 | performance. | ||
48 | |||
49 | barrier=1 This enables/disables barriers. barrier=0 disables it, | ||
50 | barrier=1 enables it. | ||
51 | |||
52 | orlov (*) This enables the new Orlov block allocator. It's enabled | ||
53 | by default. | ||
54 | |||
55 | oldalloc This disables the Orlov block allocator and enables the | ||
56 | old block allocator. Orlov should have better performance, | ||
57 | we'd like to get some feedback if it's the contrary for | ||
58 | you. | ||
59 | |||
60 | user_xattr (*) Enables POSIX Extended Attributes. It's enabled by | ||
61 | default, however you need to confifure its support | ||
62 | (CONFIG_EXT3_FS_XATTR). This is neccesary if you want | ||
63 | to use POSIX Acces Control Lists support. You can visit | ||
64 | http://acl.bestbits.at to know more about POSIX Extended | ||
65 | attributes. | ||
66 | |||
67 | nouser_xattr Disables POSIX Extended Attributes. | ||
68 | |||
69 | acl (*) Enables POSIX Access Control Lists support. This is | ||
70 | enabled by default, however you need to configure | ||
71 | its support (CONFIG_EXT3_FS_POSIX_ACL). If you want | ||
72 | to know more about ACLs visit http://acl.bestbits.at | ||
73 | |||
74 | noacl This option disables POSIX Access Control List support. | ||
75 | |||
76 | reservation | ||
77 | |||
78 | noreservation | ||
79 | |||
80 | resize= | ||
81 | |||
82 | bsddf (*) Make 'df' act like BSD. | ||
83 | minixdf Make 'df' act like Minix. | ||
84 | |||
85 | check=none Don't do extra checking of bitmaps on mount. | ||
86 | nocheck | ||
87 | |||
88 | debug Extra debugging information is sent to syslog. | ||
89 | |||
90 | errors=remount-ro(*) Remount the filesystem read-only on an error. | ||
91 | errors=continue Keep going on a filesystem error. | ||
92 | errors=panic Panic and halt the machine if an error occurs. | ||
93 | |||
94 | grpid Give objects the same group ID as their creator. | ||
95 | bsdgroups | ||
96 | |||
97 | nogrpid (*) New objects have the group ID of their creator. | ||
98 | sysvgroups | ||
99 | |||
100 | resgid=n The group ID which may use the reserved blocks. | ||
101 | |||
102 | resuid=n The user ID which may use the reserved blocks. | ||
103 | |||
104 | sb=n Use alternate superblock at this location. | ||
105 | |||
106 | quota Quota options are currently silently ignored. | ||
107 | noquota (see fs/ext3/super.c, line 594) | ||
108 | grpquota | ||
109 | usrquota | ||
110 | |||
111 | |||
112 | Specification | ||
113 | ============= | ||
114 | ext3 shares all disk implementation with ext2 filesystem, and add | ||
115 | transactions capabilities to ext2. Journaling is done by the | ||
116 | Journaling block device layer. | ||
117 | |||
118 | Journaling Block Device layer | ||
119 | ----------------------------- | ||
120 | The Journaling Block Device layer (JBD) isn't ext3 specific. It was | ||
121 | design to add journaling capabilities on a block device. The ext3 | ||
122 | filesystem code will inform the JBD of modifications it is performing | ||
123 | (Call a transaction). the journal support the transactions start and | ||
124 | stop, and in case of crash, the journal can replayed the transactions | ||
125 | to put the partition on a consistent state fastly. | ||
126 | |||
127 | handles represent a single atomic update to a filesystem. JBD can | ||
128 | handle external journal on a block device. | ||
129 | |||
130 | Data Mode | ||
131 | --------- | ||
132 | There's 3 different data modes: | ||
133 | |||
134 | * writeback mode | ||
135 | In data=writeback mode, ext3 does not journal data at all. This mode | ||
136 | provides a similar level of journaling as XFS, JFS, and ReiserFS in its | ||
137 | default mode - metadata journaling. A crash+recovery can cause | ||
138 | incorrect data to appear in files which were written shortly before the | ||
139 | crash. This mode will typically provide the best ext3 performance. | ||
140 | |||
141 | * ordered mode | ||
142 | In data=ordered mode, ext3 only officially journals metadata, but it | ||
143 | logically groups metadata and data blocks into a single unit called a | ||
144 | transaction. When it's time to write the new metadata out to disk, the | ||
145 | associated data blocks are written first. In general, this mode | ||
146 | perform slightly slower than writeback but significantly faster than | ||
147 | journal mode. | ||
148 | |||
149 | * journal mode | ||
150 | data=journal mode provides full data and metadata journaling. All new | ||
151 | data is written to the journal first, and then to its final location. | ||
152 | In the event of a crash, the journal can be replayed, bringing both | ||
153 | data and metadata into a consistent state. This mode is the slowest | ||
154 | except when data needs to be read from and written to disk at the same | ||
155 | time where it outperform all others mode. | ||
156 | |||
157 | Compatibility | ||
158 | ------------- | ||
159 | |||
160 | Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. | ||
161 | Ext3 is fully compatible with Ext2. Ext3 partitions can easily be | ||
162 | mounted as Ext2. | ||
163 | |||
164 | External Tools | ||
165 | ============== | ||
166 | see manual pages to know more. | ||
167 | |||
168 | tune2fs: create a ext3 journal on a ext2 partition with the -j flags | ||
169 | mke2fs: create a ext3 partition with the -j flags | ||
170 | debugfs: ext2 and ext3 file system debugger | ||
171 | |||
172 | References | ||
173 | ========== | ||
174 | |||
175 | kernel source: file:/usr/src/linux/fs/ext3 | ||
176 | file:/usr/src/linux/fs/jbd | ||
177 | |||
178 | programs: http://e2fsprogs.sourceforge.net | ||
179 | |||
180 | useful link: | ||
181 | http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html | ||
182 | http://www-106.ibm.com/developerworks/linux/library/l-fs7/ | ||
183 | http://www-106.ibm.com/developerworks/linux/library/l-fs8/ | ||
diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt new file mode 100644 index 000000000000..bd0fa7704035 --- /dev/null +++ b/Documentation/filesystems/hfs.txt | |||
@@ -0,0 +1,83 @@ | |||
1 | |||
2 | Macintosh HFS Filesystem for Linux | ||
3 | ================================== | ||
4 | |||
5 | HFS stands for ``Hierarchical File System'' and is the filesystem used | ||
6 | by the Mac Plus and all later Macintosh models. Earlier Macintosh | ||
7 | models used MFS (``Macintosh File System''), which is not supported, | ||
8 | MacOS 8.1 and newer support a filesystem called HFS+ that's similar to | ||
9 | HFS but is extended in various areas. Use the hfsplus filesystem driver | ||
10 | to access such filesystems from Linux. | ||
11 | |||
12 | |||
13 | Mount options | ||
14 | ============= | ||
15 | |||
16 | When mounting an HFS filesystem, the following options are accepted: | ||
17 | |||
18 | creator=cccc, type=cccc | ||
19 | Specifies the creator/type values as shown by the MacOS finder | ||
20 | used for creating new files. Default values: '????'. | ||
21 | |||
22 | uid=n, gid=n | ||
23 | Specifies the user/group that owns all files on the filesystems. | ||
24 | Default: user/group id of the mounting process. | ||
25 | |||
26 | dir_umask=n, file_umask=n, umask=n | ||
27 | Specifies the umask used for all files , all directories or all | ||
28 | files and directories. Defaults to the umask of the mounting process. | ||
29 | |||
30 | session=n | ||
31 | Select the CDROM session to mount as HFS filesystem. Defaults to | ||
32 | leaving that decision to the CDROM driver. This option will fail | ||
33 | with anything but a CDROM as underlying devices. | ||
34 | |||
35 | part=n | ||
36 | Select partition number n from the devices. Does only makes | ||
37 | sense for CDROMS because they can't be partitioned under Linux. | ||
38 | For disk devices the generic partition parsing code does this | ||
39 | for us. Defaults to not parsing the partition table at all. | ||
40 | |||
41 | quiet | ||
42 | Ignore invalid mount options instead of complaining. | ||
43 | |||
44 | |||
45 | Writing to HFS Filesystems | ||
46 | ========================== | ||
47 | |||
48 | HFS is not a UNIX filesystem, thus it does not have the usual features you'd | ||
49 | expect: | ||
50 | |||
51 | o You can't modify the set-uid, set-gid, sticky or executable bits or the uid | ||
52 | and gid of files. | ||
53 | o You can't create hard- or symlinks, device files, sockets or FIFOs. | ||
54 | |||
55 | HFS does on the other have the concepts of multiple forks per file. These | ||
56 | non-standard forks are represented as hidden additional files in the normal | ||
57 | filesystems namespace which is kind of a cludge and makes the semantics for | ||
58 | the a little strange: | ||
59 | |||
60 | o You can't create, delete or rename resource forks of files or the | ||
61 | Finder's metadata. | ||
62 | o They are however created (with default values), deleted and renamed | ||
63 | along with the corresponding data fork or directory. | ||
64 | o Copying files to a different filesystem will loose those attributes | ||
65 | that are essential for MacOS to work. | ||
66 | |||
67 | |||
68 | Creating HFS filesystems | ||
69 | =================================== | ||
70 | |||
71 | The hfsutils package from Robert Leslie contains a program called | ||
72 | hformat that can be used to create HFS filesystem. See | ||
73 | <http://www.mars.org/home/rob/proj/hfs/> for details. | ||
74 | |||
75 | |||
76 | Credits | ||
77 | ======= | ||
78 | |||
79 | The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU) | ||
80 | and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis | ||
81 | Technologies. | ||
82 | Roman rewrote large parts of the code and brought in btree routines derived | ||
83 | from Brad Boyer's hfsplus driver (also maintained by Roman now). | ||
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt new file mode 100644 index 000000000000..33dc360c8e89 --- /dev/null +++ b/Documentation/filesystems/hpfs.txt | |||
@@ -0,0 +1,296 @@ | |||
1 | Read/Write HPFS 2.09 | ||
2 | 1998-2004, Mikulas Patocka | ||
3 | |||
4 | email: mikulas@artax.karlin.mff.cuni.cz | ||
5 | homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi | ||
6 | |||
7 | CREDITS: | ||
8 | Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file | ||
9 | is taken from it | ||
10 | Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) | ||
11 | Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion | ||
12 | |||
13 | Mount options | ||
14 | |||
15 | uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) | ||
16 | Set owner/group/mode for files that do not have it specified in extended | ||
17 | attributes. Mode is inverted umask - for example umask 027 gives owner | ||
18 | all permission, group read permission and anybody else no access. Note | ||
19 | that for files mode is anded with 0666. If you want files to have 'x' | ||
20 | rights, you must use extended attributes. | ||
21 | case=lower,asis (default asis) | ||
22 | File name lowercasing in readdir. | ||
23 | conv=binary,text,auto (default binary) | ||
24 | CR/LF -> LF conversion, if auto, decision is made according to extension | ||
25 | - there is a list of text extensions (I thing it's better to not convert | ||
26 | text file than to damage binary file). If you want to change that list, | ||
27 | change it in the source. Original readonly HPFS contained some strange | ||
28 | heuristic algorithm that I removed. I thing it's danger to let the | ||
29 | computer decide whether file is text or binary. For example, DJGPP | ||
30 | binaries contain small text message at the beginning and they could be | ||
31 | misidentified and damaged under some circumstances. | ||
32 | check=none,normal,strict (default normal) | ||
33 | Check level. Selecting none will cause only little speedup and big | ||
34 | danger. I tried to write it so that it won't crash if check=normal on | ||
35 | corrupted filesystems. check=strict means many superfluous checks - | ||
36 | used for debugging (for example it checks if file is allocated in | ||
37 | bitmaps when accessing it). | ||
38 | errors=continue,remount-ro,panic (default remount-ro) | ||
39 | Behaviour when filesystem errors found. | ||
40 | chkdsk=no,errors,always (default errors) | ||
41 | When to mark filesystem dirty so that OS/2 checks it. | ||
42 | eas=no,ro,rw (default rw) | ||
43 | What to do with extended attributes. 'no' - ignore them and use always | ||
44 | values specified in uid/gid/mode options. 'ro' - read extended | ||
45 | attributes but do not create them. 'rw' - create extended attributes | ||
46 | when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. | ||
47 | timeshift=(-)nnn (default 0) | ||
48 | Shifts the time by nnn seconds. For example, if you see under linux | ||
49 | one hour more, than under os/2, use timeshift=-3600. | ||
50 | |||
51 | |||
52 | File names | ||
53 | |||
54 | As in OS/2, filenames are case insensitive. However, shell thinks that names | ||
55 | are case sensitive, so for example when you create a file FOO, you can use | ||
56 | 'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you | ||
57 | also won't be able to compile linux kernel (and maybe other things) on HPFS | ||
58 | because kernel creates different files with names like bootsect.S and | ||
59 | bootsect.s. When searching for file thats name has characters >= 128, codepages | ||
60 | are used - see below. | ||
61 | OS/2 ignores dots and spaces at the end of file name, so this driver does as | ||
62 | well. If you create 'a. ...', the file 'a' will be created, but you can still | ||
63 | access it under names 'a.', 'a..', 'a . . . ' etc. | ||
64 | |||
65 | |||
66 | Extended attributes | ||
67 | |||
68 | On HPFS partitions, OS/2 can associate to each file a special information called | ||
69 | extended attributes. Extended attributes are pairs of (key,value) where key is | ||
70 | an ascii string identifying that attribute and value is any string of bytes of | ||
71 | variable length. OS/2 stores window and icon positions and file types there. So | ||
72 | why not use it for unix-specific info like file owner or access rights? This | ||
73 | driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended | ||
74 | attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only | ||
75 | that extended attributes those value differs from defaults specified in mount | ||
76 | options are created. Once created, the extended attributes are never deleted, | ||
77 | they're just changed. It means that when your default uid=0 and you type | ||
78 | something like 'chown luser file; chown root file' the file will contain | ||
79 | extended attribute UID=0. And when you umount the fs and mount it again with | ||
80 | uid=luser_uid, the file will be still owned by root! If you chmod file to 444, | ||
81 | extended attribute "MODE" will not be set, this special case is done by setting | ||
82 | read-only flag. When you mknod a block or char device, besides "MODE", the | ||
83 | special 4-byte extended attribute "DEV" will be created containing the device | ||
84 | number. Currently this driver cannot resize extended attributes - it means | ||
85 | that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" | ||
86 | attributes with different sizes, they won't be rewritten and changing these | ||
87 | values doesn't work. | ||
88 | |||
89 | |||
90 | Symlinks | ||
91 | |||
92 | You can do symlinks on HPFS partition, symlinks are achieved by setting extended | ||
93 | attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and | ||
94 | chgrp symlinks but I don't know what is it good for. chmoding symlink results | ||
95 | in chmoding file where symlink points. These symlinks are just for Linux use and | ||
96 | incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are | ||
97 | stored in very crazy way. They tried to do it so that link changes when file is | ||
98 | moved ... sometimes it works. But the link is partly stored in directory | ||
99 | extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) | ||
100 | to analyze or change OS2SYS.INI. | ||
101 | |||
102 | |||
103 | Codepages | ||
104 | |||
105 | HPFS can contain several uppercasing tables for several codepages and each | ||
106 | file has a pointer to codepage it's name is in. However OS/2 was created in | ||
107 | America where people don't care much about codepages and so multiple codepages | ||
108 | support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. | ||
109 | Once I booted English OS/2 working in cp 850 and I created a file on my 852 | ||
110 | partition. It marked file name codepage as 850 - good. But when I again booted | ||
111 | Czech OS/2, the file was completely inaccessible under any name. It seems that | ||
112 | OS/2 uppercases the search pattern with its system code page (852) and file | ||
113 | name it's comparing to with its code page (850). These could never match. Is it | ||
114 | really what IBM developers wanted? But problems continued. When I created in | ||
115 | Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 | ||
116 | probably uses different uppercasing method when searching where to place a file | ||
117 | (note, that files in HPFS directory must be sorted) and when searching for | ||
118 | a file. Finally when I opened this directory in PmShell, PmShell crashed (the | ||
119 | funny thing was that, when rebooted, PmShell tried to reopen this directory | ||
120 | again :-). chkdsk happily ignores these errors and only low-level disk | ||
121 | modification saved me. Never mix different language versions of OS/2 on one | ||
122 | system although HPFS was designed to allow that. | ||
123 | OK, I could implement complex codepage support to this driver but I think it | ||
124 | would cause more problems than benefit with such buggy implementation in OS/2. | ||
125 | So this driver simply uses first codepage it finds for uppercasing and | ||
126 | lowercasing no matter what's file codepage index. Usually all file names are in | ||
127 | this codepage - if you don't try to do what I described above :-) | ||
128 | |||
129 | |||
130 | Known bugs | ||
131 | |||
132 | HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client | ||
133 | should work. If you have OS/2 server, use only read-only mode. I don't know how | ||
134 | to handle some HPFS386 structures like access control list or extended perm | ||
135 | list, I don't know how to delete them when file is deleted and how to not | ||
136 | overwrite them with extended attributes. Send me some info on these structures | ||
137 | and I'll make it. However, this driver should detect presence of HPFS386 | ||
138 | structures, remount read-only and not destroy them (I hope). | ||
139 | |||
140 | When there's not enough space for extended attributes, they will be truncated | ||
141 | and no error is returned. | ||
142 | |||
143 | OS/2 can't access files if the path is longer than about 256 chars but this | ||
144 | driver allows you to do it. chkdsk ignores such errors. | ||
145 | |||
146 | Sometimes you won't be able to delete some files on a very full filesystem | ||
147 | (returning error ENOSPC). That's because file in non-leaf node in directory tree | ||
148 | (one directory, if it's large, has dirents in tree on HPFS) must be replaced | ||
149 | with another node when deleted. And that new file might have larger name than | ||
150 | the old one so the new name doesn't fit in directory node (dnode). And that | ||
151 | would result in directory tree splitting, that takes disk space. Workaround is | ||
152 | to delete other files that are leaf (probability that the file is non-leaf is | ||
153 | about 1/50) or to truncate file first to make some space. | ||
154 | You encounter this problem only if you have many directories so that | ||
155 | preallocated directory band is full i.e. | ||
156 | number_of_directories / size_of_filesystem_in_mb > 4. | ||
157 | |||
158 | You can't delete open directories. | ||
159 | |||
160 | You can't rename over directories (what is it good for?). | ||
161 | |||
162 | Renaming files so that only case changes doesn't work. This driver supports it | ||
163 | but vfs doesn't. Something like 'mv file FILE' won't work. | ||
164 | |||
165 | All atimes and directory mtimes are not updated. That's because of performance | ||
166 | reasons. If you extremely wish to update them, let me know, I'll write it (but | ||
167 | it will be slow). | ||
168 | |||
169 | When the system is out of memory and swap, it may slightly corrupt filesystem | ||
170 | (lost files, unbalanced directories). (I guess all filesystem may do it). | ||
171 | |||
172 | When compiled, you get warning: function declaration isn't a prototype. Does | ||
173 | anybody know what does it mean? | ||
174 | |||
175 | |||
176 | What does "unbalanced tree" message mean? | ||
177 | |||
178 | Old versions of this driver created sometimes unbalanced dnode trees. OS/2 | ||
179 | chkdsk doesn't scream if the tree is unbalanced (and sometimes creates | ||
180 | unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely | ||
181 | crashes when the tree is not balanced. This driver handles unbalanced trees | ||
182 | correctly and writes warning if it finds them. If you see this message, this is | ||
183 | probably because of directories created with old version of this driver. | ||
184 | Workaround is to move all files from that directory to another and then back | ||
185 | again. Do it in Linux, not OS/2! If you see this message in directory that is | ||
186 | whole created by this driver, it is BUG - let me know about it. | ||
187 | |||
188 | |||
189 | Bugs in OS/2 | ||
190 | |||
191 | When you have two (or more) lost directories pointing each to other, chkdsk | ||
192 | locks up when repairing filesystem. | ||
193 | |||
194 | Sometimes (I think it's random) when you create a file with one-char name under | ||
195 | OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs | ||
196 | error corrected". | ||
197 | |||
198 | File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and | ||
199 | marks them as short (and writes "minor fs error corrected"). This bug is not in | ||
200 | HPFS386. | ||
201 | |||
202 | Codepage bugs described above. | ||
203 | |||
204 | If you don't install fixpacks, there are many, many more... | ||
205 | |||
206 | |||
207 | History | ||
208 | |||
209 | 0.90 First public release | ||
210 | 0.91 Fixed bug that caused shooting to memory when write_inode was called on | ||
211 | open inode (rarely happened) | ||
212 | 0.92 Fixed a little memory leak in freeing directory inodes | ||
213 | 0.93 Fixed bug that locked up the machine when there were too many filenames | ||
214 | with first 15 characters same | ||
215 | Fixed write_file to zero file when writing behind file end | ||
216 | 0.94 Fixed a little memory leak when trying to delete busy file or directory | ||
217 | 0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files | ||
218 | 1.90 First version for 2.1.1xx kernels | ||
219 | 1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk | ||
220 | Fixed a race-condition when write_inode is called while deleting file | ||
221 | Fixed a bug that could possibly happen (with very low probability) when | ||
222 | using 0xff in filenames | ||
223 | Rewritten locking to avoid race-conditions | ||
224 | Mount option 'eas' now works | ||
225 | Fsync no longer returns error | ||
226 | Files beginning with '.' are marked hidden | ||
227 | Remount support added | ||
228 | Alloc is not so slow when filesystem becomes full | ||
229 | Atimes are no more updated because it slows down operation | ||
230 | Code cleanup (removed all commented debug prints) | ||
231 | 1.92 Corrected a bug when sync was called just before closing file | ||
232 | 1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it | ||
233 | works with previous versions | ||
234 | Fixed a possible problem with disks > 64G (but I don't have one, so I can't | ||
235 | test it) | ||
236 | Fixed a file overflow at 2G | ||
237 | Added new option 'timeshift' | ||
238 | Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in | ||
239 | read-only mode | ||
240 | Fixed a bug that slowed down alloc and prevented allocating 100% space | ||
241 | (this bug was not destructive) | ||
242 | 1.94 Added workaround for one bug in Linux | ||
243 | Fixed one buffer leak | ||
244 | Fixed some incompatibilities with large extended attributes (but it's still | ||
245 | not 100% ok, I have no info on it and OS/2 doesn't want to create them) | ||
246 | Rewritten allocation | ||
247 | Fixed a bug with i_blocks (du sometimes didn't display correct values) | ||
248 | Directories have no longer archive attribute set (some programs don't like | ||
249 | it) | ||
250 | Fixed a bug that it set badly one flag in large anode tree (it was not | ||
251 | destructive) | ||
252 | 1.95 Fixed one buffer leak, that could happen on corrupted filesystem | ||
253 | Fixed one bug in allocation in 1.94 | ||
254 | 1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported | ||
255 | error sometimes when opening directories in PMSHELL) | ||
256 | Fixed a possible bitmap race | ||
257 | Fixed possible problem on large disks | ||
258 | You can now delete open files | ||
259 | Fixed a nondestructive race in rename | ||
260 | 1.97 Support for HPFS v3 (on large partitions) | ||
261 | Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) | ||
262 | 1.97.1 Changed names of global symbols | ||
263 | Fixed a bug when chmoding or chowning root directory | ||
264 | 1.98 Fixed a deadlock when using old_readdir | ||
265 | Better directory handling; workaround for "unbalanced tree" bug in OS/2 | ||
266 | 1.99 Corrected a possible problem when there's not enough space while deleting | ||
267 | file | ||
268 | Now it tries to truncate the file if there's not enough space when deleting | ||
269 | Removed a lot of redundant code | ||
270 | 2.00 Fixed a bug in rename (it was there since 1.96) | ||
271 | Better anti-fragmentation strategy | ||
272 | 2.01 Fixed problem with directory listing over NFS | ||
273 | Directory lseek now checks for proper parameters | ||
274 | Fixed race-condition in buffer code - it is in all filesystems in Linux; | ||
275 | when reading device (cat /dev/hda) while creating files on it, files | ||
276 | could be damaged | ||
277 | 2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond | ||
278 | end of partition | ||
279 | 2.03 Char, block devices and pipes are correctly created | ||
280 | Fixed non-crashing race in unlink (Alexander Viro) | ||
281 | Now it works with Japanese version of OS/2 | ||
282 | 2.04 Fixed error when ftruncate used to extend file | ||
283 | 2.05 Fixed crash when got mount parameters without = | ||
284 | Fixed crash when allocation of anode failed due to full disk | ||
285 | Fixed some crashes when block io or inode allocation failed | ||
286 | 2.06 Fixed some crash on corrupted disk structures | ||
287 | Better allocation strategy | ||
288 | Reschedule points added so that it doesn't lock CPU long time | ||
289 | It should work in read-only mode on Warp Server | ||
290 | 2.07 More fixes for Warp Server. Now it really works | ||
291 | 2.08 Creating new files is not so slow on large disks | ||
292 | An attempt to sync deleted file does not generate filesystem error | ||
293 | 2.09 Fixed error on extremly fragmented files | ||
294 | |||
295 | |||
296 | vim: set textwidth=80: | ||
diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt new file mode 100644 index 000000000000..f64a10506689 --- /dev/null +++ b/Documentation/filesystems/isofs.txt | |||
@@ -0,0 +1,38 @@ | |||
1 | Mount options that are the same as for msdos and vfat partitions. | ||
2 | |||
3 | gid=nnn All files in the partition will be in group nnn. | ||
4 | uid=nnn All files in the partition will be owned by user id nnn. | ||
5 | umask=nnn The permission mask (see umask(1)) for the partition. | ||
6 | |||
7 | Mount options that are the same as vfat partitions. These are only useful | ||
8 | when using discs encoded using Microsoft's Joliet extensions. | ||
9 | iocharset=name Character set to use for converting from Unicode to | ||
10 | ASCII. Joliet filenames are stored in Unicode format, but | ||
11 | Unix for the most part doesn't know how to deal with Unicode. | ||
12 | There is also an option of doing UTF8 translations with the | ||
13 | utf8 option. | ||
14 | utf8 Encode Unicode names in UTF8 format. Default is no. | ||
15 | |||
16 | Mount options unique to the isofs filesystem. | ||
17 | block=512 Set the block size for the disk to 512 bytes | ||
18 | block=1024 Set the block size for the disk to 1024 bytes | ||
19 | block=2048 Set the block size for the disk to 2048 bytes | ||
20 | check=relaxed Matches filenames with different cases | ||
21 | check=strict Matches only filenames with the exact same case | ||
22 | cruft Try to handle badly formatted CDs. | ||
23 | map=off Do not map non-Rock Ridge filenames to lower case | ||
24 | map=normal Map non-Rock Ridge filenames to lower case | ||
25 | map=acorn As map=normal but also apply Acorn extensions if present | ||
26 | mode=xxx Sets the permissions on files to xxx | ||
27 | nojoliet Ignore Joliet extensions if they are present. | ||
28 | norock Ignore Rock Ridge extensions if they are present. | ||
29 | unhide Show hidden files. | ||
30 | session=x Select number of session on multisession CD | ||
31 | sbsector=xxx Session begins from sector xxx | ||
32 | |||
33 | Recommended documents about ISO 9660 standard are located at: | ||
34 | http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm | ||
35 | ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf | ||
36 | Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically | ||
37 | identical with ISO 9660.", so it is a valid and gratis substitute of the | ||
38 | official ISO specification. | ||
diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt new file mode 100644 index 000000000000..3e992daf99ad --- /dev/null +++ b/Documentation/filesystems/jfs.txt | |||
@@ -0,0 +1,35 @@ | |||
1 | IBM's Journaled File System (JFS) for Linux | ||
2 | |||
3 | JFS Homepage: http://jfs.sourceforge.net/ | ||
4 | |||
5 | The following mount options are supported: | ||
6 | |||
7 | iocharset=name Character set to use for converting from Unicode to | ||
8 | ASCII. The default is to do no conversion. Use | ||
9 | iocharset=utf8 for UTF8 translations. This requires | ||
10 | CONFIG_NLS_UTF8 to be set in the kernel .config file. | ||
11 | iocharset=none specifies the default behavior explicitly. | ||
12 | |||
13 | resize=value Resize the volume to <value> blocks. JFS only supports | ||
14 | growing a volume, not shrinking it. This option is only | ||
15 | valid during a remount, when the volume is mounted | ||
16 | read-write. The resize keyword with no value will grow | ||
17 | the volume to the full size of the partition. | ||
18 | |||
19 | nointegrity Do not write to the journal. The primary use of this option | ||
20 | is to allow for higher performance when restoring a volume | ||
21 | from backup media. The integrity of the volume is not | ||
22 | guaranteed if the system abnormally abends. | ||
23 | |||
24 | integrity Default. Commit metadata changes to the journal. Use this | ||
25 | option to remount a volume where the nointegrity option was | ||
26 | previously specified in order to restore normal behavior. | ||
27 | |||
28 | errors=continue Keep going on a filesystem error. | ||
29 | errors=remount-ro Default. Remount the filesystem read-only on an error. | ||
30 | errors=panic Panic and halt the machine if an error occurs. | ||
31 | |||
32 | Please send bugs, comments, cards and letters to shaggy@austin.ibm.com. | ||
33 | |||
34 | The JFS mailing list can be subscribed to by using the link labeled | ||
35 | "Mail list Subscribe" at our web page http://jfs.sourceforge.net/ | ||
diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt new file mode 100644 index 000000000000..f12c30c93f2f --- /dev/null +++ b/Documentation/filesystems/ncpfs.txt | |||
@@ -0,0 +1,12 @@ | |||
1 | The ncpfs filesystem understands the NCP protocol, designed by the | ||
2 | Novell Corporation for their NetWare(tm) product. NCP is functionally | ||
3 | similar to the NFS used in the TCP/IP community. | ||
4 | To mount a NetWare filesystem, you need a special mount program, which | ||
5 | can be found in the ncpfs package. The home site for ncpfs is | ||
6 | ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors | ||
7 | will have it as well. | ||
8 | |||
9 | Related products are linware and mars_nwe, which will give Linux partial | ||
10 | NetWare server functionality. Linware's home site is | ||
11 | klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on | ||
12 | ftp.gwdg.de/pub/linux/misc/ncpfs. | ||
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt new file mode 100644 index 000000000000..f89b440fad1d --- /dev/null +++ b/Documentation/filesystems/ntfs.txt | |||
@@ -0,0 +1,630 @@ | |||
1 | The Linux NTFS filesystem driver | ||
2 | ================================ | ||
3 | |||
4 | |||
5 | Table of contents | ||
6 | ================= | ||
7 | |||
8 | - Overview | ||
9 | - Web site | ||
10 | - Features | ||
11 | - Supported mount options | ||
12 | - Known bugs and (mis-)features | ||
13 | - Using NTFS volume and stripe sets | ||
14 | - The Device-Mapper driver | ||
15 | - The Software RAID / MD driver | ||
16 | - Limitiations when using the MD driver | ||
17 | - ChangeLog | ||
18 | |||
19 | |||
20 | Overview | ||
21 | ======== | ||
22 | |||
23 | Linux-NTFS comes with a number of user-space programs known as ntfsprogs. | ||
24 | These include mkntfs, a full-featured ntfs file system format utility, | ||
25 | ntfsundelete used for recovering files that were unintentionally deleted | ||
26 | from an NTFS volume and ntfsresize which is used to resize an NTFS partition. | ||
27 | See the web site for more information. | ||
28 | |||
29 | To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file | ||
30 | system type 'ntfs'. The driver currently supports read-only mode (with no | ||
31 | fault-tolerance, encryption or journalling) and very limited, but safe, write | ||
32 | support. | ||
33 | |||
34 | For fault tolerance and raid support (i.e. volume and stripe sets), you can | ||
35 | use the kernel's Software RAID / MD driver. See section "Using Software RAID | ||
36 | with NTFS" for details. | ||
37 | |||
38 | |||
39 | Web site | ||
40 | ======== | ||
41 | |||
42 | There is plenty of additional information on the linux-ntfs web site | ||
43 | at http://linux-ntfs.sourceforge.net/ | ||
44 | |||
45 | The web site has a lot of additional information, such as a comprehensive | ||
46 | FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS | ||
47 | userspace utilities, etc. | ||
48 | |||
49 | |||
50 | Features | ||
51 | ======== | ||
52 | |||
53 | - This is a complete rewrite of the NTFS driver that used to be in the kernel. | ||
54 | This new driver implements NTFS read support and is functionally equivalent | ||
55 | to the old ntfs driver. | ||
56 | - The new driver has full support for sparse files on NTFS 3.x volumes which | ||
57 | the old driver isn't happy with. | ||
58 | - The new driver supports execution of binaries due to mmap() now being | ||
59 | supported. | ||
60 | - The new driver supports loopback mounting of files on NTFS which is used by | ||
61 | some Linux distributions to enable the user to run Linux from an NTFS | ||
62 | partition by creating a large file while in Windows and then loopback | ||
63 | mounting the file while in Linux and creating a Linux filesystem on it that | ||
64 | is used to install Linux on it. | ||
65 | - A comparison of the two drivers using: | ||
66 | time find . -type f -exec md5sum "{}" \; | ||
67 | run three times in sequence with each driver (after a reboot) on a 1.4GiB | ||
68 | NTFS partition, showed the new driver to be 20% faster in total time elapsed | ||
69 | (from 9:43 minutes on average down to 7:53). The time spent in user space | ||
70 | was unchanged but the time spent in the kernel was decreased by a factor of | ||
71 | 2.5 (from 85 CPU seconds down to 33). | ||
72 | - The driver does not support short file names in general. For backwards | ||
73 | compatibility, we implement access to files using their short file names if | ||
74 | they exist. The driver will not create short file names however, and a | ||
75 | rename will discard any existing short file name. | ||
76 | - The new driver supports exporting of mounted NTFS volumes via NFS. | ||
77 | - The new driver supports async io (aio). | ||
78 | - The new driver supports fsync(2), fdatasync(2), and msync(2). | ||
79 | - The new driver supports readv(2) and writev(2). | ||
80 | - The new driver supports access time updates (including mtime and ctime). | ||
81 | |||
82 | |||
83 | Supported mount options | ||
84 | ======================= | ||
85 | |||
86 | In addition to the generic mount options described by the manual page for the | ||
87 | mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the | ||
88 | following mount options: | ||
89 | |||
90 | iocharset=name Deprecated option. Still supported but please use | ||
91 | nls=name in the future. See description for nls=name. | ||
92 | |||
93 | nls=name Character set to use when returning file names. | ||
94 | Unlike VFAT, NTFS suppresses names that contain | ||
95 | unconvertible characters. Note that most character | ||
96 | sets contain insufficient characters to represent all | ||
97 | possible Unicode characters that can exist on NTFS. | ||
98 | To be sure you are not missing any files, you are | ||
99 | advised to use nls=utf8 which is capable of | ||
100 | representing all Unicode characters. | ||
101 | |||
102 | utf8=<bool> Option no longer supported. Currently mapped to | ||
103 | nls=utf8 but please use nls=utf8 in the future and | ||
104 | make sure utf8 is compiled either as module or into | ||
105 | the kernel. See description for nls=name. | ||
106 | |||
107 | uid= | ||
108 | gid= | ||
109 | umask= Provide default owner, group, and access mode mask. | ||
110 | These options work as documented in mount(8). By | ||
111 | default, the files/directories are owned by root and | ||
112 | he/she has read and write permissions, as well as | ||
113 | browse permission for directories. No one else has any | ||
114 | access permissions. I.e. the mode on all files is by | ||
115 | default rw------- and for directories rwx------, a | ||
116 | consequence of the default fmask=0177 and dmask=0077. | ||
117 | Using a umask of zero will grant all permissions to | ||
118 | everyone, i.e. all files and directories will have mode | ||
119 | rwxrwxrwx. | ||
120 | |||
121 | fmask= | ||
122 | dmask= Instead of specifying umask which applies both to | ||
123 | files and directories, fmask applies only to files and | ||
124 | dmask only to directories. | ||
125 | |||
126 | sloppy=<BOOL> If sloppy is specified, ignore unknown mount options. | ||
127 | Otherwise the default behaviour is to abort mount if | ||
128 | any unknown options are found. | ||
129 | |||
130 | show_sys_files=<BOOL> If show_sys_files is specified, show the system files | ||
131 | in directory listings. Otherwise the default behaviour | ||
132 | is to hide the system files. | ||
133 | Note that even when show_sys_files is specified, "$MFT" | ||
134 | will not be visible due to bugs/mis-features in glibc. | ||
135 | Further, note that irrespective of show_sys_files, all | ||
136 | files are accessible by name, i.e. you can always do | ||
137 | "ls -l \$UpCase" for example to specifically show the | ||
138 | system file containing the Unicode upcase table. | ||
139 | |||
140 | case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as | ||
141 | case sensitive and create file names in the POSIX | ||
142 | namespace. Otherwise the default behaviour is to treat | ||
143 | file names as case insensitive and to create file names | ||
144 | in the WIN32/LONG name space. Note, the Linux NTFS | ||
145 | driver will never create short file names and will | ||
146 | remove them on rename/delete of the corresponding long | ||
147 | file name. | ||
148 | Note that files remain accessible via their short file | ||
149 | name, if it exists. If case_sensitive, you will need | ||
150 | to provide the correct case of the short file name. | ||
151 | |||
152 | errors=opt What to do when critical file system errors are found. | ||
153 | Following values can be used for "opt": | ||
154 | continue: DEFAULT, try to clean-up as much as | ||
155 | possible, e.g. marking a corrupt inode as | ||
156 | bad so it is no longer accessed, and then | ||
157 | continue. | ||
158 | recover: At present only supported is recovery of | ||
159 | the boot sector from the backup copy. | ||
160 | If read-only mount, the recovery is done | ||
161 | in memory only and not written to disk. | ||
162 | Note that the options are additive, i.e. specifying: | ||
163 | errors=continue,errors=recover | ||
164 | means the driver will attempt to recover and if that | ||
165 | fails it will clean-up as much as possible and | ||
166 | continue. | ||
167 | |||
168 | mft_zone_multiplier= Set the MFT zone multiplier for the volume (this | ||
169 | setting is not persistent across mounts and can be | ||
170 | changed from mount to mount but cannot be changed on | ||
171 | remount). Values of 1 to 4 are allowed, 1 being the | ||
172 | default. The MFT zone multiplier determines how much | ||
173 | space is reserved for the MFT on the volume. If all | ||
174 | other space is used up, then the MFT zone will be | ||
175 | shrunk dynamically, so this has no impact on the | ||
176 | amount of free space. However, it can have an impact | ||
177 | on performance by affecting fragmentation of the MFT. | ||
178 | In general use the default. If you have a lot of small | ||
179 | files then use a higher value. The values have the | ||
180 | following meaning: | ||
181 | Value MFT zone size (% of volume size) | ||
182 | 1 12.5% | ||
183 | 2 25% | ||
184 | 3 37.5% | ||
185 | 4 50% | ||
186 | Note this option is irrelevant for read-only mounts. | ||
187 | |||
188 | |||
189 | Known bugs and (mis-)features | ||
190 | ============================= | ||
191 | |||
192 | - The link count on each directory inode entry is set to 1, due to Linux not | ||
193 | supporting directory hard links. This may well confuse some user space | ||
194 | applications, since the directory names will have the same inode numbers. | ||
195 | This also speeds up ntfs_read_inode() immensely. And we haven't found any | ||
196 | problems with this approach so far. If you find a problem with this, please | ||
197 | let us know. | ||
198 | |||
199 | |||
200 | Please send bug reports/comments/feedback/abuse to the Linux-NTFS development | ||
201 | list at sourceforge: linux-ntfs-dev@lists.sourceforge.net | ||
202 | |||
203 | |||
204 | Using NTFS volume and stripe sets | ||
205 | ================================= | ||
206 | |||
207 | For support of volume and stripe sets, you can either use the kernel's | ||
208 | Device-Mapper driver or the kernel's Software RAID / MD driver. The former is | ||
209 | the recommended one to use for linear raid. But the latter is required for | ||
210 | raid level 5. For striping and mirroring, either driver should work fine. | ||
211 | |||
212 | |||
213 | The Device-Mapper driver | ||
214 | ------------------------ | ||
215 | |||
216 | You will need to create a table of the components of the volume/stripe set and | ||
217 | how they fit together and load this into the kernel using the dmsetup utility | ||
218 | (see man 8 dmsetup). | ||
219 | |||
220 | Linear volume sets, i.e. linear raid, has been tested and works fine. Even | ||
221 | though untested, there is no reason why stripe sets, i.e. raid level 0, and | ||
222 | mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. | ||
223 | raid level 5, unfortunately cannot work yet because the current version of the | ||
224 | Device-Mapper driver does not support raid level 5. You may be able to use the | ||
225 | Software RAID / MD driver for raid level 5, see the next section for details. | ||
226 | |||
227 | To create the table describing your volume you will need to know each of its | ||
228 | components and their sizes in sectors, i.e. multiples of 512-byte blocks. | ||
229 | |||
230 | For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for | ||
231 | example if one of your partitions is /dev/hda2 you would do: | ||
232 | |||
233 | $ fdisk -ul /dev/hda | ||
234 | |||
235 | Disk /dev/hda: 81.9 GB, 81964302336 bytes | ||
236 | 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors | ||
237 | Units = sectors of 1 * 512 = 512 bytes | ||
238 | |||
239 | Device Boot Start End Blocks Id System | ||
240 | /dev/hda1 * 63 4209029 2104483+ 83 Linux | ||
241 | /dev/hda2 4209030 37768814 16779892+ 86 NTFS | ||
242 | /dev/hda3 37768815 46170809 4200997+ 83 Linux | ||
243 | |||
244 | And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = | ||
245 | 33559785 sectors. | ||
246 | |||
247 | For Win2k and later dynamic disks, you can for example use the ldminfo utility | ||
248 | which is part of the Linux LDM tools (the latest version at the time of | ||
249 | writing is linux-ldm-0.0.8.tar.bz2). You can download it from: | ||
250 | http://linux-ntfs.sourceforge.net/downloads.html | ||
251 | Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go | ||
252 | into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You | ||
253 | will find the precompiled (i386) ldminfo utility there. NOTE: You will not be | ||
254 | able to compile this yourself easily so use the binary version! | ||
255 | |||
256 | Then you would use ldminfo in dump mode to obtain the necessary information: | ||
257 | |||
258 | $ ./ldminfo --dump /dev/hda | ||
259 | |||
260 | This would dump the LDM database found on /dev/hda which describes all of your | ||
261 | dynamic disks and all the volumes on them. At the bottom you will see the | ||
262 | VOLUME DEFINITIONS section which is all you really need. You may need to look | ||
263 | further above to determine which of the disks in the volume definitions is | ||
264 | which device in Linux. Hint: Run ldminfo on each of your dynamic disks and | ||
265 | look at the Disk Id close to the top of the output for each (the PRIVATE HEADER | ||
266 | section). You can then find these Disk Ids in the VBLK DATABASE section in the | ||
267 | <Disk> components where you will get the LDM Name for the disk that is found in | ||
268 | the VOLUME DEFINITIONS section. | ||
269 | |||
270 | Note you will also need to enable the LDM driver in the Linux kernel. If your | ||
271 | distribution did not enable it, you will need to recompile the kernel with it | ||
272 | enabled. This will create the LDM partitions on each device at boot time. You | ||
273 | would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) | ||
274 | in the Device-Mapper table. | ||
275 | |||
276 | You can also bypass using the LDM driver by using the main device (e.g. | ||
277 | /dev/hda) and then using the offsets of the LDM partitions into this device as | ||
278 | the "Start sector of device" when creating the table. Once again ldminfo would | ||
279 | give you the correct information to do this. | ||
280 | |||
281 | Assuming you know all your devices and their sizes things are easy. | ||
282 | |||
283 | For a linear raid the table would look like this (note all values are in | ||
284 | 512-byte sectors): | ||
285 | |||
286 | --- cut here --- | ||
287 | # Offset into Size of this Raid type Device Start sector | ||
288 | # volume device of device | ||
289 | 0 1028161 linear /dev/hda1 0 | ||
290 | 1028161 3903762 linear /dev/hdb2 0 | ||
291 | 4931923 2103211 linear /dev/hdc1 0 | ||
292 | --- cut here --- | ||
293 | |||
294 | For a striped volume, i.e. raid level 0, you will need to know the chunk size | ||
295 | you used when creating the volume. Windows uses 64kiB as the default, so it | ||
296 | will probably be this unless you changes the defaults when creating the array. | ||
297 | |||
298 | For a raid level 0 the table would look like this (note all values are in | ||
299 | 512-byte sectors): | ||
300 | |||
301 | --- cut here --- | ||
302 | # Offset Size Raid Number Chunk 1st Start 2nd Start | ||
303 | # into of the type of size Device in Device in | ||
304 | # volume volume stripes device device | ||
305 | 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 | ||
306 | --- cut here --- | ||
307 | |||
308 | If there are more than two devices, just add each of them to the end of the | ||
309 | line. | ||
310 | |||
311 | Finally, for a mirrored volume, i.e. raid level 1, the table would look like | ||
312 | this (note all values are in 512-byte sectors): | ||
313 | |||
314 | --- cut here --- | ||
315 | # Ofs Size Raid Log Number Region Should Number Source Start Taget Start | ||
316 | # in of the type type of log size sync? of Device in Device in | ||
317 | # vol volume params mirrors Device Device | ||
318 | 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 | ||
319 | --- cut here --- | ||
320 | |||
321 | If you are mirroring to multiple devices you can specify further targets at the | ||
322 | end of the line. | ||
323 | |||
324 | Note the "Should sync?" parameter "nosync" means that the two mirrors are | ||
325 | already in sync which will be the case on a clean shutdown of Windows. If the | ||
326 | mirrors are not clean, you can specify the "sync" option instead of "nosync" | ||
327 | and the Device-Mapper driver will then copy the entirey of the "Source Device" | ||
328 | to the "Target Device" or if you specified multipled target devices to all of | ||
329 | them. | ||
330 | |||
331 | Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), | ||
332 | and hand it over to dmsetup to work with, like so: | ||
333 | |||
334 | $ dmsetup create myvolume1 /etc/ntfsvolume1 | ||
335 | |||
336 | You can obviously replace "myvolume1" with whatever name you like. | ||
337 | |||
338 | If it all worked, you will now have the device /dev/device-mapper/myvolume1 | ||
339 | which you can then just use as an argument to the mount command as usual to | ||
340 | mount the ntfs volume. For example: | ||
341 | |||
342 | $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 | ||
343 | |||
344 | (You need to create the directory /mnt/myvol1 first and of course you can use | ||
345 | anything you like instead of /mnt/myvol1 as long as it is an existing | ||
346 | directory.) | ||
347 | |||
348 | It is advisable to do the mount read-only to see if the volume has been setup | ||
349 | correctly to avoid the possibility of causing damage to the data on the ntfs | ||
350 | volume. | ||
351 | |||
352 | |||
353 | The Software RAID / MD driver | ||
354 | ----------------------------- | ||
355 | |||
356 | An alternative to using the Device-Mapper driver is to use the kernel's | ||
357 | Software RAID / MD driver. For which you need to set up your /etc/raidtab | ||
358 | appropriately (see man 5 raidtab). | ||
359 | |||
360 | Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level | ||
361 | 0, have been tested and work fine (though see section "Limitiations when using | ||
362 | the MD driver with NTFS volumes" especially if you want to use linear raid). | ||
363 | Even though untested, there is no reason why mirrors, i.e. raid level 1, and | ||
364 | stripes with parity, i.e. raid level 5, should not work, too. | ||
365 | |||
366 | You have to use the "persistent-superblock 0" option for each raid-disk in the | ||
367 | NTFS volume/stripe you are configuring in /etc/raidtab as the persistent | ||
368 | superblock used by the MD driver would damange the NTFS volume. | ||
369 | |||
370 | Windows by default uses a stripe chunk size of 64k, so you probably want the | ||
371 | "chunk-size 64k" option for each raid-disk, too. | ||
372 | |||
373 | For example, if you have a stripe set consisting of two partitions /dev/hda5 | ||
374 | and /dev/hdb1 your /etc/raidtab would look like this: | ||
375 | |||
376 | raiddev /dev/md0 | ||
377 | raid-level 0 | ||
378 | nr-raid-disks 2 | ||
379 | nr-spare-disks 0 | ||
380 | persistent-superblock 0 | ||
381 | chunk-size 64k | ||
382 | device /dev/hda5 | ||
383 | raid-disk 0 | ||
384 | device /dev/hdb1 | ||
385 | raid-disl 1 | ||
386 | |||
387 | For linear raid, just change the raid-level above to "raid-level linear", for | ||
388 | mirrors, change it to "raid-level 1", and for stripe sets with parity, change | ||
389 | it to "raid-level 5". | ||
390 | |||
391 | Note for stripe sets with parity you will also need to tell the MD driver | ||
392 | which parity algorithm to use by specifying the option "parity-algorithm | ||
393 | which", where you need to replace "which" with the name of the algorithm to | ||
394 | use (see man 5 raidtab for available algorithms) and you will have to try the | ||
395 | different available algorithms until you find one that works. Make sure you | ||
396 | are working read-only when playing with this as you may damage your data | ||
397 | otherwise. If you find which algorithm works please let us know (email the | ||
398 | linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on | ||
399 | IRC in channel #ntfs on the irc.freenode.net network) so we can update this | ||
400 | documentation. | ||
401 | |||
402 | Once the raidtab is setup, run for example raid0run -a to start all devices or | ||
403 | raid0run /dev/md0 to start a particular md device, in this case /dev/md0. | ||
404 | |||
405 | Then just use the mount command as usual to mount the ntfs volume using for | ||
406 | example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume | ||
407 | |||
408 | It is advisable to do the mount read-only to see if the md volume has been | ||
409 | setup correctly to avoid the possibility of causing damage to the data on the | ||
410 | ntfs volume. | ||
411 | |||
412 | |||
413 | Limitiations when using the Software RAID / MD driver | ||
414 | ----------------------------------------------------- | ||
415 | |||
416 | Using the md driver will not work properly if any of your NTFS partitions have | ||
417 | an odd number of sectors. This is especially important for linear raid as all | ||
418 | data after the first partition with an odd number of sectors will be offset by | ||
419 | one or more sectors so if you mount such a partition with write support you | ||
420 | will cause massive damage to the data on the volume which will only become | ||
421 | apparent when you try to use the volume again under Windows. | ||
422 | |||
423 | So when using linear raid, make sure that all your partitions have an even | ||
424 | number of sectors BEFORE attempting to use it. You have been warned! | ||
425 | |||
426 | Even better is to simply use the Device-Mapper for linear raid and then you do | ||
427 | not have this problem with odd numbers of sectors. | ||
428 | |||
429 | |||
430 | ChangeLog | ||
431 | ========= | ||
432 | |||
433 | Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. | ||
434 | |||
435 | 2.1.22: | ||
436 | - Improve handling of ntfs volumes with errors. | ||
437 | - Fix various bugs and race conditions. | ||
438 | 2.1.21: | ||
439 | - Fix several race conditions and various other bugs. | ||
440 | - Many internal cleanups, code reorganization, optimizations, and mft | ||
441 | and index record writing code rewritten to fit in with the changes. | ||
442 | - Update Documentation/filesystems/ntfs.txt with instructions on how to | ||
443 | use the Device-Mapper driver with NTFS ftdisk/LDM raid. | ||
444 | 2.1.20: | ||
445 | - Fix two stupid bugs introduced in 2.1.18 release. | ||
446 | 2.1.19: | ||
447 | - Minor bugfix in handling of the default upcase table. | ||
448 | - Many internal cleanups and improvements. Many thanks to Linus | ||
449 | Torvalds and Al Viro for the help and advice with the sparse | ||
450 | annotations and cleanups. | ||
451 | 2.1.18: | ||
452 | - Fix scheduling latencies at mount time. (Ingo Molnar) | ||
453 | - Fix endianness bug in a little traversed portion of the attribute | ||
454 | lookup code. | ||
455 | 2.1.17: | ||
456 | - Fix bugs in mount time error code paths. | ||
457 | 2.1.16: | ||
458 | - Implement access time updates (including mtime and ctime). | ||
459 | - Implement fsync(2), fdatasync(2), and msync(2) system calls. | ||
460 | - Enable the readv(2) and writev(2) system calls. | ||
461 | - Enable access via the asynchronous io (aio) API by adding support for | ||
462 | the aio_read(3) and aio_write(3) functions. | ||
463 | 2.1.15: | ||
464 | - Invalidate quotas when (re)mounting read-write. | ||
465 | NOTE: This now only leave user space journalling on the side. (See | ||
466 | note for version 2.1.13, below.) | ||
467 | 2.1.14: | ||
468 | - Fix an NFSd caused deadlock reported by several users. | ||
469 | 2.1.13: | ||
470 | - Implement writing of inodes (access time updates are not implemented | ||
471 | yet so mounting with -o noatime,nodiratime is enforced). | ||
472 | - Enable writing out of resident files so you can now overwrite any | ||
473 | uncompressed, unencrypted, nonsparse file as long as you do not | ||
474 | change the file size. | ||
475 | - Add housekeeping of ntfs system files so that ntfsfix no longer needs | ||
476 | to be run after writing to an NTFS volume. | ||
477 | NOTE: This still leaves quota tracking and user space journalling on | ||
478 | the side but they should not cause data corruption. In the worst | ||
479 | case the charged quotas will be out of date ($Quota) and some | ||
480 | userspace applications might get confused due to the out of date | ||
481 | userspace journal ($UsnJrnl). | ||
482 | 2.1.12: | ||
483 | - Fix the second fix to the decompression engine from the 2.1.9 release | ||
484 | and some further internals cleanups. | ||
485 | 2.1.11: | ||
486 | - Driver internal cleanups. | ||
487 | 2.1.10: | ||
488 | - Force read-only (re)mounting of volumes with unsupported volume | ||
489 | flags and various cleanups. | ||
490 | 2.1.9: | ||
491 | - Fix two bugs in handling of corner cases in the decompression engine. | ||
492 | 2.1.8: | ||
493 | - Read the $MFT mirror and compare it to the $MFT and if the two do not | ||
494 | match, force a read-only mount and do not allow read-write remounts. | ||
495 | - Read and parse the $LogFile journal and if it indicates that the | ||
496 | volume was not shutdown cleanly, force a read-only mount and do not | ||
497 | allow read-write remounts. If the $LogFile indicates a clean | ||
498 | shutdown and a read-write (re)mount is requested, empty $LogFile to | ||
499 | ensure that Windows cannot cause data corruption by replaying a stale | ||
500 | journal after Linux has written to the volume. | ||
501 | - Improve time handling so that the NTFS time is fully preserved when | ||
502 | converted to kernel time and only up to 99 nano-seconds are lost when | ||
503 | kernel time is converted to NTFS time. | ||
504 | 2.1.7: | ||
505 | - Enable NFS exporting of mounted NTFS volumes. | ||
506 | 2.1.6: | ||
507 | - Fix minor bug in handling of compressed directories that fixes the | ||
508 | erroneous "du" and "stat" output people reported. | ||
509 | 2.1.5: | ||
510 | - Minor bug fix in attribute list attribute handling that fixes the | ||
511 | I/O errors on "ls" of certain fragmented files found by at least two | ||
512 | people running Windows XP. | ||
513 | 2.1.4: | ||
514 | - Minor update allowing compilation with all gcc versions (well, the | ||
515 | ones the kernel can be compiled with anyway). | ||
516 | 2.1.3: | ||
517 | - Major bug fixes for reading files and volumes in corner cases which | ||
518 | were being hit by Windows 2k/XP users. | ||
519 | 2.1.2: | ||
520 | - Major bug fixes aleviating the hangs in statfs experienced by some | ||
521 | users. | ||
522 | 2.1.1: | ||
523 | - Update handling of compressed files so people no longer get the | ||
524 | frequently reported warning messages about initialized_size != | ||
525 | data_size. | ||
526 | 2.1.0: | ||
527 | - Add configuration option for developmental write support. | ||
528 | - Initial implementation of file overwriting. (Writes to resident files | ||
529 | are not written out to disk yet, so avoid writing to files smaller | ||
530 | than about 1kiB.) | ||
531 | - Intercept/abort changes in file size as they are not implemented yet. | ||
532 | 2.0.25: | ||
533 | - Minor bugfixes in error code paths and small cleanups. | ||
534 | 2.0.24: | ||
535 | - Small internal cleanups. | ||
536 | - Support for sendfile system call. (Christoph Hellwig) | ||
537 | 2.0.23: | ||
538 | - Massive internal locking changes to mft record locking. Fixes | ||
539 | various race conditions and deadlocks. | ||
540 | - Fix ntfs over loopback for compressed files by adding an | ||
541 | optimization barrier. (gcc was screwing up otherwise ?) | ||
542 | Thanks go to Christoph Hellwig for pointing these two out: | ||
543 | - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs(). | ||
544 | - Fix ntfs_free() for ia64 and parisc. | ||
545 | 2.0.22: | ||
546 | - Small internal cleanups. | ||
547 | 2.0.21: | ||
548 | These only affect 32-bit architectures: | ||
549 | - Check for, and refuse to mount too large volumes (maximum is 2TiB). | ||
550 | - Check for, and refuse to open too large files and directories | ||
551 | (maximum is 16TiB). | ||
552 | 2.0.20: | ||
553 | - Support non-resident directory index bitmaps. This means we now cope | ||
554 | with huge directories without problems. | ||
555 | - Fix a page leak that manifested itself in some cases when reading | ||
556 | directory contents. | ||
557 | - Internal cleanups. | ||
558 | 2.0.19: | ||
559 | - Fix race condition and improvements in block i/o interface. | ||
560 | - Optimization when reading compressed files. | ||
561 | 2.0.18: | ||
562 | - Fix race condition in reading of compressed files. | ||
563 | 2.0.17: | ||
564 | - Cleanups and optimizations. | ||
565 | 2.0.16: | ||
566 | - Fix stupid bug introduced in 2.0.15 in new attribute inode API. | ||
567 | - Big internal cleanup replacing the mftbmp access hacks by using the | ||
568 | new attribute inode API instead. | ||
569 | 2.0.15: | ||
570 | - Bug fix in parsing of remount options. | ||
571 | - Internal changes implementing attribute (fake) inodes allowing all | ||
572 | attribute i/o to go via the page cache and to use all the normal | ||
573 | vfs/mm functionality. | ||
574 | 2.0.14: | ||
575 | - Internal changes improving run list merging code and minor locking | ||
576 | change to not rely on BKL in ntfs_statfs(). | ||
577 | 2.0.13: | ||
578 | - Internal changes towards using iget5_locked() in preparation for | ||
579 | fake inodes and small cleanups to ntfs_volume structure. | ||
580 | 2.0.12: | ||
581 | - Internal cleanups in address space operations made possible by the | ||
582 | changes introduced in the previous release. | ||
583 | 2.0.11: | ||
584 | - Internal updates and cleanups introducing the first step towards | ||
585 | fake inode based attribute i/o. | ||
586 | 2.0.10: | ||
587 | - Microsoft says that the maximum number of inodes is 2^32 - 1. Update | ||
588 | the driver accordingly to only use 32-bits to store inode numbers on | ||
589 | 32-bit architectures. This improves the speed of the driver a little. | ||
590 | 2.0.9: | ||
591 | - Change decompression engine to use a single buffer. This should not | ||
592 | affect performance except perhaps on the most heavy i/o on SMP | ||
593 | systems when accessing multiple compressed files from multiple | ||
594 | devices simultaneously. | ||
595 | - Minor updates and cleanups. | ||
596 | 2.0.8: | ||
597 | - Remove now obsolete show_inodes and posix mount option(s). | ||
598 | - Restore show_sys_files mount option. | ||
599 | - Add new mount option case_sensitive, to determine if the driver | ||
600 | treats file names as case sensitive or not. | ||
601 | - Mostly drop support for short file names (for backwards compatibility | ||
602 | we only support accessing files via their short file name if one | ||
603 | exists). | ||
604 | - Fix dcache aliasing issues wrt short/long file names. | ||
605 | - Cleanups and minor fixes. | ||
606 | 2.0.7: | ||
607 | - Just cleanups. | ||
608 | 2.0.6: | ||
609 | - Major bugfix to make compatible with other kernel changes. This fixes | ||
610 | the hangs/oopses on umount. | ||
611 | - Locking cleanup in directory operations (remove BKL usage). | ||
612 | 2.0.5: | ||
613 | - Major buffer overflow bug fix. | ||
614 | - Minor cleanups and updates for kernel 2.5.12. | ||
615 | 2.0.4: | ||
616 | - Cleanups and updates for kernel 2.5.11. | ||
617 | 2.0.3: | ||
618 | - Small bug fixes, cleanups, and performance improvements. | ||
619 | 2.0.2: | ||
620 | - Use default fmask of 0177 so that files are no executable by default. | ||
621 | If you want owner executable files, just use fmask=0077. | ||
622 | - Update for kernel 2.5.9 but preserve backwards compatibility with | ||
623 | kernel 2.5.7. | ||
624 | - Minor bug fixes, cleanups, and updates. | ||
625 | 2.0.1: | ||
626 | - Minor updates, primarily set the executable bit by default on files | ||
627 | so they can be executed. | ||
628 | 2.0.0: | ||
629 | - Started ChangeLog. | ||
630 | |||
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting new file mode 100644 index 000000000000..2f388460cbe7 --- /dev/null +++ b/Documentation/filesystems/porting | |||
@@ -0,0 +1,266 @@ | |||
1 | Changes since 2.5.0: | ||
2 | |||
3 | --- | ||
4 | [recommended] | ||
5 | |||
6 | New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), | ||
7 | sb_set_blocksize() and sb_min_blocksize(). | ||
8 | |||
9 | Use them. | ||
10 | |||
11 | (sb_find_get_block() replaces 2.4's get_hash_table()) | ||
12 | |||
13 | --- | ||
14 | [recommended] | ||
15 | |||
16 | New methods: ->alloc_inode() and ->destroy_inode(). | ||
17 | |||
18 | Remove inode->u.foo_inode_i | ||
19 | Declare | ||
20 | struct foo_inode_info { | ||
21 | /* fs-private stuff */ | ||
22 | struct inode vfs_inode; | ||
23 | }; | ||
24 | static inline struct foo_inode_info *FOO_I(struct inode *inode) | ||
25 | { | ||
26 | return list_entry(inode, struct foo_inode_info, vfs_inode); | ||
27 | } | ||
28 | |||
29 | Use FOO_I(inode) instead of &inode->u.foo_inode_i; | ||
30 | |||
31 | Add foo_alloc_inode() and foo_destory_inode() - the former should allocate | ||
32 | foo_inode_info and return the address of ->vfs_inode, the latter should free | ||
33 | FOO_I(inode) (see in-tree filesystems for examples). | ||
34 | |||
35 | Make them ->alloc_inode and ->destroy_inode in your super_operations. | ||
36 | |||
37 | Keep in mind that now you need explicit initialization of private data - | ||
38 | typically in ->read_inode() and after getting an inode from new_inode(). | ||
39 | |||
40 | At some point that will become mandatory. | ||
41 | |||
42 | --- | ||
43 | [mandatory] | ||
44 | |||
45 | Change of file_system_type method (->read_super to ->get_sb) | ||
46 | |||
47 | ->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. | ||
48 | |||
49 | Turn your foo_read_super() into a function that would return 0 in case of | ||
50 | success and negative number in case of error (-EINVAL unless you have more | ||
51 | informative error value to report). Call it foo_fill_super(). Now declare | ||
52 | |||
53 | struct super_block foo_get_sb(struct file_system_type *fs_type, | ||
54 | int flags, const char *dev_name, void *data) | ||
55 | { | ||
56 | return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super); | ||
57 | } | ||
58 | |||
59 | (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of | ||
60 | filesystem). | ||
61 | |||
62 | Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as | ||
63 | foo_get_sb. | ||
64 | |||
65 | --- | ||
66 | [mandatory] | ||
67 | |||
68 | Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. | ||
69 | Most likely there is no need to change anything, but if you relied on | ||
70 | global exclusion between renames for some internal purpose - you need to | ||
71 | change your internal locking. Otherwise exclusion warranties remain the | ||
72 | same (i.e. parents and victim are locked, etc.). | ||
73 | |||
74 | --- | ||
75 | [informational] | ||
76 | |||
77 | Now we have the exclusion between ->lookup() and directory removal (by | ||
78 | ->rmdir() and ->rename()). If you used to need that exclusion and do | ||
79 | it by internal locking (most of filesystems couldn't care less) - you | ||
80 | can relax your locking. | ||
81 | |||
82 | --- | ||
83 | [mandatory] | ||
84 | |||
85 | ->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), | ||
86 | ->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() | ||
87 | and ->readdir() are called without BKL now. Grab it on entry, drop upon return | ||
88 | - that will guarantee the same locking you used to have. If your method or its | ||
89 | parts do not need BKL - better yet, now you can shift lock_kernel() and | ||
90 | unlock_kernel() so that they would protect exactly what needs to be | ||
91 | protected. | ||
92 | |||
93 | --- | ||
94 | [mandatory] | ||
95 | |||
96 | BKL is also moved from around sb operations. ->write_super() Is now called | ||
97 | without BKL held. BKL should have been shifted into individual fs sb_op | ||
98 | functions. If you don't need it, remove it. | ||
99 | |||
100 | --- | ||
101 | [informational] | ||
102 | |||
103 | check for ->link() target not being a directory is done by callers. Feel | ||
104 | free to drop it... | ||
105 | |||
106 | --- | ||
107 | [informational] | ||
108 | |||
109 | ->link() callers hold ->i_sem on the object we are linking to. Some of your | ||
110 | problems might be over... | ||
111 | |||
112 | --- | ||
113 | [mandatory] | ||
114 | |||
115 | new file_system_type method - kill_sb(superblock). If you are converting | ||
116 | an existing filesystem, set it according to ->fs_flags: | ||
117 | FS_REQUIRES_DEV - kill_block_super | ||
118 | FS_LITTER - kill_litter_super | ||
119 | neither - kill_anon_super | ||
120 | FS_LITTER is gone - just remove it from fs_flags. | ||
121 | |||
122 | --- | ||
123 | [mandatory] | ||
124 | |||
125 | FS_SINGLE is gone (actually, that had happened back when ->get_sb() | ||
126 | went in - and hadn't been documented ;-/). Just remove it from fs_flags | ||
127 | (and see ->get_sb() entry for other actions). | ||
128 | |||
129 | --- | ||
130 | [mandatory] | ||
131 | |||
132 | ->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so | ||
133 | watch for ->i_sem-grabbing code that might be used by your ->setattr(). | ||
134 | Callers of notify_change() need ->i_sem now. | ||
135 | |||
136 | --- | ||
137 | [recommended] | ||
138 | |||
139 | New super_block field "struct export_operations *s_export_op" for | ||
140 | explicit support for exporting, e.g. via NFS. The structure is fully | ||
141 | documented at its declaration in include/linux/fs.h, and in | ||
142 | Documentation/filesystems/Exporting. | ||
143 | |||
144 | Briefly it allows for the definition of decode_fh and encode_fh operations | ||
145 | to encode and decode filehandles, and allows the filesystem to use | ||
146 | a standard helper function for decode_fh, and provide file-system specific | ||
147 | support for this helper, particularly get_parent. | ||
148 | |||
149 | It is planned that this will be required for exporting once the code | ||
150 | settles down a bit. | ||
151 | |||
152 | [mandatory] | ||
153 | |||
154 | s_export_op is now required for exporting a filesystem. | ||
155 | isofs, ext2, ext3, resierfs, fat | ||
156 | can be used as examples of very different filesystems. | ||
157 | |||
158 | --- | ||
159 | [mandatory] | ||
160 | |||
161 | iget4() and the read_inode2 callback have been superseded by iget5_locked() | ||
162 | which has the following prototype, | ||
163 | |||
164 | struct inode *iget5_locked(struct super_block *sb, unsigned long ino, | ||
165 | int (*test)(struct inode *, void *), | ||
166 | int (*set)(struct inode *, void *), | ||
167 | void *data); | ||
168 | |||
169 | 'test' is an additional function that can be used when the inode | ||
170 | number is not sufficient to identify the actual file object. 'set' | ||
171 | should be a non-blocking function that initializes those parts of a | ||
172 | newly created inode to allow the test function to succeed. 'data' is | ||
173 | passed as an opaque value to both test and set functions. | ||
174 | |||
175 | When the inode has been created by iget5_locked(), it will be returned with | ||
176 | the I_NEW flag set and will still be locked. read_inode has not been | ||
177 | called so the file system still has to finalize the initialization. Once | ||
178 | the inode is initialized it must be unlocked by calling unlock_new_inode(). | ||
179 | |||
180 | The filesystem is responsible for setting (and possibly testing) i_ino | ||
181 | when appropriate. There is also a simpler iget_locked function that | ||
182 | just takes the superblock and inode number as arguments and does the | ||
183 | test and set for you. | ||
184 | |||
185 | e.g. | ||
186 | inode = iget_locked(sb, ino); | ||
187 | if (inode->i_state & I_NEW) { | ||
188 | read_inode_from_disk(inode); | ||
189 | unlock_new_inode(inode); | ||
190 | } | ||
191 | |||
192 | --- | ||
193 | [recommended] | ||
194 | |||
195 | ->getattr() finally getting used. See instances in nfs, minix, etc. | ||
196 | |||
197 | --- | ||
198 | [mandatory] | ||
199 | |||
200 | ->revalidate() is gone. If your filesystem had it - provide ->getattr() | ||
201 | and let it call whatever you had as ->revlidate() + (for symlinks that | ||
202 | had ->revalidate()) add calls in ->follow_link()/->readlink(). | ||
203 | |||
204 | --- | ||
205 | [mandatory] | ||
206 | |||
207 | ->d_parent changes are not protected by BKL anymore. Read access is safe | ||
208 | if at least one of the following is true: | ||
209 | * filesystem has no cross-directory rename() | ||
210 | * dcache_lock is held | ||
211 | * we know that parent had been locked (e.g. we are looking at | ||
212 | ->d_parent of ->lookup() argument). | ||
213 | * we are called from ->rename(). | ||
214 | * the child's ->d_lock is held | ||
215 | Audit your code and add locking if needed. Notice that any place that is | ||
216 | not protected by the conditions above is risky even in the old tree - you | ||
217 | had been relying on BKL and that's prone to screwups. Old tree had quite | ||
218 | a few holes of that kind - unprotected access to ->d_parent leading to | ||
219 | anything from oops to silent memory corruption. | ||
220 | |||
221 | --- | ||
222 | [mandatory] | ||
223 | |||
224 | FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags | ||
225 | (see rootfs for one kind of solution and bdev/socket/pipe for another). | ||
226 | |||
227 | --- | ||
228 | [recommended] | ||
229 | |||
230 | Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter | ||
231 | is still alive, but only because of the mess in drivers/s390/block/dasd.c. | ||
232 | As soon as it gets fixed is_read_only() will die. | ||
233 | |||
234 | --- | ||
235 | [mandatory] | ||
236 | |||
237 | ->permission() is called without BKL now. Grab it on entry, drop upon | ||
238 | return - that will guarantee the same locking you used to have. If | ||
239 | your method or its parts do not need BKL - better yet, now you can | ||
240 | shift lock_kernel() and unlock_kernel() so that they would protect | ||
241 | exactly what needs to be protected. | ||
242 | |||
243 | --- | ||
244 | [mandatory] | ||
245 | |||
246 | ->statfs() is now called without BKL held. BKL should have been | ||
247 | shifted into individual fs sb_op functions where it's not clear that | ||
248 | it's safe to remove it. If you don't need it, remove it. | ||
249 | |||
250 | --- | ||
251 | [mandatory] | ||
252 | |||
253 | is_read_only() is gone; use bdev_read_only() instead. | ||
254 | |||
255 | --- | ||
256 | [mandatory] | ||
257 | |||
258 | destroy_buffers() is gone; use invalidate_bdev(). | ||
259 | |||
260 | --- | ||
261 | [mandatory] | ||
262 | |||
263 | fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is | ||
264 | deliberate; as soon as struct block_device * is propagated in a reasonable | ||
265 | way by that code fixing will become trivial; until then nothing can be | ||
266 | done. | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt new file mode 100644 index 000000000000..cbe85c17176b --- /dev/null +++ b/Documentation/filesystems/proc.txt | |||
@@ -0,0 +1,1940 @@ | |||
1 | ------------------------------------------------------------------------------ | ||
2 | T H E /proc F I L E S Y S T E M | ||
3 | ------------------------------------------------------------------------------ | ||
4 | /proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999 | ||
5 | Bodo Bauer <bb@ricochet.net> | ||
6 | |||
7 | 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 | ||
8 | ------------------------------------------------------------------------------ | ||
9 | Version 1.3 Kernel version 2.2.12 | ||
10 | Kernel version 2.4.0-test11-pre4 | ||
11 | ------------------------------------------------------------------------------ | ||
12 | |||
13 | Table of Contents | ||
14 | ----------------- | ||
15 | |||
16 | 0 Preface | ||
17 | 0.1 Introduction/Credits | ||
18 | 0.2 Legal Stuff | ||
19 | |||
20 | 1 Collecting System Information | ||
21 | 1.1 Process-Specific Subdirectories | ||
22 | 1.2 Kernel data | ||
23 | 1.3 IDE devices in /proc/ide | ||
24 | 1.4 Networking info in /proc/net | ||
25 | 1.5 SCSI info | ||
26 | 1.6 Parallel port info in /proc/parport | ||
27 | 1.7 TTY info in /proc/tty | ||
28 | 1.8 Miscellaneous kernel statistics in /proc/stat | ||
29 | |||
30 | 2 Modifying System Parameters | ||
31 | 2.1 /proc/sys/fs - File system data | ||
32 | 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats | ||
33 | 2.3 /proc/sys/kernel - general kernel parameters | ||
34 | 2.4 /proc/sys/vm - The virtual memory subsystem | ||
35 | 2.5 /proc/sys/dev - Device specific parameters | ||
36 | 2.6 /proc/sys/sunrpc - Remote procedure calls | ||
37 | 2.7 /proc/sys/net - Networking stuff | ||
38 | 2.8 /proc/sys/net/ipv4 - IPV4 settings | ||
39 | 2.9 Appletalk | ||
40 | 2.10 IPX | ||
41 | 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem | ||
42 | |||
43 | ------------------------------------------------------------------------------ | ||
44 | Preface | ||
45 | ------------------------------------------------------------------------------ | ||
46 | |||
47 | 0.1 Introduction/Credits | ||
48 | ------------------------ | ||
49 | |||
50 | This documentation is part of a soon (or so we hope) to be released book on | ||
51 | the SuSE Linux distribution. As there is no complete documentation for the | ||
52 | /proc file system and we've used many freely available sources to write these | ||
53 | chapters, it seems only fair to give the work back to the Linux community. | ||
54 | This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm | ||
55 | afraid it's still far from complete, but we hope it will be useful. As far as | ||
56 | we know, it is the first 'all-in-one' document about the /proc file system. It | ||
57 | is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, | ||
58 | SPARC, AXP, etc., features, you probably won't find what you are looking for. | ||
59 | It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But | ||
60 | additions and patches are welcome and will be added to this document if you | ||
61 | mail them to Bodo. | ||
62 | |||
63 | We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of | ||
64 | other people for help compiling this documentation. We'd also like to extend a | ||
65 | special thank you to Andi Kleen for documentation, which we relied on heavily | ||
66 | to create this document, as well as the additional information he provided. | ||
67 | Thanks to everybody else who contributed source or docs to the Linux kernel | ||
68 | and helped create a great piece of software... :) | ||
69 | |||
70 | If you have any comments, corrections or additions, please don't hesitate to | ||
71 | contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this | ||
72 | document. | ||
73 | |||
74 | The latest version of this document is available online at | ||
75 | http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version. | ||
76 | |||
77 | If the above direction does not works for you, ypu could try the kernel | ||
78 | mailing list at linux-kernel@vger.kernel.org and/or try to reach me at | ||
79 | comandante@zaralinux.com. | ||
80 | |||
81 | 0.2 Legal Stuff | ||
82 | --------------- | ||
83 | |||
84 | We don't guarantee the correctness of this document, and if you come to us | ||
85 | complaining about how you screwed up your system because of incorrect | ||
86 | documentation, we won't feel responsible... | ||
87 | |||
88 | ------------------------------------------------------------------------------ | ||
89 | CHAPTER 1: COLLECTING SYSTEM INFORMATION | ||
90 | ------------------------------------------------------------------------------ | ||
91 | |||
92 | ------------------------------------------------------------------------------ | ||
93 | In This Chapter | ||
94 | ------------------------------------------------------------------------------ | ||
95 | * Investigating the properties of the pseudo file system /proc and its | ||
96 | ability to provide information on the running Linux system | ||
97 | * Examining /proc's structure | ||
98 | * Uncovering various information about the kernel and the processes running | ||
99 | on the system | ||
100 | ------------------------------------------------------------------------------ | ||
101 | |||
102 | |||
103 | The proc file system acts as an interface to internal data structures in the | ||
104 | kernel. It can be used to obtain information about the system and to change | ||
105 | certain kernel parameters at runtime (sysctl). | ||
106 | |||
107 | First, we'll take a look at the read-only parts of /proc. In Chapter 2, we | ||
108 | show you how you can use /proc/sys to change settings. | ||
109 | |||
110 | 1.1 Process-Specific Subdirectories | ||
111 | ----------------------------------- | ||
112 | |||
113 | The directory /proc contains (among other things) one subdirectory for each | ||
114 | process running on the system, which is named after the process ID (PID). | ||
115 | |||
116 | The link self points to the process reading the file system. Each process | ||
117 | subdirectory has the entries listed in Table 1-1. | ||
118 | |||
119 | |||
120 | Table 1-1: Process specific entries in /proc | ||
121 | .............................................................................. | ||
122 | File Content | ||
123 | cmdline Command line arguments | ||
124 | cpu Current and last cpu in wich it was executed (2.4)(smp) | ||
125 | cwd Link to the current working directory | ||
126 | environ Values of environment variables | ||
127 | exe Link to the executable of this process | ||
128 | fd Directory, which contains all file descriptors | ||
129 | maps Memory maps to executables and library files (2.4) | ||
130 | mem Memory held by this process | ||
131 | root Link to the root directory of this process | ||
132 | stat Process status | ||
133 | statm Process memory status information | ||
134 | status Process status in human readable form | ||
135 | wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan | ||
136 | .............................................................................. | ||
137 | |||
138 | For example, to get the status information of a process, all you have to do is | ||
139 | read the file /proc/PID/status: | ||
140 | |||
141 | >cat /proc/self/status | ||
142 | Name: cat | ||
143 | State: R (running) | ||
144 | Pid: 5452 | ||
145 | PPid: 743 | ||
146 | TracerPid: 0 (2.4) | ||
147 | Uid: 501 501 501 501 | ||
148 | Gid: 100 100 100 100 | ||
149 | Groups: 100 14 16 | ||
150 | VmSize: 1112 kB | ||
151 | VmLck: 0 kB | ||
152 | VmRSS: 348 kB | ||
153 | VmData: 24 kB | ||
154 | VmStk: 12 kB | ||
155 | VmExe: 8 kB | ||
156 | VmLib: 1044 kB | ||
157 | SigPnd: 0000000000000000 | ||
158 | SigBlk: 0000000000000000 | ||
159 | SigIgn: 0000000000000000 | ||
160 | SigCgt: 0000000000000000 | ||
161 | CapInh: 00000000fffffeff | ||
162 | CapPrm: 0000000000000000 | ||
163 | CapEff: 0000000000000000 | ||
164 | |||
165 | |||
166 | This shows you nearly the same information you would get if you viewed it with | ||
167 | the ps command. In fact, ps uses the proc file system to obtain its | ||
168 | information. The statm file contains more detailed information about the | ||
169 | process memory usage. Its seven fields are explained in Table 1-2. | ||
170 | |||
171 | |||
172 | Table 1-2: Contents of the statm files (as of 2.6.8-rc3) | ||
173 | .............................................................................. | ||
174 | Field Content | ||
175 | size total program size (pages) (same as VmSize in status) | ||
176 | resident size of memory portions (pages) (same as VmRSS in status) | ||
177 | shared number of pages that are shared (i.e. backed by a file) | ||
178 | trs number of pages that are 'code' (not including libs; broken, | ||
179 | includes data segment) | ||
180 | lrs number of pages of library (always 0 on 2.6) | ||
181 | drs number of pages of data/stack (including libs; broken, | ||
182 | includes library text) | ||
183 | dt number of dirty pages (always 0 on 2.6) | ||
184 | .............................................................................. | ||
185 | |||
186 | 1.2 Kernel data | ||
187 | --------------- | ||
188 | |||
189 | Similar to the process entries, the kernel data files give information about | ||
190 | the running kernel. The files used to obtain this information are contained in | ||
191 | /proc and are listed in Table 1-3. Not all of these will be present in your | ||
192 | system. It depends on the kernel configuration and the loaded modules, which | ||
193 | files are there, and which are missing. | ||
194 | |||
195 | Table 1-3: Kernel info in /proc | ||
196 | .............................................................................. | ||
197 | File Content | ||
198 | apm Advanced power management info | ||
199 | buddyinfo Kernel memory allocator information (see text) (2.5) | ||
200 | bus Directory containing bus specific information | ||
201 | cmdline Kernel command line | ||
202 | cpuinfo Info about the CPU | ||
203 | devices Available devices (block and character) | ||
204 | dma Used DMS channels | ||
205 | filesystems Supported filesystems | ||
206 | driver Various drivers grouped here, currently rtc (2.4) | ||
207 | execdomains Execdomains, related to security (2.4) | ||
208 | fb Frame Buffer devices (2.4) | ||
209 | fs File system parameters, currently nfs/exports (2.4) | ||
210 | ide Directory containing info about the IDE subsystem | ||
211 | interrupts Interrupt usage | ||
212 | iomem Memory map (2.4) | ||
213 | ioports I/O port usage | ||
214 | irq Masks for irq to cpu affinity (2.4)(smp?) | ||
215 | isapnp ISA PnP (Plug&Play) Info (2.4) | ||
216 | kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) | ||
217 | kmsg Kernel messages | ||
218 | ksyms Kernel symbol table | ||
219 | loadavg Load average of last 1, 5 & 15 minutes | ||
220 | locks Kernel locks | ||
221 | meminfo Memory info | ||
222 | misc Miscellaneous | ||
223 | modules List of loaded modules | ||
224 | mounts Mounted filesystems | ||
225 | net Networking info (see text) | ||
226 | partitions Table of partitions known to the system | ||
227 | pci Depreciated info of PCI bus (new way -> /proc/bus/pci/, | ||
228 | decoupled by lspci (2.4) | ||
229 | rtc Real time clock | ||
230 | scsi SCSI info (see text) | ||
231 | slabinfo Slab pool info | ||
232 | stat Overall statistics | ||
233 | swaps Swap space utilization | ||
234 | sys See chapter 2 | ||
235 | sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) | ||
236 | tty Info of tty drivers | ||
237 | uptime System uptime | ||
238 | version Kernel version | ||
239 | video bttv info of video resources (2.4) | ||
240 | .............................................................................. | ||
241 | |||
242 | You can, for example, check which interrupts are currently in use and what | ||
243 | they are used for by looking in the file /proc/interrupts: | ||
244 | |||
245 | > cat /proc/interrupts | ||
246 | CPU0 | ||
247 | 0: 8728810 XT-PIC timer | ||
248 | 1: 895 XT-PIC keyboard | ||
249 | 2: 0 XT-PIC cascade | ||
250 | 3: 531695 XT-PIC aha152x | ||
251 | 4: 2014133 XT-PIC serial | ||
252 | 5: 44401 XT-PIC pcnet_cs | ||
253 | 8: 2 XT-PIC rtc | ||
254 | 11: 8 XT-PIC i82365 | ||
255 | 12: 182918 XT-PIC PS/2 Mouse | ||
256 | 13: 1 XT-PIC fpu | ||
257 | 14: 1232265 XT-PIC ide0 | ||
258 | 15: 7 XT-PIC ide1 | ||
259 | NMI: 0 | ||
260 | |||
261 | In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the | ||
262 | output of a SMP machine): | ||
263 | |||
264 | > cat /proc/interrupts | ||
265 | |||
266 | CPU0 CPU1 | ||
267 | 0: 1243498 1214548 IO-APIC-edge timer | ||
268 | 1: 8949 8958 IO-APIC-edge keyboard | ||
269 | 2: 0 0 XT-PIC cascade | ||
270 | 5: 11286 10161 IO-APIC-edge soundblaster | ||
271 | 8: 1 0 IO-APIC-edge rtc | ||
272 | 9: 27422 27407 IO-APIC-edge 3c503 | ||
273 | 12: 113645 113873 IO-APIC-edge PS/2 Mouse | ||
274 | 13: 0 0 XT-PIC fpu | ||
275 | 14: 22491 24012 IO-APIC-edge ide0 | ||
276 | 15: 2183 2415 IO-APIC-edge ide1 | ||
277 | 17: 30564 30414 IO-APIC-level eth0 | ||
278 | 18: 177 164 IO-APIC-level bttv | ||
279 | NMI: 2457961 2457959 | ||
280 | LOC: 2457882 2457881 | ||
281 | ERR: 2155 | ||
282 | |||
283 | NMI is incremented in this case because every timer interrupt generates a NMI | ||
284 | (Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. | ||
285 | |||
286 | LOC is the local interrupt counter of the internal APIC of every CPU. | ||
287 | |||
288 | ERR is incremented in the case of errors in the IO-APIC bus (the bus that | ||
289 | connects the CPUs in a SMP system. This means that an error has been detected, | ||
290 | the IO-APIC automatically retry the transmission, so it should not be a big | ||
291 | problem, but you should read the SMP-FAQ. | ||
292 | |||
293 | In this context it could be interesting to note the new irq directory in 2.4. | ||
294 | It could be used to set IRQ to CPU affinity, this means that you can "hook" an | ||
295 | IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the | ||
296 | irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask | ||
297 | |||
298 | For example | ||
299 | > ls /proc/irq/ | ||
300 | 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask | ||
301 | 1 11 13 15 17 19 3 5 7 9 | ||
302 | > ls /proc/irq/0/ | ||
303 | smp_affinity | ||
304 | |||
305 | The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ | ||
306 | is the same by default: | ||
307 | |||
308 | > cat /proc/irq/0/smp_affinity | ||
309 | ffffffff | ||
310 | |||
311 | It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can | ||
312 | set it by doing: | ||
313 | |||
314 | > echo 1 > /proc/irq/prof_cpu_mask | ||
315 | |||
316 | This means that only the first CPU will handle the IRQ, but you can also echo 5 | ||
317 | wich means that only the first and fourth CPU can handle the IRQ. | ||
318 | |||
319 | The way IRQs are routed is handled by the IO-APIC, and it's Round Robin | ||
320 | between all the CPUs which are allowed to handle it. As usual the kernel has | ||
321 | more info than you and does a better job than you, so the defaults are the | ||
322 | best choice for almost everyone. | ||
323 | |||
324 | There are three more important subdirectories in /proc: net, scsi, and sys. | ||
325 | The general rule is that the contents, or even the existence of these | ||
326 | directories, depend on your kernel configuration. If SCSI is not enabled, the | ||
327 | directory scsi may not exist. The same is true with the net, which is there | ||
328 | only when networking support is present in the running kernel. | ||
329 | |||
330 | The slabinfo file gives information about memory usage at the slab level. | ||
331 | Linux uses slab pools for memory management above page level in version 2.2. | ||
332 | Commonly used objects have their own slab pool (such as network buffers, | ||
333 | directory cache, and so on). | ||
334 | |||
335 | .............................................................................. | ||
336 | |||
337 | > cat /proc/buddyinfo | ||
338 | |||
339 | Node 0, zone DMA 0 4 5 4 4 3 ... | ||
340 | Node 0, zone Normal 1 0 0 1 101 8 ... | ||
341 | Node 0, zone HighMem 2 0 0 1 1 0 ... | ||
342 | |||
343 | Memory fragmentation is a problem under some workloads, and buddyinfo is a | ||
344 | useful tool for helping diagnose these problems. Buddyinfo will give you a | ||
345 | clue as to how big an area you can safely allocate, or why a previous | ||
346 | allocation failed. | ||
347 | |||
348 | Each column represents the number of pages of a certain order which are | ||
349 | available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in | ||
350 | ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE | ||
351 | available in ZONE_NORMAL, etc... | ||
352 | |||
353 | .............................................................................. | ||
354 | |||
355 | meminfo: | ||
356 | |||
357 | Provides information about distribution and utilization of memory. This | ||
358 | varies by architecture and compile options. The following is from a | ||
359 | 16GB PIII, which has highmem enabled. You may not have all of these fields. | ||
360 | |||
361 | > cat /proc/meminfo | ||
362 | |||
363 | |||
364 | MemTotal: 16344972 kB | ||
365 | MemFree: 13634064 kB | ||
366 | Buffers: 3656 kB | ||
367 | Cached: 1195708 kB | ||
368 | SwapCached: 0 kB | ||
369 | Active: 891636 kB | ||
370 | Inactive: 1077224 kB | ||
371 | HighTotal: 15597528 kB | ||
372 | HighFree: 13629632 kB | ||
373 | LowTotal: 747444 kB | ||
374 | LowFree: 4432 kB | ||
375 | SwapTotal: 0 kB | ||
376 | SwapFree: 0 kB | ||
377 | Dirty: 968 kB | ||
378 | Writeback: 0 kB | ||
379 | Mapped: 280372 kB | ||
380 | Slab: 684068 kB | ||
381 | CommitLimit: 7669796 kB | ||
382 | Committed_AS: 100056 kB | ||
383 | PageTables: 24448 kB | ||
384 | VmallocTotal: 112216 kB | ||
385 | VmallocUsed: 428 kB | ||
386 | VmallocChunk: 111088 kB | ||
387 | |||
388 | MemTotal: Total usable ram (i.e. physical ram minus a few reserved | ||
389 | bits and the kernel binary code) | ||
390 | MemFree: The sum of LowFree+HighFree | ||
391 | Buffers: Relatively temporary storage for raw disk blocks | ||
392 | shouldn't get tremendously large (20MB or so) | ||
393 | Cached: in-memory cache for files read from the disk (the | ||
394 | pagecache). Doesn't include SwapCached | ||
395 | SwapCached: Memory that once was swapped out, is swapped back in but | ||
396 | still also is in the swapfile (if memory is needed it | ||
397 | doesn't need to be swapped out AGAIN because it is already | ||
398 | in the swapfile. This saves I/O) | ||
399 | Active: Memory that has been used more recently and usually not | ||
400 | reclaimed unless absolutely necessary. | ||
401 | Inactive: Memory which has been less recently used. It is more | ||
402 | eligible to be reclaimed for other purposes | ||
403 | HighTotal: | ||
404 | HighFree: Highmem is all memory above ~860MB of physical memory | ||
405 | Highmem areas are for use by userspace programs, or | ||
406 | for the pagecache. The kernel must use tricks to access | ||
407 | this memory, making it slower to access than lowmem. | ||
408 | LowTotal: | ||
409 | LowFree: Lowmem is memory which can be used for everything that | ||
410 | highmem can be used for, but it is also availble for the | ||
411 | kernel's use for its own data structures. Among many | ||
412 | other things, it is where everything from the Slab is | ||
413 | allocated. Bad things happen when you're out of lowmem. | ||
414 | SwapTotal: total amount of swap space available | ||
415 | SwapFree: Memory which has been evicted from RAM, and is temporarily | ||
416 | on the disk | ||
417 | Dirty: Memory which is waiting to get written back to the disk | ||
418 | Writeback: Memory which is actively being written back to the disk | ||
419 | Mapped: files which have been mmaped, such as libraries | ||
420 | Slab: in-kernel data structures cache | ||
421 | CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), | ||
422 | this is the total amount of memory currently available to | ||
423 | be allocated on the system. This limit is only adhered to | ||
424 | if strict overcommit accounting is enabled (mode 2 in | ||
425 | 'vm.overcommit_memory'). | ||
426 | The CommitLimit is calculated with the following formula: | ||
427 | CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap | ||
428 | For example, on a system with 1G of physical RAM and 7G | ||
429 | of swap with a `vm.overcommit_ratio` of 30 it would | ||
430 | yield a CommitLimit of 7.3G. | ||
431 | For more details, see the memory overcommit documentation | ||
432 | in vm/overcommit-accounting. | ||
433 | Committed_AS: The amount of memory presently allocated on the system. | ||
434 | The committed memory is a sum of all of the memory which | ||
435 | has been allocated by processes, even if it has not been | ||
436 | "used" by them as of yet. A process which malloc()'s 1G | ||
437 | of memory, but only touches 300M of it will only show up | ||
438 | as using 300M of memory even if it has the address space | ||
439 | allocated for the entire 1G. This 1G is memory which has | ||
440 | been "committed" to by the VM and can be used at any time | ||
441 | by the allocating application. With strict overcommit | ||
442 | enabled on the system (mode 2 in 'vm.overcommit_memory'), | ||
443 | allocations which would exceed the CommitLimit (detailed | ||
444 | above) will not be permitted. This is useful if one needs | ||
445 | to guarantee that processes will not fail due to lack of | ||
446 | memory once that memory has been successfully allocated. | ||
447 | PageTables: amount of memory dedicated to the lowest level of page | ||
448 | tables. | ||
449 | VmallocTotal: total size of vmalloc memory area | ||
450 | VmallocUsed: amount of vmalloc area which is used | ||
451 | VmallocChunk: largest contigious block of vmalloc area which is free | ||
452 | |||
453 | |||
454 | 1.3 IDE devices in /proc/ide | ||
455 | ---------------------------- | ||
456 | |||
457 | The subdirectory /proc/ide contains information about all IDE devices of which | ||
458 | the kernel is aware. There is one subdirectory for each IDE controller, the | ||
459 | file drivers and a link for each IDE device, pointing to the device directory | ||
460 | in the controller specific subtree. | ||
461 | |||
462 | The file drivers contains general information about the drivers used for the | ||
463 | IDE devices: | ||
464 | |||
465 | > cat /proc/ide/drivers | ||
466 | ide-cdrom version 4.53 | ||
467 | ide-disk version 1.08 | ||
468 | |||
469 | More detailed information can be found in the controller specific | ||
470 | subdirectories. These are named ide0, ide1 and so on. Each of these | ||
471 | directories contains the files shown in table 1-4. | ||
472 | |||
473 | |||
474 | Table 1-4: IDE controller info in /proc/ide/ide? | ||
475 | .............................................................................. | ||
476 | File Content | ||
477 | channel IDE channel (0 or 1) | ||
478 | config Configuration (only for PCI/IDE bridge) | ||
479 | mate Mate name | ||
480 | model Type/Chipset of IDE controller | ||
481 | .............................................................................. | ||
482 | |||
483 | Each device connected to a controller has a separate subdirectory in the | ||
484 | controllers directory. The files listed in table 1-5 are contained in these | ||
485 | directories. | ||
486 | |||
487 | |||
488 | Table 1-5: IDE device information | ||
489 | .............................................................................. | ||
490 | File Content | ||
491 | cache The cache | ||
492 | capacity Capacity of the medium (in 512Byte blocks) | ||
493 | driver driver and version | ||
494 | geometry physical and logical geometry | ||
495 | identify device identify block | ||
496 | media media type | ||
497 | model device identifier | ||
498 | settings device setup | ||
499 | smart_thresholds IDE disk management thresholds | ||
500 | smart_values IDE disk management values | ||
501 | .............................................................................. | ||
502 | |||
503 | The most interesting file is settings. This file contains a nice overview of | ||
504 | the drive parameters: | ||
505 | |||
506 | # cat /proc/ide/ide0/hda/settings | ||
507 | name value min max mode | ||
508 | ---- ----- --- --- ---- | ||
509 | bios_cyl 526 0 65535 rw | ||
510 | bios_head 255 0 255 rw | ||
511 | bios_sect 63 0 63 rw | ||
512 | breada_readahead 4 0 127 rw | ||
513 | bswap 0 0 1 r | ||
514 | file_readahead 72 0 2097151 rw | ||
515 | io_32bit 0 0 3 rw | ||
516 | keepsettings 0 0 1 rw | ||
517 | max_kb_per_request 122 1 127 rw | ||
518 | multcount 0 0 8 rw | ||
519 | nice1 1 0 1 rw | ||
520 | nowerr 0 0 1 rw | ||
521 | pio_mode write-only 0 255 w | ||
522 | slow 0 0 1 rw | ||
523 | unmaskirq 0 0 1 rw | ||
524 | using_dma 0 0 1 rw | ||
525 | |||
526 | |||
527 | 1.4 Networking info in /proc/net | ||
528 | -------------------------------- | ||
529 | |||
530 | The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the | ||
531 | additional values you get for IP version 6 if you configure the kernel to | ||
532 | support this. Table 1-7 lists the files and their meaning. | ||
533 | |||
534 | |||
535 | Table 1-6: IPv6 info in /proc/net | ||
536 | .............................................................................. | ||
537 | File Content | ||
538 | udp6 UDP sockets (IPv6) | ||
539 | tcp6 TCP sockets (IPv6) | ||
540 | raw6 Raw device statistics (IPv6) | ||
541 | igmp6 IP multicast addresses, which this host joined (IPv6) | ||
542 | if_inet6 List of IPv6 interface addresses | ||
543 | ipv6_route Kernel routing table for IPv6 | ||
544 | rt6_stats Global IPv6 routing tables statistics | ||
545 | sockstat6 Socket statistics (IPv6) | ||
546 | snmp6 Snmp data (IPv6) | ||
547 | .............................................................................. | ||
548 | |||
549 | |||
550 | Table 1-7: Network info in /proc/net | ||
551 | .............................................................................. | ||
552 | File Content | ||
553 | arp Kernel ARP table | ||
554 | dev network devices with statistics | ||
555 | dev_mcast the Layer2 multicast groups a device is listening too | ||
556 | (interface index, label, number of references, number of bound | ||
557 | addresses). | ||
558 | dev_stat network device status | ||
559 | ip_fwchains Firewall chain linkage | ||
560 | ip_fwnames Firewall chain names | ||
561 | ip_masq Directory containing the masquerading tables | ||
562 | ip_masquerade Major masquerading table | ||
563 | netstat Network statistics | ||
564 | raw raw device statistics | ||
565 | route Kernel routing table | ||
566 | rpc Directory containing rpc info | ||
567 | rt_cache Routing cache | ||
568 | snmp SNMP data | ||
569 | sockstat Socket statistics | ||
570 | tcp TCP sockets | ||
571 | tr_rif Token ring RIF routing table | ||
572 | udp UDP sockets | ||
573 | unix UNIX domain sockets | ||
574 | wireless Wireless interface data (Wavelan etc) | ||
575 | igmp IP multicast addresses, which this host joined | ||
576 | psched Global packet scheduler parameters. | ||
577 | netlink List of PF_NETLINK sockets | ||
578 | ip_mr_vifs List of multicast virtual interfaces | ||
579 | ip_mr_cache List of multicast routing cache | ||
580 | .............................................................................. | ||
581 | |||
582 | You can use this information to see which network devices are available in | ||
583 | your system and how much traffic was routed over those devices: | ||
584 | |||
585 | > cat /proc/net/dev | ||
586 | Inter-|Receive |[... | ||
587 | face |bytes packets errs drop fifo frame compressed multicast|[... | ||
588 | lo: 908188 5596 0 0 0 0 0 0 [... | ||
589 | ppp0:15475140 20721 410 0 0 410 0 0 [... | ||
590 | eth0: 614530 7085 0 0 0 0 0 1 [... | ||
591 | |||
592 | ...] Transmit | ||
593 | ...] bytes packets errs drop fifo colls carrier compressed | ||
594 | ...] 908188 5596 0 0 0 0 0 0 | ||
595 | ...] 1375103 17405 0 0 0 0 0 0 | ||
596 | ...] 1703981 5535 0 0 0 3 0 0 | ||
597 | |||
598 | In addition, each Channel Bond interface has it's own directory. For | ||
599 | example, the bond0 device will have a directory called /proc/net/bond0/. | ||
600 | It will contain information that is specific to that bond, such as the | ||
601 | current slaves of the bond, the link status of the slaves, and how | ||
602 | many times the slaves link has failed. | ||
603 | |||
604 | 1.5 SCSI info | ||
605 | ------------- | ||
606 | |||
607 | If you have a SCSI host adapter in your system, you'll find a subdirectory | ||
608 | named after the driver for this adapter in /proc/scsi. You'll also see a list | ||
609 | of all recognized SCSI devices in /proc/scsi: | ||
610 | |||
611 | >cat /proc/scsi/scsi | ||
612 | Attached devices: | ||
613 | Host: scsi0 Channel: 00 Id: 00 Lun: 00 | ||
614 | Vendor: IBM Model: DGHS09U Rev: 03E0 | ||
615 | Type: Direct-Access ANSI SCSI revision: 03 | ||
616 | Host: scsi0 Channel: 00 Id: 06 Lun: 00 | ||
617 | Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 | ||
618 | Type: CD-ROM ANSI SCSI revision: 02 | ||
619 | |||
620 | |||
621 | The directory named after the driver has one file for each adapter found in | ||
622 | the system. These files contain information about the controller, including | ||
623 | the used IRQ and the IO address range. The amount of information shown is | ||
624 | dependent on the adapter you use. The example shows the output for an Adaptec | ||
625 | AHA-2940 SCSI adapter: | ||
626 | |||
627 | > cat /proc/scsi/aic7xxx/0 | ||
628 | |||
629 | Adaptec AIC7xxx driver version: 5.1.19/3.2.4 | ||
630 | Compile Options: | ||
631 | TCQ Enabled By Default : Disabled | ||
632 | AIC7XXX_PROC_STATS : Disabled | ||
633 | AIC7XXX_RESET_DELAY : 5 | ||
634 | Adapter Configuration: | ||
635 | SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter | ||
636 | Ultra Wide Controller | ||
637 | PCI MMAPed I/O Base: 0xeb001000 | ||
638 | Adapter SEEPROM Config: SEEPROM found and used. | ||
639 | Adaptec SCSI BIOS: Enabled | ||
640 | IRQ: 10 | ||
641 | SCBs: Active 0, Max Active 2, | ||
642 | Allocated 15, HW 16, Page 255 | ||
643 | Interrupts: 160328 | ||
644 | BIOS Control Word: 0x18b6 | ||
645 | Adapter Control Word: 0x005b | ||
646 | Extended Translation: Enabled | ||
647 | Disconnect Enable Flags: 0xffff | ||
648 | Ultra Enable Flags: 0x0001 | ||
649 | Tag Queue Enable Flags: 0x0000 | ||
650 | Ordered Queue Tag Flags: 0x0000 | ||
651 | Default Tag Queue Depth: 8 | ||
652 | Tagged Queue By Device array for aic7xxx host instance 0: | ||
653 | {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} | ||
654 | Actual queue depth per device for aic7xxx host instance 0: | ||
655 | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} | ||
656 | Statistics: | ||
657 | (scsi0:0:0:0) | ||
658 | Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 | ||
659 | Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) | ||
660 | Total transfers 160151 (74577 reads and 85574 writes) | ||
661 | (scsi0:0:6:0) | ||
662 | Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 | ||
663 | Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) | ||
664 | Total transfers 0 (0 reads and 0 writes) | ||
665 | |||
666 | |||
667 | 1.6 Parallel port info in /proc/parport | ||
668 | --------------------------------------- | ||
669 | |||
670 | The directory /proc/parport contains information about the parallel ports of | ||
671 | your system. It has one subdirectory for each port, named after the port | ||
672 | number (0,1,2,...). | ||
673 | |||
674 | These directories contain the four files shown in Table 1-8. | ||
675 | |||
676 | |||
677 | Table 1-8: Files in /proc/parport | ||
678 | .............................................................................. | ||
679 | File Content | ||
680 | autoprobe Any IEEE-1284 device ID information that has been acquired. | ||
681 | devices list of the device drivers using that port. A + will appear by the | ||
682 | name of the device currently using the port (it might not appear | ||
683 | against any). | ||
684 | hardware Parallel port's base address, IRQ line and DMA channel. | ||
685 | irq IRQ that parport is using for that port. This is in a separate | ||
686 | file to allow you to alter it by writing a new value in (IRQ | ||
687 | number or none). | ||
688 | .............................................................................. | ||
689 | |||
690 | 1.7 TTY info in /proc/tty | ||
691 | ------------------------- | ||
692 | |||
693 | Information about the available and actually used tty's can be found in the | ||
694 | directory /proc/tty.You'll find entries for drivers and line disciplines in | ||
695 | this directory, as shown in Table 1-9. | ||
696 | |||
697 | |||
698 | Table 1-9: Files in /proc/tty | ||
699 | .............................................................................. | ||
700 | File Content | ||
701 | drivers list of drivers and their usage | ||
702 | ldiscs registered line disciplines | ||
703 | driver/serial usage statistic and status of single tty lines | ||
704 | .............................................................................. | ||
705 | |||
706 | To see which tty's are currently in use, you can simply look into the file | ||
707 | /proc/tty/drivers: | ||
708 | |||
709 | > cat /proc/tty/drivers | ||
710 | pty_slave /dev/pts 136 0-255 pty:slave | ||
711 | pty_master /dev/ptm 128 0-255 pty:master | ||
712 | pty_slave /dev/ttyp 3 0-255 pty:slave | ||
713 | pty_master /dev/pty 2 0-255 pty:master | ||
714 | serial /dev/cua 5 64-67 serial:callout | ||
715 | serial /dev/ttyS 4 64-67 serial | ||
716 | /dev/tty0 /dev/tty0 4 0 system:vtmaster | ||
717 | /dev/ptmx /dev/ptmx 5 2 system | ||
718 | /dev/console /dev/console 5 1 system:console | ||
719 | /dev/tty /dev/tty 5 0 system:/dev/tty | ||
720 | unknown /dev/tty 4 1-63 console | ||
721 | |||
722 | |||
723 | 1.8 Miscellaneous kernel statistics in /proc/stat | ||
724 | ------------------------------------------------- | ||
725 | |||
726 | Various pieces of information about kernel activity are available in the | ||
727 | /proc/stat file. All of the numbers reported in this file are aggregates | ||
728 | since the system first booted. For a quick look, simply cat the file: | ||
729 | |||
730 | > cat /proc/stat | ||
731 | cpu 2255 34 2290 22625563 6290 127 456 | ||
732 | cpu0 1132 34 1441 11311718 3675 127 438 | ||
733 | cpu1 1123 0 849 11313845 2614 0 18 | ||
734 | intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] | ||
735 | ctxt 1990473 | ||
736 | btime 1062191376 | ||
737 | processes 2915 | ||
738 | procs_running 1 | ||
739 | procs_blocked 0 | ||
740 | |||
741 | The very first "cpu" line aggregates the numbers in all of the other "cpuN" | ||
742 | lines. These numbers identify the amount of time the CPU has spent performing | ||
743 | different kinds of work. Time units are in USER_HZ (typically hundredths of a | ||
744 | second). The meanings of the columns are as follows, from left to right: | ||
745 | |||
746 | - user: normal processes executing in user mode | ||
747 | - nice: niced processes executing in user mode | ||
748 | - system: processes executing in kernel mode | ||
749 | - idle: twiddling thumbs | ||
750 | - iowait: waiting for I/O to complete | ||
751 | - irq: servicing interrupts | ||
752 | - softirq: servicing softirqs | ||
753 | |||
754 | The "intr" line gives counts of interrupts serviced since boot time, for each | ||
755 | of the possible system interrupts. The first column is the total of all | ||
756 | interrupts serviced; each subsequent column is the total for that particular | ||
757 | interrupt. | ||
758 | |||
759 | The "ctxt" line gives the total number of context switches across all CPUs. | ||
760 | |||
761 | The "btime" line gives the time at which the system booted, in seconds since | ||
762 | the Unix epoch. | ||
763 | |||
764 | The "processes" line gives the number of processes and threads created, which | ||
765 | includes (but is not limited to) those created by calls to the fork() and | ||
766 | clone() system calls. | ||
767 | |||
768 | The "procs_running" line gives the number of processes currently running on | ||
769 | CPUs. | ||
770 | |||
771 | The "procs_blocked" line gives the number of processes currently blocked, | ||
772 | waiting for I/O to complete. | ||
773 | |||
774 | |||
775 | ------------------------------------------------------------------------------ | ||
776 | Summary | ||
777 | ------------------------------------------------------------------------------ | ||
778 | The /proc file system serves information about the running system. It not only | ||
779 | allows access to process data but also allows you to request the kernel status | ||
780 | by reading files in the hierarchy. | ||
781 | |||
782 | The directory structure of /proc reflects the types of information and makes | ||
783 | it easy, if not obvious, where to look for specific data. | ||
784 | ------------------------------------------------------------------------------ | ||
785 | |||
786 | ------------------------------------------------------------------------------ | ||
787 | CHAPTER 2: MODIFYING SYSTEM PARAMETERS | ||
788 | ------------------------------------------------------------------------------ | ||
789 | |||
790 | ------------------------------------------------------------------------------ | ||
791 | In This Chapter | ||
792 | ------------------------------------------------------------------------------ | ||
793 | * Modifying kernel parameters by writing into files found in /proc/sys | ||
794 | * Exploring the files which modify certain parameters | ||
795 | * Review of the /proc/sys file tree | ||
796 | ------------------------------------------------------------------------------ | ||
797 | |||
798 | |||
799 | A very interesting part of /proc is the directory /proc/sys. This is not only | ||
800 | a source of information, it also allows you to change parameters within the | ||
801 | kernel. Be very careful when attempting this. You can optimize your system, | ||
802 | but you can also cause it to crash. Never alter kernel parameters on a | ||
803 | production system. Set up a development machine and test to make sure that | ||
804 | everything works the way you want it to. You may have no alternative but to | ||
805 | reboot the machine once an error has been made. | ||
806 | |||
807 | To change a value, simply echo the new value into the file. An example is | ||
808 | given below in the section on the file system data. You need to be root to do | ||
809 | this. You can create your own boot script to perform this every time your | ||
810 | system boots. | ||
811 | |||
812 | The files in /proc/sys can be used to fine tune and monitor miscellaneous and | ||
813 | general things in the operation of the Linux kernel. Since some of the files | ||
814 | can inadvertently disrupt your system, it is advisable to read both | ||
815 | documentation and source before actually making adjustments. In any case, be | ||
816 | very careful when writing to any of these files. The entries in /proc may | ||
817 | change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt | ||
818 | review the kernel documentation in the directory /usr/src/linux/Documentation. | ||
819 | This chapter is heavily based on the documentation included in the pre 2.2 | ||
820 | kernels, and became part of it in version 2.2.1 of the Linux kernel. | ||
821 | |||
822 | 2.1 /proc/sys/fs - File system data | ||
823 | ----------------------------------- | ||
824 | |||
825 | This subdirectory contains specific file system, file handle, inode, dentry | ||
826 | and quota information. | ||
827 | |||
828 | Currently, these files are in /proc/sys/fs: | ||
829 | |||
830 | dentry-state | ||
831 | ------------ | ||
832 | |||
833 | Status of the directory cache. Since directory entries are dynamically | ||
834 | allocated and deallocated, this file indicates the current status. It holds | ||
835 | six values, in which the last two are not used and are always zero. The others | ||
836 | are listed in table 2-1. | ||
837 | |||
838 | |||
839 | Table 2-1: Status files of the directory cache | ||
840 | .............................................................................. | ||
841 | File Content | ||
842 | nr_dentry Almost always zero | ||
843 | nr_unused Number of unused cache entries | ||
844 | age_limit | ||
845 | in seconds after the entry may be reclaimed, when memory is short | ||
846 | want_pages internally | ||
847 | .............................................................................. | ||
848 | |||
849 | dquot-nr and dquot-max | ||
850 | ---------------------- | ||
851 | |||
852 | The file dquot-max shows the maximum number of cached disk quota entries. | ||
853 | |||
854 | The file dquot-nr shows the number of allocated disk quota entries and the | ||
855 | number of free disk quota entries. | ||
856 | |||
857 | If the number of available cached disk quotas is very low and you have a large | ||
858 | number of simultaneous system users, you might want to raise the limit. | ||
859 | |||
860 | file-nr and file-max | ||
861 | -------------------- | ||
862 | |||
863 | The kernel allocates file handles dynamically, but doesn't free them again at | ||
864 | this time. | ||
865 | |||
866 | The value in file-max denotes the maximum number of file handles that the | ||
867 | Linux kernel will allocate. When you get a lot of error messages about running | ||
868 | out of file handles, you might want to raise this limit. The default value is | ||
869 | 10% of RAM in kilobytes. To change it, just write the new number into the | ||
870 | file: | ||
871 | |||
872 | # cat /proc/sys/fs/file-max | ||
873 | 4096 | ||
874 | # echo 8192 > /proc/sys/fs/file-max | ||
875 | # cat /proc/sys/fs/file-max | ||
876 | 8192 | ||
877 | |||
878 | |||
879 | This method of revision is useful for all customizable parameters of the | ||
880 | kernel - simply echo the new value to the corresponding file. | ||
881 | |||
882 | Historically, the three values in file-nr denoted the number of allocated file | ||
883 | handles, the number of allocated but unused file handles, and the maximum | ||
884 | number of file handles. Linux 2.6 always reports 0 as the number of free file | ||
885 | handles -- this is not an error, it just means that the number of allocated | ||
886 | file handles exactly matches the number of used file handles. | ||
887 | |||
888 | Attempts to allocate more file descriptors than file-max are reported with | ||
889 | printk, look for "VFS: file-max limit <number> reached". | ||
890 | |||
891 | inode-state and inode-nr | ||
892 | ------------------------ | ||
893 | |||
894 | The file inode-nr contains the first two items from inode-state, so we'll skip | ||
895 | to that file... | ||
896 | |||
897 | inode-state contains two actual numbers and five dummy values. The numbers | ||
898 | are nr_inodes and nr_free_inodes (in order of appearance). | ||
899 | |||
900 | nr_inodes | ||
901 | ~~~~~~~~~ | ||
902 | |||
903 | Denotes the number of inodes the system has allocated. This number will | ||
904 | grow and shrink dynamically. | ||
905 | |||
906 | nr_free_inodes | ||
907 | -------------- | ||
908 | |||
909 | Represents the number of free inodes. Ie. The number of inuse inodes is | ||
910 | (nr_inodes - nr_free_inodes). | ||
911 | |||
912 | super-nr and super-max | ||
913 | ---------------------- | ||
914 | |||
915 | Again, super block structures are allocated by the kernel, but not freed. The | ||
916 | file super-max contains the maximum number of super block handlers, where | ||
917 | super-nr shows the number of currently allocated ones. | ||
918 | |||
919 | Every mounted file system needs a super block, so if you plan to mount lots of | ||
920 | file systems, you may want to increase these numbers. | ||
921 | |||
922 | aio-nr and aio-max-nr | ||
923 | --------------------- | ||
924 | |||
925 | aio-nr is the running total of the number of events specified on the | ||
926 | io_setup system call for all currently active aio contexts. If aio-nr | ||
927 | reaches aio-max-nr then io_setup will fail with EAGAIN. Note that | ||
928 | raising aio-max-nr does not result in the pre-allocation or re-sizing | ||
929 | of any kernel data structures. | ||
930 | |||
931 | 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats | ||
932 | ----------------------------------------------------------- | ||
933 | |||
934 | Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This | ||
935 | handles the kernel support for miscellaneous binary formats. | ||
936 | |||
937 | Binfmt_misc provides the ability to register additional binary formats to the | ||
938 | Kernel without compiling an additional module/kernel. Therefore, binfmt_misc | ||
939 | needs to know magic numbers at the beginning or the filename extension of the | ||
940 | binary. | ||
941 | |||
942 | It works by maintaining a linked list of structs that contain a description of | ||
943 | a binary format, including a magic with size (or the filename extension), | ||
944 | offset and mask, and the interpreter name. On request it invokes the given | ||
945 | interpreter with the original program as argument, as binfmt_java and | ||
946 | binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default | ||
947 | binary-formats, you have to register an additional binary-format. | ||
948 | |||
949 | There are two general files in binfmt_misc and one file per registered format. | ||
950 | The two general files are register and status. | ||
951 | |||
952 | Registering a new binary format | ||
953 | ------------------------------- | ||
954 | |||
955 | To register a new binary format you have to issue the command | ||
956 | |||
957 | echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register | ||
958 | |||
959 | |||
960 | |||
961 | with appropriate name (the name for the /proc-dir entry), offset (defaults to | ||
962 | 0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and | ||
963 | last but not least, the interpreter that is to be invoked (for example and | ||
964 | testing /bin/echo). Type can be M for usual magic matching or E for filename | ||
965 | extension matching (give extension in place of magic). | ||
966 | |||
967 | Check or reset the status of the binary format handler | ||
968 | ------------------------------------------------------ | ||
969 | |||
970 | If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the | ||
971 | current status (enabled/disabled) of binfmt_misc. Change the status by echoing | ||
972 | 0 (disables) or 1 (enables) or -1 (caution: this clears all previously | ||
973 | registered binary formats) to status. For example echo 0 > status to disable | ||
974 | binfmt_misc (temporarily). | ||
975 | |||
976 | Status of a single handler | ||
977 | -------------------------- | ||
978 | |||
979 | Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files | ||
980 | perform the same function as status, but their scope is limited to the actual | ||
981 | binary format. By cating this file, you also receive all related information | ||
982 | about the interpreter/magic of the binfmt. | ||
983 | |||
984 | Example usage of binfmt_misc (emulate binfmt_java) | ||
985 | -------------------------------------------------- | ||
986 | |||
987 | cd /proc/sys/fs/binfmt_misc | ||
988 | echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register | ||
989 | echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register | ||
990 | echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register | ||
991 | echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register | ||
992 | |||
993 | |||
994 | These four lines add support for Java executables and Java applets (like | ||
995 | binfmt_java, additionally recognizing the .html extension with no need to put | ||
996 | <!--applet> to every applet file). You have to install the JDK and the | ||
997 | shell-script /usr/local/java/bin/javawrapper too. It works around the | ||
998 | brokenness of the Java filename handling. To add a Java binary, just create a | ||
999 | link to the class-file somewhere in the path. | ||
1000 | |||
1001 | 2.3 /proc/sys/kernel - general kernel parameters | ||
1002 | ------------------------------------------------ | ||
1003 | |||
1004 | This directory reflects general kernel behaviors. As I've said before, the | ||
1005 | contents depend on your configuration. Here you'll find the most important | ||
1006 | files, along with descriptions of what they mean and how to use them. | ||
1007 | |||
1008 | acct | ||
1009 | ---- | ||
1010 | |||
1011 | The file contains three values; highwater, lowwater, and frequency. | ||
1012 | |||
1013 | It exists only when BSD-style process accounting is enabled. These values | ||
1014 | control its behavior. If the free space on the file system where the log lives | ||
1015 | goes below lowwater percentage, accounting suspends. If it goes above | ||
1016 | highwater percentage, accounting resumes. Frequency determines how often you | ||
1017 | check the amount of free space (value is in seconds). Default settings are: 4, | ||
1018 | 2, and 30. That is, suspend accounting if there is less than 2 percent free; | ||
1019 | resume it if we have a value of 3 or more percent; consider information about | ||
1020 | the amount of free space valid for 30 seconds | ||
1021 | |||
1022 | ctrl-alt-del | ||
1023 | ------------ | ||
1024 | |||
1025 | When the value in this file is 0, ctrl-alt-del is trapped and sent to the init | ||
1026 | program to handle a graceful restart. However, when the value is greater that | ||
1027 | zero, Linux's reaction to this key combination will be an immediate reboot, | ||
1028 | without syncing its dirty buffers. | ||
1029 | |||
1030 | [NOTE] | ||
1031 | When a program (like dosemu) has the keyboard in raw mode, the | ||
1032 | ctrl-alt-del is intercepted by the program before it ever reaches the | ||
1033 | kernel tty layer, and it is up to the program to decide what to do with | ||
1034 | it. | ||
1035 | |||
1036 | domainname and hostname | ||
1037 | ----------------------- | ||
1038 | |||
1039 | These files can be controlled to set the NIS domainname and hostname of your | ||
1040 | box. For the classic darkstar.frop.org a simple: | ||
1041 | |||
1042 | # echo "darkstar" > /proc/sys/kernel/hostname | ||
1043 | # echo "frop.org" > /proc/sys/kernel/domainname | ||
1044 | |||
1045 | |||
1046 | would suffice to set your hostname and NIS domainname. | ||
1047 | |||
1048 | osrelease, ostype and version | ||
1049 | ----------------------------- | ||
1050 | |||
1051 | The names make it pretty obvious what these fields contain: | ||
1052 | |||
1053 | > cat /proc/sys/kernel/osrelease | ||
1054 | 2.2.12 | ||
1055 | |||
1056 | > cat /proc/sys/kernel/ostype | ||
1057 | Linux | ||
1058 | |||
1059 | > cat /proc/sys/kernel/version | ||
1060 | #4 Fri Oct 1 12:41:14 PDT 1999 | ||
1061 | |||
1062 | |||
1063 | The files osrelease and ostype should be clear enough. Version needs a little | ||
1064 | more clarification. The #4 means that this is the 4th kernel built from this | ||
1065 | source base and the date after it indicates the time the kernel was built. The | ||
1066 | only way to tune these values is to rebuild the kernel. | ||
1067 | |||
1068 | panic | ||
1069 | ----- | ||
1070 | |||
1071 | The value in this file represents the number of seconds the kernel waits | ||
1072 | before rebooting on a panic. When you use the software watchdog, the | ||
1073 | recommended setting is 60. If set to 0, the auto reboot after a kernel panic | ||
1074 | is disabled, which is the default setting. | ||
1075 | |||
1076 | printk | ||
1077 | ------ | ||
1078 | |||
1079 | The four values in printk denote | ||
1080 | * console_loglevel, | ||
1081 | * default_message_loglevel, | ||
1082 | * minimum_console_loglevel and | ||
1083 | * default_console_loglevel | ||
1084 | respectively. | ||
1085 | |||
1086 | These values influence printk() behavior when printing or logging error | ||
1087 | messages, which come from inside the kernel. See syslog(2) for more | ||
1088 | information on the different log levels. | ||
1089 | |||
1090 | console_loglevel | ||
1091 | ---------------- | ||
1092 | |||
1093 | Messages with a higher priority than this will be printed to the console. | ||
1094 | |||
1095 | default_message_level | ||
1096 | --------------------- | ||
1097 | |||
1098 | Messages without an explicit priority will be printed with this priority. | ||
1099 | |||
1100 | minimum_console_loglevel | ||
1101 | ------------------------ | ||
1102 | |||
1103 | Minimum (highest) value to which the console_loglevel can be set. | ||
1104 | |||
1105 | default_console_loglevel | ||
1106 | ------------------------ | ||
1107 | |||
1108 | Default value for console_loglevel. | ||
1109 | |||
1110 | sg-big-buff | ||
1111 | ----------- | ||
1112 | |||
1113 | This file shows the size of the generic SCSI (sg) buffer. At this point, you | ||
1114 | can't tune it yet, but you can change it at compile time by editing | ||
1115 | include/scsi/sg.h and changing the value of SG_BIG_BUFF. | ||
1116 | |||
1117 | If you use a scanner with SANE (Scanner Access Now Easy) you might want to set | ||
1118 | this to a higher value. Refer to the SANE documentation on this issue. | ||
1119 | |||
1120 | modprobe | ||
1121 | -------- | ||
1122 | |||
1123 | The location where the modprobe binary is located. The kernel uses this | ||
1124 | program to load modules on demand. | ||
1125 | |||
1126 | unknown_nmi_panic | ||
1127 | ----------------- | ||
1128 | |||
1129 | The value in this file affects behavior of handling NMI. When the value is | ||
1130 | non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel | ||
1131 | debugging information is displayed on console. | ||
1132 | |||
1133 | NMI switch that most IA32 servers have fires unknown NMI up, for example. | ||
1134 | If a system hangs up, try pressing the NMI switch. | ||
1135 | |||
1136 | [NOTE] | ||
1137 | This function and oprofile share a NMI callback. Therefore this function | ||
1138 | cannot be enabled when oprofile is activated. | ||
1139 | And NMI watchdog will be disabled when the value in this file is set to | ||
1140 | non-zero. | ||
1141 | |||
1142 | |||
1143 | 2.4 /proc/sys/vm - The virtual memory subsystem | ||
1144 | ----------------------------------------------- | ||
1145 | |||
1146 | The files in this directory can be used to tune the operation of the virtual | ||
1147 | memory (VM) subsystem of the Linux kernel. | ||
1148 | |||
1149 | vfs_cache_pressure | ||
1150 | ------------------ | ||
1151 | |||
1152 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
1153 | caching of directory and inode objects. | ||
1154 | |||
1155 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
1156 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
1157 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
1158 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
1159 | causes the kernel to prefer to reclaim dentries and inodes. | ||
1160 | |||
1161 | dirty_background_ratio | ||
1162 | ---------------------- | ||
1163 | |||
1164 | Contains, as a percentage of total system memory, the number of pages at which | ||
1165 | the pdflush background writeback daemon will start writing out dirty data. | ||
1166 | |||
1167 | dirty_ratio | ||
1168 | ----------------- | ||
1169 | |||
1170 | Contains, as a percentage of total system memory, the number of pages at which | ||
1171 | a process which is generating disk writes will itself start writing out dirty | ||
1172 | data. | ||
1173 | |||
1174 | dirty_writeback_centisecs | ||
1175 | ------------------------- | ||
1176 | |||
1177 | The pdflush writeback daemons will periodically wake up and write `old' data | ||
1178 | out to disk. This tunable expresses the interval between those wakeups, in | ||
1179 | 100'ths of a second. | ||
1180 | |||
1181 | Setting this to zero disables periodic writeback altogether. | ||
1182 | |||
1183 | dirty_expire_centisecs | ||
1184 | ---------------------- | ||
1185 | |||
1186 | This tunable is used to define when dirty data is old enough to be eligible | ||
1187 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. | ||
1188 | Data which has been dirty in-memory for longer than this interval will be | ||
1189 | written out next time a pdflush daemon wakes up. | ||
1190 | |||
1191 | legacy_va_layout | ||
1192 | ---------------- | ||
1193 | |||
1194 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel | ||
1195 | will use the legacy (2.4) layout for all processes. | ||
1196 | |||
1197 | lower_zone_protection | ||
1198 | --------------------- | ||
1199 | |||
1200 | For some specialised workloads on highmem machines it is dangerous for | ||
1201 | the kernel to allow process memory to be allocated from the "lowmem" | ||
1202 | zone. This is because that memory could then be pinned via the mlock() | ||
1203 | system call, or by unavailability of swapspace. | ||
1204 | |||
1205 | And on large highmem machines this lack of reclaimable lowmem memory | ||
1206 | can be fatal. | ||
1207 | |||
1208 | So the Linux page allocator has a mechanism which prevents allocations | ||
1209 | which _could_ use highmem from using too much lowmem. This means that | ||
1210 | a certain amount of lowmem is defended from the possibility of being | ||
1211 | captured into pinned user memory. | ||
1212 | |||
1213 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
1214 | mechanism will also defend that region from allocations which could use | ||
1215 | highmem or lowmem). | ||
1216 | |||
1217 | The `lower_zone_protection' tunable determines how aggressive the kernel is | ||
1218 | in defending these lower zones. The default value is zero - no | ||
1219 | protection at all. | ||
1220 | |||
1221 | If you have a machine which uses highmem or ISA DMA and your | ||
1222 | applications are using mlock(), or if you are running with no swap then | ||
1223 | you probably should increase the lower_zone_protection setting. | ||
1224 | |||
1225 | The units of this tunable are fairly vague. It is approximately equal | ||
1226 | to "megabytes". So setting lower_zone_protection=100 will protect around 100 | ||
1227 | megabytes of the lowmem zone from user allocations. It will also make | ||
1228 | those 100 megabytes unavaliable for use by applications and by | ||
1229 | pagecache, so there is a cost. | ||
1230 | |||
1231 | The effects of this tunable may be observed by monitoring | ||
1232 | /proc/meminfo:LowFree. Write a single huge file and observe the point | ||
1233 | at which LowFree ceases to fall. | ||
1234 | |||
1235 | A reasonable value for lower_zone_protection is 100. | ||
1236 | |||
1237 | page-cluster | ||
1238 | ------------ | ||
1239 | |||
1240 | page-cluster controls the number of pages which are written to swap in | ||
1241 | a single attempt. The swap I/O size. | ||
1242 | |||
1243 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
1244 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
1245 | |||
1246 | The default value is three (eight pages at a time). There may be some | ||
1247 | small benefits in tuning this to a different value if your workload is | ||
1248 | swap-intensive. | ||
1249 | |||
1250 | overcommit_memory | ||
1251 | ----------------- | ||
1252 | |||
1253 | This file contains one value. The following algorithm is used to decide if | ||
1254 | there's enough memory: if the value of overcommit_memory is positive, then | ||
1255 | there's always enough memory. This is a useful feature, since programs often | ||
1256 | malloc() huge amounts of memory 'just in case', while they only use a small | ||
1257 | part of it. Leaving this value at 0 will lead to the failure of such a huge | ||
1258 | malloc(), when in fact the system has enough memory for the program to run. | ||
1259 | |||
1260 | On the other hand, enabling this feature can cause you to run out of memory | ||
1261 | and thrash the system to death, so large and/or important servers will want to | ||
1262 | set this value to 0. | ||
1263 | |||
1264 | nr_hugepages and hugetlb_shm_group | ||
1265 | ---------------------------------- | ||
1266 | |||
1267 | nr_hugepages configures number of hugetlb page reserved for the system. | ||
1268 | |||
1269 | hugetlb_shm_group contains group id that is allowed to create SysV shared | ||
1270 | memory segment using hugetlb page. | ||
1271 | |||
1272 | laptop_mode | ||
1273 | ----------- | ||
1274 | |||
1275 | laptop_mode is a knob that controls "laptop mode". All the things that are | ||
1276 | controlled by this knob are discussed in Documentation/laptop-mode.txt. | ||
1277 | |||
1278 | block_dump | ||
1279 | ---------- | ||
1280 | |||
1281 | block_dump enables block I/O debugging when set to a nonzero value. More | ||
1282 | information on block I/O debugging is in Documentation/laptop-mode.txt. | ||
1283 | |||
1284 | swap_token_timeout | ||
1285 | ------------------ | ||
1286 | |||
1287 | This file contains valid hold time of swap out protection token. The Linux | ||
1288 | VM has token based thrashing control mechanism and uses the token to prevent | ||
1289 | unnecessary page faults in thrashing situation. The unit of the value is | ||
1290 | second. The value would be useful to tune thrashing behavior. | ||
1291 | |||
1292 | 2.5 /proc/sys/dev - Device specific parameters | ||
1293 | ---------------------------------------------- | ||
1294 | |||
1295 | Currently there is only support for CDROM drives, and for those, there is only | ||
1296 | one read-only file containing information about the CD-ROM drives attached to | ||
1297 | the system: | ||
1298 | |||
1299 | >cat /proc/sys/dev/cdrom/info | ||
1300 | CD-ROM information, Id: cdrom.c 2.55 1999/04/25 | ||
1301 | |||
1302 | drive name: sr0 hdb | ||
1303 | drive speed: 32 40 | ||
1304 | drive # of slots: 1 0 | ||
1305 | Can close tray: 1 1 | ||
1306 | Can open tray: 1 1 | ||
1307 | Can lock tray: 1 1 | ||
1308 | Can change speed: 1 1 | ||
1309 | Can select disk: 0 1 | ||
1310 | Can read multisession: 1 1 | ||
1311 | Can read MCN: 1 1 | ||
1312 | Reports media changed: 1 1 | ||
1313 | Can play audio: 1 1 | ||
1314 | |||
1315 | |||
1316 | You see two drives, sr0 and hdb, along with a list of their features. | ||
1317 | |||
1318 | 2.6 /proc/sys/sunrpc - Remote procedure calls | ||
1319 | --------------------------------------------- | ||
1320 | |||
1321 | This directory contains four files, which enable or disable debugging for the | ||
1322 | RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can | ||
1323 | be set to one to turn debugging on. (The default value is 0 for each) | ||
1324 | |||
1325 | 2.7 /proc/sys/net - Networking stuff | ||
1326 | ------------------------------------ | ||
1327 | |||
1328 | The interface to the networking parts of the kernel is located in | ||
1329 | /proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only | ||
1330 | some of them, depending on your kernel's configuration. | ||
1331 | |||
1332 | |||
1333 | Table 2-3: Subdirectories in /proc/sys/net | ||
1334 | .............................................................................. | ||
1335 | Directory Content Directory Content | ||
1336 | core General parameter appletalk Appletalk protocol | ||
1337 | unix Unix domain sockets netrom NET/ROM | ||
1338 | 802 E802 protocol ax25 AX25 | ||
1339 | ethernet Ethernet protocol rose X.25 PLP layer | ||
1340 | ipv4 IP version 4 x25 X.25 protocol | ||
1341 | ipx IPX token-ring IBM token ring | ||
1342 | bridge Bridging decnet DEC net | ||
1343 | ipv6 IP version 6 | ||
1344 | .............................................................................. | ||
1345 | |||
1346 | We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are | ||
1347 | only minor players in the Linux world, we'll skip them in this chapter. You'll | ||
1348 | find some short info on Appletalk and IPX further on in this chapter. Review | ||
1349 | the online documentation and the kernel source to get a detailed view of the | ||
1350 | parameters for those protocols. In this section we'll discuss the | ||
1351 | subdirectories printed in bold letters in the table above. As default values | ||
1352 | are suitable for most needs, there is no need to change these values. | ||
1353 | |||
1354 | /proc/sys/net/core - Network core options | ||
1355 | ----------------------------------------- | ||
1356 | |||
1357 | rmem_default | ||
1358 | ------------ | ||
1359 | |||
1360 | The default setting of the socket receive buffer in bytes. | ||
1361 | |||
1362 | rmem_max | ||
1363 | -------- | ||
1364 | |||
1365 | The maximum receive socket buffer size in bytes. | ||
1366 | |||
1367 | wmem_default | ||
1368 | ------------ | ||
1369 | |||
1370 | The default setting (in bytes) of the socket send buffer. | ||
1371 | |||
1372 | wmem_max | ||
1373 | -------- | ||
1374 | |||
1375 | The maximum send socket buffer size in bytes. | ||
1376 | |||
1377 | message_burst and message_cost | ||
1378 | ------------------------------ | ||
1379 | |||
1380 | These parameters are used to limit the warning messages written to the kernel | ||
1381 | log from the networking code. They enforce a rate limit to make a | ||
1382 | denial-of-service attack impossible. A higher message_cost factor, results in | ||
1383 | fewer messages that will be written. Message_burst controls when messages will | ||
1384 | be dropped. The default settings limit warning messages to one every five | ||
1385 | seconds. | ||
1386 | |||
1387 | netdev_max_backlog | ||
1388 | ------------------ | ||
1389 | |||
1390 | Maximum number of packets, queued on the INPUT side, when the interface | ||
1391 | receives packets faster than kernel can process them. | ||
1392 | |||
1393 | optmem_max | ||
1394 | ---------- | ||
1395 | |||
1396 | Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence | ||
1397 | of struct cmsghdr structures with appended data. | ||
1398 | |||
1399 | /proc/sys/net/unix - Parameters for Unix domain sockets | ||
1400 | ------------------------------------------------------- | ||
1401 | |||
1402 | There are only two files in this subdirectory. They control the delays for | ||
1403 | deleting and destroying socket descriptors. | ||
1404 | |||
1405 | 2.8 /proc/sys/net/ipv4 - IPV4 settings | ||
1406 | -------------------------------------- | ||
1407 | |||
1408 | IP version 4 is still the most used protocol in Unix networking. It will be | ||
1409 | replaced by IP version 6 in the next couple of years, but for the moment it's | ||
1410 | the de facto standard for the internet and is used in most networking | ||
1411 | environments around the world. Because of the importance of this protocol, | ||
1412 | we'll have a deeper look into the subtree controlling the behavior of the IPv4 | ||
1413 | subsystem of the Linux kernel. | ||
1414 | |||
1415 | Let's start with the entries in /proc/sys/net/ipv4. | ||
1416 | |||
1417 | ICMP settings | ||
1418 | ------------- | ||
1419 | |||
1420 | icmp_echo_ignore_all and icmp_echo_ignore_broadcasts | ||
1421 | ---------------------------------------------------- | ||
1422 | |||
1423 | Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or | ||
1424 | just those to broadcast and multicast addresses. | ||
1425 | |||
1426 | Please note that if you accept ICMP echo requests with a broadcast/multi\-cast | ||
1427 | destination address your network may be used as an exploder for denial of | ||
1428 | service packet flooding attacks to other hosts. | ||
1429 | |||
1430 | icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate | ||
1431 | --------------------------------------------------------------------------------------- | ||
1432 | |||
1433 | Sets limits for sending ICMP packets to specific targets. A value of zero | ||
1434 | disables all limiting. Any positive value sets the maximum package rate in | ||
1435 | hundredth of a second (on Intel systems). | ||
1436 | |||
1437 | IP settings | ||
1438 | ----------- | ||
1439 | |||
1440 | ip_autoconfig | ||
1441 | ------------- | ||
1442 | |||
1443 | This file contains the number one if the host received its IP configuration by | ||
1444 | RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. | ||
1445 | |||
1446 | ip_default_ttl | ||
1447 | -------------- | ||
1448 | |||
1449 | TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of | ||
1450 | hops a packet may travel. | ||
1451 | |||
1452 | ip_dynaddr | ||
1453 | ---------- | ||
1454 | |||
1455 | Enable dynamic socket address rewriting on interface address change. This is | ||
1456 | useful for dialup interface with changing IP addresses. | ||
1457 | |||
1458 | ip_forward | ||
1459 | ---------- | ||
1460 | |||
1461 | Enable or disable forwarding of IP packages between interfaces. Changing this | ||
1462 | value resets all other parameters to their default values. They differ if the | ||
1463 | kernel is configured as host or router. | ||
1464 | |||
1465 | ip_local_port_range | ||
1466 | ------------------- | ||
1467 | |||
1468 | Range of ports used by TCP and UDP to choose the local port. Contains two | ||
1469 | numbers, the first number is the lowest port, the second number the highest | ||
1470 | local port. Default is 1024-4999. Should be changed to 32768-61000 for | ||
1471 | high-usage systems. | ||
1472 | |||
1473 | ip_no_pmtu_disc | ||
1474 | --------------- | ||
1475 | |||
1476 | Global switch to turn path MTU discovery off. It can also be set on a per | ||
1477 | socket basis by the applications or on a per route basis. | ||
1478 | |||
1479 | ip_masq_debug | ||
1480 | ------------- | ||
1481 | |||
1482 | Enable/disable debugging of IP masquerading. | ||
1483 | |||
1484 | IP fragmentation settings | ||
1485 | ------------------------- | ||
1486 | |||
1487 | ipfrag_high_trash and ipfrag_low_trash | ||
1488 | -------------------------------------- | ||
1489 | |||
1490 | Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes | ||
1491 | of memory is allocated for this purpose, the fragment handler will toss | ||
1492 | packets until ipfrag_low_thresh is reached. | ||
1493 | |||
1494 | ipfrag_time | ||
1495 | ----------- | ||
1496 | |||
1497 | Time in seconds to keep an IP fragment in memory. | ||
1498 | |||
1499 | TCP settings | ||
1500 | ------------ | ||
1501 | |||
1502 | tcp_ecn | ||
1503 | ------- | ||
1504 | |||
1505 | This file controls the use of the ECN bit in the IPv4 headers, this is a new | ||
1506 | feature about Explicit Congestion Notification, but some routers and firewalls | ||
1507 | block trafic that has this bit set, so it could be necessary to echo 0 to | ||
1508 | /proc/sys/net/ipv4/tcp_ecn, if you want to talk to this sites. For more info | ||
1509 | you could read RFC2481. | ||
1510 | |||
1511 | tcp_retrans_collapse | ||
1512 | -------------------- | ||
1513 | |||
1514 | Bug-to-bug compatibility with some broken printers. On retransmit, try to send | ||
1515 | larger packets to work around bugs in certain TCP stacks. Can be turned off by | ||
1516 | setting it to zero. | ||
1517 | |||
1518 | tcp_keepalive_probes | ||
1519 | -------------------- | ||
1520 | |||
1521 | Number of keep alive probes TCP sends out, until it decides that the | ||
1522 | connection is broken. | ||
1523 | |||
1524 | tcp_keepalive_time | ||
1525 | ------------------ | ||
1526 | |||
1527 | How often TCP sends out keep alive messages, when keep alive is enabled. The | ||
1528 | default is 2 hours. | ||
1529 | |||
1530 | tcp_syn_retries | ||
1531 | --------------- | ||
1532 | |||
1533 | Number of times initial SYNs for a TCP connection attempt will be | ||
1534 | retransmitted. Should not be higher than 255. This is only the timeout for | ||
1535 | outgoing connections, for incoming connections the number of retransmits is | ||
1536 | defined by tcp_retries1. | ||
1537 | |||
1538 | tcp_sack | ||
1539 | -------- | ||
1540 | |||
1541 | Enable select acknowledgments after RFC2018. | ||
1542 | |||
1543 | tcp_timestamps | ||
1544 | -------------- | ||
1545 | |||
1546 | Enable timestamps as defined in RFC1323. | ||
1547 | |||
1548 | tcp_stdurg | ||
1549 | ---------- | ||
1550 | |||
1551 | Enable the strict RFC793 interpretation of the TCP urgent pointer field. The | ||
1552 | default is to use the BSD compatible interpretation of the urgent pointer | ||
1553 | pointing to the first byte after the urgent data. The RFC793 interpretation is | ||
1554 | to have it point to the last byte of urgent data. Enabling this option may | ||
1555 | lead to interoperatibility problems. Disabled by default. | ||
1556 | |||
1557 | tcp_syncookies | ||
1558 | -------------- | ||
1559 | |||
1560 | Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out | ||
1561 | syncookies when the syn backlog queue of a socket overflows. This is to ward | ||
1562 | off the common 'syn flood attack'. Disabled by default. | ||
1563 | |||
1564 | Note that the concept of a socket backlog is abandoned. This means the peer | ||
1565 | may not receive reliable error messages from an over loaded server with | ||
1566 | syncookies enabled. | ||
1567 | |||
1568 | tcp_window_scaling | ||
1569 | ------------------ | ||
1570 | |||
1571 | Enable window scaling as defined in RFC1323. | ||
1572 | |||
1573 | tcp_fin_timeout | ||
1574 | --------------- | ||
1575 | |||
1576 | The length of time in seconds it takes to receive a final FIN before the | ||
1577 | socket is always closed. This is strictly a violation of the TCP | ||
1578 | specification, but required to prevent denial-of-service attacks. | ||
1579 | |||
1580 | tcp_max_ka_probes | ||
1581 | ----------------- | ||
1582 | |||
1583 | Indicates how many keep alive probes are sent per slow timer run. Should not | ||
1584 | be set too high to prevent bursts. | ||
1585 | |||
1586 | tcp_max_syn_backlog | ||
1587 | ------------------- | ||
1588 | |||
1589 | Length of the per socket backlog queue. Since Linux 2.2 the backlog specified | ||
1590 | in listen(2) only specifies the length of the backlog queue of already | ||
1591 | established sockets. When more connection requests arrive Linux starts to drop | ||
1592 | packets. When syncookies are enabled the packets are still answered and the | ||
1593 | maximum queue is effectively ignored. | ||
1594 | |||
1595 | tcp_retries1 | ||
1596 | ------------ | ||
1597 | |||
1598 | Defines how often an answer to a TCP connection request is retransmitted | ||
1599 | before giving up. | ||
1600 | |||
1601 | tcp_retries2 | ||
1602 | ------------ | ||
1603 | |||
1604 | Defines how often a TCP packet is retransmitted before giving up. | ||
1605 | |||
1606 | Interface specific settings | ||
1607 | --------------------------- | ||
1608 | |||
1609 | In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each | ||
1610 | interface the system knows about and one directory calls all. Changes in the | ||
1611 | all subdirectory affect all interfaces, whereas changes in the other | ||
1612 | subdirectories affect only one interface. All directories have the same | ||
1613 | entries: | ||
1614 | |||
1615 | accept_redirects | ||
1616 | ---------------- | ||
1617 | |||
1618 | This switch decides if the kernel accepts ICMP redirect messages or not. The | ||
1619 | default is 'yes' if the kernel is configured for a regular host and 'no' for a | ||
1620 | router configuration. | ||
1621 | |||
1622 | accept_source_route | ||
1623 | ------------------- | ||
1624 | |||
1625 | Should source routed packages be accepted or declined. The default is | ||
1626 | dependent on the kernel configuration. It's 'yes' for routers and 'no' for | ||
1627 | hosts. | ||
1628 | |||
1629 | bootp_relay | ||
1630 | ~~~~~~~~~~~ | ||
1631 | |||
1632 | Accept packets with source address 0.b.c.d with destinations not to this host | ||
1633 | as local ones. It is supposed that a BOOTP relay daemon will catch and forward | ||
1634 | such packets. | ||
1635 | |||
1636 | The default is 0, since this feature is not implemented yet (kernel version | ||
1637 | 2.2.12). | ||
1638 | |||
1639 | forwarding | ||
1640 | ---------- | ||
1641 | |||
1642 | Enable or disable IP forwarding on this interface. | ||
1643 | |||
1644 | log_martians | ||
1645 | ------------ | ||
1646 | |||
1647 | Log packets with source addresses with no known route to kernel log. | ||
1648 | |||
1649 | mc_forwarding | ||
1650 | ------------- | ||
1651 | |||
1652 | Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a | ||
1653 | multicast routing daemon is required. | ||
1654 | |||
1655 | proxy_arp | ||
1656 | --------- | ||
1657 | |||
1658 | Does (1) or does not (0) perform proxy ARP. | ||
1659 | |||
1660 | rp_filter | ||
1661 | --------- | ||
1662 | |||
1663 | Integer value determines if a source validation should be made. 1 means yes, 0 | ||
1664 | means no. Disabled by default, but local/broadcast address spoofing is always | ||
1665 | on. | ||
1666 | |||
1667 | If you set this to 1 on a router that is the only connection for a network to | ||
1668 | the net, it will prevent spoofing attacks against your internal networks | ||
1669 | (external addresses can still be spoofed), without the need for additional | ||
1670 | firewall rules. | ||
1671 | |||
1672 | secure_redirects | ||
1673 | ---------------- | ||
1674 | |||
1675 | Accept ICMP redirect messages only for gateways, listed in default gateway | ||
1676 | list. Enabled by default. | ||
1677 | |||
1678 | shared_media | ||
1679 | ------------ | ||
1680 | |||
1681 | If it is not set the kernel does not assume that different subnets on this | ||
1682 | device can communicate directly. Default setting is 'yes'. | ||
1683 | |||
1684 | send_redirects | ||
1685 | -------------- | ||
1686 | |||
1687 | Determines whether to send ICMP redirects to other hosts. | ||
1688 | |||
1689 | Routing settings | ||
1690 | ---------------- | ||
1691 | |||
1692 | The directory /proc/sys/net/ipv4/route contains several file to control | ||
1693 | routing issues. | ||
1694 | |||
1695 | error_burst and error_cost | ||
1696 | -------------------------- | ||
1697 | |||
1698 | These parameters are used to limit how many ICMP destination unreachable to | ||
1699 | send from the host in question. ICMP destination unreachable messages are | ||
1700 | sent when we can not reach the next hop, while trying to transmit a packet. | ||
1701 | It will also print some error messages to kernel logs if someone is ignoring | ||
1702 | our ICMP redirects. The higher the error_cost factor is, the fewer | ||
1703 | destination unreachable and error messages will be let through. Error_burst | ||
1704 | controls when destination unreachable messages and error messages will be | ||
1705 | dropped. The default settings limit warning messages to five every second. | ||
1706 | |||
1707 | flush | ||
1708 | ----- | ||
1709 | |||
1710 | Writing to this file results in a flush of the routing cache. | ||
1711 | |||
1712 | gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh | ||
1713 | --------------------------------------------------------------------- | ||
1714 | |||
1715 | Values to control the frequency and behavior of the garbage collection | ||
1716 | algorithm for the routing cache. gc_min_interval is deprecated and replaced | ||
1717 | by gc_min_interval_ms. | ||
1718 | |||
1719 | |||
1720 | max_size | ||
1721 | -------- | ||
1722 | |||
1723 | Maximum size of the routing cache. Old entries will be purged once the cache | ||
1724 | reached has this size. | ||
1725 | |||
1726 | max_delay, min_delay | ||
1727 | -------------------- | ||
1728 | |||
1729 | Delays for flushing the routing cache. | ||
1730 | |||
1731 | redirect_load, redirect_number | ||
1732 | ------------------------------ | ||
1733 | |||
1734 | Factors which determine if more ICPM redirects should be sent to a specific | ||
1735 | host. No redirects will be sent once the load limit or the maximum number of | ||
1736 | redirects has been reached. | ||
1737 | |||
1738 | redirect_silence | ||
1739 | ---------------- | ||
1740 | |||
1741 | Timeout for redirects. After this period redirects will be sent again, even if | ||
1742 | this has been stopped, because the load or number limit has been reached. | ||
1743 | |||
1744 | Network Neighbor handling | ||
1745 | ------------------------- | ||
1746 | |||
1747 | Settings about how to handle connections with direct neighbors (nodes attached | ||
1748 | to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. | ||
1749 | |||
1750 | As we saw it in the conf directory, there is a default subdirectory which | ||
1751 | holds the default values, and one directory for each interface. The contents | ||
1752 | of the directories are identical, with the single exception that the default | ||
1753 | settings contain additional options to set garbage collection parameters. | ||
1754 | |||
1755 | In the interface directories you'll find the following entries: | ||
1756 | |||
1757 | base_reachable_time, base_reachable_time_ms | ||
1758 | ------------------------------------------- | ||
1759 | |||
1760 | A base value used for computing the random reachable time value as specified | ||
1761 | in RFC2461. | ||
1762 | |||
1763 | Expression of base_reachable_time, which is deprecated, is in seconds. | ||
1764 | Expression of base_reachable_time_ms is in milliseconds. | ||
1765 | |||
1766 | retrans_time, retrans_time_ms | ||
1767 | ----------------------------- | ||
1768 | |||
1769 | The time between retransmitted Neighbor Solicitation messages. | ||
1770 | Used for address resolution and to determine if a neighbor is | ||
1771 | unreachable. | ||
1772 | |||
1773 | Expression of retrans_time, which is deprecated, is in 1/100 seconds (for | ||
1774 | IPv4) or in jiffies (for IPv6). | ||
1775 | Expression of retrans_time_ms is in milliseconds. | ||
1776 | |||
1777 | unres_qlen | ||
1778 | ---------- | ||
1779 | |||
1780 | Maximum queue length for a pending arp request - the number of packets which | ||
1781 | are accepted from other layers while the ARP address is still resolved. | ||
1782 | |||
1783 | anycast_delay | ||
1784 | ------------- | ||
1785 | |||
1786 | Maximum for random delay of answers to neighbor solicitation messages in | ||
1787 | jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support | ||
1788 | yet). | ||
1789 | |||
1790 | ucast_solicit | ||
1791 | ------------- | ||
1792 | |||
1793 | Maximum number of retries for unicast solicitation. | ||
1794 | |||
1795 | mcast_solicit | ||
1796 | ------------- | ||
1797 | |||
1798 | Maximum number of retries for multicast solicitation. | ||
1799 | |||
1800 | delay_first_probe_time | ||
1801 | ---------------------- | ||
1802 | |||
1803 | Delay for the first time probe if the neighbor is reachable. (see | ||
1804 | gc_stale_time) | ||
1805 | |||
1806 | locktime | ||
1807 | -------- | ||
1808 | |||
1809 | An ARP/neighbor entry is only replaced with a new one if the old is at least | ||
1810 | locktime old. This prevents ARP cache thrashing. | ||
1811 | |||
1812 | proxy_delay | ||
1813 | ----------- | ||
1814 | |||
1815 | Maximum time (real time is random [0..proxytime]) before answering to an ARP | ||
1816 | request for which we have an proxy ARP entry. In some cases, this is used to | ||
1817 | prevent network flooding. | ||
1818 | |||
1819 | proxy_qlen | ||
1820 | ---------- | ||
1821 | |||
1822 | Maximum queue length of the delayed proxy arp timer. (see proxy_delay). | ||
1823 | |||
1824 | app_solcit | ||
1825 | ---------- | ||
1826 | |||
1827 | Determines the number of requests to send to the user level ARP daemon. Use 0 | ||
1828 | to turn off. | ||
1829 | |||
1830 | gc_stale_time | ||
1831 | ------------- | ||
1832 | |||
1833 | Determines how often to check for stale ARP entries. After an ARP entry is | ||
1834 | stale it will be resolved again (which is useful when an IP address migrates | ||
1835 | to another machine). When ucast_solicit is greater than 0 it first tries to | ||
1836 | send an ARP packet directly to the known host When that fails and | ||
1837 | mcast_solicit is greater than 0, an ARP request is broadcasted. | ||
1838 | |||
1839 | 2.9 Appletalk | ||
1840 | ------------- | ||
1841 | |||
1842 | The /proc/sys/net/appletalk directory holds the Appletalk configuration data | ||
1843 | when Appletalk is loaded. The configurable parameters are: | ||
1844 | |||
1845 | aarp-expiry-time | ||
1846 | ---------------- | ||
1847 | |||
1848 | The amount of time we keep an ARP entry before expiring it. Used to age out | ||
1849 | old hosts. | ||
1850 | |||
1851 | aarp-resolve-time | ||
1852 | ----------------- | ||
1853 | |||
1854 | The amount of time we will spend trying to resolve an Appletalk address. | ||
1855 | |||
1856 | aarp-retransmit-limit | ||
1857 | --------------------- | ||
1858 | |||
1859 | The number of times we will retransmit a query before giving up. | ||
1860 | |||
1861 | aarp-tick-time | ||
1862 | -------------- | ||
1863 | |||
1864 | Controls the rate at which expires are checked. | ||
1865 | |||
1866 | The directory /proc/net/appletalk holds the list of active Appletalk sockets | ||
1867 | on a machine. | ||
1868 | |||
1869 | The fields indicate the DDP type, the local address (in network:node format) | ||
1870 | the remote address, the size of the transmit pending queue, the size of the | ||
1871 | received queue (bytes waiting for applications to read) the state and the uid | ||
1872 | owning the socket. | ||
1873 | |||
1874 | /proc/net/atalk_iface lists all the interfaces configured for appletalk.It | ||
1875 | shows the name of the interface, its Appletalk address, the network range on | ||
1876 | that address (or network number for phase 1 networks), and the status of the | ||
1877 | interface. | ||
1878 | |||
1879 | /proc/net/atalk_route lists each known network route. It lists the target | ||
1880 | (network) that the route leads to, the router (may be directly connected), the | ||
1881 | route flags, and the device the route is using. | ||
1882 | |||
1883 | 2.10 IPX | ||
1884 | -------- | ||
1885 | |||
1886 | The IPX protocol has no tunable values in proc/sys/net. | ||
1887 | |||
1888 | The IPX protocol does, however, provide proc/net/ipx. This lists each IPX | ||
1889 | socket giving the local and remote addresses in Novell format (that is | ||
1890 | network:node:port). In accordance with the strange Novell tradition, | ||
1891 | everything but the port is in hex. Not_Connected is displayed for sockets that | ||
1892 | are not tied to a specific remote address. The Tx and Rx queue sizes indicate | ||
1893 | the number of bytes pending for transmission and reception. The state | ||
1894 | indicates the state the socket is in and the uid is the owning uid of the | ||
1895 | socket. | ||
1896 | |||
1897 | The /proc/net/ipx_interface file lists all IPX interfaces. For each interface | ||
1898 | it gives the network number, the node number, and indicates if the network is | ||
1899 | the primary network. It also indicates which device it is bound to (or | ||
1900 | Internal for internal networks) and the Frame Type if appropriate. Linux | ||
1901 | supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for | ||
1902 | IPX. | ||
1903 | |||
1904 | The /proc/net/ipx_route table holds a list of IPX routes. For each route it | ||
1905 | gives the destination network, the router node (or Directly) and the network | ||
1906 | address of the router (or Connected) for internal networks. | ||
1907 | |||
1908 | 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem | ||
1909 | ---------------------------------------------------------- | ||
1910 | |||
1911 | The "mqueue" filesystem provides the necessary kernel features to enable the | ||
1912 | creation of a user space library that implements the POSIX message queues | ||
1913 | API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System | ||
1914 | Interfaces specification.) | ||
1915 | |||
1916 | The "mqueue" filesystem contains values for determining/setting the amount of | ||
1917 | resources used by the file system. | ||
1918 | |||
1919 | /proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the | ||
1920 | maximum number of message queues allowed on the system. | ||
1921 | |||
1922 | /proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the | ||
1923 | maximum number of messages in a queue value. In fact it is the limiting value | ||
1924 | for another (user) limit which is set in mq_open invocation. This attribute of | ||
1925 | a queue must be less or equal then msg_max. | ||
1926 | |||
1927 | /proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the | ||
1928 | maximum message size value (it is every message queue's attribute set during | ||
1929 | its creation). | ||
1930 | |||
1931 | |||
1932 | ------------------------------------------------------------------------------ | ||
1933 | Summary | ||
1934 | ------------------------------------------------------------------------------ | ||
1935 | Certain aspects of kernel behavior can be modified at runtime, without the | ||
1936 | need to recompile the kernel, or even to reboot the system. The files in the | ||
1937 | /proc/sys tree can not only be read, but also modified. You can use the echo | ||
1938 | command to write value into these files, thereby changing the default settings | ||
1939 | of the kernel. | ||
1940 | ------------------------------------------------------------------------------ | ||
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt new file mode 100644 index 000000000000..2d2a7b2a16b9 --- /dev/null +++ b/Documentation/filesystems/romfs.txt | |||
@@ -0,0 +1,187 @@ | |||
1 | ROMFS - ROM FILE SYSTEM | ||
2 | |||
3 | This is a quite dumb, read only filesystem, mainly for initial RAM | ||
4 | disks of installation disks. It has grown up by the need of having | ||
5 | modules linked at boot time. Using this filesystem, you get a very | ||
6 | similar feature, and even the possibility of a small kernel, with a | ||
7 | file system which doesn't take up useful memory from the router | ||
8 | functions in the basement of your office. | ||
9 | |||
10 | For comparison, both the older minix and xiafs (the latter is now | ||
11 | defunct) filesystems, compiled as module need more than 20000 bytes, | ||
12 | while romfs is less than a page, about 4000 bytes (assuming i586 | ||
13 | code). Under the same conditions, the msdos filesystem would need | ||
14 | about 30K (and does not support device nodes or symlinks), while the | ||
15 | nfs module with nfsroot is about 57K. Furthermore, as a bit unfair | ||
16 | comparison, an actual rescue disk used up 3202 blocks with ext2, while | ||
17 | with romfs, it needed 3079 blocks. | ||
18 | |||
19 | To create such a file system, you'll need a user program named | ||
20 | genromfs. It is available via anonymous ftp on sunsite.unc.edu and | ||
21 | its mirrors, in the /pub/Linux/system/recovery/ directory. | ||
22 | |||
23 | As the name suggests, romfs could be also used (space-efficiently) on | ||
24 | various read-only media, like (E)EPROM disks if someone will have the | ||
25 | motivation.. :) | ||
26 | |||
27 | However, the main purpose of romfs is to have a very small kernel, | ||
28 | which has only this filesystem linked in, and then can load any module | ||
29 | later, with the current module utilities. It can also be used to run | ||
30 | some program to decide if you need SCSI devices, and even IDE or | ||
31 | floppy drives can be loaded later if you use the "initrd"--initial | ||
32 | RAM disk--feature of the kernel. This would not be really news | ||
33 | flash, but with romfs, you can even spare off your ext2 or minix or | ||
34 | maybe even affs filesystem until you really know that you need it. | ||
35 | |||
36 | For example, a distribution boot disk can contain only the cd disk | ||
37 | drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem | ||
38 | module. The kernel can be small enough, since it doesn't have other | ||
39 | filesystems, like the quite large ext2fs module, which can then be | ||
40 | loaded off the CD at a later stage of the installation. Another use | ||
41 | would be for a recovery disk, when you are reinstalling a workstation | ||
42 | from the network, and you will have all the tools/modules available | ||
43 | from a nearby server, so you don't want to carry two disks for this | ||
44 | purpose, just because it won't fit into ext2. | ||
45 | |||
46 | romfs operates on block devices as you can expect, and the underlying | ||
47 | structure is very simple. Every accessible structure begins on 16 | ||
48 | byte boundaries for fast access. The minimum space a file will take | ||
49 | is 32 bytes (this is an empty file, with a less than 16 character | ||
50 | name). The maximum overhead for any non-empty file is the header, and | ||
51 | the 16 byte padding for the name and the contents, also 16+14+15 = 45 | ||
52 | bytes. This is quite rare however, since most file names are longer | ||
53 | than 3 bytes, and shorter than 15 bytes. | ||
54 | |||
55 | The layout of the filesystem is the following: | ||
56 | |||
57 | offset content | ||
58 | |||
59 | +---+---+---+---+ | ||
60 | 0 | - | r | o | m | \ | ||
61 | +---+---+---+---+ The ASCII representation of those bytes | ||
62 | 4 | 1 | f | s | - | / (i.e. "-rom1fs-") | ||
63 | +---+---+---+---+ | ||
64 | 8 | full size | The number of accessible bytes in this fs. | ||
65 | +---+---+---+---+ | ||
66 | 12 | checksum | The checksum of the FIRST 512 BYTES. | ||
67 | +---+---+---+---+ | ||
68 | 16 | volume name | The zero terminated name of the volume, | ||
69 | : : padded to 16 byte boundary. | ||
70 | +---+---+---+---+ | ||
71 | xx | file | | ||
72 | : headers : | ||
73 | |||
74 | Every multi byte value (32 bit words, I'll use the longwords term from | ||
75 | now on) must be in big endian order. | ||
76 | |||
77 | The first eight bytes identify the filesystem, even for the casual | ||
78 | inspector. After that, in the 3rd longword, it contains the number of | ||
79 | bytes accessible from the start of this filesystem. The 4th longword | ||
80 | is the checksum of the first 512 bytes (or the number of bytes | ||
81 | accessible, whichever is smaller). The applied algorithm is the same | ||
82 | as in the AFFS filesystem, namely a simple sum of the longwords | ||
83 | (assuming bigendian quantities again). For details, please consult | ||
84 | the source. This algorithm was chosen because although it's not quite | ||
85 | reliable, it does not require any tables, and it is very simple. | ||
86 | |||
87 | The following bytes are now part of the file system; each file header | ||
88 | must begin on a 16 byte boundary. | ||
89 | |||
90 | offset content | ||
91 | |||
92 | +---+---+---+---+ | ||
93 | 0 | next filehdr|X| The offset of the next file header | ||
94 | +---+---+---+---+ (zero if no more files) | ||
95 | 4 | spec.info | Info for directories/hard links/devices | ||
96 | +---+---+---+---+ | ||
97 | 8 | size | The size of this file in bytes | ||
98 | +---+---+---+---+ | ||
99 | 12 | checksum | Covering the meta data, including the file | ||
100 | +---+---+---+---+ name, and padding | ||
101 | 16 | file name | The zero terminated name of the file, | ||
102 | : : padded to 16 byte boundary | ||
103 | +---+---+---+---+ | ||
104 | xx | file data | | ||
105 | : : | ||
106 | |||
107 | Since the file headers begin always at a 16 byte boundary, the lowest | ||
108 | 4 bits would be always zero in the next filehdr pointer. These four | ||
109 | bits are used for the mode information. Bits 0..2 specify the type of | ||
110 | the file; while bit 4 shows if the file is executable or not. The | ||
111 | permissions are assumed to be world readable, if this bit is not set, | ||
112 | and world executable if it is; except the character and block devices, | ||
113 | they are never accessible for other than owner. The owner of every | ||
114 | file is user and group 0, this should never be a problem for the | ||
115 | intended use. The mapping of the 8 possible values to file types is | ||
116 | the following: | ||
117 | |||
118 | mapping spec.info means | ||
119 | 0 hard link link destination [file header] | ||
120 | 1 directory first file's header | ||
121 | 2 regular file unused, must be zero [MBZ] | ||
122 | 3 symbolic link unused, MBZ (file data is the link content) | ||
123 | 4 block device 16/16 bits major/minor number | ||
124 | 5 char device - " - | ||
125 | 6 socket unused, MBZ | ||
126 | 7 fifo unused, MBZ | ||
127 | |||
128 | Note that hard links are specifically marked in this filesystem, but | ||
129 | they will behave as you can expect (i.e. share the inode number). | ||
130 | Note also that it is your responsibility to not create hard link | ||
131 | loops, and creating all the . and .. links for directories. This is | ||
132 | normally done correctly by the genromfs program. Please refrain from | ||
133 | using the executable bits for special purposes on the socket and fifo | ||
134 | special files, they may have other uses in the future. Additionally, | ||
135 | please remember that only regular files, and symlinks are supposed to | ||
136 | have a nonzero size field; they contain the number of bytes available | ||
137 | directly after the (padded) file name. | ||
138 | |||
139 | Another thing to note is that romfs works on file headers and data | ||
140 | aligned to 16 byte boundaries, but most hardware devices and the block | ||
141 | device drivers are unable to cope with smaller than block-sized data. | ||
142 | To overcome this limitation, the whole size of the file system must be | ||
143 | padded to an 1024 byte boundary. | ||
144 | |||
145 | If you have any problems or suggestions concerning this file system, | ||
146 | please contact me. However, think twice before wanting me to add | ||
147 | features and code, because the primary and most important advantage of | ||
148 | this file system is the small code. On the other hand, don't be | ||
149 | alarmed, I'm not getting that much romfs related mail. Now I can | ||
150 | understand why Avery wrote poems in the ARCnet docs to get some more | ||
151 | feedback. :) | ||
152 | |||
153 | romfs has also a mailing list, and to date, it hasn't received any | ||
154 | traffic, so you are welcome to join it to discuss your ideas. :) | ||
155 | |||
156 | It's run by ezmlm, so you can subscribe to it by sending a message | ||
157 | to romfs-subscribe@shadow.banki.hu, the content is irrelevant. | ||
158 | |||
159 | Pending issues: | ||
160 | |||
161 | - Permissions and owner information are pretty essential features of a | ||
162 | Un*x like system, but romfs does not provide the full possibilities. | ||
163 | I have never found this limiting, but others might. | ||
164 | |||
165 | - The file system is read only, so it can be very small, but in case | ||
166 | one would want to write _anything_ to a file system, he still needs | ||
167 | a writable file system, thus negating the size advantages. Possible | ||
168 | solutions: implement write access as a compile-time option, or a new, | ||
169 | similarly small writable filesystem for RAM disks. | ||
170 | |||
171 | - Since the files are only required to have alignment on a 16 byte | ||
172 | boundary, it is currently possibly suboptimal to read or execute files | ||
173 | from the filesystem. It might be resolved by reordering file data to | ||
174 | have most of it (i.e. except the start and the end) laying at "natural" | ||
175 | boundaries, thus it would be possible to directly map a big portion of | ||
176 | the file contents to the mm subsystem. | ||
177 | |||
178 | - Compression might be an useful feature, but memory is quite a | ||
179 | limiting factor in my eyes. | ||
180 | |||
181 | - Where it is used? | ||
182 | |||
183 | - Does it work on other architectures than intel and motorola? | ||
184 | |||
185 | |||
186 | Have fun, | ||
187 | Janos Farkas <chexum@shadow.banki.hu> | ||
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt new file mode 100644 index 000000000000..f673ef0de0f7 --- /dev/null +++ b/Documentation/filesystems/smbfs.txt | |||
@@ -0,0 +1,8 @@ | |||
1 | Smbfs is a filesystem that implements the SMB protocol, which is the | ||
2 | protocol used by Windows for Workgroups, Windows 95 and Windows NT. | ||
3 | Smbfs was inspired by Samba, the program written by Andrew Tridgell | ||
4 | that turns any Unix host into a file server for DOS or Windows clients. | ||
5 | |||
6 | Smbfs is a SMB client, but uses parts of samba for it's operation. For | ||
7 | more info on samba, including documentation, please go to | ||
8 | http://www.samba.org/ and then on to your nearest mirror. | ||
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt new file mode 100644 index 000000000000..e97d024eae77 --- /dev/null +++ b/Documentation/filesystems/sysfs-pci.txt | |||
@@ -0,0 +1,88 @@ | |||
1 | Accessing PCI device resources through sysfs | ||
2 | |||
3 | sysfs, usually mounted at /sys, provides access to PCI resources on platforms | ||
4 | that support it. For example, a given bus might look like this: | ||
5 | |||
6 | /sys/devices/pci0000:17 | ||
7 | |-- 0000:17:00.0 | ||
8 | | |-- class | ||
9 | | |-- config | ||
10 | | |-- detach_state | ||
11 | | |-- device | ||
12 | | |-- irq | ||
13 | | |-- local_cpus | ||
14 | | |-- resource | ||
15 | | |-- resource0 | ||
16 | | |-- resource1 | ||
17 | | |-- resource2 | ||
18 | | |-- rom | ||
19 | | |-- subsystem_device | ||
20 | | |-- subsystem_vendor | ||
21 | | `-- vendor | ||
22 | `-- detach_state | ||
23 | |||
24 | The topmost element describes the PCI domain and bus number. In this case, | ||
25 | the domain number is 0000 and the bus number is 17 (both values are in hex). | ||
26 | This bus contains a single function device in slot 0. The domain and bus | ||
27 | numbers are reproduced for convenience. Under the device directory are several | ||
28 | files, each with their own function. | ||
29 | |||
30 | file function | ||
31 | ---- -------- | ||
32 | class PCI class (ascii, ro) | ||
33 | config PCI config space (binary, rw) | ||
34 | detach_state connection status (bool, rw) | ||
35 | device PCI device (ascii, ro) | ||
36 | irq IRQ number (ascii, ro) | ||
37 | local_cpus nearby CPU mask (cpumask, ro) | ||
38 | resource PCI resource host addresses (ascii, ro) | ||
39 | resource0..N PCI resource N, if present (binary, mmap) | ||
40 | rom PCI ROM resource, if present (binary, ro) | ||
41 | subsystem_device PCI subsystem device (ascii, ro) | ||
42 | subsystem_vendor PCI subsystem vendor (ascii, ro) | ||
43 | vendor PCI vendor (ascii, ro) | ||
44 | |||
45 | ro - read only file | ||
46 | rw - file is readable and writable | ||
47 | mmap - file is mmapable | ||
48 | ascii - file contains ascii text | ||
49 | binary - file contains binary data | ||
50 | cpumask - file contains a cpumask type | ||
51 | |||
52 | The read only files are informational, writes to them will be ignored. | ||
53 | Writable files can be used to perform actions on the device (e.g. changing | ||
54 | config space, detaching a device). mmapable files are available via an | ||
55 | mmap of the file at offset 0 and can be used to do actual device programming | ||
56 | from userspace. Note that some platforms don't support mmapping of certain | ||
57 | resources, so be sure to check the return value from any attempted mmap. | ||
58 | |||
59 | Accessing legacy resources through sysfs | ||
60 | |||
61 | Legacy I/O port and ISA memory resources are also provided in sysfs if the | ||
62 | underlying platform supports them. They're located in the PCI class heirarchy, | ||
63 | e.g. | ||
64 | |||
65 | /sys/class/pci_bus/0000:17/ | ||
66 | |-- bridge -> ../../../devices/pci0000:17 | ||
67 | |-- cpuaffinity | ||
68 | |-- legacy_io | ||
69 | `-- legacy_mem | ||
70 | |||
71 | The legacy_io file is a read/write file that can be used by applications to | ||
72 | do legacy port I/O. The application should open the file, seek to the desired | ||
73 | port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem | ||
74 | file should be mmapped with an offset corresponding to the memory offset | ||
75 | desired, e.g. 0xa0000 for the VGA frame buffer. The application can then | ||
76 | simply dereference the returned pointer (after checking for errors of course) | ||
77 | to access legacy memory space. | ||
78 | |||
79 | Supporting PCI access on new platforms | ||
80 | |||
81 | In order to support PCI resource mapping as described above, Linux platform | ||
82 | code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. | ||
83 | Platforms are free to only support subsets of the mmap functionality, but | ||
84 | useful return codes should be provided. | ||
85 | |||
86 | Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms | ||
87 | wishing to support legacy functionality should define it and provide | ||
88 | pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions. \ No newline at end of file | ||
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt new file mode 100644 index 000000000000..60f6c2c4d477 --- /dev/null +++ b/Documentation/filesystems/sysfs.txt | |||
@@ -0,0 +1,341 @@ | |||
1 | |||
2 | sysfs - _The_ filesystem for exporting kernel objects. | ||
3 | |||
4 | Patrick Mochel <mochel@osdl.org> | ||
5 | |||
6 | 10 January 2003 | ||
7 | |||
8 | |||
9 | What it is: | ||
10 | ~~~~~~~~~~~ | ||
11 | |||
12 | sysfs is a ram-based filesystem initially based on ramfs. It provides | ||
13 | a means to export kernel data structures, their attributes, and the | ||
14 | linkages between them to userspace. | ||
15 | |||
16 | sysfs is tied inherently to the kobject infrastructure. Please read | ||
17 | Documentation/kobject.txt for more information concerning the kobject | ||
18 | interface. | ||
19 | |||
20 | |||
21 | Using sysfs | ||
22 | ~~~~~~~~~~~ | ||
23 | |||
24 | sysfs is always compiled in. You can access it by doing: | ||
25 | |||
26 | mount -t sysfs sysfs /sys | ||
27 | |||
28 | |||
29 | Directory Creation | ||
30 | ~~~~~~~~~~~~~~~~~~ | ||
31 | |||
32 | For every kobject that is registered with the system, a directory is | ||
33 | created for it in sysfs. That directory is created as a subdirectory | ||
34 | of the kobject's parent, expressing internal object hierarchies to | ||
35 | userspace. Top-level directories in sysfs represent the common | ||
36 | ancestors of object hierarchies; i.e. the subsystems the objects | ||
37 | belong to. | ||
38 | |||
39 | Sysfs internally stores the kobject that owns the directory in the | ||
40 | ->d_fsdata pointer of the directory's dentry. This allows sysfs to do | ||
41 | reference counting directly on the kobject when the file is opened and | ||
42 | closed. | ||
43 | |||
44 | |||
45 | Attributes | ||
46 | ~~~~~~~~~~ | ||
47 | |||
48 | Attributes can be exported for kobjects in the form of regular files in | ||
49 | the filesystem. Sysfs forwards file I/O operations to methods defined | ||
50 | for the attributes, providing a means to read and write kernel | ||
51 | attributes. | ||
52 | |||
53 | Attributes should be ASCII text files, preferably with only one value | ||
54 | per file. It is noted that it may not be efficient to contain only | ||
55 | value per file, so it is socially acceptable to express an array of | ||
56 | values of the same type. | ||
57 | |||
58 | Mixing types, expressing multiple lines of data, and doing fancy | ||
59 | formatting of data is heavily frowned upon. Doing these things may get | ||
60 | you publically humiliated and your code rewritten without notice. | ||
61 | |||
62 | |||
63 | An attribute definition is simply: | ||
64 | |||
65 | struct attribute { | ||
66 | char * name; | ||
67 | mode_t mode; | ||
68 | }; | ||
69 | |||
70 | |||
71 | int sysfs_create_file(struct kobject * kobj, struct attribute * attr); | ||
72 | void sysfs_remove_file(struct kobject * kobj, struct attribute * attr); | ||
73 | |||
74 | |||
75 | A bare attribute contains no means to read or write the value of the | ||
76 | attribute. Subsystems are encouraged to define their own attribute | ||
77 | structure and wrapper functions for adding and removing attributes for | ||
78 | a specific object type. | ||
79 | |||
80 | For example, the driver model defines struct device_attribute like: | ||
81 | |||
82 | struct device_attribute { | ||
83 | struct attribute attr; | ||
84 | ssize_t (*show)(struct device * dev, char * buf); | ||
85 | ssize_t (*store)(struct device * dev, const char * buf); | ||
86 | }; | ||
87 | |||
88 | int device_create_file(struct device *, struct device_attribute *); | ||
89 | void device_remove_file(struct device *, struct device_attribute *); | ||
90 | |||
91 | It also defines this helper for defining device attributes: | ||
92 | |||
93 | #define DEVICE_ATTR(_name,_mode,_show,_store) \ | ||
94 | struct device_attribute dev_attr_##_name = { \ | ||
95 | .attr = {.name = __stringify(_name) , .mode = _mode }, \ | ||
96 | .show = _show, \ | ||
97 | .store = _store, \ | ||
98 | }; | ||
99 | |||
100 | For example, declaring | ||
101 | |||
102 | static DEVICE_ATTR(foo,0644,show_foo,store_foo); | ||
103 | |||
104 | is equivalent to doing: | ||
105 | |||
106 | static struct device_attribute dev_attr_foo = { | ||
107 | .attr = { | ||
108 | .name = "foo", | ||
109 | .mode = 0644, | ||
110 | }, | ||
111 | .show = show_foo, | ||
112 | .store = store_foo, | ||
113 | }; | ||
114 | |||
115 | |||
116 | Subsystem-Specific Callbacks | ||
117 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
118 | |||
119 | When a subsystem defines a new attribute type, it must implement a | ||
120 | set of sysfs operations for forwarding read and write calls to the | ||
121 | show and store methods of the attribute owners. | ||
122 | |||
123 | struct sysfs_ops { | ||
124 | ssize_t (*show)(struct kobject *, struct attribute *,char *); | ||
125 | ssize_t (*store)(struct kobject *,struct attribute *,const char *); | ||
126 | }; | ||
127 | |||
128 | [ Subsystems should have already defined a struct kobj_type as a | ||
129 | descriptor for this type, which is where the sysfs_ops pointer is | ||
130 | stored. See the kobject documentation for more information. ] | ||
131 | |||
132 | When a file is read or written, sysfs calls the appropriate method | ||
133 | for the type. The method then translates the generic struct kobject | ||
134 | and struct attribute pointers to the appropriate pointer types, and | ||
135 | calls the associated methods. | ||
136 | |||
137 | |||
138 | To illustrate: | ||
139 | |||
140 | #define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr) | ||
141 | #define to_dev(d) container_of(d, struct device, kobj) | ||
142 | |||
143 | static ssize_t | ||
144 | dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) | ||
145 | { | ||
146 | struct device_attribute * dev_attr = to_dev_attr(attr); | ||
147 | struct device * dev = to_dev(kobj); | ||
148 | ssize_t ret = 0; | ||
149 | |||
150 | if (dev_attr->show) | ||
151 | ret = dev_attr->show(dev,buf); | ||
152 | return ret; | ||
153 | } | ||
154 | |||
155 | |||
156 | |||
157 | Reading/Writing Attribute Data | ||
158 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
159 | |||
160 | To read or write attributes, show() or store() methods must be | ||
161 | specified when declaring the attribute. The method types should be as | ||
162 | simple as those defined for device attributes: | ||
163 | |||
164 | ssize_t (*show)(struct device * dev, char * buf); | ||
165 | ssize_t (*store)(struct device * dev, const char * buf); | ||
166 | |||
167 | IOW, they should take only an object and a buffer as parameters. | ||
168 | |||
169 | |||
170 | sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the | ||
171 | method. Sysfs will call the method exactly once for each read or | ||
172 | write. This forces the following behavior on the method | ||
173 | implementations: | ||
174 | |||
175 | - On read(2), the show() method should fill the entire buffer. | ||
176 | Recall that an attribute should only be exporting one value, or an | ||
177 | array of similar values, so this shouldn't be that expensive. | ||
178 | |||
179 | This allows userspace to do partial reads and seeks arbitrarily over | ||
180 | the entire file at will. | ||
181 | |||
182 | - On write(2), sysfs expects the entire buffer to be passed during the | ||
183 | first write. Sysfs then passes the entire buffer to the store() | ||
184 | method. | ||
185 | |||
186 | When writing sysfs files, userspace processes should first read the | ||
187 | entire file, modify the values it wishes to change, then write the | ||
188 | entire buffer back. | ||
189 | |||
190 | Attribute method implementations should operate on an identical | ||
191 | buffer when reading and writing values. | ||
192 | |||
193 | Other notes: | ||
194 | |||
195 | - The buffer will always be PAGE_SIZE bytes in length. On i386, this | ||
196 | is 4096. | ||
197 | |||
198 | - show() methods should return the number of bytes printed into the | ||
199 | buffer. This is the return value of snprintf(). | ||
200 | |||
201 | - show() should always use snprintf(). | ||
202 | |||
203 | - store() should return the number of bytes used from the buffer. This | ||
204 | can be done using strlen(). | ||
205 | |||
206 | - show() or store() can always return errors. If a bad value comes | ||
207 | through, be sure to return an error. | ||
208 | |||
209 | - The object passed to the methods will be pinned in memory via sysfs | ||
210 | referencing counting its embedded object. However, the physical | ||
211 | entity (e.g. device) the object represents may not be present. Be | ||
212 | sure to have a way to check this, if necessary. | ||
213 | |||
214 | |||
215 | A very simple (and naive) implementation of a device attribute is: | ||
216 | |||
217 | static ssize_t show_name(struct device * dev, char * buf) | ||
218 | { | ||
219 | return sprintf(buf,"%s\n",dev->name); | ||
220 | } | ||
221 | |||
222 | static ssize_t store_name(struct device * dev, const char * buf) | ||
223 | { | ||
224 | sscanf(buf,"%20s",dev->name); | ||
225 | return strlen(buf); | ||
226 | } | ||
227 | |||
228 | static DEVICE_ATTR(name,S_IRUGO,show_name,store_name); | ||
229 | |||
230 | |||
231 | (Note that the real implementation doesn't allow userspace to set the | ||
232 | name for a device.) | ||
233 | |||
234 | |||
235 | Top Level Directory Layout | ||
236 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
237 | |||
238 | The sysfs directory arrangement exposes the relationship of kernel | ||
239 | data structures. | ||
240 | |||
241 | The top level sysfs diretory looks like: | ||
242 | |||
243 | block/ | ||
244 | bus/ | ||
245 | class/ | ||
246 | devices/ | ||
247 | firmware/ | ||
248 | net/ | ||
249 | |||
250 | devices/ contains a filesystem representation of the device tree. It maps | ||
251 | directly to the internal kernel device tree, which is a hierarchy of | ||
252 | struct device. | ||
253 | |||
254 | bus/ contains flat directory layout of the various bus types in the | ||
255 | kernel. Each bus's directory contains two subdirectories: | ||
256 | |||
257 | devices/ | ||
258 | drivers/ | ||
259 | |||
260 | devices/ contains symlinks for each device discovered in the system | ||
261 | that point to the device's directory under root/. | ||
262 | |||
263 | drivers/ contains a directory for each device driver that is loaded | ||
264 | for devices on that particular bus (this assumes that drivers do not | ||
265 | span multiple bus types). | ||
266 | |||
267 | |||
268 | More information can driver-model specific features can be found in | ||
269 | Documentation/driver-model/. | ||
270 | |||
271 | |||
272 | TODO: Finish this section. | ||
273 | |||
274 | |||
275 | Current Interfaces | ||
276 | ~~~~~~~~~~~~~~~~~~ | ||
277 | |||
278 | The following interface layers currently exist in sysfs: | ||
279 | |||
280 | |||
281 | - devices (include/linux/device.h) | ||
282 | ---------------------------------- | ||
283 | Structure: | ||
284 | |||
285 | struct device_attribute { | ||
286 | struct attribute attr; | ||
287 | ssize_t (*show)(struct device * dev, char * buf); | ||
288 | ssize_t (*store)(struct device * dev, const char * buf); | ||
289 | }; | ||
290 | |||
291 | Declaring: | ||
292 | |||
293 | DEVICE_ATTR(_name,_str,_mode,_show,_store); | ||
294 | |||
295 | Creation/Removal: | ||
296 | |||
297 | int device_create_file(struct device *device, struct device_attribute * attr); | ||
298 | void device_remove_file(struct device * dev, struct device_attribute * attr); | ||
299 | |||
300 | |||
301 | - bus drivers (include/linux/device.h) | ||
302 | -------------------------------------- | ||
303 | Structure: | ||
304 | |||
305 | struct bus_attribute { | ||
306 | struct attribute attr; | ||
307 | ssize_t (*show)(struct bus_type *, char * buf); | ||
308 | ssize_t (*store)(struct bus_type *, const char * buf); | ||
309 | }; | ||
310 | |||
311 | Declaring: | ||
312 | |||
313 | BUS_ATTR(_name,_mode,_show,_store) | ||
314 | |||
315 | Creation/Removal: | ||
316 | |||
317 | int bus_create_file(struct bus_type *, struct bus_attribute *); | ||
318 | void bus_remove_file(struct bus_type *, struct bus_attribute *); | ||
319 | |||
320 | |||
321 | - device drivers (include/linux/device.h) | ||
322 | ----------------------------------------- | ||
323 | |||
324 | Structure: | ||
325 | |||
326 | struct driver_attribute { | ||
327 | struct attribute attr; | ||
328 | ssize_t (*show)(struct device_driver *, char * buf); | ||
329 | ssize_t (*store)(struct device_driver *, const char * buf); | ||
330 | }; | ||
331 | |||
332 | Declaring: | ||
333 | |||
334 | DRIVER_ATTR(_name,_mode,_show,_store) | ||
335 | |||
336 | Creation/Removal: | ||
337 | |||
338 | int driver_create_file(struct device_driver *, struct driver_attribute *); | ||
339 | void driver_remove_file(struct device_driver *, struct driver_attribute *); | ||
340 | |||
341 | |||
diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt new file mode 100644 index 000000000000..d81722418010 --- /dev/null +++ b/Documentation/filesystems/sysv-fs.txt | |||
@@ -0,0 +1,38 @@ | |||
1 | This is the implementation of the SystemV/Coherent filesystem for Linux. | ||
2 | It implements all of | ||
3 | - Xenix FS, | ||
4 | - SystemV/386 FS, | ||
5 | - Coherent FS. | ||
6 | |||
7 | This is version beta 4. | ||
8 | |||
9 | To install: | ||
10 | * Answer the 'System V and Coherent filesystem support' question with 'y' | ||
11 | when configuring the kernel. | ||
12 | * To mount a disk or a partition, use | ||
13 | mount [-r] -t sysv device mountpoint | ||
14 | The file system type names | ||
15 | -t sysv | ||
16 | -t xenix | ||
17 | -t coherent | ||
18 | may be used interchangeably, but the last two will eventually disappear. | ||
19 | |||
20 | Bugs in the present implementation: | ||
21 | - Coherent FS: | ||
22 | - The "free list interleave" n:m is currently ignored. | ||
23 | - Only file systems with no filesystem name and no pack name are recognized. | ||
24 | (See Coherent "man mkfs" for a description of these features.) | ||
25 | - SystemV Release 2 FS: | ||
26 | The superblock is only searched in the blocks 9, 15, 18, which | ||
27 | corresponds to the beginning of track 1 on floppy disks. No support | ||
28 | for this FS on hard disk yet. | ||
29 | |||
30 | |||
31 | Please report any bugs and suggestions to | ||
32 | Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de> | ||
33 | Pascal Haible <haible@izfm.uni-stuttgart.de> | ||
34 | Krzysztof G. Baranowski <kgb@manjak.knm.org.pl> | ||
35 | |||
36 | Bruno Haible | ||
37 | <haible@ma2s2.mathematik.uni-karlsruhe.de> | ||
38 | |||
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt new file mode 100644 index 000000000000..417e3095fe39 --- /dev/null +++ b/Documentation/filesystems/tmpfs.txt | |||
@@ -0,0 +1,100 @@ | |||
1 | Tmpfs is a file system which keeps all files in virtual memory. | ||
2 | |||
3 | |||
4 | Everything in tmpfs is temporary in the sense that no files will be | ||
5 | created on your hard drive. If you unmount a tmpfs instance, | ||
6 | everything stored therein is lost. | ||
7 | |||
8 | tmpfs puts everything into the kernel internal caches and grows and | ||
9 | shrinks to accommodate the files it contains and is able to swap | ||
10 | unneeded pages out to swap space. It has maximum size limits which can | ||
11 | be adjusted on the fly via 'mount -o remount ...' | ||
12 | |||
13 | If you compare it to ramfs (which was the template to create tmpfs) | ||
14 | you gain swapping and limit checking. Another similar thing is the RAM | ||
15 | disk (/dev/ram*), which simulates a fixed size hard disk in physical | ||
16 | RAM, where you have to create an ordinary filesystem on top. Ramdisks | ||
17 | cannot swap and you do not have the possibility to resize them. | ||
18 | |||
19 | Since tmpfs lives completely in the page cache and on swap, all tmpfs | ||
20 | pages currently in memory will show up as cached. It will not show up | ||
21 | as shared or something like that. Further on you can check the actual | ||
22 | RAM+swap use of a tmpfs instance with df(1) and du(1). | ||
23 | |||
24 | |||
25 | tmpfs has the following uses: | ||
26 | |||
27 | 1) There is always a kernel internal mount which you will not see at | ||
28 | all. This is used for shared anonymous mappings and SYSV shared | ||
29 | memory. | ||
30 | |||
31 | This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not | ||
32 | set, the user visible part of tmpfs is not build. But the internal | ||
33 | mechanisms are always present. | ||
34 | |||
35 | 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for | ||
36 | POSIX shared memory (shm_open, shm_unlink). Adding the following | ||
37 | line to /etc/fstab should take care of this: | ||
38 | |||
39 | tmpfs /dev/shm tmpfs defaults 0 0 | ||
40 | |||
41 | Remember to create the directory that you intend to mount tmpfs on | ||
42 | if necessary (/dev/shm is automagically created if you use devfs). | ||
43 | |||
44 | This mount is _not_ needed for SYSV shared memory. The internal | ||
45 | mount is used for that. (In the 2.3 kernel versions it was | ||
46 | necessary to mount the predecessor of tmpfs (shm fs) to use SYSV | ||
47 | shared memory) | ||
48 | |||
49 | 3) Some people (including me) find it very convenient to mount it | ||
50 | e.g. on /tmp and /var/tmp and have a big swap partition. And now | ||
51 | loop mounts of tmpfs files do work, so mkinitrd shipped by most | ||
52 | distributions should succeed with a tmpfs /tmp. | ||
53 | |||
54 | 4) And probably a lot more I do not know about :-) | ||
55 | |||
56 | |||
57 | tmpfs has three mount options for sizing: | ||
58 | |||
59 | size: The limit of allocated bytes for this tmpfs instance. The | ||
60 | default is half of your physical RAM without swap. If you | ||
61 | oversize your tmpfs instances the machine will deadlock | ||
62 | since the OOM handler will not be able to free that memory. | ||
63 | nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE. | ||
64 | nr_inodes: The maximum number of inodes for this instance. The default | ||
65 | is half of the number of your physical RAM pages, or (on a | ||
66 | a machine with highmem) the number of lowmem RAM pages, | ||
67 | whichever is the lower. | ||
68 | |||
69 | These parameters accept a suffix k, m or g for kilo, mega and giga and | ||
70 | can be changed on remount. The size parameter also accepts a suffix % | ||
71 | to limit this tmpfs instance to that percentage of your physical RAM: | ||
72 | the default, when neither size nor nr_blocks is specified, is size=50% | ||
73 | |||
74 | If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks | ||
75 | nor inodes will be limited in that instance. It is generally unwise to | ||
76 | mount with such options, since it allows any user with write access to | ||
77 | use up all the memory on the machine; but enhances the scalability of | ||
78 | that instance in a system with many cpus making intensive use of it. | ||
79 | |||
80 | |||
81 | To specify the initial root directory you can use the following mount | ||
82 | options: | ||
83 | |||
84 | mode: The permissions as an octal number | ||
85 | uid: The user id | ||
86 | gid: The group id | ||
87 | |||
88 | These options do not have any effect on remount. You can change these | ||
89 | parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. | ||
90 | |||
91 | |||
92 | So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' | ||
93 | will give you tmpfs instance on /mytmpfs which can allocate 10GB | ||
94 | RAM/SWAP in 10240 inodes and it is only accessible by root. | ||
95 | |||
96 | |||
97 | Author: | ||
98 | Christoph Rohland <cr@sap.com>, 1.12.01 | ||
99 | Updated: | ||
100 | Hugh Dickins <hugh@veritas.com>, 01 September 2004 | ||
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt new file mode 100644 index 000000000000..e5213bc301f7 --- /dev/null +++ b/Documentation/filesystems/udf.txt | |||
@@ -0,0 +1,57 @@ | |||
1 | * | ||
2 | * Documentation/filesystems/udf.txt | ||
3 | * | ||
4 | UDF Filesystem version 0.9.8.1 | ||
5 | |||
6 | If you encounter problems with reading UDF discs using this driver, | ||
7 | please report them to linux_udf@hpesjro.fc.hp.com, which is the | ||
8 | developer's list. | ||
9 | |||
10 | Write support requires a block driver which supports writing. The current | ||
11 | scsi and ide cdrom drivers do not support writing. | ||
12 | |||
13 | ------------------------------------------------------------------------------- | ||
14 | The following mount options are supported: | ||
15 | |||
16 | gid= Set the default group. | ||
17 | umask= Set the default umask. | ||
18 | uid= Set the default user. | ||
19 | bs= Set the block size. | ||
20 | unhide Show otherwise hidden files. | ||
21 | undelete Show deleted files in lists. | ||
22 | adinicb Embed data in the inode (default) | ||
23 | noadinicb Don't embed data in the inode | ||
24 | shortad Use short ad's | ||
25 | longad Use long ad's (default) | ||
26 | nostrict Unset strict conformance | ||
27 | iocharset= Set the NLS character set | ||
28 | |||
29 | The remaining are for debugging and disaster recovery: | ||
30 | |||
31 | novrs Skip volume sequence recognition | ||
32 | |||
33 | The following expect a offset from 0. | ||
34 | |||
35 | session= Set the CDROM session (default= last session) | ||
36 | anchor= Override standard anchor location. (default= 256) | ||
37 | volume= Override the VolumeDesc location. (unused) | ||
38 | partition= Override the PartitionDesc location. (unused) | ||
39 | lastblock= Set the last block of the filesystem/ | ||
40 | |||
41 | The following expect a offset from the partition root. | ||
42 | |||
43 | fileset= Override the fileset block location. (unused) | ||
44 | rootdir= Override the root directory location. (unused) | ||
45 | WARNING: overriding the rootdir to a non-directory may | ||
46 | yield highly unpredictable results. | ||
47 | ------------------------------------------------------------------------------- | ||
48 | |||
49 | |||
50 | For the latest version and toolset see: | ||
51 | http://linux-udf.sourceforge.net/ | ||
52 | |||
53 | Documentation on UDF and ECMA 167 is available FREE from: | ||
54 | http://www.osta.org/ | ||
55 | http://www.ecma-international.org/ | ||
56 | |||
57 | Ben Fennema <bfennema@falcon.csc.calpoly.edu> | ||
diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt new file mode 100644 index 000000000000..2b5a56a6a558 --- /dev/null +++ b/Documentation/filesystems/ufs.txt | |||
@@ -0,0 +1,61 @@ | |||
1 | USING UFS | ||
2 | ========= | ||
3 | |||
4 | mount -t ufs -o ufstype=type_of_ufs device dir | ||
5 | |||
6 | |||
7 | UFS OPTIONS | ||
8 | =========== | ||
9 | |||
10 | ufstype=type_of_ufs | ||
11 | UFS is a file system widely used in different operating systems. | ||
12 | The problem are differences among implementations. Features of | ||
13 | some implementations are undocumented, so its hard to recognize | ||
14 | type of ufs automatically. That's why user must specify type of | ||
15 | ufs manually by mount option ufstype. Possible values are: | ||
16 | |||
17 | old old format of ufs | ||
18 | default value, supported as read-only | ||
19 | |||
20 | 44bsd used in FreeBSD, NetBSD, OpenBSD | ||
21 | supported as read-write | ||
22 | |||
23 | ufs2 used in FreeBSD 5.x | ||
24 | supported as read-only | ||
25 | |||
26 | 5xbsd synonym for ufs2 | ||
27 | |||
28 | sun used in SunOS (Solaris) | ||
29 | supported as read-write | ||
30 | |||
31 | sunx86 used in SunOS for Intel (Solarisx86) | ||
32 | supported as read-write | ||
33 | |||
34 | hp used in HP-UX | ||
35 | supported as read-only | ||
36 | |||
37 | nextstep | ||
38 | used in NextStep | ||
39 | supported as read-only | ||
40 | |||
41 | nextstep-cd | ||
42 | used for NextStep CDROMs (block_size == 2048) | ||
43 | supported as read-only | ||
44 | |||
45 | openstep | ||
46 | used in OpenStep | ||
47 | supported as read-only | ||
48 | |||
49 | |||
50 | POSSIBLE PROBLEMS | ||
51 | ================= | ||
52 | |||
53 | There is still bug in reallocation of fragment, in file fs/ufs/balloc.c, | ||
54 | line 364. But it seems working on current buffer cache configuration. | ||
55 | |||
56 | |||
57 | BUG REPORTS | ||
58 | =========== | ||
59 | |||
60 | Any ufs bug report you can send to daniel.pirkl@email.cz (do not send | ||
61 | partition tables bug reports.) | ||
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt new file mode 100644 index 000000000000..5ead20c6c744 --- /dev/null +++ b/Documentation/filesystems/vfat.txt | |||
@@ -0,0 +1,231 @@ | |||
1 | USING VFAT | ||
2 | ---------------------------------------------------------------------- | ||
3 | To use the vfat filesystem, use the filesystem type 'vfat'. i.e. | ||
4 | mount -t vfat /dev/fd0 /mnt | ||
5 | |||
6 | No special partition formatter is required. mkdosfs will work fine | ||
7 | if you want to format from within Linux. | ||
8 | |||
9 | VFAT MOUNT OPTIONS | ||
10 | ---------------------------------------------------------------------- | ||
11 | umask=### -- The permission mask (for files and directories, see umask(1)). | ||
12 | The default is the umask of current process. | ||
13 | |||
14 | dmask=### -- The permission mask for the directory. | ||
15 | The default is the umask of current process. | ||
16 | |||
17 | fmask=### -- The permission mask for files. | ||
18 | The default is the umask of current process. | ||
19 | |||
20 | codepage=### -- Sets the codepage number for converting to shortname | ||
21 | characters on FAT filesystem. | ||
22 | By default, FAT_DEFAULT_CODEPAGE setting is used. | ||
23 | |||
24 | iocharset=name -- Character set to use for converting between the | ||
25 | encoding is used for user visible filename and 16 bit | ||
26 | Unicode characters. Long filenames are stored on disk | ||
27 | in Unicode format, but Unix for the most part doesn't | ||
28 | know how to deal with Unicode. | ||
29 | By default, FAT_DEFAULT_IOCHARSET setting is used. | ||
30 | |||
31 | There is also an option of doing UTF8 translations | ||
32 | with the utf8 option. | ||
33 | |||
34 | NOTE: "iocharset=utf8" is not recommended. If unsure, | ||
35 | you should consider the following option instead. | ||
36 | |||
37 | utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that | ||
38 | is used by the console. It can be be enabled for the | ||
39 | filesystem with this option. If 'uni_xlate' gets set, | ||
40 | UTF8 gets disabled. | ||
41 | |||
42 | uni_xlate=<bool> -- Translate unhandled Unicode characters to special | ||
43 | escaped sequences. This would let you backup and | ||
44 | restore filenames that are created with any Unicode | ||
45 | characters. Until Linux supports Unicode for real, | ||
46 | this gives you an alternative. Without this option, | ||
47 | a '?' is used when no translation is possible. The | ||
48 | escape character is ':' because it is otherwise | ||
49 | illegal on the vfat filesystem. The escape sequence | ||
50 | that gets used is ':' and the four digits of hexadecimal | ||
51 | unicode. | ||
52 | |||
53 | nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will | ||
54 | end in '~1' or tilde followed by some number. If this | ||
55 | option is set, then if the filename is | ||
56 | "longfilename.txt" and "longfile.txt" does not | ||
57 | currently exist in the directory, 'longfile.txt' will | ||
58 | be the short alias instead of 'longfi~1.txt'. | ||
59 | |||
60 | quiet -- Stops printing certain warning messages. | ||
61 | |||
62 | check=s|r|n -- Case sensitivity checking setting. | ||
63 | s: strict, case sensitive | ||
64 | r: relaxed, case insensitive | ||
65 | n: normal, default setting, currently case insensitive | ||
66 | |||
67 | shortname=lower|win95|winnt|mixed | ||
68 | -- Shortname display/create setting. | ||
69 | lower: convert to lowercase for display, | ||
70 | emulate the Windows 95 rule for create. | ||
71 | win95: emulate the Windows 95 rule for display/create. | ||
72 | winnt: emulate the Windows NT rule for display/create. | ||
73 | mixed: emulate the Windows NT rule for display, | ||
74 | emulate the Windows 95 rule for create. | ||
75 | Default setting is `lower'. | ||
76 | |||
77 | <bool>: 0,1,yes,no,true,false | ||
78 | |||
79 | TODO | ||
80 | ---------------------------------------------------------------------- | ||
81 | * Need to get rid of the raw scanning stuff. Instead, always use | ||
82 | a get next directory entry approach. The only thing left that uses | ||
83 | raw scanning is the directory renaming code. | ||
84 | |||
85 | |||
86 | POSSIBLE PROBLEMS | ||
87 | ---------------------------------------------------------------------- | ||
88 | * vfat_valid_longname does not properly checked reserved names. | ||
89 | * When a volume name is the same as a directory name in the root | ||
90 | directory of the filesystem, the directory name sometimes shows | ||
91 | up as an empty file. | ||
92 | * autoconv option does not work correctly. | ||
93 | |||
94 | BUG REPORTS | ||
95 | ---------------------------------------------------------------------- | ||
96 | If you have trouble with the VFAT filesystem, mail bug reports to | ||
97 | chaffee@bmrc.cs.berkeley.edu. Please specify the filename | ||
98 | and the operation that gave you trouble. | ||
99 | |||
100 | TEST SUITE | ||
101 | ---------------------------------------------------------------------- | ||
102 | If you plan to make any modifications to the vfat filesystem, please | ||
103 | get the test suite that comes with the vfat distribution at | ||
104 | |||
105 | http://bmrc.berkeley.edu/people/chaffee/vfat.html | ||
106 | |||
107 | This tests quite a few parts of the vfat filesystem and additional | ||
108 | tests for new features or untested features would be appreciated. | ||
109 | |||
110 | NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM | ||
111 | ---------------------------------------------------------------------- | ||
112 | (This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu> | ||
113 | and lightly annotated by Gordon Chaffee). | ||
114 | |||
115 | This document presents a very rough, technical overview of my | ||
116 | knowledge of the extended FAT file system used in Windows NT 3.5 and | ||
117 | Windows 95. I don't guarantee that any of the following is correct, | ||
118 | but it appears to be so. | ||
119 | |||
120 | The extended FAT file system is almost identical to the FAT | ||
121 | file system used in DOS versions up to and including 6.223410239847 | ||
122 | :-). The significant change has been the addition of long file names. | ||
123 | These names support up to 255 characters including spaces and lower | ||
124 | case characters as opposed to the traditional 8.3 short names. | ||
125 | |||
126 | Here is the description of the traditional FAT entry in the current | ||
127 | Windows 95 filesystem: | ||
128 | |||
129 | struct directory { // Short 8.3 names | ||
130 | unsigned char name[8]; // file name | ||
131 | unsigned char ext[3]; // file extension | ||
132 | unsigned char attr; // attribute byte | ||
133 | unsigned char lcase; // Case for base and extension | ||
134 | unsigned char ctime_ms; // Creation time, milliseconds | ||
135 | unsigned char ctime[2]; // Creation time | ||
136 | unsigned char cdate[2]; // Creation date | ||
137 | unsigned char adate[2]; // Last access date | ||
138 | unsigned char reserved[2]; // reserved values (ignored) | ||
139 | unsigned char time[2]; // time stamp | ||
140 | unsigned char date[2]; // date stamp | ||
141 | unsigned char start[2]; // starting cluster number | ||
142 | unsigned char size[4]; // size of the file | ||
143 | }; | ||
144 | |||
145 | The lcase field specifies if the base and/or the extension of an 8.3 | ||
146 | name should be capitalized. This field does not seem to be used by | ||
147 | Windows 95 but it is used by Windows NT. The case of filenames is not | ||
148 | completely compatible from Windows NT to Windows 95. It is not completely | ||
149 | compatible in the reverse direction, however. Filenames that fit in | ||
150 | the 8.3 namespace and are written on Windows NT to be lowercase will | ||
151 | show up as uppercase on Windows 95. | ||
152 | |||
153 | Note that the "start" and "size" values are actually little | ||
154 | endian integer values. The descriptions of the fields in this | ||
155 | structure are public knowledge and can be found elsewhere. | ||
156 | |||
157 | With the extended FAT system, Microsoft has inserted extra | ||
158 | directory entries for any files with extended names. (Any name which | ||
159 | legally fits within the old 8.3 encoding scheme does not have extra | ||
160 | entries.) I call these extra entries slots. Basically, a slot is a | ||
161 | specially formatted directory entry which holds up to 13 characters of | ||
162 | a file's extended name. Think of slots as additional labeling for the | ||
163 | directory entry of the file to which they correspond. Microsoft | ||
164 | prefers to refer to the 8.3 entry for a file as its alias and the | ||
165 | extended slot directory entries as the file name. | ||
166 | |||
167 | The C structure for a slot directory entry follows: | ||
168 | |||
169 | struct slot { // Up to 13 characters of a long name | ||
170 | unsigned char id; // sequence number for slot | ||
171 | unsigned char name0_4[10]; // first 5 characters in name | ||
172 | unsigned char attr; // attribute byte | ||
173 | unsigned char reserved; // always 0 | ||
174 | unsigned char alias_checksum; // checksum for 8.3 alias | ||
175 | unsigned char name5_10[12]; // 6 more characters in name | ||
176 | unsigned char start[2]; // starting cluster number | ||
177 | unsigned char name11_12[4]; // last 2 characters in name | ||
178 | }; | ||
179 | |||
180 | If the layout of the slots looks a little odd, it's only | ||
181 | because of Microsoft's efforts to maintain compatibility with old | ||
182 | software. The slots must be disguised to prevent old software from | ||
183 | panicking. To this end, a number of measures are taken: | ||
184 | |||
185 | 1) The attribute byte for a slot directory entry is always set | ||
186 | to 0x0f. This corresponds to an old directory entry with | ||
187 | attributes of "hidden", "system", "read-only", and "volume | ||
188 | label". Most old software will ignore any directory | ||
189 | entries with the "volume label" bit set. Real volume label | ||
190 | entries don't have the other three bits set. | ||
191 | |||
192 | 2) The starting cluster is always set to 0, an impossible | ||
193 | value for a DOS file. | ||
194 | |||
195 | Because the extended FAT system is backward compatible, it is | ||
196 | possible for old software to modify directory entries. Measures must | ||
197 | be taken to ensure the validity of slots. An extended FAT system can | ||
198 | verify that a slot does in fact belong to an 8.3 directory entry by | ||
199 | the following: | ||
200 | |||
201 | 1) Positioning. Slots for a file always immediately proceed | ||
202 | their corresponding 8.3 directory entry. In addition, each | ||
203 | slot has an id which marks its order in the extended file | ||
204 | name. Here is a very abbreviated view of an 8.3 directory | ||
205 | entry and its corresponding long name slots for the file | ||
206 | "My Big File.Extension which is long": | ||
207 | |||
208 | <proceeding files...> | ||
209 | <slot #3, id = 0x43, characters = "h is long"> | ||
210 | <slot #2, id = 0x02, characters = "xtension whic"> | ||
211 | <slot #1, id = 0x01, characters = "My Big File.E"> | ||
212 | <directory entry, name = "MYBIGFIL.EXT"> | ||
213 | |||
214 | Note that the slots are stored from last to first. Slots | ||
215 | are numbered from 1 to N. The Nth slot is or'ed with 0x40 | ||
216 | to mark it as the last one. | ||
217 | |||
218 | 2) Checksum. Each slot has an "alias_checksum" value. The | ||
219 | checksum is calculated from the 8.3 name using the | ||
220 | following algorithm: | ||
221 | |||
222 | for (sum = i = 0; i < 11; i++) { | ||
223 | sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i] | ||
224 | } | ||
225 | |||
226 | 3) If there is free space in the final slot, a Unicode NULL (0x0000) | ||
227 | is stored after the final character. After that, all unused | ||
228 | characters in the final slot are set to Unicode 0xFFFF. | ||
229 | |||
230 | Finally, note that the extended name is stored in Unicode. Each Unicode | ||
231 | character takes two bytes. | ||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt new file mode 100644 index 000000000000..3f318dd44c77 --- /dev/null +++ b/Documentation/filesystems/vfs.txt | |||
@@ -0,0 +1,671 @@ | |||
1 | /* -*- auto-fill -*- */ | ||
2 | |||
3 | Overview of the Virtual File System | ||
4 | |||
5 | Richard Gooch <rgooch@atnf.csiro.au> | ||
6 | |||
7 | 5-JUL-1999 | ||
8 | |||
9 | |||
10 | Conventions used in this document <section> | ||
11 | ================================= | ||
12 | |||
13 | Each section in this document will have the string "<section>" at the | ||
14 | right-hand side of the section title. Each subsection will have | ||
15 | "<subsection>" at the right-hand side. These strings are meant to make | ||
16 | it easier to search through the document. | ||
17 | |||
18 | NOTE that the master copy of this document is available online at: | ||
19 | http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt | ||
20 | |||
21 | |||
22 | What is it? <section> | ||
23 | =========== | ||
24 | |||
25 | The Virtual File System (otherwise known as the Virtual Filesystem | ||
26 | Switch) is the software layer in the kernel that provides the | ||
27 | filesystem interface to userspace programs. It also provides an | ||
28 | abstraction within the kernel which allows different filesystem | ||
29 | implementations to co-exist. | ||
30 | |||
31 | |||
32 | A Quick Look At How It Works <section> | ||
33 | ============================ | ||
34 | |||
35 | In this section I'll briefly describe how things work, before | ||
36 | launching into the details. I'll start with describing what happens | ||
37 | when user programs open and manipulate files, and then look from the | ||
38 | other view which is how a filesystem is supported and subsequently | ||
39 | mounted. | ||
40 | |||
41 | Opening a File <subsection> | ||
42 | -------------- | ||
43 | |||
44 | The VFS implements the open(2), stat(2), chmod(2) and similar system | ||
45 | calls. The pathname argument is used by the VFS to search through the | ||
46 | directory entry cache (dentry cache or "dcache"). This provides a very | ||
47 | fast look-up mechanism to translate a pathname (filename) into a | ||
48 | specific dentry. | ||
49 | |||
50 | An individual dentry usually has a pointer to an inode. Inodes are the | ||
51 | things that live on disc drives, and can be regular files (you know: | ||
52 | those things that you write data into), directories, FIFOs and other | ||
53 | beasts. Dentries live in RAM and are never saved to disc: they exist | ||
54 | only for performance. Inodes live on disc and are copied into memory | ||
55 | when required. Later any changes are written back to disc. The inode | ||
56 | that lives in RAM is a VFS inode, and it is this which the dentry | ||
57 | points to. A single inode can be pointed to by multiple dentries | ||
58 | (think about hardlinks). | ||
59 | |||
60 | The dcache is meant to be a view into your entire filespace. Unlike | ||
61 | Linus, most of us losers can't fit enough dentries into RAM to cover | ||
62 | all of our filespace, so the dcache has bits missing. In order to | ||
63 | resolve your pathname into a dentry, the VFS may have to resort to | ||
64 | creating dentries along the way, and then loading the inode. This is | ||
65 | done by looking up the inode. | ||
66 | |||
67 | To look up an inode (usually read from disc) requires that the VFS | ||
68 | calls the lookup() method of the parent directory inode. This method | ||
69 | is installed by the specific filesystem implementation that the inode | ||
70 | lives in. There will be more on this later. | ||
71 | |||
72 | Once the VFS has the required dentry (and hence the inode), we can do | ||
73 | all those boring things like open(2) the file, or stat(2) it to peek | ||
74 | at the inode data. The stat(2) operation is fairly simple: once the | ||
75 | VFS has the dentry, it peeks at the inode data and passes some of it | ||
76 | back to userspace. | ||
77 | |||
78 | Opening a file requires another operation: allocation of a file | ||
79 | structure (this is the kernel-side implementation of file | ||
80 | descriptors). The freshly allocated file structure is initialised with | ||
81 | a pointer to the dentry and a set of file operation member functions. | ||
82 | These are taken from the inode data. The open() file method is then | ||
83 | called so the specific filesystem implementation can do it's work. You | ||
84 | can see that this is another switch performed by the VFS. | ||
85 | |||
86 | The file structure is placed into the file descriptor table for the | ||
87 | process. | ||
88 | |||
89 | Reading, writing and closing files (and other assorted VFS operations) | ||
90 | is done by using the userspace file descriptor to grab the appropriate | ||
91 | file structure, and then calling the required file structure method | ||
92 | function to do whatever is required. | ||
93 | |||
94 | For as long as the file is open, it keeps the dentry "open" (in use), | ||
95 | which in turn means that the VFS inode is still in use. | ||
96 | |||
97 | All VFS system calls (i.e. open(2), stat(2), read(2), write(2), | ||
98 | chmod(2) and so on) are called from a process context. You should | ||
99 | assume that these calls are made without any kernel locks being | ||
100 | held. This means that the processes may be executing the same piece of | ||
101 | filesystem or driver code at the same time, on different | ||
102 | processors. You should ensure that access to shared resources is | ||
103 | protected by appropriate locks. | ||
104 | |||
105 | Registering and Mounting a Filesystem <subsection> | ||
106 | ------------------------------------- | ||
107 | |||
108 | If you want to support a new kind of filesystem in the kernel, all you | ||
109 | need to do is call register_filesystem(). You pass a structure | ||
110 | describing the filesystem implementation (struct file_system_type) | ||
111 | which is then added to an internal table of supported filesystems. You | ||
112 | can do: | ||
113 | |||
114 | % cat /proc/filesystems | ||
115 | |||
116 | to see what filesystems are currently available on your system. | ||
117 | |||
118 | When a request is made to mount a block device onto a directory in | ||
119 | your filespace the VFS will call the appropriate method for the | ||
120 | specific filesystem. The dentry for the mount point will then be | ||
121 | updated to point to the root inode for the new filesystem. | ||
122 | |||
123 | It's now time to look at things in more detail. | ||
124 | |||
125 | |||
126 | struct file_system_type <section> | ||
127 | ======================= | ||
128 | |||
129 | This describes the filesystem. As of kernel 2.1.99, the following | ||
130 | members are defined: | ||
131 | |||
132 | struct file_system_type { | ||
133 | const char *name; | ||
134 | int fs_flags; | ||
135 | struct super_block *(*read_super) (struct super_block *, void *, int); | ||
136 | struct file_system_type * next; | ||
137 | }; | ||
138 | |||
139 | name: the name of the filesystem type, such as "ext2", "iso9660", | ||
140 | "msdos" and so on | ||
141 | |||
142 | fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) | ||
143 | |||
144 | read_super: the method to call when a new instance of this | ||
145 | filesystem should be mounted | ||
146 | |||
147 | next: for internal VFS use: you should initialise this to NULL | ||
148 | |||
149 | The read_super() method has the following arguments: | ||
150 | |||
151 | struct super_block *sb: the superblock structure. This is partially | ||
152 | initialised by the VFS and the rest must be initialised by the | ||
153 | read_super() method | ||
154 | |||
155 | void *data: arbitrary mount options, usually comes as an ASCII | ||
156 | string | ||
157 | |||
158 | int silent: whether or not to be silent on error | ||
159 | |||
160 | The read_super() method must determine if the block device specified | ||
161 | in the superblock contains a filesystem of the type the method | ||
162 | supports. On success the method returns the superblock pointer, on | ||
163 | failure it returns NULL. | ||
164 | |||
165 | The most interesting member of the superblock structure that the | ||
166 | read_super() method fills in is the "s_op" field. This is a pointer to | ||
167 | a "struct super_operations" which describes the next level of the | ||
168 | filesystem implementation. | ||
169 | |||
170 | |||
171 | struct super_operations <section> | ||
172 | ======================= | ||
173 | |||
174 | This describes how the VFS can manipulate the superblock of your | ||
175 | filesystem. As of kernel 2.1.99, the following members are defined: | ||
176 | |||
177 | struct super_operations { | ||
178 | void (*read_inode) (struct inode *); | ||
179 | int (*write_inode) (struct inode *, int); | ||
180 | void (*put_inode) (struct inode *); | ||
181 | void (*drop_inode) (struct inode *); | ||
182 | void (*delete_inode) (struct inode *); | ||
183 | int (*notify_change) (struct dentry *, struct iattr *); | ||
184 | void (*put_super) (struct super_block *); | ||
185 | void (*write_super) (struct super_block *); | ||
186 | int (*statfs) (struct super_block *, struct statfs *, int); | ||
187 | int (*remount_fs) (struct super_block *, int *, char *); | ||
188 | void (*clear_inode) (struct inode *); | ||
189 | }; | ||
190 | |||
191 | All methods are called without any locks being held, unless otherwise | ||
192 | noted. This means that most methods can block safely. All methods are | ||
193 | only called from a process context (i.e. not from an interrupt handler | ||
194 | or bottom half). | ||
195 | |||
196 | read_inode: this method is called to read a specific inode from the | ||
197 | mounted filesystem. The "i_ino" member in the "struct inode" | ||
198 | will be initialised by the VFS to indicate which inode to | ||
199 | read. Other members are filled in by this method | ||
200 | |||
201 | write_inode: this method is called when the VFS needs to write an | ||
202 | inode to disc. The second parameter indicates whether the write | ||
203 | should be synchronous or not, not all filesystems check this flag. | ||
204 | |||
205 | put_inode: called when the VFS inode is removed from the inode | ||
206 | cache. This method is optional | ||
207 | |||
208 | drop_inode: called when the last access to the inode is dropped, | ||
209 | with the inode_lock spinlock held. | ||
210 | |||
211 | This method should be either NULL (normal unix filesystem | ||
212 | semantics) or "generic_delete_inode" (for filesystems that do not | ||
213 | want to cache inodes - causing "delete_inode" to always be | ||
214 | called regardless of the value of i_nlink) | ||
215 | |||
216 | The "generic_delete_inode()" behaviour is equivalent to the | ||
217 | old practice of using "force_delete" in the put_inode() case, | ||
218 | but does not have the races that the "force_delete()" approach | ||
219 | had. | ||
220 | |||
221 | delete_inode: called when the VFS wants to delete an inode | ||
222 | |||
223 | notify_change: called when VFS inode attributes are changed. If this | ||
224 | is NULL the VFS falls back to the write_inode() method. This | ||
225 | is called with the kernel lock held | ||
226 | |||
227 | put_super: called when the VFS wishes to free the superblock | ||
228 | (i.e. unmount). This is called with the superblock lock held | ||
229 | |||
230 | write_super: called when the VFS superblock needs to be written to | ||
231 | disc. This method is optional | ||
232 | |||
233 | statfs: called when the VFS needs to get filesystem statistics. This | ||
234 | is called with the kernel lock held | ||
235 | |||
236 | remount_fs: called when the filesystem is remounted. This is called | ||
237 | with the kernel lock held | ||
238 | |||
239 | clear_inode: called then the VFS clears the inode. Optional | ||
240 | |||
241 | The read_inode() method is responsible for filling in the "i_op" | ||
242 | field. This is a pointer to a "struct inode_operations" which | ||
243 | describes the methods that can be performed on individual inodes. | ||
244 | |||
245 | |||
246 | struct inode_operations <section> | ||
247 | ======================= | ||
248 | |||
249 | This describes how the VFS can manipulate an inode in your | ||
250 | filesystem. As of kernel 2.1.99, the following members are defined: | ||
251 | |||
252 | struct inode_operations { | ||
253 | struct file_operations * default_file_ops; | ||
254 | int (*create) (struct inode *,struct dentry *,int); | ||
255 | int (*lookup) (struct inode *,struct dentry *); | ||
256 | int (*link) (struct dentry *,struct inode *,struct dentry *); | ||
257 | int (*unlink) (struct inode *,struct dentry *); | ||
258 | int (*symlink) (struct inode *,struct dentry *,const char *); | ||
259 | int (*mkdir) (struct inode *,struct dentry *,int); | ||
260 | int (*rmdir) (struct inode *,struct dentry *); | ||
261 | int (*mknod) (struct inode *,struct dentry *,int,dev_t); | ||
262 | int (*rename) (struct inode *, struct dentry *, | ||
263 | struct inode *, struct dentry *); | ||
264 | int (*readlink) (struct dentry *, char *,int); | ||
265 | struct dentry * (*follow_link) (struct dentry *, struct dentry *); | ||
266 | int (*readpage) (struct file *, struct page *); | ||
267 | int (*writepage) (struct page *page, struct writeback_control *wbc); | ||
268 | int (*bmap) (struct inode *,int); | ||
269 | void (*truncate) (struct inode *); | ||
270 | int (*permission) (struct inode *, int); | ||
271 | int (*smap) (struct inode *,int); | ||
272 | int (*updatepage) (struct file *, struct page *, const char *, | ||
273 | unsigned long, unsigned int, int); | ||
274 | int (*revalidate) (struct dentry *); | ||
275 | }; | ||
276 | |||
277 | Again, all methods are called without any locks being held, unless | ||
278 | otherwise noted. | ||
279 | |||
280 | default_file_ops: this is a pointer to a "struct file_operations" | ||
281 | which describes how to open and then manipulate open files | ||
282 | |||
283 | create: called by the open(2) and creat(2) system calls. Only | ||
284 | required if you want to support regular files. The dentry you | ||
285 | get should not have an inode (i.e. it should be a negative | ||
286 | dentry). Here you will probably call d_instantiate() with the | ||
287 | dentry and the newly created inode | ||
288 | |||
289 | lookup: called when the VFS needs to look up an inode in a parent | ||
290 | directory. The name to look for is found in the dentry. This | ||
291 | method must call d_add() to insert the found inode into the | ||
292 | dentry. The "i_count" field in the inode structure should be | ||
293 | incremented. If the named inode does not exist a NULL inode | ||
294 | should be inserted into the dentry (this is called a negative | ||
295 | dentry). Returning an error code from this routine must only | ||
296 | be done on a real error, otherwise creating inodes with system | ||
297 | calls like create(2), mknod(2), mkdir(2) and so on will fail. | ||
298 | If you wish to overload the dentry methods then you should | ||
299 | initialise the "d_dop" field in the dentry; this is a pointer | ||
300 | to a struct "dentry_operations". | ||
301 | This method is called with the directory inode semaphore held | ||
302 | |||
303 | link: called by the link(2) system call. Only required if you want | ||
304 | to support hard links. You will probably need to call | ||
305 | d_instantiate() just as you would in the create() method | ||
306 | |||
307 | unlink: called by the unlink(2) system call. Only required if you | ||
308 | want to support deleting inodes | ||
309 | |||
310 | symlink: called by the symlink(2) system call. Only required if you | ||
311 | want to support symlinks. You will probably need to call | ||
312 | d_instantiate() just as you would in the create() method | ||
313 | |||
314 | mkdir: called by the mkdir(2) system call. Only required if you want | ||
315 | to support creating subdirectories. You will probably need to | ||
316 | call d_instantiate() just as you would in the create() method | ||
317 | |||
318 | rmdir: called by the rmdir(2) system call. Only required if you want | ||
319 | to support deleting subdirectories | ||
320 | |||
321 | mknod: called by the mknod(2) system call to create a device (char, | ||
322 | block) inode or a named pipe (FIFO) or socket. Only required | ||
323 | if you want to support creating these types of inodes. You | ||
324 | will probably need to call d_instantiate() just as you would | ||
325 | in the create() method | ||
326 | |||
327 | readlink: called by the readlink(2) system call. Only required if | ||
328 | you want to support reading symbolic links | ||
329 | |||
330 | follow_link: called by the VFS to follow a symbolic link to the | ||
331 | inode it points to. Only required if you want to support | ||
332 | symbolic links | ||
333 | |||
334 | |||
335 | struct file_operations <section> | ||
336 | ====================== | ||
337 | |||
338 | This describes how the VFS can manipulate an open file. As of kernel | ||
339 | 2.1.99, the following members are defined: | ||
340 | |||
341 | struct file_operations { | ||
342 | loff_t (*llseek) (struct file *, loff_t, int); | ||
343 | ssize_t (*read) (struct file *, char *, size_t, loff_t *); | ||
344 | ssize_t (*write) (struct file *, const char *, size_t, loff_t *); | ||
345 | int (*readdir) (struct file *, void *, filldir_t); | ||
346 | unsigned int (*poll) (struct file *, struct poll_table_struct *); | ||
347 | int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); | ||
348 | int (*mmap) (struct file *, struct vm_area_struct *); | ||
349 | int (*open) (struct inode *, struct file *); | ||
350 | int (*release) (struct inode *, struct file *); | ||
351 | int (*fsync) (struct file *, struct dentry *); | ||
352 | int (*fasync) (struct file *, int); | ||
353 | int (*check_media_change) (kdev_t dev); | ||
354 | int (*revalidate) (kdev_t dev); | ||
355 | int (*lock) (struct file *, int, struct file_lock *); | ||
356 | }; | ||
357 | |||
358 | Again, all methods are called without any locks being held, unless | ||
359 | otherwise noted. | ||
360 | |||
361 | llseek: called when the VFS needs to move the file position index | ||
362 | |||
363 | read: called by read(2) and related system calls | ||
364 | |||
365 | write: called by write(2) and related system calls | ||
366 | |||
367 | readdir: called when the VFS needs to read the directory contents | ||
368 | |||
369 | poll: called by the VFS when a process wants to check if there is | ||
370 | activity on this file and (optionally) go to sleep until there | ||
371 | is activity. Called by the select(2) and poll(2) system calls | ||
372 | |||
373 | ioctl: called by the ioctl(2) system call | ||
374 | |||
375 | mmap: called by the mmap(2) system call | ||
376 | |||
377 | open: called by the VFS when an inode should be opened. When the VFS | ||
378 | opens a file, it creates a new "struct file" and initialises | ||
379 | the "f_op" file operations member with the "default_file_ops" | ||
380 | field in the inode structure. It then calls the open method | ||
381 | for the newly allocated file structure. You might think that | ||
382 | the open method really belongs in "struct inode_operations", | ||
383 | and you may be right. I think it's done the way it is because | ||
384 | it makes filesystems simpler to implement. The open() method | ||
385 | is a good place to initialise the "private_data" member in the | ||
386 | file structure if you want to point to a device structure | ||
387 | |||
388 | release: called when the last reference to an open file is closed | ||
389 | |||
390 | fsync: called by the fsync(2) system call | ||
391 | |||
392 | fasync: called by the fcntl(2) system call when asynchronous | ||
393 | (non-blocking) mode is enabled for a file | ||
394 | |||
395 | Note that the file operations are implemented by the specific | ||
396 | filesystem in which the inode resides. When opening a device node | ||
397 | (character or block special) most filesystems will call special | ||
398 | support routines in the VFS which will locate the required device | ||
399 | driver information. These support routines replace the filesystem file | ||
400 | operations with those for the device driver, and then proceed to call | ||
401 | the new open() method for the file. This is how opening a device file | ||
402 | in the filesystem eventually ends up calling the device driver open() | ||
403 | method. Note the devfs (the Device FileSystem) has a more direct path | ||
404 | from device node to device driver (this is an unofficial kernel | ||
405 | patch). | ||
406 | |||
407 | |||
408 | Directory Entry Cache (dcache) <section> | ||
409 | ------------------------------ | ||
410 | |||
411 | struct dentry_operations | ||
412 | ======================== | ||
413 | |||
414 | This describes how a filesystem can overload the standard dentry | ||
415 | operations. Dentries and the dcache are the domain of the VFS and the | ||
416 | individual filesystem implementations. Device drivers have no business | ||
417 | here. These methods may be set to NULL, as they are either optional or | ||
418 | the VFS uses a default. As of kernel 2.1.99, the following members are | ||
419 | defined: | ||
420 | |||
421 | struct dentry_operations { | ||
422 | int (*d_revalidate)(struct dentry *); | ||
423 | int (*d_hash) (struct dentry *, struct qstr *); | ||
424 | int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); | ||
425 | void (*d_delete)(struct dentry *); | ||
426 | void (*d_release)(struct dentry *); | ||
427 | void (*d_iput)(struct dentry *, struct inode *); | ||
428 | }; | ||
429 | |||
430 | d_revalidate: called when the VFS needs to revalidate a dentry. This | ||
431 | is called whenever a name look-up finds a dentry in the | ||
432 | dcache. Most filesystems leave this as NULL, because all their | ||
433 | dentries in the dcache are valid | ||
434 | |||
435 | d_hash: called when the VFS adds a dentry to the hash table | ||
436 | |||
437 | d_compare: called when a dentry should be compared with another | ||
438 | |||
439 | d_delete: called when the last reference to a dentry is | ||
440 | deleted. This means no-one is using the dentry, however it is | ||
441 | still valid and in the dcache | ||
442 | |||
443 | d_release: called when a dentry is really deallocated | ||
444 | |||
445 | d_iput: called when a dentry loses its inode (just prior to its | ||
446 | being deallocated). The default when this is NULL is that the | ||
447 | VFS calls iput(). If you define this method, you must call | ||
448 | iput() yourself | ||
449 | |||
450 | Each dentry has a pointer to its parent dentry, as well as a hash list | ||
451 | of child dentries. Child dentries are basically like files in a | ||
452 | directory. | ||
453 | |||
454 | Directory Entry Cache APIs | ||
455 | -------------------------- | ||
456 | |||
457 | There are a number of functions defined which permit a filesystem to | ||
458 | manipulate dentries: | ||
459 | |||
460 | dget: open a new handle for an existing dentry (this just increments | ||
461 | the usage count) | ||
462 | |||
463 | dput: close a handle for a dentry (decrements the usage count). If | ||
464 | the usage count drops to 0, the "d_delete" method is called | ||
465 | and the dentry is placed on the unused list if the dentry is | ||
466 | still in its parents hash list. Putting the dentry on the | ||
467 | unused list just means that if the system needs some RAM, it | ||
468 | goes through the unused list of dentries and deallocates them. | ||
469 | If the dentry has already been unhashed and the usage count | ||
470 | drops to 0, in this case the dentry is deallocated after the | ||
471 | "d_delete" method is called | ||
472 | |||
473 | d_drop: this unhashes a dentry from its parents hash list. A | ||
474 | subsequent call to dput() will dellocate the dentry if its | ||
475 | usage count drops to 0 | ||
476 | |||
477 | d_delete: delete a dentry. If there are no other open references to | ||
478 | the dentry then the dentry is turned into a negative dentry | ||
479 | (the d_iput() method is called). If there are other | ||
480 | references, then d_drop() is called instead | ||
481 | |||
482 | d_add: add a dentry to its parents hash list and then calls | ||
483 | d_instantiate() | ||
484 | |||
485 | d_instantiate: add a dentry to the alias hash list for the inode and | ||
486 | updates the "d_inode" member. The "i_count" member in the | ||
487 | inode structure should be set/incremented. If the inode | ||
488 | pointer is NULL, the dentry is called a "negative | ||
489 | dentry". This function is commonly called when an inode is | ||
490 | created for an existing negative dentry | ||
491 | |||
492 | d_lookup: look up a dentry given its parent and path name component | ||
493 | It looks up the child of that given name from the dcache | ||
494 | hash table. If it is found, the reference count is incremented | ||
495 | and the dentry is returned. The caller must use d_put() | ||
496 | to free the dentry when it finishes using it. | ||
497 | |||
498 | |||
499 | RCU-based dcache locking model | ||
500 | ------------------------------ | ||
501 | |||
502 | On many workloads, the most common operation on dcache is | ||
503 | to look up a dentry, given a parent dentry and the name | ||
504 | of the child. Typically, for every open(), stat() etc., | ||
505 | the dentry corresponding to the pathname will be looked | ||
506 | up by walking the tree starting with the first component | ||
507 | of the pathname and using that dentry along with the next | ||
508 | component to look up the next level and so on. Since it | ||
509 | is a frequent operation for workloads like multiuser | ||
510 | environments and webservers, it is important to optimize | ||
511 | this path. | ||
512 | |||
513 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | ||
514 | in every component during path look-up. Since 2.5.10 onwards, | ||
515 | fastwalk algorithm changed this by holding the dcache_lock | ||
516 | at the beginning and walking as many cached path component | ||
517 | dentries as possible. This signficantly decreases the number | ||
518 | of acquisition of dcache_lock. However it also increases the | ||
519 | lock hold time signficantly and affects performance in large | ||
520 | SMP machines. Since 2.5.62 kernel, dcache has been using | ||
521 | a new locking model that uses RCU to make dcache look-up | ||
522 | lock-free. | ||
523 | |||
524 | The current dcache locking model is not very different from the existing | ||
525 | dcache locking model. Prior to 2.5.62 kernel, dcache_lock | ||
526 | protected the hash chain, d_child, d_alias, d_lru lists as well | ||
527 | as d_inode and several other things like mount look-up. RCU-based | ||
528 | changes affect only the way the hash chain is protected. For everything | ||
529 | else the dcache_lock must be taken for both traversing as well as | ||
530 | updating. The hash chain updations too take the dcache_lock. | ||
531 | The significant change is the way d_lookup traverses the hash chain, | ||
532 | it doesn't acquire the dcache_lock for this and rely on RCU to | ||
533 | ensure that the dentry has not been *freed*. | ||
534 | |||
535 | |||
536 | Dcache locking details | ||
537 | ---------------------- | ||
538 | For many multi-user workloads, open() and stat() on files are | ||
539 | very frequently occurring operations. Both involve walking | ||
540 | of path names to find the dentry corresponding to the | ||
541 | concerned file. In 2.4 kernel, dcache_lock was held | ||
542 | during look-up of each path component. Contention and | ||
543 | cacheline bouncing of this global lock caused significant | ||
544 | scalability problems. With the introduction of RCU | ||
545 | in linux kernel, this was worked around by making | ||
546 | the look-up of path components during path walking lock-free. | ||
547 | |||
548 | |||
549 | Safe lock-free look-up of dcache hash table | ||
550 | =========================================== | ||
551 | |||
552 | Dcache is a complex data structure with the hash table entries | ||
553 | also linked together in other lists. In 2.4 kernel, dcache_lock | ||
554 | protected all the lists. We applied RCU only on hash chain | ||
555 | walking. The rest of the lists are still protected by dcache_lock. | ||
556 | Some of the important changes are : | ||
557 | |||
558 | 1. The deletion from hash chain is done using hlist_del_rcu() macro which | ||
559 | doesn't initialize next pointer of the deleted dentry and this | ||
560 | allows us to walk safely lock-free while a deletion is happening. | ||
561 | |||
562 | 2. Insertion of a dentry into the hash table is done using | ||
563 | hlist_add_head_rcu() which take care of ordering the writes - | ||
564 | the writes to the dentry must be visible before the dentry | ||
565 | is inserted. This works in conjuction with hlist_for_each_rcu() | ||
566 | while walking the hash chain. The only requirement is that | ||
567 | all initialization to the dentry must be done before hlist_add_head_rcu() | ||
568 | since we don't have dcache_lock protection while traversing | ||
569 | the hash chain. This isn't different from the existing code. | ||
570 | |||
571 | 3. The dentry looked up without holding dcache_lock by cannot be | ||
572 | returned for walking if it is unhashed. It then may have a NULL | ||
573 | d_inode or other bogosity since RCU doesn't protect the other | ||
574 | fields in the dentry. We therefore use a flag DCACHE_UNHASHED to | ||
575 | indicate unhashed dentries and use this in conjunction with a | ||
576 | per-dentry lock (d_lock). Once looked up without the dcache_lock, | ||
577 | we acquire the per-dentry lock (d_lock) and check if the | ||
578 | dentry is unhashed. If so, the look-up is failed. If not, the | ||
579 | reference count of the dentry is increased and the dentry is returned. | ||
580 | |||
581 | 4. Once a dentry is looked up, it must be ensured during the path | ||
582 | walk for that component it doesn't go away. In pre-2.5.10 code, | ||
583 | this was done holding a reference to the dentry. dcache_rcu does | ||
584 | the same. In some sense, dcache_rcu path walking looks like | ||
585 | the pre-2.5.10 version. | ||
586 | |||
587 | 5. All dentry hash chain updations must take the dcache_lock as well as | ||
588 | the per-dentry lock in that order. dput() does this to ensure | ||
589 | that a dentry that has just been looked up in another CPU | ||
590 | doesn't get deleted before dget() can be done on it. | ||
591 | |||
592 | 6. There are several ways to do reference counting of RCU protected | ||
593 | objects. One such example is in ipv4 route cache where | ||
594 | deferred freeing (using call_rcu()) is done as soon as | ||
595 | the reference count goes to zero. This cannot be done in | ||
596 | the case of dentries because tearing down of dentries | ||
597 | require blocking (dentry_iput()) which isn't supported from | ||
598 | RCU callbacks. Instead, tearing down of dentries happen | ||
599 | synchronously in dput(), but actual freeing happens later | ||
600 | when RCU grace period is over. This allows safe lock-free | ||
601 | walking of the hash chains, but a matched dentry may have | ||
602 | been partially torn down. The checking of DCACHE_UNHASHED | ||
603 | flag with d_lock held detects such dentries and prevents | ||
604 | them from being returned from look-up. | ||
605 | |||
606 | |||
607 | Maintaining POSIX rename semantics | ||
608 | ================================== | ||
609 | |||
610 | Since look-up of dentries is lock-free, it can race against | ||
611 | a concurrent rename operation. For example, during rename | ||
612 | of file A to B, look-up of either A or B must succeed. | ||
613 | So, if look-up of B happens after A has been removed from the | ||
614 | hash chain but not added to the new hash chain, it may fail. | ||
615 | Also, a comparison while the name is being written concurrently | ||
616 | by a rename may result in false positive matches violating | ||
617 | rename semantics. Issues related to race with rename are | ||
618 | handled as described below : | ||
619 | |||
620 | 1. Look-up can be done in two ways - d_lookup() which is safe | ||
621 | from simultaneous renames and __d_lookup() which is not. | ||
622 | If __d_lookup() fails, it must be followed up by a d_lookup() | ||
623 | to correctly determine whether a dentry is in the hash table | ||
624 | or not. d_lookup() protects look-ups using a sequence | ||
625 | lock (rename_lock). | ||
626 | |||
627 | 2. The name associated with a dentry (d_name) may be changed if | ||
628 | a rename is allowed to happen simultaneously. To avoid memcmp() | ||
629 | in __d_lookup() go out of bounds due to a rename and false | ||
630 | positive comparison, the name comparison is done while holding the | ||
631 | per-dentry lock. This prevents concurrent renames during this | ||
632 | operation. | ||
633 | |||
634 | 3. Hash table walking during look-up may move to a different bucket as | ||
635 | the current dentry is moved to a different bucket due to rename. | ||
636 | But we use hlists in dcache hash table and they are null-terminated. | ||
637 | So, even if a dentry moves to a different bucket, hash chain | ||
638 | walk will terminate. [with a list_head list, it may not since | ||
639 | termination is when the list_head in the original bucket is reached]. | ||
640 | Since we redo the d_parent check and compare name while holding | ||
641 | d_lock, lock-free look-up will not race against d_move(). | ||
642 | |||
643 | 4. There can be a theoritical race when a dentry keeps coming back | ||
644 | to original bucket due to double moves. Due to this look-up may | ||
645 | consider that it has never moved and can end up in a infinite loop. | ||
646 | But this is not any worse that theoritical livelocks we already | ||
647 | have in the kernel. | ||
648 | |||
649 | |||
650 | Important guidelines for filesystem developers related to dcache_rcu | ||
651 | ==================================================================== | ||
652 | |||
653 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem | ||
654 | don't change. Only dcache internal implementation changes. However | ||
655 | filesystems *must not* delete from the dentry hash chains directly | ||
656 | using the list macros like allowed earlier. They must use dcache | ||
657 | APIs like d_drop() or __d_drop() depending on the situation. | ||
658 | |||
659 | 2. d_flags is now protected by a per-dentry lock (d_lock). All | ||
660 | access to d_flags must be protected by it. | ||
661 | |||
662 | 3. For a hashed dentry, checking of d_count needs to be protected | ||
663 | by d_lock. | ||
664 | |||
665 | |||
666 | Papers and other documentation on dcache locking | ||
667 | ================================================ | ||
668 | |||
669 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | ||
670 | |||
671 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | ||
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt new file mode 100644 index 000000000000..c7d5d0c7067d --- /dev/null +++ b/Documentation/filesystems/xfs.txt | |||
@@ -0,0 +1,188 @@ | |||
1 | |||
2 | The SGI XFS Filesystem | ||
3 | ====================== | ||
4 | |||
5 | XFS is a high performance journaling filesystem which originated | ||
6 | on the SGI IRIX platform. It is completely multi-threaded, can | ||
7 | support large files and large filesystems, extended attributes, | ||
8 | variable block sizes, is extent based, and makes extensive use of | ||
9 | Btrees (directories, extents, free space) to aid both performance | ||
10 | and scalability. | ||
11 | |||
12 | Refer to the documentation at http://oss.sgi.com/projects/xfs/ | ||
13 | for further details. This implementation is on-disk compatible | ||
14 | with the IRIX version of XFS. | ||
15 | |||
16 | |||
17 | Mount Options | ||
18 | ============= | ||
19 | |||
20 | When mounting an XFS filesystem, the following options are accepted. | ||
21 | |||
22 | biosize=size | ||
23 | Sets the preferred buffered I/O size (default size is 64K). | ||
24 | "size" must be expressed as the logarithm (base2) of the | ||
25 | desired I/O size. | ||
26 | Valid values for this option are 14 through 16, inclusive | ||
27 | (i.e. 16K, 32K, and 64K bytes). On machines with a 4K | ||
28 | pagesize, 13 (8K bytes) is also a valid size. | ||
29 | The preferred buffered I/O size can also be altered on an | ||
30 | individual file basis using the ioctl(2) system call. | ||
31 | |||
32 | ikeep/noikeep | ||
33 | When inode clusters are emptied of inodes, keep them around | ||
34 | on the disk (ikeep) - this is the traditional XFS behaviour | ||
35 | and is still the default for now. Using the noikeep option, | ||
36 | inode clusters are returned to the free space pool. | ||
37 | |||
38 | logbufs=value | ||
39 | Set the number of in-memory log buffers. Valid numbers range | ||
40 | from 2-8 inclusive. | ||
41 | The default value is 8 buffers for filesystems with a | ||
42 | blocksize of 64K, 4 buffers for filesystems with a blocksize | ||
43 | of 32K, 3 buffers for filesystems with a blocksize of 16K | ||
44 | and 2 buffers for all other configurations. Increasing the | ||
45 | number of buffers may increase performance on some workloads | ||
46 | at the cost of the memory used for the additional log buffers | ||
47 | and their associated control structures. | ||
48 | |||
49 | logbsize=value | ||
50 | Set the size of each in-memory log buffer. | ||
51 | Size may be specified in bytes, or in kilobytes with a "k" suffix. | ||
52 | Valid sizes for version 1 and version 2 logs are 16384 (16k) and | ||
53 | 32768 (32k). Valid sizes for version 2 logs also include | ||
54 | 65536 (64k), 131072 (128k) and 262144 (256k). | ||
55 | The default value for machines with more than 32MB of memory | ||
56 | is 32768, machines with less memory use 16384 by default. | ||
57 | |||
58 | logdev=device and rtdev=device | ||
59 | Use an external log (metadata journal) and/or real-time device. | ||
60 | An XFS filesystem has up to three parts: a data section, a log | ||
61 | section, and a real-time section. The real-time section is | ||
62 | optional, and the log section can be separate from the data | ||
63 | section or contained within it. | ||
64 | |||
65 | noalign | ||
66 | Data allocations will not be aligned at stripe unit boundaries. | ||
67 | |||
68 | noatime | ||
69 | Access timestamps are not updated when a file is read. | ||
70 | |||
71 | norecovery | ||
72 | The filesystem will be mounted without running log recovery. | ||
73 | If the filesystem was not cleanly unmounted, it is likely to | ||
74 | be inconsistent when mounted in "norecovery" mode. | ||
75 | Some files or directories may not be accessible because of this. | ||
76 | Filesystems mounted "norecovery" must be mounted read-only or | ||
77 | the mount will fail. | ||
78 | |||
79 | nouuid | ||
80 | Don't check for double mounted file systems using the file system uuid. | ||
81 | This is useful to mount LVM snapshot volumes. | ||
82 | |||
83 | osyncisosync | ||
84 | Make O_SYNC writes implement true O_SYNC. WITHOUT this option, | ||
85 | Linux XFS behaves as if an "osyncisdsync" option is used, | ||
86 | which will make writes to files opened with the O_SYNC flag set | ||
87 | behave as if the O_DSYNC flag had been used instead. | ||
88 | This can result in better performance without compromising | ||
89 | data safety. | ||
90 | However if this option is not in effect, timestamp updates from | ||
91 | O_SYNC writes can be lost if the system crashes. | ||
92 | If timestamp updates are critical, use the osyncisosync option. | ||
93 | |||
94 | quota/usrquota/uqnoenforce | ||
95 | User disk quota accounting enabled, and limits (optionally) | ||
96 | enforced. | ||
97 | |||
98 | grpquota/gqnoenforce | ||
99 | Group disk quota accounting enabled and limits (optionally) | ||
100 | enforced. | ||
101 | |||
102 | sunit=value and swidth=value | ||
103 | Used to specify the stripe unit and width for a RAID device or | ||
104 | a stripe volume. "value" must be specified in 512-byte block | ||
105 | units. | ||
106 | If this option is not specified and the filesystem was made on | ||
107 | a stripe volume or the stripe width or unit were specified for | ||
108 | the RAID device at mkfs time, then the mount system call will | ||
109 | restore the value from the superblock. For filesystems that | ||
110 | are made directly on RAID devices, these options can be used | ||
111 | to override the information in the superblock if the underlying | ||
112 | disk layout changes after the filesystem has been created. | ||
113 | The "swidth" option is required if the "sunit" option has been | ||
114 | specified, and must be a multiple of the "sunit" value. | ||
115 | |||
116 | sysctls | ||
117 | ======= | ||
118 | |||
119 | The following sysctls are available for the XFS filesystem: | ||
120 | |||
121 | fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) | ||
122 | Setting this to "1" clears accumulated XFS statistics | ||
123 | in /proc/fs/xfs/stat. It then immediately resets to "0". | ||
124 | |||
125 | fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) | ||
126 | The interval at which the xfssyncd thread flushes metadata | ||
127 | out to disk. This thread will flush log activity out, and | ||
128 | do some processing on unlinked inodes. | ||
129 | |||
130 | fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) | ||
131 | The interval at which xfsbufd scans the dirty metadata buffers list. | ||
132 | |||
133 | fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) | ||
134 | The age at which xfsbufd flushes dirty metadata buffers to disk. | ||
135 | |||
136 | fs.xfs.error_level (Min: 0 Default: 3 Max: 11) | ||
137 | A volume knob for error reporting when internal errors occur. | ||
138 | This will generate detailed messages & backtraces for filesystem | ||
139 | shutdowns, for example. Current threshold values are: | ||
140 | |||
141 | XFS_ERRLEVEL_OFF: 0 | ||
142 | XFS_ERRLEVEL_LOW: 1 | ||
143 | XFS_ERRLEVEL_HIGH: 5 | ||
144 | |||
145 | fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) | ||
146 | Causes certain error conditions to call BUG(). Value is a bitmask; | ||
147 | AND together the tags which represent errors which should cause panics: | ||
148 | |||
149 | XFS_NO_PTAG 0 | ||
150 | XFS_PTAG_IFLUSH 0x00000001 | ||
151 | XFS_PTAG_LOGRES 0x00000002 | ||
152 | XFS_PTAG_AILDELETE 0x00000004 | ||
153 | XFS_PTAG_ERROR_REPORT 0x00000008 | ||
154 | XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 | ||
155 | XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 | ||
156 | XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 | ||
157 | |||
158 | This option is intended for debugging only. | ||
159 | |||
160 | fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) | ||
161 | Controls whether symlinks are created with mode 0777 (default) | ||
162 | or whether their mode is affected by the umask (irix mode). | ||
163 | |||
164 | fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) | ||
165 | Controls files created in SGID directories. | ||
166 | If the group ID of the new file does not match the effective group | ||
167 | ID or one of the supplementary group IDs of the parent dir, the | ||
168 | ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl | ||
169 | is set. | ||
170 | |||
171 | fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) | ||
172 | Controls whether unprivileged users can use chown to "give away" | ||
173 | a file to another user. | ||
174 | |||
175 | fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) | ||
176 | Setting this to "1" will cause the "sync" flag set | ||
177 | by the chattr(1) command on a directory to be | ||
178 | inherited by files in that directory. | ||
179 | |||
180 | fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) | ||
181 | Setting this to "1" will cause the "nodump" flag set | ||
182 | by the chattr(1) command on a directory to be | ||
183 | inherited by files in that directory. | ||
184 | |||
185 | fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) | ||
186 | Setting this to "1" will cause the "noatime" flag set | ||
187 | by the chattr(1) command on a directory to be | ||
188 | inherited by files in that directory. | ||