aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
* AFS: implement basic file write supportDavid Howells2007-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement support for writing to regular AFS files, including: (1) write (2) truncate (3) fsync, fdatasync (4) chmod, chown, chgrp, utime. AFS writeback attempts to batch writes into as chunks as large as it can manage up to the point that it writes back 65535 pages in one chunk or it meets a locked page. Furthermore, if a page has been written to using a particular key, then should another write to that page use some other key, the first write will be flushed before the second is allowed to take place. If the first write fails due to a security error, then the page will be scrapped and reread before the second write takes place. If a page is dirty and the callback on it is broken by the server, then the dirty data is not discarded (same behaviour as NFS). Shared-writable mappings are not supported by this patch. [akpm@linux-foundation.org: fix a bunch of warnings] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* AFS: AFS fixupsDavid Howells2007-05-09
| | | | | | | | | | | | | | | | | | | | | Make some miscellaneous changes to the AFS filesystem: (1) Assert RCU barriers on module exit to make sure RCU has finished with callbacks in this module. (2) Correctly handle the AFS server returning a zero-length read. (3) Split out data zapping calls into one function (afs_zap_data). (4) Rename some afs_file_*() functions to afs_*() where they apply to non-regular files too. (5) Be consistent about the presentation of volume ID:vnode ID in debugging output. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: use path_walk in do_path_lookupJosef 'Jeff' Sipek2007-05-09
| | | | | | | | | | Since path_walk sets the total_link_count to 0 and calls link_path_walk, we can just call path_walk directly. Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: fix indentation in do_path_lookupJosef 'Jeff' Sipek2007-05-09
| | | | | | | Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* use simple_read_from_buffer() in fs/Akinobu Mita2007-05-09
| | | | | | | | | | Cleanup using simple_read_from_buffer() in binfmt_misc, configfs, and sysfs. Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* smaps: only define clear_refs for CONFIG_MMUDavid Rientjes2007-05-08
| | | | | | | | | | /proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we don't have any references to clear_refs_smap() in generic procfs code. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove suid/sgid bits on [f]truncate()Linus Torvalds2007-05-08
| | | | | | | | .. to match what we do on write(). This way, people who write to files by using [f]truncate + writable mmap have the same semantics as if they were using the write() family of system calls. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://oss.sgi.com:8090/xfs/xfs-2.6Linus Torvalds2007-05-08
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://oss.sgi.com:8090/xfs/xfs-2.6: [XFS] Add lockdep support for XFS [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. [XFS] Get rid of redundant "required" in msg. [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. [XFS] Remove unused ilen variable and references. [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. [XFS] Fix race condition in xfs_write(). [XFS] Fix uquota and oquota enforcement problems. [XFS] propogate return codes from flush routines [XFS] Fix quotaon syscall failures for group enforcement requests. [XFS] Invalidate quotacheck when mounting without a quota type. [XFS] reducing the number of random number functions. [XFS] remove more misc. unused args [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. [XFS] The last argument "lsn" of xfs_trans_commit() is always called with
| * [XFS] Add lockdep support for XFSLachlan McIlroy2007-05-07
| | | | | | | | | | | | | | | | | | SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.Lachlan McIlroy2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Get rid of redundant "required" in msg.Utako Kusaka2007-05-07
| | | | | | | | | | | | | | | | | | SGI-PV: 963466 SGI-Modid: xfs-linux-melb:xfs-kern:28416a Signed-off-by: Utako Kusaka <utako@tnes.nec.co.jp> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
| * [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.Tim Shimmin2007-05-07
| | | | | | | | | | | | | | | | SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
| * [XFS] Remove unused ilen variable and references.Tim Shimmin2007-05-07
| | | | | | | | | | | | | | | | | | SGI-PV: 907752 SGI-Modid: xfs-linux-melb:xfs-kern:28344a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
| * [XFS] Fix to prevent the notorious 'NULL files' problem after a crash.Lachlan McIlroy2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Fix race condition in xfs_write().Lachlan McIlroy2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Fix uquota and oquota enforcement problems.Kouta Ooizumi2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When uquota and oquota (gquota/pquota) are enabled for accounting both are enforced if ether has enforcement active. Conditions: - Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled. - Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled. - The usage without enforce is reached at the soft limit. Problems: 1. "repquota" shows all grace time even if no enforcement. 2. we cannot make a file over a hard limits even if no enforcement. SGI-PV: 962291 SGI-Modid: xfs-linux-melb:xfs-kern:28272a Signed-off-by: Kouta Ooizumi <k-ooizumi@tnes.nec.co.jp> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] propogate return codes from flush routinesLachlan McIlroy2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Stewart Smith <stewart@flamingspork.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Fix quotaon syscall failures for group enforcement requests.Donald Douwsma2007-05-07
| | | | | | | | | | | | | | | | | | | | | | xfs_qm_scall_quotaon was incorrectly failing requests to enable group quota enforcement. Fixes logic error in OQUOTA handling. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28227a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] Invalidate quotacheck when mounting without a quota type.Donald Douwsma2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When quotas are mounted or remounted without a particular quota type the quota accounting for that type becomes invalid. Previously we were ignoring this leading to accounting errors. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28225a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Utako Kusaka <utako@tnes.nec.co.jp> Signed-off-by: Vlad Apostolov <vapo@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] reducing the number of random number functions.Joe Perches2007-05-07
| | | | | | | | | | | | | | | | | | | | | | Patch provided by Joe Perches SGI-PV: 961696 SGI-Modid: xfs-linux-melb:xfs-kern:28209a Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] remove more misc. unused argsEric Sandeen2007-05-07
| | | | | | | | | | | | | | | | | | | | | | Patch provided by Eric Sandeen. SGI-PV: 961695 SGI-Modid: xfs-linux-melb:xfs-kern:28205a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it.Eric Sandeen2007-05-07
| | | | | | | | | | | | | | | | | | | | | | Patch provided by Eric Sandeen. SGI-PV: 961694 SGI-Modid: xfs-linux-melb:xfs-kern:28204a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
| * [XFS] The last argument "lsn" of xfs_trans_commit() is always called withEric Sandeen2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
* | Merge branch 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-blockLinus Torvalds2007-05-08
|\ \ | | | | | | | | | | | | | | | | | | * 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block: [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern() [PATCH] splice: always call into page_cache_readahead() [PATCH] splice(): fix interaction with readahead
| * | [PATCH] splice: always call into page_cache_readahead()Jens Axboe2007-05-08
| | | | | | | | | | | | | | | | | | | | | Don't try to guess what the read-ahead logic will do, allow it to make its own decisions. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| * | [PATCH] splice(): fix interaction with readaheadFengguang Wu2007-05-08
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric Dumazet, thank you for disclosing this bug. Readahead logic somehow fails to populate the page range with data. It can be because 1) the readahead routine is not always called in the following lines of fs/splice.c: if (!loff || nr_pages > 1) page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages); 2) even called, page_cache_readahead() wont guarantee the pages are there. It wont submit readahead I/O for pages already in the radix tree, or when (ra_pages == 0), or after 256 cache hits. In your case, it should be because of the retried reads, which lead to excessive cache hits, and disables readahead at some time. And that _one_ failure of readahead blocks the whole read process. The application receives EAGAIN and retries the read, but __generic_file_splice_read() refuse to make progress: - in the previous invocation, it has allocated a blank page and inserted it into the radix tree, but never has the chance to start I/O for it: the test of SPLICE_F_NONBLOCK goes before that. - in the retried invocation, the readahead code will neither get out of the cache hit mode, nor will it submit I/O for an already existing page. Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2007-05-08
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: JFS: Fix race waking up jfsIO kernel thread JFS: use __set_current_state() Copy i_flags to jfs inode flags on write JFS: document uid, gid, and umask mount options in jfs.txt
| * | JFS: Fix race waking up jfsIO kernel threadDave Kleikamp2007-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible for a journal I/O request to be added to the log_redrive queue and the jfsIO thread to be awakened after the thread releases log_redrive_lock but before it sets its state to TASK_INTERRUPTIBLE. The jfsIO thread should set the state before giving up the spinlock, so the waking thread will really wake it. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
| * | JFS: use __set_current_state()Milind Arun Choudhary2007-04-26
| | | | | | | | | | | | | | | | | | | | | | | | use __set_current_state(TASK_*) instead of current->state = TASK_* Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
| * | Copy i_flags to jfs inode flags on writeDave Kleikamp2007-04-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | This mirrors Jan Kara's patches for ext3. This patch makes sure that changes made to inode->i_flags are reflected on disk for jfs. It also moves a call of jfs_set_inode_flags() to be more consistent with where jfs_get_inode_flags() is called. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
* | | udf: possible null pointer dereference while load_partitionDmitriy Monakhov2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | sb_read may return NULL, let's explicitly check it. Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | udf: support files larger than 1GJan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make UDF work correctly for files larger than 1GB. As no extent can be longer than (1<<30)-blocksize bytes, we have to create several extents if a big hole is being created. As a side-effect, we now don't discard preallocated blocks when creating a hole. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | udf: add assertionsJan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a few assertions into udf_discard_prealloc() to check that the file is sane (mostly helps debugging further patches ;). Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | udf: use get_bh()Jan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make UDF use get_bh() instead of directly accessing b_count and use brelse() instead of udf_release_data() which does just brelse()... Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | UDF: introduce struct extent_positionJan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a structure extent_position to store a position of an extent and the corresponding buffer_head in one place. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | udf: use sector_t and loff_t for file offsetsJan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use sector_t and loff_t for file offsets in UDF filesystem. Otherwise an overflow may occur for long files. Also make inode_bmap() return offset in the extent in number of blocks instead of number of bytes - for most callers this is more convenient. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | nfs: fix congestion control: use atomic_longsPeter Zijlstra2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the atomic_t in struct nfs_server to atomic_long_t in anticipation of machines that can handle 8+TB of (4K) pages under writeback. However I suspect other things in NFS will start going *bang* by then. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | utimensat implementationUlrich Drepper2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions For this change the internally used do_utimes() functions was changed to accept a timespec time value and an additional flags parameter. Additionally the sys_utime function was changed to match compat_sys_utime which already use do_utimes instead of duplicating the work. Also, the completely missing futimensat() functionality is added. We have such a function in glibc but we have to resort to using /proc/self/fd/* which not everybody likes (chroot etc). Test application (the syscall number will need per-arch editing): #include <errno.h> #include <fcntl.h> #include <time.h> #include <sys/time.h> #include <stddef.h> #include <syscall.h> #define __NR_utimensat 280 #define UTIME_NOW ((1l << 30) - 1l) #define UTIME_OMIT ((1l << 30) - 2l) int main(void) { int status = 0; int fd = open("ttt", O_RDWR|O_CREAT|O_EXCL, 0666); if (fd == -1) error (1, errno, "failed to create test file \"ttt\""); struct stat64 st1; if (fstat64 (fd, &st1) != 0) error (1, errno, "fstat failed"); struct timespec t[2]; t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); struct stat64 st2; if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0] = st1.st_atim; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_OMIT; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("atim not set"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim changed from zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 0; t[0].tv_nsec = UTIME_OMIT; t[1] = st1.st_mtim; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("mtim changed from original time"); status = 1; } if (st2.st_mtim.tv_sec != st1.st_mtim.tv_sec || st2.st_mtim.tv_nsec != st1.st_mtim.tv_nsec) { puts ("mtim not set"); status = 1; } if (status != 0) goto out; sleep (2); t[0].tv_sec = 0; t[0].tv_nsec = UTIME_NOW; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_NOW; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); struct timeval tv; gettimeofday(&tv,NULL); if (st2.st_atim.tv_sec <= st1.st_atim.tv_sec || st2.st_atim.tv_sec > tv.tv_sec) { puts ("atim not set to NOW"); status = 1; } if (st2.st_mtim.tv_sec <= st1.st_mtim.tv_sec || st2.st_mtim.tv_sec > tv.tv_sec) { puts ("mtim not set to NOW"); status = 1; } if (symlink ("ttt", "tttsym") != 0) error (1, errno, "cannot create symlink"); t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "tttsym", t, AT_SYMLINK_NOFOLLOW) != 0) error (1, errno, "utimensat failed"); if (lstat64 ("tttsym", &st2) != 0) error (1, errno, "lstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("symlink atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("symlink mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 1; t[0].tv_nsec = 0; t[1].tv_sec = 1; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, fd, NULL, t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 1 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to one"); status = 1; } if (st2.st_mtim.tv_sec != 1 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to one"); status = 1; } if (status == 0) puts ("all OK"); out: close (fd); unlink ("ttt"); unlink ("tttsym"); return status; } [akpm@linux-foundation.org: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Alexey Dobriyan <adobriyan@openvz.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | inode numbering: change libfs sb creation routines to avoid collisions with ↵Jeff Layton2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | their root inodes This patch makes it so that simple_fill_super and get_sb_pseudo assign their root inodes to be number 1. It also fixes up a couple of callers of simple_fill_super that were passing in files arrays that had an index at number 1, and adds a warning for any caller that sends in such an array. It would have been nice to have made it so that it wasn't possible to make such a collision, but some callers need to be able to control what inode number their entries get, so I think this is the best that can be done. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | inode numbering: make static counters in new_inode and iunique be 32 bitsJeff Layton2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problems are: - on filesystems w/o permanent inode numbers, i_ino values can be larger than 32 bits, which can cause problems for some 32 bit userspace programs on a 64 bit kernel. We can't do anything for filesystems that have actual >32-bit inode numbers, but on filesystems that generate i_ino values on the fly, we should try to have them fit in 32 bits. We could trivially fix this by making the static counters in new_inode and iunique 32 bits, but... - many filesystems call new_inode and assume that the i_ino values they are given are unique. They are not guaranteed to be so, since the static counter can wrap. This problem is exacerbated by the fix for #1. - after allocating a new inode, some filesystems call iunique to try to get a unique i_ino value, but they don't actually add their inodes to the hashtable, and so they're still not guaranteed to be unique if that counter wraps. This patch set takes the simpler approach of simply using iunique and hashing the inodes afterward. Christoph H. previously mentioned that he thought that this approach may slow down lookups for filesystems that currently hash their inodes. The questions are: 1) how much would this slow down lookups for these filesystems? 2) is it enough to justify adding more infrastructure to avoid it? What might be best is to start with this approach and then only move to using IDR or some other scheme if these extra inodes in the hashtable prove to be problematic. I've done some cursory testing with this patch and the overhead of hashing and unhashing the inodes with pipefs is pretty low -- just a few seconds of system time added on to the creation and destruction of 10 million pipes (very similar to the overhead that the IDR approach would add). The hard thing to measure is what effect this has on other filesystems. I'm open to ways to try and gauge this. Again, I've only converted pipefs as an example. If this approach is acceptable then I'll start work on patches to convert other filesystems. With a pretty-much-worst-case microbenchmark provided by Eric Dumazet <dada1@cosmosbay.com>: hashing patch (pipebench): sys 1m15.329s sys 1m16.249s sys 1m17.169s unpatched (pipebench): sys 1m9.836s sys 1m12.541s sys 1m14.153s Which works out to 1.05642174294555027017. So ~5-6% slowdown. This patch: When a 32-bit program that was not compiled with large file offsets does a stat and gets a st_ino value back that won't fit in the 32 bit field, glibc (correctly) generates an EOVERFLOW error. We can't do anything about fs's with larger permanent inode numbers, but when we generate them on the fly, we ought to try and have them fit within a 32 bit field. This patch takes the first step toward this by making the static counters in these two functions be 32 bits. [jlayton@redhat.com: mention that it's only the case for 32bit, non-LFS stat] Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Invalid return value of execve() resulting in oopsesAlexey Kuznetsov2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When elf loader fails to map executable (due to memory shortage or because binary is malformed), it can return 0. Normally, this is invisible because process is killed with SIGKILL and it never returns to user space. But if exec() is called from kernel thread (hotplug, whatever) consequences are more interesting and vary depending on architecture. i386. Nothing especially interesting, execve() just returns with "success" :-) x86_64. Fake zero frame is used on way to caller, RSP/RIP are loaded with zeros, ergo... double fault. ia64. Similar to i386, but r32...r95 are corrupted. Sometimes it oopses due to return to zero PC, sometimes it sees NaT in rXX and oopses due to NaT consumption. Signed-off-by: Alexey Kuznetsov <alexey@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | procfs: use simple_read_from_buffer()Akinobu Mita2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | Cleanup using simple_read_from_buffer() in procfs. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Fix error handling in HDIO_GETGEO compat wrapperAndreas Schwab2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | Don't clobber error from sys_ioctl in HDIO_GETGEO compat wrapper. Signed-off-by: Andreas Schwab <schwab@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | udf: decrement correct link count in udf_rmdirStephen Mollett2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It appears that a minor thinko occurred in udf_rmdir and the (already-cleared) link count on the directory that is being removed was being decremented instead of the link count on its parent directory. This gives rise to lots of kernel messages similar to: UDF-fs warning (device loop1): udf_rmdir: empty directory has nlink != 2 (8) when removing directory trees. No other ill effects have been observed but I guess it could theoretically result in the link count overflowing on a very long-lived, much modified directory. Signed-off-by: Stephen Mollett <molletts@yahoo.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Jan Kara <jack@ucw.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | fat: fix VFAT compat ioctls on 64-bit systemsOGAWA Hirofumi2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If you compile and run the below test case in an msdos or vfat directory on an x86-64 system with -m32 you'll get garbage in the kernel_dirent struct followed by a SIGSEGV. The patch fixes this. Reported and initial fix by Bart Oldeman #include <sys/types.h> #include <sys/ioctl.h> #include <dirent.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> struct kernel_dirent { long d_ino; long d_off; unsigned short d_reclen; char d_name[256]; /* We must not include limits.h! */ }; #define VFAT_IOCTL_READDIR_BOTH _IOR('r', 1, struct kernel_dirent [2]) #define VFAT_IOCTL_READDIR_SHORT _IOR('r', 2, struct kernel_dirent [2]) int main(void) { int fd = open(".", O_RDONLY); struct kernel_dirent de[2]; while (1) { int i = ioctl(fd, VFAT_IOCTL_READDIR_BOTH, (long)de); if (i == -1) break; if (de[0].d_reclen == 0) break; printf("SFN: reclen=%2d off=%d ino=%d, %-12s", de[0].d_reclen, de[0].d_off, de[0].d_ino, de[0].d_name); if (de[1].d_reclen) printf("\tLFN: reclen=%2d off=%d ino=%d, %s", de[1].d_reclen, de[1].d_off, de[1].d_ino, de[1].d_name); printf("\n"); } return 0; } Signed-off-by: Bart Oldeman <bartoldeman@users.sourceforge.net> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ext3: copy i_flags to inode flags on writeJan Kara2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext2-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | fat: don't use free_clusters for fat32OGAWA Hirofumi2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that the recent Windows changed specification, and it's undocumented. Windows doesn't update ->free_clusters correctly. This patch doesn't use ->free_clusters by default. (instead, add "usefree" for forcing to use it) Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Juergen Beisert <juergen127@kreuzholzen.de> Cc: Andreas Schwab <schwab@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | reiserfs: use __set_current_state()Milind Arun Choudhary2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | use __set_current_state(TASK_*) instead of current->state = TASK_*, in fs/reiserfs Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Cc: <reiserfs-dev@namesys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | jbd: check for error returned by kthread_create on creating journal threadPavel Emelianov2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the thread failed to create the subsequent wait_event will hang forever. This is likely to happen if kernel hits max_threads limit. Will be critical for virtualization systems that limit the number of tasks and kernel memory usage within the container. (akpm: JBD should be converted fully to the kthread API: kthread_should_stop() and kthread_stop()). Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | check privileges before setting mount propagationMiklos Szeredi2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a missing check for CAP_SYS_ADMIN in do_change_type(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>