aboutsummaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAge
* ext4: convert to new aopsNick Piggin2007-10-16
| | | | | | | | | | | Convert ext4 to use write_begin()/write_end() methods. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dmitriy Monakhov <dmonakhov@sw.ru> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: combine file_ra_state.prev_index/prev_offset into prev_posFengguang Wu2007-10-16
| | | | | | | | | | | | | | | | | | | Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext34: ensure do_split leaves enough free space in both blocksEric Sandeen2007-09-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. (IOW, it moves half of the entry *count* not half of the entry *space*). If by chance we have both large & small entries, and we move only the smallest entries, and we have a large new entry to insert, we may not have created enough space for it. The patch below stores each record size when calculating the dx_map, and then walks the hash-sorted dx_map, calculating how many entries must be moved to more evenly split the existing entries between the old block and the new block, guaranteeing enough space for the new entry. The dx_map "offs" member is reduced to u16 so that the overall map size does not change - it is temporarily stored at the end of the new block, and if it grows too large it may be overwritten. By making offs and size both u16, we won't grow the map size. Also add a few comments to the functions involved. This fixes the testcase reported by hooanon05@yahoo.co.jp on the linux-ext4 list, "ext3 dir_index causes an error" Thanks to Andreas Dilger for discussing the problem & solution with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Tested-by: Junjiro Okajima <hooanon05@yahoo.co.jp> Cc: Theodore Ts'o <tytso@mit.edu> Cc: <linux-ext4@vger.kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dir_index: error out instead of BUG on corrupt dx dirsEric Sandeen2007-09-19
| | | | | | | | | | | | | Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable errors with helpful warnings. With help catching other asserts from Duane Griffin <duaneg@dghda.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Duane Griffin <duaneg@dghda.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* quota: fix infinite loopJan Kara2007-09-11
| | | | | | | | | | | If we fail to start a transaction when releasing dquot, we have to call dquot_release() anyway to mark dquot structure as inactive. Otherwise we end in an infinite loop inside dqput(). Signed-off-by: Jan Kara <jack@suse.cz> Cc: xb <xavier.bru@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* "ext4_ext_put_in_cache" uses __u32 to receive physical block numberMingming Cao2007-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | Yan Zheng wrote: > I think I found a bug in ext4/extents.c, "ext4_ext_put_in_cache" uses > "__u32" to receive physical block number. "ext4_ext_put_in_cache" is > used in "ext4_ext_get_blocks", it sets ext4 inode's extent cache > according most recently tree lookup (higher 16 bits of saved physical > block number are always zero). when serving a mapping request, > "ext4_ext_get_blocks" first check whether the logical block is in > inode's extent cache. if the logical block is in the cache and the > cached region isn't a gap, "ext4_ext_get_blocks" gets physical block > number by using cached region's physical block number and offset in > the cached region. as described above, "ext4_ext_get_blocks" may > return wrong result when there are physical block numbers bigger than > 0xffffffff. > You are right. Thanks for reporting this! Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Yan Zheng <yanzheng@21cn.com> Cc: <stable@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix inode_table test in ext234_check_descriptorsEric Sandeen2007-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ext[234]_check_descriptors sanity checks block group descriptor geometry at mount time, testing whether the block bitmap, inode bitmap, and inode table reside wholly within the blockgroup. However, the inode table test is off by one so that if the last block in the inode table resides on the last block of the block group, the test incorrectly fails. This is because it tests the last block as (start + length) rather than (start + length - 1). This can be seen by trying to mount a filesystem made such as: mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024 which yields: EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 not in group (block 101)! EXT2-fs: group descriptors corrupted! There is a similar bug in e2fsprogs, patch already sent for that. (I wonder if inside(), outside(), and/or in_range() should someday be used in this and other tests throughout the ext filesystems...) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: Remove slab destructors from kmem_cache_create().Paul Mundt2007-07-19
| | | | | | | | | | | | | | Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* fix ext4/JBD2 build warningsMingming Cao2007-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Looking at the current linus-git tree jbd_debug() define in include/linux/jbd2.h extern u8 journal_enable_debug; #define jbd_debug(n, f, a...) \ do { \ if ((n) <= journal_enable_debug) { \ printk (KERN_DEBUG "(%s, %d): %s: ", \ __FILE__, __LINE__, __FUNCTION__); \ printk (f, ## a); \ } \ } while (0) > fs/ext4/inode.c: In function ‘ext4_write_inode’: > fs/ext4/inode.c:2906: warning: comparison is always true due to limited > range of data type > > fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’: > fs/jbd2/recovery.c:254: warning: comparison is always true due to > limited range of data type > fs/jbd2/recovery.c:257: warning: comparison is always true due to > limited range of data type > > fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’: > fs/jbd2/recovery.c:301: warning: comparison is always true due to > limited range of data type > Noticed all warnings are occurs when the debug level is 0. Then found the "jbd2: Move jbd2-debug file to debugfs" patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b changed the jbd2_journal_enable_debug from int type to u8, makes the jbd_debug comparision is always true when the debugging level is 0. Thus the compile warning occurs. Thought about changing the jbd2_journal_enable_debug data type back to int, but can't, because the jbd2-debug is moved to debug fs, where calling debugfs_create_u8() to create the debugfs entry needs the value to be u8 type. Even if we changed the data type back to int, the code is still buggy, kernel should not print jbd2 debug message if the jbd2_journal_enable_debug is set to 0. But this is not the case. The fix is change the level of debugging to 1. The same should fixed in ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we probably should fix it all together. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Jeff Garzik <jeff@garzik.org> Cc: Theodore Tso <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: split ondemand readahead interface into two functionsRusty Russell2007-07-19
| | | | | | | | | | | | | Split ondemand readahead interface into two functions. I think this makes it a little clearer for non-readahead experts (like Rusty). Internally they both call ondemand_readahead(), but the page argument is changed to an obvious boolean flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* readahead: convert ext3/ext4 invocationsFengguang Wu2007-07-19
| | | | | | | | | | | | | | | | | Convert ext3/ext4 dir reads to use on-demand readahead. Readahead for dirs operates _not_ on file level, but on blockdev level. This makes a difference when the data blocks are not continuous. And the read routine is somehow opaque: there's no handy info about the status of current page. So a simplified call scheme is employed: to call into readahead whenever the current page falls out of readahead windows. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: extent macros cleanupDmitry Monakhov2007-07-18
| | | | | | | | | | | | | | | Use the EXT_LAST_INDEX macro; that's what it's there for. Clean up ext4_ext_ext_grow_indepth() so the correct EXT_FIRST_INDEX or EXT_FIRST_MACRO is used as necessary. The two macros are equivalent, so the C will collapse the if statement out, but it makes the code much more readable. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Singed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Fix compilation with EXT_DEBUG, also fix leXX_to_cpu conversions.Dmitry Monakhov2007-07-18
| | | | | | | Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: remove extra IS_RDONLY() checkDave Hansen2007-07-18
| | | | | | | | | | | | ext4_change_inode_journal_flag() is only called from one location: ext4_ioctl(EXT3_IOC_SETFLAGS). That ioctl case already has a IS_RDONLY() call in it so this one is superfluous. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Use is_power_of_2()Vignesh Babu2007-07-18
| | | | | | | | | | | Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2() Signed-off-by: Vignesh Babu <vignesh.babu@wipro.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Use zero_user_page() in ext4 where possibleEric Sandeen2007-07-18
| | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Remove 65000 subdirectory limitAndreas Dilger2007-07-18
| | | | | | | | | | | | | | | | | | | | | | | This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. A later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory with >65000 subdirs. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields Kalpak Shah2007-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to make sure that existing ext3 filesystems can also avail the new fields that have been added to the ext4 inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand the inode by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Add nanosecond timestampsKalpak Shah2007-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds nanosecond timestamps for ext4. This involves adding *time_extra fields to the ext4_inode to extend the timestamps to 64-bits. Creation time is also added by this patch. These extended fields will fit into an inode if the filesystem was formatted with large inodes (-I 256 or larger) and there are currently no EAs consuming all of the available space. For new inodes we always reserve enough space for the kernel's known extended fields, but for inodes created with an old kernel this might not have been the case. So this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature flag(ro-compat so that older kernels can't create inodes with a smaller extra_isize). which indicates if the fields fitting inside s_min_extra_isize are available or not. If the expansion of inodes if unsuccessful then this feature will be disabled. This feature is only enabled if requested by the sysadmin. None of the extended inode fields is critical for correct filesystem operation. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUGJose R. Santos2007-07-18
| | | | | | | | | | When the JBD code was forked to create the new JBD2 code base, the references to CONFIG_JBD_DEBUG where never changed to CONFIG_JBD2_DEBUG. This patch fixes that. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devicesJose R. Santos2007-07-18
| | | | | | | | | | | | Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Make extents code sanely handle on-disk corruptionAlex Tomas2007-07-18
| | | | | | | | | | | Add more run-time checking of extent header fields and remove BUG_ON checks so we don't panic the kernel just because the on-disk filesystem is corrupted. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: copy i_flags to inode flags on writeJan Kara2007-07-18
| | | | | | | | | | | | | | Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext4-specific i_flags. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem. (This is a forward port patch from ext3) Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Enable extents by defaultMingming Cao2007-07-18
| | | | | | | | | | Turn on extents feature by default in ext4 filesystem, to get wider testing of extents feature in ext4dev. This can be disabled using -o noextents. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Change on-disk format to support 2^15 uninitialized extentsAmit Arora2007-07-18
| | | | | | | | | | | | This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. Signed-off-by: Amit Arora <aarora@in.ibm.com>
* write support for preallocated blocksAmit Arora2007-07-17
| | | | | | | | | This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora <aarora@in.ibm.com>
* fallocate support in ext4Amit Arora2007-07-17
| | | | | | | | | | | | This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora <aarora@in.ibm.com>
* Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid checkSatyam Sharma2007-07-17
| | | | | | | | | | | | | | | | | | | | Introduce is_owner_or_cap() macro in fs.h, and convert over relevant users to it. This is done because we want to avoid bugs in the future where we check for only effective fsuid of the current task against a file's owning uid, without simultaneously checking for CAP_FOWNER as well, thus violating its semantics. [ XFS uses special macros and structures, and in general looked ... untouchable, so we leave it alone -- but it has been looked over. ] The (current->fsuid != inode->i_uid) check in generic_permission() and exec_permission_lite() is left alone, because those operations are covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* knfsd: exportfs: add exportfs.h headerChristoph Hellwig2007-07-17
| | | | | | | | | | | | | currently the export_operation structure and helpers related to it are in fs.h. fs.h is already far too large and there are very few places needing the export bits, so split them off into a separate header. [akpm@linux-foundation.org: fix cifs build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: statfs speed upBadari Pulavarty2007-07-16
| | | | | | | | | | | | | | | | | | This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: fix error handling in ext4_create_journalBorislav Petkov2007-07-16
| | | | | | | | | Fix error handling in ext4_create_journal according to kernel conventions. Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mistaken ext4_inode_bitmap for ext4_block_bitmapToshiyuki Okajima2007-07-16
| | | | | | | | | | | In ext4_new_blocks(), one of two ext4_block_bitmap() calls should be ext4_inode_bitmap() call. It is not harmful in normal processing, but it should be fixed. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: fix deadlock in ext4_remount() and orphan list handlingJan Kara2007-07-16
| | | | | | | | | | | | | | | | ext4_orphan_add() and ext4_orphan_del() functions lock sb->s_lock with a transaction started with ext4_mark_recovery_complete() waits for a transaction holding sb->s_lock, thus leading to a possible deadlock. At the moment we call ext4_mark_recovery_complete() from ext4_remount() we have done all the work needed for remounting and thus we are safe to drop sb->s_lock before we wait for transactions to commit. Note that at this moment we are still guarded by s_umount lock against other remounts/umounts. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Sandeen <sandeen@sandeen.net> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3/ext4: orphan list corruption due bad inodeVasily Averin2007-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After ext3 orphan list check has been added into ext3_destroy_inode() (please see my previous patch) the following situation has been detected: EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0 Inode 00000101a15b7840: orphan list check failed! 00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f ... Call Trace: [<ffffffff80211ea9>] ext3_destroy_inode+0x79/0x90 [<ffffffff801a2b16>] sys_unlink+0x126/0x1a0 [<ffffffff80111479>] error_exit+0x0/0x81 [<ffffffff80110aba>] system_call+0x7e/0x83 First messages said that unlinked inode has i_nlink=0, then ext3_unlink() adds this inode into orphan list. Second message means that this inode has not been removed from orphan list. Inode dump has showed that i_fop = &bad_file_ops and it can be set in make_bad_inode() only. Then I've found that ext3_read_inode() can call make_bad_inode() without any error/warning messages, for example in the following case: ... if (inode->i_nlink == 0) { if (inode->i_mode == 0 || !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) { /* this inode is deleted */ brelse (bh); goto bad_inode; ... Bad inode can live some time, ext3_unlink can add it to orphan list, but ext3_delete_inode() do not deleted this inode from orphan list. As result we can have orphan list corruption detected in ext3_destroy_inode(). However it is not clear for me how to fix this issue correctly. As far as i see is_bad_inode() is called after iget() in all places excluding ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad inode check to these functions too and call iput if bad inode detected. Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3/ext4: orphan list check on destroy_inodeVasily Averin2007-07-16
| | | | | | | | | | | | Customers claims to ext3-related errors, investigation showed that ext3 orphan list has been corrupted and have the reference to non-ext3 inode. The following debug helps to understand the reasons of this issue. [akpm@linux-foundation.org: update for print_hex_dump() changes] Signed-off-by: Vasily Averin <vvs@sw.ru> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sendfile: remove .sendfile from filesystems that use generic_file_sendfile()Jens Axboe2007-07-10
| | | | | | | They can use generic_file_splice_read() instead. Since sys_sendfile() now prefers that, there should be no change in behaviour. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* ext4: lost brelse in ext4_read_inode()Kirill Korotaev2007-06-24
| | | | | | | | | | One of error path in ext4_read_inode() leaks bh since brelse is forgoten. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-by: Vasily Averin <vvs@sw.ru> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* When ext4_ext_insert_extent() fails to insert new blocksAlex Tomas2007-05-31
| | | | | | | | we should free just the allocated blocks. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Extent overlap bugfixAmit Arora2007-05-31
| | | | | | | | This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Signed-off-by: Amit Arora <aarora@in.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Remove unnecessary exported symbols.Mingming Cao2007-05-31
| | | | | Signed-Off-By: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* EXT4: Fix whitespaceDave Kleikamp2007-05-31
| | | | | | | Replace a lot of spaces with tabs Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Remove SLAB_CTOR_CONSTRUCTORChristoph Lameter2007-05-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Howells <dhowells@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@ucw.cz> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* header cleaning: don't include smp_lock.h when not usedRandy Dunlap2007-05-08
| | | | | | | | | | | | Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3: dirindex error pointer issuesDmitriy Monakhov2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | - ext3_dx_find_entry() exit with out setting proper error pointer - do_split() exit with out setting proper error pointer it is realy painful because many callers contain folowing code: de = do_split(handle,dir, &bh, frame, &hinfo, &retval); if (!(de)) return retval; <<< WOW retval wasn't changed by do_split(), so caller failed <<< but return SUCCESS :) - Rearrange do_split() error path. Current error path is realy ugly, all this up and down jump stuff doesn't make code easy to understand. [dmonakhov@sw.ru: fix annoying fake error messages] Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Andreas Dilger <adilger@clusterfs.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systemsMarkus Rechberger2007-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Taken from http://bugzilla.kernel.org/show_bug.cgi?id=5079 signed long ranges from -2.147.483.648 to 2.147.483.647 on x86 32bit 10000011110110100100111110111101 .. -2,082,844,739 10000011110110100100111110111101 .. 2,212,122,557 <- this currently gets stored on the disk but when converting it to a 64bit signed long value it loses its sign and becomes positive. Cc: Andreas Dilger <adilger@dilger.ca> Cc: <linux-ext4@vger.kernel.org> Andreas says: This patch is now treating timestamps with the high bit set as negative times (before Jan 1, 1970). This means we lose 1/2 of the possible range of timestamps (lopping off 68 years before unix timestamp overflow - now only 30 years away :-) to handle the extremely rare case of setting timestamps into the distant past. If we are only interested in fixing the underflow case, we could just limit the values to 0 instead of storing negative values. At worst this will skew the timestamp by a few hours for timezones in the far east (files would still show Jan 1, 1970 in "ls -l" output). That said, it seems 32-bit systems (mine at least) allow files to be set into the past (01/01/1907 works fine) so it seems this patch is bringing the x86_64 behaviour into sync with other kernels. On the plus side, we have a patch that is ready to add nanosecond timestamps to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so this extends the maximum date to 2242. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab allocators: Remove SLAB_DEBUG_INITIAL flagChristoph Lameter2007-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by SLAB. I think its purpose was to have a callback after an object has been freed to verify that the state is the constructor state again? The callback is performed before each freeing of an object. I would think that it is much easier to check the object state manually before the free. That also places the check near the code object manipulation of the object. Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was compiled with SLAB debugging on. If there would be code in a constructor handling SLAB_DEBUG_INITIAL then it would have to be conditional on SLAB_DEBUG otherwise it would just be dead code. But there is no such code in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real use of, difficult to understand and there are easier ways to accomplish the same effect (i.e. add debug code before kfree). There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be clear in fs inode caches. Remove the pointless checks (they would even be pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors. This is the last slab flag that SLUB did not support. Remove the check for unimplemented flags from SLUB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove destroy_dirty_buffers from invalidate_bdev()Peter Zijlstra2007-05-07
| | | | | | | | | | | | | | | Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't been used in 6 years (so akpm says). find * -name \*.[ch] | xargs grep -l invalidate_bdev | while read file; do quilt add $file; sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file; done Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] revert "retries in ext4_prepare_write() violate ordering requirements"Andrew Morton2007-04-02
| | | | | | | | | | | | | | Revert b46be05004abb419e303e66e143eed9f8a6e9f3f. Same reasoning as for ext3. Cc: Kirill Korotaev <dev@openvz.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ken Chen <kenneth.w.chen@intel.com> Cc: Andrey Savochkin <saw@sw.ru> Cc: <linux-ext4@vger.kernel.org> Cc: Dmitriy Monakhov <dmonakhov@openvz.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] ext[34]: EA block reference count racing fixMingming Cao2007-03-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are race issues around ext[34] xattr block release code. ext[34]_xattr_release_block() checks the reference count of xattr block (h_refcount) and frees that xattr block if it is the last one reference it. Unlike ext2, the check of this counter is unprotected by any lock. ext[34]_xattr_release_block() will free the mb_cache entry before freeing that xattr block. There is a small window between the check for the re h_refcount ==1 and the call to mb_cache_entry_free(). During this small window another inode might find this xattr block from the mbcache and reuse it, racing a refcount updates. The xattr block will later be freed by the first inode without notice other inode is still use it. Later if that block is reallocated as a datablock for other file, then more serious problem might happen. We need put a lock around places checking the refount as well to avoid racing issue. Another place need this kind of protection is in ext3_xattr_block_set(), where it will modify the xattr block content in- the-fly if the refcount is 1 (means it's the only inode reference it). This will also fix another issue: the xattr block may not get freed at all if no lock is to protect the refcount check at the release time. It is possible that the last two inodes could release the shared xattr block at the same time. But both of them think they are not the last one so only decreased the h_refcount without freeing xattr block at all. We need to call lock_buffer() after ext3_journal_get_write_access() to avoid deadlock (because the later will call lock_buffer()/unlock_buffer () as well). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Andreas Gruenbacher <agruen@suse.de> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] ext[234]: update documentationAneesh Kumar K.V2007-02-20
| | | | | | Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>