aboutsummaryrefslogtreecommitdiffstats
path: root/fs/ext4/file.c
Commit message (Collapse)AuthorAge
* ext4: serialize unaligned asynchronous DIOEric Sandeen2011-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4 has a data corruption case when doing non-block-aligned asynchronous direct IO into a sparse file, as demonstrated by xfstest 240. The root cause is that while ext4 preallocates space in the hole, mappings of that space still look "new" and dio_zero_block() will zero out the unwritten portions. When more than one AIO thread is going, they both find this "new" block and race to zero out their portion; this is uncoordinated and causes data corruption. Dave Chinner fixed this for xfs by simply serializing all unaligned asynchronous direct IO. I've done the same here. The difference is that we only wait on conversions, not all IO. This is a very big hammer, and I'm not very pleased with stuffing this into ext4_file_write(). But since ext4 is DIO_LOCKING, we need to serialize it at this high level. I tried to move this into ext4_ext_direct_IO, but by then we have the i_mutex already, and we will wait on the work queue to do conversions - which must also take the i_mutex. So that won't work. This was originally exposed by qemu-kvm installing to a raw disk image with a normal sector-63 alignment. I've tested a backport of this patch with qemu, and it does avoid the corruption. It is also quite a lot slower (14 min for package installs, vs. 8 min for well-aligned) but I'll take slow correctness over fast corruption any day. Mingming suggested that we can track outstanding conversions, and wait on those so that non-sparse files won't be affected, and I've implemented that here; unaligned AIO to nonsparse files won't take a perf hit. [tytso@mit.edu: Keep the mutex as a hashed array instead of bloating the ext4 inode] [tytso@mit.edu: Fix up namespace issues so that global variables are protected with an "ext4_" prefix.] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* fallocate should be a file operationChristoph Hellwig2011-01-17
| | | | | | | | | | | | | | | | | | | Currently all filesystems except XFS implement fallocate asynchronously, while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC I/O we really want our allocation on disk, especially for the !KEEP_SIZE case where we actually grow the file with user-visible zeroes. On the other hand always commiting the transaction is a bad idea for fast-path uses of fallocate like for example in recent Samba versions. Given that block allocation is a data plane operation anyway change it from an inode operation to a file operation so that we have the file structure available that lets us check for O_SYNC. This also includes moving the code around for a few of the filesystems, and remove the already unnedded S_ISDIR checks given that we only wire up fallocate for regular files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessaryTheodore Ts'o2011-01-10
| | | | | | | | | | Replace the jbd2_inode structure (which is 48 bytes) with a pointer and only allocate the jbd2_inode when it is needed --- that is, when the file system has a journal present and the inode has been opened for writing. This allows us to further slim down the ext4_inode_info structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: improve llseek error handling for overly large seek offsetsToshiyuki Okajima2010-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The llseek system call should return EINVAL if passed a seek offset which results in a write error. What this maximum offset should be depends on whether or not the huge_file file system feature is set, and whether or not the file is extent based or not. If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be written (write systemcall) is different from the maximum size which can be sought (lseek systemcall). For example, the following 2 cases demonstrates the differences between the maximum size which can be written, versus the seek offset allowed by the llseek system call: #1: mkfs.ext3 <dev>; mount -t ext4 <dev> #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev> Table. the max file size which we can write or seek at each filesystem feature tuning and file flag setting +============+===============================+===============================+ | \ File flag| | | | \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL | |case \| | | +------------+-------------------------------+-------------------------------+ | #1 | write: 2194719883264 | write: -------------- | | | seek: 2199023251456 | seek: -------------- | +------------+-------------------------------+-------------------------------+ | #2 | write: 4402345721856 | write: 17592186044415 | | | seek: 17592186044415 | seek: 17592186044415 | +------------+-------------------------------+-------------------------------+ The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes. (llseek of ext4_file_operations is generic_file_llseek which uses sb->s_maxbytes.) Therefore we create ext4 llseek function which uses 2 maxbytes. The new own function originates from generic_file_llseek(). If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
* ext4: fix EFBIG edge case when writing to large non-extent fileToshiyuki Okajima2010-07-27
| | | | | | | | | | | | | | | | | | | By running the following reproducer, we can confirm that the write system call returns with 0 when it should return the error EFBIG. #!/bin/sh /bin/dd if=/dev/zero of=./img bs=1k count=1 seek=1024k > /dev/null 2>&1 /sbin/mkfs.ext3 -Fq ./img /bin/mount -o loop -t ext4 ./img /mnt /bin/touch /mnt/file strace /bin/dd if=/dev/zero of=/mnt/file conv=notrunc bs=1k count=1 seek=$((2194719883264/1024)) 2>&1 | /bin/egrep "write.* 1024\) = " /bin/umount /mnt exit Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eric Sandeen <sandeen@redhat.com>
* ext4: Clean up s_dirt handlingTheodore Ts'o2010-06-11
| | | | | | | | | | | | | | | | | We don't need to set s_dirt in most of the ext4 code when journaling is enabled. In ext3/4 some of the summary statistics for # of free inodes, blocks, and directories are calculated from the per-block group statistics when the file system is mounted or unmounted. As a result the superblock doesn't have to be updated, either via the journal or by setting s_dirt. There are a few exceptions, most notably when resizing the file system, where the superblock needs to be modified --- and in that case it should be done as a journalled operation if possible, and s_dirt set only in no-journal mode. This patch will optimize out some unneeded disk writes when using ext4 with a journal. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Use bitops to read/modify i_flags in struct ext4_inode_infoDmitry Monakhov2010-05-16
| | | | | | | | | | | | At several places we modify EXT4_I(inode)->i_flags without holding i_mutex (ext4_do_update_inode, ...). These modifications are racy and we can lose updates to i_flags. So convert handling of i_flags to use bitops which are atomic. https://bugzilla.kernel.org/show_bug.cgi?id=15792 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Merge branch 'for_linus' of ↵Linus Torvalds2010-03-05
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
| * dquot: cleanup dquot initialize routineChristoph Hellwig2010-03-04
| | | | | | | | | | | | | | | | | | | | | | | | Get rid of the initialize dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_initialize helper to __dquot_initialize and vfs_dq_init to dquot_initialize to have a consistent namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * dquot: move dquot initialization responsibility into the filesystemChristoph Hellwig2010-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently various places in the VFS call vfs_dq_init directly. This means we tie the quota code into the VFS. Get rid of that and make the filesystem responsible for the initialization. For most metadata operations this is a straight forward move into the methods, but for truncate and open it's a bit more complicated. For truncate we currently only call vfs_dq_init for the sys_truncate case because open already takes care of it for ftruncate and open(O_TRUNC) - the new code causes an additional vfs_dq_init for those which is harmless. For open the initialization is moved from do_filp_open into the open method, which means it happens slightly earlier now, and only for regular files. The latter is fine because we don't need to initialize it for operations on special files, and we already do it as part of the namespace operations for directories. Add a dquot_file_open helper that filesystems that support generic quotas can use to fill in ->open. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* | Merge branch 'for_linus' of ↵Linus Torvalds2010-03-05
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits) ext4: fix up rb_root initializations to use RB_ROOT ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl ext4: Fix the NULL reference in double_down_write_data_sem() ext4: Fix insertion point of extent in mext_insert_across_blocks() ext4: consolidate in_range() definitions ext4: cleanup to use ext4_grp_offs_to_block() ext4: cleanup to use ext4_group_first_block_no() ext4: Release page references acquired in ext4_da_block_invalidatepages ext4: Fix ext4_quota_write cross block boundary behaviour ext4: Convert BUG_ON checks to use ext4_error() instead ext4: Use direct_IO_no_locking in ext4 dio read ext4: use ext4_get_block_write in buffer write ext4: mechanical rename some of the direct I/O get_block's identifiers ext4: make "offset" consistent in ext4_check_dir_entry() ext4: Handle non empty on-disk orphan link ext4: explicitly remove inode from orphan list after failed direct io ext4: fix error handling in migrate ext4: deprecate obsoleted mount options ext4: Fix fencepost error in chosing choosing group vs file preallocation. jbd2: clean up an assertion in jbd2_journal_commit_transaction() ...
| * | ext4: Use bitops to read/modify EXT4_I(inode)->i_stateTheodore Ts'o2010-01-24
| |/ | | | | | | | | | | | | | | | | | | | | At several places we modify EXT4_I(inode)->i_state without holding i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage, ext4_do_update_inode, ...). These modifications are racy and we can lose updates to i_state. So convert handling of i_state to use bitops which are atomic. Cc: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* / Get rid of mnt_mountpoint abuses in ext4Al Viro2010-03-03
|/ | | | | | | | | path to mnt/mnt->mnt_root is no worse than that to mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no pinning the sucker down (mnt is not going away and mnt->mnt_root won't change) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* const: mark struct vm_struct_operationsAlexey Dobriyan2009-09-27
| | | | | | | | | | | * mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: Remove syncing logic from ext4_file_writeJan Kara2009-09-14
| | | | | | | | | The syncing is now properly handled by generic_file_aio_write() so no special ext4 code is needed. CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Signed-off-by: Jan Kara <jack@suse.cz>
* ext[234]: move over to 'check_acl' permission modelLinus Torvalds2009-09-08
| | | | | | | | | | Don't implement per-filesystem 'extX_permission()' functions that have to be called for every path component operation, and instead just expose the actual ACL checking so that the VFS layer can now do it for us. Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: update the s_last_mounted field in the superblockTheodore Ts'o2009-06-13
| | | | | | | | | | | This field can be very helpful when a system administrator is trying to sort through large numbers of block devices or filesystem images. What is stored in this field can be ambiguous if multiple filesystem namespaces are in play; what we store in practice is the mountpoint interpreted by the process's namespace which first opens a file in the filesystem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix discard of inode prealloc space with delayed allocation.Aneesh Kumar K.V2009-03-27
| | | | | | | | | | | With delayed allocation we should not/cannot discard inode prealloc space during file close. We would still have dirty pages for which we haven't allocated blocks yet. With this fix after each get_blocks request we check whether we have zero reserved blocks and if yes and we don't have any writers on the file we discard inode prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Automatically allocate delay allocated blocks on closeTheodore Ts'o2009-02-24
| | | | | | | | | | | When closing a file that had been previously truncated, force any delay allocated blocks that to be allocated so that if the filesystem is mounted with data=ordered, the data blocks will be pushed out to disk along with the journal commit. Many application programs expect this, so we do this to avoid zero length files if the system crashes unexpectedly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: sparse fixesAneesh Kumar K.V2008-11-22
| | | | | | | | | | * Change EXT4_HAS_*_FEATURE to return a boolean * Add a function prototype for ext4_fiemap() in ext4.h * Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions * Add lock annotations to mb_free_blocks() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Rename ext4dev to ext4Theodore Ts'o2008-10-10
| | | | | | | | The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Hook ext4 to the vfs fiemap interface.Eric Sandeen2008-10-07
| | | | | | | | | | ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org
* ext4: Remove old legacy block allocatorTheodore Ts'o2008-10-10
| | | | Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix whitespace checkpatch warnings/errorsTheodore Ts'o2008-09-08
| | | | | Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: delayed allocation i_blocks fix for statMingming Cao2008-07-11
| | | | | | | | | | | | | | | | | | | Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Use page_mkwrite vma_operations to get mmap write notification.Aneesh Kumar K.V2008-07-11
| | | | | | | | | | | | | | We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: move headers out of include/linuxChristoph Hellwig2008-04-29
| | | | | | | | | | Move ext4 headers out of include/linux. This is just the trivial move, there's some more thing that could be done later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Convert ext4 to use unlocked_ioctlAndi Kleen2008-04-29
| | | | | | | | | I checked ext4_ioctl and it looked largely safe to not be used without BKL. So convert it over to unlocked_ioctl. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: Convert truncate_mutex to read write semaphore.Aneesh Kumar K.V2008-01-28
| | | | | | | | We are currently taking the truncate_mutex for every read. This would have performance impact on large CPU configuration. Convert the lock to read write semaphore and take read lock when we are trying to read the file. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
* ext4: store maxbytes for bitmapped files and return EFBIG as appropriateEric Sandeen2008-01-28
| | | | | | | | Calculate & store the max offset for bitmapped files, and catch too-large seeks, truncates, and writes in ext4, shortening or rejecting as appropriate. Signed-off-by: Eric Sandeen <sandeen@redhat.com>
* fallocate support in ext4Amit Arora2007-07-17
| | | | | | | | | | | | This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora <aarora@in.ibm.com>
* sendfile: remove .sendfile from filesystems that use generic_file_sendfile()Jens Axboe2007-07-10
| | | | | | | They can use generic_file_splice_read() instead. Since sys_sendfile() now prefers that, there should be no change in behaviour. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* [PATCH] mark struct inode_operations const 1Arjan van de Ven2007-02-12
| | | | | | | | | | | Many struct inode_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] ext4: change uses of f_{dentry, vfsmnt} to use f_pathJosef "Jeff" Sipek2006-12-08
| | | | | | | | | Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the ext4 filesystem. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] jbd2: enable building of jbd2 and have ext4 use it rather than jbdMingming Cao2006-10-11
| | | | | | | | | | Reworked from a patch by Mingming Cao and Randy Dunlap Signed-off-By: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ext4: rename ext4 symbols to avoid duplication of ext3 symbolsMingming Cao2006-10-11
| | | | | | | | | | Mingming Cao originally did this work, and Shaggy reproduced it using some scripts from her. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ext4: initial copy of files from ext3Dave Kleikamp2006-10-11
Start of the ext4 patch series. See Documentation/filesystems/ext4.txt for details. This is a simple copy of the files in fs/ext3 to fs/ext4 and /usr/incude/linux/ext3* to /usr/include/ex4* Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>