aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
...
| | * | | | | | | | | Take dirtying the inode to callers of ext2_free_blocks()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | ext2: switch to dquot_free_block_nodirty()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | brute-force conversion Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | switch minix to ->evict_inode(), fix write_inode/delete_inode raceAl Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to wait for completion of possible writeback in progress before we clear on-disk inode during deletion. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | switch sysfs to ->evict_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | switch procfs to ->evict_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | simplify get_cramfs_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | simply don't hash the inodes that don't have real inumber instead of skipping them during iget5_locked(); as the result, simple iget_locked() would do and we can get rid of cramfs ->drop_inode() as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | new helper: end_writeback()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Essentially, the minimal variant of ->evict_inode(). It's a trimmed-down clear_inode(), sans any fs callbacks. Once it returns we know that no async writeback will be happening; every ->evict_inode() instance should do that once and do that before doing anything ->write_inode() could interfere with (e.g. freeing the on-disk inode). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | Take ->i_bdev/->i_cdev handling out of clear_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All call chains to clear_inode() pass through evict_inode() and clear_inode() should be called by evict_inode() exactly once. So we can pull i_bdev/i_cdev detaching up to evict_inode() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | generic_detach_inode() can be static nowAl Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | switch hugetlbfs to ->evict_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first spoils - hugetlb can use default ->drop_inode() now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | New method - evict_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hybrid of ->clear_inode() and ->delete_inode(); if present, does all fs work to be done when in-core inode is about to be gone, for whatever reason. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | unify fs/inode.c callers of clear_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, just a straightforward merge Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | simplify checks for I_CLEAR/I_FREEINGAl Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is equivalent to I_FREEING for almost all code looking at either; it's there to keep track of having called clear_inode() exactly once per inode lifetime, at some point after having set I_FREEING. I_CLEAR and I_FREEING never get set at the same time with the current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR instead of I_CLEAR without loss of information. As the result of such change, checks become simpler and the amount of code that needs to know about I_CLEAR shrinks a lot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | get rid of file_fsync()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Copy and simplify in the only two users remaining. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | xfs: new truncate sequenceChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert XFS to the new truncate sequence. We still can have errors after updating the file size in xfs_setattr, but these are real I/O errors and lead to a transaction abort and filesystem shutdown, so they are not an issue. Errors from ->write_begin and write_end can now be handled correctly because we can actually get rid of the delalloc extents while previous the buffer state was stipped in block_invalidatepage. There is still no error handling for ->direct_IO, because doing so will need some major restructuring given that we only have the iolock shared and do not hold i_mutex at all. Fortunately leaving the normally allocated blocks behind there is not a major issue and this will get cleaned up by xfs_free_eofblock later. Note: the patch is against Al's vfs.git tree as that contains the nessecary preparations. I'd prefer to get it applied there so that we can get some testing in linux-next. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | exofs: New truncate sequenceBoaz Harrosh2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These changes are crafted based on the similar conversion done to ext2 by Nick Piggin. * Remove the deprecated ->truncate vector. Let exofs_setattr take care of on-disk size updates. * Call truncate_pagecache on the unused pages if write_begin/end fails. * Cleanup exofs_delete_inode that did stupid inode writes and updates on an inode that will be removed. * And finally get rid of exofs_get_block. We never had any blocks it was all for calling nobh_truncate_page. nobh_truncate_page is not actually needed in exofs since the last page is complete and gone, just like all the other pages. There is no partial blocks in exofs. I've tested with this patch, and there are no apparent failures, so far. CC: Nick Piggin <npiggin@suse.de> CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | jffs2: don't open-code iget_failed()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | check ATTR_SIZE contraints in inode_change_okChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure we check the truncate constraints early on in ->setattr by adding those checks to inode_change_ok. Also clean up and document inode_change_ok to make this obvious. As a fallout we don't have to call inode_newsize_ok from simple_setsize and simplify it down to a truncate_setsize which doesn't return an error. This simplifies a lot of setattr implementations and means we use truncate_setsize almost everywhere. Get rid of fat_setsize now that it's trivial and mark ext2_setsize static to make the calling convention obvious. Keep the inode_newsize_ok in vmtruncate for now as all callers need an audit for its removal anyway. Note: setattr code in ecryptfs doesn't call inode_change_ok at all and needs a deeper audit, but that is left for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | always call inode_change_ok early in ->setattrChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure we call inode_change_ok before doing any changes in ->setattr, and make sure to call it even if our fs wants to ignore normal UNIX permissions, but use the ATTR_FORCE to skip those. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | remove inode_setattrChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | default to simple_setattrChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the new truncate sequence every filesystem that wants to support file size changes on disk needs to implement its own ->setattr. So instead of calling inode_setattr which supports size changes call into a simple method that doesn't support this. simple_setattr is almost what we want except that it does not mark the inode dirty after changes. Given that marking the inode dirty is a no-op for the simple in-memory filesystems that use simple_setattr currently just add the mark_inode_dirty call. Also add a WARN_ON for the presence of a truncate method to simple_setattr to catch new instances of it during the transition period. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | rename generic_setattrChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite its name it's now a generic implementation of ->setattr, but rather a helper to copy attributes from a struct iattr to the inode. Rename it to setattr_copy to reflect this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | add missing setattr methodsChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the new truncate sequence every filesystem that wants to truncate on-disk state needs a seattr method. Convert the remaining filesystems that implement the truncate inode operation to have its own setattr method. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | get rid of block_write_begin_newtruncChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the callers in preparation of the new truncate sequence and rename the non-truncating version to block_write_begin. While we're at it also remove several unused arguments to block_write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | introduce __block_write_beginChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split up the block_write_begin implementation - __block_write_begin is a new trivial wrapper for block_prepare_write that always takes an already allocated page and can be either called from block_write_begin or filesystem code that already has a page allocated. Remove the handling of already allocated pages from block_write_begin after switching all callers that do it to __block_write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | clean up write_begin usage for directories in pagecacheChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For filesystem that implement directories in pagecache we call block_write_begin with an already allocated page for this code, while the normal regular file write path uses the default block_write_begin behaviour. Get rid of the __foofs_write_begin helper and opencode the normal write_begin call in foofs_write_begin, while adding a new foofs_prepare_chunk helper for the directory code. The added benefit is that foofs_prepare_chunk has a much saner calling convention. Note that the interruptible flag passed into block_write_begin is always ignored if we already pass in a page (see next patch for details), and we never were doing truncations of exessive blocks for this case either so we can switch directly to block_write_begin_newtrunc. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | get rid of cont_write_begin_newtruncChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the callers in preparation of the new truncate sequence and rename the non-truncating version to cont_write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | get rid of nobh_write_begin_newtruncChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the only remaining caller and rename the non-truncating version to nobh_write_begin. Get rid of the superflous file argument to it while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | sort out blockdev_direct_IO variantsChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the callers in prepearation of the new truncate calling sequence. This was only done for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant was not needed anyway. Get rid of blockdev_direct_IO_no_locking and its _newtrunc variant while at it as just opencoding the two additional paramters is shorted than the name suffix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | fix leak in __logfs_create()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if kmalloc fails, we still need to drop the inode, as we do on other failure exits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | Fix reiserfs_file_release()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a) count file openers correctly; i_count use was completely wrong b) use new mutex for exclusion between final close/open/truncate, to protect tailpacking logics. i_mutex use was wrong and resulted in deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | missing include in hppfsAl Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | | | | | | | Deal with missing exports for hostfsAl Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | | | | | vfs: fix warning: 'dirent' is used uninitialized in this functionKevin Winchester2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using: gcc (GCC) 4.5.0 20100610 (prerelease) The following warnings appear: fs/readdir.c: In function `filldir64': fs/readdir.c:240:15: warning: `dirent' is used uninitialized in this function fs/readdir.c: In function `filldir': fs/readdir.c:155:15: warning: `dirent' is used uninitialized in this function fs/compat.c: In function `compat_filldir64': fs/compat.c:1071:11: warning: `dirent' is used uninitialized in this function fs/compat.c: In function `compat_filldir': fs/compat.c:984:15: warning: `dirent' is used uninitialized in this function The warnings are related to the use of the NAME_OFFSET() macro. Luckily, it appears as though the standard offsetof() macro is what is being implemented by NAME_OFFSET(), thus we can fix the warning and use a more standard code construct at the same time. Signed-off-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | mm: avoid resetting wb_start after each writeback roundJan Kara2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't write out some huge inode for too long while starving writeout of other inodes. To avoid livelocks, we record time we started writeback in wbc->wb_start and do not write out inodes which were dirtied after this time. But currently, writeback_inodes_wb() resets wb_start each time it is called thus effectively invalidating this logic and making any WB_SYNC_NONE writeback prone to livelocks. This patch makes sure wb_start is set only once when we start writeback. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | oom: deprecate oom_adj tunableDavid Rientjes2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is August 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | oom: badness heuristic rewriteDavid Rientjes2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | oom: move badness() declaration into oom.hAndrew Morton2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | oom: /proc/<pid>/oom_score treat kernel thread honestlyKOSAKI Motohiro2010-08-09
| | |_|_|_|_|/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | Merge branch 'bkl/core' of ↵Linus Torvalds2010-08-07
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: do_coredump: Do not take BKL init: Remove the BKL from startup code
| | * | | | | | | | | do_coredump: Do not take BKLArnd Bergmann2010-07-09
| | | |_|/ / / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | core_pattern is not actually protected and hasn't been ever since we introduced procfs support for sysctl -- a _long_ time. Don't take it here either. Also nothing inside do_coredump appears to require bkl protection. Signed-off-by: Arnd Bergmann <arnd@arndb.de> [ remove smp_lock.h headers ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| * | | | | | | | | Merge branch 'for-2.6.36' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2010-08-07
| |\ \ \ \ \ \ \ \ \ | | | |_|_|_|_|_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: (34 commits) nfsd4: fix file open accounting for RDWR opens nfsd: don't allow setting maxblksize after svc created nfsd: initialize nfsd versions before creating svc net: sunrpc: removed duplicated #include nfsd41: Fix a crash when a callback is retried nfsd: fix startup/shutdown order bug nfsd: minor nfsd read api cleanup gcc-4.6: nfsd: fix initialized but not read warnings nfsd4: share file descriptors between stateid's nfsd4: fix openmode checking on IO using lock stateid nfsd4: miscellaneous process_open2 cleanup nfsd4: don't pretend to support write delegations nfsd: bypass readahead cache when have struct file nfsd: minor nfsd_svc() cleanup nfsd: move more into nfsd_startup() nfsd: just keep single lockd reference for nfsd nfsd: clean up nfsd_create_serv error handling nfsd: fix error handling in __write_ports_addxprt nfsd: fix error handling when starting nfsd with rpcbind down nfsd4: fix v4 state shutdown error paths ...
| | * | | | | | | | nfsd4: fix file open accounting for RDWR opensJ. Bruce Fields2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f9d7562fdb9dc0ada3a7aba5dbbe9d965e2a105d "nfsd4: share file descriptors between stateid's" didn't correctly account for O_RDWR opens. Symptoms include leaked files, resulting in failures to unmount and/or warnings about orphaned inodes on reboot. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd: don't allow setting maxblksize after svc createdJ. Bruce Fields2010-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's harmless to set this after the server is created, but also ineffective, since the value is only used at the time of svc_create_pooled(). So fail the attempt, in keeping with the pattern set by write_versions, write_{lease,grace}time and write_recoverydir. (This could break userspace that tried to write to nfsd/max_block_size between setting up sockets and starting the server. However, such code wouldn't have worked anyway, and I don't know of any examples--rpc.nfsd in nfs-utils, probably the only user of the interface, doesn't do that.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd: initialize nfsd versions before creating svcJ. Bruce Fields2010-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 59db4a0c102e0de226a3395dbf25ea51bf845937 "nfsd: move more into nfsd_startup()" inadvertently moved nfsd_versions after nfsd_create_svc(). On older distributions using an rpc.nfsd that does not explicitly set the list of nfsd versions, this results in svc-create_pooled() being called with an empty versions array. The resulting incomplete initialization leads to a NULL dereference in svc_process_common() the first time a client accesses the server. Move nfsd_reset_versions() back before the svc_create_pooled(); this time, put it closer to the svc_create_pooled() call, to make this mistake more difficult in the future. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd41: Fix a crash when a callback is retriedBoaz Harrosh2010-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a callback is retried at nfsd4_cb_recall_done() due to some error, the returned rpc reply crashes here: @@ -514,6 +514,7 @@ decode_cb_sequence(struct xdr_stream *xdr, struct nfsd4_cb_sequence *res, u32 dummy; __be32 *p; + BUG_ON(!res); if (res->cbs_minorversion == 0) return 0; [BUG_ON added for demonstration] This is because the nfsd4_cb_done_sequence() has NULLed out the task->tk_msg.rpc_resp pointer. Also eventually the rpc would use the new slot without making sure it is free by calling nfsd41_cb_setup_sequence(). This problem was introduced by a 4.1 protocol addition patch: [0421b5c5] nfsd41: Backchannel: Implement cb_recall over NFSv4.1 Which was overlooking the possibility of an RPC callback retries. For not-4.1 case redoing the _prepare is harmless. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd: fix startup/shutdown order bugJ. Bruce Fields2010-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must create the server before we can call init_socks or check the number of threads. Symptoms were a NULL pointer dereference in nfsd_svc(). Problem identified by Jeff Layton. Also fix a minor cleanup-on-error case in nfsd_startup(). Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd: minor nfsd read api cleanupJ. Bruce Fields2010-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph points that the NFSv2/v3 callers know which case they want here, so we may as well just call the file=NULL case directly instead of making this conditional. Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | gcc-4.6: nfsd: fix initialized but not read warningsAndi Kleen2010-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes at least one real minor bug: the nfs4 recovery dir sysctl would not return its status properly. Also I finished Al's 1e41568d7378d ("Take ima_path_check() in nfsd past dentry_open() in nfsd_open()") commit, it moved the IMA code, but left the old path initializer in there. The rest is just dead code removed I think, although I was not fully sure about the "is_borc" stuff. Some more review would be still good. Found by gcc 4.6's new warnings. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | | | nfsd4: share file descriptors between stateid'sJ. Bruce Fields2010-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vfs doesn't really allow us to "upgrade" a file descriptor from read-only to read-write, and our attempt to do so in nfs4_upgrade_open is ugly and incomplete. Move to a different scheme where we keep multiple opens, shared between open stateid's, in the nfs4_file struct. Each file will be opened at most 3 times (for read, write, and read-write), and those opens will be shared between all clients and openers. On upgrade we will do another open if necessary instead of attempting to upgrade an existing open. We keep count of the number of readers and writers so we know when to close the shared files. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>