| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
| |
The /proc/fs/jbd2/<dev>/history was maintained manually; by using
tracepoints, we can get all of the existing functionality of the /proc
file plus extra capabilities thanks to the ftrace infrastructure. We
save memory as a bonus.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.
By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
| |
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted. Disable them to economize space
in the system logs. In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
| |
This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch a problem that ext4_dirty_inode() was not calling
ext4_mark_inode_dirty() if the current_handle is not valid, which it
is the case in no journal mode.
It also removes a test for non-matching transaction which can never
happen.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a cleanup of commit 91ac6f4. Since ext4_mark_inode_dirty()
has already called ext4_mark_iloc_dirty(), which in turn calls
ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
call ext4_do_update_inode() in no journal mode. Indeed, it would be
duplicated work.
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
| |
Move the check to make sure the original and donor inodes are
different earlier, to avoid a potential deadlock by trying to lock the
same inode twice.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.
But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.
Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue. When fsync() is called, it will go
through the list and do the conversion.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the DIO VFS code passes create = 0 when writing to the
middle of file. It does this to avoid block allocation for holes, so
as not to expose stale data out when there is a parallel buffered read
(which does not hold the i_mutex lock). Direct I/O writes into holes
falls back to buffered IO for this reason.
Since preallocated extents are treated as holes when doing a
get_block() look up (buffer is not mapped), direct IO over fallocate
also falls back to buffered IO. Thus ext4 actually silently falls
back to buffered IO in above two cases, which is undesirable.
To fix this, this patch creates unitialized extents when a direct I/O
write into holes in sparse files, and registering an end_io callback which
converts the uninitialized extent to an initialized extent after the
I/O is completed.
Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When writing into an unitialized extent via direct I/O, and the direct
I/O doesn't exactly cover the unitialized extent, split the extent
into uninitialized and initialized extents before submitting the I/O.
This avoids needing to deal with an ENOSPC error in the end_io
callback that gets used for direct I/O.
When the IO is complete, the written extent will be marked as initialized.
Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
| |
ext4_da_reserve_space() can reserve quota blocks multiple times if
ext4_claim_free_blocks() fail and we retry the allocation. We should
release the quota reservation before restarting.
Bug found by Jan Kara.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small. This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time. So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode. We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
|
| |
The hueristic was designed to avoid using locality group preallocation
when writing the last segment of a closed file. Fix it by move
setting size to the maximum of size and isize until after we check
whether size == isize.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
|
|
|
|
|
|
| |
This allows the user to see what filesystem was involved with a
particular ext4_da_writepage() error. Also, use KERN_CRIT which is
more appropriate than KERN_EMERG.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
writeback: writeback_inodes_sb() should use bdi_start_writeback()
writeback: don't delay inodes redirtied by a fast dirtier
writeback: make the super_block pinning more efficient
writeback: don't resort for a single super_block in move_expired_inodes()
writeback: move inodes from one super_block together
writeback: get rid to incorrect references to pdflush in comments
writeback: improve readability of the wb_writeback() continue/break logic
writeback: cleanup writeback_single_inode()
writeback: kupdate writeback shall not stop when more io is possible
writeback: stop background writeback when below background threshold
writeback: balance_dirty_pages() shall write more than dirtied pages
fs: Fix busyloop in wb_writeback()
|
| |
| |
| |
| |
| |
| |
| | |
Pointless to iterate other devices looking for a super, when
we have a bdi mapping.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Debug traces show that in per-bdi writeback, the inode under writeback
almost always get redirtied by a busy dirtier. We used to call
redirty_tail() in this case, which could delay inode for up to 30s.
This is unacceptable because it now happens so frequently for plain cp/dd,
that the accumulated delays could make writeback of big files very slow.
So let's distinguish between data redirty and metadata only redirty.
The first one is caused by a busy dirtier, while the latter one could
happen in XFS, NFS, etc. when they are doing delalloc or updating isize.
The inode being busy dirtied will now be requeued for next io, while
the inode being redirtied by fs will continue to be delayed to avoid
repeated IO.
CC: Jan Kara <jack@suse.cz>
CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently we pin the inode->i_sb for every single inode. This
increases cache traffic on sb->s_umount sem. Lets instead
cache the inode sb pin state and keep the super_block pinned
for as long as keep writing out inodes from the same
super_block.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| | |
If we only moved inodes from a single super_block to the temporary
list, there's no point in doing a resort for multiple super_blocks.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
__mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
several partitions, writeback might keep spindle moving between partitions.
To reduce the move, better write big chunk of one partition and then move to
another. Inodes from one fs usually are in one partion, so idealy move indoes
from one fs together should reduce spindle move. This patch tries to address
this. Before per-bdi writeback is added, the behavior is write indoes
from one fs first and then another, so the patch restores previous behavior.
The loop in the patch is a bit ugly, should we add a dirty list for each
superblock in bdi_writeback?
Test in a two partition disk with attached fio script shows about 3% ~ 6%
improvement.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| | |
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| | |
And throw some comments in there, too.
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make the if-else straight in writeback_single_inode().
No behavior change.
Cc: Jan Kara <jack@suse.cz>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix the kupdate case, which disregards wbc.more_io and stop writeback
prematurely even when there are more inodes to be synced.
wbc.more_io should always be respected.
Also remove the pages_skipped check. It will set when some page(s) of some
inode(s) cannot be written for now. Such inodes will be delayed for a while.
This variable has nothing to do with whether there are other writeable inodes.
CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Treat bdi_start_writeback(0) as a special request to do background write,
and stop such work when we are below the background dirty threshold.
Also simplify the (nr_pages <= 0) checks. Since we already pass in
nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
need to worry about it being decreased to zero.
Reported-by: Richard Kennedy <richard@rsk.demon.co.uk>
CC: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If all inodes are under writeback (e.g. in case when there's only one inode
with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
write anything.
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It needs walk_page_range().
Reported-by: Michal Simek <monstr@monstr.eu>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Prevent lower dentry from going negative during unlink
eCryptfs: Propagate vfs_read and vfs_write return codes
eCryptfs: Validate global auth tok keys
eCryptfs: Filename encryption only supports password auth tokens
eCryptfs: Check for O_RDONLY lower inodes when opening lower files
eCryptfs: Handle unrecognized tag 3 cipher codes
ecryptfs: improved dependency checking and reporting
eCryptfs: Fix lockdep-reported AB-BA mutex issue
ecryptfs: Remove unneeded locking that triggers lockdep false positives
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When calling vfs_unlink() on the lower dentry, d_delete() turns the
dentry into a negative dentry when the d_count is 1. This eventually
caused a NULL pointer deref when a read() or write() was done and the
negative dentry's d_inode was dereferenced in
ecryptfs_read_update_atime() or ecryptfs_getxattr().
Placing mutt's tmpdir in an eCryptfs mount is what initially triggered
the oops and I was able to reproduce it with the following sequence:
open("/tmp/upper/foo", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 3
link("/tmp/upper/foo", "/tmp/upper/bar") = 0
unlink("/tmp/upper/foo") = 0
open("/tmp/upper/bar", O_RDWR|O_CREAT|O_NOFOLLOW, 0600) = 4
unlink("/tmp/upper/bar") = 0
write(4, "eCryptfs test\n"..., 14 <unfinished ...>
+++ killed by SIGKILL +++
https://bugs.launchpad.net/ecryptfs/+bug/387073
Reported-by: Loïc Minier <loic.minier@canonical.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Errors returned from vfs_read() and vfs_write() calls to the lower
filesystem were being masked as -EINVAL. This caused some confusion to
users who saw EINVAL instead of ENOSPC when the disk was full, for
instance.
Also, the actual bytes read or written were not accessible by callers to
ecryptfs_read_lower() and ecryptfs_write_lower(), which may be useful in
some cases. This patch updates the error handling logic where those
functions are called in order to accept positive return codes indicating
success.
Cc: Eric Sandeen <esandeen@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When searching through the global authentication tokens for a given key
signature, verify that a matching key has not been revoked and has not
expired. This allows the `keyctl revoke` command to be properly used on
keys in use by eCryptfs.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Returns -ENOTSUPP when attempting to use filename encryption with
something other than a password authentication token, such as a private
token from openssl. Using filename encryption with a userspace eCryptfs
key module is a future goal. Until then, this patch handles the
situation a little better than simply using a BUG_ON().
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If the lower inode is read-only, don't attempt to open the lower file
read/write and don't hand off the open request to the privileged
eCryptfs kthread for opening it read/write. Instead, only try an
unprivileged, read-only open of the file and give up if that fails.
This patch fixes an oops when eCryptfs is mounted on top of a read-only
mount.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Returns an error when an unrecognized cipher code is present in a tag 3
packet or an ecryptfs_crypt_stat cannot be initialized. Also sets an
crypt_stat->tfm error pointer to NULL to ensure that it will not be
incorrectly freed in ecryptfs_destroy_crypt_stat().
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
So, I compiled a 2.6.31-rc5 kernel with ecryptfs and loaded its module.
When it came time to mount my filesystem, I got this in dmesg, and it
refused to mount:
[93577.776637] Unable to allocate crypto cipher with name [aes]; rc = [-2]
[93577.783280] Error attempting to initialize key TFM cipher with name = [aes]; rc = [-2]
[93577.791183] Error attempting to initialize cipher with name = [aes] and key size = [32]; rc = [-2]
[93577.800113] Error parsing options; rc = [-22]
I figured from the error message that I'd either forgotten to load "aes"
or that my key size was bogus. Neither one of those was the case. In
fact, I was missing the CRYPTO_ECB config option and the 'ecb' module.
Unfortunately, there's no trace of 'ecb' in that error message.
I've done two things to fix this. First, I've modified ecryptfs's
Kconfig entry to select CRYPTO_ECB and CRYPTO_CBC. I also took CRYPTO
out of the dependencies since the 'select' will take care of it for us.
I've also modified the error messages to print a string that should
contain both 'ecb' and 'aes' in my error case. That will give any
future users a chance of finding the right modules and Kconfig options.
I also wonder if we should:
select CRYPTO_AES if !EMBEDDED
since I think most ecryptfs users are using AES like me.
Cc: ecryptfs-devel@lists.launchpad.net
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
[tyhicks@linux.vnet.ibm.com: Removed extra newline, 80-char violation]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Lockdep reports the following valid-looking possible AB-BA deadlock with
global_auth_tok_list_mutex and keysig_list_mutex:
ecryptfs_new_file_context() ->
ecryptfs_copy_mount_wide_sigs_to_inode_sigs() ->
mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
-> ecryptfs_add_keysig() ->
mutex_lock(&crypt_stat->keysig_list_mutex);
vs
ecryptfs_generate_key_packet_set() ->
mutex_lock(&crypt_stat->keysig_list_mutex);
-> ecryptfs_find_global_auth_tok_for_sig() ->
mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
ie the two mutexes are taken in opposite orders in the two different
code paths. I'm not sure if this is a real bug where two threads could
actually hit the two paths in parallel and deadlock, but it at least
makes lockdep impossible to use with ecryptfs since this report triggers
every time and disables future lockdep reporting.
Since ecryptfs_add_keysig() is called only from the single callsite in
ecryptfs_copy_mount_wide_sigs_to_inode_sigs(), the simplest fix seems to
be to move the lock of keysig_list_mutex back up outside of the where
global_auth_tok_list_mutex is taken. This patch does that, and fixes
the lockdep report on my system (and ecryptfs still works OK).
The full output of lockdep fixed by this patch is:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-2-generic #14~rbd2
-------------------------------------------------------
gdm/2640 is trying to acquire lock:
(&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}, at: [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
but task is already holding lock:
(&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&crypt_stat->keysig_list_mutex){+.+.+.}:
[<ffffffff8108c897>] check_prev_add+0x2a7/0x370
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121526a>] ecryptfs_add_keysig+0x5a/0xb0
[<ffffffff81213299>] ecryptfs_copy_mount_wide_sigs_to_inode_sigs+0x59/0xb0
[<ffffffff81214b06>] ecryptfs_new_file_context+0xa6/0x1a0
[<ffffffff8120e42a>] ecryptfs_initialize_file+0x4a/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
-> #0 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}:
[<ffffffff8108c675>] check_prev_add+0x85/0x370
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
[<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
[<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
[<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
other info that might help us debug this:
2 locks held by gdm/2640:
#0: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8113cb8b>] do_filp_open+0x3cb/0xae0
#1: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0
stack backtrace:
Pid: 2640, comm: gdm Tainted: G C 2.6.31-2-generic #14~rbd2
Call Trace:
[<ffffffff8108b988>] print_circular_bug_tail+0xa8/0xf0
[<ffffffff8108c675>] check_prev_add+0x85/0x370
[<ffffffff81094912>] ? __module_text_address+0x12/0x60
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff81017275>] ? print_context_stack+0x85/0x140
[<ffffffff81089c68>] ? find_usage_backwards+0x38/0x160
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8108b0b0>] ? check_usage_backwards+0x0/0xb0
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8108c02c>] ? mark_held_locks+0x6c/0xa0
[<ffffffff81125b0d>] ? kmem_cache_alloc+0xfd/0x1a0
[<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
[<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
[<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
[<ffffffff81210240>] ? ecryptfs_init_persistent_file+0x60/0xe0
[<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8129a93e>] ? _raw_spin_unlock+0x5e/0xb0
[<ffffffff8155410b>] ? _spin_unlock+0x2b/0x40
[<ffffffff81139e9b>] ? getname+0x3b/0x240
[<ffffffff81148a5a>] ? alloc_fd+0xfa/0x140
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff81553b8f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In ecryptfs_destroy_inode(), inode_info->lower_file_mutex is locked,
and just after the mutex is unlocked, the code does:
kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
This means that if another context could possibly try to take the same
mutex as ecryptfs_destroy_inode(), then it could end up getting the
mutex just before the data structure containing the mutex is freed.
So any such use would be an obvious use-after-free bug (catchable with
slab poisoning or mutex debugging), and therefore the locking in
ecryptfs_destroy_inode() is not needed and can be dropped.
Similarly, in ecryptfs_destroy_crypt_stat(), crypt_stat->keysig_list_mutex
is locked, and then the mutex is unlocked just before the code does:
memset(crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat));
Therefore taking this mutex is similarly not necessary.
Removing this locking fixes false-positive lockdep reports such as the
following (and they are false-positives for exactly the same reason
that the locking is not needed):
=================================
[ INFO: inconsistent lock state ]
2.6.31-2-generic #14~rbd3
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/323 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&inode_info->lower_file_mutex){+.+.?.}, at: [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff8108c02c>] mark_held_locks+0x6c/0xa0
[<ffffffff8108c10f>] lockdep_trace_alloc+0xaf/0xe0
[<ffffffff81125a51>] kmem_cache_alloc+0x41/0x1a0
[<ffffffff8113117a>] get_empty_filp+0x7a/0x1a0
[<ffffffff8112dd46>] dentry_open+0x36/0xc0
[<ffffffff8121a36c>] ecryptfs_privileged_open+0x5c/0x2e0
[<ffffffff81210283>] ecryptfs_init_persistent_file+0xa3/0xe0
[<ffffffff8120e838>] ecryptfs_lookup_and_interpose_lower+0x278/0x380
[<ffffffff8120f97a>] ecryptfs_lookup+0x12a/0x250
[<ffffffff8113930a>] real_lookup+0xea/0x160
[<ffffffff8113afc8>] do_lookup+0xb8/0xf0
[<ffffffff8113b518>] __link_path_walk+0x518/0x870
[<ffffffff8113bd9c>] path_walk+0x5c/0xc0
[<ffffffff8113be5b>] do_path_lookup+0x5b/0xa0
[<ffffffff8113bfe7>] user_path_at+0x57/0xa0
[<ffffffff811340dc>] vfs_fstatat+0x3c/0x80
[<ffffffff8113424b>] vfs_stat+0x1b/0x20
[<ffffffff81134274>] sys_newstat+0x24/0x50
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 7811
hardirqs last enabled at (7811): [<ffffffff810c037f>] call_rcu+0x5f/0x90
hardirqs last disabled at (7810): [<ffffffff810c0353>] call_rcu+0x33/0x90
softirqs last enabled at (3764): [<ffffffff810631da>] __do_softirq+0x14a/0x220
softirqs last disabled at (3751): [<ffffffff8101440c>] call_softirq+0x1c/0x30
other info that might help us debug this:
2 locks held by kswapd0/323:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff810f67ed>] shrink_slab+0x3d/0x190
#1: (&type->s_umount_key#35){.+.+..}, at: [<ffffffff811429a1>] prune_dcache+0xd1/0x1b0
stack backtrace:
Pid: 323, comm: kswapd0 Tainted: G C 2.6.31-2-generic #14~rbd3
Call Trace:
[<ffffffff8108ad6c>] print_usage_bug+0x18c/0x1a0
[<ffffffff8108aff0>] ? check_usage_forwards+0x0/0xc0
[<ffffffff8108bac2>] mark_lock_irq+0xf2/0x280
[<ffffffff8108bd87>] mark_lock+0x137/0x1d0
[<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[<ffffffff8108bee6>] mark_irqflags+0xc6/0x1a0
[<ffffffff8108d337>] __lock_acquire+0x287/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff8108d2e7>] ? __lock_acquire+0x237/0x430
[<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff8129a91e>] ? _raw_spin_unlock+0x5e/0xb0
[<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
[<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
[<ffffffff81145d27>] destroy_inode+0x87/0xd0
[<ffffffff81146b4c>] generic_delete_inode+0x12c/0x1a0
[<ffffffff81145832>] iput+0x62/0x70
[<ffffffff811423c8>] dentry_iput+0x98/0x110
[<ffffffff81142550>] d_kill+0x50/0x80
[<ffffffff81142623>] prune_one_dentry+0xa3/0xc0
[<ffffffff811428b1>] __shrink_dcache_sb+0x271/0x290
[<ffffffff811429d9>] prune_dcache+0x109/0x1b0
[<ffffffff81142abf>] shrink_dcache_memory+0x3f/0x50
[<ffffffff810f68dd>] shrink_slab+0x12d/0x190
[<ffffffff810f9377>] balance_pgdat+0x4d7/0x640
[<ffffffff8104c4c0>] ? finish_task_switch+0x40/0x150
[<ffffffff810f63c0>] ? isolate_pages_global+0x0/0x60
[<ffffffff810f95f7>] kswapd+0x117/0x170
[<ffffffff810777a0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810f94e0>] ? kswapd+0x0/0x170
[<ffffffff810773be>] kthread+0x9e/0xb0
[<ffffffff8101430a>] child_rip+0xa/0x20
[<ffffffff81013c90>] ? restore_args+0x0/0x30
[<ffffffff81077320>] ? kthread+0x0/0xb0
[<ffffffff81014300>] ? child_rip+0x0/0x20
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We forget to set nfs_server.protocol in tcp case when old-style binary
options are passed to mount. The thing remains zero and never validated
afterwards. As the result, we hit BUG in fs/nfs/client.c:588.
Breakage has been introduced in NFS: Add nfs_alloc_parsed_mount_data
merged yesterday...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|\ \
| | |
| | |
| | |
| | | |
* 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6:
[PATCH] Fix idle time field in /proc/uptime
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Git commit 79741dd changes idle cputime accounting, but unfortunately
the /proc/uptime file hasn't caught up. Here the idle time calculation
from /proc/stat is copied over.
Signed-off-by: Michael Abbott <michael.abbott@diamond.ac.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (42 commits)
Btrfs: hash the btree inode during fill_super
Btrfs: relocate file extents in clusters
Btrfs: don't rename file into dummy directory
Btrfs: check size of inode backref before adding hardlink
Btrfs: fix releasepage to avoid unlocking extents we haven't locked
Btrfs: Fix test_range_bit for whole file extents
Btrfs: fix errors handling cached state in set/clear_extent_bit
Btrfs: fix early enospc during balancing
Btrfs: deal with NULL space info
Btrfs: account for space used by the super mirrors
Btrfs: fix extent entry threshold calculation
Btrfs: remove dead code
Btrfs: fix bitmap size tracking
Btrfs: don't keep retrying a block group if we fail to allocate a cluster
Btrfs: make balance code choose more wisely when relocating
Btrfs: fix arithmetic error in clone ioctl
Btrfs: add snapshot/subvolume destroy ioctl
Btrfs: change how subvolumes are organized
Btrfs: do not reuse objectid of deleted snapshot/subvol
Btrfs: speed up snapshot dropping
...
|
| |\ \ \
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus
Conflicts:
fs/btrfs/super.c
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The snapshot deletion patches dropped this line, but the inode
needs to be hashed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The extent relocation code copy file extents one by one when
relocating data block group. This is inefficient if file
extents are small. This patch makes the relocation code copy
file extents in clusters. So we can can make better use of
read-ahead.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A recent change enforces only one access point to each subvolume. The first
directory entry (the one added when the subvolume/snapshot was created) is
treated as valid access point, all other subvolume links are linked to dummy
empty directories. The dummy directories are temporary inodes that only in
memory, so we can not rename file into them.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.
The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing. But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.
Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.
releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.
This just adds a check for the highest possible offset so we don't wrap.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times. clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.
The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.
This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change. In both cases
we were incorrectly going backwards from the original
requested change.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We now do extra checks before a balance to make sure
there is room for the balance to take place. One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.
If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type. But, the code wasn't checking for
available chunk space, and so it was exiting too soon.
The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|