aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
...
| | | * | | | | | Btrfs: try not to sleep as much when doing slow cachingJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the fs is super full and we unmount the fs, we could get stuck in this thing where unmount is waiting for the caching kthread to make progress and the caching kthread keeps scheduling because we're in the middle of a commit. So instead just let the caching kthread keep going and only yeild if need_resched(). This makes my horrible umount case go from taking up to 10 minutes to taking less than 20 seconds. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: kill BTRFS_I(inode)->block_groupJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: don't look at the extent buffer level 3 times in a rowJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a bit of debugging in btrfs_search_slot to make sure the level of the cow block is the same as the original block we were cow'ing. I don't think I've ever seen this tripped, so kill it. This saves us 2 kmap's per level in our search. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: map the node block when looking for readahead targetsJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have particularly full nodes, we could call btrfs_node_blockptr up to 32 times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the extent buffer while we look for readahead targets. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: set range_start to the right start in count_range_bitsJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In count_range_bits we are adjusting total_bytes based on the range we are searching for, but we don't adjust the range start according to the range we are searching for, which makes for weird results. For example, if the range [0-8192] is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number of bytes found, but the range_start will be 0, which makes it look like the range is [0-4096]. So instead set range_start = max(cur_start, state->start). This makes everything come out right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: fix how we do space reservation for truncateJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ceph guys keep running into problems where we have space reserved in our orphan block rsv when freeing it up. This is because they tend to do snapshots alot, so their truncates tend to use a bunch of space, so when we go to do things like update the inode we have to steal reservation space in order to make the reservation happen. This happens because truncate can use as much space as it freaking feels like, but we still have to hold space for removing the orphan item and updating the inode, which will definitely always happen. So in order to fix this we need to split all of the reservation stuf up. So with this patch we have 1) The orphan block reserve which only holds the space for deleting our orphan item when everything is over. 2) The truncate block reserve which gets allocated and used specifically for the space that the truncate will use on a per truncate basis. 3) The transaction will always have 1 item's worth of data reserved so we can update the inode normally. Hopefully this will make the ceph problem go away. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: kill trans_mutexJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use trans_mutex for lots of things, here's a basic list 1) To serialize trans_handles joining the currently running transaction 2) To make sure that no new trans handles are started while we are committing 3) To protect the dead_roots list and the transaction lists Really the serializing trans_handles joining is not too hard, and can really get bogged down in acquiring a reference to the transaction. So replace the trans_mutex with a trans_lock spinlock and use it to do the following 1) Protect fs_info->running_transaction. All trans handles have to do is check this, and then take a reference of the transaction and keep on going. 2) Protect the fs_info->trans_list. This doesn't get used too much, basically it just holds the current transactions, which will usually just be the currently committing transaction and the currently running transaction at most. 3) Protect the dead roots list. This is only ever processed by splicing the list so this is relatively simple. 4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using the trans_mutex before, so this is a pretty straightforward change. 5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over the entirety of the commit we need to have a way to block new people from creating a new transaction while we're doing our work. So we set no_trans_join and in join_transaction we test to see if that is set, and if it is we do a wait_on_commit. 6) Make the transaction use count atomic so we don't need to take locks to modify it when we're dropping references. 7) Add a commit_lock to the transaction to make sure multiple people trying to commit the same transaction don't race and commit at the same time. 8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl trans. I have tested this with xfstests, but obviously it is a pretty hairy change so lots of testing is greatly appreciated. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: if we've already started a trans handle, use that oneJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently track trans handles in current->journal_info, but we don't actually use it. This patch fixes it. This will cover the case where we have multiple people starting transactions down the call chain. This keeps us from having to allocate a new handle and all of that, we just increase the use count of the current handle, save the old block_rsv, and return. I tested this with xfstests and it worked out fine. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: take away the num_items argument from btrfs_join_transactionJosef Bacik2011-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I keep forgetting that btrfs_join_transaction() just ignores the num_items argument, which leads me to sending pointless patches and looking stupid :). So just kill the num_items argument from btrfs_join_transaction and btrfs_start_ioctl_transaction, since neither of them use it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| | | * | | | | | Btrfs: make sure to use the delalloc reserve when filling delallocJosef Bacik2011-05-23
| | | | |_|_|/ / | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the prealloc filling code and compressed code we don't set trans->block_rsv to the delalloc block reserve properly, which is going to make us use metadata from the wrong pool, this patch fixes that. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | | | | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2011-06-03
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-block: block: Use hlist_entry() for io_context.cic_list.first cfq-iosched: Remove bogus check in queue_fail path xen/blkback: potential null dereference in error handling xen/blkback: don't call vbd_size() if bd_disk is NULL block: blkdev_get() should access ->bd_disk only after success CFQ: Fix typo and remove unnecessary semicolon block: remove unwanted semicolons Revert "block: Remove extra discard_alignment from hd_struct." nbd: adjust 'max_part' according to part_shift nbd: limit module parameters to a sane value nbd: pass MSG_* flags to kernel_recvmsg() block: improve the bio_add_page() and bio_add_pc_page() descriptions
| | * | | | | | | block: blkdev_get() should access ->bd_disk only after successTejun Heo2011-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | d4dc210f69 (block: don't block events on excl write for non-optical devices) added dereferencing of bdev->bd_disk to test GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be %NULL if open failed which can lead to an oops. Test the flag after testing open was successful, not before. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Miller <davem@davemloft.net> Tested-by: David Miller <davem@davemloft.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * | | | | | | Revert "block: Remove extra discard_alignment from hd_struct."Jens Axboe2011-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was not a good idea to start dereferencing disk->queue from the fs sysfs strategy for displaying discard alignment. We ran into first a NULL pointer deref, and after fixing that we sometimes see unvalid disk->queue pointer values. Since discard is the only one of the bunch actually looking into the queue, just revert the change. This reverts commit 23ceb5b7719e9276d4fa72a3ecf94dd396755276. Conflicts: fs/partitions/check.c
| | * | | | | | | block: improve the bio_add_page() and bio_add_pc_page() descriptionsAndreas Gruenbacher2011-05-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The descriptions of bio_add_page() and bio_add_pc_page() are slightly inconsistent; improve them. Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | | | | | | Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6Linus Torvalds2011-06-03
| |\ \ \ \ \ \ \ \ | | |_|_|_|/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: fix-up free space earlier UBIFS: intialize LPT earlier UBIFS: assert no fixup when writing a node UBIFS: fix clean znode counter corruption in error cases UBIFS: fix memory leak on error path UBIFS: fix shrinker object count reports UBIFS: fix recovery broken by the previous recovery fix UBIFS: amend ubifs_recover_leb interface UBIFS: introduce a "grouped" journal head flag UBIFS: supress false error messages
| | * | | | | | | UBIFS: fix-up free space earlierBen Gardiner2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The free space fixup is currently initiated during mount after the call to ubifs_write_master() which results in a write to PEBs; this has been observed with the patch 'assert no fixup when writing a node' applied: Move the free space fixup on mount to before the calls to ubifs_recover_inl_heads() and ubifs_write_master(). This results in no assertions with the previously mentioned patch applied. Artem: tweaked the patch a bit Signed-off-by: Ben Gardiner <bengardiner@nanometrics> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: intialize LPT earlierBen Gardiner2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current 'mount_ubifs()' implementation does not initialize the LPT until the the master node is marked dirty. Move the LPT initialization to before marking the master node dirty. This is a preparation for the next patch which will move the free-space-fixup check to before marking the master node dirty, because we have to fix-up the free space before doing any writes. Artem: massaged the patch and commit message. Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: assert no fixup when writing a nodeBen Gardiner2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current free space fixup can result in some writing to the UBI volume when the space_fixup flag is set. To catch instances where UBIFS is writing to the NAND while the space_fixup flag is set, add an assert to ubifs_write_node(). Artem: tweaked the patch, added similar assertion to the write buffer write path. Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: fix clean znode counter corruption in error casesArtem Bityutskiy2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UBIFS maintains per-filesystem and global clean znode counters ('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'. However, in case of failures during commit the counters were corrupted. E.g., if a failure happens in the middle of 'write_index()', then some nodes in the commit list ('c->cnext') are marked as clean, and some are marked as dirty. And the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we have 2 file-sytem and one of them fails, and we unmount it, 'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: fix memory leak on error pathArtem Bityutskiy2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write failure because it forgets to free the 'struct ubifs_dent_node *dent' object. Although the object is small, the alignment can make it large - e.g., 2KiB if the min. I/O unit is 2KiB. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org
| | * | | | | | | UBIFS: fix shrinker object count reportsArtem Bityutskiy2011-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes VM asks the shrinker to return amount of objects it can shrink, and we return the ubifs_clean_zn_cnt in that case. However, it is possible that this counter is negative for a short period of time, due to the way UBIFS TNC code updates it. And I can observe the following warnings sometimes: shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788 This patch makes sure UBIFS never returns negative count of objects. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org
| | * | | | | | | UBIFS: fix recovery broken by the previous recovery fixArtem Bityutskiy2011-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb (UBIFS: fix extremely rare mount failure) broke recovery. This commit make UBIFS drop the last min. I/O unit in all journal heads, but this is needed only for the GC head. And this does not work for non-GC heads. For example, if suppose we have min. I/O units A and B, and A contains a valid node X, which was fsynced, and then a group of nodes Y which spans the rest of A and B. In this case we'll drop not only Y, but also X, which is obviously incorrect. This patch fixes the issue and additionally makes recovery to drop last min. I/O unit only for the GC head, and leave things as they have been for ages for the other heads - this is safer. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: amend ubifs_recover_leb interfaceArtem Bityutskiy2011-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells whether the nodes are grouped in the LEB to recover, pass the journal head number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag. This patch is a preparation to a further fix where we'll need to know the journal head number for other purposes. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: introduce a "grouped" journal head flagArtem Bityutskiy2011-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Journal heads are different in a way how UBIFS writes nodes there. All normal journal heads receive grouped nodes, while the GC journal heads receives ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which describes this property. This patch is a preparation to a further recovery fix. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| | * | | | | | | UBIFS: supress false error messagesArtem Bityutskiy2011-06-01
| | | |_|_|_|/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but it introduced a regression - now UBIFS prints scary error messages during recovery on all corrupted nodes, even though the corruptions are expected (due to a power cut). This patch fixes the issue. Additionally fix a typo in a commentary introduced by the same commit. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * | | | | | | autofs4: bogus dentry_unhash() added in ->unlink()Al Viro2011-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | | vfs: shrink_dcache_parent before rmdir, dir renameSage Weil2011-05-30
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dentry_unhash push-down series missed that shink_dcache_parent needs to be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and allow efficient dentry reclaim. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-29
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Remove ecryptfs_header_cache_2 eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose() eCryptfs: Return useful code from contains_ecryptfs_marker eCryptfs: Fix new inode race condition eCryptfs: Cleanup inode initialization code eCryptfs: Consolidate inode functions into inode.c
| | * | | | | | eCryptfs: Remove ecryptfs_header_cache_2Tyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that ecryptfs_lookup_interpose() is no longer using ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to ecryptfs_header_cache. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
| | * | | | | | eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose()Tyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ecryptfs_lookup_interpose() has turned into spaghetti code over the years. This is an effort to clean it up. - Shorten overly descriptive variable names such as ecryptfs_dentry - Simplify gotos and error paths - Create helper function for reading plaintext i_size from metadata It also includes an optimization when reading i_size from the metadata. A complete page-sized kmem_cache_alloc() was being done to read in 16 bytes of metadata. The buffer for that is now statically declared. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
| | * | | | | | eCryptfs: Return useful code from contains_ecryptfs_markerTyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having the calling functions translate the true/false return code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or -EINVAL so that the calling functions can just reuse the return code. Also, rename the function to ecryptfs_validate_marker() to avoid callers mistakenly thinking that it returns true/false codes. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
| | * | | | | | eCryptfs: Fix new inode race conditionTyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only unlock and d_add() new inodes after the plaintext inode size has been read from the lower filesystem. This fixes a race condition that was sometimes seen during a multi-job kernel build in an eCryptfs mount. https://bugzilla.kernel.org/show_bug.cgi?id=36002 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Reported-by: David <david@unsolicited.net> Tested-by: David <david@unsolicited.net>
| | * | | | | | eCryptfs: Cleanup inode initialization codeTyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The eCryptfs inode get, initialization, and dentry interposition code has two separate paths. One is for when dentry interposition is needed after doing things like a mkdir in the lower filesystem and the other is needed after a lookup. Unlocking new inodes and doing a d_add() needs to happen at different times, depending on which type of dentry interposing is being done. This patch cleans up the inode get and initialization code paths and splits them up so that the locking and d_add() differences mentioned above can be handled appropriately in a later patch. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Tested-by: David <david@unsolicited.net>
| | * | | | | | eCryptfs: Consolidate inode functions into inode.cTyler Hicks2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions should live in inode.c since their focus is on inodes and they're primarily used by functions in inode.c. Also does a simple cleanup of ecryptfs_inode_test() and rolls ecryptfs_init_inode() into ecryptfs_inode_set(). Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Tested-by: David <david@unsolicited.net>
| * | | | | | | Merge branch 'pnfs-submit' of git://git.open-osd.org/linux-open-osdLinus Torvalds2011-05-29
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'pnfs-submit' of git://git.open-osd.org/linux-open-osd: (32 commits) pnfs-obj: pg_test check for max_io_size NFSv4.1: define nfs_generic_pg_test NFSv4.1: use pnfs_generic_pg_test directly by layout driver NFSv4.1: change pg_test return type to bool NFSv4.1: unify pnfs_pageio_init functions pnfs-obj: objlayout_encode_layoutcommit implementation pnfs: encode_layoutcommit pnfs-obj: report errors and .encode_layoutreturn Implementation. pnfs: encode_layoutreturn pnfs: layoutret_on_setattr pnfs: layoutreturn pnfs-obj: osd raid engine read/write implementation pnfs: support for non-rpc layout drivers pnfs-obj: define per-inode private structure pnfs: alloc and free layout_hdr layoutdriver methods pnfs-obj: objio_osd device information retrieval and caching pnfs-obj: decode layout, alloc/free lseg pnfs-obj: pnfs_osd XDR client implementation pnfs-obj: pnfs_osd XDR definitions pnfs-obj: objlayoutdriver module skeleton ...
| | * | | | | | | pnfs-obj: pg_test check for max_io_sizeBoaz Harrosh2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement pg_test vector to test for max IO sizes. We calculate a max_io_size member only once, and cache it in lseg so to not do so on every page insert. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> [simplify logic] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | NFSv4.1: define nfs_generic_pg_testBoaz Harrosh2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, unless pnfs is used coalesce pages until pg_bsize (rsize or wsize) is reached. pnfs layout drivers define their own pg_test methods that use pnfs_generic_pg_test and need to define their own I/O size limits (e.g. based on the file stripe size). [Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | NFSv4.1: use pnfs_generic_pg_test directly by layout driverBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | NFSv4.1: change pg_test return type to boolBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | NFSv4.1: unify pnfs_pageio_init functionsBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use common code for pnfs_pageio_init_{read,write} and use a common generic pg_test function. Note that this function always assumes the the layout driver's pg_test method is implemented. [Fix BUG] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs-obj: objlayout_encode_layoutcommit implementationBoaz Harrosh2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Define API for io-engines to report delta_space_used in IOs * Encode the osd-layout specific information of the layoutcommit XDR buffer. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: encode_layoutcommitBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a layout driver method to encode the layout type specific opaque part of layout commit in-line in the xdr stream. Currently, the pnfs-objects layout driver uses it to encode metadata hints to the MDS and the blocks layout driver to commit provisionally allocated extents to the file. Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs-obj: report errors and .encode_layoutreturn Implementation.Boaz Harrosh2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An io_state pre-allocates an error information structure for each possible osd-device that might error during IO. When IO is done if all was well the io_state is freed. (as today). If the I/O has ended with an error, the io_state is queued on a per-layout err_list. When eventually encode_layoutreturn() is called, each error is properly encoded on the XDR buffer and only then the io_state is removed from err_list and de-allocated. It is up to the io_engine to fill in the segment that fault and the type of osd_error that occurred. By calling objlayout_io_set_result() for each failing device. In objio_osd: * Allocate io-error descriptors space as part of io_state * Use generic objlayout error reporting at end of io. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: encode_layoutreturnAndy Adamson2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a layout driver method to encode the layout type specific opaque part of layout return in-line in the xdr stream. Currently the pnfs-objects layout driver uses it to encode i/o error information on LAYOUTRETURN. Signed-off-by: Andy Adamson <andros@netapp.com> [fixup layout header pointer for encode_layoutreturn] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: layoutret_on_setattrBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the objects layout security model, we have object capabilities that are associated with the layout and we anticipate that the server will issue a cb_layoutrecall for any setattr that changes security related attributes (user/group/mode/acl) or truncates the file. Therefore, the layout is returned before issuing the setattr to avoid the anticipated cb_layoutrecall. Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: layoutreturnBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFSv4.1 LAYOUTRETURN implementation Currently, does not support layout-type payload encoding. Signed-off-by: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> [call pnfs_return_layout right before pnfs_destroy_layout] [remove assert_spin_locked from pnfs_clear_lseg_list] [remove wait parameter from the layoutreturn path.] [remove return_type field from nfs4_layoutreturn_args] [remove range from nfs4_layoutreturn_args] [no need to send layoutcommit from _pnfs_return_layout] [don't wait on sync layoutreturn] [fix layout stateid in layoutreturn args] [fixed NULL deref in _pnfs_return_layout] [removed recaim member of nfs4_layoutreturn_args] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs-obj: osd raid engine read/write implementationBoaz Harrosh2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the use of the in-kernel osd library. Implement read/write of data from/to osd-objects according to information specified in the objects-layout. Support for stripping over mirrors with a received stripe_unit. There are however a few constrains which are not supported: 1. Stripe Unit must be a multiple of PAGE_SIZE 2. stripe length (stripe_unit * number_of_stripes) can not be bigger then 32bit. Also support raid-groups and partial-layout. Partial-layout is when not all the groups are received on the line, addressing only a partial range of the file. TODO: Only raid0! raid 4/5/6 support will come at later stage A none supported layout will send IO through the MDS [Important fallout from the last rebase] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> [gfp_flags] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: support for non-rpc layout driversBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Non-rpc layout driver such as for objects and blocks implement their own I/O path and error handling logic. Therefore bypass NFS-based error handling for these layout drivers. [fix lseg ref-count bugs, and null de-refs] [Fall out from: non-rpc layout drivers] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> [get rid of PNFS_USE_RPC_CODE] [get rid of __nfs4_write_done_cb] [revert useless change in nfs4_write_done_cb] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs-obj: define per-inode private structureBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocate and deallocate per-inode private pnfs_layout_hdr in preparation for I/O implementation. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
| | * | | | | | | pnfs: alloc and free layout_hdr layoutdriver methodsBenny Halevy2011-05-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [gfp_flags] Signed-off-by: Benny Halevy <bhalevy@panasas.com>