aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
* Btrfs: change how we indicate we're adding csumsJosef Bacik2012-07-23
| | | | | | | | | | | | | | | | | | | | | | | There is weird logic I had to put in place to make sure that when we were adding csums that we'd used the delalloc block rsv instead of the global block rsv. Part of this meant that we had to free up our transaction reservation before we ran the delayed refs since csum deletion happens during the delayed ref work. The problem with this is that when we release a reservation we will add it to the global reserve if it is not full in order to keep us going along longer before we have to force a transaction commit. By releasing our reservation before we run delayed refs we don't get the opportunity to drain down the global reserve for the work we did, so we won't refill it as often. This isn't a problem per-se, it just results in us possibly committing transactions more and more often, and in rare cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran out of space in our block rsv. This also helps us by holding onto space while the delayed refs run so we don't end up with as many people trying to do things at the same time, which again will help us not force commits or hit the use_block_rsv warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: return error of btrfs_update_inode() to callerTsutomu Itoh2012-07-23
| | | | | | | | | We didn't check error of btrfs_update_inode(), but that error looks easy to bubble back up. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: fix error handling in __add_reloc_root()Dan Carpenter2012-07-23
| | | | | | | | We dereferenced "node" in the error message after freeing it. Also btrfs_panic() can return so we should return an error code instead of continuing. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
* Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mountingIlya Dryomov2012-07-23
| | | | | | | | There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in. Bail out with EINVAL. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Btrfs: do not return EINVAL instead of ENOMEM from open_ctree()Ilya Dryomov2012-07-23
| | | | | | When bailing from open_ctree() err is returned, not ret. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Btrfs: add DEVICE_READY ioctlJosef Bacik2012-07-23
| | | | | | | | | | This will be used in conjunction with btrfs device ready <dev>. This is needed for initrd's to have a nice and lightweight way to tell if all of the devices needed for a file system are in the cache currently. This keeps them from having to do mount+sleep loops waiting for devices to show up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: flush delayed inodes if we're short on spaceJosef Bacik2012-07-23
| | | | | | | | | | | | | | Those crazy gentoo guys have been complaining about ENOSPC errors on their portage volumes. This is because doing things like untar tends to create lots of new files which will soak up all the reservation space in the delayed inodes. Usually this gets papered over by the fact that we will try and commit the transaction, however if this happens in the wrong spot or we choose not to commit the transaction you will be screwed. So add the ability to expclitly flush delayed inodes to free up space. Please test this out guys to make sure it works since as usual I cannot reproduce. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* btrfs: join DEV_STATS ioctls to oneDavid Sterba2012-07-23
| | | | | | | | | | | | | | | | Commit c11d2c236cc260b36 (Btrfs: add ioctl to get and reset the device stats) introduced two ioctls doing almost the same thing distinguished by just the ioctl number which encodes "do reset after read". I have suggested http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html to implement it via the ioctl args. This hasn't happen, and I think we should use a more clean way to pass flags and should not waste ioctl numbers. CC: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.cz>
* btrfs: ignore unfragmented file checks in defrag when compression enabled - ↵Andrew Mahone2012-07-23
| | | | | | | | | | | | | rebased Rebased on btrfs-next and retested. Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip checks for adjacent extents and extent size when deciding whether to defrag, as these can prevent an uncompressed and unfragmented file from being compressed as requested. Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
* Btrfs: small naming cleanup in join_transaction()Dan Carpenter2012-07-23
| | | | | | | | "root->fs_info" and "fs_info" are the same, but "fs_info" is prefered because it is shorter and that's what is used in the rest of the function. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
* Btrfs: don't update atime on RO subvolumesAlexander Block2012-07-23
| | | | | | | | | | | | Before the update_time inode operation was indroduced, it was not possible to prevent updates of atime on RO subvolumes. VFS was only able to check for RO on the mount, but did not know anything about btrfs subvolumes. btrfs_update_time does now check if the root is RO and skip updating of times. Signed-off-by: Alexander Block <ablock84@googlemail.com>
* Btrfs: allow mount -o remount,compress=noArnd Hannemann2012-07-23
| | | | | | | | | | | | Btrfs allows to turn on compression on a mounted and used filesystem by issuing mount -o remount,compress=lzo. This patch allows to turn compression off again while the filesystem is mounted. As suggested by David Sterba if the compress-force option was set, it is implicitly cleared if compression is turned off. Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
* Btrfs: remove ->dirty_inodeJosef Bacik2012-07-23
| | | | | | | | We do all of our inode updating when we change it, and now that we do ->update_time we don't need ->dirty_inode for atime updates anymore, so just remove it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* Btrfs: reduce calls to wake_up on uncontended locksChris Mason2012-07-23
| | | | | | | | The btrfs locks were unconditionally calling wake_up as the locks were released. This lead to extra thrashing on the waitqueue, especially for locks that were dominated by readers. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: don't wait around for new log writers on an SSDChris Mason2012-07-23
| | | | | | | Waiting on spindles improves performance, but ssds want all the IO as quickly as we can push it down. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds2012-07-20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull pnfs/ore fixes from Boaz Harrosh: "These are catastrophic fixes to the pnfs objects-layout that were just discovered. They are also destined for @stable. I have found these and worked on them at around RC1 time but unfortunately went to the hospital for kidney stones and had a very slow recovery. I refrained from sending them as is, before proper testing, and surly I have found a bug just yesterday. So now they are all well tested, and have my sign-off. Other then fixing the problem at hand, and assuming there are no bugs at the new code, there is low risk to any surrounding code. And in anyway they affect only these paths that are now broken. That is RAID5 in pnfs objects-layout code. It does also affect exofs (which was not broken) but I have tested exofs and it is lower priority then objects-layout because no one is using exofs, but objects-layout has lots of users." * 'for-linus' of git://git.open-osd.org/linux-open-osd: pnfs-obj: Fix __r4w_get_page when offset is beyond i_size pnfs-obj: don't leak objio_state if ore_write/read fails ore: Unlock r4w pages in exact reverse order of locking ore: Remove support of partial IO request (NFS crash) ore: Fix NFS crash by supporting any unaligned RAID IO
| * pnfs-obj: Fix __r4w_get_page when offset is beyond i_sizeBoaz Harrosh2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is very common for the end of the file to be unaligned on stripe size. But since we know it's beyond file's end then the XOR should be preformed with all zeros. Old code used to just read zeros out of the OSD devices, which is a great waist. But what scares me more about this situation is that, we now have pages attached to the file's mapping that are beyond i_size. I don't like the kind of bugs this calls for. Fix both birds, by returning a global zero_page, if offset is beyond i_size. TODO: Change the API to ->__r4w_get_page() so a NULL can be returned without being considered as error, since XOR API treats NULL entries as zero_pages. [Bug since 3.2. Should apply the same way to all Kernels since] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * pnfs-obj: don't leak objio_state if ore_write/read failsBoaz Harrosh2012-07-20
| | | | | | | | | | | | [Bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * ore: Unlock r4w pages in exact reverse order of lockingBoaz Harrosh2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | The read-4-write pages are locked in address ascending order. But where unlocked in a way easiest for coding. Fix that, locks should be released in opposite order of locking, .i.e descending address order. I have not hit this dead-lock. It was found by inspecting the dbug print-outs. I suspect there is an higher lock at caller that protects us, but fix it regardless. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * ore: Remove support of partial IO request (NFS crash)Boaz Harrosh2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do to OOM situations the ore might fail to allocate all resources needed for IO of the full request. If some progress was possible it would proceed with a partial/short request, for the sake of forward progress. Since this crashes NFS-core and exofs is just fine without it just remove this contraption, and fail. TODO: Support real forward progress with some reserved allocations of resources, such as mem pools and/or bio_sets [Bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> CC: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * ore: Fix NFS crash by supporting any unaligned RAID IOBoaz Harrosh2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In RAID_5/6 We used to not permit an IO that it's end byte is not stripe_size aligned and spans more than one stripe. .i.e the caller must check if after submission the actual transferred bytes is shorter, and would need to resubmit a new IO with the remainder. Exofs supports this, and NFS was supposed to support this as well with it's short write mechanism. But late testing has exposed a CRASH when this is used with none-RPC layout-drivers. The change at NFS is deep and risky, in it's place the fix at ORE to lift the limitation is actually clean and simple. So here it is below. The principal here is that in the case of unaligned IO on both ends, beginning and end, we will send two read requests one like old code, before the calculation of the first stripe, and also a new site, before the calculation of the last stripe. If any "boundary" is aligned or the complete IO is within a single stripe. we do a single read like before. The code is clean and simple by splitting the old _read_4_write into 3 even parts: 1._read_4_write_first_stripe 2. _read_4_write_last_stripe 3. _read_4_write_execute And calling 1+3 at the same place as before. 2+3 before last stripe, and in the case of all in a single stripe then 1+2+3 is preformed additively. Why did I not think of it before. Well I had a strike of genius because I have stared at this code for 2 years, and did not find this simple solution, til today. Not that I did not try. This solution is much better for NFS than the previous supposedly solution because the short write was dealt with out-of-band after IO_done, which would cause for a seeky IO pattern where as in here we execute in order. At both solutions we do 2 separate reads, only here we do it within a single IO request. (And actually combine two writes into a single submission) NFS/exofs code need not change since the ORE API communicates the new shorter length on return, what will happen is that this case would not occur anymore. hurray!! [Stable this is an NFS bug since 3.2 Kernel should apply cleanly] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* | Merge tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifsLinus Torvalds2012-07-20
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull UBIFS free space fix-up bugfix from Artem Bityutskiy: "It's been reported already twice recently: http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html and we finally have the fix. I am quite confident the fix is correct because I could reproduce the problem with nandsim and verify the fix. It was also verified by Iwo (the reporter). I am also confident that this is OK to merge the fix so late because this patch affects only the fixup functionality, which is not used by most users." * tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs: UBIFS: fix a bug in empty space fix-up
| * | UBIFS: fix a bug in empty space fix-upArtem Bityutskiy2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UBIFS has a feature called "empty space fix-up" which is a quirk to work-around limitations of dumb flasher programs. Namely, of those flashers that are unable to skip NAND pages full of 0xFFs while flashing, resulting in empty space at the end of half-filled eraseblocks to be unusable for UBIFS. This feature is relatively new (introduced in v3.0). The fix-up routine (fixup_free_space()) is executed only once at the very first mount if the superblock has the 'space_fixup' flag set (can be done with -F option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and writes it back to the same LEB. The routine assumes the image is pristine and does not have anything in the journal. There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly. All but one LEB of the log of a pristine file-system are empty. And one contains just a commit start node. And 'fixup_free_space()' just unmapped this LEB, which resulted in wiping the commit start node. As a result, some users were unable to mount the file-system next time with the following symptom: UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0 The root-cause of this bug was that 'fixup_free_space()' wrongly assumed that the beginning of empty space in the log head (c->lhead_offs) was known on mount. However, it is not the case - it was always 0. UBIFS does not store in it the master node and finds out by scanning the log on every mount. The fix is simple - just pass commit start node size instead of 0 to 'fixup_leb()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@linux.intel.com> Cc: stable@vger.kernel.org [v3.0+] Reported-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com> Tested-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com> Reported-by: James Nute <newten82@gmail.com>
* | | Merge git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2012-07-18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull CIFS fixes from Steve French. * git://git.samba.org/sfrench/cifs-2.6: cifs: always update the inode cache with the results from a FIND_* cifs: when CONFIG_HIGHMEM is set, serialize the read/write kmaps cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap space Initialise mid_q_entry before putting it on the pending queue
| * | | cifs: always update the inode cache with the results from a FIND_*Jeff Layton2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we get back a FIND_FIRST/NEXT result, we have some info about the dentry that we use to instantiate a new inode. We were ignoring and discarding that info when we had an existing dentry in the cache. Fix this by updating the inode in place when we find an existing dentry and the uniqueid is the same. Cc: <stable@vger.kernel.org> # .31.x Reported-and-Tested-by: Andrew Bartlett <abartlet@samba.org> Reported-by: Bill Robertson <bill_robertson@debortoli.com.au> Reported-by: Dion Edwards <dion_edwards@debortoli.com.au> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | | cifs: when CONFIG_HIGHMEM is set, serialize the read/write kmapsJeff Layton2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jian found that when he ran fsx on a 32 bit arch with a large wsize the process and one of the bdi writeback kthreads would sometimes deadlock with a stack trace like this: crash> bt PID: 2789 TASK: f02edaa0 CPU: 3 COMMAND: "fsx" #0 [eed63cbc] schedule at c083c5b3 #1 [eed63d80] kmap_high at c0500ec8 #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs] #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs] #4 [eed63e50] do_writepages at c04f3e32 #5 [eed63e54] __filemap_fdatawrite_range at c04e152a #6 [eed63ea4] filemap_fdatawrite at c04e1b3e #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs] #8 [eed63ecc] do_sync_write at c052d202 #9 [eed63f74] vfs_write at c052d4ee #10 [eed63f94] sys_write at c052df4c #11 [eed63fb0] ia32_sysenter_target at c0409a98 EAX: 00000004 EBX: 00000003 ECX: abd73b73 EDX: 012a65c6 DS: 007b ESI: 012a65c6 ES: 007b EDI: 00000000 SS: 007b ESP: bf8db178 EBP: bf8db1f8 GS: 0033 CS: 0073 EIP: 40000424 ERR: 00000004 EFLAGS: 00000246 Each task would kmap part of its address array before getting stuck, but not enough to actually issue the write. This patch fixes this by serializing the marshal_iov operations for async reads and writes. The idea here is to ensure that cifs aggressively tries to populate a request before attempting to fulfill another one. As soon as all of the pages are kmapped for a request, then we can unlock and allow another one to proceed. There's no need to do this serialization on non-CONFIG_HIGHMEM arches however, so optimize all of this out when CONFIG_HIGHMEM isn't set. Cc: <stable@vger.kernel.org> Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | | cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap spaceJeff Layton2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently rely on being able to kmap all of the pages in an async read or write request. If you're on a machine that has CONFIG_HIGHMEM set then that kmap space is limited, sometimes to as low as 512 slots. With 512 slots, we can only support up to a 2M r/wsize, and that's assuming that we can get our greedy little hands on all of them. There are other users however, so it's possible we'll end up stuck with a size that large. Since we can't handle a rsize or wsize larger than that currently, cap those options at the number of kmap slots we have. We could consider capping it even lower, but we currently default to a max of 1M. Might as well allow those luddites on 32 bit arches enough rope to hang themselves. A more robust fix would be to teach the send and receive routines how to contend with an array of pages so we don't need to marshal up a kvec array at all. That's a fairly significant overhaul though, so we'll need this limit in place until that's ready. Cc: <stable@vger.kernel.org> Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | | Initialise mid_q_entry before putting it on the pending queueSachin Prabhu2012-07-17
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user reported a crash in cifs_demultiplex_thread() caused by an incorrectly set mid_q_entry->callback() function. It appears that the callback assignment made in cifs_call_async() was not flushed back to memory suggesting that a memory barrier was required here. Changing the code to make sure that the mid_q_entry structure was completely initialised before it was added to the pending queue fixes the problem. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>
* | | ext4: fix duplicated mnt_drop_write call in EXT4_IOC_MOVE_EXTAl Viro2012-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Caused, AFAICS, by mismerge in commit ff9cb1c4eead ("Merge branch 'for_linus' into for_linus_merged") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'pm-post-3.5-rc7' of ↵Linus Torvalds2012-07-17
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull a last-minute PM update from Rafael J. Wysocki: "This renames CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND to encourage future reuse of the capability in question in related cases." * tag 'pm-post-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND
| * | | PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPENDMichael Kerrisk2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990, the capability introduced in 4d7e30d98939a0340022ccd49325a3d70f7e0238 to govern EPOLLWAKEUP seems misnamed: this capability is about governing the ability to suspend the system, not using a particular API flag (EPOLLWAKEUP). We should make the name of the capability more general to encourage reuse in related cases. (Whether or not this capability should also be used to govern the use of /sys/power/wake_lock is a question that needs to be separately resolved.) This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure that the old capability name doesn't make it out into the wild, could you please apply and push up the tree to ensure that it is incorporated for the 3.5 release. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* | | | Merge tag 'for-linus-v3.5-rc7' of git://oss.sgi.com/xfs/xfsLinus Torvalds2012-07-16
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs regression fixes from Ben Myers: - Really fix a cursor leak in xfs_alloc_ag_vextent_near - Fix a performance regression related to doing allocation in workqueues - Prevent recursion in xfs_buf_iorequest which is causing stack overflows - Don't call xfs_bdstrat_cb in xfs_buf_iodone callbacks * tag 'for-linus-v3.5-rc7' of git://oss.sgi.com/xfs/xfs: xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks xfs: prevent recursion in xfs_buf_iorequest xfs: don't defer metadata allocation to the workqueue xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near
| * | | | xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacksChristoph Hellwig2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_bdstrat_cb only adds a check for a shutdown filesystem over xfs_buf_iorequest, but xfs_buf_iodone_callbacks just checked for a shut down filesystem a little earlier. In addition the shutdown handling in xfs_bdstrat_cb is not very suitable for this caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | xfs: prevent recursion in xfs_buf_iorequestChristoph Hellwig2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the b_iodone handler is run in calling context in xfs_buf_iorequest we can run into a recursion where xfs_buf_iodone_callbacks keeps calling back into xfs_buf_iorequest because an I/O error happened, which keeps calling back into xfs_buf_iorequest. This chain will usually not take long because the filesystem gets shut down because of log I/O errors, but even over a short time it can cause stack overflows if run on the same context. As a short term workaround make sure we always call the iodone handler in workqueue context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | xfs: don't defer metadata allocation to the workqueueDave Chinner2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Mel Gorman <mgorman@suse.de> Tested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | xfs: really fix the cursor leak in xfs_alloc_ag_vextent_nearDave Chinner2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current cursor is reallocated when retrying the allocation, so the existing cursor needs to be destroyed in both the restart and the failure cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* | | | | fifo: Do not restart open() if it already found a partnerAnders Kaseorg2012-07-16
| |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a parent and child process open the two ends of a fifo, and the child immediately exits, the parent may receive a SIGCHLD before its open() returns. In that case, we need to make sure that open() will return successfully after the SIGCHLD handler returns, instead of throwing EINTR or being restarted. Otherwise, the restarted open() would incorrectly wait for a second partner on the other end. The following test demonstrates the EINTR that was wrongly thrown from the parent’s open(). Change .sa_flags = 0 to .sa_flags = SA_RESTART to see a deadlock instead, in which the restarted open() waits for a second reader that will never come. (On my systems, this happens pretty reliably within about 5 to 500 iterations. Others report that it manages to loop ~forever sometimes; YMMV.) #include <sys/stat.h> #include <sys/types.h> #include <sys/wait.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define CHECK(x) do if ((x) == -1) {perror(#x); abort();} while(0) void handler(int signum) {} int main() { struct sigaction act = {.sa_handler = handler, .sa_flags = 0}; CHECK(sigaction(SIGCHLD, &act, NULL)); CHECK(mknod("fifo", S_IFIFO | S_IRWXU, 0)); for (;;) { int fd; pid_t pid; putc('.', stderr); CHECK(pid = fork()); if (pid == 0) { CHECK(fd = open("fifo", O_RDONLY)); _exit(0); } CHECK(fd = open("fifo", O_WRONLY)); CHECK(close(fd)); CHECK(waitpid(pid, NULL, 0)); } } This is what I suspect was causing the Git test suite to fail in t9010-svn-fe.sh: http://bugs.debian.org/678852 Signed-off-by: Anders Kaseorg <andersk@mit.edu> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2012-07-13
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: - Fix an NFSv4 mount regression - Fix O_DIRECT list manipulation snafus * tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix an NFSv4 mount regression NFS: Fix list manipulation snafus in fs/nfs/direct.c
| * | | | NFSv4: Fix an NFSv4 mount regressionTrond Myklebust2012-07-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The helper nfs_fs_mount() will always call nfs4_try_mount with the mount_info->fill_super argument pointing to nfs_fill_super, which is NFSv2/v3 only. Fix is to have nfs4_try_mount replace it with nfs4_fill_super. The regression was introduced by commit c40f8d1d (NFS: Create a common fs_mount() function) Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | NFS: Fix list manipulation snafus in fs/nfs/direct.cTrond Myklebust2012-07-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix 2 bugs in nfs_direct_write_reschedule: - The request needs to be removed from the 'reqs' list before it can be added to 'failed'. - Fix an infinite loop if the 'failed' list is non-empty. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* | | | | Remove easily user-triggerable BUG from generic_setleaseDave Jones2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can be trivially triggered from userspace by passing in something unexpected. kernel BUG at fs/locks.c:1468! invalid opcode: 0000 [#1] SMP RIP: 0010:generic_setlease+0xc2/0x100 Call Trace: __vfs_setlease+0x35/0x40 fcntl_setlease+0x76/0x150 sys_fcntl+0x1c6/0x810 system_call_fastpath+0x1a/0x1f Signed-off-by: Dave Jones <davej@redhat.com> Cc: stable@kernel.org # 3.2+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | block: fix infinite loop in __getblk_slowJeff Moyer2012-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 080399aaaf35 ("block: don't mark buffers beyond end of disk as mapped") exposed a bug in __getblk_slow that causes mount to hang as it loops infinitely waiting for a buffer that lies beyond the end of the disk to become uptodate. The problem was initially reported by Torsten Hilbrich here: https://lkml.org/lkml/2012/6/18/54 and also reported independently here: http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511 and then Richard W.M. Jones and Marcos Mello noted a few separate bugzillas also associated with the same issue. This patch has been confirmed to fix: https://bugzilla.redhat.com/show_bug.cgi?id=835019 The main problem is here, in __getblk_slow: for (;;) { struct buffer_head * bh; int ret; bh = __find_get_block(bdev, block, size); if (bh) return bh; ret = grow_buffers(bdev, block, size); if (ret < 0) return NULL; if (ret == 0) free_more_memory(); } __find_get_block does not find the block, since it will not be marked as mapped, and so grow_buffers is called to fill in the buffers for the associated page. I believe the for (;;) loop is there primarily to retry in the case of memory pressure keeping grow_buffers from succeeding. However, we also continue to loop for other cases, like the block lying beond the end of the disk. So, the fix I came up with is to only loop when grow_buffers fails due to memory allocation issues (return value of 0). The attached patch was tested by myself, Torsten, and Rich, and was found to resolve the problem in call cases. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-and-Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: Josh Boyer <jwboyer@redhat.com> Cc: Stable <stable@vger.kernel.org> # 3.0+ [ Jens is on vacation, taking this directly - Linus ] -- Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | fat: fix non-atomic NFS i_pos readSteven J. Magnani2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fat_encode_fh() can fetch an invalid i_pos value on systems where 64-bit accesses are not atomic. Make it use the same accessor as the rest of the FAT code. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | fs: ramfs: file-nommu: add SetPageUptodate()Bob Liu2012-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a bug in the below scenario for !CONFIG_MMU: 1. create a new file 2. mmap the file and write to it 3. read the file can't get the correct value Because sys_read() -> generic_file_aio_read() -> simple_readpage() -> clear_page() which causes the page to be zeroed. Add SetPageUptodate() to ramfs_nommu_expand_for_mapping() so that generic_file_aio_read() do not call simple_readpage(). Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ocfs2: fix NULL pointer dereference in __ocfs2_change_file_space()Luis Henriques2012-07-11
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As ocfs2_fallocate() will invoke __ocfs2_change_file_space() with a NULL as the first parameter (file), it may trigger a NULL pointer dereferrence due to a missing check. Addresses http://bugs.launchpad.net/bugs/1006012 Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reported-by: Bret Towe <magnade@gmail.com> Tested-by: Bret Towe <magnade@gmail.com> Cc: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Acked-by: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vfs: make O_PATH file descriptors usable for 'fchdir()'Linus Torvalds2012-07-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already use them for openat() and friends, but fchdir() also wants to be able to use O_PATH file descriptors. This should make it comparable to the O_SEARCH of Solaris. In particular, O_PATH allows you to access (not-quite-open) a directory you don't have read persmission to, only execute permission. Noticed during development of multithread support for ksh93. Reported-by: ольга крыжановская <olga.kryzhanovska@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org # O_PATH introduced in 3.0+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'ecryptfs-3.5-rc6-fixes' of ↵Linus Torvalds2012-07-06
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull eCryptfs fixes from Tyler Hicks: "Fixes an incorrect access mode check when preparing to open a file in the lower filesystem. This isn't an urgent fix, but it is simple and the check was obviously incorrect. Also fixes a couple important bugs in the eCryptfs miscdev interface. These changes are low risk due to the small number of users that use the miscdev interface. I was able to keep the changes minimal and I have some cleaner, more complete changes queued up for the next merge window that will build on these patches." * tag 'ecryptfs-3.5-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: Gracefully refuse miscdev file ops on inherited/passed files eCryptfs: Fix lockdep warning in miscdev operations eCryptfs: Properly check for O_RDONLY flag before doing privileged open
| * | | | eCryptfs: Gracefully refuse miscdev file ops on inherited/passed filesTyler Hicks2012-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | File operations on /dev/ecryptfs would BUG() when the operations were performed by processes other than the process that originally opened the file. This could happen with open files inherited after fork() or file descriptors passed through IPC mechanisms. Rather than calling BUG(), an error code can be safely returned in most situations. In ecryptfs_miscdev_release(), eCryptfs still needs to handle the release even if the last file reference is being held by a process that didn't originally open the file. ecryptfs_find_daemon_by_euid() will not be successful, so a pointer to the daemon is stored in the file's private_data. The private_data pointer is initialized when the miscdev file is opened and only used when the file is released. https://launchpad.net/bugs/994247 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Tested-by: Sasha Levin <levinsasha928@gmail.com>
| * | | | eCryptfs: Fix lockdep warning in miscdev operationsTyler Hicks2012-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't grab the daemon mutex while holding the message context mutex. Addresses this lockdep warning: ecryptfsd/2141 is trying to acquire lock: (&ecryptfs_msg_ctx_arr[i].mux){+.+.+.}, at: [<ffffffffa029c213>] ecryptfs_miscdev_read+0x143/0x470 [ecryptfs] but task is already holding lock: (&(*daemon)->mux){+.+...}, at: [<ffffffffa029c2ec>] ecryptfs_miscdev_read+0x21c/0x470 [ecryptfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(*daemon)->mux){+.+...}: [<ffffffff810a3b8d>] lock_acquire+0x9d/0x220 [<ffffffff8151c6da>] __mutex_lock_common+0x5a/0x4b0 [<ffffffff8151cc64>] mutex_lock_nested+0x44/0x50 [<ffffffffa029c5d7>] ecryptfs_send_miscdev+0x97/0x120 [ecryptfs] [<ffffffffa029b744>] ecryptfs_send_message+0x134/0x1e0 [ecryptfs] [<ffffffffa029a24e>] ecryptfs_generate_key_packet_set+0x2fe/0xa80 [ecryptfs] [<ffffffffa02960f8>] ecryptfs_write_metadata+0x108/0x250 [ecryptfs] [<ffffffffa0290f80>] ecryptfs_create+0x130/0x250 [ecryptfs] [<ffffffff811963a4>] vfs_create+0xb4/0x120 [<ffffffff81197865>] do_last+0x8c5/0xa10 [<ffffffff811998f9>] path_openat+0xd9/0x460 [<ffffffff81199da2>] do_filp_open+0x42/0xa0 [<ffffffff81187998>] do_sys_open+0xf8/0x1d0 [<ffffffff81187a91>] sys_open+0x21/0x30 [<ffffffff81527d69>] system_call_fastpath+0x16/0x1b -> #0 (&ecryptfs_msg_ctx_arr[i].mux){+.+.+.}: [<ffffffff810a3418>] __lock_acquire+0x1bf8/0x1c50 [<ffffffff810a3b8d>] lock_acquire+0x9d/0x220 [<ffffffff8151c6da>] __mutex_lock_common+0x5a/0x4b0 [<ffffffff8151cc64>] mutex_lock_nested+0x44/0x50 [<ffffffffa029c213>] ecryptfs_miscdev_read+0x143/0x470 [ecryptfs] [<ffffffff811887d3>] vfs_read+0xb3/0x180 [<ffffffff811888ed>] sys_read+0x4d/0x90 [<ffffffff81527d69>] system_call_fastpath+0x16/0x1b Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
| * | | | eCryptfs: Properly check for O_RDONLY flag before doing privileged openTyler Hicks2012-07-03
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the first attempt at opening the lower file read/write fails, eCryptfs will retry using a privileged kthread. However, the privileged retry should not happen if the lower file's inode is read-only because a read/write open will still be unsuccessful. The check for determining if the open should be retried was intended to be based on the access mode of the lower file's open flags being O_RDONLY, but the check was incorrectly performed. This would cause the open to be retried by the privileged kthread, resulting in a second failed open of the lower file. This patch corrects the check to determine if the open request should be handled by the privileged kthread. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Dan Carpenter <dan.carpenter@oracle.com>