aboutsummaryrefslogtreecommitdiffstats
path: root/fs
diff options
context:
space:
mode:
authorNick Piggin <npiggin@suse.de>2009-04-30 18:08:16 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2009-05-02 18:36:09 -0400
commitb827e496c893de0c0f142abfaeb8730a2fd6b37f (patch)
treea86aecd5d811f9306b9662ceb5a5a45091b62b97 /fs
parenta5fc1abe438b87a9d128beebc377f78e2681a76d (diff)
mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'fs')
-rw-r--r--fs/buffer.c10
1 files changed, 6 insertions, 4 deletions
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7514f5..aed297739eb0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2397,7 +2397,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
2397 if ((page->mapping != inode->i_mapping) || 2397 if ((page->mapping != inode->i_mapping) ||
2398 (page_offset(page) > size)) { 2398 (page_offset(page) > size)) {
2399 /* page got truncated out from underneath us */ 2399 /* page got truncated out from underneath us */
2400 goto out_unlock; 2400 unlock_page(page);
2401 goto out;
2401 } 2402 }
2402 2403
2403 /* page is wholly or partially inside EOF */ 2404 /* page is wholly or partially inside EOF */
@@ -2411,14 +2412,15 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
2411 ret = block_commit_write(page, 0, end); 2412 ret = block_commit_write(page, 0, end);
2412 2413
2413 if (unlikely(ret)) { 2414 if (unlikely(ret)) {
2415 unlock_page(page);
2414 if (ret == -ENOMEM) 2416 if (ret == -ENOMEM)
2415 ret = VM_FAULT_OOM; 2417 ret = VM_FAULT_OOM;
2416 else /* -ENOSPC, -EIO, etc */ 2418 else /* -ENOSPC, -EIO, etc */
2417 ret = VM_FAULT_SIGBUS; 2419 ret = VM_FAULT_SIGBUS;
2418 } 2420 } else
2421 ret = VM_FAULT_LOCKED;
2419 2422
2420out_unlock: 2423out:
2421 unlock_page(page);
2422 return ret; 2424 return ret;
2423} 2425}
2424 2426