aboutsummaryrefslogtreecommitdiffstats
path: root/fs/notify
diff options
context:
space:
mode:
authorJerry Hoemann <jerry.hoemann@hp.com>2014-10-29 17:50:22 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2014-10-29 19:33:14 -0400
commit6424babfd68dd8a83d9c60a5242d27038856599f (patch)
treea891bd1afc06d2e3f0a93ec5869ea2b9766ecfa5 /fs/notify
parent6ea41c0c0aa37d87ef5dd0d14535d2e1e195cd83 (diff)
fsnotify: next_i is freed during fsnotify_unmount_inodes.
During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING | I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'fs/notify')
-rw-r--r--fs/notify/inode_mark.c17
1 files changed, 11 insertions, 6 deletions
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 9ce062218de9..e8497144b323 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -288,20 +288,25 @@ void fsnotify_unmount_inodes(struct list_head *list)
288 spin_unlock(&inode->i_lock); 288 spin_unlock(&inode->i_lock);
289 289
290 /* In case the dropping of a reference would nuke next_i. */ 290 /* In case the dropping of a reference would nuke next_i. */
291 if ((&next_i->i_sb_list != list) && 291 while (&next_i->i_sb_list != list) {
292 atomic_read(&next_i->i_count)) {
293 spin_lock(&next_i->i_lock); 292 spin_lock(&next_i->i_lock);
294 if (!(next_i->i_state & (I_FREEING | I_WILL_FREE))) { 293 if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
294 atomic_read(&next_i->i_count)) {
295 __iget(next_i); 295 __iget(next_i);
296 need_iput = next_i; 296 need_iput = next_i;
297 spin_unlock(&next_i->i_lock);
298 break;
297 } 299 }
298 spin_unlock(&next_i->i_lock); 300 spin_unlock(&next_i->i_lock);
301 next_i = list_entry(next_i->i_sb_list.next,
302 struct inode, i_sb_list);
299 } 303 }
300 304
301 /* 305 /*
302 * We can safely drop inode_sb_list_lock here because we hold 306 * We can safely drop inode_sb_list_lock here because either
303 * references on both inode and next_i. Also no new inodes 307 * we actually hold references on both inode and next_i or
304 * will be added since the umount has begun. 308 * end of list. Also no new inodes will be added since the
309 * umount has begun.
305 */ 310 */
306 spin_unlock(&inode_sb_list_lock); 311 spin_unlock(&inode_sb_list_lock);
307 312