aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
...
| * | | | Btrfs: turbo charge fsyncJosef Bacik2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | Btrfs: fix possible corruption when fsyncing written prealloced extentsJosef Bacik2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While working on my fsync patch my fsync tester kept hitting mismatching md5sums when I would randomly write to a prealloc'ed region, syncfs() and then write to the prealloced region some more and then fsync() and then immediately reboot. This is because the tree logging code will skip writing csums for file extents who's generation is less than the current running transaction. When we mark extents as written we haven't been updating their generation so they were always being skipped. This wouldn't happen if you were to preallocate and then write in the same transaction, but if you for example prealloced a VM you could definitely run into this problem. This patch makes my fsync tester happy again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | Btrfs: do not allocate chunks as agressivelyJosef Bacik2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Swinging this pendulum back the other way. We've been allocating chunks up to 2% of the disk no matter how much we actually have allocated. So instead fix this calculation to only allocate chunks if we have more than 80% of the space available allocated. Please test this as it will likely cause all sorts of ENOSPC problems to pop up suddenly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | Btrfs: update last trans if we don't update the inodeJosef Bacik2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a completely impossible situation to hit where you can preallocate a file, fsync it, write into the preallocated region, have the transaction commit twice and then fsync and then immediately lose power and lose all of the contents of the write. This patch fixes this just so I feel better about the situation and because it is lightweight, we just update the last_trans when we finish an ordered IO and we don't update the inode itself. This way we are completely safe and I feel better. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | Btrfs: fix gcc warnings for 32bit compilesJan Schmidt2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| * | | | Btrfs: fix btrfs send for inline items and compressionChris Mason2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| * | | | Btrfs: don't treat top/root directory inode as deleted/reusedAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't do the deleted/reused logic for top/root inodes as it would create a stream that tries to delete and recreate the root dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: ignore non-FS inodes for send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have to ignore inode/space cache objects in send/receive. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: pass root instead of parent_root to iterate_inode_refAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to pass the root that we determined earlier to iterate_inode_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: use <= instead of < in is_extent_unchangedAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Used the wrong compare operator here. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: fix check for changed extent in is_extent_unchangedAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous check was working fine, but this check should be easier to read. Also, we could theoritically have some exotic bugs with the previous checks. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: free nce and nce_head on error in name_cache_insertAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both were leaked in case of error. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: remove unused tmp_path from iterate_dir_itemAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A leftover from older code and unused now. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: code cleanups for send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Doing some code cleanups as suggested by Arne. Changes do not change any logic. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: add/fix comments/documentation for send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the subject already said, add/fix comments. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: update send_progress at correct placesAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updating send_progress in process_recorded_refs was not correct. It got updated too early in the cur_inode_new_gen case. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: make aux field of ulist 64 bitAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs send/receive uses the aux field to store inode numbers. On 32 bit machines this may become a problem. Also fix all users of ulist_add and ulist_add_merged. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: fix use of radix_tree for name_cache in send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't easily use the index of the radix tree for inums as the radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels, we now use the lower 32bit of the inum as index and an additional list to store multiple entries per radix tree entry. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: fix memory leak for name_cache in send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When everything is done, name_cache_free is called which however forgot to call kfree on the cache entries. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: don't break in the final loop of find_extent_cloneAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we break, we may miss the clone from send_root which we prefer over all other clones. Commit is a result of Arne's review. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: use normal return path for root == send_root caseAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't have a seperate return path for the mentioned case. Now we do the same "take lowest inode/offset" logic for all found clones. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: use kmalloc instead of stack for backref_ctxAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to never get in trouble due to the backref_ctx which was on the stack before. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: rename backref_ctx::found_in_send_root to found_itselfAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new name should be easier to understand/read. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: remove unused use_list from send/receive codeAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | use_list is a leftover and unused. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: add correct parent to check_dirs when dir got movedAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only added the parent for the new position of a moved dir. We also need to add the old parent of the moved dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: remove unused code with #if 0Alexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fs_path_remove is not used at the moment due to a previous patch. Remove it for now (with #if 0) to avoid compile warnings. Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: add missing check for dir != tmp_dir to is_first_refAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We missed that check which resultet in all refs with the same name being reported as first_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: fix cur_ino < parent_ino case for send/receiveAlexander Block2012-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the current inodes inum is smaller then the inum of the parent directory strange things were happending due to wrong path resolution and other bugs. Fix this with a new approach for the problem. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>
| * | | | Btrfs: add rdev to get_inode_info in send/receiveAlexander Block2012-10-01
| | |_|/ | |/| | | | | | | | | | | | | | | | | | We need rdev in the next commit. Signed-off-by: Alexander Block <ablock84@googlemail.com>
* | | | Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2012-10-09
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | Pull CIFS fixes from Steve French. * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: reinstate the forcegid option Convert properly UTF-8 to UTF-16 [CIFS] WARN_ON_ONCE if kernel_sendmsg() returns -ENOSPC
| * | | cifs: reinstate the forcegid optionJeff Layton2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apparently this was lost when we converted to the standard option parser in 8830d7e07a5e38bc47650a7554b7c1cfd49902bf Cc: Sachin Prabhu <sprabhu@redhat.com> Cc: stable@vger.kernel.org # v3.4+ Reported-by: Gregory Lee Bartholomew <gregory.lee.bartholomew@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | | Convert properly UTF-8 to UTF-16Frediano Ziglio2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wchar_t is currently 16bit so converting a utf8 encoded characters not in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a -EINVAL return. This patch detect utf8 in cifs_strtoUTF16 and add special code calling utf8s_to_utf16s. Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * | | [CIFS] WARN_ON_ONCE if kernel_sendmsg() returns -ENOSPCSteve French2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel_sendmsg() is less likely to return -ENOSPC and it might be a bug to do so. However, in the past there might have been cases where a -ENOSPC was returned from a low level driver. Add a WARN_ON_ONCE() to ensure that it is safe to assume that -ENOSPC is no longer returned. This -ENOSPC specific handling will be removed once we are sure it is no longer returned. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>
* | | | Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2012-10-09
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge patches from Andrew Morton: "A few misc things and very nearly all of the MM tree. A tremendous amount of stuff (again), including a significant rbtree library rework." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits) sparc64: Support transparent huge pages. mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). mm: Add and use update_mmu_cache_pmd() in transparent huge page code. sparc64: Document PGD and PMD layout. sparc64: Eliminate PTE table memory wastage. sparc64: Halve the size of PTE tables sparc64: Only support 4MB huge pages and 8KB base pages. memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning mm: memcg: clean up mm_match_cgroup() signature mm: document PageHuge somewhat mm: use %pK for /proc/vmallocinfo mm, thp: fix mlock statistics mm, thp: fix mapped pages avoiding unevictable list on mlock memory-hotplug: update memory block's state and notify userspace memory-hotplug: preparation to notify memory block's state at memory hot remove mm: avoid section mismatch warning for memblock_type_name make GFP_NOTRACK definition unconditional cma: decrease cc.nr_migratepages after reclaiming pagelist CMA: migrate mlocked pages kpageflags: fix wrong KPF_THP on non-huge compound pages ...
| * | | | kpageflags: fix wrong KPF_THP on non-huge compound pagesNaoya Horiguchi2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KPF_THP can be set on non-huge compound pages (like slab pages or pages allocated by drivers with __GFP_COMP) because PageTransCompound only checks PG_head and PG_tail. Obviously this is a bug and breaks user space applications which look for thp via /proc/kpageflags. This patch rules out setting KPF_THP wrongly by additionally checking PageLRU on the head pages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | fs/fs-writeback.c: remove unneccesary parameter of __writeback_single_inode()Yan Hong2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The parameter 'wb' is never used in this function. Signed-off-by: Yan Hong <clouds.yan@gmail.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: avoid taking rmap locks in move_ptes()Michel Lespinasse2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During mremap(), the destination VMA is generally placed after the original vma in rmap traversal order: in move_vma(), we always have new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >= vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one. When the destination VMA is placed after the original in rmap traversal order, we can avoid taking the rmap locks in move_ptes(). Essentially, this reintroduces the optimization that had been disabled in "mm anon rmap: remove anon_vma_moveto_tail". The difference is that we don't try to impose the rmap traversal order; instead we just rely on things being in the desired order in the common case and fall back to taking locks in the uncommon case. Also we skip the i_mmap_mutex in addition to the anon_vma lock: in both cases, the vmas are traversed in increasing vm_pgoff order with ties resolved in tree insertion order. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: replace vma prio_tree with an interval treeMichel Lespinasse2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement an interval tree as a replacement for the VMA prio_tree. The algorithms are similar to lib/interval_tree.c; however that code can't be directly reused as the interval endpoints are not explicitly stored in the VMA. So instead, the common algorithm is moved into a template and the details (node type, how to get interval endpoints from the node, etc) are filled in using the C preprocessor. Once the interval tree functions are available, using them as a replacement to the VMA prio tree is a relatively simple, mechanical job. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | rbtree: move some implementation details from rbtree.h to rbtree.cMichel Lespinasse2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rbtree users must use the documented APIs to manipulate the tree structure. Low-level helpers to manipulate node colors and parenthood are not part of that API, so move them to lib/rbtree.c [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.cMichel Lespinasse2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recently added code to use rbtrees in sysctl did not follow the proper rbtree interface on insertion - it was calling rb_link_node() which inserts a new node into the binary tree, but missed the call to rb_insert_color() which properly balances the rbtree and establishes all expected rbtree invariants. I found out about this only because faulty commit also used rb_init_node(), which I am removing within this patchset. But I think it's an easy mistake to make, and it makes me wonder if we should change the rbtree API so that insertions would be done with a single rb_insert() call (even if its implementation could still inline the rb_link_node() part and call a private __rb_insert_color function to do the rebalancing). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | rbtree: empty nodes have no colorMichel Lespinasse2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Empty nodes have no color. We can make use of this property to simplify the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also, we can get rid of the rb_init_node function which had been introduced by commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack allocated rb nodes") to avoid some issue with the empty node's color not being initialized. I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are doing there, though. axboe introduced them in commit 10fd48f2376d ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I see it, the 'empty node' abstraction is only used by rbtree users to flag nodes that they haven't inserted in any rbtree, so asking the predecessor or successor of such nodes doesn't make any sense. One final rb_init_node() caller was recently added in sysctl code to implement faster sysctl name lookups. This code doesn't make use of RB_EMPTY_NODE at all, and from what I could see it only called rb_init_node() under the mistaken assumption that such initialization was required before node insertion. [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | oom: remove deprecated oom_adjDavidlohr Bueso2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The deprecated /proc/<pid>/oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: kill vma flag VM_RESERVED and mm->reserved_vm counterKonstantin Khlebnikov2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: prepare VM_DONTDUMP for using in driversKonstantin Khlebnikov2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags: VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for sys_madvise. The next patch will use it for replacing the outdated flag VM_RESERVED. Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: kill vma flag VM_CAN_NONLINEARKonstantin Khlebnikov2012-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge branch 'linux-next' of git://git.open-osd.org/linux-open-osdLinus Torvalds2012-10-09
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull exofs update from Boaz Harrosh: "Just three one liners" * 'linux-next' of git://git.open-osd.org/linux-open-osd: pnfs_osd_xdr: Remove unused #include from pnfs_osd_xdr.h ore: signedness bug in _sp2d_min_pg() exofs: check for allocation failure in uri_store()
| * | | | | ore: signedness bug in _sp2d_min_pg()Dan Carpenter2012-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This for loop doesn't work correctly when "p" is unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
| * | | | | exofs: check for allocation failure in uri_store()Alexey Khoroshilov2012-08-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no memory allocation failure check in uri_store(). That can lead to NULL pointer dereference. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* | | | | | Fix F_DUPFD_CLOEXEC breakageAl Viro2012-10-09
| |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a braino in F_DUPFD_CLOEXEC; f_dupfd() expects flags for alloc_fd(), get_unused_fd() etc and there clone-on-exec if O_CLOEXEC, not FD_CLOEXEC. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | exec: make de_thread() killableOleg Nesterov2012-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while waiting for other threads. The only complication is that we should clear ->group_exit_task and ->notify_count before we return, and we should do this under tasklist_lock. -EAGAIN is used to match the initial signal_group_exit() check/return, it doesn't really matter. This fixes the (unlikely) race with coredump. de_thread() checks signal_group_exit() before it starts to kill the subthreads, but this can't help if another CLONE_VM (but non CLONE_THREAD) task starts the coredumping after de_thread() unlocks ->siglock. In this case the killed sub-thread can block in exit_mm() waiting for coredump_finish(), execing thread waits for that sub-thead, and the coredumping thread waits for execing thread. Deadlock. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>