aboutsummaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAge
* Fix typo s/contenious/continuous in commentNikanth Karthikesan2010-08-18
| | | | | | | Fix typo s/contenious/continuous in comment. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* Merge branch 'master' into for-nextJiri Kosina2010-08-11
|\ | | | | | | | | Conflicts: fs/exofs/inode.c
| * Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-08-10
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.
| | * writeback: fix bad _bh spinlock nestingJens Axboe2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a bug where a lock is _bh nested within another _bh lock, but forgets to use the _bh variant for unlock. Further more, it's not necessary to test _bh locks, the inner lock can just use spin_lock(). So fix up the bug by making that change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: cleanup bdi_registerArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes sure we first initialize everything and set the BDI_registered flag, and only after this we add the bdi to 'bdi_list'. Current code adds the bdi to the list too early, and as a result I the WARN(!test_bit(BDI_registered, &bdi->state) in bdi forker is triggered. Also, it is in general good practice to make things visible only when they are fully initialized. Also, this patch does few micro clean-ups: 1. Removes the 'exit' label which does not do anything, just returns. This allows to get rid of few braces and 'ret' variable and make the code smaller. 2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded '-ENOMEM'. Theoretically, some day 'kthread_run()' can return something else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to NULL. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: add new tracepointsArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce: 1. trace_writeback_wake_thread(bdi) 2. trace_writeback_wake_forker_thread(bdi) The first event is triggered every time we wake up a bdi thread to start periodic background write-out. The second event is triggered only when the bdi thread does not exist and should be created by the forker thread. This patch was suggested by Dave Chinner and Christoph Hellwig. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: remove unnecessary init_timer callArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'setup_timer()' function also calls 'init_timer()', so the extra 'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically 'init_timer()' plus callback function and data pointers initialization. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: optimize periodic bdi thread wakeupsArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which should take care of the periodic background write-out. However, the write-out will actually start only 'dirty_writeback_interval' centisecs later, so we can delay the wake-up. This change was requested by Nick Piggin who pointed out that if we delay the wake-up, we weed out 2 unnecessary contex switches, which matters because '__mark_inode_dirty()' is a hot-path function. This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which sets up a timer to wake-up the bdi thread and returns. So the wake-up is delayed. We also delete the timer in bdi threads just before writing-back. And synchronously delete it when unregistering bdi. At the unregister point the bdi does not have any users, so no one can arm it again. Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes this change as well. This patch also moves the 'bdi_wb_init()' function down in the file to avoid forward-declaration of 'bdi_wakeup_thread_delayed()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: prevent unnecessary bdi threads wakeupsArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very bad for battery-driven devices. There are two types of activities bdi threads do: 1. process bdi works from the 'bdi->work_list' 2. periodic write-back So there are 2 sources of wake-up events for bdi threads: 1. 'bdi_queue_work()' - submits bdi works 2. '__mark_inode_dirty()' - adds dirty I/O to bdi's The former already has bdi wake-up code. The latter does not, and this patch adds it. '__mark_inode_dirty()' is hot-path function, but this patch adds another 'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when the bdi has no dirty inodes. So adding this spinlock should be fine and should not affect performance. This patch makes sure bdi threads and the forker thread do not wake-up if there is nothing to do. The forker thread will nevertheless wake up at least every 5 min. to check whether it has to kill a bdi thread. This can also be optimized, but is not worth it. This patch also tidies up the warning about unregistered bid, and turns it from an ugly crocodile to a simple 'WARN()' statement. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: move bdi threads exiting logic to the forker threadArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, bdi threads can decide to exit if there were no useful activities for 5 minutes. However, this causes nasty races: we can easily oops in the 'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up. And even if we do not oops, but the bdi tread exits immediately after we wake it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs) in the bdi work processing. This patch makes the forker thread to be the central place which not only creates bdi threads, but also kills them if they were inactive long enough. This better design-wise. Another reason why this change was done is to prepare for the further changes which will prevent the bdi threads from waking up every 5 sec and wasting power. Indeed, when the task does not wake up periodically anymore, it won't be able to exit either. This patch also moves the the 'wake_up_bit()' call from the bdi thread to the forker thread as well. So now the forker thread sets the BDI_pending bit, then forks the task or kills it, then clears the bit and wakes up the waiting process. The only process which may wain on the bit is 'bdi_wb_shutdown()'. This function was changed as well - now it first removes the bdi from the 'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is guaranteed that the forker thread won't race with it, because the bdi is not visible. Note, the forker thread sets the 'BDI_pending' bit under the 'bdi->wb_lock' which is essential for proper serialization. And additionally, when we change 'bdi->wb.task', we now take the 'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise would when raced with, say, 'bdi_queue_work()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: restructure bdi forker loop a littleArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch re-structures the bdi forker a little: 1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason for this is that the forker thread can start _before_ the 'BDI_registered' flag is set (see 'bdi_register()'), so the WARN() statement will fire for the default bdi. I observed this warning at boot-up. 2. Introduce an enum 'action' and use "switch" statement in the outer loop. This is a preparation to the further patch which will teach the forker thread killing bdi threads, so we'll have another case in the "switch" statement. This change was suggested by Christoph Hellwig. This patch is just a small step towards the coming change where the forker thread will kill the bdi threads. It should simplify reviewing the following changes, which would otherwise be larger. This patch also amends comments a little. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: do not remove bdi from bdi_listArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The forker thread removes bdis from 'bdi_list' before forking the bdi thread. But this is wrong for at least 2 reasons. Reason #1: if we temporary remove a bdi from the list, we may miss works which would otherwise be given to us. Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when it races with the forker thread, it can shut down the bdi thread at the same time as the forker creates it. This patch makes sure the forker thread never removes bdis from 'bdi_list' (which was suggested by Christoph Hellwig). In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending' flag. NOTE! The error path is interesting. Currently, when we fail to create a bdi thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the bdi from the list, we cannot move it to the tail either, because then we can mess up the RCU readers which walk the list. And also, we'll have the race described above in "Reason #2". But I not think that adding to the tail is any important so I just do not do that. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: simplify bdi code a littleArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch simplifies bdi code a little by removing the 'pending_list' which is redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is working like this: 1. In a loop, fetch all bdi's which have works but have no writeback thread and move them to the 'pending_list'. 2. If the list is empty, sleep for 5 sec. 3. Otherwise, take one bdi from the list, fork the writeback thread for this bdi, and repeat the loop. IOW, it first moves everything to the 'pending_list', then process only one element, and so on. This patch simplifies the algorithm, which is now as follows. 1. Find the first bdi which has a work and remove it from the global list of bdi's (bdi_list). 2. If there was not such bdi, sleep 5 sec. 3. Fork the writeback thread for this bdi and repeat the loop. IOW, now we find the first bdi to process, process it, and so on. This is simpler and involves less lists. The bonus now is that we can get rid of a couple of functions, as well as remove complications which involve 'rcu_call()' and 'bdi->rcu_head'. This patch also makes sure we use 'list_add_tail_rcu()', instead of plain 'list_add_tail()', but this piece of code is going to be removed in the next patch anyway. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: do not lose wake-ups in the forker thread - 2Artem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if someone submits jobs for the default bdi, we can lose wake-up events. E.g., this can happen if 'bdi_queue_work()' is called when 'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but before 'set_current_state(TASK_INTERRUPTIBLE)'. This situation is unlikely, and the result is not very severe - we'll just delay the execution of the work, but this is still not very nice. This patch fixes the issue by checking whether the default bdi has works before the forker thread goes sleep. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: do not lose wake-ups in the forker thread - 1Artem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the forker thread can lose wake-ups which may lead to unnecessary delays in processing bdi works. E.g., consider the following scenario. 1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to do, and is about to finish the loop. 2. A bdi thread decides to exit because it was inactive for long time. 3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up the forker thread. 4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)' and goes sleep. We lose a wake-up. Losing the wake-up is not fatal, but this means that the bdi work processing will be delayed by up to 5 sec. This race is theoretical, I never hit it, but it is worth fixing. The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking 'bdi_list', not after. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: fix possible race when creating bdi threadsArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a very unlikely race condition on the bdi forker thread error path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code for a short period of time. If at the same time someone submits a work to this bdi, we can end up with an oops 'bdi_queue_work()' while executing 'wake_up_process(wb->task)'. This patch fixes the issue by introducing a temporary variable 'task' and storing the possible error code there, so that 'wb->task' would never take erroneous values. Note, this race is very unlikely and I never hit it, so it is theoretical, but nevertheless worth fixing. This patch also merges 2 comments which were previously separate. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: harmonize writeback threads namingArtem Bityutskiy2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The write-back code mixes words "thread" and "task" for the same things. This is not a big deal, but still an inconsistency. hch: a convention I tend to use and I've seen in various places is to always use _task for the storage of the task_struct pointer, and thread everywhere else. This especially helps with having foo_thread for the actual thread and foo_task for a global variable keeping the task_struct pointer This patch renames: * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()' * 'bdi_forker_task()' -> 'bdi_forker_thread()' because bdi threads are 'bdi_writeback_thread()', so these names are more consistent. This patch also amends commentaries and makes them refer the forker and bdi threads as "thread", not "task". Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use 'static void' instead of 'void static' and make checkpatch.pl happy. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: Add tracing to write_cache_pagesDave Chinner2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a trace event to the ->writepage loop in write_cache_pages to give visibility into how the ->writepage call is changing variables within the writeback control structure. Of most interest is how wbc->nr_to_write changes from call to call, especially with filesystems that write multiple pages in ->writepage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: Add tracing to balance_dirty_pagesDave Chinner2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: Initial tracing supportDave Chinner2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like: sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path. The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: merge bdi_writeback_task and bdi_start_fnChristoph Hellwig2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | Move all code for the writeback thread into fs/fs-writeback.c instead of splitting it over two functions in two files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * writeback: remove wb_listChristoph Hellwig2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The wb_list member of struct backing_device_info always has exactly one element. Just use the direct bdi->wb pointer instead and simplify some code. Also remove bdi_task_init which is now trivial to prepare for the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * block: unify flags for struct bio and struct requestChristoph Hellwig2010-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | Merge branch 'kmemleak' of ↵Linus Torvalds2010-08-10
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm * 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm: kmemleak: Fix typo in the comment lib/scatterlist: Hook sg_kmalloc into kmemleak (v2) kmemleak: Add DocBook style comments to kmemleak.c kmemleak: Introduce a default off mode for kmemleak kmemleak: Show more information for objects found by alias
| | * | kmemleak: Fix typo in the commentHolger Hans Peter Freyther2010-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix typo in comment. Signed-off-by: Holger Hans Peter Freyther <zecke@selfish.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| | * | kmemleak: Add DocBook style comments to kmemleak.cCatalin Marinas2010-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The description and parameters of the kmemleak API weren't obvious. This patch adds comments clarifying the API usage. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
| | * | kmemleak: Introduce a default off mode for kmemleakJason Baron2010-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new DEBUG_KMEMLEAK_DEFAULT_OFF config parameter that allows kmemleak to be disabled by default, but enabled on the command line via: kmemleak=on. Although a reboot is required to turn it on, its still useful to not require a re-compile. Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
| | * | kmemleak: Show more information for objects found by aliasCatalin Marinas2010-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There may be situations when an object is freed using a pointer inside the memory block. Kmemleak should show more information to help with debugging. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
| * | | Merge branch 'for-linus' of ↵Linus Torvalds2010-08-10
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c
| | * | | switch shmem.c to ->evice_inode()Al Viro2010-08-09
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | check ATTR_SIZE contraints in inode_change_okChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure we check the truncate constraints early on in ->setattr by adding those checks to inode_change_ok. Also clean up and document inode_change_ok to make this obvious. As a fallout we don't have to call inode_newsize_ok from simple_setsize and simplify it down to a truncate_setsize which doesn't return an error. This simplifies a lot of setattr implementations and means we use truncate_setsize almost everywhere. Get rid of fat_setsize now that it's trivial and mark ext2_setsize static to make the calling convention obvious. Keep the inode_newsize_ok in vmtruncate for now as all callers need an audit for its removal anyway. Note: setattr code in ecryptfs doesn't call inode_change_ok at all and needs a deeper audit, but that is left for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | always call inode_change_ok early in ->setattrChristoph Hellwig2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure we call inode_change_ok before doing any changes in ->setattr, and make sure to call it even if our fs wants to ignore normal UNIX permissions, but use the ATTR_FORCE to skip those. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | | rename generic_setattrChristoph Hellwig2010-08-09
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite its name it's now a generic implementation of ->setattr, but rather a helper to copy attributes from a struct iattr to the inode. Rename it to setattr_copy to reflect this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | / | fix a typo on comments in mm/percpu.cNamhyung Kim2010-08-11
|/ / / | | | | | | | | | | | | | | | | | | 'eqaul' should be 'equal'. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Merge branch 'merge' of ↵Linus Torvalds2010-08-10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc: fix build with make 3.82 Revert "Input: appletouch - fix integer overflow issue" memblock: Fix memblock_is_region_reserved() to return a boolean powerpc: Trim defconfigs powerpc: fix i8042 module build error sound/soc: mpc5200_psc_ac97: Use gpio pins for cold reset powerpc/5200: add mpc5200_psc_ac97_gpio_reset
| * | | memblock: Fix memblock_is_region_reserved() to return a booleanBenjamin Herrenschmidt2010-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers expect a boolean result which is true if the region overlaps a reserved region. However, the implementation actually returns -1 if there is no overlap, and a region index (0 based) if there is. Make it behave as callers (and common sense) expect. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | | | hibernation: freeze swap at hibernationKAMEZAWA Hiroyuki2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When taking a memory snapshot in hibernate_snapshot(), all (directly called) memory allocations use GFP_ATOMIC. Hence swap misusage during hibernation never occurs. But from a pessimistic point of view, there is no guarantee that no page allcation has __GFP_WAIT. It is better to have a global indication "we enter hibernation, don't use swap!". This patch tries to freeze new-swap-allocation during hibernation. (All user processes are frozenm so swapin is not a concern). This way, no updates will happen to swap_map[] between hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free() is called. We can be assured that swap corruption will not occur. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: fix corruption of hibernation caused by reusing swap during image savingKAMEZAWA Hiroyuki2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 2.6.31, swap_map[]'s refcounting was changed to show that a used swap entry is just for swap-cache, can be reused. Then, while scanning free entry in swap_map[], a swap entry may be able to be reclaimed and reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap entry if necessary"). But this caused deta corruption at resume. The scenario is - Assume a clean-swap cache, but mapped. - at hibernation_snapshot[], clean-swap-cache is saved as clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE. - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save image, and break the contents. After resume: - the memory reclaim runs and finds clean-not-referenced-swap-cache and discards it because it's marked as clean. But here, the contents on disk and swap-cache is inconsistent. Hance memory is corrupted. This patch avoids the bug by not reclaiming swap-entry during hibernation. This is a quick fix for backporting. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | ksm: cleanup for mm_slots_hashLai Jiangshan2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use compile-allocated memory instead of dynamic allocated memory for mm_slots_hash. Use hash_ptr() instead divisions for bucket calculation. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: raise the bar to PAGEOUT_IO_SYNC stallsWu Fengguang2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is 20%, it will hardly be triggered by pure dirty pages. We'd better treat PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so uncomfortably long (easily goes beyond 1s). The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations, which are easy to satisfy in 1TB memory boxes. So, although 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory, it won't really have to busy scan that much. Andreas tested an older version of this patch and reported that it mostly fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will fix it further in the next patch. Reported-by: Andreas Mohr <andi@lisas.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm/vmalloc.c: check kmalloc() return valueKulikov Vasiliy2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmalloc() may fail, if so return -ENOMEM. Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: add mm_vmscan_memcg_isolate tracepointKOSAKI Motohiro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memcg also need to trace page isolation information as global reclaim. This patch does it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg, vmscan: add memcg reclaim tracepointKOSAKI Motohiro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memcg also need to trace reclaim progress as direct reclaim. This patch add it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: shrink_slab() requires the number of lru_pages, not the page orderKOSAKI Motohiro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently shrink_slab() has the following scanning equation. lru_scanned max_pass basic_scan_objects = 4 x ------------- x ----------------------------- lru_pages shrinker->seeks (default:2) scan_objects = min(basic_scan_objects, max_pass * 2) If we pass very small value as lru_pages instead real number of lru pages, shrink_slab() drop much objects rather than necessary. And now, __zone_reclaim() pass 'order' as lru_pages by mistake. That produces a bad result. For example, if we receive very low memory pressure (scan = 32, order = 0), shrink_slab() via zone_reclaim() always drop _all_ icache/dcache objects. (see above equation, very small lru_pages make very big scan_objects result). This patch fixes it. [akpm@linux-foundation.org: fix layout, typos] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mmu-notifiers: remove mmu notifier calls in apply_to_page_range()Jeremy Fitzhardinge2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not appropriate for apply_to_page_range() to directly call any mmu notifiers, because it is a general purpose function whose effect depends on what context it is called in and what the callback function does. In particular, if it is being used as part of an mmu notifier implementation, the recursive calls can be particularly problematic. It is up to apply_to_page_range's caller to do any notifier calls if necessary. It does not affect any in-tree users because they all operate on init_mm, and mmu notifiers only pertain to usermode mappings. [stefano.stabellini@eu.citrix.com: remove unused local `start'] Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: protect reading of reclaim_stat with lru_lockKOSAKI Motohiro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rik van Riel pointed out reading reclaim_stat should be protected lru_lock, otherwise vmscan might sweep 2x much pages. This fault was introduced by commit 4f98a2fee8acdb4ac84545df98cccecfd130f8db Author: Rik van Riel <riel@redhat.com> Date: Sat Oct 18 20:26:32 2008 -0700 vmscan: split LRU lists into anon & file sets Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: avoid subtraction of unsigned typesKOSAKI Motohiro2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'slab_reclaimable' and 'nr_pages' are unsigned. Subtraction is unsafe because negative results would be misinterpreted. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: set VM_FAULT_WRITE in do_swap_page()Andrea Arcangeli2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set the flag if do_swap_page is decowing the page the same way do_wp_page would too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | rmap: add exclusive page to private anon_vma on swapinRik van Riel2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On swapin it is fairly common for a page to be owned exclusively by one process. In that case we want to add the page to the anon_vma of that process's VMA, instead of to the root anon_vma. This will reduce the amount of rmap searching that the swapout code needs to do. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | oom: badness heuristic rewriteDavid Rientjes2010-08-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>