aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMichal Hocko <mhocko@suse.cz>2015-02-11 18:26:12 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2015-02-11 20:06:03 -0500
commit49550b605587924b3336386caae53200c68969d3 (patch)
tree1da2875ec456c2cd5a1066ff4a294dad263bc185
parent1dfab5abcdd404fd04597c063d9f61a5b3247552 (diff)
oom: add helpers for setting and clearing TIF_MEMDIE
This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--drivers/staging/android/lowmemorykiller.c7
-rw-r--r--include/linux/oom.h4
-rw-r--r--kernel/exit.c2
-rw-r--r--mm/memcontrol.c2
-rw-r--r--mm/oom_kill.c23
5 files changed, 32 insertions, 6 deletions
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b545d3d1da3e..feafa172b155 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -160,7 +160,12 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
160 selected->pid, selected->comm, 160 selected->pid, selected->comm,
161 selected_oom_score_adj, selected_tasksize); 161 selected_oom_score_adj, selected_tasksize);
162 lowmem_deathpending_timeout = jiffies + HZ; 162 lowmem_deathpending_timeout = jiffies + HZ;
163 set_tsk_thread_flag(selected, TIF_MEMDIE); 163 /*
164 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
165 * infrastructure. There is no real reason why the selected
166 * task should have access to the memory reserves.
167 */
168 mark_tsk_oom_victim(selected);
164 send_sig(SIGKILL, selected, 0); 169 send_sig(SIGKILL, selected, 0);
165 rem += selected_tasksize; 170 rem += selected_tasksize;
166 } 171 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76200984d1e2..b42b80f88c3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
47 return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN); 47 return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
48} 48}
49 49
50extern void mark_tsk_oom_victim(struct task_struct *tsk);
51
52extern void unmark_oom_victim(void);
53
50extern unsigned long oom_badness(struct task_struct *p, 54extern unsigned long oom_badness(struct task_struct *p,
51 struct mem_cgroup *memcg, const nodemask_t *nodemask, 55 struct mem_cgroup *memcg, const nodemask_t *nodemask,
52 unsigned long totalpages); 56 unsigned long totalpages);
diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55475ee..02b3d1ab2ec0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
435 task_unlock(tsk); 435 task_unlock(tsk);
436 mm_update_next_owner(mm); 436 mm_update_next_owner(mm);
437 mmput(mm); 437 mmput(mm);
438 clear_thread_flag(TIF_MEMDIE); 438 unmark_oom_victim();
439} 439}
440 440
441static struct task_struct *find_alive_thread(struct task_struct *p) 441static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 11c9e6a1dad5..fe4d258ef32b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1556,7 +1556,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
1556 * quickly exit and free its memory. 1556 * quickly exit and free its memory.
1557 */ 1557 */
1558 if (fatal_signal_pending(current) || task_will_free_mem(current)) { 1558 if (fatal_signal_pending(current) || task_will_free_mem(current)) {
1559 set_thread_flag(TIF_MEMDIE); 1559 mark_tsk_oom_victim(current);
1560 return; 1560 return;
1561 } 1561 }
1562 1562
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 294493a7ae4b..80b34e285f96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,6 +416,23 @@ void note_oom_kill(void)
416 atomic_inc(&oom_kills); 416 atomic_inc(&oom_kills);
417} 417}
418 418
419/**
420 * mark_tsk_oom_victim - marks the given taks as OOM victim.
421 * @tsk: task to mark
422 */
423void mark_tsk_oom_victim(struct task_struct *tsk)
424{
425 set_tsk_thread_flag(tsk, TIF_MEMDIE);
426}
427
428/**
429 * unmark_oom_victim - unmarks the current task as OOM victim.
430 */
431void unmark_oom_victim(void)
432{
433 clear_thread_flag(TIF_MEMDIE);
434}
435
419#define K(x) ((x) << (PAGE_SHIFT-10)) 436#define K(x) ((x) << (PAGE_SHIFT-10))
420/* 437/*
421 * Must be called while holding a reference to p, which will be released upon 438 * Must be called while holding a reference to p, which will be released upon
@@ -440,7 +457,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
440 */ 457 */
441 task_lock(p); 458 task_lock(p);
442 if (p->mm && task_will_free_mem(p)) { 459 if (p->mm && task_will_free_mem(p)) {
443 set_tsk_thread_flag(p, TIF_MEMDIE); 460 mark_tsk_oom_victim(p);
444 task_unlock(p); 461 task_unlock(p);
445 put_task_struct(p); 462 put_task_struct(p);
446 return; 463 return;
@@ -495,7 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
495 512
496 /* mm cannot safely be dereferenced after task_unlock(victim) */ 513 /* mm cannot safely be dereferenced after task_unlock(victim) */
497 mm = victim->mm; 514 mm = victim->mm;
498 set_tsk_thread_flag(victim, TIF_MEMDIE); 515 mark_tsk_oom_victim(victim);
499 pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", 516 pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
500 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), 517 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
501 K(get_mm_counter(victim->mm, MM_ANONPAGES)), 518 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -652,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
652 */ 669 */
653 if (current->mm && 670 if (current->mm &&
654 (fatal_signal_pending(current) || task_will_free_mem(current))) { 671 (fatal_signal_pending(current) || task_will_free_mem(current))) {
655 set_thread_flag(TIF_MEMDIE); 672 mark_tsk_oom_victim(current);
656 return; 673 return;
657 } 674 }
658 675