summaryrefslogtreecommitdiffstats
path: root/mm/backing-dev.c
diff options
context:
space:
mode:
authorMichal Hocko <mhocko@suse.com>2015-12-11 16:40:32 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2015-12-12 13:15:34 -0500
commit373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 (patch)
treec58101acd453c59917f1fe41d9a795e4a9f58b45 /mm/backing-dev.c
parent475a2f905d5a41d5fc569ef21841be67d0a7f788 (diff)
mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/backing-dev.c')
-rw-r--r--mm/backing-dev.c19
1 files changed, 16 insertions, 3 deletions
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
957 * jiffies for either a BDI to exit congestion of the given @sync queue 957 * jiffies for either a BDI to exit congestion of the given @sync queue
958 * or a write to complete. 958 * or a write to complete.
959 * 959 *
960 * In the absence of zone congestion, cond_resched() is called to yield 960 * In the absence of zone congestion, a short sleep or a cond_resched is
961 * the processor if necessary but otherwise does not sleep. 961 * performed to yield the processor and to allow other subsystems to make
962 * a forward progress.
962 * 963 *
963 * The return value is 0 if the sleep is for the full timeout. Otherwise, 964 * The return value is 0 if the sleep is for the full timeout. Otherwise,
964 * it is the number of jiffies that were still remaining when the function 965 * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
978 */ 979 */
979 if (atomic_read(&nr_wb_congested[sync]) == 0 || 980 if (atomic_read(&nr_wb_congested[sync]) == 0 ||
980 !test_bit(ZONE_CONGESTED, &zone->flags)) { 981 !test_bit(ZONE_CONGESTED, &zone->flags)) {
981 cond_resched(); 982
983 /*
984 * Memory allocation/reclaim might be called from a WQ
985 * context and the current implementation of the WQ
986 * concurrency control doesn't recognize that a particular
987 * WQ is congested if the worker thread is looping without
988 * ever sleeping. Therefore we have to do a short sleep
989 * here rather than calling cond_resched().
990 */
991 if (current->flags & PF_WQ_WORKER)
992 schedule_timeout(1);
993 else
994 cond_resched();
982 995
983 /* In case we scheduled, work out time remaining */ 996 /* In case we scheduled, work out time remaining */
984 ret = timeout - (jiffies - start); 997 ret = timeout - (jiffies - start);