aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorJohannes Weiner <hannes@cmpxchg.org>2013-12-12 20:12:34 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2013-12-12 21:19:26 -0500
commit96f1c58d853497a757463e0b57fed140d6858f3a (patch)
tree9d2c6fca20e2a6de39cb4aff25d015b75615ea64 /mm
parent3592806cfa08b7cca968f793c33f8e9460bab395 (diff)
mm: memcg: fix race condition between memcg teardown and swapin
There is a race condition between a memcg being torn down and a swapin triggered from a different memcg of a page that was recorded to belong to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The result is unreclaimable pages pointing to dead memcgs, which can lead to anything from endless loops in later memcg teardown (the page is charged to all hierarchical parents but is not on any LRU list) or crashes from following the dangling memcg pointer. Memcgs with tasks in them can not be torn down and usually charges don't show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP extension is the notable exception because it charges the cgroup that was recorded as owner during swapout, which may be empty and in the process of being torn down when a task in another memcg triggers the swapin: teardown: swapin: lookup_swap_cgroup_id() rcu_read_lock() mem_cgroup_lookup() css_tryget() rcu_read_unlock() disable css_tryget() call_rcu() offline_css() reparent_charges() res_counter_charge() (hierarchical!) css_put() css_free() pc->mem_cgroup = dead memcg add page to dead lru Add a final reparenting step into css_free() to make sure any such raced charges are moved out of the memcg before it's finally freed. In the longer term it would be cleaner to have the css_tryget() and the res_counter charge under the same RCU lock section so that the charge reparenting is deferred until the last charge whose tryget succeeded is visible. But this will require more invasive changes that will be harder to evaluate and backport into stable, so better defer them to a separate change set. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/memcontrol.c36
1 files changed, 36 insertions, 0 deletions
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3aff0175d4c..f6a63f5b3827 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6355,6 +6355,42 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
6355static void mem_cgroup_css_free(struct cgroup_subsys_state *css) 6355static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
6356{ 6356{
6357 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 6357 struct mem_cgroup *memcg = mem_cgroup_from_css(css);
6358 /*
6359 * XXX: css_offline() would be where we should reparent all
6360 * memory to prepare the cgroup for destruction. However,
6361 * memcg does not do css_tryget() and res_counter charging
6362 * under the same RCU lock region, which means that charging
6363 * could race with offlining. Offlining only happens to
6364 * cgroups with no tasks in them but charges can show up
6365 * without any tasks from the swapin path when the target
6366 * memcg is looked up from the swapout record and not from the
6367 * current task as it usually is. A race like this can leak
6368 * charges and put pages with stale cgroup pointers into
6369 * circulation:
6370 *
6371 * #0 #1
6372 * lookup_swap_cgroup_id()
6373 * rcu_read_lock()
6374 * mem_cgroup_lookup()
6375 * css_tryget()
6376 * rcu_read_unlock()
6377 * disable css_tryget()
6378 * call_rcu()
6379 * offline_css()
6380 * reparent_charges()
6381 * res_counter_charge()
6382 * css_put()
6383 * css_free()
6384 * pc->mem_cgroup = dead memcg
6385 * add page to lru
6386 *
6387 * The bulk of the charges are still moved in offline_css() to
6388 * avoid pinning a lot of pages in case a long-term reference
6389 * like a swapout record is deferring the css_free() to long
6390 * after offlining. But this makes sure we catch any charges
6391 * made after offlining:
6392 */
6393 mem_cgroup_reparent_charges(memcg);
6358 6394
6359 memcg_destroy_kmem(memcg); 6395 memcg_destroy_kmem(memcg);
6360 __mem_cgroup_free(memcg); 6396 __mem_cgroup_free(memcg);