memcg, slab: do not schedule cache destruction when last page goes away

This patchset is a part of preparations for kmemcg re-parenting. It targets at simplifying kmemcg work-flows and synchronization. First, it removes async per memcg cache destruction (see patches 1, 2). Now caches are only destroyed on memcg offline. That means the caches that are not empty on memcg offline will be leaked. However, they are already leaked, because memcg_cache_params::nr_pages normally never drops to 0 so the destruction work is never scheduled except kmem_cache_shrink is called explicitly. In the future I'm planning reaping such dead caches on vmpressure or periodically. Second, it substitutes per memcg slab_caches_mutex's with the global memcg_slab_mutex, which should be taken during the whole per memcg cache creation/destruction path before the slab_mutex (see patch 3). This greatly simplifies synchronization among various per memcg cache creation/destruction paths. I'm still not quite sure about the end picture, in particular I don't know whether we should reap dead memcgs' kmem caches periodically or try to merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for more details), but whichever way we choose, this set looks like a reasonable change to me, because it greatly simplifies kmemcg work-flows and eases further development. This patch (of 3): After a memcg is offlined, we mark its kmem caches that cannot be deleted right now due to pending objects as dead by setting the memcg_cache_params::dead flag, so that memcg_release_pages will schedule cache destruction (memcg_cache_params::destroy) as soon as the last slab of the cache is freed (memcg_cache_params::nr_pages drops to zero). I guess the idea was to destroy the caches as soon as possible, i.e. immediately after freeing the last object. However, it just doesn't work that way, because kmem caches always preserve some pages for the sake of performance, so that nr_pages never gets to zero unless the cache is shrunk explicitly using kmem_cache_shrink. Of course, we could account the total number of objects on the cache or check if all the slabs allocated for the cache are empty on kmem_cache_free and schedule destruction if so, but that would be too costly. Thus we have a piece of code that works only when we explicitly call kmem_cache_shrink, but complicates the whole picture a lot. Moreover, it's racy in fact. For instance, kmem_cache_shrink may free the last slab and thus schedule cache destruction before it finishes checking that the cache is empty, which can lead to use-after-free. So I propose to remove this async cache destruction from memcg_release_pages, and check if the cache is empty explicitly after calling kmem_cache_shrink instead. This will simplify things a lot w/o introducing any functional changes. And regarding dead memcg caches (i.e. those that are left hanging around after memcg offline for they have objects), I suppose we should reap them either periodically or on vmpressure as Glauber suggested initially. I'm going to implement this later. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Vladimir Davydov <vdavydov@parallels.com> 2014-06-04 19:07:37 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2014-06-04 19:54:01 -0400
commit: 1e32e77f95d60b121b6072e3e3a650a7f93068f9 (patch)
tree: 1428f555cc548a8df199cd2a261c579705e88b29 /mm/memcontrol.c
parent: d8dc595ce3909fbc131bdf5ab8c9808fe624b18d (diff)
1 files changed, 2 insertions, 61 deletions
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f4ff49c6add..6b1c45ced733 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3277,60 +3277,11 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
        cachep = memcg_params_to_cache(p);
-        /*
+        kmem_cache_shrink(cachep);
-         * If we get down to 0 after shrink, we could delete right away.
+        if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
-         * However, memcg_release_pages() already puts us back in the workqueue
-         * in that case. If we proceed deleting, we'll get a dangling
-         * reference, and removing the object from the workqueue in that case
-         * is unnecessary complication. We are not a fast path.
-         *
-         * Note that this case is fundamentally different from racing with
-         * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
-         * kmem_cache_shrink, not only we would be reinserting a dead cache
-         * into the queue, but doing so from inside the worker racing to
-         * destroy it.
-         *
-         * So if we aren't down to zero, we'll just schedule a worker and try
-         * again
-         */
-        if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
-                kmem_cache_shrink(cachep);
-        else
                kmem_cache_destroy(cachep);
 }
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
-{
-        if (!cachep->memcg_params->dead)
-                return;
-        /*
-         * There are many ways in which we can get here.
-         *
-         * We can get to a memory-pressure situation while the delayed work is
-         * still pending to run. The vmscan shrinkers can then release all
-         * cache memory and get us to destruction. If this is the case, we'll
-         * be executed twice, which is a bug (the second time will execute over
-         * bogus data). In this case, cancelling the work should be fine.
-         *
-         * But we can also get here from the worker itself, if
-         * kmem_cache_shrink is enough to shake all the remaining objects and
-         * get the page count to 0. In this case, we'll deadlock if we try to
-         * cancel the work (the worker runs with an internal lock held, which
-         * is the same lock we would hold for cancel_work_sync().)
-         *
-         * Since we can't possibly know who got us here, just refrain from
-         * running if there is already work pending
-         */
-        if (work_pending(&cachep->memcg_params->destroy))
-                return;
-        /*
-         * We have to defer the actual destroying to a workqueue, because
-         * we might currently be in a context that cannot sleep.
-         */
-        schedule_work(&cachep->memcg_params->destroy);
-}
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
        struct kmem_cache *c;
@@ -3356,16 +3307,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
                 * We will now manually delete the caches, so to avoid races
                 * we need to cancel all pending destruction workers and
                 * proceed with destruction ourselves.
-                 *
-                 * kmem_cache_destroy() will call kmem_cache_shrink internally,
-                 * and that could spawn the workers again: it is likely that
-                 * the cache still have active pages until this very moment.
-                 * This would lead us back to mem_cgroup_destroy_cache.
-                 *
-                 * But that will not execute at all if the "dead" flag is not
-                 * set, so flip it down to guarantee we are in control.
                 */
-                c->memcg_params->dead = false;
                cancel_work_sync(&c->memcg_params->destroy);
                kmem_cache_destroy(c);
@@ -3387,7 +3329,6 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
        mutex_lock(&memcg->slab_caches_mutex);
        list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
                cachep = memcg_params_to_cache(params);
-                cachep->memcg_params->dead = true;
                schedule_work(&cachep->memcg_params->destroy);
        }
        mutex_unlock(&memcg->slab_caches_mutex);
author	Vladimir Davydov <vdavydov@parallels.com>	2014-06-04 19:07:37 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2014-06-04 19:54:01 -0400
commit	1e32e77f95d60b121b6072e3e3a650a7f93068f9 (patch)
tree	1428f555cc548a8df199cd2a261c579705e88b29 /mm/memcontrol.c
parent	d8dc595ce3909fbc131bdf5ab8c9808fe624b18d (diff)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9f4ff49c6add..6b1c45ced733 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c
@@ -3277,60 +3277,11 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
3277		3277
3278	cachep = memcg_params_to_cache(p);	3278	cachep = memcg_params_to_cache(p);
3279		3279
3280	/*	3280	kmem_cache_shrink(cachep);
3281	* If we get down to 0 after shrink, we could delete right away.	3281	if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
3282	* However, memcg_release_pages() already puts us back in the workqueue
3283	* in that case. If we proceed deleting, we'll get a dangling
3284	* reference, and removing the object from the workqueue in that case
3285	* is unnecessary complication. We are not a fast path.
3286	*
3287	* Note that this case is fundamentally different from racing with
3288	* shrink_slab(): if memcg_cgroup_destroy_cache() is called in
3289	* kmem_cache_shrink, not only we would be reinserting a dead cache
3290	* into the queue, but doing so from inside the worker racing to
3291	* destroy it.
3292	*
3293	* So if we aren't down to zero, we'll just schedule a worker and try
3294	* again
3295	*/
3296	if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
3297	kmem_cache_shrink(cachep);
3298	else
3299	kmem_cache_destroy(cachep);	3282	kmem_cache_destroy(cachep);
3300	}	3283	}
3301		3284
3302	void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
3303	{
3304	if (!cachep->memcg_params->dead)
3305	return;
3306
3307	/*
3308	* There are many ways in which we can get here.
3309	*
3310	* We can get to a memory-pressure situation while the delayed work is
3311	* still pending to run. The vmscan shrinkers can then release all
3312	* cache memory and get us to destruction. If this is the case, we'll
3313	* be executed twice, which is a bug (the second time will execute over
3314	* bogus data). In this case, cancelling the work should be fine.
3315	*
3316	* But we can also get here from the worker itself, if
3317	* kmem_cache_shrink is enough to shake all the remaining objects and
3318	* get the page count to 0. In this case, we'll deadlock if we try to
3319	* cancel the work (the worker runs with an internal lock held, which
3320	* is the same lock we would hold for cancel_work_sync().)
3321	*
3322	* Since we can't possibly know who got us here, just refrain from
3323	* running if there is already work pending
3324	*/
3325	if (work_pending(&cachep->memcg_params->destroy))
3326	return;
3327	/*
3328	* We have to defer the actual destroying to a workqueue, because
3329	* we might currently be in a context that cannot sleep.
3330	*/
3331	schedule_work(&cachep->memcg_params->destroy);
3332	}
3333
3334	int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)	3285	int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
3335	{	3286	{
3336	struct kmem_cache *c;	3287	struct kmem_cache *c;
@@ -3356,16 +3307,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
3356	* We will now manually delete the caches, so to avoid races	3307	* We will now manually delete the caches, so to avoid races
3357	* we need to cancel all pending destruction workers and	3308	* we need to cancel all pending destruction workers and
3358	* proceed with destruction ourselves.	3309	* proceed with destruction ourselves.
3359	*
3360	* kmem_cache_destroy() will call kmem_cache_shrink internally,
3361	* and that could spawn the workers again: it is likely that
3362	* the cache still have active pages until this very moment.
3363	* This would lead us back to mem_cgroup_destroy_cache.
3364	*
3365	* But that will not execute at all if the "dead" flag is not
3366	* set, so flip it down to guarantee we are in control.
3367	*/	3310	*/
3368	c->memcg_params->dead = false;
3369	cancel_work_sync(&c->memcg_params->destroy);	3311	cancel_work_sync(&c->memcg_params->destroy);
3370	kmem_cache_destroy(c);	3312	kmem_cache_destroy(c);
3371		3313
@@ -3387,7 +3329,6 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
3387	mutex_lock(&memcg->slab_caches_mutex);	3329	mutex_lock(&memcg->slab_caches_mutex);
3388	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {	3330	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
3389	cachep = memcg_params_to_cache(params);	3331	cachep = memcg_params_to_cache(params);
3390	cachep->memcg_params->dead = true;
3391	schedule_work(&cachep->memcg_params->destroy);	3332	schedule_work(&cachep->memcg_params->destroy);
3392	}	3333	}
3393	mutex_unlock(&memcg->slab_caches_mutex);	3334	mutex_unlock(&memcg->slab_caches_mutex);