aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorAnton Vorontsov <anton.vorontsov@linaro.org>2013-04-29 18:08:31 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-04-29 18:54:38 -0400
commit70ddf637eebe47e61fb2be08a59315581b6d2f38 (patch)
tree7fdb9e04da11c191daa225cad2314e440effc176 /Documentation
parent84d96d897671cfb386e722acbefdb3a79e115a8a (diff)
memcg: add memory.pressure_level events
With this patch userland applications that want to maintain the interactivity/memory allocation cost can use the pressure level notifications. The levels are defined like this: The "low" level means that the system is reclaiming memory for new allocations. Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically "Activity Manager") might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services). The "medium" level means that the system is experiencing medium memory pressure, the system might be making swap, paging out active file caches, etc. Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk. The "critical" level means that the system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B and C, and suppose group C experiences some pressure. In this situation, only group C will receive the notification, i.e. groups A and B will not receive it. This is done to avoid excessive "broadcasting" of messages, which disturbs the system and which is especially bad if we are low on memory or thrashing. So, organize the cgroups wisely, or propagate the events manually (or, ask us to implement the pass-through events, explaining why would you need them.) Performance wise, the memory pressure notifications feature itself is lightweight and does not require much of bookkeeping, in contrast to the rest of memcg features. Unfortunately, as of current memcg implementation, pages accounting is an inseparable part and cannot be turned off. The good news is that there are some efforts[1] to improve the situation; plus, implementing the same, fully API-compatible[2] interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will not require any changes on the userland side. [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291 [2] http://lkml.org/lkml/2013/2/21/454 [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings] Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/cgroups/memory.txt70
1 files changed, 69 insertions, 1 deletions
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 8b8c28b9864c..f336ede58e62 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
40 - soft limit 40 - soft limit
41 - moving (recharging) account at moving a task is selectable. 41 - moving (recharging) account at moving a task is selectable.
42 - usage threshold notifier 42 - usage threshold notifier
43 - memory pressure notifier
43 - oom-killer disable knob and oom-notifier 44 - oom-killer disable knob and oom-notifier
44 - Root cgroup has no limit controls. 45 - Root cgroup has no limit controls.
45 46
@@ -65,6 +66,7 @@ Brief summary of control files.
65 memory.stat # show various statistics 66 memory.stat # show various statistics
66 memory.use_hierarchy # set/show hierarchical account enabled 67 memory.use_hierarchy # set/show hierarchical account enabled
67 memory.force_empty # trigger forced move charge to parent 68 memory.force_empty # trigger forced move charge to parent
69 memory.pressure_level # set memory pressure notifications
68 memory.swappiness # set/show swappiness parameter of vmscan 70 memory.swappiness # set/show swappiness parameter of vmscan
69 (See sysctl's vm.swappiness) 71 (See sysctl's vm.swappiness)
70 memory.move_charge_at_immigrate # set/show controls of moving charges 72 memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
762 under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may 764 under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
763 be stopped.) 765 be stopped.)
764 766
76511. TODO 76711. Memory Pressure
768
769The pressure level notifications can be used to monitor the memory
770allocation cost; based on the pressure, applications can implement
771different strategies of managing their memory resources. The pressure
772levels are defined as following:
773
774The "low" level means that the system is reclaiming memory for new
775allocations. Monitoring this reclaiming activity might be useful for
776maintaining cache level. Upon notification, the program (typically
777"Activity Manager") might analyze vmstat and act in advance (i.e.
778prematurely shutdown unimportant services).
779
780The "medium" level means that the system is experiencing medium memory
781pressure, the system might be making swap, paging out active file caches,
782etc. Upon this event applications may decide to further analyze
783vmstat/zoneinfo/memcg or internal memory usage statistics and free any
784resources that can be easily reconstructed or re-read from a disk.
785
786The "critical" level means that the system is actively thrashing, it is
787about to out of memory (OOM) or even the in-kernel OOM killer is on its
788way to trigger. Applications should do whatever they can to help the
789system. It might be too late to consult with vmstat or any other
790statistics, so it's advisable to take an immediate action.
791
792The events are propagated upward until the event is handled, i.e. the
793events are not pass-through. Here is what this means: for example you have
794three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
795and C, and suppose group C experiences some pressure. In this situation,
796only group C will receive the notification, i.e. groups A and B will not
797receive it. This is done to avoid excessive "broadcasting" of messages,
798which disturbs the system and which is especially bad if we are low on
799memory or thrashing. So, organize the cgroups wisely, or propagate the
800events manually (or, ask us to implement the pass-through events,
801explaining why would you need them.)
802
803The file memory.pressure_level is only used to setup an eventfd. To
804register a notification, an application must:
805
806- create an eventfd using eventfd(2);
807- open memory.pressure_level;
808- write string like "<event_fd> <fd of memory.pressure_level> <level>"
809 to cgroup.event_control.
810
811Application will be notified through eventfd when memory pressure is at
812the specific level (or higher). Read/write operations to
813memory.pressure_level are no implemented.
814
815Test:
816
817 Here is a small script example that makes a new cgroup, sets up a
818 memory limit, sets up a notification in the cgroup and then makes child
819 cgroup experience a critical pressure:
820
821 # cd /sys/fs/cgroup/memory/
822 # mkdir foo
823 # cd foo
824 # cgroup_event_listener memory.pressure_level low &
825 # echo 8000000 > memory.limit_in_bytes
826 # echo 8000000 > memory.memsw.limit_in_bytes
827 # echo $$ > tasks
828 # dd if=/dev/zero | read x
829
830 (Expect a bunch of notifications, and eventually, the oom-killer will
831 trigger.)
832
83312. TODO
766 834
7671. Add support for accounting huge pages (as a separate controller) 8351. Add support for accounting huge pages (as a separate controller)
7682. Make per-cgroup scanner reclaim not-shared pages first 8362. Make per-cgroup scanner reclaim not-shared pages first