From e21a05cb408bb9f244f11a0813d4b355dad0822e Mon Sep 17 00:00:00 2001 From: GeunSik Lim Date: Wed, 24 Feb 2010 11:06:39 +0100 Subject: doc: cpuset: Update the cpuset flag file This patch is for modifying with correct cuset flag file. We need to update current manual for cpuset. For example, before) cpus, cpu_exclusive, mems after ) cpuset.cpus, cpuset.cpu_exclusive, cpuset.mems Signed-off-by: Geunsik Lim Acked-by: Paul Menage Signed-off-by: Jiri Kosina --- Documentation/cgroups/cpusets.txt | 127 +++++++++++++++++++------------------- 1 file changed, 65 insertions(+), 62 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 1d7e9784439a..4160df82b3f5 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system containing (on top of the standard cgroup files) the following files describing that cpuset: - - cpus: list of CPUs in that cpuset - - mems: list of Memory Nodes in that cpuset - - memory_migrate flag: if set, move pages to cpusets nodes - - cpu_exclusive flag: is cpu placement exclusive? - - mem_exclusive flag: is memory placement exclusive? - - mem_hardwall flag: is memory allocation hardwalled - - memory_pressure: measure of how much paging pressure in cpuset - - memory_spread_page flag: if set, spread page cache evenly on allowed nodes - - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes - - sched_load_balance flag: if set, load balance within CPUs on that cpuset - - sched_relax_domain_level: the searching range when migrating tasks + - cpuset.cpus: list of CPUs in that cpuset + - cpuset.mems: list of Memory Nodes in that cpuset + - cpuset.memory_migrate flag: if set, move pages to cpusets nodes + - cpuset.cpu_exclusive flag: is cpu placement exclusive? + - cpuset.mem_exclusive flag: is memory placement exclusive? + - cpuset.mem_hardwall flag: is memory allocation hardwalled + - cpuset.memory_pressure: measure of how much paging pressure in cpuset + - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes + - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes + - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset + - cpuset.sched_relax_domain_level: the searching range when migrating tasks In addition, the root cpuset only has the following file: - - memory_pressure_enabled flag: compute memory_pressure? + - cpuset.memory_pressure_enabled flag: compute memory_pressure? New cpusets are created using the mkdir system call or shell command. The properties of a cpuset, such as its flags, allowed @@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct ancestor or descendant, may share any of the same CPUs or Memory Nodes. -A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", +A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", i.e. it restricts kernel allocations for page, buffer and other data commonly shared by the kernel across multiple users. All cpusets, whether hardwalled or not, restrict allocations of memory for user @@ -304,15 +304,15 @@ times 1000. --------------------------- There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in -kernel data structures. They are called 'memory_spread_page' and -'memory_spread_slab'. +kernel data structures. They are called 'cpuset.memory_spread_page' and +'cpuset.memory_spread_slab'. -If the per-cpuset boolean flag file 'memory_spread_page' is set, then +If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then the kernel will spread the file system buffers (page cache) evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. -If the per-cpuset boolean flag file 'memory_spread_slab' is set, +If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, then the kernel will spread some file system related slab caches, such as for inodes and dentries evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those @@ -337,21 +337,21 @@ their containing tasks memory spread settings. If memory spreading is turned off, then the currently specified NUMA mempolicy once again applies to memory page allocations. -Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag +Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag files. By default they contain "0", meaning that the feature is off for that cpuset. If a "1" is written to that file, then that turns the named feature on. The implementation is simple. -Setting the flag 'memory_spread_page' turns on a per-process flag +Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag PF_SPREAD_PAGE for each task that is in that cpuset or subsequently joins that cpuset. The page allocation calls for the page cache is modified to perform an inline check for this PF_SPREAD_PAGE task flag, and if set, a call to a new routine cpuset_mem_spread_node() returns the node to prefer for the allocation. -Similarly, setting 'memory_spread_slab' turns on the flag +Similarly, setting 'cpuset.memory_spread_slab' turns on the flag PF_SPREAD_SLAB, and appropriately marked slab caches will allocate pages from the node returned by cpuset_mem_spread_node(). @@ -404,24 +404,24 @@ the following two situations: system overhead on those CPUs, including avoiding task load balancing if that is not needed. -When the per-cpuset flag "sched_load_balance" is enabled (the default -setting), it requests that all the CPUs in that cpusets allowed 'cpus' +When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default +setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' be contained in a single sched domain, ensuring that load balancing can move a task (not otherwised pinned, as by sched_setaffinity) from any CPU in that cpuset to any other. -When the per-cpuset flag "sched_load_balance" is disabled, then the +When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the scheduler will avoid load balancing across the CPUs in that cpuset, --except-- in so far as is necessary because some overlapping cpuset has "sched_load_balance" enabled. -So, for example, if the top cpuset has the flag "sched_load_balance" +So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" enabled, then the scheduler will have one sched domain covering all -CPUs, and the setting of the "sched_load_balance" flag in any other +CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other cpusets won't matter, as we're already fully load balancing. Therefore in the above two situations, the top cpuset flag -"sched_load_balance" should be disabled, and only some of the smaller, +"cpuset.sched_load_balance" should be disabled, and only some of the smaller, child cpusets have this flag enabled. When doing this, you don't usually want to leave any unpinned tasks in @@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that task to that underused CPU. Of course, tasks pinned to a particular CPU can be left in a cpuset -that disables "sched_load_balance" as those tasks aren't going anywhere +that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere else anyway. There is an impedance mismatch here, between cpusets and sched domains. @@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain. It is necessary for sched domains to be flat because load balancing across partially overlapping sets of CPUs would risk unstable dynamics that would be beyond our understanding. So if each of two partially -overlapping cpusets enables the flag 'sched_load_balance', then we +overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we form a single sched domain that is a superset of both. We won't move a task to a CPU outside it cpuset, but the scheduler load balancing code might waste some compute cycles considering that possibility. This mismatch is why there is not a simple one-to-one relation -between which cpusets have the flag "sched_load_balance" enabled, +between which cpusets have the flag "cpuset.sched_load_balance" enabled, and the sched domain configuration. If a cpuset enables the flag, it will get balancing across all its CPUs, but if it disables the flag, it will only be assured of no load balancing if no other overlapping cpuset enables the flag. -If two cpusets have partially overlapping 'cpus' allowed, and only +If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only one of them has this flag enabled, then the other may find its tasks only partially load balanced, just on the overlapping CPUs. This is just the general case of the top_cpuset example given a few @@ -468,23 +468,23 @@ load balancing to the other CPUs. 1.7.1 sched_load_balance implementation details. ------------------------------------------------ -The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary +The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary to most cpuset flags.) When enabled for a cpuset, the kernel will ensure that it can load balance across all the CPUs in that cpuset (makes sure that all the CPUs in the cpus_allowed of that cpuset are in the same sched domain.) -If two overlapping cpusets both have 'sched_load_balance' enabled, +If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, then they will be (must be) both in the same sched domain. -If, as is the default, the top cpuset has 'sched_load_balance' enabled, +If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, then by the above that means there is a single sched domain covering the whole system, regardless of any other cpuset settings. The kernel commits to user space that it will avoid load balancing where it can. It will pick as fine a granularity partition of sched domains as it can while still providing load balancing for any set -of CPUs allowed to a cpuset having 'sched_load_balance' enabled. +of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. The internal kernel cpuset to scheduler interface passes from the cpuset code to the scheduler code a partition of the load balanced @@ -495,9 +495,9 @@ all the CPUs that must be load balanced. The cpuset code builds a new such partition and passes it to the scheduler sched domain setup code, to have the sched domains rebuilt as necessary, whenever: - - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, + - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, - or CPUs come or go from a cpuset with this flag enabled, - - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs + - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs and with this flag enabled changes, - or a cpuset with non-empty CPUs and with this flag enabled is removed, - or a cpu is offlined/onlined. @@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance on the next tick. For some applications in special situation, waiting 1 tick may be too long. -The 'sched_relax_domain_level' file allows you to request changing +The 'cpuset.sched_relax_domain_level' file allows you to request changing this searching range as you like. This file takes int value which indicates size of searching range in levels ideally as follows, otherwise initial value -1 that indicates the cpuset has no request. @@ -559,8 +559,8 @@ The system default is architecture dependent. The system default can be changed using the relax_domain_level= boot parameter. This file is per-cpuset and affect the sched domain where the cpuset -belongs to. Therefore if the flag 'sched_load_balance' of a cpuset -is disabled, then 'sched_relax_domain_level' have no effect since +belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset +is disabled, then 'cpuset.sched_relax_domain_level' have no effect since there is no sched domain belonging the cpuset. If multiple cpusets are overlapping and hence they form a single sched @@ -607,9 +607,9 @@ from one cpuset to another, then the kernel will adjust the tasks memory placement, as above, the next time that the kernel attempts to allocate a page of memory for that task. -If a cpuset has its 'cpus' modified, then each task in that cpuset +If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset will have its allowed CPU placement changed immediately. Similarly, -if a tasks pid is written to another cpusets 'tasks' file, then its +if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its allowed CPU placement is changed immediately. If such a task had been bound to some subset of its cpuset using the sched_setaffinity() call, the task will be allowed to run on any CPU allowed in its new cpuset, @@ -622,8 +622,8 @@ and the processor placement is updated immediately. Normally, once a page is allocated (given a physical page of main memory) then that page stays on whatever node it was allocated, so long as it remains allocated, even if the -cpusets memory placement policy 'mems' subsequently changes. -If the cpuset flag file 'memory_migrate' is set true, then when +cpusets memory placement policy 'cpuset.mems' subsequently changes. +If the cpuset flag file 'cpuset.memory_migrate' is set true, then when tasks are attached to that cpuset, any pages that task had allocated to it on nodes in its previous cpuset are migrated to the tasks new cpuset. The relative placement of the page within @@ -631,12 +631,12 @@ the cpuset is preserved during these migration operations if possible. For example if the page was on the second valid node of the prior cpuset then the page will be placed on the second valid node of the new cpuset. -Also if 'memory_migrate' is set true, then if that cpusets -'mems' file is modified, pages allocated to tasks in that -cpuset, that were on nodes in the previous setting of 'mems', +Also if 'cpuset.memory_migrate' is set true, then if that cpusets +'cpuset.mems' file is modified, pages allocated to tasks in that +cpuset, that were on nodes in the previous setting of 'cpuset.mems', will be moved to nodes in the new setting of 'mems.' Pages that were not in the tasks prior cpuset, or in the cpusets -prior 'mems' setting, will not be moved. +prior 'cpuset.mems' setting, will not be moved. There is an exception to the above. If hotplug functionality is used to remove all the CPUs that are currently assigned to a cpuset, @@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset: cd /dev/cpuset mkdir Charlie cd Charlie - /bin/echo 2-3 > cpus - /bin/echo 1 > mems + /bin/echo 2-3 > cpuset.cpus + /bin/echo 1 > cpuset.mems /bin/echo $$ > tasks sh # The subshell 'sh' is now running in cpuset Charlie @@ -725,10 +725,13 @@ Now you want to do something with this cpuset. In this directory you can find several files: # ls -cpu_exclusive memory_migrate mems tasks -cpus memory_pressure notify_on_release -mem_exclusive memory_spread_page sched_load_balance -mem_hardwall memory_spread_slab sched_relax_domain_level +cpuset.cpu_exclusive cpuset.memory_spread_slab +cpuset.cpus cpuset.mems +cpuset.mem_exclusive cpuset.sched_load_balance +cpuset.mem_hardwall cpuset.sched_relax_domain_level +cpuset.memory_migrate notify_on_release +cpuset.memory_pressure tasks +cpuset.memory_spread_page Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using @@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate the cpuset. Set some flags: -# /bin/echo 1 > cpu_exclusive +# /bin/echo 1 > cpuset.cpu_exclusive Add some cpus: -# /bin/echo 0-7 > cpus +# /bin/echo 0-7 > cpuset.cpus Add some mems: -# /bin/echo 0-7 > mems +# /bin/echo 0-7 > cpuset.mems Now attach your shell to this cpuset: # /bin/echo $$ > tasks @@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent This is the syntax to use when writing in the cpus or mems files in cpuset directories: -# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 -# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 +# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 +# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 To add a CPU to a cpuset, write the new list of CPUs including the CPU to be added. To add 6 to the above cpuset: -# /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 +# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 Similarly to remove a CPU from a cpuset, write the new list of CPUs without the CPU to be removed. To remove all the CPUs: -# /bin/echo "" > cpus -> clear cpus list +# /bin/echo "" > cpuset.cpus -> clear cpus list 2.3 Setting flags ----------------- The syntax is very simple: -# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' -# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' +# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' +# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 2.4 Attaching processes ----------------------- -- cgit v1.2.2 From 2468c7234b366eeb799ee0648cb58f9cba394a54 Mon Sep 17 00:00:00 2001 From: Daisuke Nishimura Date: Wed, 10 Mar 2010 15:22:03 -0800 Subject: cgroup: introduce cancel_attach() Add cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura Acked-by: Li Zefan Reviewed-by: Paul Menage Cc: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 0b33bfe7dde9..d45082653e3d 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL task is passed, then a successful result indicates that *any* unspecified task can be moved into the cgroup. Note that this isn't called on a fork. If this method returns 0 (success) then this should -remain valid while the caller holds cgroup_mutex. If threadgroup is +remain valid while the caller holds cgroup_mutex and it is ensured that either +attach() or cancel_attach() will be called in future. If threadgroup is true, then a successful result indicates that all threads in the given thread's threadgroup can be moved together. +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, + struct task_struct *task, bool threadgroup) +(cgroup_mutex held by caller) + +Called when a task attach operation has failed after can_attach() has succeeded. +A subsystem whose can_attach() has some side-effects should provide this +function, so that the subsytem can implement a rollback. If not, not necessary. +This will be called only about subsystems whose can_attach() operation have +succeeded. + void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *task, bool threadgroup) -- cgit v1.2.2 From e6a1105ba08b265023dd71a4174fb4a29ebc7083 Mon Sep 17 00:00:00 2001 From: Ben Blum Date: Wed, 10 Mar 2010 15:22:09 -0800 Subject: cgroups: subsystem module loading interface Add interface between cgroups subsystem management and module loading This patch implements rudimentary module-loading support for cgroups - namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module initcall, and a struct module pointer in struct cgroup_subsys. Several functions that might be wanted by modules have had EXPORT_SYMBOL added to them, but it's unclear exactly which functions want it and which won't. Signed-off-by: Ben Blum Acked-by: Li Zefan Cc: Paul Menage Cc: "David S. Miller" Cc: KAMEZAWA Hiroyuki Cc: Lai Jiangshan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index d45082653e3d..ae8a037a761e 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -488,6 +488,10 @@ Each subsystem should: - add an entry in linux/cgroup_subsys.h - define a cgroup_subsys object called _subsys +If a subsystem can be compiled as a module, it should also have in its +module initcall a call to cgroup_load_subsys(&its_subsys_struct). It +should also set its_subsys.module = THIS_MODULE in its .c file. + Each subsystem may export the following methods. The only mandatory methods are create/destroy. Any others that are null are presumed to be successful no-ops. -- cgit v1.2.2 From cf5d5941fda647fe3d2f2d00cf9e0245236a5f08 Mon Sep 17 00:00:00 2001 From: Ben Blum Date: Wed, 10 Mar 2010 15:22:09 -0800 Subject: cgroups: subsystem module unloading Provides support for unloading modular subsystems. This patch adds a new function cgroup_unload_subsys which is to be used for removing a loaded subsystem during module deletion. Reference counting of the subsystems' modules is moved from once (at load time) to once per attached hierarchy (in parse_cgroupfs_options and rebind_subsystems) (i.e., 0 or 1). Signed-off-by: Ben Blum Acked-by: Li Zefan Cc: Paul Menage Cc: "David S. Miller" Cc: KAMEZAWA Hiroyuki Cc: Lai Jiangshan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index ae8a037a761e..764007b63921 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -489,8 +489,9 @@ Each subsystem should: - define a cgroup_subsys object called _subsys If a subsystem can be compiled as a module, it should also have in its -module initcall a call to cgroup_load_subsys(&its_subsys_struct). It -should also set its_subsys.module = THIS_MODULE in its .c file. +module initcall a call to cgroup_load_subsys(), and in its exitcall a +call to cgroup_unload_subsys(). It should also set its_subsys.module = +THIS_MODULE in its .c file. Each subsystem may export the following methods. The only mandatory methods are create/destroy. Any others that are null are presumed to -- cgit v1.2.2 From 8ca712ea84728531d36841ca8f98f9e8680bcf4e Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:10 -0800 Subject: cgroups: fix CONTENTS in cgroups documentation Add a forgotten item into CONTENTS. Signed-off-by: Kirill A. Shutemov Acked-by: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 764007b63921..c0358c30c64f 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -22,6 +22,7 @@ CONTENTS: 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Attaching processes + 2.3 Mounting hierarchies by name 3. Kernel API 3.1 Overview 3.2 Synchronization -- cgit v1.2.2 From 7dc74be032bfcaa2f9d9e4296ff5bbddfa9e2f19 Mon Sep 17 00:00:00 2001 From: Daisuke Nishimura Date: Wed, 10 Mar 2010 15:22:13 -0800 Subject: memcg: add interface to move charge at task migration In current memcg, charges associated with a task aren't moved to the new cgroup at task migration. Some users feel this behavior to be strange. These patches are for this feature, that is, for charging to the new cgroup and, of course, uncharging from the old cgroup at task migration. This patch adds "memory.move_charge_at_immigrate" file, which is a flag file to determine whether charges should be moved to the new cgroup at task migration or not and what type of charges should be moved. This patch also adds read and write handlers of the file. This patch also adds no-op handlers for this feature. These handlers will be implemented in later patches. And you cannot write any values other than 0 to move_charge_at_immigrate yet. Signed-off-by: Daisuke Nishimura Cc: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Cc: Li Zefan Cc: Paul Menage Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 56 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f2552b45..e726fb0df719 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages). 4.2 Task migration When a task migrates from one cgroup to another, it's charge is not -carried forward. The pages allocated from the original cgroup still +carried forward by default. The pages allocated from the original cgroup still remain charged to it, the charge is dropped when the page is freed or reclaimed. +Note: You can move charges of a task along with task migration. See 8. + 4.3 Removing a cgroup A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a @@ -414,7 +416,57 @@ NOTE1: Soft limits take effect over a long period of time, since they involve NOTE2: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. -8. TODO +8. Move charges at task migration + +Users can move charges associated with a task along with task migration, that +is, uncharge task's pages from the old cgroup and charge them to the new cgroup. + +8.1 Interface + +This feature is disabled by default. It can be enabled(and disabled again) by +writing to memory.move_charge_at_immigrate of the destination cgroup. + +If you want to enable it: + +# echo (some positive value) > memory.move_charge_at_immigrate + +Note: Each bits of move_charge_at_immigrate has its own meaning about what type + of charges should be moved. See 8.2 for details. +Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread + group. +Note: If we cannot find enough space for the task in the destination cgroup, we + try to make space by reclaiming memory. Task migration may fail if we + cannot make enough space. +Note: It can take several seconds if you move charges in giga bytes order. + +And if you want disable it again: + +# echo 0 > memory.move_charge_at_immigrate + +8.2 Type of charges which can be move + +Each bits of move_charge_at_immigrate has its own meaning about what type of +charges should be moved. + + bit | what type of charges would be moved ? + -----+------------------------------------------------------------------------ + 0 | A charge of an anonymous page(or swap of it) used by the target task. + | Those pages and swaps must be used only by the target task. You must + | enable Swap Extension(see 2.4) to enable move of swap charges. + +Note: Those pages and swaps must be charged to the old cgroup. +Note: More type of pages(e.g. file cache, shmem,) will be supported by other + bits in future. + +8.3 TODO + +- Add support for other types of pages(e.g. file cache, shmem, etc.). +- Implement madvise(2) to let users decide the vma to be moved or not to be + moved. +- All of moving charge operations are done under cgroup_mutex. It's not good + behavior to hold the mutex too long, so we may need some trick. + +9. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first -- cgit v1.2.2 From 024914477e15ef8b17f271ec47f1bb8a589f0806 Mon Sep 17 00:00:00 2001 From: Daisuke Nishimura Date: Wed, 10 Mar 2010 15:22:17 -0800 Subject: memcg: move charges of anonymous swap This patch is another core part of this move-charge-at-task-migration feature. It enables moving charges of anonymous swaps. To move the charge of swap, we need to exchange swap_cgroup's record. In current implementation, swap_cgroup's record is protected by: - page lock: if the entry is on swap cache. - swap_lock: if the entry is not on swap cache. This works well in usual swap-in/out activity. But this behavior make the feature of moving swap charge check many conditions to exchange swap_cgroup's record safely. So I changed modification of swap_cgroup's recored(swap_cgroup_record()) to use xchg, and define a new function to cmpxchg swap_cgroup's record. This patch also enables moving charge of non pte_present but not uncharged swap caches, which can be exist on swap-out path, by getting the target pages via find_get_page() as do_mincore() does. [kosaki.motohiro@jp.fujitsu.com: fix ia64 build] [akpm@linux-foundation.org: fix typos] Signed-off-by: Daisuke Nishimura Cc: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Cc: Li Zefan Cc: Paul Menage Cc: Daisuke Nishimura Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index e726fb0df719..1f59a1a38bd9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -420,6 +420,8 @@ NOTE2: It is recommended to set the soft limit always below the hard limit, Users can move charges associated with a task along with task migration, that is, uncharge task's pages from the old cgroup and charge them to the new cgroup. +This feature is not supported in !CONFIG_MMU environments because of lack of +page tables. 8.1 Interface -- cgit v1.2.2 From 0dea116876eefc9c7ca9c5d74fe665481e499fa3 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:20 -0800 Subject: cgroup: implement eventfd-based generic API for notifications This patchset introduces eventfd-based API for notifications in cgroups and implements memory notifications on top of it. It uses statistics in memory controler to track memory usage. Output of time(1) on building kernel on tmpfs: Root cgroup before changes: make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total Non-root cgroup before changes: make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total Root cgroup after changes (0 thresholds): make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total Non-root cgroup after changes (0 thresholds): make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total Root cgroup after changes (1 thresholds, never crossed): make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total Non-root cgroup after changes (1 thresholds, never crossed): make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total This patch: Introduce the write-only file "cgroup.event_control" in every cgroup. To register new notification handler you need: - create an eventfd; - open a control file to be monitored. Callbacks register_event() and unregister_event() must be defined for the control file; - write " " to cgroup.event_control. Interpretation of args is defined by control file implementation; eventfd will be woken up by control file implementation or when the cgroup is removed. To unregister notification handler just close eventfd. If you need notification functionality for a control file you have to implement callbacks register_event() and unregister_event() in the struct cftype. [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix] Signed-off-by: Kirill A. Shutemov Reviewed-by: KAMEZAWA Hiroyuki Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Pavel Emelyanov Cc: Dan Malek Cc: Vladislav Buzov Cc: Daisuke Nishimura Cc: Alexander Shishkin Cc: Davide Libenzi Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index c0358c30c64f..fd588ff0e296 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -23,6 +23,7 @@ CONTENTS: 2.1 Basic Usage 2.2 Attaching processes 2.3 Mounting hierarchies by name + 2.4 Notification API 3. Kernel API 3.1 Overview 3.2 Synchronization @@ -435,6 +436,25 @@ you give a subsystem a name. The name of the subsystem appears as part of the hierarchy description in /proc/mounts and /proc//cgroups. +2.4 Notification API +-------------------- + +There is mechanism which allows to get notifications about changing +status of a cgroup. + +To register new notification handler you need: + - create a file descriptor for event notification using eventfd(2); + - open a control file to be monitored (e.g. memory.usage_in_bytes); + - write " " to cgroup.event_control. + Interpretation of args is defined by control file implementation; + +eventfd will be woken up by control file implementation or when the +cgroup is removed. + +To unregister notification handler just close eventfd. + +NOTE: Support of notifications should be implemented for the control +file. See documentation for the subsystem. 3. Kernel API ============= -- cgit v1.2.2 From 2e72b6347c9459e6cff5634ddc815485bae6985f Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:24 -0800 Subject: memcg: implement memory thresholds It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold application need: - create an eventfd; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like " " to cgroup.event_control. Application will be notified through eventfd when memory usage crosses threshold in any direction. It's applicable for root and non-root cgroup. It uses stats to track memory usage, simmilar to soft limits. It checks if we need to send event to userspace on every 100 page in/out. I guess it's good compromise between performance and accuracy of thresholds. [akpm@linux-foundation.org: coding-style fixes] [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue] Signed-off-by: Kirill A. Shutemov Cc: Li Zefan Cc: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: Pavel Emelyanov Cc: Dan Malek Cc: Vladislav Buzov Cc: Daisuke Nishimura Cc: Alexander Shishkin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 1f59a1a38bd9..268ab08222dd 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -468,7 +468,24 @@ Note: More type of pages(e.g. file cache, shmem,) will be supported by other - All of moving charge operations are done under cgroup_mutex. It's not good behavior to hold the mutex too long, so we may need some trick. -9. TODO +9. Memory thresholds + +Memory controler implements memory thresholds using cgroups notification +API (see cgroups.txt). It allows to register multiple memory and memsw +thresholds and gets notifications when it crosses. + +To register a threshold application need: + - create an eventfd using eventfd(2); + - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; + - write string like " " to + cgroup.event_control. + +Application will be notified through eventfd when memory usage crosses +threshold in any direction. + +It's applicable for root and non-root cgroup. + +10. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first -- cgit v1.2.2 From 1080d7a30304d03b1d9fd530aacd8aece2d702a2 Mon Sep 17 00:00:00 2001 From: Daisuke Nishimura Date: Wed, 10 Mar 2010 15:22:31 -0800 Subject: memcg: update memcg_test.txt Update memcg_test.txt to describe how to test the move-charge feature. Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki Acked-by: Balbir Singh Cc: "Kirill A. Shutemov" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memcg_test.txt | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 72db89ed0609..e01148886b5b 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -1,6 +1,6 @@ Memory Resource Controller(Memcg) Implementation Memo. -Last Updated: 2009/1/20 -Base Kernel Version: based on 2.6.29-rc2. +Last Updated: 2010/2 +Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). Because VM is getting complex (one of reasons is memcg...), memcg's behavior is complex. This is a document for memcg's internal behavior. @@ -378,3 +378,21 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. #echo 50M > memory.limit_in_bytes #echo 50M > memory.memsw.limit_in_bytes run 51M of malloc + + 9.9 Move charges at task migration + Charges associated with a task can be moved along with task migration. + + (Shell-A) + #mkdir /cgroup/A + #echo $$ >/cgroup/A/tasks + run some programs which uses some amount of memory in /cgroup/A. + + (Shell-B) + #mkdir /cgroup/B + #echo 1 >/cgroup/B/memory.move_charge_at_immigrate + #echo "pid of the program running in group A" >/cgroup/B/tasks + + You can see charges have been moved by reading *.usage_in_bytes or + memory.stat of both A and B. + See 8.2 of Documentation/cgroups/memory.txt to see what value should be + written to move_charge_at_immigrate. -- cgit v1.2.2 From daaf1e68874c078a15ae6ae827751839c4d81739 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 10 Mar 2010 15:22:32 -0800 Subject: memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: David Rientjes Cc: Nick Piggin Reviewed-by: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 268ab08222dd..f8bc802d70b9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -182,6 +182,8 @@ list. NOTE: Reclaim does not work for the root cgroup, since we cannot set any limits on the root cgroup. +Note2: When panic_on_oom is set to "2", the whole system will panic. + 2. Locking The memory controller uses the following hierarchy @@ -379,7 +381,8 @@ The feature can be disabled by NOTE1: Enabling/disabling will fail if the cgroup already has other cgroups created below it. -NOTE2: This feature can be enabled/disabled per subtree. +NOTE2: When panic_on_oom is set to "2", the whole system will panic in +case of an oom event in any cgroup. 7. Soft limits -- cgit v1.2.2 From 1d8fd973a41942bc760dd21581e24117bc1dd063 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:35 -0800 Subject: cgroups: add simple listener of cgroup events to documentation An example of cgroup notification API usage. Signed-off-by: Kirill A. Shutemov Reviewed-by: KAMEZAWA Hiroyuki Cc: Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Pavel Emelyanov Cc: Dan Malek Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroup_event_listener.c | 110 ++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 Documentation/cgroups/cgroup_event_listener.c (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c new file mode 100644 index 000000000000..8c2bfc4a6358 --- /dev/null +++ b/Documentation/cgroups/cgroup_event_listener.c @@ -0,0 +1,110 @@ +/* + * cgroup_event_listener.c - Simple listener of cgroup events + * + * Copyright (C) Kirill A. Shutemov + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define USAGE_STR "Usage: cgroup_event_listener \n" + +int main(int argc, char **argv) +{ + int efd = -1; + int cfd = -1; + int event_control = -1; + char event_control_path[PATH_MAX]; + char line[LINE_MAX]; + int ret; + + if (argc != 3) { + fputs(USAGE_STR, stderr); + return 1; + } + + cfd = open(argv[1], O_RDONLY); + if (cfd == -1) { + fprintf(stderr, "Cannot open %s: %s\n", argv[1], + strerror(errno)); + goto out; + } + + ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control", + dirname(argv[1])); + if (ret >= PATH_MAX) { + fputs("Path to cgroup.event_control is too long\n", stderr); + goto out; + } + + event_control = open(event_control_path, O_WRONLY); + if (event_control == -1) { + fprintf(stderr, "Cannot open %s: %s\n", event_control_path, + strerror(errno)); + goto out; + } + + efd = eventfd(0, 0); + if (efd == -1) { + perror("eventfd() failed"); + goto out; + } + + ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]); + if (ret >= LINE_MAX) { + fputs("Arguments string is too long\n", stderr); + goto out; + } + + ret = write(event_control, line, strlen(line) + 1); + if (ret == -1) { + perror("Cannot write to cgroup.event_control"); + goto out; + } + + while (1) { + uint64_t result; + + ret = read(efd, &result, sizeof(result)); + if (ret == -1) { + if (errno == EINTR) + continue; + perror("Cannot read from eventfd"); + break; + } + assert(ret == sizeof(result)); + + ret = access(event_control_path, W_OK); + if ((ret == -1) && (errno == ENOENT)) { + puts("The cgroup seems to have removed."); + ret = 0; + break; + } + + if (ret == -1) { + perror("cgroup.event_control " + "is not accessable any more"); + break; + } + + printf("%s %s: crossed\n", argv[1], argv[2]); + } + +out: + if (efd >= 0) + close(efd); + if (event_control >= 0) + close(event_control); + if (cfd >= 0) + close(cfd); + + return (ret != 0); +} -- cgit v1.2.2 From 1e111452d457ceedec8a57fc3d45b1312e736fba Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:36 -0800 Subject: memcg: update memcg_test.txt to describe memory thresholds Decription of sanity check for memory thresholds. Signed-off-by: Kirill A. Shutemov Acked-by: KAMEZAWA Hiroyuki Cc: Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Pavel Emelyanov Cc: Dan Malek Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memcg_test.txt | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index e01148886b5b..4d32e0ed1917 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -396,3 +396,24 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. memory.stat of both A and B. See 8.2 of Documentation/cgroups/memory.txt to see what value should be written to move_charge_at_immigrate. + + 9.10 Memory thresholds + Memory controler implements memory thresholds using cgroups notification + API. You can use Documentation/cgroups/cgroup_event_listener.c to test + it. + + (Shell-A) Create cgroup and run event listener + # mkdir /cgroup/A + # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M + + (Shell-B) Add task to cgroup and try to allocate and free memory + # echo $$ >/cgroup/A/tasks + # a="$(dd if=/dev/zero bs=1M count=10)" + # a= + + You will see message from cgroup_event_listener every time you cross + the thresholds. + + Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. + + It's good idea to test root cgroup as well. -- cgit v1.2.2 From 0263c12c12ccc90edc9d856fa839f8936183e6d1 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 10 Mar 2010 15:22:37 -0800 Subject: memcg: fix typos in memcg_test.txt Signed-off-by: Kirill A. Shutemov Reviewed-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memcg_test.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 4d32e0ed1917..f7f68b2ac199 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. race and lock dependency with other cgroup subsystems. example) - # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices + # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices and do task move, mkdir, rmdir etc...under this. @@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. For example, test like following is good. (Shell-A) - # mount -t cgroup none /cgroup -t memory + # mount -t cgroup none /cgroup -o memory # mkdir /cgroup/test # echo 40M > /cgroup/test/memory.limit_in_bytes # echo 0 > /cgroup/test/tasks -- cgit v1.2.2