Merge branch 'wip-2.6.34' into old-private-masterarchived-private-master

author: Andrea Bastoni <bastoni@cs.unc.edu> 2010-05-30 19:16:45 -0400
committer: Andrea Bastoni <bastoni@cs.unc.edu> 2010-05-30 19:16:45 -0400
commit: ada47b5fe13d89735805b566185f4885f5a3f750 (patch)
tree: 644b88f8a71896307d71438e9b3af49126ffb22b /Documentation/cgroups
parent: 43e98717ad40a4ae64545b5ba047c7b86aa44f4f (diff)
parent: 3280f21d43ee541f97f8cda5792150d2dbec20d5 (diff)
6 files changed, 470 insertions, 73 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
new file mode 100644
index 000000000000..630879cd9a42
--- /dev/null
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -0,0 +1,135 @@
+                                Block IO Controller
+                                ===================
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kinds of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
+Plan is to use the same cgroup based management interface for blkio controller
+and based on user options switch IO policies in the background.
+In the first phase, this patchset implements proportional weight time based
+division of disk policy. It is implemented in CFQ. Hence this policy takes
+effect only on leaf nodes when CFQ is being used.
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+- Enable group scheduling in CFQ
+        CONFIG_CFQ_GROUP_IOSCHED=y
+- Compile and boot into kernel and mount IO controller (blkio).
+        mount -t cgroup -o blkio none /cgroup
+- Create two cgroups
+        mkdir -p /cgroup/test1/ /cgroup/test2
+- Set weights of group test1 and test2
+        echo 1000 > /cgroup/test1/blkio.weight
+        echo 500 > /cgroup/test2/blkio.weight
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files.
+        sync
+        echo 3 > /proc/sys/vm/drop_caches
+        dd if=/mnt/sdb/zerofile1 of=/dev/null &
+        echo $! > /cgroup/test1/tasks
+        cat /cgroup/test1/tasks
+        dd if=/mnt/sdb/zerofile2 of=/dev/null &
+        echo $! > /cgroup/test2/tasks
+        cat /cgroup/test2/tasks
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at blkio.disk_time and
+  blkio.disk_sectors files of both test1 and test2 groups. This will tell how
+  much disk time (in milli seconds), each group got and how many secotors each
+  group dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+Various user visible config options
+===================================
+CONFIG_CFQ_GROUP_IOSCHED
+        - Enables group scheduling in CFQ. Currently only 1 level of group
+          creation is allowed.
+CONFIG_DEBUG_CFQ_IOSCHED
+        - Enables some debugging messages in blktrace. Also creates extra
+          cgroup file blkio.dequeue.
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configuration.
+CONFIG_BLK_CGROUP
+        - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
+CONFIG_DEBUG_BLK_CGROUP
+        - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
+Details of cgroup files
+=======================
+- blkio.weight
+        - Specifies per cgroup weight.
+          Currently allowed range of weights is from 100 to 1000.
+- blkio.time
+        - disk time allocated to cgroup per device in milliseconds. First
+          two fields specify the major and minor number of the device and
+          third field specifies the disk time allocated to group in
+          milliseconds.
+- blkio.sectors
+        - number of sectors transferred to/from disk by the group. First
+          two fields specify the major and minor number of the device and
+          third field specifies the number of sectors transferred by the
+          group to/from the device.
+- blkio.dequeue
+        - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
+          gives the statistics about how many a times a group was dequeued
+          from service tree of the device. First two fields specify the major
+          and minor number of the device and third field specifies the number
+          of times a group was dequeued from a particular device.
+CFQ sysfs tunable
+=================
+/sys/block/<disk>/queue/iosched/group_isolation
+If group_isolation=1, it provides stronger isolation between groups at the
+expense of throughput. By default group_isolation is 0. In general that
+means that if group_isolation=0, expect fairness for sequential workload
+only. Set group_isolation=1 to see fairness for random IO workload also.
+Generally CFQ will put random seeky workload in sync-noidle category. CFQ
+will disable idling on these queues and it does a collective idling on group
+of such queues. Generally these are slow moving queues and if there is a
+sync-noidle service tree in each group, that group gets exclusive access to
+disk for certain period. That means it will bring the throughput down if
+group does not have enough IO to drive deeper queue depths and utilize disk
+capacity to the fullest in the slice allocated to it. But the flip side is
+that even a random reader should get better latencies and overall throughput
+if there are lots of sequential readers/sync-idle workload running in the
+system.
+If group_isolation=0, then CFQ automatically moves all the random seeky queues
+in the root group. That means there will be no service differentiation for
+that kind of workload. This leads to better throughput as we do collective
+idling on root sync-noidle tree.
+By default one should run with group_isolation=0. If that is not sufficient
+and one wants stronger isolation between groups, then set group_isolation=1
+but this will come at cost of reduced throughput.
+What works
+==========
+- Currently only sync IO queues are support. All the buffered writes are
+  still system wide and not per group. Hence we will not see service
+  differentiation between buffered writes between groups.
diff --git a/Documentation/cgroups/cgroup_event_listener.c b/Documentation/cgroups/cgroup_event_listener.c
new file mode 100644
index 000000000000..8c2bfc4a6358
--- /dev/null
+++ b/Documentation/cgroups/cgroup_event_listener.c
@@ -0,0 +1,110 @@
+/*
+ * cgroup_event_listener.c - Simple listener of cgroup events
+ *
+ * Copyright (C) Kirill A. Shutemov <kirill@shutemov.name>
+ */
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <libgen.h>
+#include <limits.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>\n"
+int main(int argc, char **argv)
+{
+        int efd = -1;
+        int cfd = -1;
+        int event_control = -1;
+        char event_control_path[PATH_MAX];
+        char line[LINE_MAX];
+        int ret;
+        if (argc != 3) {
+                fputs(USAGE_STR, stderr);
+                return 1;
+        }
+        cfd = open(argv[1], O_RDONLY);
+        if (cfd == -1) {
+                fprintf(stderr, "Cannot open %s: %s\n", argv[1],
+                                strerror(errno));
+                goto out;
+        }
+        ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
+                        dirname(argv[1]));
+        if (ret >= PATH_MAX) {
+                fputs("Path to cgroup.event_control is too long\n", stderr);
+                goto out;
+        }
+        event_control = open(event_control_path, O_WRONLY);
+        if (event_control == -1) {
+                fprintf(stderr, "Cannot open %s: %s\n", event_control_path,
+                                strerror(errno));
+                goto out;
+        }
+        efd = eventfd(0, 0);
+        if (efd == -1) {
+                perror("eventfd() failed");
+                goto out;
+        }
+        ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]);
+        if (ret >= LINE_MAX) {
+                fputs("Arguments string is too long\n", stderr);
+                goto out;
+        }
+        ret = write(event_control, line, strlen(line) + 1);
+        if (ret == -1) {
+                perror("Cannot write to cgroup.event_control");
+                goto out;
+        }
+        while (1) {
+                uint64_t result;
+                ret = read(efd, &result, sizeof(result));
+                if (ret == -1) {
+                        if (errno == EINTR)
+                                continue;
+                        perror("Cannot read from eventfd");
+                        break;
+                }
+                assert(ret == sizeof(result));
+                ret = access(event_control_path, W_OK);
+                if ((ret == -1) && (errno == ENOENT)) {
+                                puts("The cgroup seems to have removed.");
+                                ret = 0;
+                                break;
+                }
+                if (ret == -1) {
+                        perror("cgroup.event_control "
+                                        "is not accessable any more");
+                        break;
+                }
+                printf("%s %s: crossed\n", argv[1], argv[2]);
+        }
+out:
+        if (efd >= 0)
+                close(efd);
+        if (event_control >= 0)
+                close(event_control);
+        if (cfd >= 0)
+                close(cfd);
+        return (ret != 0);
+}
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0b33bfe7dde9..a1ca5924faff 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -22,6 +22,8 @@ CONTENTS:
 2. Usage Examples and Syntax
  2.1 Basic Usage
  2.2 Attaching processes
+  2.3 Mounting hierarchies by name
+  2.4 Notification API
 3. Kernel API
  3.1 Overview
  3.2 Synchronization
@@ -233,8 +235,7 @@ containing the following files describing that cgroup:
 - cgroup.procs: list of tgids in the cgroup.  This list is not
   guaranteed to be sorted or free of duplicate tgids, and userspace
   should sort/uniquify the list if this property is required.
-   Writing a tgid into this file moves all threads with that tgid into
+   This is a read-only file, for now.
-   this cgroup.
 - notify_on_release flag: run the release agent on exit?
 - release_agent: the path to use for release notifications (this file
   exists in the top cgroup only)
@@ -434,6 +435,25 @@ you give a subsystem a name.
 The name of the subsystem appears as part of the hierarchy description
 in /proc/mounts and /proc/<pid>/cgroups.
+2.4 Notification API
+--------------------
+There is mechanism which allows to get notifications about changing
+status of a cgroup.
+To register new notification handler you need:
+ - create a file descriptor for event notification using eventfd(2);
+ - open a control file to be monitored (e.g. memory.usage_in_bytes);
+ - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
+   Interpretation of args is defined by control file implementation;
+eventfd will be woken up by control file implementation or when the
+cgroup is removed.
+To unregister notification handler just close eventfd.
+NOTE: Support of notifications should be implemented for the control
+file. See documentation for the subsystem.
 3. Kernel API
 =============
@@ -488,6 +508,11 @@ Each subsystem should:
 - add an entry in linux/cgroup_subsys.h
 - define a cgroup_subsys object called <name>_subsys
+If a subsystem can be compiled as a module, it should also have in its
+module initcall a call to cgroup_load_subsys(), and in its exitcall a
+call to cgroup_unload_subsys(). It should also set its_subsys.module =
+THIS_MODULE in its .c file.
 Each subsystem may export the following methods. The only mandatory
 methods are create/destroy. Any others that are null are presumed to
 be successful no-ops.
@@ -536,10 +561,21 @@ returns an error, this will abort the attach operation.  If a NULL
 task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex. If threadgroup is
+remain valid while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
 thread's threadgroup can be moved together.
+void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+               struct task_struct *task, bool threadgroup)
+(cgroup_mutex held by caller)
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsytem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded.
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
            struct cgroup *old_cgrp, struct task_struct *task,
            bool threadgroup)
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 1d7e9784439a..4160df82b3f5 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system
 containing (on top of the standard cgroup files) the following
 files describing that cpuset:
- - cpus: list of CPUs in that cpuset
+ - cpuset.cpus: list of CPUs in that cpuset
- - mems: list of Memory Nodes in that cpuset
+ - cpuset.mems: list of Memory Nodes in that cpuset
- - memory_migrate flag: if set, move pages to cpusets nodes
+ - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
- - cpu_exclusive flag: is cpu placement exclusive?
+ - cpuset.cpu_exclusive flag: is cpu placement exclusive?
- - mem_exclusive flag: is memory placement exclusive?
+ - cpuset.mem_exclusive flag: is memory placement exclusive?
- - mem_hardwall flag:  is memory allocation hardwalled
+ - cpuset.mem_hardwall flag:  is memory allocation hardwalled
- - memory_pressure: measure of how much paging pressure in cpuset
+ - cpuset.memory_pressure: measure of how much paging pressure in cpuset
- - memory_spread_page flag: if set, spread page cache evenly on allowed nodes
+ - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
+ - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- - sched_load_balance flag: if set, load balance within CPUs on that cpuset
+ - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- - sched_relax_domain_level: the searching range when migrating tasks
+ - cpuset.sched_relax_domain_level: the searching range when migrating tasks
 In addition, the root cpuset only has the following file:
- - memory_pressure_enabled flag: compute memory_pressure?
+ - cpuset.memory_pressure_enabled flag: compute memory_pressure?
 New cpusets are created using the mkdir system call or shell
 command.  The properties of a cpuset, such as its flags, allowed
@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than
 a direct ancestor or descendant, may share any of the same CPUs or
 Memory Nodes.
-A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
+A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
 i.e. it restricts kernel allocations for page, buffer and other data
 commonly shared by the kernel across multiple users.  All cpusets,
 whether hardwalled or not, restrict allocations of memory for user
@@ -304,15 +304,15 @@ times 1000.
 ---------------------------
 There are two boolean flag files per cpuset that control where the
 kernel allocates pages for the file system buffers and related in
-kernel data structures.  They are called 'memory_spread_page' and
+kernel data structures.  They are called 'cpuset.memory_spread_page' and
-'memory_spread_slab'.
+'cpuset.memory_spread_slab'.
-If the per-cpuset boolean flag file 'memory_spread_page' is set, then
+If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
 the kernel will spread the file system buffers (page cache) evenly
 over all the nodes that the faulting task is allowed to use, instead
 of preferring to put those pages on the node where the task is running.
-If the per-cpuset boolean flag file 'memory_spread_slab' is set,
+If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
 then the kernel will spread some file system related slab caches,
 such as for inodes and dentries evenly over all the nodes that the
 faulting task is allowed to use, instead of preferring to put those
@@ -337,21 +337,21 @@ their containing tasks memory spread settings.  If memory spreading
 is turned off, then the currently specified NUMA mempolicy once again
 applies to memory page allocations.
-Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
+Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
 files.  By default they contain "0", meaning that the feature is off
 for that cpuset.  If a "1" is written to that file, then that turns
 the named feature on.
 The implementation is simple.
-Setting the flag 'memory_spread_page' turns on a per-process flag
+Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
 PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
 joins that cpuset.  The page allocation calls for the page cache
 is modified to perform an inline check for this PF_SPREAD_PAGE task
 flag, and if set, a call to a new routine cpuset_mem_spread_node()
 returns the node to prefer for the allocation.
-Similarly, setting 'memory_spread_slab' turns on the flag
+Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
 PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
 pages from the node returned by cpuset_mem_spread_node().
@@ -404,24 +404,24 @@ the following two situations:
    system overhead on those CPUs, including avoiding task load
    balancing if that is not needed.
-When the per-cpuset flag "sched_load_balance" is enabled (the default
+When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
-setting), it requests that all the CPUs in that cpusets allowed 'cpus'
+setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
 be contained in a single sched domain, ensuring that load balancing
 can move a task (not otherwised pinned, as by sched_setaffinity)
 from any CPU in that cpuset to any other.
-When the per-cpuset flag "sched_load_balance" is disabled, then the
+When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
 scheduler will avoid load balancing across the CPUs in that cpuset,
 --except-- in so far as is necessary because some overlapping cpuset
 has "sched_load_balance" enabled.
-So, for example, if the top cpuset has the flag "sched_load_balance"
+So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
 enabled, then the scheduler will have one sched domain covering all
-CPUs, and the setting of the "sched_load_balance" flag in any other
+CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
 cpusets won't matter, as we're already fully load balancing.
 Therefore in the above two situations, the top cpuset flag
-"sched_load_balance" should be disabled, and only some of the smaller,
+"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
 child cpusets have this flag enabled.
 When doing this, you don't usually want to leave any unpinned tasks in
@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that
 task to that underused CPU.
 Of course, tasks pinned to a particular CPU can be left in a cpuset
-that disables "sched_load_balance" as those tasks aren't going anywhere
+that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
 else anyway.
 There is an impedance mismatch here, between cpusets and sched domains.
@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain.
 It is necessary for sched domains to be flat because load balancing
 across partially overlapping sets of CPUs would risk unstable dynamics
 that would be beyond our understanding.  So if each of two partially
-overlapping cpusets enables the flag 'sched_load_balance', then we
+overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
 form a single sched domain that is a superset of both.  We won't move
 a task to a CPU outside it cpuset, but the scheduler load balancing
 code might waste some compute cycles considering that possibility.
 This mismatch is why there is not a simple one-to-one relation
-between which cpusets have the flag "sched_load_balance" enabled,
+between which cpusets have the flag "cpuset.sched_load_balance" enabled,
 and the sched domain configuration.  If a cpuset enables the flag, it
 will get balancing across all its CPUs, but if it disables the flag,
 it will only be assured of no load balancing if no other overlapping
 cpuset enables the flag.
-If two cpusets have partially overlapping 'cpus' allowed, and only
+If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
 one of them has this flag enabled, then the other may find its
 tasks only partially load balanced, just on the overlapping CPUs.
 This is just the general case of the top_cpuset example given a few
@@ -468,23 +468,23 @@ load balancing to the other CPUs.
 1.7.1 sched_load_balance implementation details.
 ------------------------------------------------
-The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
+The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
 to most cpuset flags.)  When enabled for a cpuset, the kernel will
 ensure that it can load balance across all the CPUs in that cpuset
 (makes sure that all the CPUs in the cpus_allowed of that cpuset are
 in the same sched domain.)
-If two overlapping cpusets both have 'sched_load_balance' enabled,
+If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
 then they will be (must be) both in the same sched domain.
-If, as is the default, the top cpuset has 'sched_load_balance' enabled,
+If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
 then by the above that means there is a single sched domain covering
 the whole system, regardless of any other cpuset settings.
 The kernel commits to user space that it will avoid load balancing
 where it can.  It will pick as fine a granularity partition of sched
 domains as it can while still providing load balancing for any set
-of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
+of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
 The internal kernel cpuset to scheduler interface passes from the
 cpuset code to the scheduler code a partition of the load balanced
@@ -495,9 +495,9 @@ all the CPUs that must be load balanced.
 The cpuset code builds a new such partition and passes it to the
 scheduler sched domain setup code, to have the sched domains rebuilt
 as necessary, whenever:
- - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes,
+ - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
 - or CPUs come or go from a cpuset with this flag enabled,
- - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs
+ - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
   and with this flag enabled changes,
 - or a cpuset with non-empty CPUs and with this flag enabled is removed,
 - or a cpu is offlined/onlined.
@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance
 on the next tick.  For some applications in special situation, waiting
 1 tick may be too long.
-The 'sched_relax_domain_level' file allows you to request changing
+The 'cpuset.sched_relax_domain_level' file allows you to request changing
 this searching range as you like.  This file takes int value which
 indicates size of searching range in levels ideally as follows,
 otherwise initial value -1 that indicates the cpuset has no request.
@@ -559,8 +559,8 @@ The system default is architecture dependent.  The system default
 can be changed using the relax_domain_level= boot parameter.
 This file is per-cpuset and affect the sched domain where the cpuset
-belongs to.  Therefore if the flag 'sched_load_balance' of a cpuset
+belongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
-is disabled, then 'sched_relax_domain_level' have no effect since
+is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
 there is no sched domain belonging the cpuset.
 If multiple cpusets are overlapping and hence they form a single sched
@@ -607,9 +607,9 @@ from one cpuset to another, then the kernel will adjust the tasks
 memory placement, as above, the next time that the kernel attempts
 to allocate a page of memory for that task.
-If a cpuset has its 'cpus' modified, then each task in that cpuset
+If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
 will have its allowed CPU placement changed immediately.  Similarly,
-if a tasks pid is written to another cpusets 'tasks' file, then its
+if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its
 allowed CPU placement is changed immediately.  If such a task had been
 bound to some subset of its cpuset using the sched_setaffinity() call,
 the task will be allowed to run on any CPU allowed in its new cpuset,
@@ -622,8 +622,8 @@ and the processor placement is updated immediately.
 Normally, once a page is allocated (given a physical page
 of main memory) then that page stays on whatever node it
 was allocated, so long as it remains allocated, even if the
-cpusets memory placement policy 'mems' subsequently changes.
+cpusets memory placement policy 'cpuset.mems' subsequently changes.
-If the cpuset flag file 'memory_migrate' is set true, then when
+If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
 tasks are attached to that cpuset, any pages that task had
 allocated to it on nodes in its previous cpuset are migrated
 to the tasks new cpuset. The relative placement of the page within
@@ -631,12 +631,12 @@ the cpuset is preserved during these migration operations if possible.
 For example if the page was on the second valid node of the prior cpuset
 then the page will be placed on the second valid node of the new cpuset.
-Also if 'memory_migrate' is set true, then if that cpusets
+Also if 'cpuset.memory_migrate' is set true, then if that cpusets
-'mems' file is modified, pages allocated to tasks in that
+'cpuset.mems' file is modified, pages allocated to tasks in that
-cpuset, that were on nodes in the previous setting of 'mems',
+cpuset, that were on nodes in the previous setting of 'cpuset.mems',
 will be moved to nodes in the new setting of 'mems.'
 Pages that were not in the tasks prior cpuset, or in the cpusets
-prior 'mems' setting, will not be moved.
+prior 'cpuset.mems' setting, will not be moved.
 There is an exception to the above.  If hotplug functionality is used
 to remove all the CPUs that are currently assigned to a cpuset,
@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset:
  cd /dev/cpuset
  mkdir Charlie
  cd Charlie
-  /bin/echo 2-3 > cpus
+  /bin/echo 2-3 > cpuset.cpus
-  /bin/echo 1 > mems
+  /bin/echo 1 > cpuset.mems
  /bin/echo $$ > tasks
  sh
  # The subshell 'sh' is now running in cpuset Charlie
@@ -725,10 +725,13 @@ Now you want to do something with this cpuset.
 In this directory you can find several files:
 # ls
-cpu_exclusive  memory_migrate      mems                      tasks
+cpuset.cpu_exclusive       cpuset.memory_spread_slab
-cpus           memory_pressure     notify_on_release
+cpuset.cpus                cpuset.mems
-mem_exclusive  memory_spread_page  sched_load_balance
+cpuset.mem_exclusive       cpuset.sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
+cpuset.mem_hardwall        cpuset.sched_relax_domain_level
+cpuset.memory_migrate      notify_on_release
+cpuset.memory_pressure     tasks
+cpuset.memory_spread_page
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
@@ -736,13 +739,13 @@ it, its properties.  By writing to these files you can manipulate
 the cpuset.
 Set some flags:
-# /bin/echo 1 > cpu_exclusive
+# /bin/echo 1 > cpuset.cpu_exclusive
 Add some cpus:
-# /bin/echo 0-7 > cpus
+# /bin/echo 0-7 > cpuset.cpus
 Add some mems:
-# /bin/echo 0-7 > mems
+# /bin/echo 0-7 > cpuset.mems
 Now attach your shell to this cpuset:
 # /bin/echo $$ > tasks
@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
 This is the syntax to use when writing in the cpus or mems files
 in cpuset directories:
-# /bin/echo 1-4 > cpus          -> set cpus list to cpus 1,2,3,4
+# /bin/echo 1-4 > cpuset.cpus           -> set cpus list to cpus 1,2,3,4
-# /bin/echo 1,2,3,4 > cpus      -> set cpus list to cpus 1,2,3,4
+# /bin/echo 1,2,3,4 > cpuset.cpus       -> set cpus list to cpus 1,2,3,4
 To add a CPU to a cpuset, write the new list of CPUs including the
 CPU to be added. To add 6 to the above cpuset:
-# /bin/echo 1-4,6 > cpus        -> set cpus list to cpus 1,2,3,4,6
+# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
 Similarly to remove a CPU from a cpuset, write the new list of CPUs
 without the CPU to be removed.
 To remove all the CPUs:
-# /bin/echo "" > cpus           -> clear cpus list
+# /bin/echo "" > cpuset.cpus            -> clear cpus list
 2.3 Setting flags
 -----------------
 The syntax is very simple:
-# /bin/echo 1 > cpu_exclusive   -> set flag 'cpu_exclusive'
+# /bin/echo 1 > cpuset.cpu_exclusive    -> set flag 'cpuset.cpu_exclusive'
-# /bin/echo 0 > cpu_exclusive   -> unset flag 'cpu_exclusive'
+# /bin/echo 0 > cpuset.cpu_exclusive    -> unset flag 'cpuset.cpu_exclusive'
 2.4 Attaching processes
 -----------------------
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 72db89ed0609..f7f68b2ac199 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -1,6 +1,6 @@
 Memory Resource Controller(Memcg)  Implementation Memo.
-Last Updated: 2009/1/20
+Last Updated: 2010/2
-Base Kernel Version: based on 2.6.29-rc2.
+Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
 is complex. This is a document for memcg's internal behavior.
@@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
        race and lock dependency with other cgroup subsystems.
        example)
-        # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
+        # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
        and do task move, mkdir, rmdir etc...under this.
@@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
        For example, test like following is good.
        (Shell-A)
-        # mount -t cgroup none /cgroup -t memory
+        # mount -t cgroup none /cgroup -o memory
        # mkdir /cgroup/test
        # echo 40M > /cgroup/test/memory.limit_in_bytes
        # echo 0 > /cgroup/test/tasks
@@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
        #echo 50M > memory.limit_in_bytes
        #echo 50M > memory.memsw.limit_in_bytes
        run 51M of malloc
+ 9.9 Move charges at task migration
+        Charges associated with a task can be moved along with task migration.
+        (Shell-A)
+        #mkdir /cgroup/A
+        #echo $$ >/cgroup/A/tasks
+        run some programs which uses some amount of memory in /cgroup/A.
+        (Shell-B)
+        #mkdir /cgroup/B
+        #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
+        #echo "pid of the program running in group A" >/cgroup/B/tasks
+        You can see charges have been moved by reading *.usage_in_bytes or
+        memory.stat of both A and B.
+        See 8.2 of Documentation/cgroups/memory.txt to see what value should be
+        written to move_charge_at_immigrate.
+ 9.10 Memory thresholds
+        Memory controler implements memory thresholds using cgroups notification
+        API. You can use Documentation/cgroups/cgroup_event_listener.c to test
+        it.
+        (Shell-A) Create cgroup and run event listener
+        # mkdir /cgroup/A
+        # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
+        (Shell-B) Add task to cgroup and try to allocate and free memory
+        # echo $$ >/cgroup/A/tasks
+        # a="$(dd if=/dev/zero bs=1M count=10)"
+        # a=
+        You will see message from cgroup_event_listener every time you cross
+        the thresholds.
+        Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
+        It's good idea to test root cgroup as well.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f2552b45..3a6aecd078ba 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -182,6 +182,8 @@ list.
 NOTE: Reclaim does not work for the root cgroup, since we cannot set any
 limits on the root cgroup.
+Note2: When panic_on_oom is set to "2", the whole system will panic.
 2. Locking
 The memory controller uses the following hierarchy
@@ -262,10 +264,12 @@ some of the pages cached in the cgroup (page cache pages).
 4.2 Task migration
 When a task migrates from one cgroup to another, it's charge is not
-carried forward. The pages allocated from the original cgroup still
+carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
+Note: You can move charges of a task along with task migration. See 8.
 4.3 Removing a cgroup
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
@@ -336,7 +340,7 @@ Note:
 5.3 swappiness
  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
-  Following cgroups' swapiness can't be changed.
+  Following cgroups' swappiness can't be changed.
  - root cgroup (uses /proc/sys/vm/swappiness).
  - a cgroup which uses hierarchy and it has child cgroup.
  - a cgroup which uses hierarchy and not the root of hierarchy.
@@ -377,7 +381,8 @@ The feature can be disabled by
 NOTE1: Enabling/disabling will fail if the cgroup already has other
 cgroups created below it.
-NOTE2: This feature can be enabled/disabled per subtree.
+NOTE2: When panic_on_oom is set to "2", the whole system will panic in
+case of an oom event in any cgroup.
 7. Soft limits
@@ -414,7 +419,76 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
 NOTE2: It is recommended to set the soft limit always below the hard limit,
       otherwise the hard limit will take precedence.
-8. TODO
+8. Move charges at task migration
+Users can move charges associated with a task along with task migration, that
+is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
+This feature is not supported in !CONFIG_MMU environments because of lack of
+page tables.
+8.1 Interface
+This feature is disabled by default. It can be enabled(and disabled again) by
+writing to memory.move_charge_at_immigrate of the destination cgroup.
+If you want to enable it:
+# echo (some positive value) > memory.move_charge_at_immigrate
+Note: Each bits of move_charge_at_immigrate has its own meaning about what type
+      of charges should be moved. See 8.2 for details.
+Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread
+      group.
+Note: If we cannot find enough space for the task in the destination cgroup, we
+      try to make space by reclaiming memory. Task migration may fail if we
+      cannot make enough space.
+Note: It can take several seconds if you move charges in giga bytes order.
+And if you want disable it again:
+# echo 0 > memory.move_charge_at_immigrate
+8.2 Type of charges which can be move
+Each bits of move_charge_at_immigrate has its own meaning about what type of
+charges should be moved.
+  bit | what type of charges would be moved ?
+ -----+------------------------------------------------------------------------
+   0  | A charge of an anonymous page(or swap of it) used by the target task.
+      | Those pages and swaps must be used only by the target task. You must
+      | enable Swap Extension(see 2.4) to enable move of swap charges.
+Note: Those pages and swaps must be charged to the old cgroup.
+Note: More type of pages(e.g. file cache, shmem,) will be supported by other
+      bits in future.
+8.3 TODO
+- Add support for other types of pages(e.g. file cache, shmem, etc.).
+- Implement madvise(2) to let users decide the vma to be moved or not to be
+  moved.
+- All of moving charge operations are done under cgroup_mutex. It's not good
+  behavior to hold the mutex too long, so we may need some trick.
+9. Memory thresholds
+Memory controler implements memory thresholds using cgroups notification
+API (see cgroups.txt). It allows to register multiple memory and memsw
+thresholds and gets notifications when it crosses.
+To register a threshold application need:
+ - create an eventfd using eventfd(2);
+ - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
+ - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
+   cgroup.event_control.
+Application will be notified through eventfd when memory usage crosses
+threshold in any direction.
+It's applicable for root and non-root cgroup.
+10. TODO
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
author	Andrea Bastoni <bastoni@cs.unc.edu>	2010-05-30 19:16:45 -0400
committer	Andrea Bastoni <bastoni@cs.unc.edu>	2010-05-30 19:16:45 -0400
commit	ada47b5fe13d89735805b566185f4885f5a3f750 (patch)
tree	644b88f8a71896307d71438e9b3af49126ffb22b /Documentation/cgroups
parent	43e98717ad40a4ae64545b5ba047c7b86aa44f4f (diff)
parent	3280f21d43ee541f97f8cda5792150d2dbec20d5 (diff)