diff options
author | Li Zefan <lizf@cn.fujitsu.com> | 2009-01-15 16:50:59 -0500 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2009-01-15 19:39:37 -0500 |
commit | 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8 (patch) | |
tree | 2409270f7073c08329ac01c82df0509a264af48c /Documentation/controllers | |
parent | 23964d2d02984d44aeb2d84d7ffb3359e728df43 (diff) |
cgroups: consolidate cgroup documents
Move Documentation/cpusets.txt and Documentation/controllers/* to
Documentation/cgroups/
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/controllers')
-rw-r--r-- | Documentation/controllers/cpuacct.txt | 32 | ||||
-rw-r--r-- | Documentation/controllers/devices.txt | 52 | ||||
-rw-r--r-- | Documentation/controllers/memcg_test.txt | 342 | ||||
-rw-r--r-- | Documentation/controllers/memory.txt | 399 | ||||
-rw-r--r-- | Documentation/controllers/resource_counter.txt | 181 |
5 files changed, 0 insertions, 1006 deletions
diff --git a/Documentation/controllers/cpuacct.txt b/Documentation/controllers/cpuacct.txt deleted file mode 100644 index bb775fbe43d7..000000000000 --- a/Documentation/controllers/cpuacct.txt +++ /dev/null | |||
@@ -1,32 +0,0 @@ | |||
1 | CPU Accounting Controller | ||
2 | ------------------------- | ||
3 | |||
4 | The CPU accounting controller is used to group tasks using cgroups and | ||
5 | account the CPU usage of these groups of tasks. | ||
6 | |||
7 | The CPU accounting controller supports multi-hierarchy groups. An accounting | ||
8 | group accumulates the CPU usage of all of its child groups and the tasks | ||
9 | directly present in its group. | ||
10 | |||
11 | Accounting groups can be created by first mounting the cgroup filesystem. | ||
12 | |||
13 | # mkdir /cgroups | ||
14 | # mount -t cgroup -ocpuacct none /cgroups | ||
15 | |||
16 | With the above step, the initial or the parent accounting group | ||
17 | becomes visible at /cgroups. At bootup, this group includes all the | ||
18 | tasks in the system. /cgroups/tasks lists the tasks in this cgroup. | ||
19 | /cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by | ||
20 | this group which is essentially the CPU time obtained by all the tasks | ||
21 | in the system. | ||
22 | |||
23 | New accounting groups can be created under the parent group /cgroups. | ||
24 | |||
25 | # cd /cgroups | ||
26 | # mkdir g1 | ||
27 | # echo $$ > g1 | ||
28 | |||
29 | The above steps create a new group g1 and move the current shell | ||
30 | process (bash) into it. CPU time consumed by this bash and its children | ||
31 | can be obtained from g1/cpuacct.usage and the same is accumulated in | ||
32 | /cgroups/cpuacct.usage also. | ||
diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt deleted file mode 100644 index 7cc6e6a60672..000000000000 --- a/Documentation/controllers/devices.txt +++ /dev/null | |||
@@ -1,52 +0,0 @@ | |||
1 | Device Whitelist Controller | ||
2 | |||
3 | 1. Description: | ||
4 | |||
5 | Implement a cgroup to track and enforce open and mknod restrictions | ||
6 | on device files. A device cgroup associates a device access | ||
7 | whitelist with each cgroup. A whitelist entry has 4 fields. | ||
8 | 'type' is a (all), c (char), or b (block). 'all' means it applies | ||
9 | to all types and all major and minor numbers. Major and minor are | ||
10 | either an integer or * for all. Access is a composition of r | ||
11 | (read), w (write), and m (mknod). | ||
12 | |||
13 | The root device cgroup starts with rwm to 'all'. A child device | ||
14 | cgroup gets a copy of the parent. Administrators can then remove | ||
15 | devices from the whitelist or add new entries. A child cgroup can | ||
16 | never receive a device access which is denied by its parent. However | ||
17 | when a device access is removed from a parent it will not also be | ||
18 | removed from the child(ren). | ||
19 | |||
20 | 2. User Interface | ||
21 | |||
22 | An entry is added using devices.allow, and removed using | ||
23 | devices.deny. For instance | ||
24 | |||
25 | echo 'c 1:3 mr' > /cgroups/1/devices.allow | ||
26 | |||
27 | allows cgroup 1 to read and mknod the device usually known as | ||
28 | /dev/null. Doing | ||
29 | |||
30 | echo a > /cgroups/1/devices.deny | ||
31 | |||
32 | will remove the default 'a *:* rwm' entry. Doing | ||
33 | |||
34 | echo a > /cgroups/1/devices.allow | ||
35 | |||
36 | will add the 'a *:* rwm' entry to the whitelist. | ||
37 | |||
38 | 3. Security | ||
39 | |||
40 | Any task can move itself between cgroups. This clearly won't | ||
41 | suffice, but we can decide the best way to adequately restrict | ||
42 | movement as people get some experience with this. We may just want | ||
43 | to require CAP_SYS_ADMIN, which at least is a separate bit from | ||
44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which | ||
45 | isn't a descendent of the current one. Or we may want to use | ||
46 | CAP_MAC_ADMIN, since we really are trying to lock down root. | ||
47 | |||
48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another | ||
49 | task to a new cgroup. (Again we'll probably want to change that). | ||
50 | |||
51 | A cgroup may not be granted more permissions than the cgroup's | ||
52 | parent has. | ||
diff --git a/Documentation/controllers/memcg_test.txt b/Documentation/controllers/memcg_test.txt deleted file mode 100644 index 08d4d3ea0d79..000000000000 --- a/Documentation/controllers/memcg_test.txt +++ /dev/null | |||
@@ -1,342 +0,0 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | ||
2 | Last Updated: 2008/12/15 | ||
3 | Base Kernel Version: based on 2.6.28-rc8-mm. | ||
4 | |||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | ||
6 | is complex. This is a document for memcg's internal behavior. | ||
7 | Please note that implementation details can be changed. | ||
8 | |||
9 | (*) Topics on API should be in Documentation/controllers/memory.txt) | ||
10 | |||
11 | 0. How to record usage ? | ||
12 | 2 objects are used. | ||
13 | |||
14 | page_cgroup ....an object per page. | ||
15 | Allocated at boot or memory hotplug. Freed at memory hot removal. | ||
16 | |||
17 | swap_cgroup ... an entry per swp_entry. | ||
18 | Allocated at swapon(). Freed at swapoff(). | ||
19 | |||
20 | The page_cgroup has USED bit and double count against a page_cgroup never | ||
21 | occurs. swap_cgroup is used only when a charged page is swapped-out. | ||
22 | |||
23 | 1. Charge | ||
24 | |||
25 | a page/swp_entry may be charged (usage += PAGE_SIZE) at | ||
26 | |||
27 | mem_cgroup_newpage_charge() | ||
28 | Called at new page fault and Copy-On-Write. | ||
29 | |||
30 | mem_cgroup_try_charge_swapin() | ||
31 | Called at do_swap_page() (page fault on swap entry) and swapoff. | ||
32 | Followed by charge-commit-cancel protocol. (With swap accounting) | ||
33 | At commit, a charge recorded in swap_cgroup is removed. | ||
34 | |||
35 | mem_cgroup_cache_charge() | ||
36 | Called at add_to_page_cache() | ||
37 | |||
38 | mem_cgroup_cache_charge_swapin() | ||
39 | Called at shmem's swapin. | ||
40 | |||
41 | mem_cgroup_prepare_migration() | ||
42 | Called before migration. "extra" charge is done and followed by | ||
43 | charge-commit-cancel protocol. | ||
44 | At commit, charge against oldpage or newpage will be committed. | ||
45 | |||
46 | 2. Uncharge | ||
47 | a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by | ||
48 | |||
49 | mem_cgroup_uncharge_page() | ||
50 | Called when an anonymous page is fully unmapped. I.e., mapcount goes | ||
51 | to 0. If the page is SwapCache, uncharge is delayed until | ||
52 | mem_cgroup_uncharge_swapcache(). | ||
53 | |||
54 | mem_cgroup_uncharge_cache_page() | ||
55 | Called when a page-cache is deleted from radix-tree. If the page is | ||
56 | SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). | ||
57 | |||
58 | mem_cgroup_uncharge_swapcache() | ||
59 | Called when SwapCache is removed from radix-tree. The charge itself | ||
60 | is moved to swap_cgroup. (If mem+swap controller is disabled, no | ||
61 | charge to swap occurs.) | ||
62 | |||
63 | mem_cgroup_uncharge_swap() | ||
64 | Called when swp_entry's refcnt goes down to 0. A charge against swap | ||
65 | disappears. | ||
66 | |||
67 | mem_cgroup_end_migration(old, new) | ||
68 | At success of migration old is uncharged (if necessary), a charge | ||
69 | to new page is committed. At failure, charge to old page is committed. | ||
70 | |||
71 | 3. charge-commit-cancel | ||
72 | In some case, we can't know this "charge" is valid or not at charging | ||
73 | (because of races). | ||
74 | To handle such case, there are charge-commit-cancel functions. | ||
75 | mem_cgroup_try_charge_XXX | ||
76 | mem_cgroup_commit_charge_XXX | ||
77 | mem_cgroup_cancel_charge_XXX | ||
78 | these are used in swap-in and migration. | ||
79 | |||
80 | At try_charge(), there are no flags to say "this page is charged". | ||
81 | at this point, usage += PAGE_SIZE. | ||
82 | |||
83 | At commit(), the function checks the page should be charged or not | ||
84 | and set flags or avoid charging.(usage -= PAGE_SIZE) | ||
85 | |||
86 | At cancel(), simply usage -= PAGE_SIZE. | ||
87 | |||
88 | Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | ||
89 | |||
90 | 4. Anonymous | ||
91 | Anonymous page is newly allocated at | ||
92 | - page fault into MAP_ANONYMOUS mapping. | ||
93 | - Copy-On-Write. | ||
94 | It is charged right after it's allocated before doing any page table | ||
95 | related operations. Of course, it's uncharged when another page is used | ||
96 | for the fault address. | ||
97 | |||
98 | At freeing anonymous page (by exit() or munmap()), zap_pte() is called | ||
99 | and pages for ptes are freed one by one.(see mm/memory.c). Uncharges | ||
100 | are done at page_remove_rmap() when page_mapcount() goes down to 0. | ||
101 | |||
102 | Another page freeing is by page-reclaim (vmscan.c) and anonymous | ||
103 | pages are swapped out. In this case, the page is marked as | ||
104 | PageSwapCache(). uncharge() routine doesn't uncharge the page marked | ||
105 | as SwapCache(). It's delayed until __delete_from_swap_cache(). | ||
106 | |||
107 | 4.1 Swap-in. | ||
108 | At swap-in, the page is taken from swap-cache. There are 2 cases. | ||
109 | |||
110 | (a) If the SwapCache is newly allocated and read, it has no charges. | ||
111 | (b) If the SwapCache has been mapped by processes, it has been | ||
112 | charged already. | ||
113 | |||
114 | This swap-in is one of the most complicated work. In do_swap_page(), | ||
115 | following events occur when pte is unchanged. | ||
116 | |||
117 | (1) the page (SwapCache) is looked up. | ||
118 | (2) lock_page() | ||
119 | (3) try_charge_swapin() | ||
120 | (4) reuse_swap_page() (may call delete_swap_cache()) | ||
121 | (5) commit_charge_swapin() | ||
122 | (6) swap_free(). | ||
123 | |||
124 | Considering following situation for example. | ||
125 | |||
126 | (A) The page has not been charged before (2) and reuse_swap_page() | ||
127 | doesn't call delete_from_swap_cache(). | ||
128 | (B) The page has not been charged before (2) and reuse_swap_page() | ||
129 | calls delete_from_swap_cache(). | ||
130 | (C) The page has been charged before (2) and reuse_swap_page() doesn't | ||
131 | call delete_from_swap_cache(). | ||
132 | (D) The page has been charged before (2) and reuse_swap_page() calls | ||
133 | delete_from_swap_cache(). | ||
134 | |||
135 | memory.usage/memsw.usage changes to this page/swp_entry will be | ||
136 | Case (A) (B) (C) (D) | ||
137 | Event | ||
138 | Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 | ||
139 | =========================================== | ||
140 | (3) +1/+1 +1/+1 +1/+1 +1/+1 | ||
141 | (4) - 0/ 0 - -1/ 0 | ||
142 | (5) 0/-1 0/ 0 -1/-1 0/ 0 | ||
143 | (6) - 0/-1 - 0/-1 | ||
144 | =========================================== | ||
145 | Result 1/ 1 1/ 1 1/ 1 1/ 1 | ||
146 | |||
147 | In any cases, charges to this page should be 1/ 1. | ||
148 | |||
149 | 4.2 Swap-out. | ||
150 | At swap-out, typical state transition is below. | ||
151 | |||
152 | (a) add to swap cache. (marked as SwapCache) | ||
153 | swp_entry's refcnt += 1. | ||
154 | (b) fully unmapped. | ||
155 | swp_entry's refcnt += # of ptes. | ||
156 | (c) write back to swap. | ||
157 | (d) delete from swap cache. (remove from SwapCache) | ||
158 | swp_entry's refcnt -= 1. | ||
159 | |||
160 | |||
161 | At (b), the page is marked as SwapCache and not uncharged. | ||
162 | At (d), the page is removed from SwapCache and a charge in page_cgroup | ||
163 | is moved to swap_cgroup. | ||
164 | |||
165 | Finally, at task exit, | ||
166 | (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. | ||
167 | Here, a charge in swap_cgroup disappears. | ||
168 | |||
169 | 5. Page Cache | ||
170 | Page Cache is charged at | ||
171 | - add_to_page_cache_locked(). | ||
172 | |||
173 | uncharged at | ||
174 | - __remove_from_page_cache(). | ||
175 | |||
176 | The logic is very clear. (About migration, see below) | ||
177 | Note: __remove_from_page_cache() is called by remove_from_page_cache() | ||
178 | and __remove_mapping(). | ||
179 | |||
180 | 6. Shmem(tmpfs) Page Cache | ||
181 | Memcg's charge/uncharge have special handlers of shmem. The best way | ||
182 | to understand shmem's page state transition is to read mm/shmem.c. | ||
183 | But brief explanation of the behavior of memcg around shmem will be | ||
184 | helpful to understand the logic. | ||
185 | |||
186 | Shmem's page (just leaf page, not direct/indirect block) can be on | ||
187 | - radix-tree of shmem's inode. | ||
188 | - SwapCache. | ||
189 | - Both on radix-tree and SwapCache. This happens at swap-in | ||
190 | and swap-out, | ||
191 | |||
192 | It's charged when... | ||
193 | - A new page is added to shmem's radix-tree. | ||
194 | - A swp page is read. (move a charge from swap_cgroup to page_cgroup) | ||
195 | It's uncharged when | ||
196 | - A page is removed from radix-tree and not SwapCache. | ||
197 | - When SwapCache is removed, a charge is moved to swap_cgroup. | ||
198 | - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup | ||
199 | disappears. | ||
200 | |||
201 | 7. Page Migration | ||
202 | One of the most complicated functions is page-migration-handler. | ||
203 | Memcg has 2 routines. Assume that we are migrating a page's contents | ||
204 | from OLDPAGE to NEWPAGE. | ||
205 | |||
206 | Usual migration logic is.. | ||
207 | (a) remove the page from LRU. | ||
208 | (b) allocate NEWPAGE (migration target) | ||
209 | (c) lock by lock_page(). | ||
210 | (d) unmap all mappings. | ||
211 | (e-1) If necessary, replace entry in radix-tree. | ||
212 | (e-2) move contents of a page. | ||
213 | (f) map all mappings again. | ||
214 | (g) pushback the page to LRU. | ||
215 | (-) OLDPAGE will be freed. | ||
216 | |||
217 | Before (g), memcg should complete all necessary charge/uncharge to | ||
218 | NEWPAGE/OLDPAGE. | ||
219 | |||
220 | The point is.... | ||
221 | - If OLDPAGE is anonymous, all charges will be dropped at (d) because | ||
222 | try_to_unmap() drops all mapcount and the page will not be | ||
223 | SwapCache. | ||
224 | |||
225 | - If OLDPAGE is SwapCache, charges will be kept at (g) because | ||
226 | __delete_from_swap_cache() isn't called at (e-1) | ||
227 | |||
228 | - If OLDPAGE is page-cache, charges will be kept at (g) because | ||
229 | remove_from_swap_cache() isn't called at (e-1) | ||
230 | |||
231 | memcg provides following hooks. | ||
232 | |||
233 | - mem_cgroup_prepare_migration(OLDPAGE) | ||
234 | Called after (b) to account a charge (usage += PAGE_SIZE) against | ||
235 | memcg which OLDPAGE belongs to. | ||
236 | |||
237 | - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) | ||
238 | Called after (f) before (g). | ||
239 | If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already | ||
240 | charged, a charge by prepare_migration() is automatically canceled. | ||
241 | If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. | ||
242 | |||
243 | But zap_pte() (by exit or munmap) can be called while migration, | ||
244 | we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | ||
245 | |||
246 | 8. LRU | ||
247 | Each memcg has its own private LRU. Now, it's handling is under global | ||
248 | VM's control (means that it's handled under global zone->lru_lock). | ||
249 | Almost all routines around memcg's LRU is called by global LRU's | ||
250 | list management functions under zone->lru_lock(). | ||
251 | |||
252 | A special function is mem_cgroup_isolate_pages(). This scans | ||
253 | memcg's private LRU and call __isolate_lru_page() to extract a page | ||
254 | from LRU. | ||
255 | (By __isolate_lru_page(), the page is removed from both of global and | ||
256 | private LRU.) | ||
257 | |||
258 | |||
259 | 9. Typical Tests. | ||
260 | |||
261 | Tests for racy cases. | ||
262 | |||
263 | 9.1 Small limit to memcg. | ||
264 | When you do test to do racy case, it's good test to set memcg's limit | ||
265 | to be very small rather than GB. Many races found in the test under | ||
266 | xKB or xxMB limits. | ||
267 | (Memory behavior under GB and Memory behavior under MB shows very | ||
268 | different situation.) | ||
269 | |||
270 | 9.2 Shmem | ||
271 | Historically, memcg's shmem handling was poor and we saw some amount | ||
272 | of troubles here. This is because shmem is page-cache but can be | ||
273 | SwapCache. Test with shmem/tmpfs is always good test. | ||
274 | |||
275 | 9.3 Migration | ||
276 | For NUMA, migration is an another special case. To do easy test, cpuset | ||
277 | is useful. Following is a sample script to do migration. | ||
278 | |||
279 | mount -t cgroup -o cpuset none /opt/cpuset | ||
280 | |||
281 | mkdir /opt/cpuset/01 | ||
282 | echo 1 > /opt/cpuset/01/cpuset.cpus | ||
283 | echo 0 > /opt/cpuset/01/cpuset.mems | ||
284 | echo 1 > /opt/cpuset/01/cpuset.memory_migrate | ||
285 | mkdir /opt/cpuset/02 | ||
286 | echo 1 > /opt/cpuset/02/cpuset.cpus | ||
287 | echo 1 > /opt/cpuset/02/cpuset.mems | ||
288 | echo 1 > /opt/cpuset/02/cpuset.memory_migrate | ||
289 | |||
290 | In above set, when you moves a task from 01 to 02, page migration to | ||
291 | node 0 to node 1 will occur. Following is a script to migrate all | ||
292 | under cpuset. | ||
293 | -- | ||
294 | move_task() | ||
295 | { | ||
296 | for pid in $1 | ||
297 | do | ||
298 | /bin/echo $pid >$2/tasks 2>/dev/null | ||
299 | echo -n $pid | ||
300 | echo -n " " | ||
301 | done | ||
302 | echo END | ||
303 | } | ||
304 | |||
305 | G1_TASK=`cat ${G1}/tasks` | ||
306 | G2_TASK=`cat ${G2}/tasks` | ||
307 | move_task "${G1_TASK}" ${G2} & | ||
308 | -- | ||
309 | 9.4 Memory hotplug. | ||
310 | memory hotplug test is one of good test. | ||
311 | to offline memory, do following. | ||
312 | # echo offline > /sys/devices/system/memory/memoryXXX/state | ||
313 | (XXX is the place of memory) | ||
314 | This is an easy way to test page migration, too. | ||
315 | |||
316 | 9.5 mkdir/rmdir | ||
317 | When using hierarchy, mkdir/rmdir test should be done. | ||
318 | Use tests like the following. | ||
319 | |||
320 | echo 1 >/opt/cgroup/01/memory/use_hierarchy | ||
321 | mkdir /opt/cgroup/01/child_a | ||
322 | mkdir /opt/cgroup/01/child_b | ||
323 | |||
324 | set limit to 01. | ||
325 | add limit to 01/child_b | ||
326 | run jobs under child_a and child_b | ||
327 | |||
328 | create/delete following groups at random while jobs are running. | ||
329 | /opt/cgroup/01/child_a/child_aa | ||
330 | /opt/cgroup/01/child_b/child_bb | ||
331 | /opt/cgroup/01/child_c | ||
332 | |||
333 | running new jobs in new group is also good. | ||
334 | |||
335 | 9.6 Mount with other subsystems. | ||
336 | Mounting with other subsystems is a good test because there is a | ||
337 | race and lock dependency with other cgroup subsystems. | ||
338 | |||
339 | example) | ||
340 | # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | ||
341 | |||
342 | and do task move, mkdir, rmdir etc...under this. | ||
diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt deleted file mode 100644 index e1501964df1e..000000000000 --- a/Documentation/controllers/memory.txt +++ /dev/null | |||
@@ -1,399 +0,0 @@ | |||
1 | Memory Resource Controller | ||
2 | |||
3 | NOTE: The Memory Resource Controller has been generically been referred | ||
4 | to as the memory controller in this document. Do not confuse memory controller | ||
5 | used here with the memory controller that is used in hardware. | ||
6 | |||
7 | Salient features | ||
8 | |||
9 | a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages | ||
10 | b. The infrastructure allows easy addition of other types of memory to control | ||
11 | c. Provides *zero overhead* for non memory controller users | ||
12 | d. Provides a double LRU: global memory pressure causes reclaim from the | ||
13 | global LRU; a cgroup on hitting a limit, reclaims from the per | ||
14 | cgroup LRU | ||
15 | |||
16 | NOTE: Swap Cache (unmapped) is not accounted now. | ||
17 | |||
18 | Benefits and Purpose of the memory controller | ||
19 | |||
20 | The memory controller isolates the memory behaviour of a group of tasks | ||
21 | from the rest of the system. The article on LWN [12] mentions some probable | ||
22 | uses of the memory controller. The memory controller can be used to | ||
23 | |||
24 | a. Isolate an application or a group of applications | ||
25 | Memory hungry applications can be isolated and limited to a smaller | ||
26 | amount of memory. | ||
27 | b. Create a cgroup with limited amount of memory, this can be used | ||
28 | as a good alternative to booting with mem=XXXX. | ||
29 | c. Virtualization solutions can control the amount of memory they want | ||
30 | to assign to a virtual machine instance. | ||
31 | d. A CD/DVD burner could control the amount of memory used by the | ||
32 | rest of the system to ensure that burning does not fail due to lack | ||
33 | of available memory. | ||
34 | e. There are several other use cases, find one or use the controller just | ||
35 | for fun (to learn and hack on the VM subsystem). | ||
36 | |||
37 | 1. History | ||
38 | |||
39 | The memory controller has a long history. A request for comments for the memory | ||
40 | controller was posted by Balbir Singh [1]. At the time the RFC was posted | ||
41 | there were several implementations for memory control. The goal of the | ||
42 | RFC was to build consensus and agreement for the minimal features required | ||
43 | for memory control. The first RSS controller was posted by Balbir Singh[2] | ||
44 | in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the | ||
45 | RSS controller. At OLS, at the resource management BoF, everyone suggested | ||
46 | that we handle both page cache and RSS together. Another request was raised | ||
47 | to allow user space handling of OOM. The current memory controller is | ||
48 | at version 6; it combines both mapped (RSS) and unmapped Page | ||
49 | Cache Control [11]. | ||
50 | |||
51 | 2. Memory Control | ||
52 | |||
53 | Memory is a unique resource in the sense that it is present in a limited | ||
54 | amount. If a task requires a lot of CPU processing, the task can spread | ||
55 | its processing over a period of hours, days, months or years, but with | ||
56 | memory, the same physical memory needs to be reused to accomplish the task. | ||
57 | |||
58 | The memory controller implementation has been divided into phases. These | ||
59 | are: | ||
60 | |||
61 | 1. Memory controller | ||
62 | 2. mlock(2) controller | ||
63 | 3. Kernel user memory accounting and slab control | ||
64 | 4. user mappings length controller | ||
65 | |||
66 | The memory controller is the first controller developed. | ||
67 | |||
68 | 2.1. Design | ||
69 | |||
70 | The core of the design is a counter called the res_counter. The res_counter | ||
71 | tracks the current memory usage and limit of the group of processes associated | ||
72 | with the controller. Each cgroup has a memory controller specific data | ||
73 | structure (mem_cgroup) associated with it. | ||
74 | |||
75 | 2.2. Accounting | ||
76 | |||
77 | +--------------------+ | ||
78 | | mem_cgroup | | ||
79 | | (res_counter) | | ||
80 | +--------------------+ | ||
81 | / ^ \ | ||
82 | / | \ | ||
83 | +---------------+ | +---------------+ | ||
84 | | mm_struct | |.... | mm_struct | | ||
85 | | | | | | | ||
86 | +---------------+ | +---------------+ | ||
87 | | | ||
88 | + --------------+ | ||
89 | | | ||
90 | +---------------+ +------+--------+ | ||
91 | | page +----------> page_cgroup| | ||
92 | | | | | | ||
93 | +---------------+ +---------------+ | ||
94 | |||
95 | (Figure 1: Hierarchy of Accounting) | ||
96 | |||
97 | |||
98 | Figure 1 shows the important aspects of the controller | ||
99 | |||
100 | 1. Accounting happens per cgroup | ||
101 | 2. Each mm_struct knows about which cgroup it belongs to | ||
102 | 3. Each page has a pointer to the page_cgroup, which in turn knows the | ||
103 | cgroup it belongs to | ||
104 | |||
105 | The accounting is done as follows: mem_cgroup_charge() is invoked to setup | ||
106 | the necessary data structures and check if the cgroup that is being charged | ||
107 | is over its limit. If it is then reclaim is invoked on the cgroup. | ||
108 | More details can be found in the reclaim section of this document. | ||
109 | If everything goes well, a page meta-data-structure called page_cgroup is | ||
110 | allocated and associated with the page. This routine also adds the page to | ||
111 | the per cgroup LRU. | ||
112 | |||
113 | 2.2.1 Accounting details | ||
114 | |||
115 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | ||
116 | (some pages which never be reclaimable and will not be on global LRU | ||
117 | are not accounted. we just accounts pages under usual vm management.) | ||
118 | |||
119 | RSS pages are accounted at page_fault unless they've already been accounted | ||
120 | for earlier. A file page will be accounted for as Page Cache when it's | ||
121 | inserted into inode (radix-tree). While it's mapped into the page tables of | ||
122 | processes, duplicate accounting is carefully avoided. | ||
123 | |||
124 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | ||
125 | unaccounted when it's removed from radix-tree. | ||
126 | |||
127 | At page migration, accounting information is kept. | ||
128 | |||
129 | Note: we just account pages-on-lru because our purpose is to control amount | ||
130 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | ||
131 | |||
132 | 2.3 Shared Page Accounting | ||
133 | |||
134 | Shared pages are accounted on the basis of the first touch approach. The | ||
135 | cgroup that first touches a page is accounted for the page. The principle | ||
136 | behind this approach is that a cgroup that aggressively uses a shared | ||
137 | page will eventually get charged for it (once it is uncharged from | ||
138 | the cgroup that brought it in -- this will happen on memory pressure). | ||
139 | |||
140 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | ||
141 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | ||
142 | be backed into memory in force, charges for pages are accounted against the | ||
143 | caller of swapoff rather than the users of shmem. | ||
144 | |||
145 | |||
146 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | ||
147 | Swap Extension allows you to record charge for swap. A swapped-in page is | ||
148 | charged back to original page allocator if possible. | ||
149 | |||
150 | When swap is accounted, following files are added. | ||
151 | - memory.memsw.usage_in_bytes. | ||
152 | - memory.memsw.limit_in_bytes. | ||
153 | |||
154 | usage of mem+swap is limited by memsw.limit_in_bytes. | ||
155 | |||
156 | Note: why 'mem+swap' rather than swap. | ||
157 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | ||
158 | to move account from memory to swap...there is no change in usage of | ||
159 | mem+swap. | ||
160 | |||
161 | In other words, when we want to limit the usage of swap without affecting | ||
162 | global LRU, mem+swap limit is better than just limiting swap from OS point | ||
163 | of view. | ||
164 | |||
165 | 2.5 Reclaim | ||
166 | |||
167 | Each cgroup maintains a per cgroup LRU that consists of an active | ||
168 | and inactive list. When a cgroup goes over its limit, we first try | ||
169 | to reclaim memory from the cgroup so as to make space for the new | ||
170 | pages that the cgroup has touched. If the reclaim is unsuccessful, | ||
171 | an OOM routine is invoked to select and kill the bulkiest task in the | ||
172 | cgroup. | ||
173 | |||
174 | The reclaim algorithm has not been modified for cgroups, except that | ||
175 | pages that are selected for reclaiming come from the per cgroup LRU | ||
176 | list. | ||
177 | |||
178 | 2. Locking | ||
179 | |||
180 | The memory controller uses the following hierarchy | ||
181 | |||
182 | 1. zone->lru_lock is used for selecting pages to be isolated | ||
183 | 2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | ||
184 | 3. lock_page_cgroup() is used to protect page->page_cgroup | ||
185 | |||
186 | 3. User Interface | ||
187 | |||
188 | 0. Configuration | ||
189 | |||
190 | a. Enable CONFIG_CGROUPS | ||
191 | b. Enable CONFIG_RESOURCE_COUNTERS | ||
192 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR | ||
193 | |||
194 | 1. Prepare the cgroups | ||
195 | # mkdir -p /cgroups | ||
196 | # mount -t cgroup none /cgroups -o memory | ||
197 | |||
198 | 2. Make the new group and move bash into it | ||
199 | # mkdir /cgroups/0 | ||
200 | # echo $$ > /cgroups/0/tasks | ||
201 | |||
202 | Since now we're in the 0 cgroup, | ||
203 | We can alter the memory limit: | ||
204 | # echo 4M > /cgroups/0/memory.limit_in_bytes | ||
205 | |||
206 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | ||
207 | mega or gigabytes. | ||
208 | |||
209 | # cat /cgroups/0/memory.limit_in_bytes | ||
210 | 4194304 | ||
211 | |||
212 | NOTE: The interface has now changed to display the usage in bytes | ||
213 | instead of pages | ||
214 | |||
215 | We can check the usage: | ||
216 | # cat /cgroups/0/memory.usage_in_bytes | ||
217 | 1216512 | ||
218 | |||
219 | A successful write to this file does not guarantee a successful set of | ||
220 | this limit to the value written into the file. This can be due to a | ||
221 | number of factors, such as rounding up to page boundaries or the total | ||
222 | availability of memory on the system. The user is required to re-read | ||
223 | this file after a write to guarantee the value committed by the kernel. | ||
224 | |||
225 | # echo 1 > memory.limit_in_bytes | ||
226 | # cat memory.limit_in_bytes | ||
227 | 4096 | ||
228 | |||
229 | The memory.failcnt field gives the number of times that the cgroup limit was | ||
230 | exceeded. | ||
231 | |||
232 | The memory.stat file gives accounting information. Now, the number of | ||
233 | caches, RSS and Active pages/Inactive pages are shown. | ||
234 | |||
235 | 4. Testing | ||
236 | |||
237 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | ||
238 | Apart from that v6 has been tested with several applications and regular | ||
239 | daily use. The controller has also been tested on the PPC64, x86_64 and | ||
240 | UML platforms. | ||
241 | |||
242 | 4.1 Troubleshooting | ||
243 | |||
244 | Sometimes a user might find that the application under a cgroup is | ||
245 | terminated. There are several causes for this: | ||
246 | |||
247 | 1. The cgroup limit is too low (just too low to do anything useful) | ||
248 | 2. The user is using anonymous memory and swap is turned off or too low | ||
249 | |||
250 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | ||
251 | some of the pages cached in the cgroup (page cache pages). | ||
252 | |||
253 | 4.2 Task migration | ||
254 | |||
255 | When a task migrates from one cgroup to another, it's charge is not | ||
256 | carried forward. The pages allocated from the original cgroup still | ||
257 | remain charged to it, the charge is dropped when the page is freed or | ||
258 | reclaimed. | ||
259 | |||
260 | 4.3 Removing a cgroup | ||
261 | |||
262 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | ||
263 | cgroup might have some charge associated with it, even though all | ||
264 | tasks have migrated away from it. | ||
265 | Such charges are freed(at default) or moved to its parent. When moved, | ||
266 | both of RSS and CACHES are moved to parent. | ||
267 | If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | ||
268 | |||
269 | Charges recorded in swap information is not updated at removal of cgroup. | ||
270 | Recorded information is discarded and a cgroup which uses swap (swapcache) | ||
271 | will be charged as a new owner of it. | ||
272 | |||
273 | |||
274 | 5. Misc. interfaces. | ||
275 | |||
276 | 5.1 force_empty | ||
277 | memory.force_empty interface is provided to make cgroup's memory usage empty. | ||
278 | You can use this interface only when the cgroup has no tasks. | ||
279 | When writing anything to this | ||
280 | |||
281 | # echo 0 > memory.force_empty | ||
282 | |||
283 | Almost all pages tracked by this memcg will be unmapped and freed. Some of | ||
284 | pages cannot be freed because it's locked or in-use. Such pages are moved | ||
285 | to parent and this cgroup will be empty. But this may return -EBUSY in | ||
286 | some too busy case. | ||
287 | |||
288 | Typical use case of this interface is that calling this before rmdir(). | ||
289 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | ||
290 | moved to the parent. If you want to avoid that, force_empty will be useful. | ||
291 | |||
292 | 5.2 stat file | ||
293 | memory.stat file includes following statistics (now) | ||
294 | cache - # of pages from page-cache and shmem. | ||
295 | rss - # of pages from anonymous memory. | ||
296 | pgpgin - # of event of charging | ||
297 | pgpgout - # of event of uncharging | ||
298 | active_anon - # of pages on active lru of anon, shmem. | ||
299 | inactive_anon - # of pages on active lru of anon, shmem | ||
300 | active_file - # of pages on active lru of file-cache | ||
301 | inactive_file - # of pages on inactive lru of file cache | ||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | ||
303 | |||
304 | Below is depend on CONFIG_DEBUG_VM. | ||
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | ||
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | ||
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | ||
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | ||
309 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | ||
310 | |||
311 | Memo: | ||
312 | recent_rotated means recent frequency of lru rotation. | ||
313 | recent_scanned means recent # of scans to lru. | ||
314 | showing for better debug please see the code for meanings. | ||
315 | |||
316 | |||
317 | 5.3 swappiness | ||
318 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | ||
319 | |||
320 | Following cgroup's swapiness can't be changed. | ||
321 | - root cgroup (uses /proc/sys/vm/swappiness). | ||
322 | - a cgroup which uses hierarchy and it has child cgroup. | ||
323 | - a cgroup which uses hierarchy and not the root of hierarchy. | ||
324 | |||
325 | |||
326 | 6. Hierarchy support | ||
327 | |||
328 | The memory controller supports a deep hierarchy and hierarchical accounting. | ||
329 | The hierarchy is created by creating the appropriate cgroups in the | ||
330 | cgroup filesystem. Consider for example, the following cgroup filesystem | ||
331 | hierarchy | ||
332 | |||
333 | root | ||
334 | / | \ | ||
335 | / | \ | ||
336 | a b c | ||
337 | | \ | ||
338 | | \ | ||
339 | d e | ||
340 | |||
341 | In the diagram above, with hierarchical accounting enabled, all memory | ||
342 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | ||
343 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its | ||
344 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | ||
345 | children of the ancestor. | ||
346 | |||
347 | 6.1 Enabling hierarchical accounting and reclaim | ||
348 | |||
349 | The memory controller by default disables the hierarchy feature. Support | ||
350 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | ||
351 | |||
352 | # echo 1 > memory.use_hierarchy | ||
353 | |||
354 | The feature can be disabled by | ||
355 | |||
356 | # echo 0 > memory.use_hierarchy | ||
357 | |||
358 | NOTE1: Enabling/disabling will fail if the cgroup already has other | ||
359 | cgroups created below it. | ||
360 | |||
361 | NOTE2: This feature can be enabled/disabled per subtree. | ||
362 | |||
363 | 7. TODO | ||
364 | |||
365 | 1. Add support for accounting huge pages (as a separate controller) | ||
366 | 2. Make per-cgroup scanner reclaim not-shared pages first | ||
367 | 3. Teach controller to account for shared-pages | ||
368 | 4. Start reclamation in the background when the limit is | ||
369 | not yet hit but the usage is getting closer | ||
370 | |||
371 | Summary | ||
372 | |||
373 | Overall, the memory controller has been a stable controller and has been | ||
374 | commented and discussed quite extensively in the community. | ||
375 | |||
376 | References | ||
377 | |||
378 | 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ | ||
379 | 2. Singh, Balbir. Memory Controller (RSS Control), | ||
380 | http://lwn.net/Articles/222762/ | ||
381 | 3. Emelianov, Pavel. Resource controllers based on process cgroups | ||
382 | http://lkml.org/lkml/2007/3/6/198 | ||
383 | 4. Emelianov, Pavel. RSS controller based on process cgroups (v2) | ||
384 | http://lkml.org/lkml/2007/4/9/78 | ||
385 | 5. Emelianov, Pavel. RSS controller based on process cgroups (v3) | ||
386 | http://lkml.org/lkml/2007/5/30/244 | ||
387 | 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ | ||
388 | 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control | ||
389 | subsystem (v3), http://lwn.net/Articles/235534/ | ||
390 | 8. Singh, Balbir. RSS controller v2 test results (lmbench), | ||
391 | http://lkml.org/lkml/2007/5/17/232 | ||
392 | 9. Singh, Balbir. RSS controller v2 AIM9 results | ||
393 | http://lkml.org/lkml/2007/5/18/1 | ||
394 | 10. Singh, Balbir. Memory controller v6 test results, | ||
395 | http://lkml.org/lkml/2007/8/19/36 | ||
396 | 11. Singh, Balbir. Memory controller introduction (v6), | ||
397 | http://lkml.org/lkml/2007/8/17/69 | ||
398 | 12. Corbet, Jonathan, Controlling memory use in cgroups, | ||
399 | http://lwn.net/Articles/243795/ | ||
diff --git a/Documentation/controllers/resource_counter.txt b/Documentation/controllers/resource_counter.txt deleted file mode 100644 index f196ac1d7d25..000000000000 --- a/Documentation/controllers/resource_counter.txt +++ /dev/null | |||
@@ -1,181 +0,0 @@ | |||
1 | |||
2 | The Resource Counter | ||
3 | |||
4 | The resource counter, declared at include/linux/res_counter.h, | ||
5 | is supposed to facilitate the resource management by controllers | ||
6 | by providing common stuff for accounting. | ||
7 | |||
8 | This "stuff" includes the res_counter structure and routines | ||
9 | to work with it. | ||
10 | |||
11 | |||
12 | |||
13 | 1. Crucial parts of the res_counter structure | ||
14 | |||
15 | a. unsigned long long usage | ||
16 | |||
17 | The usage value shows the amount of a resource that is consumed | ||
18 | by a group at a given time. The units of measurement should be | ||
19 | determined by the controller that uses this counter. E.g. it can | ||
20 | be bytes, items or any other unit the controller operates on. | ||
21 | |||
22 | b. unsigned long long max_usage | ||
23 | |||
24 | The maximal value of the usage over time. | ||
25 | |||
26 | This value is useful when gathering statistical information about | ||
27 | the particular group, as it shows the actual resource requirements | ||
28 | for a particular group, not just some usage snapshot. | ||
29 | |||
30 | c. unsigned long long limit | ||
31 | |||
32 | The maximal allowed amount of resource to consume by the group. In | ||
33 | case the group requests for more resources, so that the usage value | ||
34 | would exceed the limit, the resource allocation is rejected (see | ||
35 | the next section). | ||
36 | |||
37 | d. unsigned long long failcnt | ||
38 | |||
39 | The failcnt stands for "failures counter". This is the number of | ||
40 | resource allocation attempts that failed. | ||
41 | |||
42 | c. spinlock_t lock | ||
43 | |||
44 | Protects changes of the above values. | ||
45 | |||
46 | |||
47 | |||
48 | 2. Basic accounting routines | ||
49 | |||
50 | a. void res_counter_init(struct res_counter *rc) | ||
51 | |||
52 | Initializes the resource counter. As usual, should be the first | ||
53 | routine called for a new counter. | ||
54 | |||
55 | b. int res_counter_charge[_locked] | ||
56 | (struct res_counter *rc, unsigned long val) | ||
57 | |||
58 | When a resource is about to be allocated it has to be accounted | ||
59 | with the appropriate resource counter (controller should determine | ||
60 | which one to use on its own). This operation is called "charging". | ||
61 | |||
62 | This is not very important which operation - resource allocation | ||
63 | or charging - is performed first, but | ||
64 | * if the allocation is performed first, this may create a | ||
65 | temporary resource over-usage by the time resource counter is | ||
66 | charged; | ||
67 | * if the charging is performed first, then it should be uncharged | ||
68 | on error path (if the one is called). | ||
69 | |||
70 | c. void res_counter_uncharge[_locked] | ||
71 | (struct res_counter *rc, unsigned long val) | ||
72 | |||
73 | When a resource is released (freed) it should be de-accounted | ||
74 | from the resource counter it was accounted to. This is called | ||
75 | "uncharging". | ||
76 | |||
77 | The _locked routines imply that the res_counter->lock is taken. | ||
78 | |||
79 | |||
80 | 2.1 Other accounting routines | ||
81 | |||
82 | There are more routines that may help you with common needs, like | ||
83 | checking whether the limit is reached or resetting the max_usage | ||
84 | value. They are all declared in include/linux/res_counter.h. | ||
85 | |||
86 | |||
87 | |||
88 | 3. Analyzing the resource counter registrations | ||
89 | |||
90 | a. If the failcnt value constantly grows, this means that the counter's | ||
91 | limit is too tight. Either the group is misbehaving and consumes too | ||
92 | many resources, or the configuration is not suitable for the group | ||
93 | and the limit should be increased. | ||
94 | |||
95 | b. The max_usage value can be used to quickly tune the group. One may | ||
96 | set the limits to maximal values and either load the container with | ||
97 | a common pattern or leave one for a while. After this the max_usage | ||
98 | value shows the amount of memory the container would require during | ||
99 | its common activity. | ||
100 | |||
101 | Setting the limit a bit above this value gives a pretty good | ||
102 | configuration that works in most of the cases. | ||
103 | |||
104 | c. If the max_usage is much less than the limit, but the failcnt value | ||
105 | is growing, then the group tries to allocate a big chunk of resource | ||
106 | at once. | ||
107 | |||
108 | d. If the max_usage is much less than the limit, but the failcnt value | ||
109 | is 0, then this group is given too high limit, that it does not | ||
110 | require. It is better to lower the limit a bit leaving more resource | ||
111 | for other groups. | ||
112 | |||
113 | |||
114 | |||
115 | 4. Communication with the control groups subsystem (cgroups) | ||
116 | |||
117 | All the resource controllers that are using cgroups and resource counters | ||
118 | should provide files (in the cgroup filesystem) to work with the resource | ||
119 | counter fields. They are recommended to adhere to the following rules: | ||
120 | |||
121 | a. File names | ||
122 | |||
123 | Field name File name | ||
124 | --------------------------------------------------- | ||
125 | usage usage_in_<unit_of_measurement> | ||
126 | max_usage max_usage_in_<unit_of_measurement> | ||
127 | limit limit_in_<unit_of_measurement> | ||
128 | failcnt failcnt | ||
129 | lock no file :) | ||
130 | |||
131 | b. Reading from file should show the corresponding field value in the | ||
132 | appropriate format. | ||
133 | |||
134 | c. Writing to file | ||
135 | |||
136 | Field Expected behavior | ||
137 | ---------------------------------- | ||
138 | usage prohibited | ||
139 | max_usage reset to usage | ||
140 | limit set the limit | ||
141 | failcnt reset to zero | ||
142 | |||
143 | |||
144 | |||
145 | 5. Usage example | ||
146 | |||
147 | a. Declare a task group (take a look at cgroups subsystem for this) and | ||
148 | fold a res_counter into it | ||
149 | |||
150 | struct my_group { | ||
151 | struct res_counter res; | ||
152 | |||
153 | <other fields> | ||
154 | } | ||
155 | |||
156 | b. Put hooks in resource allocation/release paths | ||
157 | |||
158 | int alloc_something(...) | ||
159 | { | ||
160 | if (res_counter_charge(res_counter_ptr, amount) < 0) | ||
161 | return -ENOMEM; | ||
162 | |||
163 | <allocate the resource and return to the caller> | ||
164 | } | ||
165 | |||
166 | void release_something(...) | ||
167 | { | ||
168 | res_counter_uncharge(res_counter_ptr, amount); | ||
169 | |||
170 | <release the resource> | ||
171 | } | ||
172 | |||
173 | In order to keep the usage value self-consistent, both the | ||
174 | "res_counter_ptr" and the "amount" in release_something() should be | ||
175 | the same as they were in the alloc_something() when the releasing | ||
176 | resource was allocated. | ||
177 | |||
178 | c. Provide the way to read res_counter values and set them (the cgroups | ||
179 | still can help with it). | ||
180 | |||
181 | c. Compile and run :) | ||