aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/vm
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 18:20:36 -0400
committerLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 18:20:36 -0400
commit1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/vm
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/balance93
-rw-r--r--Documentation/vm/hugetlbpage.txt284
-rw-r--r--Documentation/vm/locking131
-rw-r--r--Documentation/vm/numa41
-rw-r--r--Documentation/vm/overcommit-accounting73
5 files changed, 622 insertions, 0 deletions
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
new file mode 100644
index 000000000000..bd3d31bc4915
--- /dev/null
+++ b/Documentation/vm/balance
@@ -0,0 +1,93 @@
1Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
2
3Memory balancing is needed for non __GFP_WAIT as well as for non
4__GFP_IO allocations.
5
6There are two reasons to be requesting non __GFP_WAIT allocations:
7the caller can not sleep (typically intr context), or does not want
8to incur cost overheads of page stealing and possible swap io for
9whatever reasons.
10
11__GFP_IO allocation requests are made to prevent file system deadlocks.
12
13In the absence of non sleepable allocation requests, it seems detrimental
14to be doing balancing. Page reclamation can be kicked off lazily, that
15is, only when needed (aka zone free memory is 0), instead of making it
16a proactive process.
17
18That being said, the kernel should try to fulfill requests for direct
19mapped pages from the direct mapped pool, instead of falling back on
20the dma pool, so as to keep the dma pool filled for dma requests (atomic
21or not). A similar argument applies to highmem and direct mapped pages.
22OTOH, if there is a lot of free dma pages, it is preferable to satisfy
23regular memory requests by allocating one from the dma pool, instead
24of incurring the overhead of regular zone balancing.
25
26In 2.2, memory balancing/page reclamation would kick off only when the
27_total_ number of free pages fell below 1/64 th of total memory. With the
28right ratio of dma and regular memory, it is quite possible that balancing
29would not be done even when the dma zone was completely empty. 2.2 has
30been running production machines of varying memory sizes, and seems to be
31doing fine even with the presence of this problem. In 2.3, due to
32HIGHMEM, this problem is aggravated.
33
34In 2.3, zone balancing can be done in one of two ways: depending on the
35zone size (and possibly of the size of lower class zones), we can decide
36at init time how many free pages we should aim for while balancing any
37zone. The good part is, while balancing, we do not need to look at sizes
38of lower class zones, the bad part is, we might do too frequent balancing
39due to ignoring possibly lower usage in the lower class zones. Also,
40with a slight change in the allocation routine, it is possible to reduce
41the memclass() macro to be a simple equality.
42
43Another possible solution is that we balance only when the free memory
44of a zone _and_ all its lower class zones falls below 1/64th of the
45total memory in the zone and its lower class zones. This fixes the 2.2
46balancing problem, and stays as close to 2.2 behavior as possible. Also,
47the balancing algorithm works the same way on the various architectures,
48which have different numbers and types of zones. If we wanted to get
49fancy, we could assign different weights to free pages in different
50zones in the future.
51
52Note that if the size of the regular zone is huge compared to dma zone,
53it becomes less significant to consider the free dma pages while
54deciding whether to balance the regular zone. The first solution
55becomes more attractive then.
56
57The appended patch implements the second solution. It also "fixes" two
58problems: first, kswapd is woken up as in 2.2 on low memory conditions
59for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
60so as to give a fighting chance for replace_with_highmem() to get a
61HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
62fall back into regular zone. This also makes sure that HIGHMEM pages
63are not leaked (for example, in situations where a HIGHMEM page is in
64the swapcache but is not being used by anyone)
65
66kswapd also needs to know about the zones it should balance. kswapd is
67primarily needed in a situation where balancing can not be done,
68probably because all allocation requests are coming from intr context
69and all process contexts are sleeping. For 2.3, kswapd does not really
70need to balance the highmem zone, since intr context does not request
71highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
72structure to decide whether a zone needs balancing.
73
74Page stealing from process memory and shm is done if stealing the page would
75alleviate memory pressure on any zone in the page's node that has fallen below
76its watermark.
77
78pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
79per-zone fields, used to determine when a zone needs to be balanced. When
80the number of pages falls below pages_min, the hysteric field low_on_memory
81gets set. This stays set till the number of free pages becomes pages_high.
82When low_on_memory is set, page allocation requests will try to free some
83pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
84to this, is the decision to poke kswapd to free some zone pages. That
85decision is not hysteresis based, and is done when the number of free
86pages is below pages_low; in which case zone_wake_kswapd is also set.
87
88
89(Good) Ideas that I have heard:
901. Dynamic experience should influence balancing: number of failed requests
91for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
922. Implement a replace_with_highmem()-like replace_with_regular() to preserve
93dma pages. (lkd@tantalophile.demon.co.uk)
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
new file mode 100644
index 000000000000..1b9bcd1fe98b
--- /dev/null
+++ b/Documentation/vm/hugetlbpage.txt
@@ -0,0 +1,284 @@
1
2The intent of this file is to give a brief summary of hugetlbpage support in
3the Linux kernel. This support is built on top of multiple page size support
4that is provided by most modern architectures. For example, i386
5architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
6architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
7256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
8translations. Typically this is a very scarce resource on processor.
9Operating systems try to make best use of limited number of TLB resources.
10This optimization is more critical now as bigger and bigger physical memories
11(several GBs) are more readily available.
12
13Users can use the huge page support in Linux kernel by either using the mmap
14system call or standard SYSv shared memory system calls (shmget, shmat).
15
16First the Linux kernel needs to be built with CONFIG_HUGETLB_PAGE (present
17under Processor types and feature) and CONFIG_HUGETLBFS (present under file
18system option on config menu) config options.
19
20The kernel built with hugepage support should show the number of configured
21hugepages in the system by running the "cat /proc/meminfo" command.
22
23/proc/meminfo also provides information about the total number of hugetlb
24pages configured in the kernel. It also displays information about the
25number of free hugetlb pages at any time. It also displays information about
26the configured hugepage size - this is needed for generating the proper
27alignment and size of the arguments to the above system calls.
28
29The output of "cat /proc/meminfo" will have output like:
30
31.....
32HugePages_Total: xxx
33HugePages_Free: yyy
34Hugepagesize: zzz KB
35
36/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
37in the kernel.
38
39/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
40pages in the kernel. Super user can dynamically request more (or free some
41pre-configured) hugepages.
42The allocation( or deallocation) of hugetlb pages is posible only if there are
43enough physically contiguous free pages in system (freeing of hugepages is
44possible only if there are enough hugetlb pages free that can be transfered
45back to regular memory pool).
46
47Pages that are used as hugetlb pages are reserved inside the kernel and can
48not be used for other purposes.
49
50Once the kernel with Hugetlb page support is built and running, a user can
51use either the mmap system call or shared memory system calls to start using
52the huge pages. It is required that the system administrator preallocate
53enough memory for huge page purposes.
54
55Use the following command to dynamically allocate/deallocate hugepages:
56
57 echo 20 > /proc/sys/vm/nr_hugepages
58
59This command will try to configure 20 hugepages in the system. The success
60or failure of allocation depends on the amount of physically contiguous
61memory that is preset in system at this time. System administrators may want
62to put this command in one of the local rc init file. This will enable the
63kernel to request huge pages early in the boot process (when the possibility
64of getting physical contiguous pages is still very high).
65
66If the user applications are going to request hugepages using mmap system
67call, then it is required that system administrator mount a file system of
68type hugetlbfs:
69
70 mount none /mnt/huge -t hugetlbfs <uid=value> <gid=value> <mode=value>
71 <size=value> <nr_inodes=value>
72
73This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
74/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid
75options sets the owner and group of the root of the file system. By default
76the uid and gid of the current process are taken. The mode option sets the
77mode of root of file system to value & 0777. This value is given in octal.
78By default the value 0755 is picked. The size option sets the maximum value of
79memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
80rounded down to HPAGE_SIZE. The option nr_inode sets the maximum number of
81inodes that /mnt/huge can use. If the size or nr_inode options are not
82provided on command line then no limits are set. For size and nr_inodes
83options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
84example, size=2K has the same meaning as size=2048. An example is given at
85the end of this document.
86
87read and write system calls are not supported on files that reside on hugetlb
88file systems.
89
90A regular chown, chgrp and chmod commands (with right permissions) could be
91used to change the file attributes on hugetlbfs.
92
93Also, it is important to note that no such mount command is required if the
94applications are going to use only shmat/shmget system calls. Users who
95wish to use hugetlb page via shared memory segment should be a member of
96a supplementary group and system admin needs to configure that gid into
97/proc/sys/vm/hugetlb_shm_group. It is possible for same or different
98applications to use any combination of mmaps and shm* calls. Though the
99mount of filesystem will be required for using mmaps.
100
101*******************************************************************
102
103/*
104 * Example of using hugepage memory in a user application using Sys V shared
105 * memory system calls. In this example the app is requesting 256MB of
106 * memory that is backed by huge pages. The application uses the flag
107 * SHM_HUGETLB in the shmget system call to inform the kernel that it is
108 * requesting hugepages.
109 *
110 * For the ia64 architecture, the Linux kernel reserves Region number 4 for
111 * hugepages. That means the addresses starting with 0x800000... will need
112 * to be specified. Specifying a fixed address is not required on ppc64,
113 * i386 or x86_64.
114 *
115 * Note: The default shared memory limit is quite low on many kernels,
116 * you may need to increase it via:
117 *
118 * echo 268435456 > /proc/sys/kernel/shmmax
119 *
120 * This will increase the maximum size per shared memory segment to 256MB.
121 * The other limit that you will hit eventually is shmall which is the
122 * total amount of shared memory in pages. To set it to 16GB on a system
123 * with a 4kB pagesize do:
124 *
125 * echo 4194304 > /proc/sys/kernel/shmall
126 */
127#include <stdlib.h>
128#include <stdio.h>
129#include <sys/types.h>
130#include <sys/ipc.h>
131#include <sys/shm.h>
132#include <sys/mman.h>
133
134#ifndef SHM_HUGETLB
135#define SHM_HUGETLB 04000
136#endif
137
138#define LENGTH (256UL*1024*1024)
139
140#define dprintf(x) printf(x)
141
142/* Only ia64 requires this */
143#ifdef __ia64__
144#define ADDR (void *)(0x8000000000000000UL)
145#define SHMAT_FLAGS (SHM_RND)
146#else
147#define ADDR (void *)(0x0UL)
148#define SHMAT_FLAGS (0)
149#endif
150
151int main(void)
152{
153 int shmid;
154 unsigned long i;
155 char *shmaddr;
156
157 if ((shmid = shmget(2, LENGTH,
158 SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
159 perror("shmget");
160 exit(1);
161 }
162 printf("shmid: 0x%x\n", shmid);
163
164 shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
165 if (shmaddr == (char *)-1) {
166 perror("Shared memory attach failure");
167 shmctl(shmid, IPC_RMID, NULL);
168 exit(2);
169 }
170 printf("shmaddr: %p\n", shmaddr);
171
172 dprintf("Starting the writes:\n");
173 for (i = 0; i < LENGTH; i++) {
174 shmaddr[i] = (char)(i);
175 if (!(i % (1024 * 1024)))
176 dprintf(".");
177 }
178 dprintf("\n");
179
180 dprintf("Starting the Check...");
181 for (i = 0; i < LENGTH; i++)
182 if (shmaddr[i] != (char)i)
183 printf("\nIndex %lu mismatched\n", i);
184 dprintf("Done.\n");
185
186 if (shmdt((const void *)shmaddr) != 0) {
187 perror("Detach failure");
188 shmctl(shmid, IPC_RMID, NULL);
189 exit(3);
190 }
191
192 shmctl(shmid, IPC_RMID, NULL);
193
194 return 0;
195}
196
197*******************************************************************
198
199/*
200 * Example of using hugepage memory in a user application using the mmap
201 * system call. Before running this application, make sure that the
202 * administrator has mounted the hugetlbfs filesystem (on some directory
203 * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
204 * example, the app is requesting memory of size 256MB that is backed by
205 * huge pages.
206 *
207 * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
208 * That means the addresses starting with 0x800000... will need to be
209 * specified. Specifying a fixed address is not required on ppc64, i386
210 * or x86_64.
211 */
212#include <stdlib.h>
213#include <stdio.h>
214#include <unistd.h>
215#include <sys/mman.h>
216#include <fcntl.h>
217
218#define FILE_NAME "/mnt/hugepagefile"
219#define LENGTH (256UL*1024*1024)
220#define PROTECTION (PROT_READ | PROT_WRITE)
221
222/* Only ia64 requires this */
223#ifdef __ia64__
224#define ADDR (void *)(0x8000000000000000UL)
225#define FLAGS (MAP_SHARED | MAP_FIXED)
226#else
227#define ADDR (void *)(0x0UL)
228#define FLAGS (MAP_SHARED)
229#endif
230
231void check_bytes(char *addr)
232{
233 printf("First hex is %x\n", *((unsigned int *)addr));
234}
235
236void write_bytes(char *addr)
237{
238 unsigned long i;
239
240 for (i = 0; i < LENGTH; i++)
241 *(addr + i) = (char)i;
242}
243
244void read_bytes(char *addr)
245{
246 unsigned long i;
247
248 check_bytes(addr);
249 for (i = 0; i < LENGTH; i++)
250 if (*(addr + i) != (char)i) {
251 printf("Mismatch at %lu\n", i);
252 break;
253 }
254}
255
256int main(void)
257{
258 void *addr;
259 int fd;
260
261 fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
262 if (fd < 0) {
263 perror("Open failed");
264 exit(1);
265 }
266
267 addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
268 if (addr == MAP_FAILED) {
269 perror("mmap");
270 unlink(FILE_NAME);
271 exit(1);
272 }
273
274 printf("Returned address is %p\n", addr);
275 check_bytes(addr);
276 write_bytes(addr);
277 read_bytes(addr);
278
279 munmap(addr, LENGTH);
280 close(fd);
281 unlink(FILE_NAME);
282
283 return 0;
284}
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
new file mode 100644
index 000000000000..c3ef09ae3bb1
--- /dev/null
+++ b/Documentation/vm/locking
@@ -0,0 +1,131 @@
1Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
2
3The intent of this file is to have an uptodate, running commentary
4from different people about how locking and synchronization is done
5in the Linux vm code.
6
7page_table_lock & mmap_sem
8--------------------------------------
9
10Page stealers pick processes out of the process pool and scan for
11the best process to steal pages from. To guarantee the existence
12of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
13Page stealers hold kernel_lock to protect against a bunch of races.
14The vma list of the victim mm is also scanned by the stealer,
15and the page_table_lock is used to preserve list sanity against the
16process adding/deleting to the list. This also guarantees existence
17of the vma. Vma existence is not guaranteed once try_to_swap_out()
18drops the page_table_lock. To guarantee the existence of the underlying
19file structure, a get_file is done before the swapout() method is
20invoked. The page passed into swapout() is guaranteed not to be reused
21for a different purpose because the page reference count due to being
22present in the user's pte is not released till after swapout() returns.
23
24Any code that modifies the vmlist, or the vm_start/vm_end/
25vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
26kswapd from looking at the chain.
27
28The rules are:
291. To scan the vmlist (look but don't touch) you must hold the
30 mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
312. To modify the vmlist you need to hold the mmap_sem with
32 read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
33 you need to take the page_table_lock.
343. The swapper takes _just_ the page_table_lock, this is done
35 because the mmap_sem can be an extremely long lived lock
36 and the swapper just cannot sleep on that.
374. The exception to this rule is expand_stack, which just
38 takes the read lock and the page_table_lock, this is ok
39 because it doesn't really modify fields anybody relies on.
405. You must be able to guarantee that while holding page_table_lock
41 or page_table_lock of mm A, you will not try to get either lock
42 for mm B.
43
44The caveats are:
451. find_vma() makes use of, and updates, the mmap_cache pointer hint.
46The update of mmap_cache is racy (page stealer can race with other code
47that invokes find_vma with mmap_sem held), but that is okay, since it
48is a hint. This can be fixed, if desired, by having find_vma grab the
49page_table_lock.
50
51
52Code that add/delete elements from the vmlist chain are
531. callers of insert_vm_struct
542. callers of merge_segments
553. callers of avl_remove
56
57Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
58the list:
591. expand_stack
602. mprotect
613. mlock
624. mremap
63
64It is advisable that changes to vm_start/vm_end be protected, although
65in some cases it is not really needed. Eg, vm_start is modified by
66expand_stack(), it is hard to come up with a destructive scenario without
67having the vmlist protection in this case.
68
69The page_table_lock nests with the inode i_mmap_lock and the kmem cache
70c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
71dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
72pagemap_lru_lock spinlocks, and no code asks for memory with these locks
73held.
74
75The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
76
77The page_table_lock is a spin lock.
78
79Note: PTL can also be used to guarantee that no new clones using the
80mm start up ... this is a loose form of stability on mm_users. For
81example, it is used in copy_mm to protect against a racing tlb_gather_mmu
82single address space optimization, so that the zap_page_range (from
83vmtruncate) does not lose sending ipi's to cloned threads that might
84be spawned underneath it and go to user mode to drag in pte's into tlbs.
85
86swap_list_lock/swap_device_lock
87-------------------------------
88The swap devices are chained in priority order from the "swap_list" header.
89The "swap_list" is used for the round-robin swaphandle allocation strategy.
90The #free swaphandles is maintained in "nr_swap_pages". These two together
91are protected by the swap_list_lock.
92
93The swap_device_lock, which is per swap device, protects the reference
94counts on the corresponding swaphandles, maintained in the "swap_map"
95array, and the "highest_bit" and "lowest_bit" fields.
96
97Both of these are spinlocks, and are never acquired from intr level. The
98locking hierarchy is swap_list_lock -> swap_device_lock.
99
100To prevent races between swap space deletion or async readahead swapins
101deciding whether a swap handle is being used, ie worthy of being read in
102from disk, and an unmap -> swap_free making the handle unused, the swap
103delete and readahead code grabs a temp reference on the swaphandle to
104prevent warning messages from swap_duplicate <- read_swap_cache_async.
105
106Swap cache locking
107------------------
108Pages are added into the swap cache with kernel_lock held, to make sure
109that multiple pages are not being added (and hence lost) by associating
110all of them with the same swaphandle.
111
112Pages are guaranteed not to be removed from the scache if the page is
113"shared": ie, other processes hold reference on the page or the associated
114swap handle. The only code that does not follow this rule is shrink_mmap,
115which deletes pages from the swap cache if no process has a reference on
116the page (multiple processes might have references on the corresponding
117swap handle though). lookup_swap_cache() races with shrink_mmap, when
118establishing a reference on a scache page, so, it must check whether the
119page it located is still in the swapcache, or shrink_mmap deleted it.
120(This race is due to the fact that shrink_mmap looks at the page ref
121count with pagecache_lock, but then drops pagecache_lock before deleting
122the page from the scache).
123
124do_wp_page and do_swap_page have MP races in them while trying to figure
125out whether a page is "shared", by looking at the page_count + swap_count.
126To preserve the sum of the counts, the page lock _must_ be acquired before
127calling is_page_shared (else processes might switch their swap_count refs
128to the page count refs, after the page count ref has been snapshotted).
129
130Swap device deletion code currently breaks all the scache assumptions,
131since it grabs neither mmap_sem nor page_table_lock.
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
new file mode 100644
index 000000000000..4b8db1bd3b78
--- /dev/null
+++ b/Documentation/vm/numa
@@ -0,0 +1,41 @@
1Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
2
3The intent of this file is to have an uptodate, running commentary
4from different people about NUMA specific code in the Linux vm.
5
6What is NUMA? It is an architecture where the memory access times
7for different regions of memory from a given processor varies
8according to the "distance" of the memory region from the processor.
9Each region of memory to which access times are the same from any
10cpu, is called a node. On such architectures, it is beneficial if
11the kernel tries to minimize inter node communications. Schemes
12for this range from kernel text and read-only data replication
13across nodes, and trying to house all the data structures that
14key components of the kernel need on memory on that node.
15
16Currently, all the numa support is to provide efficient handling
17of widely discontiguous physical memory, so architectures which
18are not NUMA but can have huge holes in the physical address space
19can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
20
21The initial port includes NUMAizing the bootmem allocator code by
22encapsulating all the pieces of information into a bootmem_data_t
23structure. Node specific calls have been added to the allocator.
24In theory, any platform which uses the bootmem allocator should
25be able to to put the bootmem and mem_map data structures anywhere
26it deems best.
27
28Each node's page allocation data structures have also been encapsulated
29into a pg_data_t. The bootmem_data_t is just one part of this. To
30make the code look uniform between NUMA and regular UMA platforms,
31UMA platforms have a statically allocated pg_data_t too (contig_page_data).
32For the sake of uniformity, the function num_online_nodes() is also defined
33for all platforms. As we run benchmarks, we might decide to NUMAize
34more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
35
36The NUMA aware page allocation code currently tries to allocate pages
37from different nodes in a round robin manner. This will be changed to
38do concentratic circle search, starting from current node, once the
39NUMA port achieves more maturity. The call alloc_pages_node has been
40added, so that drivers can make the call and not worry about whether
41it is running on a NUMA or UMA platform.
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting
new file mode 100644
index 000000000000..21c7b1f8f32b
--- /dev/null
+++ b/Documentation/vm/overcommit-accounting
@@ -0,0 +1,73 @@
1The Linux kernel supports the following overcommit handling modes
2
30 - Heuristic overcommit handling. Obvious overcommits of
4 address space are refused. Used for a typical system. It
5 ensures a seriously wild allocation fails while allowing
6 overcommit to reduce swap usage. root is allowed to
7 allocate slighly more memory in this mode. This is the
8 default.
9
101 - Always overcommit. Appropriate for some scientific
11 applications.
12
132 - Don't overcommit. The total address space commit
14 for the system is not permitted to exceed swap + a
15 configurable percentage (default is 50) of physical RAM.
16 Depending on the percentage you use, in most situations
17 this means a process will not be killed while accessing
18 pages but will receive errors on memory allocation as
19 appropriate.
20
21The overcommit policy is set via the sysctl `vm.overcommit_memory'.
22
23The overcommit percentage is set via `vm.overcommit_ratio'.
24
25The current overcommit limit and amount committed are viewable in
26/proc/meminfo as CommitLimit and Committed_AS respectively.
27
28Gotchas
29-------
30
31The C language stack growth does an implicit mremap. If you want absolute
32guarantees and run close to the edge you MUST mmap your stack for the
33largest size you think you will need. For typical stack usage this does
34not matter much but it's a corner case if you really really care
35
36In mode 2 the MAP_NORESERVE flag is ignored.
37
38
39How It Works
40------------
41
42The overcommit is based on the following rules
43
44For a file backed map
45 SHARED or READ-only - 0 cost (the file is the map not swap)
46 PRIVATE WRITABLE - size of mapping per instance
47
48For an anonymous or /dev/zero map
49 SHARED - size of mapping
50 PRIVATE READ-only - 0 cost (but of little use)
51 PRIVATE WRITABLE - size of mapping per instance
52
53Additional accounting
54 Pages made writable copies by mmap
55 shmfs memory drawn from the same pool
56
57Status
58------
59
60o We account mmap memory mappings
61o We account mprotect changes in commit
62o We account mremap changes in size
63o We account brk
64o We account munmap
65o We report the commit status in /proc
66o Account and check on fork
67o Review stack handling/building on exec
68o SHMfs accounting
69o Implement actual limit enforcement
70
71To Do
72-----
73o Account ptrace pages (this is hard)