diff options
author | Peter W Morreale <pmorreale@novell.com> | 2009-01-15 16:50:42 -0500 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2009-01-15 19:39:35 -0500 |
commit | db0fb1848a645b0b1b033765f3a5244e7afd2e3c (patch) | |
tree | cf6b63e52fad2fa626e2a08251815f07626682dd /Documentation | |
parent | b5db0e38653bfada34a92f360b4111566ede3842 (diff) |
Update of Documentation: vm.txt and proc.txt
Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
More specifically, the section on /proc/sys/vm in
Documentation/filesystems/proc.txt was removed and a link to
Documentation/sysctl/vm.txt added.
Most of the verbiage from proc.txt was simply moved in vm.txt, with new
addtional text for "swappiness" and "stat_interval".
Signed-off-by: Peter W Morreale <pmorreale@novell.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/filesystems/proc.txt | 288 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 619 |
2 files changed, 437 insertions, 470 deletions
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d105eb45282a..bbebc3a43ac0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1371,292 +1371,8 @@ auto_msgmni default value is 1. | |||
1371 | 2.4 /proc/sys/vm - The virtual memory subsystem | 1371 | 2.4 /proc/sys/vm - The virtual memory subsystem |
1372 | ----------------------------------------------- | 1372 | ----------------------------------------------- |
1373 | 1373 | ||
1374 | The files in this directory can be used to tune the operation of the virtual | 1374 | Please see: Documentation/sysctls/vm.txt for a description of these |
1375 | memory (VM) subsystem of the Linux kernel. | 1375 | entries. |
1376 | |||
1377 | vfs_cache_pressure | ||
1378 | ------------------ | ||
1379 | |||
1380 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
1381 | caching of directory and inode objects. | ||
1382 | |||
1383 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
1384 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
1385 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
1386 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
1387 | causes the kernel to prefer to reclaim dentries and inodes. | ||
1388 | |||
1389 | dirty_background_bytes | ||
1390 | ---------------------- | ||
1391 | |||
1392 | Contains the amount of dirty memory at which the pdflush background writeback | ||
1393 | daemon will start writeback. | ||
1394 | |||
1395 | If dirty_background_bytes is written, dirty_background_ratio becomes a function | ||
1396 | of its value (dirty_background_bytes / the amount of dirtyable system memory). | ||
1397 | |||
1398 | dirty_background_ratio | ||
1399 | ---------------------- | ||
1400 | |||
1401 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
1402 | pages + file cache, not including locked pages and HugePages), the number of | ||
1403 | pages at which the pdflush background writeback daemon will start writing out | ||
1404 | dirty data. | ||
1405 | |||
1406 | If dirty_background_ratio is written, dirty_background_bytes becomes a function | ||
1407 | of its value (dirty_background_ratio * the amount of dirtyable system memory). | ||
1408 | |||
1409 | dirty_bytes | ||
1410 | ----------- | ||
1411 | |||
1412 | Contains the amount of dirty memory at which a process generating disk writes | ||
1413 | will itself start writeback. | ||
1414 | |||
1415 | If dirty_bytes is written, dirty_ratio becomes a function of its value | ||
1416 | (dirty_bytes / the amount of dirtyable system memory). | ||
1417 | |||
1418 | dirty_ratio | ||
1419 | ----------- | ||
1420 | |||
1421 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
1422 | pages + file cache, not including locked pages and HugePages), the number of | ||
1423 | pages at which a process which is generating disk writes will itself start | ||
1424 | writing out dirty data. | ||
1425 | |||
1426 | If dirty_ratio is written, dirty_bytes becomes a function of its value | ||
1427 | (dirty_ratio * the amount of dirtyable system memory). | ||
1428 | |||
1429 | dirty_writeback_centisecs | ||
1430 | ------------------------- | ||
1431 | |||
1432 | The pdflush writeback daemons will periodically wake up and write `old' data | ||
1433 | out to disk. This tunable expresses the interval between those wakeups, in | ||
1434 | 100'ths of a second. | ||
1435 | |||
1436 | Setting this to zero disables periodic writeback altogether. | ||
1437 | |||
1438 | dirty_expire_centisecs | ||
1439 | ---------------------- | ||
1440 | |||
1441 | This tunable is used to define when dirty data is old enough to be eligible | ||
1442 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. | ||
1443 | Data which has been dirty in-memory for longer than this interval will be | ||
1444 | written out next time a pdflush daemon wakes up. | ||
1445 | |||
1446 | highmem_is_dirtyable | ||
1447 | -------------------- | ||
1448 | |||
1449 | Only present if CONFIG_HIGHMEM is set. | ||
1450 | |||
1451 | This defaults to 0 (false), meaning that the ratios set above are calculated | ||
1452 | as a percentage of lowmem only. This protects against excessive scanning | ||
1453 | in page reclaim, swapping and general VM distress. | ||
1454 | |||
1455 | Setting this to 1 can be useful on 32 bit machines where you want to make | ||
1456 | random changes within an MMAPed file that is larger than your available | ||
1457 | lowmem without causing large quantities of random IO. Is is safe if the | ||
1458 | behavior of all programs running on the machine is known and memory will | ||
1459 | not be otherwise stressed. | ||
1460 | |||
1461 | legacy_va_layout | ||
1462 | ---------------- | ||
1463 | |||
1464 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel | ||
1465 | will use the legacy (2.4) layout for all processes. | ||
1466 | |||
1467 | lowmem_reserve_ratio | ||
1468 | --------------------- | ||
1469 | |||
1470 | For some specialised workloads on highmem machines it is dangerous for | ||
1471 | the kernel to allow process memory to be allocated from the "lowmem" | ||
1472 | zone. This is because that memory could then be pinned via the mlock() | ||
1473 | system call, or by unavailability of swapspace. | ||
1474 | |||
1475 | And on large highmem machines this lack of reclaimable lowmem memory | ||
1476 | can be fatal. | ||
1477 | |||
1478 | So the Linux page allocator has a mechanism which prevents allocations | ||
1479 | which _could_ use highmem from using too much lowmem. This means that | ||
1480 | a certain amount of lowmem is defended from the possibility of being | ||
1481 | captured into pinned user memory. | ||
1482 | |||
1483 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
1484 | mechanism will also defend that region from allocations which could use | ||
1485 | highmem or lowmem). | ||
1486 | |||
1487 | The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is | ||
1488 | in defending these lower zones. | ||
1489 | |||
1490 | If you have a machine which uses highmem or ISA DMA and your | ||
1491 | applications are using mlock(), or if you are running with no swap then | ||
1492 | you probably should change the lowmem_reserve_ratio setting. | ||
1493 | |||
1494 | The lowmem_reserve_ratio is an array. You can see them by reading this file. | ||
1495 | - | ||
1496 | % cat /proc/sys/vm/lowmem_reserve_ratio | ||
1497 | 256 256 32 | ||
1498 | - | ||
1499 | Note: # of this elements is one fewer than number of zones. Because the highest | ||
1500 | zone's value is not necessary for following calculation. | ||
1501 | |||
1502 | But, these values are not used directly. The kernel calculates # of protection | ||
1503 | pages for each zones from them. These are shown as array of protection pages | ||
1504 | in /proc/zoneinfo like followings. (This is an example of x86-64 box). | ||
1505 | Each zone has an array of protection pages like this. | ||
1506 | |||
1507 | - | ||
1508 | Node 0, zone DMA | ||
1509 | pages free 1355 | ||
1510 | min 3 | ||
1511 | low 3 | ||
1512 | high 4 | ||
1513 | : | ||
1514 | : | ||
1515 | numa_other 0 | ||
1516 | protection: (0, 2004, 2004, 2004) | ||
1517 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
1518 | pagesets | ||
1519 | cpu: 0 pcp: 0 | ||
1520 | : | ||
1521 | - | ||
1522 | These protections are added to score to judge whether this zone should be used | ||
1523 | for page allocation or should be reclaimed. | ||
1524 | |||
1525 | In this example, if normal pages (index=2) are required to this DMA zone and | ||
1526 | pages_high is used for watermark, the kernel judges this zone should not be | ||
1527 | used because pages_free(1355) is smaller than watermark + protection[2] | ||
1528 | (4 + 2004 = 2008). If this protection value is 0, this zone would be used for | ||
1529 | normal page requirement. If requirement is DMA zone(index=0), protection[0] | ||
1530 | (=0) is used. | ||
1531 | |||
1532 | zone[i]'s protection[j] is calculated by following expression. | ||
1533 | |||
1534 | (i < j): | ||
1535 | zone[i]->protection[j] | ||
1536 | = (total sums of present_pages from zone[i+1] to zone[j] on the node) | ||
1537 | / lowmem_reserve_ratio[i]; | ||
1538 | (i = j): | ||
1539 | (should not be protected. = 0; | ||
1540 | (i > j): | ||
1541 | (not necessary, but looks 0) | ||
1542 | |||
1543 | The default values of lowmem_reserve_ratio[i] are | ||
1544 | 256 (if zone[i] means DMA or DMA32 zone) | ||
1545 | 32 (others). | ||
1546 | As above expression, they are reciprocal number of ratio. | ||
1547 | 256 means 1/256. # of protection pages becomes about "0.39%" of total present | ||
1548 | pages of higher zones on the node. | ||
1549 | |||
1550 | If you would like to protect more pages, smaller values are effective. | ||
1551 | The minimum value is 1 (1/1 -> 100%). | ||
1552 | |||
1553 | page-cluster | ||
1554 | ------------ | ||
1555 | |||
1556 | page-cluster controls the number of pages which are written to swap in | ||
1557 | a single attempt. The swap I/O size. | ||
1558 | |||
1559 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
1560 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
1561 | |||
1562 | The default value is three (eight pages at a time). There may be some | ||
1563 | small benefits in tuning this to a different value if your workload is | ||
1564 | swap-intensive. | ||
1565 | |||
1566 | overcommit_memory | ||
1567 | ----------------- | ||
1568 | |||
1569 | Controls overcommit of system memory, possibly allowing processes | ||
1570 | to allocate (but not use) more memory than is actually available. | ||
1571 | |||
1572 | |||
1573 | 0 - Heuristic overcommit handling. Obvious overcommits of | ||
1574 | address space are refused. Used for a typical system. It | ||
1575 | ensures a seriously wild allocation fails while allowing | ||
1576 | overcommit to reduce swap usage. root is allowed to | ||
1577 | allocate slightly more memory in this mode. This is the | ||
1578 | default. | ||
1579 | |||
1580 | 1 - Always overcommit. Appropriate for some scientific | ||
1581 | applications. | ||
1582 | |||
1583 | 2 - Don't overcommit. The total address space commit | ||
1584 | for the system is not permitted to exceed swap plus a | ||
1585 | configurable percentage (default is 50) of physical RAM. | ||
1586 | Depending on the percentage you use, in most situations | ||
1587 | this means a process will not be killed while attempting | ||
1588 | to use already-allocated memory but will receive errors | ||
1589 | on memory allocation as appropriate. | ||
1590 | |||
1591 | overcommit_ratio | ||
1592 | ---------------- | ||
1593 | |||
1594 | Percentage of physical memory size to include in overcommit calculations | ||
1595 | (see above.) | ||
1596 | |||
1597 | Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) | ||
1598 | |||
1599 | swapspace = total size of all swap areas | ||
1600 | physmem = size of physical memory in system | ||
1601 | |||
1602 | nr_hugepages and hugetlb_shm_group | ||
1603 | ---------------------------------- | ||
1604 | |||
1605 | nr_hugepages configures number of hugetlb page reserved for the system. | ||
1606 | |||
1607 | hugetlb_shm_group contains group id that is allowed to create SysV shared | ||
1608 | memory segment using hugetlb page. | ||
1609 | |||
1610 | hugepages_treat_as_movable | ||
1611 | -------------------------- | ||
1612 | |||
1613 | This parameter is only useful when kernelcore= is specified at boot time to | ||
1614 | create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages | ||
1615 | are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero | ||
1616 | value written to hugepages_treat_as_movable allows huge pages to be allocated | ||
1617 | from ZONE_MOVABLE. | ||
1618 | |||
1619 | Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge | ||
1620 | pages pool can easily grow or shrink within. Assuming that applications are | ||
1621 | not running that mlock() a lot of memory, it is likely the huge pages pool | ||
1622 | can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value | ||
1623 | into nr_hugepages and triggering page reclaim. | ||
1624 | |||
1625 | laptop_mode | ||
1626 | ----------- | ||
1627 | |||
1628 | laptop_mode is a knob that controls "laptop mode". All the things that are | ||
1629 | controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. | ||
1630 | |||
1631 | block_dump | ||
1632 | ---------- | ||
1633 | |||
1634 | block_dump enables block I/O debugging when set to a nonzero value. More | ||
1635 | information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. | ||
1636 | |||
1637 | swap_token_timeout | ||
1638 | ------------------ | ||
1639 | |||
1640 | This file contains valid hold time of swap out protection token. The Linux | ||
1641 | VM has token based thrashing control mechanism and uses the token to prevent | ||
1642 | unnecessary page faults in thrashing situation. The unit of the value is | ||
1643 | second. The value would be useful to tune thrashing behavior. | ||
1644 | |||
1645 | drop_caches | ||
1646 | ----------- | ||
1647 | |||
1648 | Writing to this will cause the kernel to drop clean caches, dentries and | ||
1649 | inodes from memory, causing that memory to become free. | ||
1650 | |||
1651 | To free pagecache: | ||
1652 | echo 1 > /proc/sys/vm/drop_caches | ||
1653 | To free dentries and inodes: | ||
1654 | echo 2 > /proc/sys/vm/drop_caches | ||
1655 | To free pagecache, dentries and inodes: | ||
1656 | echo 3 > /proc/sys/vm/drop_caches | ||
1657 | |||
1658 | As this is a non-destructive operation and dirty objects are not freeable, the | ||
1659 | user should run `sync' first. | ||
1660 | 1376 | ||
1661 | 1377 | ||
1662 | 2.5 /proc/sys/dev - Device specific parameters | 1378 | 2.5 /proc/sys/dev - Device specific parameters |
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index a3415070bcac..3197fc83bc51 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt | |||
@@ -1,12 +1,13 @@ | |||
1 | Documentation for /proc/sys/vm/* kernel version 2.2.10 | 1 | Documentation for /proc/sys/vm/* kernel version 2.6.29 |
2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> | 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
3 | (c) 2008 Peter W. Morreale <pmorreale@novell.com> | ||
3 | 4 | ||
4 | For general info and legal blurb, please look in README. | 5 | For general info and legal blurb, please look in README. |
5 | 6 | ||
6 | ============================================================== | 7 | ============================================================== |
7 | 8 | ||
8 | This file contains the documentation for the sysctl files in | 9 | This file contains the documentation for the sysctl files in |
9 | /proc/sys/vm and is valid for Linux kernel version 2.2. | 10 | /proc/sys/vm and is valid for Linux kernel version 2.6.29. |
10 | 11 | ||
11 | The files in this directory can be used to tune the operation | 12 | The files in this directory can be used to tune the operation |
12 | of the virtual memory (VM) subsystem of the Linux kernel and | 13 | of the virtual memory (VM) subsystem of the Linux kernel and |
@@ -16,180 +17,274 @@ Default values and initialization routines for most of these | |||
16 | files can be found in mm/swap.c. | 17 | files can be found in mm/swap.c. |
17 | 18 | ||
18 | Currently, these files are in /proc/sys/vm: | 19 | Currently, these files are in /proc/sys/vm: |
19 | - overcommit_memory | 20 | |
20 | - page-cluster | 21 | - block_dump |
21 | - dirty_ratio | 22 | - dirty_background_bytes |
22 | - dirty_background_ratio | 23 | - dirty_background_ratio |
24 | - dirty_bytes | ||
23 | - dirty_expire_centisecs | 25 | - dirty_expire_centisecs |
26 | - dirty_ratio | ||
24 | - dirty_writeback_centisecs | 27 | - dirty_writeback_centisecs |
25 | - highmem_is_dirtyable (only if CONFIG_HIGHMEM set) | 28 | - drop_caches |
29 | - hugepages_treat_as_movable | ||
30 | - hugetlb_shm_group | ||
31 | - laptop_mode | ||
32 | - legacy_va_layout | ||
33 | - lowmem_reserve_ratio | ||
26 | - max_map_count | 34 | - max_map_count |
27 | - min_free_kbytes | 35 | - min_free_kbytes |
28 | - laptop_mode | ||
29 | - block_dump | ||
30 | - drop-caches | ||
31 | - zone_reclaim_mode | ||
32 | - min_unmapped_ratio | ||
33 | - min_slab_ratio | 36 | - min_slab_ratio |
34 | - panic_on_oom | 37 | - min_unmapped_ratio |
35 | - oom_dump_tasks | 38 | - mmap_min_addr |
36 | - oom_kill_allocating_task | ||
37 | - mmap_min_address | ||
38 | - numa_zonelist_order | ||
39 | - nr_hugepages | 39 | - nr_hugepages |
40 | - nr_overcommit_hugepages | 40 | - nr_overcommit_hugepages |
41 | - nr_trim_pages (only if CONFIG_MMU=n) | 41 | - nr_pdflush_threads |
42 | - nr_trim_pages (only if CONFIG_MMU=n) | ||
43 | - numa_zonelist_order | ||
44 | - oom_dump_tasks | ||
45 | - oom_kill_allocating_task | ||
46 | - overcommit_memory | ||
47 | - overcommit_ratio | ||
48 | - page-cluster | ||
49 | - panic_on_oom | ||
50 | - percpu_pagelist_fraction | ||
51 | - stat_interval | ||
52 | - swappiness | ||
53 | - vfs_cache_pressure | ||
54 | - zone_reclaim_mode | ||
55 | |||
42 | 56 | ||
43 | ============================================================== | 57 | ============================================================== |
44 | 58 | ||
45 | dirty_bytes, dirty_ratio, dirty_background_bytes, | 59 | block_dump |
46 | dirty_background_ratio, dirty_expire_centisecs, | ||
47 | dirty_writeback_centisecs, highmem_is_dirtyable, | ||
48 | vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, | ||
49 | drop-caches, hugepages_treat_as_movable: | ||
50 | 60 | ||
51 | See Documentation/filesystems/proc.txt | 61 | block_dump enables block I/O debugging when set to a nonzero value. More |
62 | information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. | ||
52 | 63 | ||
53 | ============================================================== | 64 | ============================================================== |
54 | 65 | ||
55 | overcommit_memory: | 66 | dirty_background_bytes |
56 | 67 | ||
57 | This value contains a flag that enables memory overcommitment. | 68 | Contains the amount of dirty memory at which the pdflush background writeback |
69 | daemon will start writeback. | ||
58 | 70 | ||
59 | When this flag is 0, the kernel attempts to estimate the amount | 71 | If dirty_background_bytes is written, dirty_background_ratio becomes a function |
60 | of free memory left when userspace requests more memory. | 72 | of its value (dirty_background_bytes / the amount of dirtyable system memory). |
61 | 73 | ||
62 | When this flag is 1, the kernel pretends there is always enough | 74 | ============================================================== |
63 | memory until it actually runs out. | ||
64 | 75 | ||
65 | When this flag is 2, the kernel uses a "never overcommit" | 76 | dirty_background_ratio |
66 | policy that attempts to prevent any overcommit of memory. | ||
67 | 77 | ||
68 | This feature can be very useful because there are a lot of | 78 | Contains, as a percentage of total system memory, the number of pages at which |
69 | programs that malloc() huge amounts of memory "just-in-case" | 79 | the pdflush background writeback daemon will start writing out dirty data. |
70 | and don't use much of it. | ||
71 | 80 | ||
72 | The default value is 0. | 81 | ============================================================== |
73 | 82 | ||
74 | See Documentation/vm/overcommit-accounting and | 83 | dirty_bytes |
75 | security/commoncap.c::cap_vm_enough_memory() for more information. | 84 | |
85 | Contains the amount of dirty memory at which a process generating disk writes | ||
86 | will itself start writeback. | ||
87 | |||
88 | If dirty_bytes is written, dirty_ratio becomes a function of its value | ||
89 | (dirty_bytes / the amount of dirtyable system memory). | ||
76 | 90 | ||
77 | ============================================================== | 91 | ============================================================== |
78 | 92 | ||
79 | overcommit_ratio: | 93 | dirty_expire_centisecs |
80 | 94 | ||
81 | When overcommit_memory is set to 2, the committed address | 95 | This tunable is used to define when dirty data is old enough to be eligible |
82 | space is not permitted to exceed swap plus this percentage | 96 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. |
83 | of physical RAM. See above. | 97 | Data which has been dirty in-memory for longer than this interval will be |
98 | written out next time a pdflush daemon wakes up. | ||
99 | |||
100 | ============================================================== | ||
101 | |||
102 | dirty_ratio | ||
103 | |||
104 | Contains, as a percentage of total system memory, the number of pages at which | ||
105 | a process which is generating disk writes will itself start writing out dirty | ||
106 | data. | ||
84 | 107 | ||
85 | ============================================================== | 108 | ============================================================== |
86 | 109 | ||
87 | page-cluster: | 110 | dirty_writeback_centisecs |
88 | 111 | ||
89 | The Linux VM subsystem avoids excessive disk seeks by reading | 112 | The pdflush writeback daemons will periodically wake up and write `old' data |
90 | multiple pages on a page fault. The number of pages it reads | 113 | out to disk. This tunable expresses the interval between those wakeups, in |
91 | is dependent on the amount of memory in your machine. | 114 | 100'ths of a second. |
92 | 115 | ||
93 | The number of pages the kernel reads in at once is equal to | 116 | Setting this to zero disables periodic writeback altogether. |
94 | 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense | ||
95 | for swap because we only cluster swap data in 32-page groups. | ||
96 | 117 | ||
97 | ============================================================== | 118 | ============================================================== |
98 | 119 | ||
99 | max_map_count: | 120 | drop_caches |
100 | 121 | ||
101 | This file contains the maximum number of memory map areas a process | 122 | Writing to this will cause the kernel to drop clean caches, dentries and |
102 | may have. Memory map areas are used as a side-effect of calling | 123 | inodes from memory, causing that memory to become free. |
103 | malloc, directly by mmap and mprotect, and also when loading shared | ||
104 | libraries. | ||
105 | 124 | ||
106 | While most applications need less than a thousand maps, certain | 125 | To free pagecache: |
107 | programs, particularly malloc debuggers, may consume lots of them, | 126 | echo 1 > /proc/sys/vm/drop_caches |
108 | e.g., up to one or two maps per allocation. | 127 | To free dentries and inodes: |
128 | echo 2 > /proc/sys/vm/drop_caches | ||
129 | To free pagecache, dentries and inodes: | ||
130 | echo 3 > /proc/sys/vm/drop_caches | ||
109 | 131 | ||
110 | The default value is 65536. | 132 | As this is a non-destructive operation and dirty objects are not freeable, the |
133 | user should run `sync' first. | ||
111 | 134 | ||
112 | ============================================================== | 135 | ============================================================== |
113 | 136 | ||
114 | min_free_kbytes: | 137 | hugepages_treat_as_movable |
115 | 138 | ||
116 | This is used to force the Linux VM to keep a minimum number | 139 | This parameter is only useful when kernelcore= is specified at boot time to |
117 | of kilobytes free. The VM uses this number to compute a pages_min | 140 | create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages |
118 | value for each lowmem zone in the system. Each lowmem zone gets | 141 | are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero |
119 | a number of reserved free pages based proportionally on its size. | 142 | value written to hugepages_treat_as_movable allows huge pages to be allocated |
143 | from ZONE_MOVABLE. | ||
120 | 144 | ||
121 | Some minimal amount of memory is needed to satisfy PF_MEMALLOC | 145 | Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge |
122 | allocations; if you set this to lower than 1024KB, your system will | 146 | pages pool can easily grow or shrink within. Assuming that applications are |
123 | become subtly broken, and prone to deadlock under high loads. | 147 | not running that mlock() a lot of memory, it is likely the huge pages pool |
124 | 148 | can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value | |
125 | Setting this too high will OOM your machine instantly. | 149 | into nr_hugepages and triggering page reclaim. |
126 | 150 | ||
127 | ============================================================== | 151 | ============================================================== |
128 | 152 | ||
129 | percpu_pagelist_fraction | 153 | hugetlb_shm_group |
130 | 154 | ||
131 | This is the fraction of pages at most (high mark pcp->high) in each zone that | 155 | hugetlb_shm_group contains group id that is allowed to create SysV |
132 | are allocated for each per cpu page list. The min value for this is 8. It | 156 | shared memory segment using hugetlb page. |
133 | means that we don't allow more than 1/8th of pages in each zone to be | ||
134 | allocated in any single per_cpu_pagelist. This entry only changes the value | ||
135 | of hot per cpu pagelists. User can specify a number like 100 to allocate | ||
136 | 1/100th of each zone to each per cpu page list. | ||
137 | 157 | ||
138 | The batch value of each per cpu pagelist is also updated as a result. It is | 158 | ============================================================== |
139 | set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) | ||
140 | 159 | ||
141 | The initial value is zero. Kernel does not use this value at boot time to set | 160 | laptop_mode |
142 | the high water marks for each per cpu page list. | ||
143 | 161 | ||
144 | =============================================================== | 162 | laptop_mode is a knob that controls "laptop mode". All the things that are |
163 | controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. | ||
145 | 164 | ||
146 | zone_reclaim_mode: | 165 | ============================================================== |
147 | 166 | ||
148 | Zone_reclaim_mode allows someone to set more or less aggressive approaches to | 167 | legacy_va_layout |
149 | reclaim memory when a zone runs out of memory. If it is set to zero then no | ||
150 | zone reclaim occurs. Allocations will be satisfied from other zones / nodes | ||
151 | in the system. | ||
152 | 168 | ||
153 | This is value ORed together of | 169 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel |
170 | will use the legacy (2.4) layout for all processes. | ||
154 | 171 | ||
155 | 1 = Zone reclaim on | 172 | ============================================================== |
156 | 2 = Zone reclaim writes dirty pages out | ||
157 | 4 = Zone reclaim swaps pages | ||
158 | 173 | ||
159 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | 174 | lowmem_reserve_ratio |
160 | from remote zones will cause a measurable performance reduction. The | 175 | |
161 | page allocator will then reclaim easily reusable pages (those page | 176 | For some specialised workloads on highmem machines it is dangerous for |
162 | cache pages that are currently not used) before allocating off node pages. | 177 | the kernel to allow process memory to be allocated from the "lowmem" |
178 | zone. This is because that memory could then be pinned via the mlock() | ||
179 | system call, or by unavailability of swapspace. | ||
180 | |||
181 | And on large highmem machines this lack of reclaimable lowmem memory | ||
182 | can be fatal. | ||
183 | |||
184 | So the Linux page allocator has a mechanism which prevents allocations | ||
185 | which _could_ use highmem from using too much lowmem. This means that | ||
186 | a certain amount of lowmem is defended from the possibility of being | ||
187 | captured into pinned user memory. | ||
188 | |||
189 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
190 | mechanism will also defend that region from allocations which could use | ||
191 | highmem or lowmem). | ||
192 | |||
193 | The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is | ||
194 | in defending these lower zones. | ||
195 | |||
196 | If you have a machine which uses highmem or ISA DMA and your | ||
197 | applications are using mlock(), or if you are running with no swap then | ||
198 | you probably should change the lowmem_reserve_ratio setting. | ||
199 | |||
200 | The lowmem_reserve_ratio is an array. You can see them by reading this file. | ||
201 | - | ||
202 | % cat /proc/sys/vm/lowmem_reserve_ratio | ||
203 | 256 256 32 | ||
204 | - | ||
205 | Note: # of this elements is one fewer than number of zones. Because the highest | ||
206 | zone's value is not necessary for following calculation. | ||
207 | |||
208 | But, these values are not used directly. The kernel calculates # of protection | ||
209 | pages for each zones from them. These are shown as array of protection pages | ||
210 | in /proc/zoneinfo like followings. (This is an example of x86-64 box). | ||
211 | Each zone has an array of protection pages like this. | ||
212 | |||
213 | - | ||
214 | Node 0, zone DMA | ||
215 | pages free 1355 | ||
216 | min 3 | ||
217 | low 3 | ||
218 | high 4 | ||
219 | : | ||
220 | : | ||
221 | numa_other 0 | ||
222 | protection: (0, 2004, 2004, 2004) | ||
223 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
224 | pagesets | ||
225 | cpu: 0 pcp: 0 | ||
226 | : | ||
227 | - | ||
228 | These protections are added to score to judge whether this zone should be used | ||
229 | for page allocation or should be reclaimed. | ||
230 | |||
231 | In this example, if normal pages (index=2) are required to this DMA zone and | ||
232 | pages_high is used for watermark, the kernel judges this zone should not be | ||
233 | used because pages_free(1355) is smaller than watermark + protection[2] | ||
234 | (4 + 2004 = 2008). If this protection value is 0, this zone would be used for | ||
235 | normal page requirement. If requirement is DMA zone(index=0), protection[0] | ||
236 | (=0) is used. | ||
237 | |||
238 | zone[i]'s protection[j] is calculated by following expression. | ||
239 | |||
240 | (i < j): | ||
241 | zone[i]->protection[j] | ||
242 | = (total sums of present_pages from zone[i+1] to zone[j] on the node) | ||
243 | / lowmem_reserve_ratio[i]; | ||
244 | (i = j): | ||
245 | (should not be protected. = 0; | ||
246 | (i > j): | ||
247 | (not necessary, but looks 0) | ||
248 | |||
249 | The default values of lowmem_reserve_ratio[i] are | ||
250 | 256 (if zone[i] means DMA or DMA32 zone) | ||
251 | 32 (others). | ||
252 | As above expression, they are reciprocal number of ratio. | ||
253 | 256 means 1/256. # of protection pages becomes about "0.39%" of total present | ||
254 | pages of higher zones on the node. | ||
255 | |||
256 | If you would like to protect more pages, smaller values are effective. | ||
257 | The minimum value is 1 (1/1 -> 100%). | ||
163 | 258 | ||
164 | It may be beneficial to switch off zone reclaim if the system is | 259 | ============================================================== |
165 | used for a file server and all of memory should be used for caching files | ||
166 | from disk. In that case the caching effect is more important than | ||
167 | data locality. | ||
168 | 260 | ||
169 | Allowing zone reclaim to write out pages stops processes that are | 261 | max_map_count: |
170 | writing large amounts of data from dirtying pages on other nodes. Zone | ||
171 | reclaim will write out dirty pages if a zone fills up and so effectively | ||
172 | throttle the process. This may decrease the performance of a single process | ||
173 | since it cannot use all of system memory to buffer the outgoing writes | ||
174 | anymore but it preserve the memory on other nodes so that the performance | ||
175 | of other processes running on other nodes will not be affected. | ||
176 | 262 | ||
177 | Allowing regular swap effectively restricts allocations to the local | 263 | This file contains the maximum number of memory map areas a process |
178 | node unless explicitly overridden by memory policies or cpuset | 264 | may have. Memory map areas are used as a side-effect of calling |
179 | configurations. | 265 | malloc, directly by mmap and mprotect, and also when loading shared |
266 | libraries. | ||
180 | 267 | ||
181 | ============================================================= | 268 | While most applications need less than a thousand maps, certain |
269 | programs, particularly malloc debuggers, may consume lots of them, | ||
270 | e.g., up to one or two maps per allocation. | ||
182 | 271 | ||
183 | min_unmapped_ratio: | 272 | The default value is 65536. |
184 | 273 | ||
185 | This is available only on NUMA kernels. | 274 | ============================================================== |
186 | 275 | ||
187 | A percentage of the total pages in each zone. Zone reclaim will only | 276 | min_free_kbytes: |
188 | occur if more than this percentage of pages are file backed and unmapped. | ||
189 | This is to insure that a minimal amount of local pages is still available for | ||
190 | file I/O even if the node is overallocated. | ||
191 | 277 | ||
192 | The default is 1 percent. | 278 | This is used to force the Linux VM to keep a minimum number |
279 | of kilobytes free. The VM uses this number to compute a pages_min | ||
280 | value for each lowmem zone in the system. Each lowmem zone gets | ||
281 | a number of reserved free pages based proportionally on its size. | ||
282 | |||
283 | Some minimal amount of memory is needed to satisfy PF_MEMALLOC | ||
284 | allocations; if you set this to lower than 1024KB, your system will | ||
285 | become subtly broken, and prone to deadlock under high loads. | ||
286 | |||
287 | Setting this too high will OOM your machine instantly. | ||
193 | 288 | ||
194 | ============================================================= | 289 | ============================================================= |
195 | 290 | ||
@@ -211,82 +306,73 @@ and may not be fast. | |||
211 | 306 | ||
212 | ============================================================= | 307 | ============================================================= |
213 | 308 | ||
214 | panic_on_oom | 309 | min_unmapped_ratio: |
215 | 310 | ||
216 | This enables or disables panic on out-of-memory feature. | 311 | This is available only on NUMA kernels. |
217 | 312 | ||
218 | If this is set to 0, the kernel will kill some rogue process, | 313 | A percentage of the total pages in each zone. Zone reclaim will only |
219 | called oom_killer. Usually, oom_killer can kill rogue processes and | 314 | occur if more than this percentage of pages are file backed and unmapped. |
220 | system will survive. | 315 | This is to insure that a minimal amount of local pages is still available for |
316 | file I/O even if the node is overallocated. | ||
221 | 317 | ||
222 | If this is set to 1, the kernel panics when out-of-memory happens. | 318 | The default is 1 percent. |
223 | However, if a process limits using nodes by mempolicy/cpusets, | ||
224 | and those nodes become memory exhaustion status, one process | ||
225 | may be killed by oom-killer. No panic occurs in this case. | ||
226 | Because other nodes' memory may be free. This means system total status | ||
227 | may be not fatal yet. | ||
228 | 319 | ||
229 | If this is set to 2, the kernel panics compulsorily even on the | 320 | ============================================================== |
230 | above-mentioned. | ||
231 | 321 | ||
232 | The default value is 0. | 322 | mmap_min_addr |
233 | 1 and 2 are for failover of clustering. Please select either | ||
234 | according to your policy of failover. | ||
235 | 323 | ||
236 | ============================================================= | 324 | This file indicates the amount of address space which a user process will |
325 | be restricted from mmaping. Since kernel null dereference bugs could | ||
326 | accidentally operate based on the information in the first couple of pages | ||
327 | of memory userspace processes should not be allowed to write to them. By | ||
328 | default this value is set to 0 and no protections will be enforced by the | ||
329 | security module. Setting this value to something like 64k will allow the | ||
330 | vast majority of applications to work correctly and provide defense in depth | ||
331 | against future potential kernel bugs. | ||
237 | 332 | ||
238 | oom_dump_tasks | 333 | ============================================================== |
239 | 334 | ||
240 | Enables a system-wide task dump (excluding kernel threads) to be | 335 | nr_hugepages |
241 | produced when the kernel performs an OOM-killing and includes such | ||
242 | information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and | ||
243 | name. This is helpful to determine why the OOM killer was invoked | ||
244 | and to identify the rogue task that caused it. | ||
245 | 336 | ||
246 | If this is set to zero, this information is suppressed. On very | 337 | Change the minimum size of the hugepage pool. |
247 | large systems with thousands of tasks it may not be feasible to dump | ||
248 | the memory state information for each one. Such systems should not | ||
249 | be forced to incur a performance penalty in OOM conditions when the | ||
250 | information may not be desired. | ||
251 | 338 | ||
252 | If this is set to non-zero, this information is shown whenever the | 339 | See Documentation/vm/hugetlbpage.txt |
253 | OOM killer actually kills a memory-hogging task. | ||
254 | 340 | ||
255 | The default value is 0. | 341 | ============================================================== |
256 | 342 | ||
257 | ============================================================= | 343 | nr_overcommit_hugepages |
258 | 344 | ||
259 | oom_kill_allocating_task | 345 | Change the maximum size of the hugepage pool. The maximum is |
346 | nr_hugepages + nr_overcommit_hugepages. | ||
260 | 347 | ||
261 | This enables or disables killing the OOM-triggering task in | 348 | See Documentation/vm/hugetlbpage.txt |
262 | out-of-memory situations. | ||
263 | 349 | ||
264 | If this is set to zero, the OOM killer will scan through the entire | 350 | ============================================================== |
265 | tasklist and select a task based on heuristics to kill. This normally | ||
266 | selects a rogue memory-hogging task that frees up a large amount of | ||
267 | memory when killed. | ||
268 | 351 | ||
269 | If this is set to non-zero, the OOM killer simply kills the task that | 352 | nr_pdflush_threads |
270 | triggered the out-of-memory condition. This avoids the expensive | ||
271 | tasklist scan. | ||
272 | 353 | ||
273 | If panic_on_oom is selected, it takes precedence over whatever value | 354 | The current number of pdflush threads. This value is read-only. |
274 | is used in oom_kill_allocating_task. | 355 | The value changes according to the number of dirty pages in the system. |
275 | 356 | ||
276 | The default value is 0. | 357 | When neccessary, additional pdflush threads are created, one per second, up to |
358 | nr_pdflush_threads_max. | ||
277 | 359 | ||
278 | ============================================================== | 360 | ============================================================== |
279 | 361 | ||
280 | mmap_min_addr | 362 | nr_trim_pages |
281 | 363 | ||
282 | This file indicates the amount of address space which a user process will | 364 | This is available only on NOMMU kernels. |
283 | be restricted from mmaping. Since kernel null dereference bugs could | 365 | |
284 | accidentally operate based on the information in the first couple of pages | 366 | This value adjusts the excess page trimming behaviour of power-of-2 aligned |
285 | of memory userspace processes should not be allowed to write to them. By | 367 | NOMMU mmap allocations. |
286 | default this value is set to 0 and no protections will be enforced by the | 368 | |
287 | security module. Setting this value to something like 64k will allow the | 369 | A value of 0 disables trimming of allocations entirely, while a value of 1 |
288 | vast majority of applications to work correctly and provide defense in depth | 370 | trims excess pages aggressively. Any value >= 1 acts as the watermark where |
289 | against future potential kernel bugs. | 371 | trimming of allocations is initiated. |
372 | |||
373 | The default value is 1. | ||
374 | |||
375 | See Documentation/nommu-mmap.txt for more information. | ||
290 | 376 | ||
291 | ============================================================== | 377 | ============================================================== |
292 | 378 | ||
@@ -335,34 +421,199 @@ this is causing problems for your system/application. | |||
335 | 421 | ||
336 | ============================================================== | 422 | ============================================================== |
337 | 423 | ||
338 | nr_hugepages | 424 | oom_dump_tasks |
339 | 425 | ||
340 | Change the minimum size of the hugepage pool. | 426 | Enables a system-wide task dump (excluding kernel threads) to be |
427 | produced when the kernel performs an OOM-killing and includes such | ||
428 | information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and | ||
429 | name. This is helpful to determine why the OOM killer was invoked | ||
430 | and to identify the rogue task that caused it. | ||
341 | 431 | ||
342 | See Documentation/vm/hugetlbpage.txt | 432 | If this is set to zero, this information is suppressed. On very |
433 | large systems with thousands of tasks it may not be feasible to dump | ||
434 | the memory state information for each one. Such systems should not | ||
435 | be forced to incur a performance penalty in OOM conditions when the | ||
436 | information may not be desired. | ||
437 | |||
438 | If this is set to non-zero, this information is shown whenever the | ||
439 | OOM killer actually kills a memory-hogging task. | ||
440 | |||
441 | The default value is 0. | ||
343 | 442 | ||
344 | ============================================================== | 443 | ============================================================== |
345 | 444 | ||
346 | nr_overcommit_hugepages | 445 | oom_kill_allocating_task |
347 | 446 | ||
348 | Change the maximum size of the hugepage pool. The maximum is | 447 | This enables or disables killing the OOM-triggering task in |
349 | nr_hugepages + nr_overcommit_hugepages. | 448 | out-of-memory situations. |
350 | 449 | ||
351 | See Documentation/vm/hugetlbpage.txt | 450 | If this is set to zero, the OOM killer will scan through the entire |
451 | tasklist and select a task based on heuristics to kill. This normally | ||
452 | selects a rogue memory-hogging task that frees up a large amount of | ||
453 | memory when killed. | ||
454 | |||
455 | If this is set to non-zero, the OOM killer simply kills the task that | ||
456 | triggered the out-of-memory condition. This avoids the expensive | ||
457 | tasklist scan. | ||
458 | |||
459 | If panic_on_oom is selected, it takes precedence over whatever value | ||
460 | is used in oom_kill_allocating_task. | ||
461 | |||
462 | The default value is 0. | ||
352 | 463 | ||
353 | ============================================================== | 464 | ============================================================== |
354 | 465 | ||
355 | nr_trim_pages | 466 | overcommit_memory: |
356 | 467 | ||
357 | This is available only on NOMMU kernels. | 468 | This value contains a flag that enables memory overcommitment. |
358 | 469 | ||
359 | This value adjusts the excess page trimming behaviour of power-of-2 aligned | 470 | When this flag is 0, the kernel attempts to estimate the amount |
360 | NOMMU mmap allocations. | 471 | of free memory left when userspace requests more memory. |
361 | 472 | ||
362 | A value of 0 disables trimming of allocations entirely, while a value of 1 | 473 | When this flag is 1, the kernel pretends there is always enough |
363 | trims excess pages aggressively. Any value >= 1 acts as the watermark where | 474 | memory until it actually runs out. |
364 | trimming of allocations is initiated. | ||
365 | 475 | ||
366 | The default value is 1. | 476 | When this flag is 2, the kernel uses a "never overcommit" |
477 | policy that attempts to prevent any overcommit of memory. | ||
367 | 478 | ||
368 | See Documentation/nommu-mmap.txt for more information. | 479 | This feature can be very useful because there are a lot of |
480 | programs that malloc() huge amounts of memory "just-in-case" | ||
481 | and don't use much of it. | ||
482 | |||
483 | The default value is 0. | ||
484 | |||
485 | See Documentation/vm/overcommit-accounting and | ||
486 | security/commoncap.c::cap_vm_enough_memory() for more information. | ||
487 | |||
488 | ============================================================== | ||
489 | |||
490 | overcommit_ratio: | ||
491 | |||
492 | When overcommit_memory is set to 2, the committed address | ||
493 | space is not permitted to exceed swap plus this percentage | ||
494 | of physical RAM. See above. | ||
495 | |||
496 | ============================================================== | ||
497 | |||
498 | page-cluster | ||
499 | |||
500 | page-cluster controls the number of pages which are written to swap in | ||
501 | a single attempt. The swap I/O size. | ||
502 | |||
503 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
504 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
505 | |||
506 | The default value is three (eight pages at a time). There may be some | ||
507 | small benefits in tuning this to a different value if your workload is | ||
508 | swap-intensive. | ||
509 | |||
510 | ============================================================= | ||
511 | |||
512 | panic_on_oom | ||
513 | |||
514 | This enables or disables panic on out-of-memory feature. | ||
515 | |||
516 | If this is set to 0, the kernel will kill some rogue process, | ||
517 | called oom_killer. Usually, oom_killer can kill rogue processes and | ||
518 | system will survive. | ||
519 | |||
520 | If this is set to 1, the kernel panics when out-of-memory happens. | ||
521 | However, if a process limits using nodes by mempolicy/cpusets, | ||
522 | and those nodes become memory exhaustion status, one process | ||
523 | may be killed by oom-killer. No panic occurs in this case. | ||
524 | Because other nodes' memory may be free. This means system total status | ||
525 | may be not fatal yet. | ||
526 | |||
527 | If this is set to 2, the kernel panics compulsorily even on the | ||
528 | above-mentioned. | ||
529 | |||
530 | The default value is 0. | ||
531 | 1 and 2 are for failover of clustering. Please select either | ||
532 | according to your policy of failover. | ||
533 | |||
534 | ============================================================= | ||
535 | |||
536 | percpu_pagelist_fraction | ||
537 | |||
538 | This is the fraction of pages at most (high mark pcp->high) in each zone that | ||
539 | are allocated for each per cpu page list. The min value for this is 8. It | ||
540 | means that we don't allow more than 1/8th of pages in each zone to be | ||
541 | allocated in any single per_cpu_pagelist. This entry only changes the value | ||
542 | of hot per cpu pagelists. User can specify a number like 100 to allocate | ||
543 | 1/100th of each zone to each per cpu page list. | ||
544 | |||
545 | The batch value of each per cpu pagelist is also updated as a result. It is | ||
546 | set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) | ||
547 | |||
548 | The initial value is zero. Kernel does not use this value at boot time to set | ||
549 | the high water marks for each per cpu page list. | ||
550 | |||
551 | ============================================================== | ||
552 | |||
553 | stat_interval | ||
554 | |||
555 | The time interval between which vm statistics are updated. The default | ||
556 | is 1 second. | ||
557 | |||
558 | ============================================================== | ||
559 | |||
560 | swappiness | ||
561 | |||
562 | This control is used to define how aggressive the kernel will swap | ||
563 | memory pages. Higher values will increase agressiveness, lower values | ||
564 | descrease the amount of swap. | ||
565 | |||
566 | The default value is 60. | ||
567 | |||
568 | ============================================================== | ||
569 | |||
570 | vfs_cache_pressure | ||
571 | ------------------ | ||
572 | |||
573 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
574 | caching of directory and inode objects. | ||
575 | |||
576 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
577 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
578 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
579 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
580 | causes the kernel to prefer to reclaim dentries and inodes. | ||
581 | |||
582 | ============================================================== | ||
583 | |||
584 | zone_reclaim_mode: | ||
585 | |||
586 | Zone_reclaim_mode allows someone to set more or less aggressive approaches to | ||
587 | reclaim memory when a zone runs out of memory. If it is set to zero then no | ||
588 | zone reclaim occurs. Allocations will be satisfied from other zones / nodes | ||
589 | in the system. | ||
590 | |||
591 | This is value ORed together of | ||
592 | |||
593 | 1 = Zone reclaim on | ||
594 | 2 = Zone reclaim writes dirty pages out | ||
595 | 4 = Zone reclaim swaps pages | ||
596 | |||
597 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | ||
598 | from remote zones will cause a measurable performance reduction. The | ||
599 | page allocator will then reclaim easily reusable pages (those page | ||
600 | cache pages that are currently not used) before allocating off node pages. | ||
601 | |||
602 | It may be beneficial to switch off zone reclaim if the system is | ||
603 | used for a file server and all of memory should be used for caching files | ||
604 | from disk. In that case the caching effect is more important than | ||
605 | data locality. | ||
606 | |||
607 | Allowing zone reclaim to write out pages stops processes that are | ||
608 | writing large amounts of data from dirtying pages on other nodes. Zone | ||
609 | reclaim will write out dirty pages if a zone fills up and so effectively | ||
610 | throttle the process. This may decrease the performance of a single process | ||
611 | since it cannot use all of system memory to buffer the outgoing writes | ||
612 | anymore but it preserve the memory on other nodes so that the performance | ||
613 | of other processes running on other nodes will not be affected. | ||
614 | |||
615 | Allowing regular swap effectively restricts allocations to the local | ||
616 | node unless explicitly overridden by memory policies or cpuset | ||
617 | configurations. | ||
618 | |||
619 | ============ End of Document ================================= | ||