diff options
| author | Peter W Morreale <pmorreale@novell.com> | 2009-01-15 16:50:42 -0500 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2009-01-15 19:39:35 -0500 |
| commit | db0fb1848a645b0b1b033765f3a5244e7afd2e3c (patch) | |
| tree | cf6b63e52fad2fa626e2a08251815f07626682dd /Documentation | |
| parent | b5db0e38653bfada34a92f360b4111566ede3842 (diff) | |
Update of Documentation: vm.txt and proc.txt
Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
More specifically, the section on /proc/sys/vm in
Documentation/filesystems/proc.txt was removed and a link to
Documentation/sysctl/vm.txt added.
Most of the verbiage from proc.txt was simply moved in vm.txt, with new
addtional text for "swappiness" and "stat_interval".
Signed-off-by: Peter W Morreale <pmorreale@novell.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/filesystems/proc.txt | 288 | ||||
| -rw-r--r-- | Documentation/sysctl/vm.txt | 619 |
2 files changed, 437 insertions, 470 deletions
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d105eb45282..bbebc3a43ac 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
| @@ -1371,292 +1371,8 @@ auto_msgmni default value is 1. | |||
| 1371 | 2.4 /proc/sys/vm - The virtual memory subsystem | 1371 | 2.4 /proc/sys/vm - The virtual memory subsystem |
| 1372 | ----------------------------------------------- | 1372 | ----------------------------------------------- |
| 1373 | 1373 | ||
| 1374 | The files in this directory can be used to tune the operation of the virtual | 1374 | Please see: Documentation/sysctls/vm.txt for a description of these |
| 1375 | memory (VM) subsystem of the Linux kernel. | 1375 | entries. |
| 1376 | |||
| 1377 | vfs_cache_pressure | ||
| 1378 | ------------------ | ||
| 1379 | |||
| 1380 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
| 1381 | caching of directory and inode objects. | ||
| 1382 | |||
| 1383 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
| 1384 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
| 1385 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
| 1386 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
| 1387 | causes the kernel to prefer to reclaim dentries and inodes. | ||
| 1388 | |||
| 1389 | dirty_background_bytes | ||
| 1390 | ---------------------- | ||
| 1391 | |||
| 1392 | Contains the amount of dirty memory at which the pdflush background writeback | ||
| 1393 | daemon will start writeback. | ||
| 1394 | |||
| 1395 | If dirty_background_bytes is written, dirty_background_ratio becomes a function | ||
| 1396 | of its value (dirty_background_bytes / the amount of dirtyable system memory). | ||
| 1397 | |||
| 1398 | dirty_background_ratio | ||
| 1399 | ---------------------- | ||
| 1400 | |||
| 1401 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
| 1402 | pages + file cache, not including locked pages and HugePages), the number of | ||
| 1403 | pages at which the pdflush background writeback daemon will start writing out | ||
| 1404 | dirty data. | ||
| 1405 | |||
| 1406 | If dirty_background_ratio is written, dirty_background_bytes becomes a function | ||
| 1407 | of its value (dirty_background_ratio * the amount of dirtyable system memory). | ||
| 1408 | |||
| 1409 | dirty_bytes | ||
| 1410 | ----------- | ||
| 1411 | |||
| 1412 | Contains the amount of dirty memory at which a process generating disk writes | ||
| 1413 | will itself start writeback. | ||
| 1414 | |||
| 1415 | If dirty_bytes is written, dirty_ratio becomes a function of its value | ||
| 1416 | (dirty_bytes / the amount of dirtyable system memory). | ||
| 1417 | |||
| 1418 | dirty_ratio | ||
| 1419 | ----------- | ||
| 1420 | |||
| 1421 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
| 1422 | pages + file cache, not including locked pages and HugePages), the number of | ||
| 1423 | pages at which a process which is generating disk writes will itself start | ||
| 1424 | writing out dirty data. | ||
| 1425 | |||
| 1426 | If dirty_ratio is written, dirty_bytes becomes a function of its value | ||
| 1427 | (dirty_ratio * the amount of dirtyable system memory). | ||
| 1428 | |||
| 1429 | dirty_writeback_centisecs | ||
| 1430 | ------------------------- | ||
| 1431 | |||
| 1432 | The pdflush writeback daemons will periodically wake up and write `old' data | ||
| 1433 | out to disk. This tunable expresses the interval between those wakeups, in | ||
| 1434 | 100'ths of a second. | ||
| 1435 | |||
| 1436 | Setting this to zero disables periodic writeback altogether. | ||
| 1437 | |||
| 1438 | dirty_expire_centisecs | ||
| 1439 | ---------------------- | ||
| 1440 | |||
| 1441 | This tunable is used to define when dirty data is old enough to be eligible | ||
| 1442 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. | ||
| 1443 | Data which has been dirty in-memory for longer than this interval will be | ||
| 1444 | written out next time a pdflush daemon wakes up. | ||
| 1445 | |||
| 1446 | highmem_is_dirtyable | ||
| 1447 | -------------------- | ||
| 1448 | |||
| 1449 | Only present if CONFIG_HIGHMEM is set. | ||
| 1450 | |||
| 1451 | This defaults to 0 (false), meaning that the ratios set above are calculated | ||
| 1452 | as a percentage of lowmem only. This protects against excessive scanning | ||
| 1453 | in page reclaim, swapping and general VM distress. | ||
| 1454 | |||
| 1455 | Setting this to 1 can be useful on 32 bit machines where you want to make | ||
| 1456 | random changes within an MMAPed file that is larger than your available | ||
| 1457 | lowmem without causing large quantities of random IO. Is is safe if the | ||
| 1458 | behavior of all programs running on the machine is known and memory will | ||
| 1459 | not be otherwise stressed. | ||
| 1460 | |||
| 1461 | legacy_va_layout | ||
| 1462 | ---------------- | ||
| 1463 | |||
| 1464 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel | ||
| 1465 | will use the legacy (2.4) layout for all processes. | ||
| 1466 | |||
| 1467 | lowmem_reserve_ratio | ||
| 1468 | --------------------- | ||
| 1469 | |||
| 1470 | For some specialised workloads on highmem machines it is dangerous for | ||
| 1471 | the kernel to allow process memory to be allocated from the "lowmem" | ||
| 1472 | zone. This is because that memory could then be pinned via the mlock() | ||
| 1473 | system call, or by unavailability of swapspace. | ||
| 1474 | |||
| 1475 | And on large highmem machines this lack of reclaimable lowmem memory | ||
| 1476 | can be fatal. | ||
| 1477 | |||
| 1478 | So the Linux page allocator has a mechanism which prevents allocations | ||
| 1479 | which _could_ use highmem from using too much lowmem. This means that | ||
| 1480 | a certain amount of lowmem is defended from the possibility of being | ||
| 1481 | captured into pinned user memory. | ||
| 1482 | |||
| 1483 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
| 1484 | mechanism will also defend that region from allocations which could use | ||
| 1485 | highmem or lowmem). | ||
| 1486 | |||
| 1487 | The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is | ||
| 1488 | in defending these lower zones. | ||
| 1489 | |||
| 1490 | If you have a machine which uses highmem or ISA DMA and your | ||
| 1491 | applications are using mlock(), or if you are running with no swap then | ||
| 1492 | you probably should change the lowmem_reserve_ratio setting. | ||
| 1493 | |||
| 1494 | The lowmem_reserve_ratio is an array. You can see them by reading this file. | ||
| 1495 | - | ||
| 1496 | % cat /proc/sys/vm/lowmem_reserve_ratio | ||
| 1497 | 256 256 32 | ||
| 1498 | - | ||
| 1499 | Note: # of this elements is one fewer than number of zones. Because the highest | ||
| 1500 | zone's value is not necessary for following calculation. | ||
| 1501 | |||
| 1502 | But, these values are not used directly. The kernel calculates # of protection | ||
| 1503 | pages for each zones from them. These are shown as array of protection pages | ||
| 1504 | in /proc/zoneinfo like followings. (This is an example of x86-64 box). | ||
| 1505 | Each zone has an array of protection pages like this. | ||
| 1506 | |||
| 1507 | - | ||
| 1508 | Node 0, zone DMA | ||
| 1509 | pages free 1355 | ||
| 1510 | min 3 | ||
| 1511 | low 3 | ||
| 1512 | high 4 | ||
| 1513 | : | ||
| 1514 | : | ||
| 1515 | numa_other 0 | ||
| 1516 | protection: (0, 2004, 2004, 2004) | ||
| 1517 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| 1518 | pagesets | ||
| 1519 | cpu: 0 pcp: 0 | ||
| 1520 | : | ||
| 1521 | - | ||
| 1522 | These protections are added to score to judge whether this zone should be used | ||
| 1523 | for page allocation or should be reclaimed. | ||
| 1524 | |||
| 1525 | In this example, if normal pages (index=2) are required to this DMA zone and | ||
| 1526 | pages_high is used for watermark, the kernel judges this zone should not be | ||
| 1527 | used because pages_free(1355) is smaller than watermark + protection[2] | ||
| 1528 | (4 + 2004 = 2008). If this protection value is 0, this zone would be used for | ||
| 1529 | normal page requirement. If requirement is DMA zone(index=0), protection[0] | ||
| 1530 | (=0) is used. | ||
| 1531 | |||
| 1532 | zone[i]'s protection[j] is calculated by following expression. | ||
| 1533 | |||
| 1534 | (i < j): | ||
| 1535 | zone[i]->protection[j] | ||
| 1536 | = (total sums of present_pages from zone[i+1] to zone[j] on the node) | ||
| 1537 | / lowmem_reserve_ratio[i]; | ||
| 1538 | (i = j): | ||
| 1539 | (should not be protected. = 0; | ||
| 1540 | (i > j): | ||
| 1541 | (not necessary, but looks 0) | ||
| 1542 | |||
| 1543 | The default values of lowmem_reserve_ratio[i] are | ||
| 1544 | 256 (if zone[i] means DMA or DMA32 zone) | ||
| 1545 | 32 (others). | ||
| 1546 | As above expression, they are reciprocal number of ratio. | ||
| 1547 | 256 means 1/256. # of protection pages becomes about "0.39%" of total present | ||
| 1548 | pages of higher zones on the node. | ||
| 1549 | |||
| 1550 | If you would like to protect more pages, smaller values are effective. | ||
| 1551 | The minimum value is 1 (1/1 -> 100%). | ||
| 1552 | |||
| 1553 | page-cluster | ||
| 1554 | ------------ | ||
| 1555 | |||
| 1556 | page-cluster controls the number of pages which are written to swap in | ||
| 1557 | a single attempt. The swap I/O size. | ||
| 1558 | |||
| 1559 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
| 1560 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
| 1561 | |||
| 1562 | The default value is three (eight pages at a time). There may be some | ||
| 1563 | small benefits in tuning this to a different value if your workload is | ||
| 1564 | swap-intensive. | ||
| 1565 | |||
| 1566 | overcommit_memory | ||
| 1567 | ----------------- | ||
| 1568 | |||
| 1569 | Controls overcommit of system memory, possibly allowing processes | ||
| 1570 | to allocate (but not use) more memory than is actually available. | ||
| 1571 | |||
| 1572 | |||
| 1573 | 0 - Heuristic overcommit handling. Obvious overcommits of | ||
| 1574 | address space are refused. Used for a typical system. It | ||
| 1575 | ensures a seriously wild allocation fails while allowing | ||
| 1576 | overcommit to reduce swap usage. root is allowed to | ||
| 1577 | allocate slightly more memory in this mode. This is the | ||
| 1578 | default. | ||
| 1579 | |||
| 1580 | 1 - Always overcommit. Appropriate for some scientific | ||
| 1581 | applications. | ||
| 1582 | |||
| 1583 | 2 - Don't overcommit. The total address space commit | ||
| 1584 | for the system is not permitted to exceed swap plus a | ||
| 1585 | configurable percentage (default is 50) of physical RAM. | ||
| 1586 | Depending on the percentage you use, in most situations | ||
| 1587 | this means a process will not be killed while attempting | ||
| 1588 | to use already-allocated memory but will receive errors | ||
| 1589 | on memory allocation as appropriate. | ||
| 1590 | |||
| 1591 | overcommit_ratio | ||
| 1592 | ---------------- | ||
| 1593 | |||
| 1594 | Percentage of physical memory size to include in overcommit calculations | ||
| 1595 | (see above.) | ||
| 1596 | |||
| 1597 | Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) | ||
| 1598 | |||
| 1599 | swapspace = total size of all swap areas | ||
| 1600 | physmem = size of physical memory in system | ||
| 1601 | |||
| 1602 | nr_hugepages and hugetlb_shm_group | ||
| 1603 | ---------------------------------- | ||
| 1604 | |||
| 1605 | nr_hugepages configures number of hugetlb page reserved for the system. | ||
| 1606 | |||
| 1607 | hugetlb_shm_group contains group id that is allowed to create SysV shared | ||
| 1608 | memory segment using hugetlb page. | ||
| 1609 | |||
| 1610 | hugepages_treat_as_movable | ||
| 1611 | -------------------------- | ||
| 1612 | |||
| 1613 | This parameter is only useful when kernelcore= is specified at boot time to | ||
| 1614 | create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages | ||
| 1615 | are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero | ||
| 1616 | value written to hugepages_treat_as_movable allows huge pages to be allocated | ||
| 1617 | from ZONE_MOVABLE. | ||
| 1618 | |||
| 1619 | Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge | ||
| 1620 | pages pool can easily grow or shrink within. Assuming that applications are | ||
| 1621 | not running that mlock() a lot of memory, it is likely the huge pages pool | ||
| 1622 | can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value | ||
| 1623 | into nr_hugepages and triggering page reclaim. | ||
| 1624 | |||
| 1625 | laptop_mode | ||
| 1626 | ----------- | ||
| 1627 | |||
| 1628 | laptop_mode is a knob that controls "laptop mode". All the things that are | ||
| 1629 | controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. | ||
| 1630 | |||
| 1631 | block_dump | ||
| 1632 | ---------- | ||
| 1633 | |||
| 1634 | block_dump enables block I/O debugging when set to a nonzero value. More | ||
| 1635 | information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. | ||
| 1636 | |||
| 1637 | swap_token_timeout | ||
| 1638 | ------------------ | ||
| 1639 | |||
| 1640 | This file contains valid hold time of swap out protection token. The Linux | ||
| 1641 | VM has token based thrashing control mechanism and uses the token to prevent | ||
| 1642 | unnecessary page faults in thrashing situation. The unit of the value is | ||
| 1643 | second. The value would be useful to tune thrashing behavior. | ||
| 1644 | |||
| 1645 | drop_caches | ||
| 1646 | ----------- | ||
| 1647 | |||
| 1648 | Writing to this will cause the kernel to drop clean caches, dentries and | ||
| 1649 | inodes from memory, causing that memory to become free. | ||
| 1650 | |||
| 1651 | To free pagecache: | ||
| 1652 | echo 1 > /proc/sys/vm/drop_caches | ||
| 1653 | To free dentries and inodes: | ||
| 1654 | echo 2 > /proc/sys/vm/drop_caches | ||
| 1655 | To free pagecache, dentries and inodes: | ||
| 1656 | echo 3 > /proc/sys/vm/drop_caches | ||
| 1657 | |||
| 1658 | As this is a non-destructive operation and dirty objects are not freeable, the | ||
| 1659 | user should run `sync' first. | ||
| 1660 | 1376 | ||
| 1661 | 1377 | ||
| 1662 | 2.5 /proc/sys/dev - Device specific parameters | 1378 | 2.5 /proc/sys/dev - Device specific parameters |
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index a3415070bca..3197fc83bc5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt | |||
| @@ -1,12 +1,13 @@ | |||
| 1 | Documentation for /proc/sys/vm/* kernel version 2.2.10 | 1 | Documentation for /proc/sys/vm/* kernel version 2.6.29 |
| 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> | 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
| 3 | (c) 2008 Peter W. Morreale <pmorreale@novell.com> | ||
| 3 | 4 | ||
| 4 | For general info and legal blurb, please look in README. | 5 | For general info and legal blurb, please look in README. |
| 5 | 6 | ||
| 6 | ============================================================== | 7 | ============================================================== |
| 7 | 8 | ||
| 8 | This file contains the documentation for the sysctl files in | 9 | This file contains the documentation for the sysctl files in |
| 9 | /proc/sys/vm and is valid for Linux kernel version 2.2. | 10 | /proc/sys/vm and is valid for Linux kernel version 2.6.29. |
| 10 | 11 | ||
| 11 | The files in this directory can be used to tune the operation | 12 | The files in this directory can be used to tune the operation |
| 12 | of the virtual memory (VM) subsystem of the Linux kernel and | 13 | of the virtual memory (VM) subsystem of the Linux kernel and |
| @@ -16,180 +17,274 @@ Default values and initialization routines for most of these | |||
| 16 | files can be found in mm/swap.c. | 17 | files can be found in mm/swap.c. |
| 17 | 18 | ||
| 18 | Currently, these files are in /proc/sys/vm: | 19 | Currently, these files are in /proc/sys/vm: |
| 19 | - overcommit_memory | 20 | |
| 20 | - page-cluster | 21 | - block_dump |
| 21 | - dirty_ratio | 22 | - dirty_background_bytes |
| 22 | - dirty_background_ratio | 23 | - dirty_background_ratio |
| 24 | - dirty_bytes | ||
| 23 | - dirty_expire_centisecs | 25 | - dirty_expire_centisecs |
| 26 | - dirty_ratio | ||
| 24 | - dirty_writeback_centisecs | 27 | - dirty_writeback_centisecs |
| 25 | - highmem_is_dirtyable (only if CONFIG_HIGHMEM set) | 28 | - drop_caches |
| 29 | - hugepages_treat_as_movable | ||
| 30 | - hugetlb_shm_group | ||
| 31 | - laptop_mode | ||
| 32 | - legacy_va_layout | ||
| 33 | - lowmem_reserve_ratio | ||
| 26 | - max_map_count | 34 | - max_map_count |
| 27 | - min_free_kbytes | 35 | - min_free_kbytes |
| 28 | - laptop_mode | ||
| 29 | - block_dump | ||
| 30 | - drop-caches | ||
| 31 | - zone_reclaim_mode | ||
| 32 | - min_unmapped_ratio | ||
| 33 | - min_slab_ratio | 36 | - min_slab_ratio |
| 34 | - panic_on_oom | 37 | - min_unmapped_ratio |
| 35 | - oom_dump_tasks | 38 | - mmap_min_addr |
| 36 | - oom_kill_allocating_task | ||
| 37 | - mmap_min_address | ||
| 38 | - numa_zonelist_order | ||
| 39 | - nr_hugepages | 39 | - nr_hugepages |
| 40 | - nr_overcommit_hugepages | 40 | - nr_overcommit_hugepages |
| 41 | - nr_trim_pages (only if CONFIG_MMU=n) | 41 | - nr_pdflush_threads |
| 42 | - nr_trim_pages (only if CONFIG_MMU=n) | ||
| 43 | - numa_zonelist_order | ||
| 44 | - oom_dump_tasks | ||
| 45 | - oom_kill_allocating_task | ||
| 46 | - overcommit_memory | ||
| 47 | - overcommit_ratio | ||
| 48 | - page-cluster | ||
| 49 | - panic_on_oom | ||
| 50 | - percpu_pagelist_fraction | ||
| 51 | - stat_interval | ||
| 52 | - swappiness | ||
| 53 | - vfs_cache_pressure | ||
| 54 | - zone_reclaim_mode | ||
| 55 | |||
| 42 | 56 | ||
| 43 | ============================================================== | 57 | ============================================================== |
| 44 | 58 | ||
| 45 | dirty_bytes, dirty_ratio, dirty_background_bytes, | 59 | block_dump |
| 46 | dirty_background_ratio, dirty_expire_centisecs, | ||
| 47 | dirty_writeback_centisecs, highmem_is_dirtyable, | ||
| 48 | vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, | ||
| 49 | drop-caches, hugepages_treat_as_movable: | ||
| 50 | 60 | ||
| 51 | See Documentation/filesystems/proc.txt | 61 | block_dump enables block I/O debugging when set to a nonzero value. More |
| 62 | information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. | ||
| 52 | 63 | ||
| 53 | ============================================================== | 64 | ============================================================== |
| 54 | 65 | ||
| 55 | overcommit_memory: | 66 | dirty_background_bytes |
| 56 | 67 | ||
| 57 | This value contains a flag that enables memory overcommitment. | 68 | Contains the amount of dirty memory at which the pdflush background writeback |
| 69 | daemon will start writeback. | ||
| 58 | 70 | ||
| 59 | When this flag is 0, the kernel attempts to estimate the amount | 71 | If dirty_background_bytes is written, dirty_background_ratio becomes a function |
| 60 | of free memory left when userspace requests more memory. | 72 | of its value (dirty_background_bytes / the amount of dirtyable system memory). |
| 61 | 73 | ||
| 62 | When this flag is 1, the kernel pretends there is always enough | 74 | ============================================================== |
| 63 | memory until it actually runs out. | ||
| 64 | 75 | ||
| 65 | When this flag is 2, the kernel uses a "never overcommit" | 76 | dirty_background_ratio |
| 66 | policy that attempts to prevent any overcommit of memory. | ||
| 67 | 77 | ||
| 68 | This feature can be very useful because there are a lot of | 78 | Contains, as a percentage of total system memory, the number of pages at which |
| 69 | programs that malloc() huge amounts of memory "just-in-case" | 79 | the pdflush background writeback daemon will start writing out dirty data. |
| 70 | and don't use much of it. | ||
| 71 | 80 | ||
| 72 | The default value is 0. | 81 | ============================================================== |
| 73 | 82 | ||
| 74 | See Documentation/vm/overcommit-accounting and | 83 | dirty_bytes |
| 75 | security/commoncap.c::cap_vm_enough_memory() for more information. | 84 | |
| 85 | Contains the amount of dirty memory at which a process generating disk writes | ||
| 86 | will itself start writeback. | ||
| 87 | |||
| 88 | If dirty_bytes is written, dirty_ratio becomes a function of its value | ||
| 89 | (dirty_bytes / the amount of dirtyable system memory). | ||
| 76 | 90 | ||
| 77 | ============================================================== | 91 | ============================================================== |
| 78 | 92 | ||
| 79 | overcommit_ratio: | 93 | dirty_expire_centisecs |
| 80 | 94 | ||
| 81 | When overcommit_memory is set to 2, the committed address | 95 | This tunable is used to define when dirty data is old enough to be eligible |
| 82 | space is not permitted to exceed swap plus this percentage | 96 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. |
| 83 | of physical RAM. See above. | 97 | Data which has been dirty in-memory for longer than this interval will be |
| 98 | written out next time a pdflush daemon wakes up. | ||
| 99 | |||
| 100 | ============================================================== | ||
| 101 | |||
| 102 | dirty_ratio | ||
| 103 | |||
| 104 | Contains, as a percentage of total system memory, the number of pages at which | ||
| 105 | a process which is generating disk writes will itself start writing out dirty | ||
| 106 | data. | ||
| 84 | 107 | ||
| 85 | ============================================================== | 108 | ============================================================== |
| 86 | 109 | ||
| 87 | page-cluster: | 110 | dirty_writeback_centisecs |
| 88 | 111 | ||
| 89 | The Linux VM subsystem avoids excessive disk seeks by reading | 112 | The pdflush writeback daemons will periodically wake up and write `old' data |
| 90 | multiple pages on a page fault. The number of pages it reads | 113 | out to disk. This tunable expresses the interval between those wakeups, in |
| 91 | is dependent on the amount of memory in your machine. | 114 | 100'ths of a second. |
| 92 | 115 | ||
| 93 | The number of pages the kernel reads in at once is equal to | 116 | Setting this to zero disables periodic writeback altogether. |
| 94 | 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense | ||
| 95 | for swap because we only cluster swap data in 32-page groups. | ||
| 96 | 117 | ||
| 97 | ============================================================== | 118 | ============================================================== |
| 98 | 119 | ||
| 99 | max_map_count: | 120 | drop_caches |
| 100 | 121 | ||
| 101 | This file contains the maximum number of memory map areas a process | 122 | Writing to this will cause the kernel to drop clean caches, dentries and |
| 102 | may have. Memory map areas are used as a side-effect of calling | 123 | inodes from memory, causing that memory to become free. |
| 103 | malloc, directly by mmap and mprotect, and also when loading shared | ||
| 104 | libraries. | ||
| 105 | 124 | ||
| 106 | While most applications need less than a thousand maps, certain | 125 | To free pagecache: |
| 107 | programs, particularly malloc debuggers, may consume lots of them, | 126 | echo 1 > /proc/sys/vm/drop_caches |
| 108 | e.g., up to one or two maps per allocation. | 127 | To free dentries and inodes: |
| 128 | echo 2 > /proc/sys/vm/drop_caches | ||
| 129 | To free pagecache, dentries and inodes: | ||
| 130 | echo 3 > /proc/sys/vm/drop_caches | ||
| 109 | 131 | ||
| 110 | The default value is 65536. | 132 | As this is a non-destructive operation and dirty objects are not freeable, the |
| 133 | user should run `sync' first. | ||
| 111 | 134 | ||
| 112 | ============================================================== | 135 | ============================================================== |
| 113 | 136 | ||
| 114 | min_free_kbytes: | 137 | hugepages_treat_as_movable |
| 115 | 138 | ||
| 116 | This is used to force the Linux VM to keep a minimum number | 139 | This parameter is only useful when kernelcore= is specified at boot time to |
| 117 | of kilobytes free. The VM uses this number to compute a pages_min | 140 | create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages |
| 118 | value for each lowmem zone in the system. Each lowmem zone gets | 141 | are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero |
| 119 | a number of reserved free pages based proportionally on its size. | 142 | value written to hugepages_treat_as_movable allows huge pages to be allocated |
| 143 | from ZONE_MOVABLE. | ||
| 120 | 144 | ||
| 121 | Some minimal amount of memory is needed to satisfy PF_MEMALLOC | 145 | Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge |
| 122 | allocations; if you set this to lower than 1024KB, your system will | 146 | pages pool can easily grow or shrink within. Assuming that applications are |
| 123 | become subtly broken, and prone to deadlock under high loads. | 147 | not running that mlock() a lot of memory, it is likely the huge pages pool |
| 124 | 148 | can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value | |
| 125 | Setting this too high will OOM your machine instantly. | 149 | into nr_hugepages and triggering page reclaim. |
| 126 | 150 | ||
| 127 | ============================================================== | 151 | ============================================================== |
| 128 | 152 | ||
| 129 | percpu_pagelist_fraction | 153 | hugetlb_shm_group |
| 130 | 154 | ||
| 131 | This is the fraction of pages at most (high mark pcp->high) in each zone that | 155 | hugetlb_shm_group contains group id that is allowed to create SysV |
| 132 | are allocated for each per cpu page list. The min value for this is 8. It | 156 | shared memory segment using hugetlb page. |
| 133 | means that we don't allow more than 1/8th of pages in each zone to be | ||
| 134 | allocated in any single per_cpu_pagelist. This entry only changes the value | ||
| 135 | of hot per cpu pagelists. User can specify a number like 100 to allocate | ||
| 136 | 1/100th of each zone to each per cpu page list. | ||
| 137 | 157 | ||
| 138 | The batch value of each per cpu pagelist is also updated as a result. It is | 158 | ============================================================== |
| 139 | set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) | ||
| 140 | 159 | ||
| 141 | The initial value is zero. Kernel does not use this value at boot time to set | 160 | laptop_mode |
| 142 | the high water marks for each per cpu page list. | ||
| 143 | 161 | ||
| 144 | =============================================================== | 162 | laptop_mode is a knob that controls "laptop mode". All the things that are |
| 163 | controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. | ||
| 145 | 164 | ||
| 146 | zone_reclaim_mode: | 165 | ============================================================== |
| 147 | 166 | ||
| 148 | Zone_reclaim_mode allows someone to set more or less aggressive approaches to | 167 | legacy_va_layout |
| 149 | reclaim memory when a zone runs out of memory. If it is set to zero then no | ||
| 150 | zone reclaim occurs. Allocations will be satisfied from other zones / nodes | ||
| 151 | in the system. | ||
| 152 | 168 | ||
| 153 | This is value ORed together of | 169 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel |
| 170 | will use the legacy (2.4) layout for all processes. | ||
| 154 | 171 | ||
| 155 | 1 = Zone reclaim on | 172 | ============================================================== |
| 156 | 2 = Zone reclaim writes dirty pages out | ||
| 157 | 4 = Zone reclaim swaps pages | ||
| 158 | 173 | ||
| 159 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | 174 | lowmem_reserve_ratio |
| 160 | from remote zones will cause a measurable performance reduction. The | 175 | |
| 161 | page allocator will then reclaim easily reusable pages (those page | 176 | For some specialised workloads on highmem machines it is dangerous for |
| 162 | cache pages that are currently not used) before allocating off node pages. | 177 | the kernel to allow process memory to be allocated from the "lowmem" |
| 178 | zone. This is because that memory could then be pinned via the mlock() | ||
| 179 | system call, or by unavailability of swapspace. | ||
| 180 | |||
| 181 | And on large highmem machines this lack of reclaimable lowmem memory | ||
| 182 | can be fatal. | ||
| 183 | |||
| 184 | So the Linux page allocator has a mechanism which prevents allocations | ||
| 185 | which _could_ use highmem from using too much lowmem. This means that | ||
| 186 | a certain amount of lowmem is defended from the possibility of being | ||
| 187 | captured into pinned user memory. | ||
| 188 | |||
| 189 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
| 190 | mechanism will also defend that region from allocations which could use | ||
| 191 | highmem or lowmem). | ||
| 192 | |||
| 193 | The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is | ||
| 194 | in defending these lower zones. | ||
| 195 | |||
| 196 | If you have a machine which uses highmem or ISA DMA and your | ||
| 197 | applications are using mlock(), or if you are running with no swap then | ||
| 198 | you probably should change the lowmem_reserve_ratio setting. | ||
| 199 | |||
| 200 | The lowmem_reserve_ratio is an array. You can see them by reading this file. | ||
| 201 | - | ||
| 202 | % cat /proc/sys/vm/lowmem_reserve_ratio | ||
| 203 | 256 256 32 | ||
| 204 | - | ||
| 205 | Note: # of this elements is one fewer than number of zones. Because the highest | ||
| 206 | zone's value is not necessary for following calculation. | ||
| 207 | |||
| 208 | But, these values are not used directly. The kernel calculates # of protection | ||
| 209 | pages for each zones from them. These are shown as array of protection pages | ||
| 210 | in /proc/zoneinfo like followings. (This is an example of x86-64 box). | ||
| 211 | Each zone has an array of protection pages like this. | ||
| 212 | |||
| 213 | - | ||
| 214 | Node 0, zone DMA | ||
| 215 | pages free 1355 | ||
| 216 | min 3 | ||
| 217 | low 3 | ||
| 218 | high 4 | ||
| 219 | : | ||
| 220 | : | ||
| 221 | numa_other 0 | ||
| 222 | protection: (0, 2004, 2004, 2004) | ||
| 223 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| 224 | pagesets | ||
| 225 | cpu: 0 pcp: 0 | ||
| 226 | : | ||
| 227 | - | ||
| 228 | These protections are added to score to judge whether this zone should be used | ||
| 229 | for page allocation or should be reclaimed. | ||
| 230 | |||
| 231 | In this example, if normal pages (index=2) are required to this DMA zone and | ||
| 232 | pages_high is used for watermark, the kernel judges this zone should not be | ||
| 233 | used because pages_free(1355) is smaller than watermark + protection[2] | ||
| 234 | (4 + 2004 = 2008). If this protection value is 0, this zone would be used for | ||
| 235 | normal page requirement. If requirement is DMA zone(index=0), protection[0] | ||
| 236 | (=0) is used. | ||
| 237 | |||
| 238 | zone[i]'s protection[j] is calculated by following expression. | ||
| 239 | |||
| 240 | (i < j): | ||
| 241 | zone[i]->protection[j] | ||
| 242 | = (total sums of present_pages from zone[i+1] to zone[j] on the node) | ||
| 243 | / lowmem_reserve_ratio[i]; | ||
| 244 | (i = j): | ||
| 245 | (should not be protected. = 0; | ||
| 246 | (i > j): | ||
| 247 | (not necessary, but looks 0) | ||
| 248 | |||
| 249 | The default values of lowmem_reserve_ratio[i] are | ||
| 250 | 256 (if zone[i] means DMA or DMA32 zone) | ||
| 251 | 32 (others). | ||
| 252 | As above expression, they are reciprocal number of ratio. | ||
| 253 | 256 means 1/256. # of protection pages becomes about "0.39%" of total present | ||
| 254 | pages of higher zones on the node. | ||
| 255 | |||
| 256 | If you would like to protect more pages, smaller values are effective. | ||
| 257 | The minimum value is 1 (1/1 -> 100%). | ||
| 163 | 258 | ||
| 164 | It may be beneficial to switch off zone reclaim if the system is | 259 | ============================================================== |
| 165 | used for a file server and all of memory should be used for caching files | ||
| 166 | from disk. In that case the caching effect is more important than | ||
| 167 | data locality. | ||
| 168 | 260 | ||
| 169 | Allowing zone reclaim to write out pages stops processes that are | 261 | max_map_count: |
| 170 | writing large amounts of data from dirtying pages on other nodes. Zone | ||
| 171 | reclaim will write out dirty pages if a zone fills up and so effectively | ||
| 172 | throttle the process. This may decrease the performance of a single process | ||
| 173 | since it cannot use all of system memory to buffer the outgoing writes | ||
| 174 | anymore but it preserve the memory on other nodes so that the performance | ||
| 175 | of other processes running on other nodes will not be affected. | ||
| 176 | 262 | ||
| 177 | Allowing regular swap effectively restricts allocations to the local | 263 | This file contains the maximum number of memory map areas a process |
| 178 | node unless explicitly overridden by memory policies or cpuset | 264 | may have. Memory map areas are used as a side-effect of calling |
| 179 | configurations. | 265 | malloc, directly by mmap and mprotect, and also when loading shared |
| 266 | libraries. | ||
| 180 | 267 | ||
| 181 | ============================================================= | 268 | While most applications need less than a thousand maps, certain |
| 269 | programs, particularly malloc debuggers, may consume lots of them, | ||
| 270 | e.g., up to one or two maps per allocation. | ||
| 182 | 271 | ||
| 183 | min_unmapped_ratio: | 272 | The default value is 65536. |
| 184 | 273 | ||
| 185 | This is available only on NUMA kernels. | 274 | ============================================================== |
| 186 | 275 | ||
| 187 | A percentage of the total pages in each zone. Zone reclaim will only | 276 | min_free_kbytes: |
| 188 | occur if more than this percentage of pages are file backed and unmapped. | ||
| 189 | This is to insure that a minimal amount of local pages is still available for | ||
| 190 | file I/O even if the node is overallocated. | ||
| 191 | 277 | ||
| 192 | The default is 1 percent. | 278 | This is used to force the Linux VM to keep a minimum number |
| 279 | of kilobytes free. The VM uses this number to compute a pages_min | ||
| 280 | value for each lowmem zone in the system. Each lowmem zone gets | ||
| 281 | a number of reserved free pages based proportionally on its size. | ||
| 282 | |||
| 283 | Some minimal amount of memory is needed to satisfy PF_MEMALLOC | ||
| 284 | allocations; if you set this to lower than 1024KB, your system will | ||
| 285 | become subtly broken, and prone to deadlock under high loads. | ||
| 286 | |||
| 287 | Setting this too high will OOM your machine instantly. | ||
| 193 | 288 | ||
| 194 | ============================================================= | 289 | ============================================================= |
| 195 | 290 | ||
| @@ -211,82 +306,73 @@ and may not be fast. | |||
| 211 | 306 | ||
| 212 | ============================================================= | 307 | ============================================================= |
| 213 | 308 | ||
| 214 | panic_on_oom | 309 | min_unmapped_ratio: |
| 215 | 310 | ||
| 216 | This enables or disables panic on out-of-memory feature. | 311 | This is available only on NUMA kernels. |
| 217 | 312 | ||
| 218 | If this is set to 0, the kernel will kill some rogue process, | 313 | A percentage of the total pages in each zone. Zone reclaim will only |
| 219 | called oom_killer. Usually, oom_killer can kill rogue processes and | 314 | occur if more than this percentage of pages are file backed and unmapped. |
| 220 | system will survive. | 315 | This is to insure that a minimal amount of local pages is still available for |
| 316 | file I/O even if the node is overallocated. | ||
| 221 | 317 | ||
| 222 | If this is set to 1, the kernel panics when out-of-memory happens. | 318 | The default is 1 percent. |
| 223 | However, if a process limits using nodes by mempolicy/cpusets, | ||
| 224 | and those nodes become memory exhaustion status, one process | ||
| 225 | may be killed by oom-killer. No panic occurs in this case. | ||
| 226 | Because other nodes' memory may be free. This means system total status | ||
| 227 | may be not fatal yet. | ||
| 228 | 319 | ||
| 229 | If this is set to 2, the kernel panics compulsorily even on the | 320 | ============================================================== |
| 230 | above-mentioned. | ||
| 231 | 321 | ||
| 232 | The default value is 0. | 322 | mmap_min_addr |
| 233 | 1 and 2 are for failover of clustering. Please select either | ||
| 234 | according to your policy of failover. | ||
| 235 | 323 | ||
| 236 | ============================================================= | 324 | This file indicates the amount of address space which a user process will |
| 325 | be restricted from mmaping. Since kernel null dereference bugs could | ||
| 326 | accidentally operate based on the information in the first couple of pages | ||
| 327 | of memory userspace processes should not be allowed to write to them. By | ||
| 328 | default this value is set to 0 and no protections will be enforced by the | ||
| 329 | security module. Setting this value to something like 64k will allow the | ||
| 330 | vast majority of applications to work correctly and provide defense in depth | ||
| 331 | against future potential kernel bugs. | ||
| 237 | 332 | ||
| 238 | oom_dump_tasks | 333 | ============================================================== |
| 239 | 334 | ||
| 240 | Enables a system-wide task dump (excluding kernel threads) to be | 335 | nr_hugepages |
| 241 | produced when the kernel performs an OOM-killing and includes such | ||
| 242 | information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and | ||
| 243 | name. This is helpful to determine why the OOM killer was invoked | ||
| 244 | and to identify the rogue task that caused it. | ||
| 245 | 336 | ||
| 246 | If this is set to zero, this information is suppressed. On very | 337 | Change the minimum size of the hugepage pool. |
| 247 | large systems with thousands of tasks it may not be feasible to dump | ||
| 248 | the memory state information for each one. Such systems should not | ||
| 249 | be forced to incur a performance penalty in OOM conditions when the | ||
| 250 | information may not be desired. | ||
| 251 | 338 | ||
| 252 | If this is set to non-zero, this information is shown whenever the | 339 | See Documentation/vm/hugetlbpage.txt |
| 253 | OOM killer actually kills a memory-hogging task. | ||
| 254 | 340 | ||
| 255 | The default value is 0. | 341 | ============================================================== |
| 256 | 342 | ||
| 257 | ============================================================= | 343 | nr_overcommit_hugepages |
| 258 | 344 | ||
| 259 | oom_kill_allocating_task | 345 | Change the maximum size of the hugepage pool. The maximum is |
| 346 | nr_hugepages + nr_overcommit_hugepages. | ||
| 260 | 347 | ||
| 261 | This enables or disables killing the OOM-triggering task in | 348 | See Documentation/vm/hugetlbpage.txt |
| 262 | out-of-memory situations. | ||
| 263 | 349 | ||
| 264 | If this is set to zero, the OOM killer will scan through the entire | 350 | ============================================================== |
| 265 | tasklist and select a task based on heuristics to kill. This normally | ||
| 266 | selects a rogue memory-hogging task that frees up a large amount of | ||
| 267 | memory when killed. | ||
| 268 | 351 | ||
| 269 | If this is set to non-zero, the OOM killer simply kills the task that | 352 | nr_pdflush_threads |
| 270 | triggered the out-of-memory condition. This avoids the expensive | ||
| 271 | tasklist scan. | ||
| 272 | 353 | ||
| 273 | If panic_on_oom is selected, it takes precedence over whatever value | 354 | The current number of pdflush threads. This value is read-only. |
| 274 | is used in oom_kill_allocating_task. | 355 | The value changes according to the number of dirty pages in the system. |
| 275 | 356 | ||
| 276 | The default value is 0. | 357 | When neccessary, additional pdflush threads are created, one per second, up to |
| 358 | nr_pdflush_threads_max. | ||
| 277 | 359 | ||
| 278 | ============================================================== | 360 | ============================================================== |
| 279 | 361 | ||
| 280 | mmap_min_addr | 362 | nr_trim_pages |
| 281 | 363 | ||
| 282 | This file indicates the amount of address space which a user process will | 364 | This is available only on NOMMU kernels. |
| 283 | be restricted from mmaping. Since kernel null dereference bugs could | 365 | |
| 284 | accidentally operate based on the information in the first couple of pages | 366 | This value adjusts the excess page trimming behaviour of power-of-2 aligned |
| 285 | of memory userspace processes should not be allowed to write to them. By | 367 | NOMMU mmap allocations. |
| 286 | default this value is set to 0 and no protections will be enforced by the | 368 | |
| 287 | security module. Setting this value to something like 64k will allow the | 369 | A value of 0 disables trimming of allocations entirely, while a value of 1 |
| 288 | vast majority of applications to work correctly and provide defense in depth | 370 | trims excess pages aggressively. Any value >= 1 acts as the watermark where |
| 289 | against future potential kernel bugs. | 371 | trimming of allocations is initiated. |
| 372 | |||
| 373 | The default value is 1. | ||
| 374 | |||
| 375 | See Documentation/nommu-mmap.txt for more information. | ||
| 290 | 376 | ||
| 291 | ============================================================== | 377 | ============================================================== |
| 292 | 378 | ||
| @@ -335,34 +421,199 @@ this is causing problems for your system/application. | |||
| 335 | 421 | ||
| 336 | ============================================================== | 422 | ============================================================== |
| 337 | 423 | ||
| 338 | nr_hugepages | 424 | oom_dump_tasks |
| 339 | 425 | ||
| 340 | Change the minimum size of the hugepage pool. | 426 | Enables a system-wide task dump (excluding kernel threads) to be |
| 427 | produced when the kernel performs an OOM-killing and includes such | ||
| 428 | information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and | ||
| 429 | name. This is helpful to determine why the OOM killer was invoked | ||
| 430 | and to identify the rogue task that caused it. | ||
| 341 | 431 | ||
| 342 | See Documentation/vm/hugetlbpage.txt | 432 | If this is set to zero, this information is suppressed. On very |
| 433 | large systems with thousands of tasks it may not be feasible to dump | ||
| 434 | the memory state information for each one. Such systems should not | ||
| 435 | be forced to incur a performance penalty in OOM conditions when the | ||
| 436 | information may not be desired. | ||
| 437 | |||
| 438 | If this is set to non-zero, this information is shown whenever the | ||
| 439 | OOM killer actually kills a memory-hogging task. | ||
| 440 | |||
| 441 | The default value is 0. | ||
| 343 | 442 | ||
| 344 | ============================================================== | 443 | ============================================================== |
| 345 | 444 | ||
| 346 | nr_overcommit_hugepages | 445 | oom_kill_allocating_task |
| 347 | 446 | ||
| 348 | Change the maximum size of the hugepage pool. The maximum is | 447 | This enables or disables killing the OOM-triggering task in |
| 349 | nr_hugepages + nr_overcommit_hugepages. | 448 | out-of-memory situations. |
| 350 | 449 | ||
| 351 | See Documentation/vm/hugetlbpage.txt | 450 | If this is set to zero, the OOM killer will scan through the entire |
| 451 | tasklist and select a task based on heuristics to kill. This normally | ||
| 452 | selects a rogue memory-hogging task that frees up a large amount of | ||
| 453 | memory when killed. | ||
| 454 | |||
| 455 | If this is set to non-zero, the OOM killer simply kills the task that | ||
| 456 | triggered the out-of-memory condition. This avoids the expensive | ||
| 457 | tasklist scan. | ||
| 458 | |||
| 459 | If panic_on_oom is selected, it takes precedence over whatever value | ||
| 460 | is used in oom_kill_allocating_task. | ||
| 461 | |||
| 462 | The default value is 0. | ||
| 352 | 463 | ||
| 353 | ============================================================== | 464 | ============================================================== |
| 354 | 465 | ||
| 355 | nr_trim_pages | 466 | overcommit_memory: |
| 356 | 467 | ||
| 357 | This is available only on NOMMU kernels. | 468 | This value contains a flag that enables memory overcommitment. |
| 358 | 469 | ||
| 359 | This value adjusts the excess page trimming behaviour of power-of-2 aligned | 470 | When this flag is 0, the kernel attempts to estimate the amount |
| 360 | NOMMU mmap allocations. | 471 | of free memory left when userspace requests more memory. |
| 361 | 472 | ||
| 362 | A value of 0 disables trimming of allocations entirely, while a value of 1 | 473 | When this flag is 1, the kernel pretends there is always enough |
| 363 | trims excess pages aggressively. Any value >= 1 acts as the watermark where | 474 | memory until it actually runs out. |
| 364 | trimming of allocations is initiated. | ||
| 365 | 475 | ||
| 366 | The default value is 1. | 476 | When this flag is 2, the kernel uses a "never overcommit" |
| 477 | policy that attempts to prevent any overcommit of memory. | ||
| 367 | 478 | ||
| 368 | See Documentation/nommu-mmap.txt for more information. | 479 | This feature can be very useful because there are a lot of |
| 480 | programs that malloc() huge amounts of memory "just-in-case" | ||
| 481 | and don't use much of it. | ||
| 482 | |||
| 483 | The default value is 0. | ||
| 484 | |||
| 485 | See Documentation/vm/overcommit-accounting and | ||
| 486 | security/commoncap.c::cap_vm_enough_memory() for more information. | ||
| 487 | |||
| 488 | ============================================================== | ||
| 489 | |||
| 490 | overcommit_ratio: | ||
| 491 | |||
| 492 | When overcommit_memory is set to 2, the committed address | ||
| 493 | space is not permitted to exceed swap plus this percentage | ||
| 494 | of physical RAM. See above. | ||
| 495 | |||
| 496 | ============================================================== | ||
| 497 | |||
| 498 | page-cluster | ||
| 499 | |||
| 500 | page-cluster controls the number of pages which are written to swap in | ||
| 501 | a single attempt. The swap I/O size. | ||
| 502 | |||
| 503 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
| 504 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
| 505 | |||
| 506 | The default value is three (eight pages at a time). There may be some | ||
| 507 | small benefits in tuning this to a different value if your workload is | ||
| 508 | swap-intensive. | ||
| 509 | |||
| 510 | ============================================================= | ||
| 511 | |||
| 512 | panic_on_oom | ||
| 513 | |||
| 514 | This enables or disables panic on out-of-memory feature. | ||
| 515 | |||
| 516 | If this is set to 0, the kernel will kill some rogue process, | ||
| 517 | called oom_killer. Usually, oom_killer can kill rogue processes and | ||
| 518 | system will survive. | ||
| 519 | |||
| 520 | If this is set to 1, the kernel panics when out-of-memory happens. | ||
| 521 | However, if a process limits using nodes by mempolicy/cpusets, | ||
| 522 | and those nodes become memory exhaustion status, one process | ||
| 523 | may be killed by oom-killer. No panic occurs in this case. | ||
| 524 | Because other nodes' memory may be free. This means system total status | ||
| 525 | may be not fatal yet. | ||
| 526 | |||
| 527 | If this is set to 2, the kernel panics compulsorily even on the | ||
| 528 | above-mentioned. | ||
| 529 | |||
| 530 | The default value is 0. | ||
| 531 | 1 and 2 are for failover of clustering. Please select either | ||
| 532 | according to your policy of failover. | ||
| 533 | |||
| 534 | ============================================================= | ||
| 535 | |||
| 536 | percpu_pagelist_fraction | ||
| 537 | |||
| 538 | This is the fraction of pages at most (high mark pcp->high) in each zone that | ||
| 539 | are allocated for each per cpu page list. The min value for this is 8. It | ||
| 540 | means that we don't allow more than 1/8th of pages in each zone to be | ||
| 541 | allocated in any single per_cpu_pagelist. This entry only changes the value | ||
| 542 | of hot per cpu pagelists. User can specify a number like 100 to allocate | ||
| 543 | 1/100th of each zone to each per cpu page list. | ||
| 544 | |||
| 545 | The batch value of each per cpu pagelist is also updated as a result. It is | ||
| 546 | set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) | ||
| 547 | |||
| 548 | The initial value is zero. Kernel does not use this value at boot time to set | ||
| 549 | the high water marks for each per cpu page list. | ||
| 550 | |||
| 551 | ============================================================== | ||
| 552 | |||
| 553 | stat_interval | ||
| 554 | |||
| 555 | The time interval between which vm statistics are updated. The default | ||
| 556 | is 1 second. | ||
| 557 | |||
| 558 | ============================================================== | ||
| 559 | |||
| 560 | swappiness | ||
| 561 | |||
| 562 | This control is used to define how aggressive the kernel will swap | ||
| 563 | memory pages. Higher values will increase agressiveness, lower values | ||
| 564 | descrease the amount of swap. | ||
| 565 | |||
| 566 | The default value is 60. | ||
| 567 | |||
| 568 | ============================================================== | ||
| 569 | |||
| 570 | vfs_cache_pressure | ||
| 571 | ------------------ | ||
| 572 | |||
| 573 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
| 574 | caching of directory and inode objects. | ||
| 575 | |||
| 576 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
| 577 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
| 578 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
| 579 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
| 580 | causes the kernel to prefer to reclaim dentries and inodes. | ||
| 581 | |||
| 582 | ============================================================== | ||
| 583 | |||
| 584 | zone_reclaim_mode: | ||
| 585 | |||
| 586 | Zone_reclaim_mode allows someone to set more or less aggressive approaches to | ||
| 587 | reclaim memory when a zone runs out of memory. If it is set to zero then no | ||
| 588 | zone reclaim occurs. Allocations will be satisfied from other zones / nodes | ||
| 589 | in the system. | ||
| 590 | |||
| 591 | This is value ORed together of | ||
| 592 | |||
| 593 | 1 = Zone reclaim on | ||
| 594 | 2 = Zone reclaim writes dirty pages out | ||
| 595 | 4 = Zone reclaim swaps pages | ||
| 596 | |||
| 597 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | ||
| 598 | from remote zones will cause a measurable performance reduction. The | ||
| 599 | page allocator will then reclaim easily reusable pages (those page | ||
| 600 | cache pages that are currently not used) before allocating off node pages. | ||
| 601 | |||
| 602 | It may be beneficial to switch off zone reclaim if the system is | ||
| 603 | used for a file server and all of memory should be used for caching files | ||
| 604 | from disk. In that case the caching effect is more important than | ||
| 605 | data locality. | ||
| 606 | |||
| 607 | Allowing zone reclaim to write out pages stops processes that are | ||
| 608 | writing large amounts of data from dirtying pages on other nodes. Zone | ||
| 609 | reclaim will write out dirty pages if a zone fills up and so effectively | ||
| 610 | throttle the process. This may decrease the performance of a single process | ||
| 611 | since it cannot use all of system memory to buffer the outgoing writes | ||
| 612 | anymore but it preserve the memory on other nodes so that the performance | ||
| 613 | of other processes running on other nodes will not be affected. | ||
| 614 | |||
| 615 | Allowing regular swap effectively restricts allocations to the local | ||
| 616 | node unless explicitly overridden by memory policies or cpuset | ||
| 617 | configurations. | ||
| 618 | |||
| 619 | ============ End of Document ================================= | ||
