diff options
author | Ingo Molnar <mingo@elte.hu> | 2009-02-12 07:08:57 -0500 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2009-02-12 07:08:57 -0500 |
commit | 871cafcc962fa1655c44b4f0e54d4c5cc14e273c (patch) | |
tree | fdb7bc65d2606c85b7be6c33ba0dfd5b4e472245 /Documentation/filesystems | |
parent | cf2592f59c0e8ed4308adbdb2e0a88655379d579 (diff) | |
parent | b578f3fcca1e78624dfb5f358776e63711d7fda2 (diff) |
Merge branch 'linus' into core/softlockup
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/nfs-rdma.txt | 4 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 316 | ||||
-rw-r--r-- | Documentation/filesystems/sysfs-pci.txt | 13 | ||||
-rw-r--r-- | Documentation/filesystems/ubifs.txt | 7 |
4 files changed, 44 insertions, 296 deletions
diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs-rdma.txt index 44bd766f2e5d..85eaeaddd27c 100644 --- a/Documentation/filesystems/nfs-rdma.txt +++ b/Documentation/filesystems/nfs-rdma.txt | |||
@@ -251,7 +251,7 @@ NFS/RDMA Setup | |||
251 | 251 | ||
252 | Instruct the server to listen on the RDMA transport: | 252 | Instruct the server to listen on the RDMA transport: |
253 | 253 | ||
254 | $ echo rdma 2050 > /proc/fs/nfsd/portlist | 254 | $ echo rdma 20049 > /proc/fs/nfsd/portlist |
255 | 255 | ||
256 | - On the client system | 256 | - On the client system |
257 | 257 | ||
@@ -263,7 +263,7 @@ NFS/RDMA Setup | |||
263 | Regardless of how the client was built (module or built-in), use this | 263 | Regardless of how the client was built (module or built-in), use this |
264 | command to mount the NFS/RDMA server: | 264 | command to mount the NFS/RDMA server: |
265 | 265 | ||
266 | $ mount -o rdma,port=2050 <IPoIB-server-name-or-address>:/<export> /mnt | 266 | $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt |
267 | 267 | ||
268 | To verify that the mount is using RDMA, run "cat /proc/mounts" and check | 268 | To verify that the mount is using RDMA, run "cat /proc/mounts" and check |
269 | the "proto" field for the given mount. | 269 | the "proto" field for the given mount. |
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d105eb45282a..a87be42f8211 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1371,292 +1371,8 @@ auto_msgmni default value is 1. | |||
1371 | 2.4 /proc/sys/vm - The virtual memory subsystem | 1371 | 2.4 /proc/sys/vm - The virtual memory subsystem |
1372 | ----------------------------------------------- | 1372 | ----------------------------------------------- |
1373 | 1373 | ||
1374 | The files in this directory can be used to tune the operation of the virtual | 1374 | Please see: Documentation/sysctls/vm.txt for a description of these |
1375 | memory (VM) subsystem of the Linux kernel. | 1375 | entries. |
1376 | |||
1377 | vfs_cache_pressure | ||
1378 | ------------------ | ||
1379 | |||
1380 | Controls the tendency of the kernel to reclaim the memory which is used for | ||
1381 | caching of directory and inode objects. | ||
1382 | |||
1383 | At the default value of vfs_cache_pressure=100 the kernel will attempt to | ||
1384 | reclaim dentries and inodes at a "fair" rate with respect to pagecache and | ||
1385 | swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer | ||
1386 | to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 | ||
1387 | causes the kernel to prefer to reclaim dentries and inodes. | ||
1388 | |||
1389 | dirty_background_bytes | ||
1390 | ---------------------- | ||
1391 | |||
1392 | Contains the amount of dirty memory at which the pdflush background writeback | ||
1393 | daemon will start writeback. | ||
1394 | |||
1395 | If dirty_background_bytes is written, dirty_background_ratio becomes a function | ||
1396 | of its value (dirty_background_bytes / the amount of dirtyable system memory). | ||
1397 | |||
1398 | dirty_background_ratio | ||
1399 | ---------------------- | ||
1400 | |||
1401 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
1402 | pages + file cache, not including locked pages and HugePages), the number of | ||
1403 | pages at which the pdflush background writeback daemon will start writing out | ||
1404 | dirty data. | ||
1405 | |||
1406 | If dirty_background_ratio is written, dirty_background_bytes becomes a function | ||
1407 | of its value (dirty_background_ratio * the amount of dirtyable system memory). | ||
1408 | |||
1409 | dirty_bytes | ||
1410 | ----------- | ||
1411 | |||
1412 | Contains the amount of dirty memory at which a process generating disk writes | ||
1413 | will itself start writeback. | ||
1414 | |||
1415 | If dirty_bytes is written, dirty_ratio becomes a function of its value | ||
1416 | (dirty_bytes / the amount of dirtyable system memory). | ||
1417 | |||
1418 | dirty_ratio | ||
1419 | ----------- | ||
1420 | |||
1421 | Contains, as a percentage of the dirtyable system memory (free pages + mapped | ||
1422 | pages + file cache, not including locked pages and HugePages), the number of | ||
1423 | pages at which a process which is generating disk writes will itself start | ||
1424 | writing out dirty data. | ||
1425 | |||
1426 | If dirty_ratio is written, dirty_bytes becomes a function of its value | ||
1427 | (dirty_ratio * the amount of dirtyable system memory). | ||
1428 | |||
1429 | dirty_writeback_centisecs | ||
1430 | ------------------------- | ||
1431 | |||
1432 | The pdflush writeback daemons will periodically wake up and write `old' data | ||
1433 | out to disk. This tunable expresses the interval between those wakeups, in | ||
1434 | 100'ths of a second. | ||
1435 | |||
1436 | Setting this to zero disables periodic writeback altogether. | ||
1437 | |||
1438 | dirty_expire_centisecs | ||
1439 | ---------------------- | ||
1440 | |||
1441 | This tunable is used to define when dirty data is old enough to be eligible | ||
1442 | for writeout by the pdflush daemons. It is expressed in 100'ths of a second. | ||
1443 | Data which has been dirty in-memory for longer than this interval will be | ||
1444 | written out next time a pdflush daemon wakes up. | ||
1445 | |||
1446 | highmem_is_dirtyable | ||
1447 | -------------------- | ||
1448 | |||
1449 | Only present if CONFIG_HIGHMEM is set. | ||
1450 | |||
1451 | This defaults to 0 (false), meaning that the ratios set above are calculated | ||
1452 | as a percentage of lowmem only. This protects against excessive scanning | ||
1453 | in page reclaim, swapping and general VM distress. | ||
1454 | |||
1455 | Setting this to 1 can be useful on 32 bit machines where you want to make | ||
1456 | random changes within an MMAPed file that is larger than your available | ||
1457 | lowmem without causing large quantities of random IO. Is is safe if the | ||
1458 | behavior of all programs running on the machine is known and memory will | ||
1459 | not be otherwise stressed. | ||
1460 | |||
1461 | legacy_va_layout | ||
1462 | ---------------- | ||
1463 | |||
1464 | If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel | ||
1465 | will use the legacy (2.4) layout for all processes. | ||
1466 | |||
1467 | lowmem_reserve_ratio | ||
1468 | --------------------- | ||
1469 | |||
1470 | For some specialised workloads on highmem machines it is dangerous for | ||
1471 | the kernel to allow process memory to be allocated from the "lowmem" | ||
1472 | zone. This is because that memory could then be pinned via the mlock() | ||
1473 | system call, or by unavailability of swapspace. | ||
1474 | |||
1475 | And on large highmem machines this lack of reclaimable lowmem memory | ||
1476 | can be fatal. | ||
1477 | |||
1478 | So the Linux page allocator has a mechanism which prevents allocations | ||
1479 | which _could_ use highmem from using too much lowmem. This means that | ||
1480 | a certain amount of lowmem is defended from the possibility of being | ||
1481 | captured into pinned user memory. | ||
1482 | |||
1483 | (The same argument applies to the old 16 megabyte ISA DMA region. This | ||
1484 | mechanism will also defend that region from allocations which could use | ||
1485 | highmem or lowmem). | ||
1486 | |||
1487 | The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is | ||
1488 | in defending these lower zones. | ||
1489 | |||
1490 | If you have a machine which uses highmem or ISA DMA and your | ||
1491 | applications are using mlock(), or if you are running with no swap then | ||
1492 | you probably should change the lowmem_reserve_ratio setting. | ||
1493 | |||
1494 | The lowmem_reserve_ratio is an array. You can see them by reading this file. | ||
1495 | - | ||
1496 | % cat /proc/sys/vm/lowmem_reserve_ratio | ||
1497 | 256 256 32 | ||
1498 | - | ||
1499 | Note: # of this elements is one fewer than number of zones. Because the highest | ||
1500 | zone's value is not necessary for following calculation. | ||
1501 | |||
1502 | But, these values are not used directly. The kernel calculates # of protection | ||
1503 | pages for each zones from them. These are shown as array of protection pages | ||
1504 | in /proc/zoneinfo like followings. (This is an example of x86-64 box). | ||
1505 | Each zone has an array of protection pages like this. | ||
1506 | |||
1507 | - | ||
1508 | Node 0, zone DMA | ||
1509 | pages free 1355 | ||
1510 | min 3 | ||
1511 | low 3 | ||
1512 | high 4 | ||
1513 | : | ||
1514 | : | ||
1515 | numa_other 0 | ||
1516 | protection: (0, 2004, 2004, 2004) | ||
1517 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
1518 | pagesets | ||
1519 | cpu: 0 pcp: 0 | ||
1520 | : | ||
1521 | - | ||
1522 | These protections are added to score to judge whether this zone should be used | ||
1523 | for page allocation or should be reclaimed. | ||
1524 | |||
1525 | In this example, if normal pages (index=2) are required to this DMA zone and | ||
1526 | pages_high is used for watermark, the kernel judges this zone should not be | ||
1527 | used because pages_free(1355) is smaller than watermark + protection[2] | ||
1528 | (4 + 2004 = 2008). If this protection value is 0, this zone would be used for | ||
1529 | normal page requirement. If requirement is DMA zone(index=0), protection[0] | ||
1530 | (=0) is used. | ||
1531 | |||
1532 | zone[i]'s protection[j] is calculated by following expression. | ||
1533 | |||
1534 | (i < j): | ||
1535 | zone[i]->protection[j] | ||
1536 | = (total sums of present_pages from zone[i+1] to zone[j] on the node) | ||
1537 | / lowmem_reserve_ratio[i]; | ||
1538 | (i = j): | ||
1539 | (should not be protected. = 0; | ||
1540 | (i > j): | ||
1541 | (not necessary, but looks 0) | ||
1542 | |||
1543 | The default values of lowmem_reserve_ratio[i] are | ||
1544 | 256 (if zone[i] means DMA or DMA32 zone) | ||
1545 | 32 (others). | ||
1546 | As above expression, they are reciprocal number of ratio. | ||
1547 | 256 means 1/256. # of protection pages becomes about "0.39%" of total present | ||
1548 | pages of higher zones on the node. | ||
1549 | |||
1550 | If you would like to protect more pages, smaller values are effective. | ||
1551 | The minimum value is 1 (1/1 -> 100%). | ||
1552 | |||
1553 | page-cluster | ||
1554 | ------------ | ||
1555 | |||
1556 | page-cluster controls the number of pages which are written to swap in | ||
1557 | a single attempt. The swap I/O size. | ||
1558 | |||
1559 | It is a logarithmic value - setting it to zero means "1 page", setting | ||
1560 | it to 1 means "2 pages", setting it to 2 means "4 pages", etc. | ||
1561 | |||
1562 | The default value is three (eight pages at a time). There may be some | ||
1563 | small benefits in tuning this to a different value if your workload is | ||
1564 | swap-intensive. | ||
1565 | |||
1566 | overcommit_memory | ||
1567 | ----------------- | ||
1568 | |||
1569 | Controls overcommit of system memory, possibly allowing processes | ||
1570 | to allocate (but not use) more memory than is actually available. | ||
1571 | |||
1572 | |||
1573 | 0 - Heuristic overcommit handling. Obvious overcommits of | ||
1574 | address space are refused. Used for a typical system. It | ||
1575 | ensures a seriously wild allocation fails while allowing | ||
1576 | overcommit to reduce swap usage. root is allowed to | ||
1577 | allocate slightly more memory in this mode. This is the | ||
1578 | default. | ||
1579 | |||
1580 | 1 - Always overcommit. Appropriate for some scientific | ||
1581 | applications. | ||
1582 | |||
1583 | 2 - Don't overcommit. The total address space commit | ||
1584 | for the system is not permitted to exceed swap plus a | ||
1585 | configurable percentage (default is 50) of physical RAM. | ||
1586 | Depending on the percentage you use, in most situations | ||
1587 | this means a process will not be killed while attempting | ||
1588 | to use already-allocated memory but will receive errors | ||
1589 | on memory allocation as appropriate. | ||
1590 | |||
1591 | overcommit_ratio | ||
1592 | ---------------- | ||
1593 | |||
1594 | Percentage of physical memory size to include in overcommit calculations | ||
1595 | (see above.) | ||
1596 | |||
1597 | Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) | ||
1598 | |||
1599 | swapspace = total size of all swap areas | ||
1600 | physmem = size of physical memory in system | ||
1601 | |||
1602 | nr_hugepages and hugetlb_shm_group | ||
1603 | ---------------------------------- | ||
1604 | |||
1605 | nr_hugepages configures number of hugetlb page reserved for the system. | ||
1606 | |||
1607 | hugetlb_shm_group contains group id that is allowed to create SysV shared | ||
1608 | memory segment using hugetlb page. | ||
1609 | |||
1610 | hugepages_treat_as_movable | ||
1611 | -------------------------- | ||
1612 | |||
1613 | This parameter is only useful when kernelcore= is specified at boot time to | ||
1614 | create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages | ||
1615 | are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero | ||
1616 | value written to hugepages_treat_as_movable allows huge pages to be allocated | ||
1617 | from ZONE_MOVABLE. | ||
1618 | |||
1619 | Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge | ||
1620 | pages pool can easily grow or shrink within. Assuming that applications are | ||
1621 | not running that mlock() a lot of memory, it is likely the huge pages pool | ||
1622 | can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value | ||
1623 | into nr_hugepages and triggering page reclaim. | ||
1624 | |||
1625 | laptop_mode | ||
1626 | ----------- | ||
1627 | |||
1628 | laptop_mode is a knob that controls "laptop mode". All the things that are | ||
1629 | controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. | ||
1630 | |||
1631 | block_dump | ||
1632 | ---------- | ||
1633 | |||
1634 | block_dump enables block I/O debugging when set to a nonzero value. More | ||
1635 | information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. | ||
1636 | |||
1637 | swap_token_timeout | ||
1638 | ------------------ | ||
1639 | |||
1640 | This file contains valid hold time of swap out protection token. The Linux | ||
1641 | VM has token based thrashing control mechanism and uses the token to prevent | ||
1642 | unnecessary page faults in thrashing situation. The unit of the value is | ||
1643 | second. The value would be useful to tune thrashing behavior. | ||
1644 | |||
1645 | drop_caches | ||
1646 | ----------- | ||
1647 | |||
1648 | Writing to this will cause the kernel to drop clean caches, dentries and | ||
1649 | inodes from memory, causing that memory to become free. | ||
1650 | |||
1651 | To free pagecache: | ||
1652 | echo 1 > /proc/sys/vm/drop_caches | ||
1653 | To free dentries and inodes: | ||
1654 | echo 2 > /proc/sys/vm/drop_caches | ||
1655 | To free pagecache, dentries and inodes: | ||
1656 | echo 3 > /proc/sys/vm/drop_caches | ||
1657 | |||
1658 | As this is a non-destructive operation and dirty objects are not freeable, the | ||
1659 | user should run `sync' first. | ||
1660 | 1376 | ||
1661 | 1377 | ||
1662 | 2.5 /proc/sys/dev - Device specific parameters | 1378 | 2.5 /proc/sys/dev - Device specific parameters |
@@ -2311,6 +2027,34 @@ increase the likelihood of this process being killed by the oom-killer. Valid | |||
2311 | values are in the range -16 to +15, plus the special value -17, which disables | 2027 | values are in the range -16 to +15, plus the special value -17, which disables |
2312 | oom-killing altogether for this process. | 2028 | oom-killing altogether for this process. |
2313 | 2029 | ||
2030 | The process to be killed in an out-of-memory situation is selected among all others | ||
2031 | based on its badness score. This value equals the original memory size of the process | ||
2032 | and is then updated according to its CPU time (utime + stime) and the | ||
2033 | run time (uptime - start time). The longer it runs the smaller is the score. | ||
2034 | Badness score is divided by the square root of the CPU time and then by | ||
2035 | the double square root of the run time. | ||
2036 | |||
2037 | Swapped out tasks are killed first. Half of each child's memory size is added to | ||
2038 | the parent's score if they do not share the same memory. Thus forking servers | ||
2039 | are the prime candidates to be killed. Having only one 'hungry' child will make | ||
2040 | parent less preferable than the child. | ||
2041 | |||
2042 | /proc/<pid>/oom_score shows process' current badness score. | ||
2043 | |||
2044 | The following heuristics are then applied: | ||
2045 | * if the task was reniced, its score doubles | ||
2046 | * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE | ||
2047 | or CAP_SYS_RAWIO) have their score divided by 4 | ||
2048 | * if oom condition happened in one cpuset and checked task does not belong | ||
2049 | to it, its score is divided by 8 | ||
2050 | * the resulting score is multiplied by two to the power of oom_adj, i.e. | ||
2051 | points <<= oom_adj when it is positive and | ||
2052 | points >>= -(oom_adj) otherwise | ||
2053 | |||
2054 | The task with the highest badness score is then selected and its children | ||
2055 | are killed, process itself will be killed in an OOM situation when it does | ||
2056 | not have children or some of them disabled oom like described above. | ||
2057 | |||
2314 | 2.13 /proc/<pid>/oom_score - Display current oom-killer score | 2058 | 2.13 /proc/<pid>/oom_score - Display current oom-killer score |
2315 | ------------------------------------------------------------- | 2059 | ------------------------------------------------------------- |
2316 | 2060 | ||
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 68ef48839c04..9f8740ca3f3b 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt | |||
@@ -9,6 +9,7 @@ that support it. For example, a given bus might look like this: | |||
9 | | |-- class | 9 | | |-- class |
10 | | |-- config | 10 | | |-- config |
11 | | |-- device | 11 | | |-- device |
12 | | |-- enable | ||
12 | | |-- irq | 13 | | |-- irq |
13 | | |-- local_cpus | 14 | | |-- local_cpus |
14 | | |-- resource | 15 | | |-- resource |
@@ -32,6 +33,7 @@ files, each with their own function. | |||
32 | class PCI class (ascii, ro) | 33 | class PCI class (ascii, ro) |
33 | config PCI config space (binary, rw) | 34 | config PCI config space (binary, rw) |
34 | device PCI device (ascii, ro) | 35 | device PCI device (ascii, ro) |
36 | enable Whether the device is enabled (ascii, rw) | ||
35 | irq IRQ number (ascii, ro) | 37 | irq IRQ number (ascii, ro) |
36 | local_cpus nearby CPU mask (cpumask, ro) | 38 | local_cpus nearby CPU mask (cpumask, ro) |
37 | resource PCI resource host addresses (ascii, ro) | 39 | resource PCI resource host addresses (ascii, ro) |
@@ -57,10 +59,19 @@ used to do actual device programming from userspace. Note that some platforms | |||
57 | don't support mmapping of certain resources, so be sure to check the return | 59 | don't support mmapping of certain resources, so be sure to check the return |
58 | value from any attempted mmap. | 60 | value from any attempted mmap. |
59 | 61 | ||
62 | The 'enable' file provides a counter that indicates how many times the device | ||
63 | has been enabled. If the 'enable' file currently returns '4', and a '1' is | ||
64 | echoed into it, it will then return '5'. Echoing a '0' into it will decrease | ||
65 | the count. Even when it returns to 0, though, some of the initialisation | ||
66 | may not be reversed. | ||
67 | |||
60 | The 'rom' file is special in that it provides read-only access to the device's | 68 | The 'rom' file is special in that it provides read-only access to the device's |
61 | ROM file, if available. It's disabled by default, however, so applications | 69 | ROM file, if available. It's disabled by default, however, so applications |
62 | should write the string "1" to the file to enable it before attempting a read | 70 | should write the string "1" to the file to enable it before attempting a read |
63 | call, and disable it following the access by writing "0" to the file. | 71 | call, and disable it following the access by writing "0" to the file. Note |
72 | that the device must be enabled for a rom read to return data succesfully. | ||
73 | In the event a driver is not bound to the device, it can be enabled using the | ||
74 | 'enable' file, documented above. | ||
64 | 75 | ||
65 | Accessing legacy resources through sysfs | 76 | Accessing legacy resources through sysfs |
66 | ---------------------------------------- | 77 | ---------------------------------------- |
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index 84da2a4ba25a..12fedb7834c6 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt | |||
@@ -79,13 +79,6 @@ Mount options | |||
79 | 79 | ||
80 | (*) == default. | 80 | (*) == default. |
81 | 81 | ||
82 | norm_unmount (*) commit on unmount; the journal is committed | ||
83 | when the file-system is unmounted so that the | ||
84 | next mount does not have to replay the journal | ||
85 | and it becomes very fast; | ||
86 | fast_unmount do not commit on unmount; this option makes | ||
87 | unmount faster, but the next mount slower | ||
88 | because of the need to replay the journal. | ||
89 | bulk_read read more in one go to take advantage of flash | 82 | bulk_read read more in one go to take advantage of flash |
90 | media that read faster sequentially | 83 | media that read faster sequentially |
91 | no_bulk_read (*) do not bulk-read | 84 | no_bulk_read (*) do not bulk-read |