diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-09-10 21:19:42 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-09-10 21:19:42 -0400 |
commit | 33e247c7e58d335d70ecb84fd869091e2e4b8dcb (patch) | |
tree | e8561e1993dff03f8e56d10a5795fe9d379a3390 /Documentation/vm | |
parent | d71fc239b6915a8b750e9a447311029ff45b6580 (diff) | |
parent | 452e06af1f0149b01201f94264d452cd7a95db7a (diff) |
Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:
- even more of the rest of MM
- lib/ updates
- checkpatch updates
- small changes to a few scruffy filesystems
- kmod fixes/cleanups
- kexec updates
- a dma-mapping cleanup series from hch
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
dma-mapping: consolidate dma_set_mask
dma-mapping: consolidate dma_supported
dma-mapping: cosolidate dma_mapping_error
dma-mapping: consolidate dma_{alloc,free}_noncoherent
dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
mm: make sure all file VMAs have ->vm_ops set
mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
mm: mark most vm_operations_struct const
namei: fix warning while make xmldocs caused by namei.c
ipc: convert invalid scenarios to use WARN_ON
zlib_deflate/deftree: remove bi_reverse()
lib/decompress_unlzma: Do a NULL check for pointer
lib/decompressors: use real out buf size for gunzip with kernel
fs/affs: make root lookup from blkdev logical size
sysctl: fix int -> unsigned long assignments in INT_MIN case
kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
kexec: align crash_notes allocation to make it be inside one physical page
kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
kexec: split kexec_load syscall from kexec core code
...
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/vm/idle_page_tracking.txt | 98 | ||||
-rw-r--r-- | Documentation/vm/pagemap.txt | 13 | ||||
-rw-r--r-- | Documentation/vm/zswap.txt | 36 |
4 files changed, 140 insertions, 9 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 081c49777abb..6a5e2a102a45 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX | |||
@@ -14,6 +14,8 @@ hugetlbpage.txt | |||
14 | - a brief summary of hugetlbpage support in the Linux kernel. | 14 | - a brief summary of hugetlbpage support in the Linux kernel. |
15 | hwpoison.txt | 15 | hwpoison.txt |
16 | - explains what hwpoison is | 16 | - explains what hwpoison is |
17 | idle_page_tracking.txt | ||
18 | - description of the idle page tracking feature. | ||
17 | ksm.txt | 19 | ksm.txt |
18 | - how to use the Kernel Samepage Merging feature. | 20 | - how to use the Kernel Samepage Merging feature. |
19 | numa | 21 | numa |
diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/vm/idle_page_tracking.txt new file mode 100644 index 000000000000..85dcc3bb85dc --- /dev/null +++ b/Documentation/vm/idle_page_tracking.txt | |||
@@ -0,0 +1,98 @@ | |||
1 | MOTIVATION | ||
2 | |||
3 | The idle page tracking feature allows to track which memory pages are being | ||
4 | accessed by a workload and which are idle. This information can be useful for | ||
5 | estimating the workload's working set size, which, in turn, can be taken into | ||
6 | account when configuring the workload parameters, setting memory cgroup limits, | ||
7 | or deciding where to place the workload within a compute cluster. | ||
8 | |||
9 | It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. | ||
10 | |||
11 | USER API | ||
12 | |||
13 | The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, | ||
14 | it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. | ||
15 | |||
16 | The file implements a bitmap where each bit corresponds to a memory page. The | ||
17 | bitmap is represented by an array of 8-byte integers, and the page at PFN #i is | ||
18 | mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is | ||
19 | set, the corresponding page is idle. | ||
20 | |||
21 | A page is considered idle if it has not been accessed since it was marked idle | ||
22 | (for more details on what "accessed" actually means see the IMPLEMENTATION | ||
23 | DETAILS section). To mark a page idle one has to set the bit corresponding to | ||
24 | the page by writing to the file. A value written to the file is OR-ed with the | ||
25 | current bitmap value. | ||
26 | |||
27 | Only accesses to user memory pages are tracked. These are pages mapped to a | ||
28 | process address space, page cache and buffer pages, swap cache pages. For other | ||
29 | page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, | ||
30 | and hence such pages are never reported idle. | ||
31 | |||
32 | For huge pages the idle flag is set only on the head page, so one has to read | ||
33 | /proc/kpageflags in order to correctly count idle huge pages. | ||
34 | |||
35 | Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return | ||
36 | -EINVAL if you are not starting the read/write on an 8-byte boundary, or | ||
37 | if the size of the read/write is not a multiple of 8 bytes. Writing to | ||
38 | this file beyond max PFN will return -ENXIO. | ||
39 | |||
40 | That said, in order to estimate the amount of pages that are not used by a | ||
41 | workload one should: | ||
42 | |||
43 | 1. Mark all the workload's pages as idle by setting corresponding bits in | ||
44 | /sys/kernel/mm/page_idle/bitmap. The pages can be found by reading | ||
45 | /proc/pid/pagemap if the workload is represented by a process, or by | ||
46 | filtering out alien pages using /proc/kpagecgroup in case the workload is | ||
47 | placed in a memory cgroup. | ||
48 | |||
49 | 2. Wait until the workload accesses its working set. | ||
50 | |||
51 | 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If | ||
52 | one wants to ignore certain types of pages, e.g. mlocked pages since they | ||
53 | are not reclaimable, he or she can filter them out using /proc/kpageflags. | ||
54 | |||
55 | See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, | ||
56 | /proc/kpageflags, and /proc/kpagecgroup. | ||
57 | |||
58 | IMPLEMENTATION DETAILS | ||
59 | |||
60 | The kernel internally keeps track of accesses to user memory pages in order to | ||
61 | reclaim unreferenced pages first on memory shortage conditions. A page is | ||
62 | considered referenced if it has been recently accessed via a process address | ||
63 | space, in which case one or more PTEs it is mapped to will have the Accessed bit | ||
64 | set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The | ||
65 | latter happens when: | ||
66 | |||
67 | - a userspace process reads or writes a page using a system call (e.g. read(2) | ||
68 | or write(2)) | ||
69 | |||
70 | - a page that is used for storing filesystem buffers is read or written, | ||
71 | because a process needs filesystem metadata stored in it (e.g. lists a | ||
72 | directory tree) | ||
73 | |||
74 | - a page is accessed by a device driver using get_user_pages() | ||
75 | |||
76 | When a dirty page is written to swap or disk as a result of memory reclaim or | ||
77 | exceeding the dirty memory limit, it is not marked referenced. | ||
78 | |||
79 | The idle memory tracking feature adds a new page flag, the Idle flag. This flag | ||
80 | is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API | ||
81 | section), and cleared automatically whenever a page is referenced as defined | ||
82 | above. | ||
83 | |||
84 | When a page is marked idle, the Accessed bit must be cleared in all PTEs it is | ||
85 | mapped to, otherwise we will not be able to detect accesses to the page coming | ||
86 | from a process address space. To avoid interference with the reclaimer, which, | ||
87 | as noted above, uses the Accessed bit to promote actively referenced pages, one | ||
88 | more page flag is introduced, the Young flag. When the PTE Accessed bit is | ||
89 | cleared as a result of setting or updating a page's Idle flag, the Young flag | ||
90 | is set on the page. The reclaimer treats the Young flag as an extra PTE | ||
91 | Accessed bit and therefore will consider such a page as referenced. | ||
92 | |||
93 | Since the idle memory tracking feature is based on the memory reclaimer logic, | ||
94 | it only works with pages that are on an LRU list, other pages are silently | ||
95 | ignored. That means it will ignore a user memory page if it is isolated, but | ||
96 | since there are usually not many of them, it should not affect the overall | ||
97 | result noticeably. In order not to stall scanning of the idle page bitmap, | ||
98 | locked pages may be skipped too. | ||
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 3cd38438242a..0e1e55588b59 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt | |||
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow | |||
5 | userspace programs to examine the page tables and related information by | 5 | userspace programs to examine the page tables and related information by |
6 | reading files in /proc. | 6 | reading files in /proc. |
7 | 7 | ||
8 | There are three components to pagemap: | 8 | There are four components to pagemap: |
9 | 9 | ||
10 | * /proc/pid/pagemap. This file lets a userspace process find out which | 10 | * /proc/pid/pagemap. This file lets a userspace process find out which |
11 | physical frame each virtual page is mapped to. It contains one 64-bit | 11 | physical frame each virtual page is mapped to. It contains one 64-bit |
@@ -70,6 +70,11 @@ There are three components to pagemap: | |||
70 | 22. THP | 70 | 22. THP |
71 | 23. BALLOON | 71 | 23. BALLOON |
72 | 24. ZERO_PAGE | 72 | 24. ZERO_PAGE |
73 | 25. IDLE | ||
74 | |||
75 | * /proc/kpagecgroup. This file contains a 64-bit inode number of the | ||
76 | memory cgroup each page is charged to, indexed by PFN. Only available when | ||
77 | CONFIG_MEMCG is set. | ||
73 | 78 | ||
74 | Short descriptions to the page flags: | 79 | Short descriptions to the page flags: |
75 | 80 | ||
@@ -116,6 +121,12 @@ Short descriptions to the page flags: | |||
116 | 24. ZERO_PAGE | 121 | 24. ZERO_PAGE |
117 | zero page for pfn_zero or huge_zero page | 122 | zero page for pfn_zero or huge_zero page |
118 | 123 | ||
124 | 25. IDLE | ||
125 | page has not been accessed since it was marked idle (see | ||
126 | Documentation/vm/idle_page_tracking.txt). Note that this flag may be | ||
127 | stale in case the page was accessed via a PTE. To make sure the flag | ||
128 | is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. | ||
129 | |||
119 | [IO related page flags] | 130 | [IO related page flags] |
120 | 1. ERROR IO error occurred | 131 | 1. ERROR IO error occurred |
121 | 3. UPTODATE page has up-to-date data | 132 | 3. UPTODATE page has up-to-date data |
diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.txt index 8458c0861e4e..89fff7d611cc 100644 --- a/Documentation/vm/zswap.txt +++ b/Documentation/vm/zswap.txt | |||
@@ -32,7 +32,7 @@ can also be enabled and disabled at runtime using the sysfs interface. | |||
32 | An example command to enable zswap at runtime, assuming sysfs is mounted | 32 | An example command to enable zswap at runtime, assuming sysfs is mounted |
33 | at /sys, is: | 33 | at /sys, is: |
34 | 34 | ||
35 | echo 1 > /sys/modules/zswap/parameters/enabled | 35 | echo 1 > /sys/module/zswap/parameters/enabled |
36 | 36 | ||
37 | When zswap is disabled at runtime it will stop storing pages that are | 37 | When zswap is disabled at runtime it will stop storing pages that are |
38 | being swapped out. However, it will _not_ immediately write out or fault | 38 | being swapped out. However, it will _not_ immediately write out or fault |
@@ -49,14 +49,26 @@ Zswap receives pages for compression through the Frontswap API and is able to | |||
49 | evict pages from its own compressed pool on an LRU basis and write them back to | 49 | evict pages from its own compressed pool on an LRU basis and write them back to |
50 | the backing swap device in the case that the compressed pool is full. | 50 | the backing swap device in the case that the compressed pool is full. |
51 | 51 | ||
52 | Zswap makes use of zbud for the managing the compressed memory pool. Each | 52 | Zswap makes use of zpool for the managing the compressed memory pool. Each |
53 | allocation in zbud is not directly accessible by address. Rather, a handle is | 53 | allocation in zpool is not directly accessible by address. Rather, a handle is |
54 | returned by the allocation routine and that handle must be mapped before being | 54 | returned by the allocation routine and that handle must be mapped before being |
55 | accessed. The compressed memory pool grows on demand and shrinks as compressed | 55 | accessed. The compressed memory pool grows on demand and shrinks as compressed |
56 | pages are freed. The pool is not preallocated. | 56 | pages are freed. The pool is not preallocated. By default, a zpool of type |
57 | zbud is created, but it can be selected at boot time by setting the "zpool" | ||
58 | attribute, e.g. zswap.zpool=zbud. It can also be changed at runtime using the | ||
59 | sysfs "zpool" attribute, e.g. | ||
60 | |||
61 | echo zbud > /sys/module/zswap/parameters/zpool | ||
62 | |||
63 | The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which | ||
64 | means the compression ratio will always be 2:1 or worse (because of half-full | ||
65 | zbud pages). The zsmalloc type zpool has a more complex compressed page | ||
66 | storage method, and it can achieve greater storage densities. However, | ||
67 | zsmalloc does not implement compressed page eviction, so once zswap fills it | ||
68 | cannot evict the oldest page, it can only reject new pages. | ||
57 | 69 | ||
58 | When a swap page is passed from frontswap to zswap, zswap maintains a mapping | 70 | When a swap page is passed from frontswap to zswap, zswap maintains a mapping |
59 | of the swap entry, a combination of the swap type and swap offset, to the zbud | 71 | of the swap entry, a combination of the swap type and swap offset, to the zpool |
60 | handle that references that compressed swap page. This mapping is achieved | 72 | handle that references that compressed swap page. This mapping is achieved |
61 | with a red-black tree per swap type. The swap offset is the search key for the | 73 | with a red-black tree per swap type. The swap offset is the search key for the |
62 | tree nodes. | 74 | tree nodes. |
@@ -74,9 +86,17 @@ controlled policy: | |||
74 | * max_pool_percent - The maximum percentage of memory that the compressed | 86 | * max_pool_percent - The maximum percentage of memory that the compressed |
75 | pool can occupy. | 87 | pool can occupy. |
76 | 88 | ||
77 | Zswap allows the compressor to be selected at kernel boot time by setting the | 89 | The default compressor is lzo, but it can be selected at boot time by setting |
78 | “compressor” attribute. The default compressor is lzo. e.g. | 90 | the “compressor” attribute, e.g. zswap.compressor=lzo. It can also be changed |
79 | zswap.compressor=deflate | 91 | at runtime using the sysfs "compressor" attribute, e.g. |
92 | |||
93 | echo lzo > /sys/module/zswap/parameters/compressor | ||
94 | |||
95 | When the zpool and/or compressor parameter is changed at runtime, any existing | ||
96 | compressed pages are not modified; they are left in their own zpool. When a | ||
97 | request is made for a page in an old zpool, it is uncompressed using its | ||
98 | original compressor. Once all pages are removed from an old zpool, the zpool | ||
99 | and its compressor are freed. | ||
80 | 100 | ||
81 | A debugfs interface is provided for various statistic about pool size, number | 101 | A debugfs interface is provided for various statistic about pool size, number |
82 | of pages stored, and various counters for the reasons pages are rejected. | 102 | of pages stored, and various counters for the reasons pages are rejected. |