diff options
Diffstat (limited to 'Documentation/vm/hugetlbpage.txt')
-rw-r--r-- | Documentation/vm/hugetlbpage.txt | 147 |
1 files changed, 95 insertions, 52 deletions
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index ea8714fcc3ad..82a7bd1800b2 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt | |||
@@ -18,13 +18,13 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS | |||
18 | automatically when CONFIG_HUGETLBFS is selected) configuration | 18 | automatically when CONFIG_HUGETLBFS is selected) configuration |
19 | options. | 19 | options. |
20 | 20 | ||
21 | The kernel built with hugepage support should show the number of configured | 21 | The kernel built with huge page support should show the number of configured |
22 | hugepages in the system by running the "cat /proc/meminfo" command. | 22 | huge pages in the system by running the "cat /proc/meminfo" command. |
23 | 23 | ||
24 | /proc/meminfo also provides information about the total number of hugetlb | 24 | /proc/meminfo also provides information about the total number of hugetlb |
25 | pages configured in the kernel. It also displays information about the | 25 | pages configured in the kernel. It also displays information about the |
26 | number of free hugetlb pages at any time. It also displays information about | 26 | number of free hugetlb pages at any time. It also displays information about |
27 | the configured hugepage size - this is needed for generating the proper | 27 | the configured huge page size - this is needed for generating the proper |
28 | alignment and size of the arguments to the above system calls. | 28 | alignment and size of the arguments to the above system calls. |
29 | 29 | ||
30 | The output of "cat /proc/meminfo" will have lines like: | 30 | The output of "cat /proc/meminfo" will have lines like: |
@@ -37,25 +37,27 @@ HugePages_Surp: yyy | |||
37 | Hugepagesize: zzz kB | 37 | Hugepagesize: zzz kB |
38 | 38 | ||
39 | where: | 39 | where: |
40 | HugePages_Total is the size of the pool of hugepages. | 40 | HugePages_Total is the size of the pool of huge pages. |
41 | HugePages_Free is the number of hugepages in the pool that are not yet | 41 | HugePages_Free is the number of huge pages in the pool that are not yet |
42 | allocated. | 42 | allocated. |
43 | HugePages_Rsvd is short for "reserved," and is the number of hugepages | 43 | HugePages_Rsvd is short for "reserved," and is the number of huge pages for |
44 | for which a commitment to allocate from the pool has been made, but no | 44 | which a commitment to allocate from the pool has been made, |
45 | allocation has yet been made. It's vaguely analogous to overcommit. | 45 | but no allocation has yet been made. Reserved huge pages |
46 | HugePages_Surp is short for "surplus," and is the number of hugepages in | 46 | guarantee that an application will be able to allocate a |
47 | the pool above the value in /proc/sys/vm/nr_hugepages. The maximum | 47 | huge page from the pool of huge pages at fault time. |
48 | number of surplus hugepages is controlled by | 48 | HugePages_Surp is short for "surplus," and is the number of huge pages in |
49 | /proc/sys/vm/nr_overcommit_hugepages. | 49 | the pool above the value in /proc/sys/vm/nr_hugepages. The |
50 | maximum number of surplus huge pages is controlled by | ||
51 | /proc/sys/vm/nr_overcommit_hugepages. | ||
50 | 52 | ||
51 | /proc/filesystems should also show a filesystem of type "hugetlbfs" configured | 53 | /proc/filesystems should also show a filesystem of type "hugetlbfs" configured |
52 | in the kernel. | 54 | in the kernel. |
53 | 55 | ||
54 | /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb | 56 | /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb |
55 | pages in the kernel. Super user can dynamically request more (or free some | 57 | pages in the kernel. Super user can dynamically request more (or free some |
56 | pre-configured) hugepages. | 58 | pre-configured) huge pages. |
57 | The allocation (or deallocation) of hugetlb pages is possible only if there are | 59 | The allocation (or deallocation) of hugetlb pages is possible only if there are |
58 | enough physically contiguous free pages in system (freeing of hugepages is | 60 | enough physically contiguous free pages in system (freeing of huge pages is |
59 | possible only if there are enough hugetlb pages free that can be transferred | 61 | possible only if there are enough hugetlb pages free that can be transferred |
60 | back to regular memory pool). | 62 | back to regular memory pool). |
61 | 63 | ||
@@ -67,43 +69,82 @@ use either the mmap system call or shared memory system calls to start using | |||
67 | the huge pages. It is required that the system administrator preallocate | 69 | the huge pages. It is required that the system administrator preallocate |
68 | enough memory for huge page purposes. | 70 | enough memory for huge page purposes. |
69 | 71 | ||
70 | Use the following command to dynamically allocate/deallocate hugepages: | 72 | The administrator can preallocate huge pages on the kernel boot command line by |
73 | specifying the "hugepages=N" parameter, where 'N' = the number of huge pages | ||
74 | requested. This is the most reliable method for preallocating huge pages as | ||
75 | memory has not yet become fragmented. | ||
76 | |||
77 | Some platforms support multiple huge page sizes. To preallocate huge pages | ||
78 | of a specific size, one must preceed the huge pages boot command parameters | ||
79 | with a huge page size selection parameter "hugepagesz=<size>". <size> must | ||
80 | be specified in bytes with optional scale suffix [kKmMgG]. The default huge | ||
81 | page size may be selected with the "default_hugepagesz=<size>" boot parameter. | ||
82 | |||
83 | /proc/sys/vm/nr_hugepages indicates the current number of configured [default | ||
84 | size] hugetlb pages in the kernel. Super user can dynamically request more | ||
85 | (or free some pre-configured) huge pages. | ||
86 | |||
87 | Use the following command to dynamically allocate/deallocate default sized | ||
88 | huge pages: | ||
71 | 89 | ||
72 | echo 20 > /proc/sys/vm/nr_hugepages | 90 | echo 20 > /proc/sys/vm/nr_hugepages |
73 | 91 | ||
74 | This command will try to configure 20 hugepages in the system. The success | 92 | This command will try to configure 20 default sized huge pages in the system. |
75 | or failure of allocation depends on the amount of physically contiguous | 93 | On a NUMA platform, the kernel will attempt to distribute the huge page pool |
76 | memory that is preset in system at this time. System administrators may want | 94 | over the all on-line nodes. These huge pages, allocated when nr_hugepages |
77 | to put this command in one of the local rc init files. This will enable the | 95 | is increased, are called "persistent huge pages". |
78 | kernel to request huge pages early in the boot process (when the possibility | 96 | |
79 | of getting physical contiguous pages is still very high). In either | 97 | The success or failure of huge page allocation depends on the amount of |
80 | case, administrators will want to verify the number of hugepages actually | 98 | physically contiguous memory that is preset in system at the time of the |
81 | allocated by checking the sysctl or meminfo. | 99 | allocation attempt. If the kernel is unable to allocate huge pages from |
82 | 100 | some nodes in a NUMA system, it will attempt to make up the difference by | |
83 | /proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of | 101 | allocating extra pages on other nodes with sufficient available contiguous |
84 | hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are | 102 | memory, if any. |
85 | requested by applications. echo'ing any non-zero value into this file | 103 | |
86 | indicates that the hugetlb subsystem is allowed to try to obtain | 104 | System administrators may want to put this command in one of the local rc init |
87 | hugepages from the buddy allocator, if the normal pool is exhausted. As | 105 | files. This will enable the kernel to request huge pages early in the boot |
88 | these surplus hugepages go out of use, they are freed back to the buddy | 106 | process when the possibility of getting physical contiguous pages is still |
107 | very high. Administrators can verify the number of huge pages actually | ||
108 | allocated by checking the sysctl or meminfo. To check the per node | ||
109 | distribution of huge pages in a NUMA system, use: | ||
110 | |||
111 | cat /sys/devices/system/node/node*/meminfo | fgrep Huge | ||
112 | |||
113 | /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of | ||
114 | huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are | ||
115 | requested by applications. Writing any non-zero value into this file | ||
116 | indicates that the hugetlb subsystem is allowed to try to obtain "surplus" | ||
117 | huge pages from the buddy allocator, when the normal pool is exhausted. As | ||
118 | these surplus huge pages go out of use, they are freed back to the buddy | ||
89 | allocator. | 119 | allocator. |
90 | 120 | ||
121 | When increasing the huge page pool size via nr_hugepages, any surplus | ||
122 | pages will first be promoted to persistent huge pages. Then, additional | ||
123 | huge pages will be allocated, if necessary and if possible, to fulfill | ||
124 | the new huge page pool size. | ||
125 | |||
126 | The administrator may shrink the pool of preallocated huge pages for | ||
127 | the default huge page size by setting the nr_hugepages sysctl to a | ||
128 | smaller value. The kernel will attempt to balance the freeing of huge pages | ||
129 | across all on-line nodes. Any free huge pages on the selected nodes will | ||
130 | be freed back to the buddy allocator. | ||
131 | |||
91 | Caveat: Shrinking the pool via nr_hugepages such that it becomes less | 132 | Caveat: Shrinking the pool via nr_hugepages such that it becomes less |
92 | than the number of hugepages in use will convert the balance to surplus | 133 | than the number of huge pages in use will convert the balance to surplus |
93 | huge pages even if it would exceed the overcommit value. As long as | 134 | huge pages even if it would exceed the overcommit value. As long as |
94 | this condition holds, however, no more surplus huge pages will be | 135 | this condition holds, however, no more surplus huge pages will be |
95 | allowed on the system until one of the two sysctls are increased | 136 | allowed on the system until one of the two sysctls are increased |
96 | sufficiently, or the surplus huge pages go out of use and are freed. | 137 | sufficiently, or the surplus huge pages go out of use and are freed. |
97 | 138 | ||
98 | With support for multiple hugepage pools at run-time available, much of | 139 | With support for multiple huge page pools at run-time available, much of |
99 | the hugepage userspace interface has been duplicated in sysfs. The above | 140 | the huge page userspace interface has been duplicated in sysfs. The above |
100 | information applies to the default hugepage size (which will be | 141 | information applies to the default huge page size which will be |
101 | controlled by the proc interfaces for backwards compatibility). The root | 142 | controlled by the /proc interfaces for backwards compatibility. The root |
102 | hugepage control directory is | 143 | huge page control directory in sysfs is: |
103 | 144 | ||
104 | /sys/kernel/mm/hugepages | 145 | /sys/kernel/mm/hugepages |
105 | 146 | ||
106 | For each hugepage size supported by the running kernel, a subdirectory | 147 | For each huge page size supported by the running kernel, a subdirectory |
107 | will exist, of the form | 148 | will exist, of the form |
108 | 149 | ||
109 | hugepages-${size}kB | 150 | hugepages-${size}kB |
@@ -116,9 +157,9 @@ Inside each of these directories, the same set of files will exist: | |||
116 | resv_hugepages | 157 | resv_hugepages |
117 | surplus_hugepages | 158 | surplus_hugepages |
118 | 159 | ||
119 | which function as described above for the default hugepage-sized case. | 160 | which function as described above for the default huge page-sized case. |
120 | 161 | ||
121 | If the user applications are going to request hugepages using mmap system | 162 | If the user applications are going to request huge pages using mmap system |
122 | call, then it is required that system administrator mount a file system of | 163 | call, then it is required that system administrator mount a file system of |
123 | type hugetlbfs: | 164 | type hugetlbfs: |
124 | 165 | ||
@@ -127,7 +168,7 @@ type hugetlbfs: | |||
127 | none /mnt/huge | 168 | none /mnt/huge |
128 | 169 | ||
129 | This command mounts a (pseudo) filesystem of type hugetlbfs on the directory | 170 | This command mounts a (pseudo) filesystem of type hugetlbfs on the directory |
130 | /mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid | 171 | /mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid |
131 | options sets the owner and group of the root of the file system. By default | 172 | options sets the owner and group of the root of the file system. By default |
132 | the uid and gid of the current process are taken. The mode option sets the | 173 | the uid and gid of the current process are taken. The mode option sets the |
133 | mode of root of file system to value & 0777. This value is given in octal. | 174 | mode of root of file system to value & 0777. This value is given in octal. |
@@ -146,24 +187,26 @@ Regular chown, chgrp, and chmod commands (with right permissions) could be | |||
146 | used to change the file attributes on hugetlbfs. | 187 | used to change the file attributes on hugetlbfs. |
147 | 188 | ||
148 | Also, it is important to note that no such mount command is required if the | 189 | Also, it is important to note that no such mount command is required if the |
149 | applications are going to use only shmat/shmget system calls. Users who | 190 | applications are going to use only shmat/shmget system calls or mmap with |
150 | wish to use hugetlb page via shared memory segment should be a member of | 191 | MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment |
151 | a supplementary group and system admin needs to configure that gid into | 192 | should be a member of a supplementary group and system admin needs to |
152 | /proc/sys/vm/hugetlb_shm_group. It is possible for same or different | 193 | configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for |
153 | applications to use any combination of mmaps and shm* calls, though the | 194 | same or different applications to use any combination of mmaps and shm* |
154 | mount of filesystem will be required for using mmap calls. | 195 | calls, though the mount of filesystem will be required for using mmap calls |
196 | without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see | ||
197 | map_hugetlb.c. | ||
155 | 198 | ||
156 | ******************************************************************* | 199 | ******************************************************************* |
157 | 200 | ||
158 | /* | 201 | /* |
159 | * Example of using hugepage memory in a user application using Sys V shared | 202 | * Example of using huge page memory in a user application using Sys V shared |
160 | * memory system calls. In this example the app is requesting 256MB of | 203 | * memory system calls. In this example the app is requesting 256MB of |
161 | * memory that is backed by huge pages. The application uses the flag | 204 | * memory that is backed by huge pages. The application uses the flag |
162 | * SHM_HUGETLB in the shmget system call to inform the kernel that it is | 205 | * SHM_HUGETLB in the shmget system call to inform the kernel that it is |
163 | * requesting hugepages. | 206 | * requesting huge pages. |
164 | * | 207 | * |
165 | * For the ia64 architecture, the Linux kernel reserves Region number 4 for | 208 | * For the ia64 architecture, the Linux kernel reserves Region number 4 for |
166 | * hugepages. That means the addresses starting with 0x800000... will need | 209 | * huge pages. That means the addresses starting with 0x800000... will need |
167 | * to be specified. Specifying a fixed address is not required on ppc64, | 210 | * to be specified. Specifying a fixed address is not required on ppc64, |
168 | * i386 or x86_64. | 211 | * i386 or x86_64. |
169 | * | 212 | * |
@@ -252,14 +295,14 @@ int main(void) | |||
252 | ******************************************************************* | 295 | ******************************************************************* |
253 | 296 | ||
254 | /* | 297 | /* |
255 | * Example of using hugepage memory in a user application using the mmap | 298 | * Example of using huge page memory in a user application using the mmap |
256 | * system call. Before running this application, make sure that the | 299 | * system call. Before running this application, make sure that the |
257 | * administrator has mounted the hugetlbfs filesystem (on some directory | 300 | * administrator has mounted the hugetlbfs filesystem (on some directory |
258 | * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this | 301 | * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this |
259 | * example, the app is requesting memory of size 256MB that is backed by | 302 | * example, the app is requesting memory of size 256MB that is backed by |
260 | * huge pages. | 303 | * huge pages. |
261 | * | 304 | * |
262 | * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. | 305 | * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. |
263 | * That means the addresses starting with 0x800000... will need to be | 306 | * That means the addresses starting with 0x800000... will need to be |
264 | * specified. Specifying a fixed address is not required on ppc64, i386 | 307 | * specified. Specifying a fixed address is not required on ppc64, i386 |
265 | * or x86_64. | 308 | * or x86_64. |