diff options
Diffstat (limited to 'Documentation/vm/numa_memory_policy.txt')
-rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 332 |
1 files changed, 332 insertions, 0 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt new file mode 100644 index 000000000000..8242f52d0f22 --- /dev/null +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -0,0 +1,332 @@ | |||
1 | |||
2 | What is Linux Memory Policy? | ||
3 | |||
4 | In the Linux kernel, "memory policy" determines from which node the kernel will | ||
5 | allocate memory in a NUMA system or in an emulated NUMA system. Linux has | ||
6 | supported platforms with Non-Uniform Memory Access architectures since 2.4.?. | ||
7 | The current memory policy support was added to Linux 2.6 around May 2004. This | ||
8 | document attempts to describe the concepts and APIs of the 2.6 memory policy | ||
9 | support. | ||
10 | |||
11 | Memory policies should not be confused with cpusets (Documentation/cpusets.txt) | ||
12 | which is an administrative mechanism for restricting the nodes from which | ||
13 | memory may be allocated by a set of processes. Memory policies are a | ||
14 | programming interface that a NUMA-aware application can take advantage of. When | ||
15 | both cpusets and policies are applied to a task, the restrictions of the cpuset | ||
16 | takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. | ||
17 | |||
18 | MEMORY POLICY CONCEPTS | ||
19 | |||
20 | Scope of Memory Policies | ||
21 | |||
22 | The Linux kernel supports _scopes_ of memory policy, described here from | ||
23 | most general to most specific: | ||
24 | |||
25 | System Default Policy: this policy is "hard coded" into the kernel. It | ||
26 | is the policy that governs all page allocations that aren't controlled | ||
27 | by one of the more specific policy scopes discussed below. When the | ||
28 | system is "up and running", the system default policy will use "local | ||
29 | allocation" described below. However, during boot up, the system | ||
30 | default policy will be set to interleave allocations across all nodes | ||
31 | with "sufficient" memory, so as not to overload the initial boot node | ||
32 | with boot-time allocations. | ||
33 | |||
34 | Task/Process Policy: this is an optional, per-task policy. When defined | ||
35 | for a specific task, this policy controls all page allocations made by or | ||
36 | on behalf of the task that aren't controlled by a more specific scope. | ||
37 | If a task does not define a task policy, then all page allocations that | ||
38 | would have been controlled by the task policy "fall back" to the System | ||
39 | Default Policy. | ||
40 | |||
41 | The task policy applies to the entire address space of a task. Thus, | ||
42 | it is inheritable, and indeed is inherited, across both fork() | ||
43 | [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task | ||
44 | to establish the task policy for a child task exec()'d from an | ||
45 | executable image that has no awareness of memory policy. See the | ||
46 | MEMORY POLICY APIS section, below, for an overview of the system call | ||
47 | that a task may use to set/change it's task/process policy. | ||
48 | |||
49 | In a multi-threaded task, task policies apply only to the thread | ||
50 | [Linux kernel task] that installs the policy and any threads | ||
51 | subsequently created by that thread. Any sibling threads existing | ||
52 | at the time a new task policy is installed retain their current | ||
53 | policy. | ||
54 | |||
55 | A task policy applies only to pages allocated after the policy is | ||
56 | installed. Any pages already faulted in by the task when the task | ||
57 | changes its task policy remain where they were allocated based on | ||
58 | the policy at the time they were allocated. | ||
59 | |||
60 | VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's | ||
61 | virtual adddress space. A task may define a specific policy for a range | ||
62 | of its virtual address space. See the MEMORY POLICIES APIS section, | ||
63 | below, for an overview of the mbind() system call used to set a VMA | ||
64 | policy. | ||
65 | |||
66 | A VMA policy will govern the allocation of pages that back this region of | ||
67 | the address space. Any regions of the task's address space that don't | ||
68 | have an explicit VMA policy will fall back to the task policy, which may | ||
69 | itself fall back to the System Default Policy. | ||
70 | |||
71 | VMA policies have a few complicating details: | ||
72 | |||
73 | VMA policy applies ONLY to anonymous pages. These include pages | ||
74 | allocated for anonymous segments, such as the task stack and heap, and | ||
75 | any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. | ||
76 | If a VMA policy is applied to a file mapping, it will be ignored if | ||
77 | the mapping used the MAP_SHARED flag. If the file mapping used the | ||
78 | MAP_PRIVATE flag, the VMA policy will only be applied when an | ||
79 | anonymous page is allocated on an attempt to write to the mapping-- | ||
80 | i.e., at Copy-On-Write. | ||
81 | |||
82 | VMA policies are shared between all tasks that share a virtual address | ||
83 | space--a.k.a. threads--independent of when the policy is installed; and | ||
84 | they are inherited across fork(). However, because VMA policies refer | ||
85 | to a specific region of a task's address space, and because the address | ||
86 | space is discarded and recreated on exec*(), VMA policies are NOT | ||
87 | inheritable across exec(). Thus, only NUMA-aware applications may | ||
88 | use VMA policies. | ||
89 | |||
90 | A task may install a new VMA policy on a sub-range of a previously | ||
91 | mmap()ed region. When this happens, Linux splits the existing virtual | ||
92 | memory area into 2 or 3 VMAs, each with it's own policy. | ||
93 | |||
94 | By default, VMA policy applies only to pages allocated after the policy | ||
95 | is installed. Any pages already faulted into the VMA range remain | ||
96 | where they were allocated based on the policy at the time they were | ||
97 | allocated. However, since 2.6.16, Linux supports page migration via | ||
98 | the mbind() system call, so that page contents can be moved to match | ||
99 | a newly installed policy. | ||
100 | |||
101 | Shared Policy: Conceptually, shared policies apply to "memory objects" | ||
102 | mapped shared into one or more tasks' distinct address spaces. An | ||
103 | application installs a shared policies the same way as VMA policies--using | ||
104 | the mbind() system call specifying a range of virtual addresses that map | ||
105 | the shared object. However, unlike VMA policies, which can be considered | ||
106 | to be an attribute of a range of a task's address space, shared policies | ||
107 | apply directly to the shared object. Thus, all tasks that attach to the | ||
108 | object share the policy, and all pages allocated for the shared object, | ||
109 | by any task, will obey the shared policy. | ||
110 | |||
111 | As of 2.6.22, only shared memory segments, created by shmget() or | ||
112 | mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared | ||
113 | policy support was added to Linux, the associated data structures were | ||
114 | added to hugetlbfs shmem segments. At the time, hugetlbfs did not | ||
115 | support allocation at fault time--a.k.a lazy allocation--so hugetlbfs | ||
116 | shmem segments were never "hooked up" to the shared policy support. | ||
117 | Although hugetlbfs segments now support lazy allocation, their support | ||
118 | for shared policy has not been completed. | ||
119 | |||
120 | As mentioned above [re: VMA policies], allocations of page cache | ||
121 | pages for regular files mmap()ed with MAP_SHARED ignore any VMA | ||
122 | policy installed on the virtual address range backed by the shared | ||
123 | file mapping. Rather, shared page cache pages, including pages backing | ||
124 | private mappings that have not yet been written by the task, follow | ||
125 | task policy, if any, else System Default Policy. | ||
126 | |||
127 | The shared policy infrastructure supports different policies on subset | ||
128 | ranges of the shared object. However, Linux still splits the VMA of | ||
129 | the task that installs the policy for each range of distinct policy. | ||
130 | Thus, different tasks that attach to a shared memory segment can have | ||
131 | different VMA configurations mapping that one shared object. This | ||
132 | can be seen by examining the /proc/<pid>/numa_maps of tasks sharing | ||
133 | a shared memory region, when one task has installed shared policy on | ||
134 | one or more ranges of the region. | ||
135 | |||
136 | Components of Memory Policies | ||
137 | |||
138 | A Linux memory policy is a tuple consisting of a "mode" and an optional set | ||
139 | of nodes. The mode determine the behavior of the policy, while the | ||
140 | optional set of nodes can be viewed as the arguments to the behavior. | ||
141 | |||
142 | Internally, memory policies are implemented by a reference counted | ||
143 | structure, struct mempolicy. Details of this structure will be discussed | ||
144 | in context, below, as required to explain the behavior. | ||
145 | |||
146 | Note: in some functions AND in the struct mempolicy itself, the mode | ||
147 | is called "policy". However, to avoid confusion with the policy tuple, | ||
148 | this document will continue to use the term "mode". | ||
149 | |||
150 | Linux memory policy supports the following 4 behavioral modes: | ||
151 | |||
152 | Default Mode--MPOL_DEFAULT: The behavior specified by this mode is | ||
153 | context or scope dependent. | ||
154 | |||
155 | As mentioned in the Policy Scope section above, during normal | ||
156 | system operation, the System Default Policy is hard coded to | ||
157 | contain the Default mode. | ||
158 | |||
159 | In this context, default mode means "local" allocation--that is | ||
160 | attempt to allocate the page from the node associated with the cpu | ||
161 | where the fault occurs. If the "local" node has no memory, or the | ||
162 | node's memory can be exhausted [no free pages available], local | ||
163 | allocation will "fallback to"--attempt to allocate pages from-- | ||
164 | "nearby" nodes, in order of increasing "distance". | ||
165 | |||
166 | Implementation detail -- subject to change: "Fallback" uses | ||
167 | a per node list of sibling nodes--called zonelists--built at | ||
168 | boot time, or when nodes or memory are added or removed from | ||
169 | the system [memory hotplug]. These per node zonelist are | ||
170 | constructed with nodes in order of increasing distance based | ||
171 | on information provided by the platform firmware. | ||
172 | |||
173 | When a task/process policy or a shared policy contains the Default | ||
174 | mode, this also means "local allocation", as described above. | ||
175 | |||
176 | In the context of a VMA, Default mode means "fall back to task | ||
177 | policy"--which may or may not specify Default mode. Thus, Default | ||
178 | mode can not be counted on to mean local allocation when used | ||
179 | on a non-shared region of the address space. However, see | ||
180 | MPOL_PREFERRED below. | ||
181 | |||
182 | The Default mode does not use the optional set of nodes. | ||
183 | |||
184 | MPOL_BIND: This mode specifies that memory must come from the | ||
185 | set of nodes specified by the policy. | ||
186 | |||
187 | The memory policy APIs do not specify an order in which the nodes | ||
188 | will be searched. However, unlike "local allocation", the Bind | ||
189 | policy does not consider the distance between the nodes. Rather, | ||
190 | allocations will fallback to the nodes specified by the policy in | ||
191 | order of numeric node id. Like everything in Linux, this is subject | ||
192 | to change. | ||
193 | |||
194 | MPOL_PREFERRED: This mode specifies that the allocation should be | ||
195 | attempted from the single node specified in the policy. If that | ||
196 | allocation fails, the kernel will search other nodes, exactly as | ||
197 | it would for a local allocation that started at the preferred node | ||
198 | in increasing distance from the preferred node. "Local" allocation | ||
199 | policy can be viewed as a Preferred policy that starts at the node | ||
200 | containing the cpu where the allocation takes place. | ||
201 | |||
202 | Internally, the Preferred policy uses a single node--the | ||
203 | preferred_node member of struct mempolicy. A "distinguished | ||
204 | value of this preferred_node, currently '-1', is interpreted | ||
205 | as "the node containing the cpu where the allocation takes | ||
206 | place"--local allocation. This is the way to specify | ||
207 | local allocation for a specific range of addresses--i.e. for | ||
208 | VMA policies. | ||
209 | |||
210 | MPOL_INTERLEAVED: This mode specifies that page allocations be | ||
211 | interleaved, on a page granularity, across the nodes specified in | ||
212 | the policy. This mode also behaves slightly differently, based on | ||
213 | the context where it is used: | ||
214 | |||
215 | For allocation of anonymous pages and shared memory pages, | ||
216 | Interleave mode indexes the set of nodes specified by the policy | ||
217 | using the page offset of the faulting address into the segment | ||
218 | [VMA] containing the address modulo the number of nodes specified | ||
219 | by the policy. It then attempts to allocate a page, starting at | ||
220 | the selected node, as if the node had been specified by a Preferred | ||
221 | policy or had been selected by a local allocation. That is, | ||
222 | allocation will follow the per node zonelist. | ||
223 | |||
224 | For allocation of page cache pages, Interleave mode indexes the set | ||
225 | of nodes specified by the policy using a node counter maintained | ||
226 | per task. This counter wraps around to the lowest specified node | ||
227 | after it reaches the highest specified node. This will tend to | ||
228 | spread the pages out over the nodes specified by the policy based | ||
229 | on the order in which they are allocated, rather than based on any | ||
230 | page offset into an address range or file. During system boot up, | ||
231 | the temporary interleaved system default policy works in this | ||
232 | mode. | ||
233 | |||
234 | MEMORY POLICY APIs | ||
235 | |||
236 | Linux supports 3 system calls for controlling memory policy. These APIS | ||
237 | always affect only the calling task, the calling task's address space, or | ||
238 | some shared object mapped into the calling task's address space. | ||
239 | |||
240 | Note: the headers that define these APIs and the parameter data types | ||
241 | for user space applications reside in a package that is not part of | ||
242 | the Linux kernel. The kernel system call interfaces, with the 'sys_' | ||
243 | prefix, are defined in <linux/syscalls.h>; the mode and flag | ||
244 | definitions are defined in <linux/mempolicy.h>. | ||
245 | |||
246 | Set [Task] Memory Policy: | ||
247 | |||
248 | long set_mempolicy(int mode, const unsigned long *nmask, | ||
249 | unsigned long maxnode); | ||
250 | |||
251 | Set's the calling task's "task/process memory policy" to mode | ||
252 | specified by the 'mode' argument and the set of nodes defined | ||
253 | by 'nmask'. 'nmask' points to a bit mask of node ids containing | ||
254 | at least 'maxnode' ids. | ||
255 | |||
256 | See the set_mempolicy(2) man page for more details | ||
257 | |||
258 | |||
259 | Get [Task] Memory Policy or Related Information | ||
260 | |||
261 | long get_mempolicy(int *mode, | ||
262 | const unsigned long *nmask, unsigned long maxnode, | ||
263 | void *addr, int flags); | ||
264 | |||
265 | Queries the "task/process memory policy" of the calling task, or | ||
266 | the policy or location of a specified virtual address, depending | ||
267 | on the 'flags' argument. | ||
268 | |||
269 | See the get_mempolicy(2) man page for more details | ||
270 | |||
271 | |||
272 | Install VMA/Shared Policy for a Range of Task's Address Space | ||
273 | |||
274 | long mbind(void *start, unsigned long len, int mode, | ||
275 | const unsigned long *nmask, unsigned long maxnode, | ||
276 | unsigned flags); | ||
277 | |||
278 | mbind() installs the policy specified by (mode, nmask, maxnodes) as | ||
279 | a VMA policy for the range of the calling task's address space | ||
280 | specified by the 'start' and 'len' arguments. Additional actions | ||
281 | may be requested via the 'flags' argument. | ||
282 | |||
283 | See the mbind(2) man page for more details. | ||
284 | |||
285 | MEMORY POLICY COMMAND LINE INTERFACE | ||
286 | |||
287 | Although not strictly part of the Linux implementation of memory policy, | ||
288 | a command line tool, numactl(8), exists that allows one to: | ||
289 | |||
290 | + set the task policy for a specified program via set_mempolicy(2), fork(2) and | ||
291 | exec(2) | ||
292 | |||
293 | + set the shared policy for a shared memory segment via mbind(2) | ||
294 | |||
295 | The numactl(8) tool is packages with the run-time version of the library | ||
296 | containing the memory policy system call wrappers. Some distributions | ||
297 | package the headers and compile-time libraries in a separate development | ||
298 | package. | ||
299 | |||
300 | |||
301 | MEMORY POLICIES AND CPUSETS | ||
302 | |||
303 | Memory policies work within cpusets as described above. For memory policies | ||
304 | that require a node or set of nodes, the nodes are restricted to the set of | ||
305 | nodes whose memories are allowed by the cpuset constraints. If the | ||
306 | intersection of the set of nodes specified for the policy and the set of nodes | ||
307 | allowed by the cpuset is the empty set, the policy is considered invalid and | ||
308 | cannot be installed. | ||
309 | |||
310 | The interaction of memory policies and cpusets can be problematic for a | ||
311 | couple of reasons: | ||
312 | |||
313 | 1) the memory policy APIs take physical node id's as arguments. However, the | ||
314 | memory policy APIs do not provide a way to determine what nodes are valid | ||
315 | in the context where the application is running. An application MAY consult | ||
316 | the cpuset file system [directly or via an out of tree, and not generally | ||
317 | available, libcpuset API] to obtain this information, but then the | ||
318 | application must be aware that it is running in a cpuset and use what are | ||
319 | intended primarily as administrative APIs. | ||
320 | |||
321 | However, as long as the policy specifies at least one node that is valid | ||
322 | in the controlling cpuset, the policy can be used. | ||
323 | |||
324 | 2) when tasks in two cpusets share access to a memory region, such as shared | ||
325 | memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and | ||
326 | MAP_SHARED flags, and any of the tasks install shared policy on the region, | ||
327 | only nodes whose memories are allowed in both cpusets may be used in the | ||
328 | policies. Again, obtaining this information requires "stepping outside" | ||
329 | the memory policy APIs, as well as knowing in what cpusets other task might | ||
330 | be attaching to the shared region, to use the cpuset information. | ||
331 | Furthermore, if the cpusets' allowed memory sets are disjoint, "local" | ||
332 | allocation is the only valid policy. | ||