aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>2007-07-16 02:38:01 -0400
committerLinus Torvalds <torvalds@woody.linux-foundation.org>2007-07-16 12:05:35 -0400
commitf0c0b2b808f232741eadac272bd4bc51f18df0f4 (patch)
treec2568efdc496cc165a4e72d8aa2542b22035e342 /Documentation
parent18a8bd949d6adb311ea816125ff65050df1f3f6e (diff)
change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/kernel-parameters.txt5
-rw-r--r--Documentation/sysctl/vm.txt45
2 files changed, 50 insertions, 0 deletions
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 62aab585d9d7..4344f69ae24a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1196,6 +1196,11 @@ and is between 256 and 4096 characters. It is defined in the file
1196 1196
1197 nowb [ARM] 1197 nowb [ARM]
1198 1198
1199 numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
1200 one of ['zone', 'node', 'default'] can be specified
1201 This can be set from sysctl after boot.
1202 See Documentation/sysctl/vm.txt for details.
1203
1199 nr_uarts= [SERIAL] maximum number of UARTs to be registered. 1204 nr_uarts= [SERIAL] maximum number of UARTs to be registered.
1200 1205
1201 opl3= [HW,OSS] 1206 opl3= [HW,OSS]
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 8cfca173d4bc..df3ff2095f9d 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
32- min_slab_ratio 32- min_slab_ratio
33- panic_on_oom 33- panic_on_oom
34- mmap_min_address 34- mmap_min_address
35- numa_zonelist_order
35 36
36============================================================== 37==============================================================
37 38
@@ -231,3 +232,47 @@ security module. Setting this value to something like 64k will allow the
231vast majority of applications to work correctly and provide defense in depth 232vast majority of applications to work correctly and provide defense in depth
232against future potential kernel bugs. 233against future potential kernel bugs.
233 234
235==============================================================
236
237numa_zonelist_order
238
239This sysctl is only for NUMA.
240'where the memory is allocated from' is controlled by zonelists.
241(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
242 you may be able to read ZONE_DMA as ZONE_DMA32...)
243
244In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
245ZONE_NORMAL -> ZONE_DMA
246This means that a memory allocation request for GFP_KERNEL will
247get memory from ZONE_DMA only when ZONE_NORMAL is not available.
248
249In NUMA case, you can think of following 2 types of order.
250Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
251
252(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
253(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
254
255Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
256will be used before ZONE_NORMAL exhaustion. This increases possibility of
257out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
258
259Type(B) cannot offer the best locality but is more robust against OOM of
260the DMA zone.
261
262Type(A) is called as "Node" order. Type (B) is "Zone" order.
263
264"Node order" orders the zonelists by node, then by zone within each node.
265Specify "[Nn]ode" for zone order
266
267"Zone Order" orders the zonelists by zone type, then by node within each
268zone. Specify "[Zz]one"for zode order.
269
270Specify "[Dd]efault" to request automatic configuration. Autoconfiguration
271will select "node" order in following case.
272(1) if the DMA zone does not exist or
273(2) if the DMA zone comprises greater than 50% of the available memory or
274(3) if any node's DMA zone comprises greater than 60% of its local memory and
275 the amount of local memory is big enough.
276
277Otherwise, "zone" order will be selected. Default order is recommended unless
278this is causing problems for your system/application.