diff options
Diffstat (limited to 'Documentation/memory-hotplug.txt')
-rw-r--r-- | Documentation/memory-hotplug.txt | 322 |
1 files changed, 322 insertions, 0 deletions
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt new file mode 100644 index 000000000000..5fbcc22c98e9 --- /dev/null +++ b/Documentation/memory-hotplug.txt | |||
@@ -0,0 +1,322 @@ | |||
1 | ============== | ||
2 | Memory Hotplug | ||
3 | ============== | ||
4 | |||
5 | Last Updated: Jul 28 2007 | ||
6 | |||
7 | This document is about memory hotplug including how-to-use and current status. | ||
8 | Because Memory Hotplug is still under development, contents of this text will | ||
9 | be changed often. | ||
10 | |||
11 | 1. Introduction | ||
12 | 1.1 purpose of memory hotplug | ||
13 | 1.2. Phases of memory hotplug | ||
14 | 1.3. Unit of Memory online/offline operation | ||
15 | 2. Kernel Configuration | ||
16 | 3. sysfs files for memory hotplug | ||
17 | 4. Physical memory hot-add phase | ||
18 | 4.1 Hardware(Firmware) Support | ||
19 | 4.2 Notify memory hot-add event by hand | ||
20 | 5. Logical Memory hot-add phase | ||
21 | 5.1. State of memory | ||
22 | 5.2. How to online memory | ||
23 | 6. Logical memory remove | ||
24 | 6.1 Memory offline and ZONE_MOVABLE | ||
25 | 6.2. How to offline memory | ||
26 | 7. Physical memory remove | ||
27 | 8. Future Work List | ||
28 | |||
29 | Note(1): x86_64's has special implementation for memory hotplug. | ||
30 | This text does not describe it. | ||
31 | Note(2): This text assumes that sysfs is mounted at /sys. | ||
32 | |||
33 | |||
34 | --------------- | ||
35 | 1. Introduction | ||
36 | --------------- | ||
37 | |||
38 | 1.1 purpose of memory hotplug | ||
39 | ------------ | ||
40 | Memory Hotplug allows users to increase/decrease the amount of memory. | ||
41 | Generally, there are two purposes. | ||
42 | |||
43 | (A) For changing the amount of memory. | ||
44 | This is to allow a feature like capacity on demand. | ||
45 | (B) For installing/removing DIMMs or NUMA-nodes physically. | ||
46 | This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. | ||
47 | |||
48 | (A) is required by highly virtualized environments and (B) is required by | ||
49 | hardware which supports memory power management. | ||
50 | |||
51 | Linux memory hotplug is designed for both purpose. | ||
52 | |||
53 | |||
54 | 1.2. Phases of memory hotplug | ||
55 | --------------- | ||
56 | There are 2 phases in Memory Hotplug. | ||
57 | 1) Physical Memory Hotplug phase | ||
58 | 2) Logical Memory Hotplug phase. | ||
59 | |||
60 | The First phase is to communicate hardware/firmware and make/erase | ||
61 | environment for hotplugged memory. Basically, this phase is necessary | ||
62 | for the purpose (B), but this is good phase for communication between | ||
63 | highly virtualized environments too. | ||
64 | |||
65 | When memory is hotplugged, the kernel recognizes new memory, makes new memory | ||
66 | management tables, and makes sysfs files for new memory's operation. | ||
67 | |||
68 | If firmware supports notification of connection of new memory to OS, | ||
69 | this phase is triggered automatically. ACPI can notify this event. If not, | ||
70 | "probe" operation by system administration is used instead. | ||
71 | (see Section 4.). | ||
72 | |||
73 | Logical Memory Hotplug phase is to change memory state into | ||
74 | avaiable/unavailable for users. Amount of memory from user's view is | ||
75 | changed by this phase. The kernel makes all memory in it as free pages | ||
76 | when a memory range is available. | ||
77 | |||
78 | In this document, this phase is described as online/offline. | ||
79 | |||
80 | Logical Memory Hotplug phase is triggred by write of sysfs file by system | ||
81 | administrator. For the hot-add case, it must be executed after Physical Hotplug | ||
82 | phase by hand. | ||
83 | (However, if you writes udev's hotplug scripts for memory hotplug, these | ||
84 | phases can be execute in seamless way.) | ||
85 | |||
86 | |||
87 | 1.3. Unit of Memory online/offline operation | ||
88 | ------------ | ||
89 | Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory | ||
90 | into chunks of the same size. The chunk is called a "section". The size of | ||
91 | a section is architecture dependent. For example, power uses 16MiB, ia64 uses | ||
92 | 1GiB. The unit of online/offline operation is "one section". (see Section 3.) | ||
93 | |||
94 | To determine the size of sections, please read this file: | ||
95 | |||
96 | /sys/devices/system/memory/block_size_bytes | ||
97 | |||
98 | This file shows the size of sections in byte. | ||
99 | |||
100 | ----------------------- | ||
101 | 2. Kernel Configuration | ||
102 | ----------------------- | ||
103 | To use memory hotplug feature, kernel must be compiled with following | ||
104 | config options. | ||
105 | |||
106 | - For all memory hotplug | ||
107 | Memory model -> Sparse Memory (CONFIG_SPARSEMEM) | ||
108 | Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) | ||
109 | |||
110 | - To enable memory removal, the followings are also necessary | ||
111 | Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE) | ||
112 | Page Migration (CONFIG_MIGRATION) | ||
113 | |||
114 | - For ACPI memory hotplug, the followings are also necessary | ||
115 | Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY) | ||
116 | This option can be kernel module. | ||
117 | |||
118 | - As a related configuration, if your box has a feature of NUMA-node hotplug | ||
119 | via ACPI, then this option is necessary too. | ||
120 | ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) | ||
121 | (CONFIG_ACPI_CONTAINER). | ||
122 | This option can be kernel module too. | ||
123 | |||
124 | -------------------------------- | ||
125 | 3 sysfs files for memory hotplug | ||
126 | -------------------------------- | ||
127 | All sections have their device information under /sys/devices/system/memory as | ||
128 | |||
129 | /sys/devices/system/memory/memoryXXX | ||
130 | (XXX is section id.) | ||
131 | |||
132 | Now, XXX is defined as start_address_of_section / section_size. | ||
133 | |||
134 | For example, assume 1GiB section size. A device for a memory starting at | ||
135 | 0x100000000 is /sys/device/system/memory/memory4 | ||
136 | (0x100000000 / 1Gib = 4) | ||
137 | This device covers address range [0x100000000 ... 0x140000000) | ||
138 | |||
139 | Under each section, you can see 3 files. | ||
140 | |||
141 | /sys/devices/system/memory/memoryXXX/phys_index | ||
142 | /sys/devices/system/memory/memoryXXX/phys_device | ||
143 | /sys/devices/system/memory/memoryXXX/state | ||
144 | |||
145 | 'phys_index' : read-only and contains section id, same as XXX. | ||
146 | 'state' : read-write | ||
147 | at read: contains online/offline state of memory. | ||
148 | at write: user can specify "online", "offline" command | ||
149 | 'phys_device': read-only: designed to show the name of physical memory device. | ||
150 | This is not well implemented now. | ||
151 | |||
152 | NOTE: | ||
153 | These directories/files appear after physical memory hotplug phase. | ||
154 | |||
155 | |||
156 | -------------------------------- | ||
157 | 4. Physical memory hot-add phase | ||
158 | -------------------------------- | ||
159 | |||
160 | 4.1 Hardware(Firmware) Support | ||
161 | ------------ | ||
162 | On x86_64/ia64 platform, memory hotplug by ACPI is supported. | ||
163 | |||
164 | In general, the firmware (ACPI) which supports memory hotplug defines | ||
165 | memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, | ||
166 | Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev | ||
167 | script. This will be done automatically. | ||
168 | |||
169 | But scripts for memory hotplug are not contained in generic udev package(now). | ||
170 | You may have to write it by yourself or online/offline memory by hand. | ||
171 | Please see "How to online memory", "How to offline memory" in this text. | ||
172 | |||
173 | If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", | ||
174 | "PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler | ||
175 | calls hotplug code for all of objects which are defined in it. | ||
176 | If memory device is found, memory hotplug code will be called. | ||
177 | |||
178 | |||
179 | 4.2 Notify memory hot-add event by hand | ||
180 | ------------ | ||
181 | In some environments, especially virtualized environment, firmware will not | ||
182 | notify memory hotplug event to the kernel. For such environment, "probe" | ||
183 | interface is supported. This interface depends on CONFIG_ARCH_MEMORY_PROBE. | ||
184 | |||
185 | Now, CONFIG_ARCH_MEMORY_PROBE is supported only by powerpc but it does not | ||
186 | contain highly architecture codes. Please add config if you need "probe" | ||
187 | interface. | ||
188 | |||
189 | Probe interface is located at | ||
190 | /sys/devices/system/memory/probe | ||
191 | |||
192 | You can tell the physical address of new memory to the kernel by | ||
193 | |||
194 | % echo start_address_of_new_memory > /sys/devices/system/memory/probe | ||
195 | |||
196 | Then, [start_address_of_new_memory, start_address_of_new_memory + section_size) | ||
197 | memory range is hot-added. In this case, hotplug script is not called (in | ||
198 | current implementation). You'll have to online memory by yourself. | ||
199 | Please see "How to online memory" in this text. | ||
200 | |||
201 | |||
202 | |||
203 | ------------------------------ | ||
204 | 5. Logical Memory hot-add phase | ||
205 | ------------------------------ | ||
206 | |||
207 | 5.1. State of memory | ||
208 | ------------ | ||
209 | To see (online/offline) state of memory section, read 'state' file. | ||
210 | |||
211 | % cat /sys/device/system/memory/memoryXXX/state | ||
212 | |||
213 | |||
214 | If the memory section is online, you'll read "online". | ||
215 | If the memory section is offline, you'll read "offline". | ||
216 | |||
217 | |||
218 | 5.2. How to online memory | ||
219 | ------------ | ||
220 | Even if the memory is hot-added, it is not at ready-to-use state. | ||
221 | For using newly added memory, you have to "online" the memory section. | ||
222 | |||
223 | For onlining, you have to write "online" to the section's state file as: | ||
224 | |||
225 | % echo online > /sys/devices/system/memory/memoryXXX/state | ||
226 | |||
227 | After this, section memoryXXX's state will be 'online' and the amount of | ||
228 | available memory will be increased. | ||
229 | |||
230 | Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA). | ||
231 | This may be changed in future. | ||
232 | |||
233 | |||
234 | |||
235 | ------------------------ | ||
236 | 6. Logical memory remove | ||
237 | ------------------------ | ||
238 | |||
239 | 6.1 Memory offline and ZONE_MOVABLE | ||
240 | ------------ | ||
241 | Memory offlining is more complicated than memory online. Because memory offline | ||
242 | has to make the whole memory section be unused, memory offline can fail if | ||
243 | the section includes memory which cannot be freed. | ||
244 | |||
245 | In general, memory offline can use 2 techniques. | ||
246 | |||
247 | (1) reclaim and free all memory in the section. | ||
248 | (2) migrate all pages in the section. | ||
249 | |||
250 | In the current implementation, Linux's memory offline uses method (2), freeing | ||
251 | all pages in the section by page migration. But not all pages are | ||
252 | migratable. Under current Linux, migratable pages are anonymous pages and | ||
253 | page caches. For offlining a section by migration, the kernel has to guarantee | ||
254 | that the section contains only migratable pages. | ||
255 | |||
256 | Now, a boot option for making a section which consists of migratable pages is | ||
257 | supported. By specifying "kernelcore=" or "movablecore=" boot option, you can | ||
258 | create ZONE_MOVABLE...a zone which is just used for movable pages. | ||
259 | (See also Documentation/kernel-parameters.txt) | ||
260 | |||
261 | Assume the system has "TOTAL" amount of memory at boot time, this boot option | ||
262 | creates ZONE_MOVABLE as following. | ||
263 | |||
264 | 1) When kernelcore=YYYY boot option is used, | ||
265 | Size of memory not for movable pages (not for offline) is YYYY. | ||
266 | Size of memory for movable pages (for offline) is TOTAL-YYYY. | ||
267 | |||
268 | 2) When movablecore=ZZZZ boot option is used, | ||
269 | Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. | ||
270 | Size of memory for movable pages (for offline) is ZZZZ. | ||
271 | |||
272 | |||
273 | Note) Unfortunately, there is no information to show which section belongs | ||
274 | to ZONE_MOVABLE. This is TBD. | ||
275 | |||
276 | |||
277 | 6.2. How to offline memory | ||
278 | ------------ | ||
279 | You can offline a section by using the same sysfs interface that was used in | ||
280 | memory onlining. | ||
281 | |||
282 | % echo offline > /sys/devices/system/memory/memoryXXX/state | ||
283 | |||
284 | If offline succeeds, the state of the memory section is changed to be "offline". | ||
285 | If it fails, some error core (like -EBUSY) will be returned by the kernel. | ||
286 | Even if a section does not belong to ZONE_MOVABLE, you can try to offline it. | ||
287 | If it doesn't contain 'unmovable' memory, you'll get success. | ||
288 | |||
289 | A section under ZONE_MOVABLE is considered to be able to be offlined easily. | ||
290 | But under some busy state, it may return -EBUSY. Even if a memory section | ||
291 | cannot be offlined due to -EBUSY, you can retry offlining it and may be able to | ||
292 | offline it (or not). | ||
293 | (For example, a page is referred to by some kernel internal call and released | ||
294 | soon.) | ||
295 | |||
296 | Consideration: | ||
297 | Memory hotplug's design direction is to make the possibility of memory offlining | ||
298 | higher and to guarantee unplugging memory under any situation. But it needs | ||
299 | more work. Returning -EBUSY under some situation may be good because the user | ||
300 | can decide to retry more or not by himself. Currently, memory offlining code | ||
301 | does some amount of retry with 120 seconds timeout. | ||
302 | |||
303 | ------------------------- | ||
304 | 7. Physical memory remove | ||
305 | ------------------------- | ||
306 | Need more implementation yet.... | ||
307 | - Notification completion of remove works by OS to firmware. | ||
308 | - Guard from remove if not yet. | ||
309 | |||
310 | -------------- | ||
311 | 8. Future Work | ||
312 | -------------- | ||
313 | - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like | ||
314 | sysctl or new control file. | ||
315 | - showing memory section and physical device relationship. | ||
316 | - showing memory section and node relationship (maybe good for NUMA) | ||
317 | - showing memory section is under ZONE_MOVABLE or not | ||
318 | - test and make it better memory offlining. | ||
319 | - support HugeTLB page migration and offlining. | ||
320 | - memmap removing at memory offline. | ||
321 | - physical remove memory. | ||
322 | |||