diff options
Diffstat (limited to 'Documentation/powerpc/firmware-assisted-dump.txt')
-rw-r--r-- | Documentation/powerpc/firmware-assisted-dump.txt | 270 |
1 files changed, 270 insertions, 0 deletions
diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt new file mode 100644 index 000000000000..3007bc98af28 --- /dev/null +++ b/Documentation/powerpc/firmware-assisted-dump.txt | |||
@@ -0,0 +1,270 @@ | |||
1 | |||
2 | Firmware-Assisted Dump | ||
3 | ------------------------ | ||
4 | July 2011 | ||
5 | |||
6 | The goal of firmware-assisted dump is to enable the dump of | ||
7 | a crashed system, and to do so from a fully-reset system, and | ||
8 | to minimize the total elapsed time until the system is back | ||
9 | in production use. | ||
10 | |||
11 | - Firmware assisted dump (fadump) infrastructure is intended to replace | ||
12 | the existing phyp assisted dump. | ||
13 | - Fadump uses the same firmware interfaces and memory reservation model | ||
14 | as phyp assisted dump. | ||
15 | - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore | ||
16 | in the ELF format in the same way as kdump. This helps us reuse the | ||
17 | kdump infrastructure for dump capture and filtering. | ||
18 | - Unlike phyp dump, userspace tool does not need to refer any sysfs | ||
19 | interface while reading /proc/vmcore. | ||
20 | - Unlike phyp dump, fadump allows user to release all the memory reserved | ||
21 | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. | ||
22 | - Once enabled through kernel boot parameter, fadump can be | ||
23 | started/stopped through /sys/kernel/fadump_registered interface (see | ||
24 | sysfs files section below) and can be easily integrated with kdump | ||
25 | service start/stop init scripts. | ||
26 | |||
27 | Comparing with kdump or other strategies, firmware-assisted | ||
28 | dump offers several strong, practical advantages: | ||
29 | |||
30 | -- Unlike kdump, the system has been reset, and loaded | ||
31 | with a fresh copy of the kernel. In particular, | ||
32 | PCI and I/O devices have been reinitialized and are | ||
33 | in a clean, consistent state. | ||
34 | -- Once the dump is copied out, the memory that held the dump | ||
35 | is immediately available to the running kernel. And therefore, | ||
36 | unlike kdump, fadump doesn't need a 2nd reboot to get back | ||
37 | the system to the production configuration. | ||
38 | |||
39 | The above can only be accomplished by coordination with, | ||
40 | and assistance from the Power firmware. The procedure is | ||
41 | as follows: | ||
42 | |||
43 | -- The first kernel registers the sections of memory with the | ||
44 | Power firmware for dump preservation during OS initialization. | ||
45 | These registered sections of memory are reserved by the first | ||
46 | kernel during early boot. | ||
47 | |||
48 | -- When a system crashes, the Power firmware will save | ||
49 | the low memory (boot memory of size larger of 5% of system RAM | ||
50 | or 256MB) of RAM to the previous registered region. It will | ||
51 | also save system registers, and hardware PTE's. | ||
52 | |||
53 | NOTE: The term 'boot memory' means size of the low memory chunk | ||
54 | that is required for a kernel to boot successfully when | ||
55 | booted with restricted memory. By default, the boot memory | ||
56 | size will be the larger of 5% of system RAM or 256MB. | ||
57 | Alternatively, user can also specify boot memory size | ||
58 | through boot parameter 'fadump_reserve_mem=' which will | ||
59 | override the default calculated size. Use this option | ||
60 | if default boot memory size is not sufficient for second | ||
61 | kernel to boot successfully. | ||
62 | |||
63 | -- After the low memory (boot memory) area has been saved, the | ||
64 | firmware will reset PCI and other hardware state. It will | ||
65 | *not* clear the RAM. It will then launch the bootloader, as | ||
66 | normal. | ||
67 | |||
68 | -- The freshly booted kernel will notice that there is a new | ||
69 | node (ibm,dump-kernel) in the device tree, indicating that | ||
70 | there is crash data available from a previous boot. During | ||
71 | the early boot OS will reserve rest of the memory above | ||
72 | boot memory size effectively booting with restricted memory | ||
73 | size. This will make sure that the second kernel will not | ||
74 | touch any of the dump memory area. | ||
75 | |||
76 | -- User-space tools will read /proc/vmcore to obtain the contents | ||
77 | of memory, which holds the previous crashed kernel dump in ELF | ||
78 | format. The userspace tools may copy this info to disk, or | ||
79 | network, nas, san, iscsi, etc. as desired. | ||
80 | |||
81 | -- Once the userspace tool is done saving dump, it will echo | ||
82 | '1' to /sys/kernel/fadump_release_mem to release the reserved | ||
83 | memory back to general use, except the memory required for | ||
84 | next firmware-assisted dump registration. | ||
85 | |||
86 | e.g. | ||
87 | # echo 1 > /sys/kernel/fadump_release_mem | ||
88 | |||
89 | Please note that the firmware-assisted dump feature | ||
90 | is only available on Power6 and above systems with recent | ||
91 | firmware versions. | ||
92 | |||
93 | Implementation details: | ||
94 | ---------------------- | ||
95 | |||
96 | During boot, a check is made to see if firmware supports | ||
97 | this feature on that particular machine. If it does, then | ||
98 | we check to see if an active dump is waiting for us. If yes | ||
99 | then everything but boot memory size of RAM is reserved during | ||
100 | early boot (See Fig. 2). This area is released once we finish | ||
101 | collecting the dump from user land scripts (e.g. kdump scripts) | ||
102 | that are run. If there is dump data, then the | ||
103 | /sys/kernel/fadump_release_mem file is created, and the reserved | ||
104 | memory is held. | ||
105 | |||
106 | If there is no waiting dump data, then only the memory required | ||
107 | to hold CPU state, HPTE region, boot memory dump and elfcore | ||
108 | header, is reserved at the top of memory (see Fig. 1). This area | ||
109 | is *not* released: this region will be kept permanently reserved, | ||
110 | so that it can act as a receptacle for a copy of the boot memory | ||
111 | content in addition to CPU state and HPTE region, in the case a | ||
112 | crash does occur. | ||
113 | |||
114 | o Memory Reservation during first kernel | ||
115 | |||
116 | Low memory Top of memory | ||
117 | 0 boot memory size | | ||
118 | | | |<--Reserved dump area -->| | ||
119 | V V | Permanent Reservation V | ||
120 | +-----------+----------/ /----------+---+----+-----------+----+ | ||
121 | | | |CPU|HPTE| DUMP |ELF | | ||
122 | +-----------+----------/ /----------+---+----+-----------+----+ | ||
123 | | ^ | ||
124 | | | | ||
125 | \ / | ||
126 | ------------------------------------------- | ||
127 | Boot memory content gets transferred to | ||
128 | reserved area by firmware at the time of | ||
129 | crash | ||
130 | Fig. 1 | ||
131 | |||
132 | o Memory Reservation during second kernel after crash | ||
133 | |||
134 | Low memory Top of memory | ||
135 | 0 boot memory size | | ||
136 | | |<------------- Reserved dump area ----------- -->| | ||
137 | V V V | ||
138 | +-----------+----------/ /----------+---+----+-----------+----+ | ||
139 | | | |CPU|HPTE| DUMP |ELF | | ||
140 | +-----------+----------/ /----------+---+----+-----------+----+ | ||
141 | | | | ||
142 | V V | ||
143 | Used by second /proc/vmcore | ||
144 | kernel to boot | ||
145 | Fig. 2 | ||
146 | |||
147 | Currently the dump will be copied from /proc/vmcore to a | ||
148 | a new file upon user intervention. The dump data available through | ||
149 | /proc/vmcore will be in ELF format. Hence the existing kdump | ||
150 | infrastructure (kdump scripts) to save the dump works fine with | ||
151 | minor modifications. | ||
152 | |||
153 | The tools to examine the dump will be same as the ones | ||
154 | used for kdump. | ||
155 | |||
156 | How to enable firmware-assisted dump (fadump): | ||
157 | ------------------------------------- | ||
158 | |||
159 | 1. Set config option CONFIG_FA_DUMP=y and build kernel. | ||
160 | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. | ||
161 | 3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline | ||
162 | to specify size of the memory to reserve for boot memory dump | ||
163 | preservation. | ||
164 | |||
165 | NOTE: If firmware-assisted dump fails to reserve memory then it will | ||
166 | fallback to existing kdump mechanism if 'crashkernel=' option | ||
167 | is set at kernel cmdline. | ||
168 | |||
169 | Sysfs/debugfs files: | ||
170 | ------------ | ||
171 | |||
172 | Firmware-assisted dump feature uses sysfs file system to hold | ||
173 | the control files and debugfs file to display memory reserved region. | ||
174 | |||
175 | Here is the list of files under kernel sysfs: | ||
176 | |||
177 | /sys/kernel/fadump_enabled | ||
178 | |||
179 | This is used to display the fadump status. | ||
180 | 0 = fadump is disabled | ||
181 | 1 = fadump is enabled | ||
182 | |||
183 | This interface can be used by kdump init scripts to identify if | ||
184 | fadump is enabled in the kernel and act accordingly. | ||
185 | |||
186 | /sys/kernel/fadump_registered | ||
187 | |||
188 | This is used to display the fadump registration status as well | ||
189 | as to control (start/stop) the fadump registration. | ||
190 | 0 = fadump is not registered. | ||
191 | 1 = fadump is registered and ready to handle system crash. | ||
192 | |||
193 | To register fadump echo 1 > /sys/kernel/fadump_registered and | ||
194 | echo 0 > /sys/kernel/fadump_registered for un-register and stop the | ||
195 | fadump. Once the fadump is un-registered, the system crash will not | ||
196 | be handled and vmcore will not be captured. This interface can be | ||
197 | easily integrated with kdump service start/stop. | ||
198 | |||
199 | /sys/kernel/fadump_release_mem | ||
200 | |||
201 | This file is available only when fadump is active during | ||
202 | second kernel. This is used to release the reserved memory | ||
203 | region that are held for saving crash dump. To release the | ||
204 | reserved memory echo 1 to it: | ||
205 | |||
206 | echo 1 > /sys/kernel/fadump_release_mem | ||
207 | |||
208 | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region | ||
209 | file will change to reflect the new memory reservations. | ||
210 | |||
211 | The existing userspace tools (kdump infrastructure) can be easily | ||
212 | enhanced to use this interface to release the memory reserved for | ||
213 | dump and continue without 2nd reboot. | ||
214 | |||
215 | Here is the list of files under powerpc debugfs: | ||
216 | (Assuming debugfs is mounted on /sys/kernel/debug directory.) | ||
217 | |||
218 | /sys/kernel/debug/powerpc/fadump_region | ||
219 | |||
220 | This file shows the reserved memory regions if fadump is | ||
221 | enabled otherwise this file is empty. The output format | ||
222 | is: | ||
223 | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> | ||
224 | |||
225 | e.g. | ||
226 | Contents when fadump is registered during first kernel | ||
227 | |||
228 | # cat /sys/kernel/debug/powerpc/fadump_region | ||
229 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 | ||
230 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 | ||
231 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 | ||
232 | |||
233 | Contents when fadump is active during second kernel | ||
234 | |||
235 | # cat /sys/kernel/debug/powerpc/fadump_region | ||
236 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 | ||
237 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 | ||
238 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 | ||
239 | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 | ||
240 | |||
241 | NOTE: Please refer to Documentation/filesystems/debugfs.txt on | ||
242 | how to mount the debugfs filesystem. | ||
243 | |||
244 | |||
245 | TODO: | ||
246 | ----- | ||
247 | o Need to come up with the better approach to find out more | ||
248 | accurate boot memory size that is required for a kernel to | ||
249 | boot successfully when booted with restricted memory. | ||
250 | o The fadump implementation introduces a fadump crash info structure | ||
251 | in the scratch area before the ELF core header. The idea of introducing | ||
252 | this structure is to pass some important crash info data to the second | ||
253 | kernel which will help second kernel to populate ELF core header with | ||
254 | correct data before it gets exported through /proc/vmcore. The current | ||
255 | design implementation does not address a possibility of introducing | ||
256 | additional fields (in future) to this structure without affecting | ||
257 | compatibility. Need to come up with the better approach to address this. | ||
258 | The possible approaches are: | ||
259 | 1. Introduce version field for version tracking, bump up the version | ||
260 | whenever a new field is added to the structure in future. The version | ||
261 | field can be used to find out what fields are valid for the current | ||
262 | version of the structure. | ||
263 | 2. Reserve the area of predefined size (say PAGE_SIZE) for this | ||
264 | structure and have unused area as reserved (initialized to zero) | ||
265 | for future field additions. | ||
266 | The advantage of approach 1 over 2 is we don't need to reserve extra space. | ||
267 | --- | ||
268 | Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> | ||
269 | This document is based on the original documentation written for phyp | ||
270 | assisted dump by Linas Vepstas and Manish Ahuja. | ||