diff options
Diffstat (limited to 'Documentation/powerpc/eeh-pci-error-recovery.txt')
-rw-r--r-- | Documentation/powerpc/eeh-pci-error-recovery.txt | 332 |
1 files changed, 332 insertions, 0 deletions
diff --git a/Documentation/powerpc/eeh-pci-error-recovery.txt b/Documentation/powerpc/eeh-pci-error-recovery.txt new file mode 100644 index 000000000000..2bfe71beec5b --- /dev/null +++ b/Documentation/powerpc/eeh-pci-error-recovery.txt | |||
@@ -0,0 +1,332 @@ | |||
1 | |||
2 | |||
3 | PCI Bus EEH Error Recovery | ||
4 | -------------------------- | ||
5 | Linas Vepstas | ||
6 | <linas@austin.ibm.com> | ||
7 | 12 January 2005 | ||
8 | |||
9 | |||
10 | Overview: | ||
11 | --------- | ||
12 | The IBM POWER-based pSeries and iSeries computers include PCI bus | ||
13 | controller chips that have extended capabilities for detecting and | ||
14 | reporting a large variety of PCI bus error conditions. These features | ||
15 | go under the name of "EEH", for "Extended Error Handling". The EEH | ||
16 | hardware features allow PCI bus errors to be cleared and a PCI | ||
17 | card to be "rebooted", without also having to reboot the operating | ||
18 | system. | ||
19 | |||
20 | This is in contrast to traditional PCI error handling, where the | ||
21 | PCI chip is wired directly to the CPU, and an error would cause | ||
22 | a CPU machine-check/check-stop condition, halting the CPU entirely. | ||
23 | Another "traditional" technique is to ignore such errors, which | ||
24 | can lead to data corruption, both of user data or of kernel data, | ||
25 | hung/unresponsive adapters, or system crashes/lockups. Thus, | ||
26 | the idea behind EEH is that the operating system can become more | ||
27 | reliable and robust by protecting it from PCI errors, and giving | ||
28 | the OS the ability to "reboot"/recover individual PCI devices. | ||
29 | |||
30 | Future systems from other vendors, based on the PCI-E specification, | ||
31 | may contain similar features. | ||
32 | |||
33 | |||
34 | Causes of EEH Errors | ||
35 | -------------------- | ||
36 | EEH was originally designed to guard against hardware failure, such | ||
37 | as PCI cards dying from heat, humidity, dust, vibration and bad | ||
38 | electrical connections. The vast majority of EEH errors seen in | ||
39 | "real life" are due to eithr poorly seated PCI cards, or, | ||
40 | unfortunately quite commonly, due device driver bugs, device firmware | ||
41 | bugs, and sometimes PCI card hardware bugs. | ||
42 | |||
43 | The most common software bug, is one that causes the device to | ||
44 | attempt to DMA to a location in system memory that has not been | ||
45 | reserved for DMA access for that card. This is a powerful feature, | ||
46 | as it prevents what; otherwise, would have been silent memory | ||
47 | corruption caused by the bad DMA. A number of device driver | ||
48 | bugs have been found and fixed in this way over the past few | ||
49 | years. Other possible causes of EEH errors include data or | ||
50 | address line parity errors (for example, due to poor electrical | ||
51 | connectivity due to a poorly seated card), and PCI-X split-completion | ||
52 | errors (due to software, device firmware, or device PCI hardware bugs). | ||
53 | The vast majority of "true hardware failures" can be cured by | ||
54 | physically removing and re-seating the PCI card. | ||
55 | |||
56 | |||
57 | Detection and Recovery | ||
58 | ---------------------- | ||
59 | In the following discussion, a generic overview of how to detect | ||
60 | and recover from EEH errors will be presented. This is followed | ||
61 | by an overview of how the current implementation in the Linux | ||
62 | kernel does it. The actual implementation is subject to change, | ||
63 | and some of the finer points are still being debated. These | ||
64 | may in turn be swayed if or when other architectures implement | ||
65 | similar functionality. | ||
66 | |||
67 | When a PCI Host Bridge (PHB, the bus controller connecting the | ||
68 | PCI bus to the system CPU electronics complex) detects a PCI error | ||
69 | condition, it will "isolate" the affected PCI card. Isolation | ||
70 | will block all writes (either to the card from the system, or | ||
71 | from the card to the system), and it will cause all reads to | ||
72 | return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). | ||
73 | This value was chosen because it is the same value you would | ||
74 | get if the device was physically unplugged from the slot. | ||
75 | This includes access to PCI memory, I/O space, and PCI config | ||
76 | space. Interrupts; however, will continued to be delivered. | ||
77 | |||
78 | Detection and recovery are performed with the aid of ppc64 | ||
79 | firmware. The programming interfaces in the Linux kernel | ||
80 | into the firmware are referred to as RTAS (Run-Time Abstraction | ||
81 | Services). The Linux kernel does not (should not) access | ||
82 | the EEH function in the PCI chipsets directly, primarily because | ||
83 | there are a number of different chipsets out there, each with | ||
84 | different interfaces and quirks. The firmware provides a | ||
85 | uniform abstraction layer that will work with all pSeries | ||
86 | and iSeries hardware (and be forwards-compatible). | ||
87 | |||
88 | If the OS or device driver suspects that a PCI slot has been | ||
89 | EEH-isolated, there is a firmware call it can make to determine if | ||
90 | this is the case. If so, then the device driver should put itself | ||
91 | into a consistent state (given that it won't be able to complete any | ||
92 | pending work) and start recovery of the card. Recovery normally | ||
93 | would consist of reseting the PCI device (holding the PCI #RST | ||
94 | line high for two seconds), followed by setting up the device | ||
95 | config space (the base address registers (BAR's), latency timer, | ||
96 | cache line size, interrupt line, and so on). This is followed by a | ||
97 | reinitialization of the device driver. In a worst-case scenario, | ||
98 | the power to the card can be toggled, at least on hot-plug-capable | ||
99 | slots. In principle, layers far above the device driver probably | ||
100 | do not need to know that the PCI card has been "rebooted" in this | ||
101 | way; ideally, there should be at most a pause in Ethernet/disk/USB | ||
102 | I/O while the card is being reset. | ||
103 | |||
104 | If the card cannot be recovered after three or four resets, the | ||
105 | kernel/device driver should assume the worst-case scenario, that the | ||
106 | card has died completely, and report this error to the sysadmin. | ||
107 | In addition, error messages are reported through RTAS and also through | ||
108 | syslogd (/var/log/messages) to alert the sysadmin of PCI resets. | ||
109 | The correct way to deal with failed adapters is to use the standard | ||
110 | PCI hotplug tools to remove and replace the dead card. | ||
111 | |||
112 | |||
113 | Current PPC64 Linux EEH Implementation | ||
114 | -------------------------------------- | ||
115 | At this time, a generic EEH recovery mechanism has been implemented, | ||
116 | so that individual device drivers do not need to be modified to support | ||
117 | EEH recovery. This generic mechanism piggy-backs on the PCI hotplug | ||
118 | infrastructure, and percolates events up through the hotplug/udev | ||
119 | infrastructure. Followiing is a detailed description of how this is | ||
120 | accomplished. | ||
121 | |||
122 | EEH must be enabled in the PHB's very early during the boot process, | ||
123 | and if a PCI slot is hot-plugged. The former is performed by | ||
124 | eeh_init() in arch/ppc64/kernel/eeh.c, and the later by | ||
125 | drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. | ||
126 | EEH must be enabled before a PCI scan of the device can proceed. | ||
127 | Current Power5 hardware will not work unless EEH is enabled; | ||
128 | although older Power4 can run with it disabled. Effectively, | ||
129 | EEH can no longer be turned off. PCI devices *must* be | ||
130 | registered with the EEH code; the EEH code needs to know about | ||
131 | the I/O address ranges of the PCI device in order to detect an | ||
132 | error. Given an arbitrary address, the routine | ||
133 | pci_get_device_by_addr() will find the pci device associated | ||
134 | with that address (if any). | ||
135 | |||
136 | The default include/asm-ppc64/io.h macros readb(), inb(), insb(), | ||
137 | etc. include a check to see if the the i/o read returned all-0xff's. | ||
138 | If so, these make a call to eeh_dn_check_failure(), which in turn | ||
139 | asks the firmware if the all-ff's value is the sign of a true EEH | ||
140 | error. If it is not, processing continues as normal. The grand | ||
141 | total number of these false alarms or "false positives" can be | ||
142 | seen in /proc/ppc64/eeh (subject to change). Normally, almost | ||
143 | all of these occur during boot, when the PCI bus is scanned, where | ||
144 | a large number of 0xff reads are part of the bus scan procedure. | ||
145 | |||
146 | If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will | ||
147 | print a stack trace to syslog (/var/log/messages). This stack trace | ||
148 | has proven to be very useful to device-driver authors for finding | ||
149 | out at what point the EEH error was detected, as the error itself | ||
150 | usually occurs slightly beforehand. | ||
151 | |||
152 | Next, it uses the Linux kernel notifier chain/work queue mechanism to | ||
153 | allow any interested parties to find out about the failure. Device | ||
154 | drivers, or other parts of the kernel, can use | ||
155 | eeh_register_notifier(struct notifier_block *) to find out about EEH | ||
156 | events. The event will include a pointer to the pci device, the | ||
157 | device node and some state info. Receivers of the event can "do as | ||
158 | they wish"; the default handler will be described further in this | ||
159 | section. | ||
160 | |||
161 | To assist in the recovery of the device, eeh.c exports the | ||
162 | following functions: | ||
163 | |||
164 | rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second | ||
165 | rtas_configure_bridge() -- ask firmware to configure any PCI bridges | ||
166 | located topologically under the pci slot. | ||
167 | eeh_save_bars() and eeh_restore_bars(): save and restore the PCI | ||
168 | config-space info for a device and any devices under it. | ||
169 | |||
170 | |||
171 | A handler for the EEH notifier_block events is implemented in | ||
172 | drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). | ||
173 | It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). | ||
174 | This last call causes the device driver for the card to be stopped, | ||
175 | which causes hotplug events to go out to user space. This triggers | ||
176 | user-space scripts that might issue commands such as "ifdown eth0" | ||
177 | for ethernet cards, and so on. This handler then sleeps for 5 seconds, | ||
178 | hoping to give the user-space scripts enough time to complete. | ||
179 | It then resets the PCI card, reconfigures the device BAR's, and | ||
180 | any bridges underneath. It then calls rpaphp_enable_pci_slot(), | ||
181 | which restarts the device driver and triggers more user-space | ||
182 | events (for example, calling "ifup eth0" for ethernet cards). | ||
183 | |||
184 | |||
185 | Device Shutdown and User-Space Events | ||
186 | ------------------------------------- | ||
187 | This section documents what happens when a pci slot is unconfigured, | ||
188 | focusing on how the device driver gets shut down, and on how the | ||
189 | events get delivered to user-space scripts. | ||
190 | |||
191 | Following is an example sequence of events that cause a device driver | ||
192 | close function to be called during the first phase of an EEH reset. | ||
193 | The following sequence is an example of the pcnet32 device driver. | ||
194 | |||
195 | rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c | ||
196 | { | ||
197 | calls | ||
198 | pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c | ||
199 | { | ||
200 | calls | ||
201 | pci_destroy_dev (struct pci_dev *) | ||
202 | { | ||
203 | calls | ||
204 | device_unregister (&dev->dev) // in /drivers/base/core.c | ||
205 | { | ||
206 | calls | ||
207 | device_del (struct device *) | ||
208 | { | ||
209 | calls | ||
210 | bus_remove_device() // in /drivers/base/bus.c | ||
211 | { | ||
212 | calls | ||
213 | device_release_driver() | ||
214 | { | ||
215 | calls | ||
216 | struct device_driver->remove() which is just | ||
217 | pci_device_remove() // in /drivers/pci/pci_driver.c | ||
218 | { | ||
219 | calls | ||
220 | struct pci_driver->remove() which is just | ||
221 | pcnet32_remove_one() // in /drivers/net/pcnet32.c | ||
222 | { | ||
223 | calls | ||
224 | unregister_netdev() // in /net/core/dev.c | ||
225 | { | ||
226 | calls | ||
227 | dev_close() // in /net/core/dev.c | ||
228 | { | ||
229 | calls dev->stop(); | ||
230 | which is just pcnet32_close() // in pcnet32.c | ||
231 | { | ||
232 | which does what you wanted | ||
233 | to stop the device | ||
234 | } | ||
235 | } | ||
236 | } | ||
237 | which | ||
238 | frees pcnet32 device driver memory | ||
239 | } | ||
240 | }}}}}} | ||
241 | |||
242 | |||
243 | in drivers/pci/pci_driver.c, | ||
244 | struct device_driver->remove() is just pci_device_remove() | ||
245 | which calls struct pci_driver->remove() which is pcnet32_remove_one() | ||
246 | which calls unregister_netdev() (in net/core/dev.c) | ||
247 | which calls dev_close() (in net/core/dev.c) | ||
248 | which calls dev->stop() which is pcnet32_close() | ||
249 | which then does the appropriate shutdown. | ||
250 | |||
251 | --- | ||
252 | Following is the analogous stack trace for events sent to user-space | ||
253 | when the pci device is unconfigured. | ||
254 | |||
255 | rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c | ||
256 | calls | ||
257 | pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c | ||
258 | calls | ||
259 | pci_destroy_dev (struct pci_dev *) { | ||
260 | calls | ||
261 | device_unregister (&dev->dev) { // in /drivers/base/core.c | ||
262 | calls | ||
263 | device_del(struct device * dev) { // in /drivers/base/core.c | ||
264 | calls | ||
265 | kobject_del() { //in /libs/kobject.c | ||
266 | calls | ||
267 | kobject_hotplug() { // in /libs/kobject.c | ||
268 | calls | ||
269 | kset_hotplug() { // in /lib/kobject.c | ||
270 | calls | ||
271 | kset->hotplug_ops->hotplug() which is really just | ||
272 | a call to | ||
273 | dev_hotplug() { // in /drivers/base/core.c | ||
274 | calls | ||
275 | dev->bus->hotplug() which is really just a call to | ||
276 | pci_hotplug () { // in drivers/pci/hotplug.c | ||
277 | which prints device name, etc.... | ||
278 | } | ||
279 | } | ||
280 | then kset_hotplug() calls | ||
281 | call_usermodehelper () with | ||
282 | argv[0]=hotplug_path[] which is "/sbin/hotplug" | ||
283 | --> event to userspace, | ||
284 | } | ||
285 | } | ||
286 | kobject_del() then calls sysfs_remove_dir(), which would | ||
287 | trigger any user-space daemon that was watching /sysfs, | ||
288 | and notice the delete event. | ||
289 | |||
290 | |||
291 | Pro's and Con's of the Current Design | ||
292 | ------------------------------------- | ||
293 | There are several issues with the current EEH software recovery design, | ||
294 | which may be addressed in future revisions. But first, note that the | ||
295 | big plus of the current design is that no changes need to be made to | ||
296 | individual device drivers, so that the current design throws a wide net. | ||
297 | The biggest negative of the design is that it potentially disturbs | ||
298 | network daemons and file systems that didn't need to be disturbed. | ||
299 | |||
300 | -- A minor complaint is that resetting the network card causes | ||
301 | user-space back-to-back ifdown/ifup burps that potentially disturb | ||
302 | network daemons, that didn't need to even know that the pci | ||
303 | card was being rebooted. | ||
304 | |||
305 | -- A more serious concern is that the same reset, for SCSI devices, | ||
306 | causes havoc to mounted file systems. Scripts cannot post-facto | ||
307 | unmount a file system without flushing pending buffers, but this | ||
308 | is impossible, because I/O has already been stopped. Thus, | ||
309 | ideally, the reset should happen at or below the block layer, | ||
310 | so that the file systems are not disturbed. | ||
311 | |||
312 | Reiserfs does not tolerate errors returned from the block device. | ||
313 | Ext3fs seems to be tolerant, retrying reads/writes until it does | ||
314 | succeed. Both have been only lightly tested in this scenario. | ||
315 | |||
316 | The SCSI-generic subsystem already has built-in code for performing | ||
317 | SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter | ||
318 | (HBA) resets. These are cascaded into a chain of attempted | ||
319 | resets if a SCSI command fails. These are completely hidden | ||
320 | from the block layer. It would be very natural to add an EEH | ||
321 | reset into this chain of events. | ||
322 | |||
323 | -- If a SCSI error occurs for the root device, all is lost unless | ||
324 | the sysadmin had the foresight to run /bin, /sbin, /etc, /var | ||
325 | and so on, out of ramdisk/tmpfs. | ||
326 | |||
327 | |||
328 | Conclusions | ||
329 | ----------- | ||
330 | There's forward progress ... | ||
331 | |||
332 | |||