summaryrefslogtreecommitdiffstats
path: root/Documentation/PCI
diff options
context:
space:
mode:
authorOza Pawandeep <poza@codeaurora.org>2018-05-17 17:44:13 -0400
committerBjorn Helgaas <bhelgaas@google.com>2018-05-17 17:44:13 -0400
commit7e9084b36740b2ec263ea35efb203001f755e1d8 (patch)
tree76c7d1d5b86834e01167e80192ad443df621d105 /Documentation/PCI
parent9f5a70f18c5893a30d6c339adc48de43c57dd7e2 (diff)
PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices
PCIe ERR_FATAL errors mean the Link is unreliable. Components on the Link may need to be reset to return to reliable operation (PCIe r4.0, sec 6.2.2). We previously handled these errors much differently depending on whether the platform supports Downstream Port Containment (DPC) (PCIe r4.0, sec 6.2.10) or not. The AER driver has historically logged the error details, called driver-supplied pci_error_handlers callbacks, and reset the Link. This reset downstream devices, but did not remove them from the PCI subsystem, re-enumerate them, or call their driver .remove() or .probe() methods. DPC is different because the hardware automatically disables the Link when it detects ERR_FATAL, which resets downstream devices. There's no opportunity for pci_error_handlers callbacks before resetting the Link. The DPC driver removes affected devices (which calls their driver .remove() methods), brings the Link back up, and re-enumerates (which calls driver .probe() methods). Align AER ERR_FATAL handling with DPC by resetting the Link in software, skipping the driver pci_error_handlers callbacks, removing the devices from the PCI subsystem, and re-enumerating. The idea is that drivers and devices should see the same behavior for ERR_FATAL events, regardless of whether they're handled by AER or DPC. Here are the basic ERR_FATAL recovery steps, showing the previous AER behavior, the AER behavior after this patch, and the DPC behavior: AER AER DPC previous new behavior -------- --- -------- Log error yes yes yes (minimal) drv.error_detected() yes no no Reset Link yes yes yes drv.mmio_enabled() yes no no drv.slot_reset() yes no no drv.resume() yes no no Remove PCI devices no yes yes (calls drv.remove()) Re-enumerate no yes yes (calls drv.probe()) N.B. With DPC, the Link reset happens before the driver .remove() calls, while with AER, the reset happens *after* the .remove() calls. The goal is to eventually do the reset before .remove() for AER as well. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog, squash doc patch into this, remove unused "result_data"] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
Diffstat (limited to 'Documentation/PCI')
-rw-r--r--Documentation/PCI/pci-error-recovery.txt35
1 files changed, 25 insertions, 10 deletions
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 0b6bb3ef449e..688b69121e82 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
110event will be platform-dependent, but will follow the general 110event will be platform-dependent, but will follow the general
111sequence described below. 111sequence described below.
112 112
113STEP 0: Error Event 113STEP 0: Error Event: ERR_NONFATAL
114------------------- 114-------------------
115A PCI bus error is detected by the PCI hardware. On powerpc, the slot 115A PCI bus error is detected by the PCI hardware. On powerpc, the slot
116is isolated, in that all I/O is blocked: all reads return 0xffffffff, 116is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@@ -228,13 +228,7 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
228If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform 228If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
229proceeds to STEP 4 (Slot Reset) 229proceeds to STEP 4 (Slot Reset)
230 230
231STEP 3: Link Reset 231STEP 3: Slot Reset
232------------------
233The platform resets the link. This is a PCI-Express specific step
234and is done whenever a fatal error has been detected that can be
235"solved" by resetting the link.
236
237STEP 4: Slot Reset
238------------------ 232------------------
239 233
240In response to a return value of PCI_ERS_RESULT_NEED_RESET, the 234In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@@ -320,7 +314,7 @@ Failure).
320>>> However, it probably should. 314>>> However, it probably should.
321 315
322 316
323STEP 5: Resume Operations 317STEP 4: Resume Operations
324------------------------- 318-------------------------
325The platform will call the resume() callback on all affected device 319The platform will call the resume() callback on all affected device
326drivers if all drivers on the segment have returned 320drivers if all drivers on the segment have returned
@@ -332,7 +326,7 @@ a result code.
332At this point, if a new error happens, the platform will restart 326At this point, if a new error happens, the platform will restart
333a new error recovery sequence. 327a new error recovery sequence.
334 328
335STEP 6: Permanent Failure 329STEP 5: Permanent Failure
336------------------------- 330-------------------------
337A "permanent failure" has occurred, and the platform cannot recover 331A "permanent failure" has occurred, and the platform cannot recover
338the device. The platform will call error_detected() with a 332the device. The platform will call error_detected() with a
@@ -355,6 +349,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
355for additional detail on real-life experience of the causes of 349for additional detail on real-life experience of the causes of
356software errors. 350software errors.
357 351
352STEP 0: Error Event: ERR_FATAL
353-------------------
354PCI bus error is detected by the PCI hardware. On powerpc, the slot is
355isolated, in that all I/O is blocked: all reads return 0xffffffff, all
356writes are ignored.
357
358STEP 1: Remove devices
359--------------------
360Platform removes the devices depending on the error agent, it could be
361this port for all subordinates or upstream component (likely downstream
362port)
363
364STEP 2: Reset link
365--------------------
366The platform resets the link. This is a PCI-Express specific step and is
367done whenever a fatal error has been detected that can be "solved" by
368resetting the link.
369
370STEP 3: Re-enumerate the devices
371--------------------
372Initiates the re-enumeration.
358 373
359Conclusion; General Remarks 374Conclusion; General Remarks
360--------------------------- 375---------------------------