diff options
-rw-r--r-- | Documentation/pcieaer-howto.txt | 253 |
1 files changed, 253 insertions, 0 deletions
diff --git a/Documentation/pcieaer-howto.txt b/Documentation/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/pcieaer-howto.txt | |||
@@ -0,0 +1,253 @@ | |||
1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO | ||
2 | T. Long Nguyen <tom.l.nguyen@intel.com> | ||
3 | Yanmin Zhang <yanmin.zhang@intel.com> | ||
4 | 07/29/2006 | ||
5 | |||
6 | |||
7 | 1. Overview | ||
8 | |||
9 | 1.1 About this guide | ||
10 | |||
11 | This guide describes the basics of the PCI Express Advanced Error | ||
12 | Reporting (AER) driver and provides information on how to use it, as | ||
13 | well as how to enable the drivers of endpoint devices to conform with | ||
14 | PCI Express AER driver. | ||
15 | |||
16 | 1.2 Copyright © Intel Corporation 2006. | ||
17 | |||
18 | 1.3 What is the PCI Express AER Driver? | ||
19 | |||
20 | PCI Express error signaling can occur on the PCI Express link itself | ||
21 | or on behalf of transactions initiated on the link. PCI Express | ||
22 | defines two error reporting paradigms: the baseline capability and | ||
23 | the Advanced Error Reporting capability. The baseline capability is | ||
24 | required of all PCI Express components providing a minimum defined | ||
25 | set of error reporting requirements. Advanced Error Reporting | ||
26 | capability is implemented with a PCI Express advanced error reporting | ||
27 | extended capability structure providing more robust error reporting. | ||
28 | |||
29 | The PCI Express AER driver provides the infrastructure to support PCI | ||
30 | Express Advanced Error Reporting capability. The PCI Express AER | ||
31 | driver provides three basic functions: | ||
32 | |||
33 | - Gathers the comprehensive error information if errors occurred. | ||
34 | - Reports error to the users. | ||
35 | - Performs error recovery actions. | ||
36 | |||
37 | AER driver only attaches root ports which support PCI-Express AER | ||
38 | capability. | ||
39 | |||
40 | |||
41 | 2. User Guide | ||
42 | |||
43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | ||
44 | |||
45 | The PCI Express AER Root driver is a Root Port service driver attached | ||
46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | ||
47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | ||
48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | ||
49 | CONFIG_PCIEAER = y. | ||
50 | |||
51 | 2.2 Load PCI Express AER Root Driver | ||
52 | There is a case where a system has AER support in BIOS. Enabling the AER | ||
53 | Root driver and having AER support in BIOS may result unpredictable | ||
54 | behavior. To avoid this conflict, a successful load of the AER Root driver | ||
55 | requires ACPI _OSC support in the BIOS to allow the AER Root driver to | ||
56 | request for native control of AER. See the PCI FW 3.0 Specification for | ||
57 | details regarding OSC usage. Currently, lots of firmwares don't provide | ||
58 | _OSC support while they use PCI Express. To support such firmwares, | ||
59 | forceload, a parameter of type bool, could enable AER to continue to | ||
60 | be initiated although firmwares have no _OSC support. To enable the | ||
61 | walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line | ||
62 | when booting kernel. Note that forceload=n by default. | ||
63 | |||
64 | 2.3 AER error output | ||
65 | When a PCI-E AER error is captured, an error message will be outputed to | ||
66 | console. If it's a correctable error, it is outputed as a warning. | ||
67 | Otherwise, it is printed as an error. So users could choose different | ||
68 | log level to filter out correctable error messages. | ||
69 | |||
70 | Below shows an example. | ||
71 | +------ PCI-Express Device Error -----+ | ||
72 | Error Severity : Uncorrected (Fatal) | ||
73 | PCIE Bus Error type : Transaction Layer | ||
74 | Unsupported Request : First | ||
75 | Requester ID : 0500 | ||
76 | VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h | ||
77 | TLB Header: | ||
78 | 04000001 00200a03 05010000 00050100 | ||
79 | |||
80 | In the example, 'Requester ID' means the ID of the device who sends | ||
81 | the error message to root port. Pls. refer to pci express specs for | ||
82 | other fields. | ||
83 | |||
84 | |||
85 | 3. Developer Guide | ||
86 | |||
87 | To enable AER aware support requires a software driver to configure | ||
88 | the AER capability structure within its device and to provide callbacks. | ||
89 | |||
90 | To support AER better, developers need understand how AER does work | ||
91 | firstly. | ||
92 | |||
93 | PCI Express errors are classified into two types: correctable errors | ||
94 | and uncorrectable errors. This classification is based on the impacts | ||
95 | of those errors, which may result in degraded performance or function | ||
96 | failure. | ||
97 | |||
98 | Correctable errors pose no impacts on the functionality of the | ||
99 | interface. The PCI Express protocol can recover without any software | ||
100 | intervention or any loss of data. These errors are detected and | ||
101 | corrected by hardware. Unlike correctable errors, uncorrectable | ||
102 | errors impact functionality of the interface. Uncorrectable errors | ||
103 | can cause a particular transaction or a particular PCI Express link | ||
104 | to be unreliable. Depending on those error conditions, uncorrectable | ||
105 | errors are further classified into non-fatal errors and fatal errors. | ||
106 | Non-fatal errors cause the particular transaction to be unreliable, | ||
107 | but the PCI Express link itself is fully functional. Fatal errors, on | ||
108 | the other hand, cause the link to be unreliable. | ||
109 | |||
110 | When AER is enabled, a PCI Express device will automatically send an | ||
111 | error message to the PCIE root port above it when the device captures | ||
112 | an error. The Root Port, upon receiving an error reporting message, | ||
113 | internally processes and logs the error message in its PCI Express | ||
114 | capability structure. Error information being logged includes storing | ||
115 | the error reporting agent's requestor ID into the Error Source | ||
116 | Identification Registers and setting the error bits of the Root Error | ||
117 | Status Register accordingly. If AER error reporting is enabled in Root | ||
118 | Error Command Register, the Root Port generates an interrupt if an | ||
119 | error is detected. | ||
120 | |||
121 | Note that the errors as described above are related to the PCI Express | ||
122 | hierarchy and links. These errors do not include any device specific | ||
123 | errors because device specific errors will still get sent directly to | ||
124 | the device driver. | ||
125 | |||
126 | 3.1 Configure the AER capability structure | ||
127 | |||
128 | AER aware drivers of PCI Express component need change the device | ||
129 | control registers to enable AER. They also could change AER registers, | ||
130 | including mask and severity registers. Helper function | ||
131 | pci_enable_pcie_error_reporting could be used to enable AER. See | ||
132 | section 3.3. | ||
133 | |||
134 | 3.2. Provide callbacks | ||
135 | |||
136 | 3.2.1 callback reset_link to reset pci express link | ||
137 | |||
138 | This callback is used to reset the pci express physical link when a | ||
139 | fatal error happens. The root port aer service driver provides a | ||
140 | default reset_link function, but different upstream ports might | ||
141 | have different specifications to reset pci express link, so all | ||
142 | upstream ports should provide their own reset_link functions. | ||
143 | |||
144 | In struct pcie_port_service_driver, a new pointer, reset_link, is | ||
145 | added. | ||
146 | |||
147 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | ||
148 | |||
149 | Section 3.2.2.2 provides more detailed info on when to call | ||
150 | reset_link. | ||
151 | |||
152 | 3.2.2 PCI error-recovery callbacks | ||
153 | |||
154 | The PCI Express AER Root driver uses error callbacks to coordinate | ||
155 | with downstream device drivers associated with a hierarchy in question | ||
156 | when performing error recovery actions. | ||
157 | |||
158 | Data struct pci_driver has a pointer, err_handler, to point to | ||
159 | pci_error_handlers who consists of a couple of callback function | ||
160 | pointers. AER driver follows the rules defined in | ||
161 | pci-error-recovery.txt except pci express specific parts (e.g. | ||
162 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | ||
163 | definitions of the callbacks. | ||
164 | |||
165 | Below sections specify when to call the error callback functions. | ||
166 | |||
167 | 3.2.2.1 Correctable errors | ||
168 | |||
169 | Correctable errors pose no impacts on the functionality of | ||
170 | the interface. The PCI Express protocol can recover without any | ||
171 | software intervention or any loss of data. These errors do not | ||
172 | require any recovery actions. The AER driver clears the device's | ||
173 | correctable error status register accordingly and logs these errors. | ||
174 | |||
175 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | ||
176 | |||
177 | If an error message indicates a non-fatal error, performing link reset | ||
178 | at upstream is not required. The AER driver calls error_detected(dev, | ||
179 | pci_channel_io_normal) to all drivers associated within a hierarchy in | ||
180 | question. for example, | ||
181 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | ||
182 | If Upstream port A captures an AER error, the hierarchy consists of | ||
183 | Downstream port B and EndPoint. | ||
184 | |||
185 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | ||
186 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | ||
187 | whether it can recover or the AER driver calls mmio_enabled as next. | ||
188 | |||
189 | If an error message indicates a fatal error, kernel will broadcast | ||
190 | error_detected(dev, pci_channel_io_frozen) to all drivers within | ||
191 | a hierarchy in question. Then, performing link reset at upstream is | ||
192 | necessary. As different kinds of devices might use different approaches | ||
193 | to reset link, AER port service driver is required to provide the | ||
194 | function to reset link. Firstly, kernel looks for if the upstream | ||
195 | component has an aer driver. If it has, kernel uses the reset_link | ||
196 | callback of the aer driver. If the upstream component has no aer driver | ||
197 | and the port is downstream port, we will use the aer driver of the | ||
198 | root port who reports the AER error. As for upstream ports, | ||
199 | they should provide their own aer service drivers with reset_link | ||
200 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | ||
201 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | ||
202 | to mmio_enabled. | ||
203 | |||
204 | 3.3 helper functions | ||
205 | |||
206 | 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); | ||
207 | pci_find_aer_capability locates the PCI Express AER capability | ||
208 | in the device configuration space. If the device doesn't support | ||
209 | PCI-Express AER, the function returns 0. | ||
210 | |||
211 | 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | ||
212 | pci_enable_pcie_error_reporting enables the device to send error | ||
213 | messages to root port when an error is detected. Note that devices | ||
214 | don't enable the error reporting by default, so device drivers need | ||
215 | call this function to enable it. | ||
216 | |||
217 | 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | ||
218 | pci_disable_pcie_error_reporting disables the device to send error | ||
219 | messages to root port when an error is detected. | ||
220 | |||
221 | 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | ||
222 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | ||
223 | error status register. | ||
224 | |||
225 | 3.4 Frequent Asked Questions | ||
226 | |||
227 | Q: What happens if a PCI Express device driver does not provide an | ||
228 | error recovery handler (pci_driver->err_handler is equal to NULL)? | ||
229 | |||
230 | A: The devices attached with the driver won't be recovered. If the | ||
231 | error is fatal, kernel will print out warning messages. Please refer | ||
232 | to section 3 for more information. | ||
233 | |||
234 | Q: What happens if an upstream port service driver does not provide | ||
235 | callback reset_link? | ||
236 | |||
237 | A: Fatal error recovery will fail if the errors are reported by the | ||
238 | upstream ports who are attached by the service driver. | ||
239 | |||
240 | Q: How does this infrastructure deal with driver that is not PCI | ||
241 | Express aware? | ||
242 | |||
243 | A: This infrastructure calls the error callback functions of the | ||
244 | driver when an error happens. But if the driver is not aware of | ||
245 | PCI Express, the device might not report its own errors to root | ||
246 | port. | ||
247 | |||
248 | Q: What modifications will that driver need to make it compatible | ||
249 | with the PCI Express AER Root driver? | ||
250 | |||
251 | A: It could call the helper functions to enable AER in devices and | ||
252 | cleanup uncorrectable status register. Pls. refer to section 3.3. | ||
253 | |||