diff options
author | Randy Dunlap <randy.dunlap@oracle.com> | 2008-03-10 20:16:32 -0400 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@suse.de> | 2008-04-21 00:46:51 -0400 |
commit | 4b5ff469234b8ab5cd05f4a201cbb229896729d0 (patch) | |
tree | dc44c4e82be76ffc00cb981eb4606276fffa7e1e /Documentation/PCI | |
parent | 3925e6fc1f774048404fdd910b0345b06c699eb4 (diff) |
PCI: doc/pci: create Documentation/PCI/ and move files into it
Create Documentation/PCI/ and move PCI-related files to it.
Fix a few instances of trailing whitespace.
Update references to the new file locations.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Diffstat (limited to 'Documentation/PCI')
-rw-r--r-- | Documentation/PCI/00-INDEX | 12 | ||||
-rw-r--r-- | Documentation/PCI/PCIEBUS-HOWTO.txt | 217 | ||||
-rw-r--r-- | Documentation/PCI/pci-error-recovery.txt | 396 | ||||
-rw-r--r-- | Documentation/PCI/pci.txt | 646 | ||||
-rw-r--r-- | Documentation/PCI/pcieaer-howto.txt | 253 |
5 files changed, 1524 insertions, 0 deletions
diff --git a/Documentation/PCI/00-INDEX b/Documentation/PCI/00-INDEX new file mode 100644 index 000000000000..49f43946c6b6 --- /dev/null +++ b/Documentation/PCI/00-INDEX | |||
@@ -0,0 +1,12 @@ | |||
1 | 00-INDEX | ||
2 | - this file | ||
3 | PCI-DMA-mapping.txt | ||
4 | - info for PCI drivers using DMA portably across all platforms | ||
5 | PCIEBUS-HOWTO.txt | ||
6 | - a guide describing the PCI Express Port Bus driver | ||
7 | pci-error-recovery.txt | ||
8 | - info on PCI error recovery | ||
9 | pci.txt | ||
10 | - info on the PCI subsystem for device driver authors | ||
11 | pcieaer-howto.txt | ||
12 | - the PCI Express Advanced Error Reporting Driver Guide HOWTO | ||
diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.txt new file mode 100644 index 000000000000..9a07e38631b0 --- /dev/null +++ b/Documentation/PCI/PCIEBUS-HOWTO.txt | |||
@@ -0,0 +1,217 @@ | |||
1 | The PCI Express Port Bus Driver Guide HOWTO | ||
2 | Tom L Nguyen tom.l.nguyen@intel.com | ||
3 | 11/03/2004 | ||
4 | |||
5 | 1. About this guide | ||
6 | |||
7 | This guide describes the basics of the PCI Express Port Bus driver | ||
8 | and provides information on how to enable the service drivers to | ||
9 | register/unregister with the PCI Express Port Bus Driver. | ||
10 | |||
11 | 2. Copyright 2004 Intel Corporation | ||
12 | |||
13 | 3. What is the PCI Express Port Bus Driver | ||
14 | |||
15 | A PCI Express Port is a logical PCI-PCI Bridge structure. There | ||
16 | are two types of PCI Express Port: the Root Port and the Switch | ||
17 | Port. The Root Port originates a PCI Express link from a PCI Express | ||
18 | Root Complex and the Switch Port connects PCI Express links to | ||
19 | internal logical PCI buses. The Switch Port, which has its secondary | ||
20 | bus representing the switch's internal routing logic, is called the | ||
21 | switch's Upstream Port. The switch's Downstream Port is bridging from | ||
22 | switch's internal routing bus to a bus representing the downstream | ||
23 | PCI Express link from the PCI Express Switch. | ||
24 | |||
25 | A PCI Express Port can provide up to four distinct functions, | ||
26 | referred to in this document as services, depending on its port type. | ||
27 | PCI Express Port's services include native hotplug support (HP), | ||
28 | power management event support (PME), advanced error reporting | ||
29 | support (AER), and virtual channel support (VC). These services may | ||
30 | be handled by a single complex driver or be individually distributed | ||
31 | and handled by corresponding service drivers. | ||
32 | |||
33 | 4. Why use the PCI Express Port Bus Driver? | ||
34 | |||
35 | In existing Linux kernels, the Linux Device Driver Model allows a | ||
36 | physical device to be handled by only a single driver. The PCI | ||
37 | Express Port is a PCI-PCI Bridge device with multiple distinct | ||
38 | services. To maintain a clean and simple solution each service | ||
39 | may have its own software service driver. In this case several | ||
40 | service drivers will compete for a single PCI-PCI Bridge device. | ||
41 | For example, if the PCI Express Root Port native hotplug service | ||
42 | driver is loaded first, it claims a PCI-PCI Bridge Root Port. The | ||
43 | kernel therefore does not load other service drivers for that Root | ||
44 | Port. In other words, it is impossible to have multiple service | ||
45 | drivers load and run on a PCI-PCI Bridge device simultaneously | ||
46 | using the current driver model. | ||
47 | |||
48 | To enable multiple service drivers running simultaneously requires | ||
49 | having a PCI Express Port Bus driver, which manages all populated | ||
50 | PCI Express Ports and distributes all provided service requests | ||
51 | to the corresponding service drivers as required. Some key | ||
52 | advantages of using the PCI Express Port Bus driver are listed below: | ||
53 | |||
54 | - Allow multiple service drivers to run simultaneously on | ||
55 | a PCI-PCI Bridge Port device. | ||
56 | |||
57 | - Allow service drivers implemented in an independent | ||
58 | staged approach. | ||
59 | |||
60 | - Allow one service driver to run on multiple PCI-PCI Bridge | ||
61 | Port devices. | ||
62 | |||
63 | - Manage and distribute resources of a PCI-PCI Bridge Port | ||
64 | device to requested service drivers. | ||
65 | |||
66 | 5. Configuring the PCI Express Port Bus Driver vs. Service Drivers | ||
67 | |||
68 | 5.1 Including the PCI Express Port Bus Driver Support into the Kernel | ||
69 | |||
70 | Including the PCI Express Port Bus driver depends on whether the PCI | ||
71 | Express support is included in the kernel config. The kernel will | ||
72 | automatically include the PCI Express Port Bus driver as a kernel | ||
73 | driver when the PCI Express support is enabled in the kernel. | ||
74 | |||
75 | 5.2 Enabling Service Driver Support | ||
76 | |||
77 | PCI device drivers are implemented based on Linux Device Driver Model. | ||
78 | All service drivers are PCI device drivers. As discussed above, it is | ||
79 | impossible to load any service driver once the kernel has loaded the | ||
80 | PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver | ||
81 | Model requires some minimal changes on existing service drivers that | ||
82 | imposes no impact on the functionality of existing service drivers. | ||
83 | |||
84 | A service driver is required to use the two APIs shown below to | ||
85 | register its service with the PCI Express Port Bus driver (see | ||
86 | section 5.2.1 & 5.2.2). It is important that a service driver | ||
87 | initializes the pcie_port_service_driver data structure, included in | ||
88 | header file /include/linux/pcieport_if.h, before calling these APIs. | ||
89 | Failure to do so will result an identity mismatch, which prevents | ||
90 | the PCI Express Port Bus driver from loading a service driver. | ||
91 | |||
92 | 5.2.1 pcie_port_service_register | ||
93 | |||
94 | int pcie_port_service_register(struct pcie_port_service_driver *new) | ||
95 | |||
96 | This API replaces the Linux Driver Model's pci_module_init API. A | ||
97 | service driver should always calls pcie_port_service_register at | ||
98 | module init. Note that after service driver being loaded, calls | ||
99 | such as pci_enable_device(dev) and pci_set_master(dev) are no longer | ||
100 | necessary since these calls are executed by the PCI Port Bus driver. | ||
101 | |||
102 | 5.2.2 pcie_port_service_unregister | ||
103 | |||
104 | void pcie_port_service_unregister(struct pcie_port_service_driver *new) | ||
105 | |||
106 | pcie_port_service_unregister replaces the Linux Driver Model's | ||
107 | pci_unregister_driver. It's always called by service driver when a | ||
108 | module exits. | ||
109 | |||
110 | 5.2.3 Sample Code | ||
111 | |||
112 | Below is sample service driver code to initialize the port service | ||
113 | driver data structure. | ||
114 | |||
115 | static struct pcie_port_service_id service_id[] = { { | ||
116 | .vendor = PCI_ANY_ID, | ||
117 | .device = PCI_ANY_ID, | ||
118 | .port_type = PCIE_RC_PORT, | ||
119 | .service_type = PCIE_PORT_SERVICE_AER, | ||
120 | }, { /* end: all zeroes */ } | ||
121 | }; | ||
122 | |||
123 | static struct pcie_port_service_driver root_aerdrv = { | ||
124 | .name = (char *)device_name, | ||
125 | .id_table = &service_id[0], | ||
126 | |||
127 | .probe = aerdrv_load, | ||
128 | .remove = aerdrv_unload, | ||
129 | |||
130 | .suspend = aerdrv_suspend, | ||
131 | .resume = aerdrv_resume, | ||
132 | }; | ||
133 | |||
134 | Below is a sample code for registering/unregistering a service | ||
135 | driver. | ||
136 | |||
137 | static int __init aerdrv_service_init(void) | ||
138 | { | ||
139 | int retval = 0; | ||
140 | |||
141 | retval = pcie_port_service_register(&root_aerdrv); | ||
142 | if (!retval) { | ||
143 | /* | ||
144 | * FIX ME | ||
145 | */ | ||
146 | } | ||
147 | return retval; | ||
148 | } | ||
149 | |||
150 | static void __exit aerdrv_service_exit(void) | ||
151 | { | ||
152 | pcie_port_service_unregister(&root_aerdrv); | ||
153 | } | ||
154 | |||
155 | module_init(aerdrv_service_init); | ||
156 | module_exit(aerdrv_service_exit); | ||
157 | |||
158 | 6. Possible Resource Conflicts | ||
159 | |||
160 | Since all service drivers of a PCI-PCI Bridge Port device are | ||
161 | allowed to run simultaneously, below lists a few of possible resource | ||
162 | conflicts with proposed solutions. | ||
163 | |||
164 | 6.1 MSI Vector Resource | ||
165 | |||
166 | The MSI capability structure enables a device software driver to call | ||
167 | pci_enable_msi to request MSI based interrupts. Once MSI interrupts | ||
168 | are enabled on a device, it stays in this mode until a device driver | ||
169 | calls pci_disable_msi to disable MSI interrupts and revert back to | ||
170 | INTx emulation mode. Since service drivers of the same PCI-PCI Bridge | ||
171 | port share the same physical device, if an individual service driver | ||
172 | calls pci_enable_msi/pci_disable_msi it may result unpredictable | ||
173 | behavior. For example, two service drivers run simultaneously on the | ||
174 | same physical Root Port. Both service drivers call pci_enable_msi to | ||
175 | request MSI based interrupts. A service driver may not know whether | ||
176 | any other service drivers have run on this Root Port. If either one | ||
177 | of them calls pci_disable_msi, it puts the other service driver | ||
178 | in a wrong interrupt mode. | ||
179 | |||
180 | To avoid this situation all service drivers are not permitted to | ||
181 | switch interrupt mode on its device. The PCI Express Port Bus driver | ||
182 | is responsible for determining the interrupt mode and this should be | ||
183 | transparent to service drivers. Service drivers need to know only | ||
184 | the vector IRQ assigned to the field irq of struct pcie_device, which | ||
185 | is passed in when the PCI Express Port Bus driver probes each service | ||
186 | driver. Service drivers should use (struct pcie_device*)dev->irq to | ||
187 | call request_irq/free_irq. In addition, the interrupt mode is stored | ||
188 | in the field interrupt_mode of struct pcie_device. | ||
189 | |||
190 | 6.2 MSI-X Vector Resources | ||
191 | |||
192 | Similar to the MSI a device driver for an MSI-X capable device can | ||
193 | call pci_enable_msix to request MSI-X interrupts. All service drivers | ||
194 | are not permitted to switch interrupt mode on its device. The PCI | ||
195 | Express Port Bus driver is responsible for determining the interrupt | ||
196 | mode and this should be transparent to service drivers. Any attempt | ||
197 | by service driver to call pci_enable_msix/pci_disable_msix may | ||
198 | result unpredictable behavior. Service drivers should use | ||
199 | (struct pcie_device*)dev->irq and call request_irq/free_irq. | ||
200 | |||
201 | 6.3 PCI Memory/IO Mapped Regions | ||
202 | |||
203 | Service drivers for PCI Express Power Management (PME), Advanced | ||
204 | Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access | ||
205 | PCI configuration space on the PCI Express port. In all cases the | ||
206 | registers accessed are independent of each other. This patch assumes | ||
207 | that all service drivers will be well behaved and not overwrite | ||
208 | other service driver's configuration settings. | ||
209 | |||
210 | 6.4 PCI Config Registers | ||
211 | |||
212 | Each service driver runs its PCI config operations on its own | ||
213 | capability structure except the PCI Express capability structure, in | ||
214 | which Root Control register and Device Control register are shared | ||
215 | between PME and AER. This patch assumes that all service drivers | ||
216 | will be well behaved and not overwrite other service driver's | ||
217 | configuration settings. | ||
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt new file mode 100644 index 000000000000..6650af432523 --- /dev/null +++ b/Documentation/PCI/pci-error-recovery.txt | |||
@@ -0,0 +1,396 @@ | |||
1 | |||
2 | PCI Error Recovery | ||
3 | ------------------ | ||
4 | February 2, 2006 | ||
5 | |||
6 | Current document maintainer: | ||
7 | Linas Vepstas <linas@austin.ibm.com> | ||
8 | |||
9 | |||
10 | Many PCI bus controllers are able to detect a variety of hardware | ||
11 | PCI errors on the bus, such as parity errors on the data and address | ||
12 | busses, as well as SERR and PERR errors. Some of the more advanced | ||
13 | chipsets are able to deal with these errors; these include PCI-E chipsets, | ||
14 | and the PCI-host bridges found on IBM Power4 and Power5-based pSeries | ||
15 | boxes. A typical action taken is to disconnect the affected device, | ||
16 | halting all I/O to it. The goal of a disconnection is to avoid system | ||
17 | corruption; for example, to halt system memory corruption due to DMA's | ||
18 | to "wild" addresses. Typically, a reconnection mechanism is also | ||
19 | offered, so that the affected PCI device(s) are reset and put back | ||
20 | into working condition. The reset phase requires coordination | ||
21 | between the affected device drivers and the PCI controller chip. | ||
22 | This document describes a generic API for notifying device drivers | ||
23 | of a bus disconnection, and then performing error recovery. | ||
24 | This API is currently implemented in the 2.6.16 and later kernels. | ||
25 | |||
26 | Reporting and recovery is performed in several steps. First, when | ||
27 | a PCI hardware error has resulted in a bus disconnect, that event | ||
28 | is reported as soon as possible to all affected device drivers, | ||
29 | including multiple instances of a device driver on multi-function | ||
30 | cards. This allows device drivers to avoid deadlocking in spinloops, | ||
31 | waiting for some i/o-space register to change, when it never will. | ||
32 | It also gives the drivers a chance to defer incoming I/O as | ||
33 | needed. | ||
34 | |||
35 | Next, recovery is performed in several stages. Most of the complexity | ||
36 | is forced by the need to handle multi-function devices, that is, | ||
37 | devices that have multiple device drivers associated with them. | ||
38 | In the first stage, each driver is allowed to indicate what type | ||
39 | of reset it desires, the choices being a simple re-enabling of I/O | ||
40 | or requesting a hard reset (a full electrical #RST of the PCI card). | ||
41 | If any driver requests a full reset, that is what will be done. | ||
42 | |||
43 | After a full reset and/or a re-enabling of I/O, all drivers are | ||
44 | again notified, so that they may then perform any device setup/config | ||
45 | that may be required. After these have all completed, a final | ||
46 | "resume normal operations" event is sent out. | ||
47 | |||
48 | The biggest reason for choosing a kernel-based implementation rather | ||
49 | than a user-space implementation was the need to deal with bus | ||
50 | disconnects of PCI devices attached to storage media, and, in particular, | ||
51 | disconnects from devices holding the root file system. If the root | ||
52 | file system is disconnected, a user-space mechanism would have to go | ||
53 | through a large number of contortions to complete recovery. Almost all | ||
54 | of the current Linux file systems are not tolerant of disconnection | ||
55 | from/reconnection to their underlying block device. By contrast, | ||
56 | bus errors are easy to manage in the device driver. Indeed, most | ||
57 | device drivers already handle very similar recovery procedures; | ||
58 | for example, the SCSI-generic layer already provides significant | ||
59 | mechanisms for dealing with SCSI bus errors and SCSI bus resets. | ||
60 | |||
61 | |||
62 | Detailed Design | ||
63 | --------------- | ||
64 | Design and implementation details below, based on a chain of | ||
65 | public email discussions with Ben Herrenschmidt, circa 5 April 2005. | ||
66 | |||
67 | The error recovery API support is exposed to the driver in the form of | ||
68 | a structure of function pointers pointed to by a new field in struct | ||
69 | pci_driver. A driver that fails to provide the structure is "non-aware", | ||
70 | and the actual recovery steps taken are platform dependent. The | ||
71 | arch/powerpc implementation will simulate a PCI hotplug remove/add. | ||
72 | |||
73 | This structure has the form: | ||
74 | struct pci_error_handlers | ||
75 | { | ||
76 | int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); | ||
77 | int (*mmio_enabled)(struct pci_dev *dev); | ||
78 | int (*link_reset)(struct pci_dev *dev); | ||
79 | int (*slot_reset)(struct pci_dev *dev); | ||
80 | void (*resume)(struct pci_dev *dev); | ||
81 | }; | ||
82 | |||
83 | The possible channel states are: | ||
84 | enum pci_channel_state { | ||
85 | pci_channel_io_normal, /* I/O channel is in normal state */ | ||
86 | pci_channel_io_frozen, /* I/O to channel is blocked */ | ||
87 | pci_channel_io_perm_failure, /* PCI card is dead */ | ||
88 | }; | ||
89 | |||
90 | Possible return values are: | ||
91 | enum pci_ers_result { | ||
92 | PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ | ||
93 | PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ | ||
94 | PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ | ||
95 | PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ | ||
96 | PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ | ||
97 | }; | ||
98 | |||
99 | A driver does not have to implement all of these callbacks; however, | ||
100 | if it implements any, it must implement error_detected(). If a callback | ||
101 | is not implemented, the corresponding feature is considered unsupported. | ||
102 | For example, if mmio_enabled() and resume() aren't there, then it | ||
103 | is assumed that the driver is not doing any direct recovery and requires | ||
104 | a reset. If link_reset() is not implemented, the card is assumed as | ||
105 | not care about link resets. Typically a driver will want to know about | ||
106 | a slot_reset(). | ||
107 | |||
108 | The actual steps taken by a platform to recover from a PCI error | ||
109 | event will be platform-dependent, but will follow the general | ||
110 | sequence described below. | ||
111 | |||
112 | STEP 0: Error Event | ||
113 | ------------------- | ||
114 | PCI bus error is detect by the PCI hardware. On powerpc, the slot | ||
115 | is isolated, in that all I/O is blocked: all reads return 0xffffffff, | ||
116 | all writes are ignored. | ||
117 | |||
118 | |||
119 | STEP 1: Notification | ||
120 | -------------------- | ||
121 | Platform calls the error_detected() callback on every instance of | ||
122 | every driver affected by the error. | ||
123 | |||
124 | At this point, the device might not be accessible anymore, depending on | ||
125 | the platform (the slot will be isolated on powerpc). The driver may | ||
126 | already have "noticed" the error because of a failing I/O, but this | ||
127 | is the proper "synchronization point", that is, it gives the driver | ||
128 | a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) | ||
129 | to complete; it can take semaphores, schedule, etc... everything but | ||
130 | touch the device. Within this function and after it returns, the driver | ||
131 | shouldn't do any new IOs. Called in task context. This is sort of a | ||
132 | "quiesce" point. See note about interrupts at the end of this doc. | ||
133 | |||
134 | All drivers participating in this system must implement this call. | ||
135 | The driver must return one of the following result codes: | ||
136 | - PCI_ERS_RESULT_CAN_RECOVER: | ||
137 | Driver returns this if it thinks it might be able to recover | ||
138 | the HW by just banging IOs or if it wants to be given | ||
139 | a chance to extract some diagnostic information (see | ||
140 | mmio_enable, below). | ||
141 | - PCI_ERS_RESULT_NEED_RESET: | ||
142 | Driver returns this if it can't recover without a hard | ||
143 | slot reset. | ||
144 | - PCI_ERS_RESULT_DISCONNECT: | ||
145 | Driver returns this if it doesn't want to recover at all. | ||
146 | |||
147 | The next step taken will depend on the result codes returned by the | ||
148 | drivers. | ||
149 | |||
150 | If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, | ||
151 | then the platform should re-enable IOs on the slot (or do nothing in | ||
152 | particular, if the platform doesn't isolate slots), and recovery | ||
153 | proceeds to STEP 2 (MMIO Enable). | ||
154 | |||
155 | If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), | ||
156 | then recovery proceeds to STEP 4 (Slot Reset). | ||
157 | |||
158 | If the platform is unable to recover the slot, the next step | ||
159 | is STEP 6 (Permanent Failure). | ||
160 | |||
161 | >>> The current powerpc implementation assumes that a device driver will | ||
162 | >>> *not* schedule or semaphore in this routine; the current powerpc | ||
163 | >>> implementation uses one kernel thread to notify all devices; | ||
164 | >>> thus, if one device sleeps/schedules, all devices are affected. | ||
165 | >>> Doing better requires complex multi-threaded logic in the error | ||
166 | >>> recovery implementation (e.g. waiting for all notification threads | ||
167 | >>> to "join" before proceeding with recovery.) This seems excessively | ||
168 | >>> complex and not worth implementing. | ||
169 | |||
170 | >>> The current powerpc implementation doesn't much care if the device | ||
171 | >>> attempts I/O at this point, or not. I/O's will fail, returning | ||
172 | >>> a value of 0xff on read, and writes will be dropped. If the device | ||
173 | >>> driver attempts more than 10K I/O's to a frozen adapter, it will | ||
174 | >>> assume that the device driver has gone into an infinite loop, and | ||
175 | >>> it will panic the kernel. There doesn't seem to be any other | ||
176 | >>> way of stopping a device driver that insists on spinning on I/O. | ||
177 | |||
178 | STEP 2: MMIO Enabled | ||
179 | ------------------- | ||
180 | The platform re-enables MMIO to the device (but typically not the | ||
181 | DMA), and then calls the mmio_enabled() callback on all affected | ||
182 | device drivers. | ||
183 | |||
184 | This is the "early recovery" call. IOs are allowed again, but DMA is | ||
185 | not (hrm... to be discussed, I prefer not), with some restrictions. This | ||
186 | is NOT a callback for the driver to start operations again, only to | ||
187 | peek/poke at the device, extract diagnostic information, if any, and | ||
188 | eventually do things like trigger a device local reset or some such, | ||
189 | but not restart operations. This is callback is made if all drivers on | ||
190 | a segment agree that they can try to recover and if no automatic link reset | ||
191 | was performed by the HW. If the platform can't just re-enable IOs without | ||
192 | a slot reset or a link reset, it wont call this callback, and instead | ||
193 | will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) | ||
194 | |||
195 | >>> The following is proposed; no platform implements this yet: | ||
196 | >>> Proposal: All I/O's should be done _synchronously_ from within | ||
197 | >>> this callback, errors triggered by them will be returned via | ||
198 | >>> the normal pci_check_whatever() API, no new error_detected() | ||
199 | >>> callback will be issued due to an error happening here. However, | ||
200 | >>> such an error might cause IOs to be re-blocked for the whole | ||
201 | >>> segment, and thus invalidate the recovery that other devices | ||
202 | >>> on the same segment might have done, forcing the whole segment | ||
203 | >>> into one of the next states, that is, link reset or slot reset. | ||
204 | |||
205 | The driver should return one of the following result codes: | ||
206 | - PCI_ERS_RESULT_RECOVERED | ||
207 | Driver returns this if it thinks the device is fully | ||
208 | functional and thinks it is ready to start | ||
209 | normal driver operations again. There is no | ||
210 | guarantee that the driver will actually be | ||
211 | allowed to proceed, as another driver on the | ||
212 | same segment might have failed and thus triggered a | ||
213 | slot reset on platforms that support it. | ||
214 | |||
215 | - PCI_ERS_RESULT_NEED_RESET | ||
216 | Driver returns this if it thinks the device is not | ||
217 | recoverable in it's current state and it needs a slot | ||
218 | reset to proceed. | ||
219 | |||
220 | - PCI_ERS_RESULT_DISCONNECT | ||
221 | Same as above. Total failure, no recovery even after | ||
222 | reset driver dead. (To be defined more precisely) | ||
223 | |||
224 | The next step taken depends on the results returned by the drivers. | ||
225 | If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform | ||
226 | proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). | ||
227 | |||
228 | If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform | ||
229 | proceeds to STEP 4 (Slot Reset) | ||
230 | |||
231 | >>> The current powerpc implementation does not implement this callback. | ||
232 | |||
233 | |||
234 | STEP 3: Link Reset | ||
235 | ------------------ | ||
236 | The platform resets the link, and then calls the link_reset() callback | ||
237 | on all affected device drivers. This is a PCI-Express specific state | ||
238 | and is done whenever a non-fatal error has been detected that can be | ||
239 | "solved" by resetting the link. This call informs the driver of the | ||
240 | reset and the driver should check to see if the device appears to be | ||
241 | in working condition. | ||
242 | |||
243 | The driver is not supposed to restart normal driver I/O operations | ||
244 | at this point. It should limit itself to "probing" the device to | ||
245 | check it's recoverability status. If all is right, then the platform | ||
246 | will call resume() once all drivers have ack'd link_reset(). | ||
247 | |||
248 | Result codes: | ||
249 | (identical to STEP 3 (MMIO Enabled) | ||
250 | |||
251 | The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5 | ||
252 | (Resume Operations). | ||
253 | |||
254 | >>> The current powerpc implementation does not implement this callback. | ||
255 | |||
256 | |||
257 | STEP 4: Slot Reset | ||
258 | ------------------ | ||
259 | The platform performs a soft or hard reset of the device, and then | ||
260 | calls the slot_reset() callback. | ||
261 | |||
262 | A soft reset consists of asserting the adapter #RST line and then | ||
263 | restoring the PCI BAR's and PCI configuration header to a state | ||
264 | that is equivalent to what it would be after a fresh system | ||
265 | power-on followed by power-on BIOS/system firmware initialization. | ||
266 | If the platform supports PCI hotplug, then the reset might be | ||
267 | performed by toggling the slot electrical power off/on. | ||
268 | |||
269 | It is important for the platform to restore the PCI config space | ||
270 | to the "fresh poweron" state, rather than the "last state". After | ||
271 | a slot reset, the device driver will almost always use its standard | ||
272 | device initialization routines, and an unusual config space setup | ||
273 | may result in hung devices, kernel panics, or silent data corruption. | ||
274 | |||
275 | This call gives drivers the chance to re-initialize the hardware | ||
276 | (re-download firmware, etc.). At this point, the driver may assume | ||
277 | that he card is in a fresh state and is fully functional. In | ||
278 | particular, interrupt generation should work normally. | ||
279 | |||
280 | Drivers should not yet restart normal I/O processing operations | ||
281 | at this point. If all device drivers report success on this | ||
282 | callback, the platform will call resume() to complete the sequence, | ||
283 | and let the driver restart normal I/O processing. | ||
284 | |||
285 | A driver can still return a critical failure for this function if | ||
286 | it can't get the device operational after reset. If the platform | ||
287 | previously tried a soft reset, it might now try a hard reset (power | ||
288 | cycle) and then call slot_reset() again. It the device still can't | ||
289 | be recovered, there is nothing more that can be done; the platform | ||
290 | will typically report a "permanent failure" in such a case. The | ||
291 | device will be considered "dead" in this case. | ||
292 | |||
293 | Drivers for multi-function cards will need to coordinate among | ||
294 | themselves as to which driver instance will perform any "one-shot" | ||
295 | or global device initialization. For example, the Symbios sym53cxx2 | ||
296 | driver performs device init only from PCI function 0: | ||
297 | |||
298 | + if (PCI_FUNC(pdev->devfn) == 0) | ||
299 | + sym_reset_scsi_bus(np, 0); | ||
300 | |||
301 | Result codes: | ||
302 | - PCI_ERS_RESULT_DISCONNECT | ||
303 | Same as above. | ||
304 | |||
305 | Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent | ||
306 | Failure). | ||
307 | |||
308 | >>> The current powerpc implementation does not currently try a | ||
309 | >>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT. | ||
310 | >>> However, it probably should. | ||
311 | |||
312 | |||
313 | STEP 5: Resume Operations | ||
314 | ------------------------- | ||
315 | The platform will call the resume() callback on all affected device | ||
316 | drivers if all drivers on the segment have returned | ||
317 | PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. | ||
318 | The goal of this callback is to tell the driver to restart activity, | ||
319 | that everything is back and running. This callback does not return | ||
320 | a result code. | ||
321 | |||
322 | At this point, if a new error happens, the platform will restart | ||
323 | a new error recovery sequence. | ||
324 | |||
325 | STEP 6: Permanent Failure | ||
326 | ------------------------- | ||
327 | A "permanent failure" has occurred, and the platform cannot recover | ||
328 | the device. The platform will call error_detected() with a | ||
329 | pci_channel_state value of pci_channel_io_perm_failure. | ||
330 | |||
331 | The device driver should, at this point, assume the worst. It should | ||
332 | cancel all pending I/O, refuse all new I/O, returning -EIO to | ||
333 | higher layers. The device driver should then clean up all of its | ||
334 | memory and remove itself from kernel operations, much as it would | ||
335 | during system shutdown. | ||
336 | |||
337 | The platform will typically notify the system operator of the | ||
338 | permanent failure in some way. If the device is hotplug-capable, | ||
339 | the operator will probably want to remove and replace the device. | ||
340 | Note, however, not all failures are truly "permanent". Some are | ||
341 | caused by over-heating, some by a poorly seated card. Many | ||
342 | PCI error events are caused by software bugs, e.g. DMA's to | ||
343 | wild addresses or bogus split transactions due to programming | ||
344 | errors. See the discussion in powerpc/eeh-pci-error-recovery.txt | ||
345 | for additional detail on real-life experience of the causes of | ||
346 | software errors. | ||
347 | |||
348 | |||
349 | Conclusion; General Remarks | ||
350 | --------------------------- | ||
351 | The way those callbacks are called is platform policy. A platform with | ||
352 | no slot reset capability may want to just "ignore" drivers that can't | ||
353 | recover (disconnect them) and try to let other cards on the same segment | ||
354 | recover. Keep in mind that in most real life cases, though, there will | ||
355 | be only one driver per segment. | ||
356 | |||
357 | Now, a note about interrupts. If you get an interrupt and your | ||
358 | device is dead or has been isolated, there is a problem :) | ||
359 | The current policy is to turn this into a platform policy. | ||
360 | That is, the recovery API only requires that: | ||
361 | |||
362 | - There is no guarantee that interrupt delivery can proceed from any | ||
363 | device on the segment starting from the error detection and until the | ||
364 | resume callback is sent, at which point interrupts are expected to be | ||
365 | fully operational. | ||
366 | |||
367 | - There is no guarantee that interrupt delivery is stopped, that is, | ||
368 | a driver that gets an interrupt after detecting an error, or that detects | ||
369 | an error within the interrupt handler such that it prevents proper | ||
370 | ack'ing of the interrupt (and thus removal of the source) should just | ||
371 | return IRQ_NOTHANDLED. It's up to the platform to deal with that | ||
372 | condition, typically by masking the IRQ source during the duration of | ||
373 | the error handling. It is expected that the platform "knows" which | ||
374 | interrupts are routed to error-management capable slots and can deal | ||
375 | with temporarily disabling that IRQ number during error processing (this | ||
376 | isn't terribly complex). That means some IRQ latency for other devices | ||
377 | sharing the interrupt, but there is simply no other way. High end | ||
378 | platforms aren't supposed to share interrupts between many devices | ||
379 | anyway :) | ||
380 | |||
381 | >>> Implementation details for the powerpc platform are discussed in | ||
382 | >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt | ||
383 | |||
384 | >>> As of this writing, there are six device drivers with patches | ||
385 | >>> implementing error recovery. Not all of these patches are in | ||
386 | >>> mainline yet. These may be used as "examples": | ||
387 | >>> | ||
388 | >>> drivers/scsi/ipr.c | ||
389 | >>> drivers/scsi/sym53cxx_2 | ||
390 | >>> drivers/next/e100.c | ||
391 | >>> drivers/net/e1000 | ||
392 | >>> drivers/net/ixgb | ||
393 | >>> drivers/net/s2io.c | ||
394 | |||
395 | The End | ||
396 | ------- | ||
diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt new file mode 100644 index 000000000000..8d4dc6250c58 --- /dev/null +++ b/Documentation/PCI/pci.txt | |||
@@ -0,0 +1,646 @@ | |||
1 | |||
2 | How To Write Linux PCI Drivers | ||
3 | |||
4 | by Martin Mares <mj@ucw.cz> on 07-Feb-2000 | ||
5 | updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006 | ||
6 | |||
7 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
8 | The world of PCI is vast and full of (mostly unpleasant) surprises. | ||
9 | Since each CPU architecture implements different chip-sets and PCI devices | ||
10 | have different requirements (erm, "features"), the result is the PCI support | ||
11 | in the Linux kernel is not as trivial as one would wish. This short paper | ||
12 | tries to introduce all potential driver authors to Linux APIs for | ||
13 | PCI device drivers. | ||
14 | |||
15 | A more complete resource is the third edition of "Linux Device Drivers" | ||
16 | by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. | ||
17 | LDD3 is available for free (under Creative Commons License) from: | ||
18 | |||
19 | http://lwn.net/Kernel/LDD3/ | ||
20 | |||
21 | However, keep in mind that all documents are subject to "bit rot". | ||
22 | Refer to the source code if things are not working as described here. | ||
23 | |||
24 | Please send questions/comments/patches about Linux PCI API to the | ||
25 | "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list. | ||
26 | |||
27 | |||
28 | |||
29 | 0. Structure of PCI drivers | ||
30 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
31 | PCI drivers "discover" PCI devices in a system via pci_register_driver(). | ||
32 | Actually, it's the other way around. When the PCI generic code discovers | ||
33 | a new device, the driver with a matching "description" will be notified. | ||
34 | Details on this below. | ||
35 | |||
36 | pci_register_driver() leaves most of the probing for devices to | ||
37 | the PCI layer and supports online insertion/removal of devices [thus | ||
38 | supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. | ||
39 | pci_register_driver() call requires passing in a table of function | ||
40 | pointers and thus dictates the high level structure of a driver. | ||
41 | |||
42 | Once the driver knows about a PCI device and takes ownership, the | ||
43 | driver generally needs to perform the following initialization: | ||
44 | |||
45 | Enable the device | ||
46 | Request MMIO/IOP resources | ||
47 | Set the DMA mask size (for both coherent and streaming DMA) | ||
48 | Allocate and initialize shared control data (pci_allocate_coherent()) | ||
49 | Access device configuration space (if needed) | ||
50 | Register IRQ handler (request_irq()) | ||
51 | Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) | ||
52 | Enable DMA/processing engines | ||
53 | |||
54 | When done using the device, and perhaps the module needs to be unloaded, | ||
55 | the driver needs to take the follow steps: | ||
56 | Disable the device from generating IRQs | ||
57 | Release the IRQ (free_irq()) | ||
58 | Stop all DMA activity | ||
59 | Release DMA buffers (both streaming and coherent) | ||
60 | Unregister from other subsystems (e.g. scsi or netdev) | ||
61 | Release MMIO/IOP resources | ||
62 | Disable the device | ||
63 | |||
64 | Most of these topics are covered in the following sections. | ||
65 | For the rest look at LDD3 or <linux/pci.h> . | ||
66 | |||
67 | If the PCI subsystem is not configured (CONFIG_PCI is not set), most of | ||
68 | the PCI functions described below are defined as inline functions either | ||
69 | completely empty or just returning an appropriate error codes to avoid | ||
70 | lots of ifdefs in the drivers. | ||
71 | |||
72 | |||
73 | |||
74 | 1. pci_register_driver() call | ||
75 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
76 | |||
77 | PCI device drivers call pci_register_driver() during their | ||
78 | initialization with a pointer to a structure describing the driver | ||
79 | (struct pci_driver): | ||
80 | |||
81 | field name Description | ||
82 | ---------- ------------------------------------------------------ | ||
83 | id_table Pointer to table of device ID's the driver is | ||
84 | interested in. Most drivers should export this | ||
85 | table using MODULE_DEVICE_TABLE(pci,...). | ||
86 | |||
87 | probe This probing function gets called (during execution | ||
88 | of pci_register_driver() for already existing | ||
89 | devices or later if a new device gets inserted) for | ||
90 | all PCI devices which match the ID table and are not | ||
91 | "owned" by the other drivers yet. This function gets | ||
92 | passed a "struct pci_dev *" for each device whose | ||
93 | entry in the ID table matches the device. The probe | ||
94 | function returns zero when the driver chooses to | ||
95 | take "ownership" of the device or an error code | ||
96 | (negative number) otherwise. | ||
97 | The probe function always gets called from process | ||
98 | context, so it can sleep. | ||
99 | |||
100 | remove The remove() function gets called whenever a device | ||
101 | being handled by this driver is removed (either during | ||
102 | deregistration of the driver or when it's manually | ||
103 | pulled out of a hot-pluggable slot). | ||
104 | The remove function always gets called from process | ||
105 | context, so it can sleep. | ||
106 | |||
107 | suspend Put device into low power state. | ||
108 | suspend_late Put device into low power state. | ||
109 | |||
110 | resume_early Wake device from low power state. | ||
111 | resume Wake device from low power state. | ||
112 | |||
113 | (Please see Documentation/power/pci.txt for descriptions | ||
114 | of PCI Power Management and the related functions.) | ||
115 | |||
116 | shutdown Hook into reboot_notifier_list (kernel/sys.c). | ||
117 | Intended to stop any idling DMA operations. | ||
118 | Useful for enabling wake-on-lan (NIC) or changing | ||
119 | the power state of a device before reboot. | ||
120 | e.g. drivers/net/e100.c. | ||
121 | |||
122 | err_handler See Documentation/PCI/pci-error-recovery.txt | ||
123 | |||
124 | |||
125 | The ID table is an array of struct pci_device_id entries ending with an | ||
126 | all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred | ||
127 | method of declaring the table. Each entry consists of: | ||
128 | |||
129 | vendor,device Vendor and device ID to match (or PCI_ANY_ID) | ||
130 | |||
131 | subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) | ||
132 | subdevice, | ||
133 | |||
134 | class Device class, subclass, and "interface" to match. | ||
135 | See Appendix D of the PCI Local Bus Spec or | ||
136 | include/linux/pci_ids.h for a full list of classes. | ||
137 | Most drivers do not need to specify class/class_mask | ||
138 | as vendor/device is normally sufficient. | ||
139 | |||
140 | class_mask limit which sub-fields of the class field are compared. | ||
141 | See drivers/scsi/sym53c8xx_2/ for example of usage. | ||
142 | |||
143 | driver_data Data private to the driver. | ||
144 | Most drivers don't need to use driver_data field. | ||
145 | Best practice is to use driver_data as an index | ||
146 | into a static list of equivalent device types, | ||
147 | instead of using it as a pointer. | ||
148 | |||
149 | |||
150 | Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up | ||
151 | a pci_device_id table. | ||
152 | |||
153 | New PCI IDs may be added to a device driver pci_ids table at runtime | ||
154 | as shown below: | ||
155 | |||
156 | echo "vendor device subvendor subdevice class class_mask driver_data" > \ | ||
157 | /sys/bus/pci/drivers/{driver}/new_id | ||
158 | |||
159 | All fields are passed in as hexadecimal values (no leading 0x). | ||
160 | The vendor and device fields are mandatory, the others are optional. Users | ||
161 | need pass only as many optional fields as necessary: | ||
162 | o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) | ||
163 | o class and classmask fields default to 0 | ||
164 | o driver_data defaults to 0UL. | ||
165 | |||
166 | Once added, the driver probe routine will be invoked for any unclaimed | ||
167 | PCI devices listed in its (newly updated) pci_ids list. | ||
168 | |||
169 | When the driver exits, it just calls pci_unregister_driver() and the PCI layer | ||
170 | automatically calls the remove hook for all devices handled by the driver. | ||
171 | |||
172 | |||
173 | 1.1 "Attributes" for driver functions/data | ||
174 | |||
175 | Please mark the initialization and cleanup functions where appropriate | ||
176 | (the corresponding macros are defined in <linux/init.h>): | ||
177 | |||
178 | __init Initialization code. Thrown away after the driver | ||
179 | initializes. | ||
180 | __exit Exit code. Ignored for non-modular drivers. | ||
181 | |||
182 | |||
183 | __devinit Device initialization code. | ||
184 | Identical to __init if the kernel is not compiled | ||
185 | with CONFIG_HOTPLUG, normal function otherwise. | ||
186 | __devexit The same for __exit. | ||
187 | |||
188 | Tips on when/where to use the above attributes: | ||
189 | o The module_init()/module_exit() functions (and all | ||
190 | initialization functions called _only_ from these) | ||
191 | should be marked __init/__exit. | ||
192 | |||
193 | o Do not mark the struct pci_driver. | ||
194 | |||
195 | o The ID table array should be marked __devinitconst; this is done | ||
196 | automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE(). | ||
197 | |||
198 | o The probe() and remove() functions should be marked __devinit | ||
199 | and __devexit respectively. All initialization functions | ||
200 | exclusively called by the probe() routine, can be marked __devinit. | ||
201 | Ditto for remove() and __devexit. | ||
202 | |||
203 | o If mydriver_remove() is marked with __devexit(), then all address | ||
204 | references to mydriver_remove must use __devexit_p(mydriver_remove) | ||
205 | (in the struct pci_driver declaration for example). | ||
206 | __devexit_p() will generate the function name _or_ NULL if the | ||
207 | function will be discarded. For an example, see drivers/net/tg3.c. | ||
208 | |||
209 | o Do NOT mark a function if you are not sure which mark to use. | ||
210 | Better to not mark the function than mark the function wrong. | ||
211 | |||
212 | |||
213 | |||
214 | 2. How to find PCI devices manually | ||
215 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
216 | |||
217 | PCI drivers should have a really good reason for not using the | ||
218 | pci_register_driver() interface to search for PCI devices. | ||
219 | The main reason PCI devices are controlled by multiple drivers | ||
220 | is because one PCI device implements several different HW services. | ||
221 | E.g. combined serial/parallel port/floppy controller. | ||
222 | |||
223 | A manual search may be performed using the following constructs: | ||
224 | |||
225 | Searching by vendor and device ID: | ||
226 | |||
227 | struct pci_dev *dev = NULL; | ||
228 | while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) | ||
229 | configure_device(dev); | ||
230 | |||
231 | Searching by class ID (iterate in a similar way): | ||
232 | |||
233 | pci_get_class(CLASS_ID, dev) | ||
234 | |||
235 | Searching by both vendor/device and subsystem vendor/device ID: | ||
236 | |||
237 | pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). | ||
238 | |||
239 | You can use the constant PCI_ANY_ID as a wildcard replacement for | ||
240 | VENDOR_ID or DEVICE_ID. This allows searching for any device from a | ||
241 | specific vendor, for example. | ||
242 | |||
243 | These functions are hotplug-safe. They increment the reference count on | ||
244 | the pci_dev that they return. You must eventually (possibly at module unload) | ||
245 | decrement the reference count on these devices by calling pci_dev_put(). | ||
246 | |||
247 | |||
248 | |||
249 | 3. Device Initialization Steps | ||
250 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
251 | |||
252 | As noted in the introduction, most PCI drivers need the following steps | ||
253 | for device initialization: | ||
254 | |||
255 | Enable the device | ||
256 | Request MMIO/IOP resources | ||
257 | Set the DMA mask size (for both coherent and streaming DMA) | ||
258 | Allocate and initialize shared control data (pci_allocate_coherent()) | ||
259 | Access device configuration space (if needed) | ||
260 | Register IRQ handler (request_irq()) | ||
261 | Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) | ||
262 | Enable DMA/processing engines. | ||
263 | |||
264 | The driver can access PCI config space registers at any time. | ||
265 | (Well, almost. When running BIST, config space can go away...but | ||
266 | that will just result in a PCI Bus Master Abort and config reads | ||
267 | will return garbage). | ||
268 | |||
269 | |||
270 | 3.1 Enable the PCI device | ||
271 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
272 | Before touching any device registers, the driver needs to enable | ||
273 | the PCI device by calling pci_enable_device(). This will: | ||
274 | o wake up the device if it was in suspended state, | ||
275 | o allocate I/O and memory regions of the device (if BIOS did not), | ||
276 | o allocate an IRQ (if BIOS did not). | ||
277 | |||
278 | NOTE: pci_enable_device() can fail! Check the return value. | ||
279 | |||
280 | [ OS BUG: we don't check resource allocations before enabling those | ||
281 | resources. The sequence would make more sense if we called | ||
282 | pci_request_resources() before calling pci_enable_device(). | ||
283 | Currently, the device drivers can't detect the bug when when two | ||
284 | devices have been allocated the same range. This is not a common | ||
285 | problem and unlikely to get fixed soon. | ||
286 | |||
287 | This has been discussed before but not changed as of 2.6.19: | ||
288 | http://lkml.org/lkml/2006/3/2/194 | ||
289 | ] | ||
290 | |||
291 | pci_set_master() will enable DMA by setting the bus master bit | ||
292 | in the PCI_COMMAND register. It also fixes the latency timer value if | ||
293 | it's set to something bogus by the BIOS. | ||
294 | |||
295 | If the PCI device can use the PCI Memory-Write-Invalidate transaction, | ||
296 | call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval | ||
297 | and also ensures that the cache line size register is set correctly. | ||
298 | Check the return value of pci_set_mwi() as not all architectures | ||
299 | or chip-sets may support Memory-Write-Invalidate. Alternatively, | ||
300 | if Mem-Wr-Inval would be nice to have but is not required, call | ||
301 | pci_try_set_mwi() to have the system do its best effort at enabling | ||
302 | Mem-Wr-Inval. | ||
303 | |||
304 | |||
305 | 3.2 Request MMIO/IOP resources | ||
306 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
307 | Memory (MMIO), and I/O port addresses should NOT be read directly | ||
308 | from the PCI device config space. Use the values in the pci_dev structure | ||
309 | as the PCI "bus address" might have been remapped to a "host physical" | ||
310 | address by the arch/chip-set specific kernel support. | ||
311 | |||
312 | See Documentation/IO-mapping.txt for how to access device registers | ||
313 | or device memory. | ||
314 | |||
315 | The device driver needs to call pci_request_region() to verify | ||
316 | no other device is already using the same address resource. | ||
317 | Conversely, drivers should call pci_release_region() AFTER | ||
318 | calling pci_disable_device(). | ||
319 | The idea is to prevent two devices colliding on the same address range. | ||
320 | |||
321 | [ See OS BUG comment above. Currently (2.6.19), The driver can only | ||
322 | determine MMIO and IO Port resource availability _after_ calling | ||
323 | pci_enable_device(). ] | ||
324 | |||
325 | Generic flavors of pci_request_region() are request_mem_region() | ||
326 | (for MMIO ranges) and request_region() (for IO Port ranges). | ||
327 | Use these for address resources that are not described by "normal" PCI | ||
328 | BARs. | ||
329 | |||
330 | Also see pci_request_selected_regions() below. | ||
331 | |||
332 | |||
333 | 3.3 Set the DMA mask size | ||
334 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
335 | [ If anything below doesn't make sense, please refer to | ||
336 | Documentation/DMA-API.txt. This section is just a reminder that | ||
337 | drivers need to indicate DMA capabilities of the device and is not | ||
338 | an authoritative source for DMA interfaces. ] | ||
339 | |||
340 | While all drivers should explicitly indicate the DMA capability | ||
341 | (e.g. 32 or 64 bit) of the PCI bus master, devices with more than | ||
342 | 32-bit bus master capability for streaming data need the driver | ||
343 | to "register" this capability by calling pci_set_dma_mask() with | ||
344 | appropriate parameters. In general this allows more efficient DMA | ||
345 | on systems where System RAM exists above 4G _physical_ address. | ||
346 | |||
347 | Drivers for all PCI-X and PCIe compliant devices must call | ||
348 | pci_set_dma_mask() as they are 64-bit DMA devices. | ||
349 | |||
350 | Similarly, drivers must also "register" this capability if the device | ||
351 | can directly address "consistent memory" in System RAM above 4G physical | ||
352 | address by calling pci_set_consistent_dma_mask(). | ||
353 | Again, this includes drivers for all PCI-X and PCIe compliant devices. | ||
354 | Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are | ||
355 | 64-bit DMA capable for payload ("streaming") data but not control | ||
356 | ("consistent") data. | ||
357 | |||
358 | |||
359 | 3.4 Setup shared control data | ||
360 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
361 | Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) | ||
362 | memory. See Documentation/DMA-API.txt for a full description of | ||
363 | the DMA APIs. This section is just a reminder that it needs to be done | ||
364 | before enabling DMA on the device. | ||
365 | |||
366 | |||
367 | 3.5 Initialize device registers | ||
368 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
369 | Some drivers will need specific "capability" fields programmed | ||
370 | or other "vendor specific" register initialized or reset. | ||
371 | E.g. clearing pending interrupts. | ||
372 | |||
373 | |||
374 | 3.6 Register IRQ handler | ||
375 | ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
376 | While calling request_irq() is the last step described here, | ||
377 | this is often just another intermediate step to initialize a device. | ||
378 | This step can often be deferred until the device is opened for use. | ||
379 | |||
380 | All interrupt handlers for IRQ lines should be registered with IRQF_SHARED | ||
381 | and use the devid to map IRQs to devices (remember that all PCI IRQ lines | ||
382 | can be shared). | ||
383 | |||
384 | request_irq() will associate an interrupt handler and device handle | ||
385 | with an interrupt number. Historically interrupt numbers represent | ||
386 | IRQ lines which run from the PCI device to the Interrupt controller. | ||
387 | With MSI and MSI-X (more below) the interrupt number is a CPU "vector". | ||
388 | |||
389 | request_irq() also enables the interrupt. Make sure the device is | ||
390 | quiesced and does not have any interrupts pending before registering | ||
391 | the interrupt handler. | ||
392 | |||
393 | MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" | ||
394 | which deliver interrupts to the CPU via a DMA write to a Local APIC. | ||
395 | The fundamental difference between MSI and MSI-X is how multiple | ||
396 | "vectors" get allocated. MSI requires contiguous blocks of vectors | ||
397 | while MSI-X can allocate several individual ones. | ||
398 | |||
399 | MSI capability can be enabled by calling pci_enable_msi() or | ||
400 | pci_enable_msix() before calling request_irq(). This causes | ||
401 | the PCI support to program CPU vector data into the PCI device | ||
402 | capability registers. | ||
403 | |||
404 | If your PCI device supports both, try to enable MSI-X first. | ||
405 | Only one can be enabled at a time. Many architectures, chip-sets, | ||
406 | or BIOSes do NOT support MSI or MSI-X and the call to pci_enable_msi/msix | ||
407 | will fail. This is important to note since many drivers have | ||
408 | two (or more) interrupt handlers: one for MSI/MSI-X and another for IRQs. | ||
409 | They choose which handler to register with request_irq() based on the | ||
410 | return value from pci_enable_msi/msix(). | ||
411 | |||
412 | There are (at least) two really good reasons for using MSI: | ||
413 | 1) MSI is an exclusive interrupt vector by definition. | ||
414 | This means the interrupt handler doesn't have to verify | ||
415 | its device caused the interrupt. | ||
416 | |||
417 | 2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed | ||
418 | to be visible to the host CPU(s) when the MSI is delivered. This | ||
419 | is important for both data coherency and avoiding stale control data. | ||
420 | This guarantee allows the driver to omit MMIO reads to flush | ||
421 | the DMA stream. | ||
422 | |||
423 | See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples | ||
424 | of MSI/MSI-X usage. | ||
425 | |||
426 | |||
427 | |||
428 | 4. PCI device shutdown | ||
429 | ~~~~~~~~~~~~~~~~~~~~~~~ | ||
430 | |||
431 | When a PCI device driver is being unloaded, most of the following | ||
432 | steps need to be performed: | ||
433 | |||
434 | Disable the device from generating IRQs | ||
435 | Release the IRQ (free_irq()) | ||
436 | Stop all DMA activity | ||
437 | Release DMA buffers (both streaming and consistent) | ||
438 | Unregister from other subsystems (e.g. scsi or netdev) | ||
439 | Disable device from responding to MMIO/IO Port addresses | ||
440 | Release MMIO/IO Port resource(s) | ||
441 | |||
442 | |||
443 | 4.1 Stop IRQs on the device | ||
444 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
445 | How to do this is chip/device specific. If it's not done, it opens | ||
446 | the possibility of a "screaming interrupt" if (and only if) | ||
447 | the IRQ is shared with another device. | ||
448 | |||
449 | When the shared IRQ handler is "unhooked", the remaining devices | ||
450 | using the same IRQ line will still need the IRQ enabled. Thus if the | ||
451 | "unhooked" device asserts IRQ line, the system will respond assuming | ||
452 | it was one of the remaining devices asserted the IRQ line. Since none | ||
453 | of the other devices will handle the IRQ, the system will "hang" until | ||
454 | it decides the IRQ isn't going to get handled and masks the IRQ (100,000 | ||
455 | iterations later). Once the shared IRQ is masked, the remaining devices | ||
456 | will stop functioning properly. Not a nice situation. | ||
457 | |||
458 | This is another reason to use MSI or MSI-X if it's available. | ||
459 | MSI and MSI-X are defined to be exclusive interrupts and thus | ||
460 | are not susceptible to the "screaming interrupt" problem. | ||
461 | |||
462 | |||
463 | 4.2 Release the IRQ | ||
464 | ~~~~~~~~~~~~~~~~~~~ | ||
465 | Once the device is quiesced (no more IRQs), one can call free_irq(). | ||
466 | This function will return control once any pending IRQs are handled, | ||
467 | "unhook" the drivers IRQ handler from that IRQ, and finally release | ||
468 | the IRQ if no one else is using it. | ||
469 | |||
470 | |||
471 | 4.3 Stop all DMA activity | ||
472 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
473 | It's extremely important to stop all DMA operations BEFORE attempting | ||
474 | to deallocate DMA control data. Failure to do so can result in memory | ||
475 | corruption, hangs, and on some chip-sets a hard crash. | ||
476 | |||
477 | Stopping DMA after stopping the IRQs can avoid races where the | ||
478 | IRQ handler might restart DMA engines. | ||
479 | |||
480 | While this step sounds obvious and trivial, several "mature" drivers | ||
481 | didn't get this step right in the past. | ||
482 | |||
483 | |||
484 | 4.4 Release DMA buffers | ||
485 | ~~~~~~~~~~~~~~~~~~~~~~~ | ||
486 | Once DMA is stopped, clean up streaming DMA first. | ||
487 | I.e. unmap data buffers and return buffers to "upstream" | ||
488 | owners if there is one. | ||
489 | |||
490 | Then clean up "consistent" buffers which contain the control data. | ||
491 | |||
492 | See Documentation/DMA-API.txt for details on unmapping interfaces. | ||
493 | |||
494 | |||
495 | 4.5 Unregister from other subsystems | ||
496 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
497 | Most low level PCI device drivers support some other subsystem | ||
498 | like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your | ||
499 | driver isn't losing resources from that other subsystem. | ||
500 | If this happens, typically the symptom is an Oops (panic) when | ||
501 | the subsystem attempts to call into a driver that has been unloaded. | ||
502 | |||
503 | |||
504 | 4.6 Disable Device from responding to MMIO/IO Port addresses | ||
505 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
506 | io_unmap() MMIO or IO Port resources and then call pci_disable_device(). | ||
507 | This is the symmetric opposite of pci_enable_device(). | ||
508 | Do not access device registers after calling pci_disable_device(). | ||
509 | |||
510 | |||
511 | 4.7 Release MMIO/IO Port Resource(s) | ||
512 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
513 | Call pci_release_region() to mark the MMIO or IO Port range as available. | ||
514 | Failure to do so usually results in the inability to reload the driver. | ||
515 | |||
516 | |||
517 | |||
518 | 5. How to access PCI config space | ||
519 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
520 | |||
521 | You can use pci_(read|write)_config_(byte|word|dword) to access the config | ||
522 | space of a device represented by struct pci_dev *. All these functions return 0 | ||
523 | when successful or an error code (PCIBIOS_...) which can be translated to a text | ||
524 | string by pcibios_strerror. Most drivers expect that accesses to valid PCI | ||
525 | devices don't fail. | ||
526 | |||
527 | If you don't have a struct pci_dev available, you can call | ||
528 | pci_bus_(read|write)_config_(byte|word|dword) to access a given device | ||
529 | and function on that bus. | ||
530 | |||
531 | If you access fields in the standard portion of the config header, please | ||
532 | use symbolic names of locations and bits declared in <linux/pci.h>. | ||
533 | |||
534 | If you need to access Extended PCI Capability registers, just call | ||
535 | pci_find_capability() for the particular capability and it will find the | ||
536 | corresponding register block for you. | ||
537 | |||
538 | |||
539 | |||
540 | 6. Other interesting functions | ||
541 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
542 | |||
543 | pci_find_slot() Find pci_dev corresponding to given bus and | ||
544 | slot numbers. | ||
545 | pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) | ||
546 | pci_find_capability() Find specified capability in device's capability | ||
547 | list. | ||
548 | pci_resource_start() Returns bus start address for a given PCI region | ||
549 | pci_resource_end() Returns bus end address for a given PCI region | ||
550 | pci_resource_len() Returns the byte length of a PCI region | ||
551 | pci_set_drvdata() Set private driver data pointer for a pci_dev | ||
552 | pci_get_drvdata() Return private driver data pointer for a pci_dev | ||
553 | pci_set_mwi() Enable Memory-Write-Invalidate transactions. | ||
554 | pci_clear_mwi() Disable Memory-Write-Invalidate transactions. | ||
555 | |||
556 | |||
557 | |||
558 | 7. Miscellaneous hints | ||
559 | ~~~~~~~~~~~~~~~~~~~~~~ | ||
560 | |||
561 | When displaying PCI device names to the user (for example when a driver wants | ||
562 | to tell the user what card has it found), please use pci_name(pci_dev). | ||
563 | |||
564 | Always refer to the PCI devices by a pointer to the pci_dev structure. | ||
565 | All PCI layer functions use this identification and it's the only | ||
566 | reasonable one. Don't use bus/slot/function numbers except for very | ||
567 | special purposes -- on systems with multiple primary buses their semantics | ||
568 | can be pretty complex. | ||
569 | |||
570 | Don't try to turn on Fast Back to Back writes in your driver. All devices | ||
571 | on the bus need to be capable of doing it, so this is something which needs | ||
572 | to be handled by platform and generic code, not individual drivers. | ||
573 | |||
574 | |||
575 | |||
576 | 8. Vendor and device identifications | ||
577 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
578 | |||
579 | One is not not required to add new device ids to include/linux/pci_ids.h. | ||
580 | Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids. | ||
581 | |||
582 | PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary | ||
583 | hex numbers (vendor controlled) and normally used only in a single | ||
584 | location, the pci_device_id table. | ||
585 | |||
586 | Please DO submit new vendor/device ids to pciids.sourceforge.net project. | ||
587 | |||
588 | |||
589 | |||
590 | 9. Obsolete functions | ||
591 | ~~~~~~~~~~~~~~~~~~~~~ | ||
592 | |||
593 | There are several functions which you might come across when trying to | ||
594 | port an old driver to the new PCI interface. They are no longer present | ||
595 | in the kernel as they aren't compatible with hotplug or PCI domains or | ||
596 | having sane locking. | ||
597 | |||
598 | pci_find_device() Superseded by pci_get_device() | ||
599 | pci_find_subsys() Superseded by pci_get_subsys() | ||
600 | pci_find_slot() Superseded by pci_get_slot() | ||
601 | |||
602 | |||
603 | The alternative is the traditional PCI device driver that walks PCI | ||
604 | device lists. This is still possible but discouraged. | ||
605 | |||
606 | |||
607 | |||
608 | 10. MMIO Space and "Write Posting" | ||
609 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
610 | |||
611 | Converting a driver from using I/O Port space to using MMIO space | ||
612 | often requires some additional changes. Specifically, "write posting" | ||
613 | needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) | ||
614 | already do this. I/O Port space guarantees write transactions reach the PCI | ||
615 | device before the CPU can continue. Writes to MMIO space allow the CPU | ||
616 | to continue before the transaction reaches the PCI device. HW weenies | ||
617 | call this "Write Posting" because the write completion is "posted" to | ||
618 | the CPU before the transaction has reached its destination. | ||
619 | |||
620 | Thus, timing sensitive code should add readl() where the CPU is | ||
621 | expected to wait before doing other work. The classic "bit banging" | ||
622 | sequence works fine for I/O Port space: | ||
623 | |||
624 | for (i = 8; --i; val >>= 1) { | ||
625 | outb(val & 1, ioport_reg); /* write bit */ | ||
626 | udelay(10); | ||
627 | } | ||
628 | |||
629 | The same sequence for MMIO space should be: | ||
630 | |||
631 | for (i = 8; --i; val >>= 1) { | ||
632 | writeb(val & 1, mmio_reg); /* write bit */ | ||
633 | readb(safe_mmio_reg); /* flush posted write */ | ||
634 | udelay(10); | ||
635 | } | ||
636 | |||
637 | It is important that "safe_mmio_reg" not have any side effects that | ||
638 | interferes with the correct operation of the device. | ||
639 | |||
640 | Another case to watch out for is when resetting a PCI device. Use PCI | ||
641 | Configuration space reads to flush the writel(). This will gracefully | ||
642 | handle the PCI master abort on all platforms if the PCI device is | ||
643 | expected to not respond to a readl(). Most x86 platforms will allow | ||
644 | MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage | ||
645 | (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). | ||
646 | |||
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/PCI/pcieaer-howto.txt | |||
@@ -0,0 +1,253 @@ | |||
1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO | ||
2 | T. Long Nguyen <tom.l.nguyen@intel.com> | ||
3 | Yanmin Zhang <yanmin.zhang@intel.com> | ||
4 | 07/29/2006 | ||
5 | |||
6 | |||
7 | 1. Overview | ||
8 | |||
9 | 1.1 About this guide | ||
10 | |||
11 | This guide describes the basics of the PCI Express Advanced Error | ||
12 | Reporting (AER) driver and provides information on how to use it, as | ||
13 | well as how to enable the drivers of endpoint devices to conform with | ||
14 | PCI Express AER driver. | ||
15 | |||
16 | 1.2 Copyright © Intel Corporation 2006. | ||
17 | |||
18 | 1.3 What is the PCI Express AER Driver? | ||
19 | |||
20 | PCI Express error signaling can occur on the PCI Express link itself | ||
21 | or on behalf of transactions initiated on the link. PCI Express | ||
22 | defines two error reporting paradigms: the baseline capability and | ||
23 | the Advanced Error Reporting capability. The baseline capability is | ||
24 | required of all PCI Express components providing a minimum defined | ||
25 | set of error reporting requirements. Advanced Error Reporting | ||
26 | capability is implemented with a PCI Express advanced error reporting | ||
27 | extended capability structure providing more robust error reporting. | ||
28 | |||
29 | The PCI Express AER driver provides the infrastructure to support PCI | ||
30 | Express Advanced Error Reporting capability. The PCI Express AER | ||
31 | driver provides three basic functions: | ||
32 | |||
33 | - Gathers the comprehensive error information if errors occurred. | ||
34 | - Reports error to the users. | ||
35 | - Performs error recovery actions. | ||
36 | |||
37 | AER driver only attaches root ports which support PCI-Express AER | ||
38 | capability. | ||
39 | |||
40 | |||
41 | 2. User Guide | ||
42 | |||
43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | ||
44 | |||
45 | The PCI Express AER Root driver is a Root Port service driver attached | ||
46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | ||
47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | ||
48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | ||
49 | CONFIG_PCIEAER = y. | ||
50 | |||
51 | 2.2 Load PCI Express AER Root Driver | ||
52 | There is a case where a system has AER support in BIOS. Enabling the AER | ||
53 | Root driver and having AER support in BIOS may result unpredictable | ||
54 | behavior. To avoid this conflict, a successful load of the AER Root driver | ||
55 | requires ACPI _OSC support in the BIOS to allow the AER Root driver to | ||
56 | request for native control of AER. See the PCI FW 3.0 Specification for | ||
57 | details regarding OSC usage. Currently, lots of firmwares don't provide | ||
58 | _OSC support while they use PCI Express. To support such firmwares, | ||
59 | forceload, a parameter of type bool, could enable AER to continue to | ||
60 | be initiated although firmwares have no _OSC support. To enable the | ||
61 | walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line | ||
62 | when booting kernel. Note that forceload=n by default. | ||
63 | |||
64 | 2.3 AER error output | ||
65 | When a PCI-E AER error is captured, an error message will be outputed to | ||
66 | console. If it's a correctable error, it is outputed as a warning. | ||
67 | Otherwise, it is printed as an error. So users could choose different | ||
68 | log level to filter out correctable error messages. | ||
69 | |||
70 | Below shows an example. | ||
71 | +------ PCI-Express Device Error -----+ | ||
72 | Error Severity : Uncorrected (Fatal) | ||
73 | PCIE Bus Error type : Transaction Layer | ||
74 | Unsupported Request : First | ||
75 | Requester ID : 0500 | ||
76 | VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h | ||
77 | TLB Header: | ||
78 | 04000001 00200a03 05010000 00050100 | ||
79 | |||
80 | In the example, 'Requester ID' means the ID of the device who sends | ||
81 | the error message to root port. Pls. refer to pci express specs for | ||
82 | other fields. | ||
83 | |||
84 | |||
85 | 3. Developer Guide | ||
86 | |||
87 | To enable AER aware support requires a software driver to configure | ||
88 | the AER capability structure within its device and to provide callbacks. | ||
89 | |||
90 | To support AER better, developers need understand how AER does work | ||
91 | firstly. | ||
92 | |||
93 | PCI Express errors are classified into two types: correctable errors | ||
94 | and uncorrectable errors. This classification is based on the impacts | ||
95 | of those errors, which may result in degraded performance or function | ||
96 | failure. | ||
97 | |||
98 | Correctable errors pose no impacts on the functionality of the | ||
99 | interface. The PCI Express protocol can recover without any software | ||
100 | intervention or any loss of data. These errors are detected and | ||
101 | corrected by hardware. Unlike correctable errors, uncorrectable | ||
102 | errors impact functionality of the interface. Uncorrectable errors | ||
103 | can cause a particular transaction or a particular PCI Express link | ||
104 | to be unreliable. Depending on those error conditions, uncorrectable | ||
105 | errors are further classified into non-fatal errors and fatal errors. | ||
106 | Non-fatal errors cause the particular transaction to be unreliable, | ||
107 | but the PCI Express link itself is fully functional. Fatal errors, on | ||
108 | the other hand, cause the link to be unreliable. | ||
109 | |||
110 | When AER is enabled, a PCI Express device will automatically send an | ||
111 | error message to the PCIE root port above it when the device captures | ||
112 | an error. The Root Port, upon receiving an error reporting message, | ||
113 | internally processes and logs the error message in its PCI Express | ||
114 | capability structure. Error information being logged includes storing | ||
115 | the error reporting agent's requestor ID into the Error Source | ||
116 | Identification Registers and setting the error bits of the Root Error | ||
117 | Status Register accordingly. If AER error reporting is enabled in Root | ||
118 | Error Command Register, the Root Port generates an interrupt if an | ||
119 | error is detected. | ||
120 | |||
121 | Note that the errors as described above are related to the PCI Express | ||
122 | hierarchy and links. These errors do not include any device specific | ||
123 | errors because device specific errors will still get sent directly to | ||
124 | the device driver. | ||
125 | |||
126 | 3.1 Configure the AER capability structure | ||
127 | |||
128 | AER aware drivers of PCI Express component need change the device | ||
129 | control registers to enable AER. They also could change AER registers, | ||
130 | including mask and severity registers. Helper function | ||
131 | pci_enable_pcie_error_reporting could be used to enable AER. See | ||
132 | section 3.3. | ||
133 | |||
134 | 3.2. Provide callbacks | ||
135 | |||
136 | 3.2.1 callback reset_link to reset pci express link | ||
137 | |||
138 | This callback is used to reset the pci express physical link when a | ||
139 | fatal error happens. The root port aer service driver provides a | ||
140 | default reset_link function, but different upstream ports might | ||
141 | have different specifications to reset pci express link, so all | ||
142 | upstream ports should provide their own reset_link functions. | ||
143 | |||
144 | In struct pcie_port_service_driver, a new pointer, reset_link, is | ||
145 | added. | ||
146 | |||
147 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | ||
148 | |||
149 | Section 3.2.2.2 provides more detailed info on when to call | ||
150 | reset_link. | ||
151 | |||
152 | 3.2.2 PCI error-recovery callbacks | ||
153 | |||
154 | The PCI Express AER Root driver uses error callbacks to coordinate | ||
155 | with downstream device drivers associated with a hierarchy in question | ||
156 | when performing error recovery actions. | ||
157 | |||
158 | Data struct pci_driver has a pointer, err_handler, to point to | ||
159 | pci_error_handlers who consists of a couple of callback function | ||
160 | pointers. AER driver follows the rules defined in | ||
161 | pci-error-recovery.txt except pci express specific parts (e.g. | ||
162 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | ||
163 | definitions of the callbacks. | ||
164 | |||
165 | Below sections specify when to call the error callback functions. | ||
166 | |||
167 | 3.2.2.1 Correctable errors | ||
168 | |||
169 | Correctable errors pose no impacts on the functionality of | ||
170 | the interface. The PCI Express protocol can recover without any | ||
171 | software intervention or any loss of data. These errors do not | ||
172 | require any recovery actions. The AER driver clears the device's | ||
173 | correctable error status register accordingly and logs these errors. | ||
174 | |||
175 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | ||
176 | |||
177 | If an error message indicates a non-fatal error, performing link reset | ||
178 | at upstream is not required. The AER driver calls error_detected(dev, | ||
179 | pci_channel_io_normal) to all drivers associated within a hierarchy in | ||
180 | question. for example, | ||
181 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | ||
182 | If Upstream port A captures an AER error, the hierarchy consists of | ||
183 | Downstream port B and EndPoint. | ||
184 | |||
185 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | ||
186 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | ||
187 | whether it can recover or the AER driver calls mmio_enabled as next. | ||
188 | |||
189 | If an error message indicates a fatal error, kernel will broadcast | ||
190 | error_detected(dev, pci_channel_io_frozen) to all drivers within | ||
191 | a hierarchy in question. Then, performing link reset at upstream is | ||
192 | necessary. As different kinds of devices might use different approaches | ||
193 | to reset link, AER port service driver is required to provide the | ||
194 | function to reset link. Firstly, kernel looks for if the upstream | ||
195 | component has an aer driver. If it has, kernel uses the reset_link | ||
196 | callback of the aer driver. If the upstream component has no aer driver | ||
197 | and the port is downstream port, we will use the aer driver of the | ||
198 | root port who reports the AER error. As for upstream ports, | ||
199 | they should provide their own aer service drivers with reset_link | ||
200 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | ||
201 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | ||
202 | to mmio_enabled. | ||
203 | |||
204 | 3.3 helper functions | ||
205 | |||
206 | 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); | ||
207 | pci_find_aer_capability locates the PCI Express AER capability | ||
208 | in the device configuration space. If the device doesn't support | ||
209 | PCI-Express AER, the function returns 0. | ||
210 | |||
211 | 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | ||
212 | pci_enable_pcie_error_reporting enables the device to send error | ||
213 | messages to root port when an error is detected. Note that devices | ||
214 | don't enable the error reporting by default, so device drivers need | ||
215 | call this function to enable it. | ||
216 | |||
217 | 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | ||
218 | pci_disable_pcie_error_reporting disables the device to send error | ||
219 | messages to root port when an error is detected. | ||
220 | |||
221 | 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | ||
222 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | ||
223 | error status register. | ||
224 | |||
225 | 3.4 Frequent Asked Questions | ||
226 | |||
227 | Q: What happens if a PCI Express device driver does not provide an | ||
228 | error recovery handler (pci_driver->err_handler is equal to NULL)? | ||
229 | |||
230 | A: The devices attached with the driver won't be recovered. If the | ||
231 | error is fatal, kernel will print out warning messages. Please refer | ||
232 | to section 3 for more information. | ||
233 | |||
234 | Q: What happens if an upstream port service driver does not provide | ||
235 | callback reset_link? | ||
236 | |||
237 | A: Fatal error recovery will fail if the errors are reported by the | ||
238 | upstream ports who are attached by the service driver. | ||
239 | |||
240 | Q: How does this infrastructure deal with driver that is not PCI | ||
241 | Express aware? | ||
242 | |||
243 | A: This infrastructure calls the error callback functions of the | ||
244 | driver when an error happens. But if the driver is not aware of | ||
245 | PCI Express, the device might not report its own errors to root | ||
246 | port. | ||
247 | |||
248 | Q: What modifications will that driver need to make it compatible | ||
249 | with the PCI Express AER Root driver? | ||
250 | |||
251 | A: It could call the helper functions to enable AER in devices and | ||
252 | cleanup uncorrectable status register. Pls. refer to section 3.3. | ||
253 | |||