diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/networking/rds.txt | 356 |
1 files changed, 356 insertions, 0 deletions
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 000000000000..c67077cbeb80 --- /dev/null +++ b/Documentation/networking/rds.txt | |||
@@ -0,0 +1,356 @@ | |||
1 | |||
2 | Overview | ||
3 | ======== | ||
4 | |||
5 | This readme tries to provide some background on the hows and whys of RDS, | ||
6 | and will hopefully help you find your way around the code. | ||
7 | |||
8 | In addition, please see this email about RDS origins: | ||
9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | ||
10 | |||
11 | RDS Architecture | ||
12 | ================ | ||
13 | |||
14 | RDS provides reliable, ordered datagram delivery by using a single | ||
15 | reliable connection between any two nodes in the cluster. This allows | ||
16 | applications to use a single socket to talk to any other process in the | ||
17 | cluster - so in a cluster with N processes you need N sockets, in contrast | ||
18 | to N*N if you use a connection-oriented socket transport like TCP. | ||
19 | |||
20 | RDS is not Infiniband-specific; it was designed to support different | ||
21 | transports. The current implementation used to support RDS over TCP as well | ||
22 | as IB. Work is in progress to support RDS over iWARP, and using DCE to | ||
23 | guarantee no dropped packets on Ethernet, it may be possible to use RDS over | ||
24 | UDP in the future. | ||
25 | |||
26 | The high-level semantics of RDS from the application's point of view are | ||
27 | |||
28 | * Addressing | ||
29 | RDS uses IPv4 addresses and 16bit port numbers to identify | ||
30 | the end point of a connection. All socket operations that involve | ||
31 | passing addresses between kernel and user space generally | ||
32 | use a struct sockaddr_in. | ||
33 | |||
34 | The fact that IPv4 addresses are used does not mean the underlying | ||
35 | transport has to be IP-based. In fact, RDS over IB uses a | ||
36 | reliable IB connection; the IP address is used exclusively to | ||
37 | locate the remote node's GID (by ARPing for the given IP). | ||
38 | |||
39 | The port space is entirely independent of UDP, TCP or any other | ||
40 | protocol. | ||
41 | |||
42 | * Socket interface | ||
43 | RDS sockets work *mostly* as you would expect from a BSD | ||
44 | socket. The next section will cover the details. At any rate, | ||
45 | all I/O is performed through the standard BSD socket API. | ||
46 | Some additions like zerocopy support are implemented through | ||
47 | control messages, while other extensions use the getsockopt/ | ||
48 | setsockopt calls. | ||
49 | |||
50 | Sockets must be bound before you can send or receive data. | ||
51 | This is needed because binding also selects a transport and | ||
52 | attaches it to the socket. Once bound, the transport assignment | ||
53 | does not change. RDS will tolerate IPs moving around (eg in | ||
54 | a active-active HA scenario), but only as long as the address | ||
55 | doesn't move to a different transport. | ||
56 | |||
57 | * sysctls | ||
58 | RDS supports a number of sysctls in /proc/sys/net/rds | ||
59 | |||
60 | |||
61 | Socket Interface | ||
62 | ================ | ||
63 | |||
64 | AF_RDS, PF_RDS, SOL_RDS | ||
65 | These constants haven't been assigned yet, because RDS isn't in | ||
66 | mainline yet. Currently, the kernel module assigns some constant | ||
67 | and publishes it to user space through two sysctl files | ||
68 | /proc/sys/net/rds/pf_rds | ||
69 | /proc/sys/net/rds/sol_rds | ||
70 | |||
71 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | ||
72 | This creates a new, unbound RDS socket. | ||
73 | |||
74 | setsockopt(SOL_SOCKET): send and receive buffer size | ||
75 | RDS honors the send and receive buffer size socket options. | ||
76 | You are not allowed to queue more than SO_SNDSIZE bytes to | ||
77 | a socket. A message is queued when sendmsg is called, and | ||
78 | it leaves the queue when the remote system acknowledges | ||
79 | its arrival. | ||
80 | |||
81 | The SO_RCVSIZE option controls the maximum receive queue length. | ||
82 | This is a soft limit rather than a hard limit - RDS will | ||
83 | continue to accept and queue incoming messages, even if that | ||
84 | takes the queue length over the limit. However, it will also | ||
85 | mark the port as "congested" and send a congestion update to | ||
86 | the source node. The source node is supposed to throttle any | ||
87 | processes sending to this congested port. | ||
88 | |||
89 | bind(fd, &sockaddr_in, ...) | ||
90 | This binds the socket to a local IP address and port, and a | ||
91 | transport. | ||
92 | |||
93 | sendmsg(fd, ...) | ||
94 | Sends a message to the indicated recipient. The kernel will | ||
95 | transparently establish the underlying reliable connection | ||
96 | if it isn't up yet. | ||
97 | |||
98 | An attempt to send a message that exceeds SO_SNDSIZE will | ||
99 | return with -EMSGSIZE | ||
100 | |||
101 | An attempt to send a message that would take the total number | ||
102 | of queued bytes over the SO_SNDSIZE threshold will return | ||
103 | EAGAIN. | ||
104 | |||
105 | An attempt to send a message to a destination that is marked | ||
106 | as "congested" will return ENOBUFS. | ||
107 | |||
108 | recvmsg(fd, ...) | ||
109 | Receives a message that was queued to this socket. The sockets | ||
110 | recv queue accounting is adjusted, and if the queue length | ||
111 | drops below SO_SNDSIZE, the port is marked uncongested, and | ||
112 | a congestion update is sent to all peers. | ||
113 | |||
114 | Applications can ask the RDS kernel module to receive | ||
115 | notifications via control messages (for instance, there is a | ||
116 | notification when a congestion update arrived, or when a RDMA | ||
117 | operation completes). These notifications are received through | ||
118 | the msg.msg_control buffer of struct msghdr. The format of the | ||
119 | messages is described in manpages. | ||
120 | |||
121 | poll(fd) | ||
122 | RDS supports the poll interface to allow the application | ||
123 | to implement async I/O. | ||
124 | |||
125 | POLLIN handling is pretty straightforward. When there's an | ||
126 | incoming message queued to the socket, or a pending notification, | ||
127 | we signal POLLIN. | ||
128 | |||
129 | POLLOUT is a little harder. Since you can essentially send | ||
130 | to any destination, RDS will always signal POLLOUT as long as | ||
131 | there's room on the send queue (ie the number of bytes queued | ||
132 | is less than the sendbuf size). | ||
133 | |||
134 | However, the kernel will refuse to accept messages to | ||
135 | a destination marked congested - in this case you will loop | ||
136 | forever if you rely on poll to tell you what to do. | ||
137 | This isn't a trivial problem, but applications can deal with | ||
138 | this - by using congestion notifications, and by checking for | ||
139 | ENOBUFS errors returned by sendmsg. | ||
140 | |||
141 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | ||
142 | This allows the application to discard all messages queued to a | ||
143 | specific destination on this particular socket. | ||
144 | |||
145 | This allows the application to cancel outstanding messages if | ||
146 | it detects a timeout. For instance, if it tried to send a message, | ||
147 | and the remote host is unreachable, RDS will keep trying forever. | ||
148 | The application may decide it's not worth it, and cancel the | ||
149 | operation. In this case, it would use RDS_CANCEL_SENT_TO to | ||
150 | nuke any pending messages. | ||
151 | |||
152 | |||
153 | RDMA for RDS | ||
154 | ============ | ||
155 | |||
156 | see rds-rdma(7) manpage (available in rds-tools) | ||
157 | |||
158 | |||
159 | Congestion Notifications | ||
160 | ======================== | ||
161 | |||
162 | see rds(7) manpage | ||
163 | |||
164 | |||
165 | RDS Protocol | ||
166 | ============ | ||
167 | |||
168 | Message header | ||
169 | |||
170 | The message header is a 'struct rds_header' (see rds.h): | ||
171 | Fields: | ||
172 | h_sequence: | ||
173 | per-packet sequence number | ||
174 | h_ack: | ||
175 | piggybacked acknowledgment of last packet received | ||
176 | h_len: | ||
177 | length of data, not including header | ||
178 | h_sport: | ||
179 | source port | ||
180 | h_dport: | ||
181 | destination port | ||
182 | h_flags: | ||
183 | CONG_BITMAP - this is a congestion update bitmap | ||
184 | ACK_REQUIRED - receiver must ack this packet | ||
185 | RETRANSMITTED - packet has previously been sent | ||
186 | h_credit: | ||
187 | indicate to other end of connection that | ||
188 | it has more credits available (i.e. there is | ||
189 | more send room) | ||
190 | h_padding[4]: | ||
191 | unused, for future use | ||
192 | h_csum: | ||
193 | header checksum | ||
194 | h_exthdr: | ||
195 | optional data can be passed here. This is currently used for | ||
196 | passing RDMA-related information. | ||
197 | |||
198 | ACK and retransmit handling | ||
199 | |||
200 | One might think that with reliable IB connections you wouldn't need | ||
201 | to ack messages that have been received. The problem is that IB | ||
202 | hardware generates an ack message before it has DMAed the message | ||
203 | into memory. This creates a potential message loss if the HCA is | ||
204 | disabled for any reason between when it sends the ack and before | ||
205 | the message is DMAed and processed. This is only a potential issue | ||
206 | if another HCA is available for fail-over. | ||
207 | |||
208 | Sending an ack immediately would allow the sender to free the sent | ||
209 | message from their send queue quickly, but could cause excessive | ||
210 | traffic to be used for acks. RDS piggybacks acks on sent data | ||
211 | packets. Ack-only packets are reduced by only allowing one to be | ||
212 | in flight at a time, and by the sender only asking for acks when | ||
213 | its send buffers start to fill up. All retransmissions are also | ||
214 | acked. | ||
215 | |||
216 | Flow Control | ||
217 | |||
218 | RDS's IB transport uses a credit-based mechanism to verify that | ||
219 | there is space in the peer's receive buffers for more data. This | ||
220 | eliminates the need for hardware retries on the connection. | ||
221 | |||
222 | Congestion | ||
223 | |||
224 | Messages waiting in the receive queue on the receiving socket | ||
225 | are accounted against the sockets SO_RCVBUF option value. Only | ||
226 | the payload bytes in the message are accounted for. If the | ||
227 | number of bytes queued equals or exceeds rcvbuf then the socket | ||
228 | is congested. All sends attempted to this socket's address | ||
229 | should return block or return -EWOULDBLOCK. | ||
230 | |||
231 | Applications are expected to be reasonably tuned such that this | ||
232 | situation very rarely occurs. An application encountering this | ||
233 | "back-pressure" is considered a bug. | ||
234 | |||
235 | This is implemented by having each node maintain bitmaps which | ||
236 | indicate which ports on bound addresses are congested. As the | ||
237 | bitmap changes it is sent through all the connections which | ||
238 | terminate in the local address of the bitmap which changed. | ||
239 | |||
240 | The bitmaps are allocated as connections are brought up. This | ||
241 | avoids allocation in the interrupt handling path which queues | ||
242 | sages on sockets. The dense bitmaps let transports send the | ||
243 | entire bitmap on any bitmap change reasonably efficiently. This | ||
244 | is much easier to implement than some finer-grained | ||
245 | communication of per-port congestion. The sender does a very | ||
246 | inexpensive bit test to test if the port it's about to send to | ||
247 | is congested or not. | ||
248 | |||
249 | |||
250 | RDS Transport Layer | ||
251 | ================== | ||
252 | |||
253 | As mentioned above, RDS is not IB-specific. Its code is divided | ||
254 | into a general RDS layer and a transport layer. | ||
255 | |||
256 | The general layer handles the socket API, congestion handling, | ||
257 | loopback, stats, usermem pinning, and the connection state machine. | ||
258 | |||
259 | The transport layer handles the details of the transport. The IB | ||
260 | transport, for example, handles all the queue pairs, work requests, | ||
261 | CM event handlers, and other Infiniband details. | ||
262 | |||
263 | |||
264 | RDS Kernel Structures | ||
265 | ===================== | ||
266 | |||
267 | struct rds_message | ||
268 | aka possibly "rds_outgoing", the generic RDS layer copies data to | ||
269 | be sent and sets header fields as needed, based on the socket API. | ||
270 | This is then queued for the individual connection and sent by the | ||
271 | connection's transport. | ||
272 | struct rds_incoming | ||
273 | a generic struct referring to incoming data that can be handed from | ||
274 | the transport to the general code and queued by the general code | ||
275 | while the socket is awoken. It is then passed back to the transport | ||
276 | code to handle the actual copy-to-user. | ||
277 | struct rds_socket | ||
278 | per-socket information | ||
279 | struct rds_connection | ||
280 | per-connection information | ||
281 | struct rds_transport | ||
282 | pointers to transport-specific functions | ||
283 | struct rds_statistics | ||
284 | non-transport-specific statistics | ||
285 | struct rds_cong_map | ||
286 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. | ||
287 | |||
288 | Connection management | ||
289 | ===================== | ||
290 | |||
291 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | ||
292 | ERROR states. | ||
293 | |||
294 | The first time an attempt is made by an RDS socket to send data to | ||
295 | a node, a connection is allocated and connected. That connection is | ||
296 | then maintained forever -- if there are transport errors, the | ||
297 | connection will be dropped and re-established. | ||
298 | |||
299 | Dropping a connection while packets are queued will cause queued or | ||
300 | partially-sent datagrams to be retransmitted when the connection is | ||
301 | re-established. | ||
302 | |||
303 | |||
304 | The send path | ||
305 | ============= | ||
306 | |||
307 | rds_sendmsg() | ||
308 | struct rds_message built from incoming data | ||
309 | CMSGs parsed (e.g. RDMA ops) | ||
310 | transport connection alloced and connected if not already | ||
311 | rds_message placed on send queue | ||
312 | send worker awoken | ||
313 | rds_send_worker() | ||
314 | calls rds_send_xmit() until queue is empty | ||
315 | rds_send_xmit() | ||
316 | transmits congestion map if one is pending | ||
317 | may set ACK_REQUIRED | ||
318 | calls transport to send either non-RDMA or RDMA message | ||
319 | (RDMA ops never retransmitted) | ||
320 | rds_ib_xmit() | ||
321 | allocs work requests from send ring | ||
322 | adds any new send credits available to peer (h_credits) | ||
323 | maps the rds_message's sg list | ||
324 | piggybacks ack | ||
325 | populates work requests | ||
326 | post send to connection's queue pair | ||
327 | |||
328 | The recv path | ||
329 | ============= | ||
330 | |||
331 | rds_ib_recv_cq_comp_handler() | ||
332 | looks at write completions | ||
333 | unmaps recv buffer from device | ||
334 | no errors, call rds_ib_process_recv() | ||
335 | refill recv ring | ||
336 | rds_ib_process_recv() | ||
337 | validate header checksum | ||
338 | copy header to rds_ib_incoming struct if start of a new datagram | ||
339 | add to ibinc's fraglist | ||
340 | if competed datagram: | ||
341 | update cong map if datagram was cong update | ||
342 | call rds_recv_incoming() otherwise | ||
343 | note if ack is required | ||
344 | rds_recv_incoming() | ||
345 | drop duplicate packets | ||
346 | respond to pings | ||
347 | find the sock associated with this datagram | ||
348 | add to sock queue | ||
349 | wake up sock | ||
350 | do some congestion calculations | ||
351 | rds_recvmsg | ||
352 | copy data into user iovec | ||
353 | handle CMSGs | ||
354 | return to application | ||
355 | |||
356 | |||