diff options
author | Jesse Gross <jesse@nicira.com> | 2011-10-25 22:26:31 -0400 |
---|---|---|
committer | Jesse Gross <jesse@nicira.com> | 2011-12-03 12:35:17 -0500 |
commit | ccb1352e76cff0524e7ccb2074826a092dd13016 (patch) | |
tree | 9122ceff5d75ec64e327a9fad4ad2013744c2999 /Documentation | |
parent | 75f2811c6460ccc59d83c66059943ce9c9f81a18 (diff) |
net: Add Open vSwitch kernel components.
Open vSwitch is a multilayer Ethernet switch targeted at virtualized
environments. In addition to supporting a variety of features
expected in a traditional hardware switch, it enables fine-grained
programmatic extension and flow-based control of the network.
This control is useful in a wide variety of applications but is
particularly important in multi-server virtualization deployments,
which are often characterized by highly dynamic endpoints and the need
to maintain logical abstractions for multiple tenants.
The Open vSwitch datapath provides an in-kernel fast path for packet
forwarding. It is complemented by a userspace daemon, ovs-vswitchd,
which is able to accept configuration from a variety of sources and
translate it into packet processing rules.
See http://openvswitch.org for more information and userspace
utilities.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/networking/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/networking/openvswitch.txt | 195 |
2 files changed, 197 insertions, 0 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index bbce1215434a..9ad9ddeb384c 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX | |||
@@ -144,6 +144,8 @@ nfc.txt | |||
144 | - The Linux Near Field Communication (NFS) subsystem. | 144 | - The Linux Near Field Communication (NFS) subsystem. |
145 | olympic.txt | 145 | olympic.txt |
146 | - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. | 146 | - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. |
147 | openvswitch.txt | ||
148 | - Open vSwitch developer documentation. | ||
147 | operstates.txt | 149 | operstates.txt |
148 | - Overview of network interface operational states. | 150 | - Overview of network interface operational states. |
149 | packet_mmap.txt | 151 | packet_mmap.txt |
diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt new file mode 100644 index 000000000000..b8a048b8df3a --- /dev/null +++ b/Documentation/networking/openvswitch.txt | |||
@@ -0,0 +1,195 @@ | |||
1 | Open vSwitch datapath developer documentation | ||
2 | ============================================= | ||
3 | |||
4 | The Open vSwitch kernel module allows flexible userspace control over | ||
5 | flow-level packet processing on selected network devices. It can be | ||
6 | used to implement a plain Ethernet switch, network device bonding, | ||
7 | VLAN processing, network access control, flow-based network control, | ||
8 | and so on. | ||
9 | |||
10 | The kernel module implements multiple "datapaths" (analogous to | ||
11 | bridges), each of which can have multiple "vports" (analogous to ports | ||
12 | within a bridge). Each datapath also has associated with it a "flow | ||
13 | table" that userspace populates with "flows" that map from keys based | ||
14 | on packet headers and metadata to sets of actions. The most common | ||
15 | action forwards the packet to another vport; other actions are also | ||
16 | implemented. | ||
17 | |||
18 | When a packet arrives on a vport, the kernel module processes it by | ||
19 | extracting its flow key and looking it up in the flow table. If there | ||
20 | is a matching flow, it executes the associated actions. If there is | ||
21 | no match, it queues the packet to userspace for processing (as part of | ||
22 | its processing, userspace will likely set up a flow to handle further | ||
23 | packets of the same type entirely in-kernel). | ||
24 | |||
25 | |||
26 | Flow key compatibility | ||
27 | ---------------------- | ||
28 | |||
29 | Network protocols evolve over time. New protocols become important | ||
30 | and existing protocols lose their prominence. For the Open vSwitch | ||
31 | kernel module to remain relevant, it must be possible for newer | ||
32 | versions to parse additional protocols as part of the flow key. It | ||
33 | might even be desirable, someday, to drop support for parsing | ||
34 | protocols that have become obsolete. Therefore, the Netlink interface | ||
35 | to Open vSwitch is designed to allow carefully written userspace | ||
36 | applications to work with any version of the flow key, past or future. | ||
37 | |||
38 | To support this forward and backward compatibility, whenever the | ||
39 | kernel module passes a packet to userspace, it also passes along the | ||
40 | flow key that it parsed from the packet. Userspace then extracts its | ||
41 | own notion of a flow key from the packet and compares it against the | ||
42 | kernel-provided version: | ||
43 | |||
44 | - If userspace's notion of the flow key for the packet matches the | ||
45 | kernel's, then nothing special is necessary. | ||
46 | |||
47 | - If the kernel's flow key includes more fields than the userspace | ||
48 | version of the flow key, for example if the kernel decoded IPv6 | ||
49 | headers but userspace stopped at the Ethernet type (because it | ||
50 | does not understand IPv6), then again nothing special is | ||
51 | necessary. Userspace can still set up a flow in the usual way, | ||
52 | as long as it uses the kernel-provided flow key to do it. | ||
53 | |||
54 | - If the userspace flow key includes more fields than the | ||
55 | kernel's, for example if userspace decoded an IPv6 header but | ||
56 | the kernel stopped at the Ethernet type, then userspace can | ||
57 | forward the packet manually, without setting up a flow in the | ||
58 | kernel. This case is bad for performance because every packet | ||
59 | that the kernel considers part of the flow must go to userspace, | ||
60 | but the forwarding behavior is correct. (If userspace can | ||
61 | determine that the values of the extra fields would not affect | ||
62 | forwarding behavior, then it could set up a flow anyway.) | ||
63 | |||
64 | How flow keys evolve over time is important to making this work, so | ||
65 | the following sections go into detail. | ||
66 | |||
67 | |||
68 | Flow key format | ||
69 | --------------- | ||
70 | |||
71 | A flow key is passed over a Netlink socket as a sequence of Netlink | ||
72 | attributes. Some attributes represent packet metadata, defined as any | ||
73 | information about a packet that cannot be extracted from the packet | ||
74 | itself, e.g. the vport on which the packet was received. Most | ||
75 | attributes, however, are extracted from headers within the packet, | ||
76 | e.g. source and destination addresses from Ethernet, IP, or TCP | ||
77 | headers. | ||
78 | |||
79 | The <linux/openvswitch.h> header file defines the exact format of the | ||
80 | flow key attributes. For informal explanatory purposes here, we write | ||
81 | them as comma-separated strings, with parentheses indicating arguments | ||
82 | and nesting. For example, the following could represent a flow key | ||
83 | corresponding to a TCP packet that arrived on vport 1: | ||
84 | |||
85 | in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), | ||
86 | eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, | ||
87 | frag=no), tcp(src=49163, dst=80) | ||
88 | |||
89 | Often we ellipsize arguments not important to the discussion, e.g.: | ||
90 | |||
91 | in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) | ||
92 | |||
93 | |||
94 | Basic rule for evolving flow keys | ||
95 | --------------------------------- | ||
96 | |||
97 | Some care is needed to really maintain forward and backward | ||
98 | compatibility for applications that follow the rules listed under | ||
99 | "Flow key compatibility" above. | ||
100 | |||
101 | The basic rule is obvious: | ||
102 | |||
103 | ------------------------------------------------------------------ | ||
104 | New network protocol support must only supplement existing flow | ||
105 | key attributes. It must not change the meaning of already defined | ||
106 | flow key attributes. | ||
107 | ------------------------------------------------------------------ | ||
108 | |||
109 | This rule does have less-obvious consequences so it is worth working | ||
110 | through a few examples. Suppose, for example, that the kernel module | ||
111 | did not already implement VLAN parsing. Instead, it just interpreted | ||
112 | the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the | ||
113 | packet. The flow key for any packet with an 802.1Q header would look | ||
114 | essentially like this, ignoring metadata: | ||
115 | |||
116 | eth(...), eth_type(0x8100) | ||
117 | |||
118 | Naively, to add VLAN support, it makes sense to add a new "vlan" flow | ||
119 | key attribute to contain the VLAN tag, then continue to decode the | ||
120 | encapsulated headers beyond the VLAN tag using the existing field | ||
121 | definitions. With this change, an TCP packet in VLAN 10 would have a | ||
122 | flow key much like this: | ||
123 | |||
124 | eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) | ||
125 | |||
126 | But this change would negatively affect a userspace application that | ||
127 | has not been updated to understand the new "vlan" flow key attribute. | ||
128 | The application could, following the flow compatibility rules above, | ||
129 | ignore the "vlan" attribute that it does not understand and therefore | ||
130 | assume that the flow contained IP packets. This is a bad assumption | ||
131 | (the flow only contains IP packets if one parses and skips over the | ||
132 | 802.1Q header) and it could cause the application's behavior to change | ||
133 | across kernel versions even though it follows the compatibility rules. | ||
134 | |||
135 | The solution is to use a set of nested attributes. This is, for | ||
136 | example, why 802.1Q support uses nested attributes. A TCP packet in | ||
137 | VLAN 10 is actually expressed as: | ||
138 | |||
139 | eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), | ||
140 | ip(proto=6, ...), tcp(...))) | ||
141 | |||
142 | Notice how the "eth_type", "ip", and "tcp" flow key attributes are | ||
143 | nested inside the "encap" attribute. Thus, an application that does | ||
144 | not understand the "vlan" key will not see either of those attributes | ||
145 | and therefore will not misinterpret them. (Also, the outer eth_type | ||
146 | is still 0x8100, not changed to 0x0800.) | ||
147 | |||
148 | Handling malformed packets | ||
149 | -------------------------- | ||
150 | |||
151 | Don't drop packets in the kernel for malformed protocol headers, bad | ||
152 | checksums, etc. This would prevent userspace from implementing a | ||
153 | simple Ethernet switch that forwards every packet. | ||
154 | |||
155 | Instead, in such a case, include an attribute with "empty" content. | ||
156 | It doesn't matter if the empty content could be valid protocol values, | ||
157 | as long as those values are rarely seen in practice, because userspace | ||
158 | can always forward all packets with those values to userspace and | ||
159 | handle them individually. | ||
160 | |||
161 | For example, consider a packet that contains an IP header that | ||
162 | indicates protocol 6 for TCP, but which is truncated just after the IP | ||
163 | header, so that the TCP header is missing. The flow key for this | ||
164 | packet would include a tcp attribute with all-zero src and dst, like | ||
165 | this: | ||
166 | |||
167 | eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) | ||
168 | |||
169 | As another example, consider a packet with an Ethernet type of 0x8100, | ||
170 | indicating that a VLAN TCI should follow, but which is truncated just | ||
171 | after the Ethernet type. The flow key for this packet would include | ||
172 | an all-zero-bits vlan and an empty encap attribute, like this: | ||
173 | |||
174 | eth(...), eth_type(0x8100), vlan(0), encap() | ||
175 | |||
176 | Unlike a TCP packet with source and destination ports 0, an | ||
177 | all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka | ||
178 | VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan | ||
179 | attribute expressly to allow this situation to be distinguished. | ||
180 | Thus, the flow key in this second example unambiguously indicates a | ||
181 | missing or malformed VLAN TCI. | ||
182 | |||
183 | Other rules | ||
184 | ----------- | ||
185 | |||
186 | The other rules for flow keys are much less subtle: | ||
187 | |||
188 | - Duplicate attributes are not allowed at a given nesting level. | ||
189 | |||
190 | - Ordering of attributes is not significant. | ||
191 | |||
192 | - When the kernel sends a given flow key to userspace, it always | ||
193 | composes it the same way. This allows userspace to hash and | ||
194 | compare entire flow keys that it may not be able to fully | ||
195 | interpret. | ||