diff options
-rw-r--r-- | Documentation/networking/fib_trie.txt | 145 |
1 files changed, 145 insertions, 0 deletions
diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt new file mode 100644 index 000000000000..f50d0c673c57 --- /dev/null +++ b/Documentation/networking/fib_trie.txt | |||
@@ -0,0 +1,145 @@ | |||
1 | LC-trie implementation notes. | ||
2 | |||
3 | Node types | ||
4 | ---------- | ||
5 | leaf | ||
6 | An end node with data. This has a copy of the relevant key, along | ||
7 | with 'hlist' with routing table entries sorted by prefix length. | ||
8 | See struct leaf and struct leaf_info. | ||
9 | |||
10 | trie node or tnode | ||
11 | An internal node, holding an array of child (leaf or tnode) pointers, | ||
12 | indexed through a subset of the key. See Level Compression. | ||
13 | |||
14 | A few concepts explained | ||
15 | ------------------------ | ||
16 | Bits (tnode) | ||
17 | The number of bits in the key segment used for indexing into the | ||
18 | child array - the "child index". See Level Compression. | ||
19 | |||
20 | Pos (tnode) | ||
21 | The position (in the key) of the key segment used for indexing into | ||
22 | the child array. See Path Compression. | ||
23 | |||
24 | Path Compression / skipped bits | ||
25 | Any given tnode is linked to from the child array of its parent, using | ||
26 | a segment of the key specified by the parent's "pos" and "bits" | ||
27 | In certain cases, this tnode's own "pos" will not be immediately | ||
28 | adjacent to the parent (pos+bits), but there will be some bits | ||
29 | in the key skipped over because they represent a single path with no | ||
30 | deviations. These "skipped bits" constitute Path Compression. | ||
31 | Note that the search algorithm will simply skip over these bits when | ||
32 | searching, making it necessary to save the keys in the leaves to | ||
33 | verify that they actually do match the key we are searching for. | ||
34 | |||
35 | Level Compression / child arrays | ||
36 | the trie is kept level balanced moving, under certain conditions, the | ||
37 | children of a full child (see "full_children") up one level, so that | ||
38 | instead of a pure binary tree, each internal node ("tnode") may | ||
39 | contain an arbitrarily large array of links to several children. | ||
40 | Conversely, a tnode with a mostly empty child array (see empty_children) | ||
41 | may be "halved", having some of its children moved downwards one level, | ||
42 | in order to avoid ever-increasing child arrays. | ||
43 | |||
44 | empty_children | ||
45 | the number of positions in the child array of a given tnode that are | ||
46 | NULL. | ||
47 | |||
48 | full_children | ||
49 | the number of children of a given tnode that aren't path compressed. | ||
50 | (in other words, they aren't NULL or leaves and their "pos" is equal | ||
51 | to this tnode's "pos"+"bits"). | ||
52 | |||
53 | (The word "full" here is used more in the sense of "complete" than | ||
54 | as the opposite of "empty", which might be a tad confusing.) | ||
55 | |||
56 | Comments | ||
57 | --------- | ||
58 | |||
59 | We have tried to keep the structure of the code as close to fib_hash as | ||
60 | possible to allow verification and help up reviewing. | ||
61 | |||
62 | fib_find_node() | ||
63 | A good start for understanding this code. This function implements a | ||
64 | straightforward trie lookup. | ||
65 | |||
66 | fib_insert_node() | ||
67 | Inserts a new leaf node in the trie. This is bit more complicated than | ||
68 | fib_find_node(). Inserting a new node means we might have to run the | ||
69 | level compression algorithm on part of the trie. | ||
70 | |||
71 | trie_leaf_remove() | ||
72 | Looks up a key, deletes it and runs the level compression algorithm. | ||
73 | |||
74 | trie_rebalance() | ||
75 | The key function for the dynamic trie after any change in the trie | ||
76 | it is run to optimize and reorganize. Tt will walk the trie upwards | ||
77 | towards the root from a given tnode, doing a resize() at each step | ||
78 | to implement level compression. | ||
79 | |||
80 | resize() | ||
81 | Analyzes a tnode and optimizes the child array size by either inflating | ||
82 | or shrinking it repeatedly until it fullfills the criteria for optimal | ||
83 | level compression. This part follows the original paper pretty closely | ||
84 | and there may be some room for experimentation here. | ||
85 | |||
86 | inflate() | ||
87 | Doubles the size of the child array within a tnode. Used by resize(). | ||
88 | |||
89 | halve() | ||
90 | Halves the size of the child array within a tnode - the inverse of | ||
91 | inflate(). Used by resize(); | ||
92 | |||
93 | fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() | ||
94 | The route manipulation functions. Should conform pretty closely to the | ||
95 | corresponding functions in fib_hash. | ||
96 | |||
97 | fn_trie_flush() | ||
98 | This walks the full trie (using nextleaf()) and searches for empty | ||
99 | leaves which have to be removed. | ||
100 | |||
101 | fn_trie_dump() | ||
102 | Dumps the routing table ordered by prefix length. This is somewhat | ||
103 | slower than the corresponding fib_hash function, as we have to walk the | ||
104 | entire trie for each prefix length. In comparison, fib_hash is organized | ||
105 | as one "zone"/hash per prefix length. | ||
106 | |||
107 | Locking | ||
108 | ------- | ||
109 | |||
110 | fib_lock is used for an RW-lock in the same way that this is done in fib_hash. | ||
111 | However, the functions are somewhat separated for other possible locking | ||
112 | scenarios. It might conceivably be possible to run trie_rebalance via RCU | ||
113 | to avoid read_lock in the fn_trie_lookup() function. | ||
114 | |||
115 | Main lookup mechanism | ||
116 | --------------------- | ||
117 | fn_trie_lookup() is the main lookup function. | ||
118 | |||
119 | The lookup is in its simplest form just like fib_find_node(). We descend the | ||
120 | trie, key segment by key segment, until we find a leaf. check_leaf() does | ||
121 | the fib_semantic_match in the leaf's sorted prefix hlist. | ||
122 | |||
123 | If we find a match, we are done. | ||
124 | |||
125 | If we don't find a match, we enter prefix matching mode. The prefix length, | ||
126 | starting out at the same as the key length, is reduced one step at a time, | ||
127 | and we backtrack upwards through the trie trying to find a longest matching | ||
128 | prefix. The goal is always to reach a leaf and get a positive result from the | ||
129 | fib_semantic_match mechanism. | ||
130 | |||
131 | Inside each tnode, the search for longest matching prefix consists of searching | ||
132 | through the child array, chopping off (zeroing) the least significant "1" of | ||
133 | the child index until we find a match or the child index consists of nothing but | ||
134 | zeros. | ||
135 | |||
136 | At this point we backtrack (t->stats.backtrack++) up the trie, continuing to | ||
137 | chop off part of the key in order to find the longest matching prefix. | ||
138 | |||
139 | At this point we will repeatedly descend subtries to look for a match, and there | ||
140 | are some optimizations available that can provide us with "shortcuts" to avoid | ||
141 | descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. | ||
142 | |||
143 | To alleviate any doubts about the correctness of the route selection process, | ||
144 | a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which | ||
145 | gives userland access to fib_lookup(). | ||