diff options
Diffstat (limited to 'Documentation/nvdimm/btt.txt')
-rw-r--r-- | Documentation/nvdimm/btt.txt | 283 |
1 files changed, 283 insertions, 0 deletions
diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt new file mode 100644 index 000000000000..b91443f577dc --- /dev/null +++ b/Documentation/nvdimm/btt.txt | |||
@@ -0,0 +1,283 @@ | |||
1 | BTT - Block Translation Table | ||
2 | ============================= | ||
3 | |||
4 | |||
5 | 1. Introduction | ||
6 | --------------- | ||
7 | |||
8 | Persistent memory based storage is able to perform IO at byte (or more | ||
9 | accurately, cache line) granularity. However, we often want to expose such | ||
10 | storage as traditional block devices. The block drivers for persistent memory | ||
11 | will do exactly this. However, they do not provide any atomicity guarantees. | ||
12 | Traditional SSDs typically provide protection against torn sectors in hardware, | ||
13 | using stored energy in capacitors to complete in-flight block writes, or perhaps | ||
14 | in firmware. We don't have this luxury with persistent memory - if a write is in | ||
15 | progress, and we experience a power failure, the block will contain a mix of old | ||
16 | and new data. Applications may not be prepared to handle such a scenario. | ||
17 | |||
18 | The Block Translation Table (BTT) provides atomic sector update semantics for | ||
19 | persistent memory devices, so that applications that rely on sector writes not | ||
20 | being torn can continue to do so. The BTT manifests itself as a stacked block | ||
21 | device, and reserves a portion of the underlying storage for its metadata. At | ||
22 | the heart of it, is an indirection table that re-maps all the blocks on the | ||
23 | volume. It can be thought of as an extremely simple file system that only | ||
24 | provides atomic sector updates. | ||
25 | |||
26 | |||
27 | 2. Static Layout | ||
28 | ---------------- | ||
29 | |||
30 | The underlying storage on which a BTT can be laid out is not limited in any way. | ||
31 | The BTT, however, splits the available space into chunks of up to 512 GiB, | ||
32 | called "Arenas". | ||
33 | |||
34 | Each arena follows the same layout for its metadata, and all references in an | ||
35 | arena are internal to it (with the exception of one field that points to the | ||
36 | next arena). The following depicts the "On-disk" metadata layout: | ||
37 | |||
38 | |||
39 | Backing Store +-------> Arena | ||
40 | +---------------+ | +------------------+ | ||
41 | | | | | Arena info block | | ||
42 | | Arena 0 +---+ | 4K | | ||
43 | | 512G | +------------------+ | ||
44 | | | | | | ||
45 | +---------------+ | | | ||
46 | | | | | | ||
47 | | Arena 1 | | Data Blocks | | ||
48 | | 512G | | | | ||
49 | | | | | | ||
50 | +---------------+ | | | ||
51 | | . | | | | ||
52 | | . | | | | ||
53 | | . | | | | ||
54 | | | | | | ||
55 | | | | | | ||
56 | +---------------+ +------------------+ | ||
57 | | | | ||
58 | | BTT Map | | ||
59 | | | | ||
60 | | | | ||
61 | +------------------+ | ||
62 | | | | ||
63 | | BTT Flog | | ||
64 | | | | ||
65 | +------------------+ | ||
66 | | Info block copy | | ||
67 | | 4K | | ||
68 | +------------------+ | ||
69 | |||
70 | |||
71 | 3. Theory of Operation | ||
72 | ---------------------- | ||
73 | |||
74 | |||
75 | a. The BTT Map | ||
76 | -------------- | ||
77 | |||
78 | The map is a simple lookup/indirection table that maps an LBA to an internal | ||
79 | block. Each map entry is 32 bits. The two most significant bits are special | ||
80 | flags, and the remaining form the internal block number. | ||
81 | |||
82 | Bit Description | ||
83 | 31 - 30 : Error and Zero flags - Used in the following way: | ||
84 | Bit Description | ||
85 | 31 30 | ||
86 | ----------------------------------------------------------------------- | ||
87 | 00 Initial state. Reads return zeroes; Premap = Postmap | ||
88 | 01 Zero state: Reads return zeroes | ||
89 | 10 Error state: Reads fail; Writes clear 'E' bit | ||
90 | 11 Normal Block – has valid postmap | ||
91 | |||
92 | |||
93 | 29 - 0 : Mappings to internal 'postmap' blocks | ||
94 | |||
95 | |||
96 | Some of the terminology that will be subsequently used: | ||
97 | |||
98 | External LBA : LBA as made visible to upper layers. | ||
99 | ABA : Arena Block Address - Block offset/number within an arena | ||
100 | Premap ABA : The block offset into an arena, which was decided upon by range | ||
101 | checking the External LBA | ||
102 | Postmap ABA : The block number in the "Data Blocks" area obtained after | ||
103 | indirection from the map | ||
104 | nfree : The number of free blocks that are maintained at any given time. | ||
105 | This is the number of concurrent writes that can happen to the | ||
106 | arena. | ||
107 | |||
108 | |||
109 | For example, after adding a BTT, we surface a disk of 1024G. We get a read for | ||
110 | the external LBA at 768G. This falls into the second arena, and of the 512G | ||
111 | worth of blocks that this arena contributes, this block is at 256G. Thus, the | ||
112 | premap ABA is 256G. We now refer to the map, and find out the mapping for block | ||
113 | 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. | ||
114 | |||
115 | |||
116 | b. The BTT Flog | ||
117 | --------------- | ||
118 | |||
119 | The BTT provides sector atomicity by making every write an "allocating write", | ||
120 | i.e. Every write goes to a "free" block. A running list of free blocks is | ||
121 | maintained in the form of the BTT flog. 'Flog' is a combination of the words | ||
122 | "free list" and "log". The flog contains 'nfree' entries, and an entry contains: | ||
123 | |||
124 | lba : The premap ABA that is being written to | ||
125 | old_map : The old postmap ABA - after 'this' write completes, this will be a | ||
126 | free block. | ||
127 | new_map : The new postmap ABA. The map will up updated to reflect this | ||
128 | lba->postmap_aba mapping, but we log it here in case we have to | ||
129 | recover. | ||
130 | seq : Sequence number to mark which of the 2 sections of this flog entry is | ||
131 | valid/newest. It cycles between 01->10->11->01 (binary) under normal | ||
132 | operation, with 00 indicating an uninitialized state. | ||
133 | lba' : alternate lba entry | ||
134 | old_map': alternate old postmap entry | ||
135 | new_map': alternate new postmap entry | ||
136 | seq' : alternate sequence number. | ||
137 | |||
138 | Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also | ||
139 | padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are | ||
140 | done such that for any entry being written, it: | ||
141 | a. overwrites the 'old' section in the entry based on sequence numbers | ||
142 | b. writes the 'new' section such that the sequence number is written last. | ||
143 | |||
144 | |||
145 | c. The concept of lanes | ||
146 | ----------------------- | ||
147 | |||
148 | While 'nfree' describes the number of concurrent IOs an arena can process | ||
149 | concurrently, 'nlanes' is the number of IOs the BTT device as a whole can | ||
150 | process. | ||
151 | nlanes = min(nfree, num_cpus) | ||
152 | A lane number is obtained at the start of any IO, and is used for indexing into | ||
153 | all the on-disk and in-memory data structures for the duration of the IO. If | ||
154 | there are more CPUs than the max number of available lanes, than lanes are | ||
155 | protected by spinlocks. | ||
156 | |||
157 | |||
158 | d. In-memory data structure: Read Tracking Table (RTT) | ||
159 | ------------------------------------------------------ | ||
160 | |||
161 | Consider a case where we have two threads, one doing reads and the other, | ||
162 | writes. We can hit a condition where the writer thread grabs a free block to do | ||
163 | a new IO, but the (slow) reader thread is still reading from it. In other words, | ||
164 | the reader consulted a map entry, and started reading the corresponding block. A | ||
165 | writer started writing to the same external LBA, and finished the write updating | ||
166 | the map for that external LBA to point to its new postmap ABA. At this point the | ||
167 | internal, postmap block that the reader is (still) reading has been inserted | ||
168 | into the list of free blocks. If another write comes in for the same LBA, it can | ||
169 | grab this free block, and start writing to it, causing the reader to read | ||
170 | incorrect data. To prevent this, we introduce the RTT. | ||
171 | |||
172 | The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts | ||
173 | into rtt[lane_number], the postmap ABA it is reading, and clears it after the | ||
174 | read is complete. Every writer thread, after grabbing a free block, checks the | ||
175 | RTT for its presence. If the postmap free block is in the RTT, it waits till the | ||
176 | reader clears the RTT entry, and only then starts writing to it. | ||
177 | |||
178 | |||
179 | e. In-memory data structure: map locks | ||
180 | -------------------------------------- | ||
181 | |||
182 | Consider a case where two writer threads are writing to the same LBA. There can | ||
183 | be a race in the following sequence of steps: | ||
184 | |||
185 | free[lane] = map[premap_aba] | ||
186 | map[premap_aba] = postmap_aba | ||
187 | |||
188 | Both threads can update their respective free[lane] with the same old, freed | ||
189 | postmap_aba. This has made the layout inconsistent by losing a free entry, and | ||
190 | at the same time, duplicating another free entry for two lanes. | ||
191 | |||
192 | To solve this, we could have a single map lock (per arena) that has to be taken | ||
193 | before performing the above sequence, but we feel that could be too contentious. | ||
194 | Instead we use an array of (nfree) map_locks that is indexed by | ||
195 | (premap_aba modulo nfree). | ||
196 | |||
197 | |||
198 | f. Reconstruction from the Flog | ||
199 | ------------------------------- | ||
200 | |||
201 | On startup, we analyze the BTT flog to create our list of free blocks. We walk | ||
202 | through all the entries, and for each lane, of the set of two possible | ||
203 | 'sections', we always look at the most recent one only (based on the sequence | ||
204 | number). The reconstruction rules/steps are simple: | ||
205 | - Read map[log_entry.lba]. | ||
206 | - If log_entry.new matches the map entry, then log_entry.old is free. | ||
207 | - If log_entry.new does not match the map entry, then log_entry.new is free. | ||
208 | (This case can only be caused by power-fails/unsafe shutdowns) | ||
209 | |||
210 | |||
211 | g. Summarizing - Read and Write flows | ||
212 | ------------------------------------- | ||
213 | |||
214 | Read: | ||
215 | |||
216 | 1. Convert external LBA to arena number + pre-map ABA | ||
217 | 2. Get a lane (and take lane_lock) | ||
218 | 3. Read map to get the entry for this pre-map ABA | ||
219 | 4. Enter post-map ABA into RTT[lane] | ||
220 | 5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) | ||
221 | 6. If ERROR flag set in map, end IO with EIO (go to step 8) | ||
222 | 7. Read data from this block | ||
223 | 8. Remove post-map ABA entry from RTT[lane] | ||
224 | 9. Release lane (and lane_lock) | ||
225 | |||
226 | Write: | ||
227 | |||
228 | 1. Convert external LBA to Arena number + pre-map ABA | ||
229 | 2. Get a lane (and take lane_lock) | ||
230 | 3. Use lane to index into in-memory free list and obtain a new block, next flog | ||
231 | index, next sequence number | ||
232 | 4. Scan the RTT to check if free block is present, and spin/wait if it is. | ||
233 | 5. Write data to this free block | ||
234 | 6. Read map to get the existing post-map ABA entry for this pre-map ABA | ||
235 | 7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] | ||
236 | 8. Write new post-map ABA into map. | ||
237 | 9. Write old post-map entry into the free list | ||
238 | 10. Calculate next sequence number and write into the free list entry | ||
239 | 11. Release lane (and lane_lock) | ||
240 | |||
241 | |||
242 | 4. Error Handling | ||
243 | ================= | ||
244 | |||
245 | An arena would be in an error state if any of the metadata is corrupted | ||
246 | irrecoverably, either due to a bug or a media error. The following conditions | ||
247 | indicate an error: | ||
248 | - Info block checksum does not match (and recovering from the copy also fails) | ||
249 | - All internal available blocks are not uniquely and entirely addressed by the | ||
250 | sum of mapped blocks and free blocks (from the BTT flog). | ||
251 | - Rebuilding free list from the flog reveals missing/duplicate/impossible | ||
252 | entries | ||
253 | - A map entry is out of bounds | ||
254 | |||
255 | If any of these error conditions are encountered, the arena is put into a read | ||
256 | only state using a flag in the info block. | ||
257 | |||
258 | |||
259 | 5. In-kernel usage | ||
260 | ================== | ||
261 | |||
262 | Any block driver that supports byte granularity IO to the storage may register | ||
263 | with the BTT. It will have to provide the rw_bytes interface in its | ||
264 | block_device_operations struct: | ||
265 | |||
266 | int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw); | ||
267 | |||
268 | It may register with the BTT after it adds its own gendisk, using btt_init: | ||
269 | |||
270 | struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize, | ||
271 | u32 lbasize, u8 uuid[], int maxlane); | ||
272 | |||
273 | note that maxlane is the maximum amount of concurrency the driver wishes to | ||
274 | allow the BTT to use. | ||
275 | |||
276 | The BTT 'disk' appears as a stacked block device that grabs the underlying block | ||
277 | device in the O_EXCL mode. | ||
278 | |||
279 | When the driver wishes to remove the backing disk, it should similarly call | ||
280 | btt_fini using the same struct btt* handle that was provided to it by btt_init. | ||
281 | |||
282 | void btt_fini(struct btt *btt); | ||
283 | |||