summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorVishal Verma <vishal.l.verma@intel.com>2015-06-25 04:20:32 -0400
committerDan Williams <dan.j.williams@intel.com>2015-06-26 11:23:38 -0400
commit5212e11fde4d40fa627668b4f2222d20db488f71 (patch)
tree153bae097b056dfc44f1781c69e24a8d1e71584a
parent8c2f7e8658df1d3b7cbfa62706941d14c715823a (diff)
nd_btt: atomic sector updates
BTT stands for Block Translation Table, and is a way to provide power fail sector atomicity semantics for block devices that have the ability to perform byte granularity IO. It relies on the capability of libnvdimm namespace devices to do byte aligned IO. The BTT works as a stacked blocked device, and reserves a chunk of space from the backing device for its accounting metadata. It is a bio-based driver because all IO is done synchronously, and there is no queuing or asynchronous completions at either the device or the driver level. The BTT uses 'lanes' to index into various 'on-disk' data structures, and lanes also act as a synchronization mechanism in case there are more CPUs than available lanes. We did a comparison between two lane lock strategies - first where we kept an atomic counter around that tracked which was the last lane that was used, and 'our' lane was determined by atomically incrementing that. That way, for the nr_cpus > nr_lanes case, theoretically, no CPU would be blocked waiting for a lane. The other strategy was to use the cpu number we're scheduled on to and hash it to a lane number. Theoretically, this could block an IO that could've otherwise run using a different, free lane. But some fio workloads showed that the direct cpu -> lane hash performed faster than tracking 'last lane' - my reasoning is the cache thrash caused by moving the atomic variable made that approach slower than simply waiting out the in-progress IO. This supports the conclusion that the driver can be a very simple bio-based one that does synchronous IOs instead of queuing. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jens Axboe <axboe@fb.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg KH <gregkh@linuxfoundation.org> [jmoyer: fix nmi watchdog timeout in btt_map_init] [jmoyer: move btt initialization to module load path] [jmoyer: fix memory leak in the btt initialization path] [jmoyer: Don't overwrite corrupted arenas] Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-rw-r--r--Documentation/nvdimm/btt.txt273
-rw-r--r--drivers/acpi/nfit.c1
-rw-r--r--drivers/nvdimm/Kconfig28
-rw-r--r--drivers/nvdimm/Makefile3
-rw-r--r--drivers/nvdimm/btt.c1371
-rw-r--r--drivers/nvdimm/btt.h141
-rw-r--r--drivers/nvdimm/btt_devs.c3
-rw-r--r--drivers/nvdimm/namespace_devs.c24
-rw-r--r--drivers/nvdimm/nd.h22
-rw-r--r--drivers/nvdimm/pmem.c14
-rw-r--r--drivers/nvdimm/region.c12
-rw-r--r--drivers/nvdimm/region_devs.c82
-rw-r--r--include/linux/libnvdimm.h1
13 files changed, 1950 insertions, 25 deletions
diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/nvdimm/btt.txt
@@ -0,0 +1,273 @@
1BTT - Block Translation Table
2=============================
3
4
51. Introduction
6---------------
7
8Persistent memory based storage is able to perform IO at byte (or more
9accurately, cache line) granularity. However, we often want to expose such
10storage as traditional block devices. The block drivers for persistent memory
11will do exactly this. However, they do not provide any atomicity guarantees.
12Traditional SSDs typically provide protection against torn sectors in hardware,
13using stored energy in capacitors to complete in-flight block writes, or perhaps
14in firmware. We don't have this luxury with persistent memory - if a write is in
15progress, and we experience a power failure, the block will contain a mix of old
16and new data. Applications may not be prepared to handle such a scenario.
17
18The Block Translation Table (BTT) provides atomic sector update semantics for
19persistent memory devices, so that applications that rely on sector writes not
20being torn can continue to do so. The BTT manifests itself as a stacked block
21device, and reserves a portion of the underlying storage for its metadata. At
22the heart of it, is an indirection table that re-maps all the blocks on the
23volume. It can be thought of as an extremely simple file system that only
24provides atomic sector updates.
25
26
272. Static Layout
28----------------
29
30The underlying storage on which a BTT can be laid out is not limited in any way.
31The BTT, however, splits the available space into chunks of up to 512 GiB,
32called "Arenas".
33
34Each arena follows the same layout for its metadata, and all references in an
35arena are internal to it (with the exception of one field that points to the
36next arena). The following depicts the "On-disk" metadata layout:
37
38
39 Backing Store +-------> Arena
40+---------------+ | +------------------+
41| | | | Arena info block |
42| Arena 0 +---+ | 4K |
43| 512G | +------------------+
44| | | |
45+---------------+ | |
46| | | |
47| Arena 1 | | Data Blocks |
48| 512G | | |
49| | | |
50+---------------+ | |
51| . | | |
52| . | | |
53| . | | |
54| | | |
55| | | |
56+---------------+ +------------------+
57 | |
58 | BTT Map |
59 | |
60 | |
61 +------------------+
62 | |
63 | BTT Flog |
64 | |
65 +------------------+
66 | Info block copy |
67 | 4K |
68 +------------------+
69
70
713. Theory of Operation
72----------------------
73
74
75a. The BTT Map
76--------------
77
78The map is a simple lookup/indirection table that maps an LBA to an internal
79block. Each map entry is 32 bits. The two most significant bits are special
80flags, and the remaining form the internal block number.
81
82Bit Description
8331 : TRIM flag - marks if the block was trimmed or discarded
8430 : ERROR flag - marks an error block. Cleared on write.
8529 - 0 : Mappings to internal 'postmap' blocks
86
87
88Some of the terminology that will be subsequently used:
89
90External LBA : LBA as made visible to upper layers.
91ABA : Arena Block Address - Block offset/number within an arena
92Premap ABA : The block offset into an arena, which was decided upon by range
93 checking the External LBA
94Postmap ABA : The block number in the "Data Blocks" area obtained after
95 indirection from the map
96nfree : The number of free blocks that are maintained at any given time.
97 This is the number of concurrent writes that can happen to the
98 arena.
99
100
101For example, after adding a BTT, we surface a disk of 1024G. We get a read for
102the external LBA at 768G. This falls into the second arena, and of the 512G
103worth of blocks that this arena contributes, this block is at 256G. Thus, the
104premap ABA is 256G. We now refer to the map, and find out the mapping for block
105'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
106
107
108b. The BTT Flog
109---------------
110
111The BTT provides sector atomicity by making every write an "allocating write",
112i.e. Every write goes to a "free" block. A running list of free blocks is
113maintained in the form of the BTT flog. 'Flog' is a combination of the words
114"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
115
116lba : The premap ABA that is being written to
117old_map : The old postmap ABA - after 'this' write completes, this will be a
118 free block.
119new_map : The new postmap ABA. The map will up updated to reflect this
120 lba->postmap_aba mapping, but we log it here in case we have to
121 recover.
122seq : Sequence number to mark which of the 2 sections of this flog entry is
123 valid/newest. It cycles between 01->10->11->01 (binary) under normal
124 operation, with 00 indicating an uninitialized state.
125lba' : alternate lba entry
126old_map': alternate old postmap entry
127new_map': alternate new postmap entry
128seq' : alternate sequence number.
129
130Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
131done such that for any entry being written, it:
132a. overwrites the 'old' section in the entry based on sequence numbers
133b. writes the new entry such that the sequence number is written last.
134
135
136c. The concept of lanes
137-----------------------
138
139While 'nfree' describes the number of concurrent IOs an arena can process
140concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
141process.
142 nlanes = min(nfree, num_cpus)
143A lane number is obtained at the start of any IO, and is used for indexing into
144all the on-disk and in-memory data structures for the duration of the IO. It is
145protected by a spinlock.
146
147
148d. In-memory data structure: Read Tracking Table (RTT)
149------------------------------------------------------
150
151Consider a case where we have two threads, one doing reads and the other,
152writes. We can hit a condition where the writer thread grabs a free block to do
153a new IO, but the (slow) reader thread is still reading from it. In other words,
154the reader consulted a map entry, and started reading the corresponding block. A
155writer started writing to the same external LBA, and finished the write updating
156the map for that external LBA to point to its new postmap ABA. At this point the
157internal, postmap block that the reader is (still) reading has been inserted
158into the list of free blocks. If another write comes in for the same LBA, it can
159grab this free block, and start writing to it, causing the reader to read
160incorrect data. To prevent this, we introduce the RTT.
161
162The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
163into rtt[lane_number], the postmap ABA it is reading, and clears it after the
164read is complete. Every writer thread, after grabbing a free block, checks the
165RTT for its presence. If the postmap free block is in the RTT, it waits till the
166reader clears the RTT entry, and only then starts writing to it.
167
168
169e. In-memory data structure: map locks
170--------------------------------------
171
172Consider a case where two writer threads are writing to the same LBA. There can
173be a race in the following sequence of steps:
174
175free[lane] = map[premap_aba]
176map[premap_aba] = postmap_aba
177
178Both threads can update their respective free[lane] with the same old, freed
179postmap_aba. This has made the layout inconsistent by losing a free entry, and
180at the same time, duplicating another free entry for two lanes.
181
182To solve this, we could have a single map lock (per arena) that has to be taken
183before performing the above sequence, but we feel that could be too contentious.
184Instead we use an array of (nfree) map_locks that is indexed by
185(premap_aba modulo nfree).
186
187
188f. Reconstruction from the Flog
189-------------------------------
190
191On startup, we analyze the BTT flog to create our list of free blocks. We walk
192through all the entries, and for each lane, of the set of two possible
193'sections', we always look at the most recent one only (based on the sequence
194number). The reconstruction rules/steps are simple:
195- Read map[log_entry.lba].
196- If log_entry.new matches the map entry, then log_entry.old is free.
197- If log_entry.new does not match the map entry, then log_entry.new is free.
198 (This case can only be caused by power-fails/unsafe shutdowns)
199
200
201g. Summarizing - Read and Write flows
202-------------------------------------
203
204Read:
205
2061. Convert external LBA to arena number + pre-map ABA
2072. Get a lane (and take lane_lock)
2083. Read map to get the entry for this pre-map ABA
2094. Enter post-map ABA into RTT[lane]
2105. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
2116. If ERROR flag set in map, end IO with EIO (go to step 8)
2127. Read data from this block
2138. Remove post-map ABA entry from RTT[lane]
2149. Release lane (and lane_lock)
215
216Write:
217
2181. Convert external LBA to Arena number + pre-map ABA
2192. Get a lane (and take lane_lock)
2203. Use lane to index into in-memory free list and obtain a new block, next flog
221 index, next sequence number
2224. Scan the RTT to check if free block is present, and spin/wait if it is.
2235. Write data to this free block
2246. Read map to get the existing post-map ABA entry for this pre-map ABA
2257. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
2268. Write new post-map ABA into map.
2279. Write old post-map entry into the free list
22810. Calculate next sequence number and write into the free list entry
22911. Release lane (and lane_lock)
230
231
2324. Error Handling
233=================
234
235An arena would be in an error state if any of the metadata is corrupted
236irrecoverably, either due to a bug or a media error. The following conditions
237indicate an error:
238- Info block checksum does not match (and recovering from the copy also fails)
239- All internal available blocks are not uniquely and entirely addressed by the
240 sum of mapped blocks and free blocks (from the BTT flog).
241- Rebuilding free list from the flog reveals missing/duplicate/impossible
242 entries
243- A map entry is out of bounds
244
245If any of these error conditions are encountered, the arena is put into a read
246only state using a flag in the info block.
247
248
2495. In-kernel usage
250==================
251
252Any block driver that supports byte granularity IO to the storage may register
253with the BTT. It will have to provide the rw_bytes interface in its
254block_device_operations struct:
255
256 int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
257
258It may register with the BTT after it adds its own gendisk, using btt_init:
259
260 struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
261 u32 lbasize, u8 uuid[], int maxlane);
262
263note that maxlane is the maximum amount of concurrency the driver wishes to
264allow the BTT to use.
265
266The BTT 'disk' appears as a stacked block device that grabs the underlying block
267device in the O_EXCL mode.
268
269When the driver wishes to remove the backing disk, it should similarly call
270btt_fini using the same struct btt* handle that was provided to it by btt_init.
271
272 void btt_fini(struct btt *btt);
273
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 35af6f7f0abd..fc38b49eff7d 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -902,6 +902,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
902 } else { 902 } else {
903 nd_mapping->size = nfit_mem->bdw->capacity; 903 nd_mapping->size = nfit_mem->bdw->capacity;
904 nd_mapping->start = nfit_mem->bdw->start_address; 904 nd_mapping->start = nfit_mem->bdw->start_address;
905 ndr_desc->num_lanes = nfit_mem->bdw->windows;
905 blk_valid = 1; 906 blk_valid = 1;
906 } 907 }
907 908
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5680e8e7a7aa..204ee0796411 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -8,11 +8,11 @@ menuconfig LIBNVDIMM
8 NFIT, or otherwise can discover NVDIMM resources, a libnvdimm 8 NFIT, or otherwise can discover NVDIMM resources, a libnvdimm
9 bus is registered to advertise PMEM (persistent memory) 9 bus is registered to advertise PMEM (persistent memory)
10 namespaces (/dev/pmemX) and BLK (sliding mmio window(s)) 10 namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
11 namespaces (/dev/ndX). A PMEM namespace refers to a memory 11 namespaces (/dev/ndblkX.Y). A PMEM namespace refers to a
12 resource that may span multiple DIMMs and support DAX (see 12 memory resource that may span multiple DIMMs and support DAX
13 CONFIG_DAX). A BLK namespace refers to an NVDIMM control 13 (see CONFIG_DAX). A BLK namespace refers to an NVDIMM control
14 region which exposes an mmio register set for windowed 14 region which exposes an mmio register set for windowed access
15 access mode to non-volatile memory. 15 mode to non-volatile memory.
16 16
17if LIBNVDIMM 17if LIBNVDIMM
18 18
@@ -20,6 +20,7 @@ config BLK_DEV_PMEM
20 tristate "PMEM: Persistent memory block device support" 20 tristate "PMEM: Persistent memory block device support"
21 default LIBNVDIMM 21 default LIBNVDIMM
22 depends on HAS_IOMEM 22 depends on HAS_IOMEM
23 select ND_BTT if BTT
23 help 24 help
24 Memory ranges for PMEM are described by either an NFIT 25 Memory ranges for PMEM are described by either an NFIT
25 (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a 26 (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
@@ -33,7 +34,22 @@ config BLK_DEV_PMEM
33 34
34 Say Y if you want to use an NVDIMM 35 Say Y if you want to use an NVDIMM
35 36
37config ND_BTT
38 tristate
39
36config BTT 40config BTT
37 def_bool y 41 bool "BTT: Block Translation Table (atomic sector updates)"
42 default y if LIBNVDIMM
43 help
44 The Block Translation Table (BTT) provides atomic sector
45 update semantics for persistent memory devices, so that
46 applications that rely on sector writes not being torn (a
47 guarantee that typical disks provide) can continue to do so.
48 The BTT manifests itself as an alternate personality for an
49 NVDIMM namespace, i.e. a namespace can be in raw mode (pmemX,
50 ndblkX.Y, etc...), or 'sectored' mode, (pmemXs, ndblkX.Ys,
51 etc...).
52
53 Select Y if unsure
38 54
39endif 55endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 6085b4bd7312..d2aab6c58492 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,8 +1,11 @@
1obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o 1obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
2obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o 2obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
3obj-$(CONFIG_ND_BTT) += nd_btt.o
3 4
4nd_pmem-y := pmem.o 5nd_pmem-y := pmem.o
5 6
7nd_btt-y := btt.o
8
6libnvdimm-y := core.o 9libnvdimm-y := core.o
7libnvdimm-y += bus.o 10libnvdimm-y += bus.o
8libnvdimm-y += dimm_devs.o 11libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
new file mode 100644
index 000000000000..7ae38aac2c25
--- /dev/null
+++ b/drivers/nvdimm/btt.c
@@ -0,0 +1,1371 @@
1/*
2 * Block Translation Table
3 * Copyright (c) 2014-2015, Intel Corporation.
4 *
5 * This program is free software; you can redistribute it and/or modify it
6 * under the terms and conditions of the GNU General Public License,
7 * version 2, as published by the Free Software Foundation.
8 *
9 * This program is distributed in the hope it will be useful, but WITHOUT
10 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
12 * more details.
13 */
14#include <linux/highmem.h>
15#include <linux/debugfs.h>
16#include <linux/blkdev.h>
17#include <linux/module.h>
18#include <linux/device.h>
19#include <linux/mutex.h>
20#include <linux/hdreg.h>
21#include <linux/genhd.h>
22#include <linux/sizes.h>
23#include <linux/ndctl.h>
24#include <linux/fs.h>
25#include <linux/nd.h>
26#include "btt.h"
27#include "nd.h"
28
29enum log_ent_request {
30 LOG_NEW_ENT = 0,
31 LOG_OLD_ENT
32};
33
34static int btt_major;
35
36static int arena_read_bytes(struct arena_info *arena, resource_size_t offset,
37 void *buf, size_t n)
38{
39 struct nd_btt *nd_btt = arena->nd_btt;
40 struct nd_namespace_common *ndns = nd_btt->ndns;
41
42 /* arena offsets are 4K from the base of the device */
43 offset += SZ_4K;
44 return nvdimm_read_bytes(ndns, offset, buf, n);
45}
46
47static int arena_write_bytes(struct arena_info *arena, resource_size_t offset,
48 void *buf, size_t n)
49{
50 struct nd_btt *nd_btt = arena->nd_btt;
51 struct nd_namespace_common *ndns = nd_btt->ndns;
52
53 /* arena offsets are 4K from the base of the device */
54 offset += SZ_4K;
55 return nvdimm_write_bytes(ndns, offset, buf, n);
56}
57
58static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
59{
60 int ret;
61
62 ret = arena_write_bytes(arena, arena->info2off, super,
63 sizeof(struct btt_sb));
64 if (ret)
65 return ret;
66
67 return arena_write_bytes(arena, arena->infooff, super,
68 sizeof(struct btt_sb));
69}
70
71static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
72{
73 WARN_ON(!super);
74 return arena_read_bytes(arena, arena->infooff, super,
75 sizeof(struct btt_sb));
76}
77
78/*
79 * 'raw' version of btt_map write
80 * Assumptions:
81 * mapping is in little-endian
82 * mapping contains 'E' and 'Z' flags as desired
83 */
84static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
85{
86 u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
87
88 WARN_ON(lba >= arena->external_nlba);
89 return arena_write_bytes(arena, ns_off, &mapping, MAP_ENT_SIZE);
90}
91
92static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
93 u32 z_flag, u32 e_flag)
94{
95 u32 ze;
96 __le32 mapping_le;
97
98 /*
99 * This 'mapping' is supposed to be just the LBA mapping, without
100 * any flags set, so strip the flag bits.
101 */
102 mapping &= MAP_LBA_MASK;
103
104 ze = (z_flag << 1) + e_flag;
105 switch (ze) {
106 case 0:
107 /*
108 * We want to set neither of the Z or E flags, and
109 * in the actual layout, this means setting the bit
110 * positions of both to '1' to indicate a 'normal'
111 * map entry
112 */
113 mapping |= MAP_ENT_NORMAL;
114 break;
115 case 1:
116 mapping |= (1 << MAP_ERR_SHIFT);
117 break;
118 case 2:
119 mapping |= (1 << MAP_TRIM_SHIFT);
120 break;
121 default:
122 /*
123 * The case where Z and E are both sent in as '1' could be
124 * construed as a valid 'normal' case, but we decide not to,
125 * to avoid confusion
126 */
127 WARN_ONCE(1, "Invalid use of Z and E flags\n");
128 return -EIO;
129 }
130
131 mapping_le = cpu_to_le32(mapping);
132 return __btt_map_write(arena, lba, mapping_le);
133}
134
135static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
136 int *trim, int *error)
137{
138 int ret;
139 __le32 in;
140 u32 raw_mapping, postmap, ze, z_flag, e_flag;
141 u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
142
143 WARN_ON(lba >= arena->external_nlba);
144
145 ret = arena_read_bytes(arena, ns_off, &in, MAP_ENT_SIZE);
146 if (ret)
147 return ret;
148
149 raw_mapping = le32_to_cpu(in);
150
151 z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
152 e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
153 ze = (z_flag << 1) + e_flag;
154 postmap = raw_mapping & MAP_LBA_MASK;
155
156 /* Reuse the {z,e}_flag variables for *trim and *error */
157 z_flag = 0;
158 e_flag = 0;
159
160 switch (ze) {
161 case 0:
162 /* Initial state. Return postmap = premap */
163 *mapping = lba;
164 break;
165 case 1:
166 *mapping = postmap;
167 e_flag = 1;
168 break;
169 case 2:
170 *mapping = postmap;
171 z_flag = 1;
172 break;
173 case 3:
174 *mapping = postmap;
175 break;
176 default:
177 return -EIO;
178 }
179
180 if (trim)
181 *trim = z_flag;
182 if (error)
183 *error = e_flag;
184
185 return ret;
186}
187
188static int btt_log_read_pair(struct arena_info *arena, u32 lane,
189 struct log_entry *ent)
190{
191 WARN_ON(!ent);
192 return arena_read_bytes(arena,
193 arena->logoff + (2 * lane * LOG_ENT_SIZE), ent,
194 2 * LOG_ENT_SIZE);
195}
196
197static struct dentry *debugfs_root;
198
199static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
200 int idx)
201{
202 char dirname[32];
203 struct dentry *d;
204
205 /* If for some reason, parent bttN was not created, exit */
206 if (!parent)
207 return;
208
209 snprintf(dirname, 32, "arena%d", idx);
210 d = debugfs_create_dir(dirname, parent);
211 if (IS_ERR_OR_NULL(d))
212 return;
213 a->debugfs_dir = d;
214
215 debugfs_create_x64("size", S_IRUGO, d, &a->size);
216 debugfs_create_x64("external_lba_start", S_IRUGO, d,
217 &a->external_lba_start);
218 debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
219 debugfs_create_u32("internal_lbasize", S_IRUGO, d,
220 &a->internal_lbasize);
221 debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
222 debugfs_create_u32("external_lbasize", S_IRUGO, d,
223 &a->external_lbasize);
224 debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
225 debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
226 debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
227 debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
228 debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
229 debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
230 debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
231 debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
232 debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
233 debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
234}
235
236static void btt_debugfs_init(struct btt *btt)
237{
238 int i = 0;
239 struct arena_info *arena;
240
241 btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
242 debugfs_root);
243 if (IS_ERR_OR_NULL(btt->debugfs_dir))
244 return;
245
246 list_for_each_entry(arena, &btt->arena_list, list) {
247 arena_debugfs_init(arena, btt->debugfs_dir, i);
248 i++;
249 }
250}
251
252/*
253 * This function accepts two log entries, and uses the
254 * sequence number to find the 'older' entry.
255 * It also updates the sequence number in this old entry to
256 * make it the 'new' one if the mark_flag is set.
257 * Finally, it returns which of the entries was the older one.
258 *
259 * TODO The logic feels a bit kludge-y. make it better..
260 */
261static int btt_log_get_old(struct log_entry *ent)
262{
263 int old;
264
265 /*
266 * the first ever time this is seen, the entry goes into [0]
267 * the next time, the following logic works out to put this
268 * (next) entry into [1]
269 */
270 if (ent[0].seq == 0) {
271 ent[0].seq = cpu_to_le32(1);
272 return 0;
273 }
274
275 if (ent[0].seq == ent[1].seq)
276 return -EINVAL;
277 if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
278 return -EINVAL;
279
280 if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
281 if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
282 old = 0;
283 else
284 old = 1;
285 } else {
286 if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
287 old = 1;
288 else
289 old = 0;
290 }
291
292 return old;
293}
294
295static struct device *to_dev(struct arena_info *arena)
296{
297 return &arena->nd_btt->dev;
298}
299
300/*
301 * This function copies the desired (old/new) log entry into ent if
302 * it is not NULL. It returns the sub-slot number (0 or 1)
303 * where the desired log entry was found. Negative return values
304 * indicate errors.
305 */
306static int btt_log_read(struct arena_info *arena, u32 lane,
307 struct log_entry *ent, int old_flag)
308{
309 int ret;
310 int old_ent, ret_ent;
311 struct log_entry log[2];
312
313 ret = btt_log_read_pair(arena, lane, log);
314 if (ret)
315 return -EIO;
316
317 old_ent = btt_log_get_old(log);
318 if (old_ent < 0 || old_ent > 1) {
319 dev_info(to_dev(arena),
320 "log corruption (%d): lane %d seq [%d, %d]\n",
321 old_ent, lane, log[0].seq, log[1].seq);
322 /* TODO set error state? */
323 return -EIO;
324 }
325
326 ret_ent = (old_flag ? old_ent : (1 - old_ent));
327
328 if (ent != NULL)
329 memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
330
331 return ret_ent;
332}
333
334/*
335 * This function commits a log entry to media
336 * It does _not_ prepare the freelist entry for the next write
337 * btt_flog_write is the wrapper for updating the freelist elements
338 */
339static int __btt_log_write(struct arena_info *arena, u32 lane,
340 u32 sub, struct log_entry *ent)
341{
342 int ret;
343 /*
344 * Ignore the padding in log_entry for calculating log_half.
345 * The entry is 'committed' when we write the sequence number,
346 * and we want to ensure that that is the last thing written.
347 * We don't bother writing the padding as that would be extra
348 * media wear and write amplification
349 */
350 unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
351 u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
352 void *src = ent;
353
354 /* split the 16B write into atomic, durable halves */
355 ret = arena_write_bytes(arena, ns_off, src, log_half);
356 if (ret)
357 return ret;
358
359 ns_off += log_half;
360 src += log_half;
361 return arena_write_bytes(arena, ns_off, src, log_half);
362}
363
364static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
365 struct log_entry *ent)
366{
367 int ret;
368
369 ret = __btt_log_write(arena, lane, sub, ent);
370 if (ret)
371 return ret;
372
373 /* prepare the next free entry */
374 arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
375 if (++(arena->freelist[lane].seq) == 4)
376 arena->freelist[lane].seq = 1;
377 arena->freelist[lane].block = le32_to_cpu(ent->old_map);
378
379 return ret;
380}
381
382/*
383 * This function initializes the BTT map to the initial state, which is
384 * all-zeroes, and indicates an identity mapping
385 */
386static int btt_map_init(struct arena_info *arena)
387{
388 int ret = -EINVAL;
389 void *zerobuf;
390 size_t offset = 0;
391 size_t chunk_size = SZ_2M;
392 size_t mapsize = arena->logoff - arena->mapoff;
393
394 zerobuf = kzalloc(chunk_size, GFP_KERNEL);
395 if (!zerobuf)
396 return -ENOMEM;
397
398 while (mapsize) {
399 size_t size = min(mapsize, chunk_size);
400
401 ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf,
402 size);
403 if (ret)
404 goto free;
405
406 offset += size;
407 mapsize -= size;
408 cond_resched();
409 }
410
411 free:
412 kfree(zerobuf);
413 return ret;
414}
415
416/*
417 * This function initializes the BTT log with 'fake' entries pointing
418 * to the initial reserved set of blocks as being free
419 */
420static int btt_log_init(struct arena_info *arena)
421{
422 int ret;
423 u32 i;
424 struct log_entry log, zerolog;
425
426 memset(&zerolog, 0, sizeof(zerolog));
427
428 for (i = 0; i < arena->nfree; i++) {
429 log.lba = cpu_to_le32(i);
430 log.old_map = cpu_to_le32(arena->external_nlba + i);
431 log.new_map = cpu_to_le32(arena->external_nlba + i);
432 log.seq = cpu_to_le32(LOG_SEQ_INIT);
433 ret = __btt_log_write(arena, i, 0, &log);
434 if (ret)
435 return ret;
436 ret = __btt_log_write(arena, i, 1, &zerolog);
437 if (ret)
438 return ret;
439 }
440
441 return 0;
442}
443
444static int btt_freelist_init(struct arena_info *arena)
445{
446 int old, new, ret;
447 u32 i, map_entry;
448 struct log_entry log_new, log_old;
449
450 arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
451 GFP_KERNEL);
452 if (!arena->freelist)
453 return -ENOMEM;
454
455 for (i = 0; i < arena->nfree; i++) {
456 old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
457 if (old < 0)
458 return old;
459
460 new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
461 if (new < 0)
462 return new;
463
464 /* sub points to the next one to be overwritten */
465 arena->freelist[i].sub = 1 - new;
466 arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
467 arena->freelist[i].block = le32_to_cpu(log_new.old_map);
468
469 /* This implies a newly created or untouched flog entry */
470 if (log_new.old_map == log_new.new_map)
471 continue;
472
473 /* Check if map recovery is needed */
474 ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
475 NULL, NULL);
476 if (ret)
477 return ret;
478 if ((le32_to_cpu(log_new.new_map) != map_entry) &&
479 (le32_to_cpu(log_new.old_map) == map_entry)) {
480 /*
481 * Last transaction wrote the flog, but wasn't able
482 * to complete the map write. So fix up the map.
483 */
484 ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
485 le32_to_cpu(log_new.new_map), 0, 0);
486 if (ret)
487 return ret;
488 }
489
490 }
491
492 return 0;
493}
494
495static int btt_rtt_init(struct arena_info *arena)
496{
497 arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
498 if (arena->rtt == NULL)
499 return -ENOMEM;
500
501 return 0;
502}
503
504static int btt_maplocks_init(struct arena_info *arena)
505{
506 u32 i;
507
508 arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
509 GFP_KERNEL);
510 if (!arena->map_locks)
511 return -ENOMEM;
512
513 for (i = 0; i < arena->nfree; i++)
514 spin_lock_init(&arena->map_locks[i].lock);
515
516 return 0;
517}
518
519static struct arena_info *alloc_arena(struct btt *btt, size_t size,
520 size_t start, size_t arena_off)
521{
522 struct arena_info *arena;
523 u64 logsize, mapsize, datasize;
524 u64 available = size;
525
526 arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
527 if (!arena)
528 return NULL;
529 arena->nd_btt = btt->nd_btt;
530
531 if (!size)
532 return arena;
533
534 arena->size = size;
535 arena->external_lba_start = start;
536 arena->external_lbasize = btt->lbasize;
537 arena->internal_lbasize = roundup(arena->external_lbasize,
538 INT_LBASIZE_ALIGNMENT);
539 arena->nfree = BTT_DEFAULT_NFREE;
540 arena->version_major = 1;
541 arena->version_minor = 1;
542
543 if (available % BTT_PG_SIZE)
544 available -= (available % BTT_PG_SIZE);
545
546 /* Two pages are reserved for the super block and its copy */
547 available -= 2 * BTT_PG_SIZE;
548
549 /* The log takes a fixed amount of space based on nfree */
550 logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
551 BTT_PG_SIZE);
552 available -= logsize;
553
554 /* Calculate optimal split between map and data area */
555 arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
556 arena->internal_lbasize + MAP_ENT_SIZE);
557 arena->external_nlba = arena->internal_nlba - arena->nfree;
558
559 mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
560 datasize = available - mapsize;
561
562 /* 'Absolute' values, relative to start of storage space */
563 arena->infooff = arena_off;
564 arena->dataoff = arena->infooff + BTT_PG_SIZE;
565 arena->mapoff = arena->dataoff + datasize;
566 arena->logoff = arena->mapoff + mapsize;
567 arena->info2off = arena->logoff + logsize;
568 return arena;
569}
570
571static void free_arenas(struct btt *btt)
572{
573 struct arena_info *arena, *next;
574
575 list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
576 list_del(&arena->list);
577 kfree(arena->rtt);
578 kfree(arena->map_locks);
579 kfree(arena->freelist);
580 debugfs_remove_recursive(arena->debugfs_dir);
581 kfree(arena);
582 }
583}
584
585/*
586 * This function checks if the metadata layout is valid and error free
587 */
588static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
589 u8 *uuid, u32 lbasize)
590{
591 u64 checksum;
592
593 if (memcmp(super->uuid, uuid, 16))
594 return 0;
595
596 checksum = le64_to_cpu(super->checksum);
597 super->checksum = 0;
598 if (checksum != nd_btt_sb_checksum(super))
599 return 0;
600 super->checksum = cpu_to_le64(checksum);
601
602 if (lbasize != le32_to_cpu(super->external_lbasize))
603 return 0;
604
605 /* TODO: figure out action for this */
606 if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
607 dev_info(to_dev(arena), "Found arena with an error flag\n");
608
609 return 1;
610}
611
612/*
613 * This function reads an existing valid btt superblock and
614 * populates the corresponding arena_info struct
615 */
616static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
617 u64 arena_off)
618{
619 arena->internal_nlba = le32_to_cpu(super->internal_nlba);
620 arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
621 arena->external_nlba = le32_to_cpu(super->external_nlba);
622 arena->external_lbasize = le32_to_cpu(super->external_lbasize);
623 arena->nfree = le32_to_cpu(super->nfree);
624 arena->version_major = le16_to_cpu(super->version_major);
625 arena->version_minor = le16_to_cpu(super->version_minor);
626
627 arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
628 le64_to_cpu(super->nextoff));
629 arena->infooff = arena_off;
630 arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
631 arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
632 arena->logoff = arena_off + le64_to_cpu(super->logoff);
633 arena->info2off = arena_off + le64_to_cpu(super->info2off);
634
635 arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
636 (arena->info2off - arena->infooff + BTT_PG_SIZE);
637
638 arena->flags = le32_to_cpu(super->flags);
639}
640
641static int discover_arenas(struct btt *btt)
642{
643 int ret = 0;
644 struct arena_info *arena;
645 struct btt_sb *super;
646 size_t remaining = btt->rawsize;
647 u64 cur_nlba = 0;
648 size_t cur_off = 0;
649 int num_arenas = 0;
650
651 super = kzalloc(sizeof(*super), GFP_KERNEL);
652 if (!super)
653 return -ENOMEM;
654
655 while (remaining) {
656 /* Alloc memory for arena */
657 arena = alloc_arena(btt, 0, 0, 0);
658 if (!arena) {
659 ret = -ENOMEM;
660 goto out_super;
661 }
662
663 arena->infooff = cur_off;
664 ret = btt_info_read(arena, super);
665 if (ret)
666 goto out;
667
668 if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
669 btt->lbasize)) {
670 if (remaining == btt->rawsize) {
671 btt->init_state = INIT_NOTFOUND;
672 dev_info(to_dev(arena), "No existing arenas\n");
673 goto out;
674 } else {
675 dev_info(to_dev(arena),
676 "Found corrupted metadata!\n");
677 ret = -ENODEV;
678 goto out;
679 }
680 }
681
682 arena->external_lba_start = cur_nlba;
683 parse_arena_meta(arena, super, cur_off);
684
685 ret = btt_freelist_init(arena);
686 if (ret)
687 goto out;
688
689 ret = btt_rtt_init(arena);
690 if (ret)
691 goto out;
692
693 ret = btt_maplocks_init(arena);
694 if (ret)
695 goto out;
696
697 list_add_tail(&arena->list, &btt->arena_list);
698
699 remaining -= arena->size;
700 cur_off += arena->size;
701 cur_nlba += arena->external_nlba;
702 num_arenas++;
703
704 if (arena->nextoff == 0)
705 break;
706 }
707 btt->num_arenas = num_arenas;
708 btt->nlba = cur_nlba;
709 btt->init_state = INIT_READY;
710
711 kfree(super);
712 return ret;
713
714 out:
715 kfree(arena);
716 free_arenas(btt);
717 out_super:
718 kfree(super);
719 return ret;
720}
721
722static int create_arenas(struct btt *btt)
723{
724 size_t remaining = btt->rawsize;
725 size_t cur_off = 0;
726
727 while (remaining) {
728 struct arena_info *arena;
729 size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
730
731 remaining -= arena_size;
732 if (arena_size < ARENA_MIN_SIZE)
733 break;
734
735 arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
736 if (!arena) {
737 free_arenas(btt);
738 return -ENOMEM;
739 }
740 btt->nlba += arena->external_nlba;
741 if (remaining >= ARENA_MIN_SIZE)
742 arena->nextoff = arena->size;
743 else
744 arena->nextoff = 0;
745 cur_off += arena_size;
746 list_add_tail(&arena->list, &btt->arena_list);
747 }
748
749 return 0;
750}
751
752/*
753 * This function completes arena initialization by writing
754 * all the metadata.
755 * It is only called for an uninitialized arena when a write
756 * to that arena occurs for the first time.
757 */
758static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
759{
760 int ret;
761 struct btt_sb *super;
762
763 ret = btt_map_init(arena);
764 if (ret)
765 return ret;
766
767 ret = btt_log_init(arena);
768 if (ret)
769 return ret;
770
771 super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
772 if (!super)
773 return -ENOMEM;
774
775 strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
776 memcpy(super->uuid, uuid, 16);
777 super->flags = cpu_to_le32(arena->flags);
778 super->version_major = cpu_to_le16(arena->version_major);
779 super->version_minor = cpu_to_le16(arena->version_minor);
780 super->external_lbasize = cpu_to_le32(arena->external_lbasize);
781 super->external_nlba = cpu_to_le32(arena->external_nlba);
782 super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
783 super->internal_nlba = cpu_to_le32(arena->internal_nlba);
784 super->nfree = cpu_to_le32(arena->nfree);
785 super->infosize = cpu_to_le32(sizeof(struct btt_sb));
786 super->nextoff = cpu_to_le64(arena->nextoff);
787 /*
788 * Subtract arena->infooff (arena start) so numbers are relative
789 * to 'this' arena
790 */
791 super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
792 super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
793 super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
794 super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
795
796 super->flags = 0;
797 super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
798
799 ret = btt_info_write(arena, super);
800
801 kfree(super);
802 return ret;
803}
804
805/*
806 * This function completes the initialization for the BTT namespace
807 * such that it is ready to accept IOs
808 */
809static int btt_meta_init(struct btt *btt)
810{
811 int ret = 0;
812 struct arena_info *arena;
813
814 mutex_lock(&btt->init_lock);
815 list_for_each_entry(arena, &btt->arena_list, list) {
816 ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
817 if (ret)
818 goto unlock;
819
820 ret = btt_freelist_init(arena);
821 if (ret)
822 goto unlock;
823
824 ret = btt_rtt_init(arena);
825 if (ret)
826 goto unlock;
827
828 ret = btt_maplocks_init(arena);
829 if (ret)
830 goto unlock;
831 }
832
833 btt->init_state = INIT_READY;
834
835 unlock:
836 mutex_unlock(&btt->init_lock);
837 return ret;
838}
839
840/*
841 * This function calculates the arena in which the given LBA lies
842 * by doing a linear walk. This is acceptable since we expect only
843 * a few arenas. If we have backing devices that get much larger,
844 * we can construct a balanced binary tree of arenas at init time
845 * so that this range search becomes faster.
846 */
847static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
848 struct arena_info **arena)
849{
850 struct arena_info *arena_list;
851 __u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
852
853 list_for_each_entry(arena_list, &btt->arena_list, list) {
854 if (lba < arena_list->external_nlba) {
855 *arena = arena_list;
856 *premap = lba;
857 return 0;
858 }
859 lba -= arena_list->external_nlba;
860 }
861
862 return -EIO;
863}
864
865/*
866 * The following (lock_map, unlock_map) are mostly just to improve
867 * readability, since they index into an array of locks
868 */
869static void lock_map(struct arena_info *arena, u32 premap)
870 __acquires(&arena->map_locks[idx].lock)
871{
872 u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
873
874 spin_lock(&arena->map_locks[idx].lock);
875}
876
877static void unlock_map(struct arena_info *arena, u32 premap)
878 __releases(&arena->map_locks[idx].lock)
879{
880 u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
881
882 spin_unlock(&arena->map_locks[idx].lock);
883}
884
885static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
886{
887 return arena->dataoff + ((u64)lba * arena->internal_lbasize);
888}
889
890static int btt_data_read(struct arena_info *arena, struct page *page,
891 unsigned int off, u32 lba, u32 len)
892{
893 int ret;
894 u64 nsoff = to_namespace_offset(arena, lba);
895 void *mem = kmap_atomic(page);
896
897 ret = arena_read_bytes(arena, nsoff, mem + off, len);
898 kunmap_atomic(mem);
899
900 return ret;
901}
902
903static int btt_data_write(struct arena_info *arena, u32 lba,
904 struct page *page, unsigned int off, u32 len)
905{
906 int ret;
907 u64 nsoff = to_namespace_offset(arena, lba);
908 void *mem = kmap_atomic(page);
909
910 ret = arena_write_bytes(arena, nsoff, mem + off, len);
911 kunmap_atomic(mem);
912
913 return ret;
914}
915
916static void zero_fill_data(struct page *page, unsigned int off, u32 len)
917{
918 void *mem = kmap_atomic(page);
919
920 memset(mem + off, 0, len);
921 kunmap_atomic(mem);
922}
923
924static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
925 sector_t sector, unsigned int len)
926{
927 int ret = 0;
928 int t_flag, e_flag;
929 struct arena_info *arena = NULL;
930 u32 lane = 0, premap, postmap;
931
932 while (len) {
933 u32 cur_len;
934
935 lane = nd_region_acquire_lane(btt->nd_region);
936
937 ret = lba_to_arena(btt, sector, &premap, &arena);
938 if (ret)
939 goto out_lane;
940
941 cur_len = min(btt->sector_size, len);
942
943 ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
944 if (ret)
945 goto out_lane;
946
947 /*
948 * We loop to make sure that the post map LBA didn't change
949 * from under us between writing the RTT and doing the actual
950 * read.
951 */
952 while (1) {
953 u32 new_map;
954
955 if (t_flag) {
956 zero_fill_data(page, off, cur_len);
957 goto out_lane;
958 }
959
960 if (e_flag) {
961 ret = -EIO;
962 goto out_lane;
963 }
964
965 arena->rtt[lane] = RTT_VALID | postmap;
966 /*
967 * Barrier to make sure this write is not reordered
968 * to do the verification map_read before the RTT store
969 */
970 barrier();
971
972 ret = btt_map_read(arena, premap, &new_map, &t_flag,
973 &e_flag);
974 if (ret)
975 goto out_rtt;
976
977 if (postmap == new_map)
978 break;
979
980 postmap = new_map;
981 }
982
983 ret = btt_data_read(arena, page, off, postmap, cur_len);
984 if (ret)
985 goto out_rtt;
986
987 arena->rtt[lane] = RTT_INVALID;
988 nd_region_release_lane(btt->nd_region, lane);
989
990 len -= cur_len;
991 off += cur_len;
992 sector += btt->sector_size >> SECTOR_SHIFT;
993 }
994
995 return 0;
996
997 out_rtt:
998 arena->rtt[lane] = RTT_INVALID;
999 out_lane:
1000 nd_region_release_lane(btt->nd_region, lane);
1001 return ret;
1002}
1003
1004static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
1005 unsigned int off, unsigned int len)
1006{
1007 int ret = 0;
1008 struct arena_info *arena = NULL;
1009 u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
1010 struct log_entry log;
1011 int sub;
1012
1013 while (len) {
1014 u32 cur_len;
1015
1016 lane = nd_region_acquire_lane(btt->nd_region);
1017
1018 ret = lba_to_arena(btt, sector, &premap, &arena);
1019 if (ret)
1020 goto out_lane;
1021 cur_len = min(btt->sector_size, len);
1022
1023 if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
1024 ret = -EIO;
1025 goto out_lane;
1026 }
1027
1028 new_postmap = arena->freelist[lane].block;
1029
1030 /* Wait if the new block is being read from */
1031 for (i = 0; i < arena->nfree; i++)
1032 while (arena->rtt[i] == (RTT_VALID | new_postmap))
1033 cpu_relax();
1034
1035
1036 if (new_postmap >= arena->internal_nlba) {
1037 ret = -EIO;
1038 goto out_lane;
1039 } else
1040 ret = btt_data_write(arena, new_postmap, page,
1041 off, cur_len);
1042 if (ret)
1043 goto out_lane;
1044
1045 lock_map(arena, premap);
1046 ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
1047 if (ret)
1048 goto out_map;
1049 if (old_postmap >= arena->internal_nlba) {
1050 ret = -EIO;
1051 goto out_map;
1052 }
1053
1054 log.lba = cpu_to_le32(premap);
1055 log.old_map = cpu_to_le32(old_postmap);
1056 log.new_map = cpu_to_le32(new_postmap);
1057 log.seq = cpu_to_le32(arena->freelist[lane].seq);
1058 sub = arena->freelist[lane].sub;
1059 ret = btt_flog_write(arena, lane, sub, &log);
1060 if (ret)
1061 goto out_map;
1062
1063 ret = btt_map_write(arena, premap, new_postmap, 0, 0);
1064 if (ret)
1065 goto out_map;
1066
1067 unlock_map(arena, premap);
1068 nd_region_release_lane(btt->nd_region, lane);
1069
1070 len -= cur_len;
1071 off += cur_len;
1072 sector += btt->sector_size >> SECTOR_SHIFT;
1073 }
1074
1075 return 0;
1076
1077 out_map:
1078 unlock_map(arena, premap);
1079 out_lane:
1080 nd_region_release_lane(btt->nd_region, lane);
1081 return ret;
1082}
1083
1084static int btt_do_bvec(struct btt *btt, struct page *page,
1085 unsigned int len, unsigned int off, int rw,
1086 sector_t sector)
1087{
1088 int ret;
1089
1090 if (rw == READ) {
1091 ret = btt_read_pg(btt, page, off, sector, len);
1092 flush_dcache_page(page);
1093 } else {
1094 flush_dcache_page(page);
1095 ret = btt_write_pg(btt, sector, page, off, len);
1096 }
1097
1098 return ret;
1099}
1100
1101static void btt_make_request(struct request_queue *q, struct bio *bio)
1102{
1103 struct btt *btt = q->queuedata;
1104 struct bvec_iter iter;
1105 struct bio_vec bvec;
1106 int err = 0, rw;
1107
1108 rw = bio_data_dir(bio);
1109 bio_for_each_segment(bvec, bio, iter) {
1110 unsigned int len = bvec.bv_len;
1111
1112 BUG_ON(len > PAGE_SIZE);
1113 /* Make sure len is in multiples of sector size. */
1114 /* XXX is this right? */
1115 BUG_ON(len < btt->sector_size);
1116 BUG_ON(len % btt->sector_size);
1117
1118 err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
1119 rw, iter.bi_sector);
1120 if (err) {
1121 dev_info(&btt->nd_btt->dev,
1122 "io error in %s sector %lld, len %d,\n",
1123 (rw == READ) ? "READ" : "WRITE",
1124 (unsigned long long) iter.bi_sector, len);
1125 goto out;
1126 }
1127 }
1128
1129out:
1130 bio_endio(bio, err);
1131}
1132
1133static int btt_rw_page(struct block_device *bdev, sector_t sector,
1134 struct page *page, int rw)
1135{
1136 struct btt *btt = bdev->bd_disk->private_data;
1137
1138 btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
1139 page_endio(page, rw & WRITE, 0);
1140 return 0;
1141}
1142
1143
1144static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
1145{
1146 /* some standard values */
1147 geo->heads = 1 << 6;
1148 geo->sectors = 1 << 5;
1149 geo->cylinders = get_capacity(bd->bd_disk) >> 11;
1150 return 0;
1151}
1152
1153static const struct block_device_operations btt_fops = {
1154 .owner = THIS_MODULE,
1155 .rw_page = btt_rw_page,
1156 .getgeo = btt_getgeo,
1157};
1158
1159static int btt_blk_init(struct btt *btt)
1160{
1161 struct nd_btt *nd_btt = btt->nd_btt;
1162 struct nd_namespace_common *ndns = nd_btt->ndns;
1163
1164 /* create a new disk and request queue for btt */
1165 btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
1166 if (!btt->btt_queue)
1167 return -ENOMEM;
1168
1169 btt->btt_disk = alloc_disk(0);
1170 if (!btt->btt_disk) {
1171 blk_cleanup_queue(btt->btt_queue);
1172 return -ENOMEM;
1173 }
1174
1175 nvdimm_namespace_disk_name(ndns, btt->btt_disk->disk_name);
1176 btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
1177 btt->btt_disk->major = btt_major;
1178 btt->btt_disk->first_minor = 0;
1179 btt->btt_disk->fops = &btt_fops;
1180 btt->btt_disk->private_data = btt;
1181 btt->btt_disk->queue = btt->btt_queue;
1182 btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
1183
1184 blk_queue_make_request(btt->btt_queue, btt_make_request);
1185 blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
1186 blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX);
1187 blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
1188 queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue);
1189 btt->btt_queue->queuedata = btt;
1190
1191 set_capacity(btt->btt_disk,
1192 btt->nlba * btt->sector_size >> SECTOR_SHIFT);
1193 add_disk(btt->btt_disk);
1194
1195 return 0;
1196}
1197
1198static void btt_blk_cleanup(struct btt *btt)
1199{
1200 del_gendisk(btt->btt_disk);
1201 put_disk(btt->btt_disk);
1202 blk_cleanup_queue(btt->btt_queue);
1203}
1204
1205/**
1206 * btt_init - initialize a block translation table for the given device
1207 * @nd_btt: device with BTT geometry and backing device info
1208 * @rawsize: raw size in bytes of the backing device
1209 * @lbasize: lba size of the backing device
1210 * @uuid: A uuid for the backing device - this is stored on media
1211 * @maxlane: maximum number of parallel requests the device can handle
1212 *
1213 * Initialize a Block Translation Table on a backing device to provide
1214 * single sector power fail atomicity.
1215 *
1216 * Context:
1217 * Might sleep.
1218 *
1219 * Returns:
1220 * Pointer to a new struct btt on success, NULL on failure.
1221 */
1222static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
1223 u32 lbasize, u8 *uuid, struct nd_region *nd_region)
1224{
1225 int ret;
1226 struct btt *btt;
1227 struct device *dev = &nd_btt->dev;
1228
1229 btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
1230 if (!btt)
1231 return NULL;
1232
1233 btt->nd_btt = nd_btt;
1234 btt->rawsize = rawsize;
1235 btt->lbasize = lbasize;
1236 btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
1237 INIT_LIST_HEAD(&btt->arena_list);
1238 mutex_init(&btt->init_lock);
1239 btt->nd_region = nd_region;
1240
1241 ret = discover_arenas(btt);
1242 if (ret) {
1243 dev_err(dev, "init: error in arena_discover: %d\n", ret);
1244 goto out_free;
1245 }
1246
1247 if (btt->init_state != INIT_READY) {
1248 btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
1249 ((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
1250 dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
1251 btt->num_arenas, rawsize);
1252
1253 ret = create_arenas(btt);
1254 if (ret) {
1255 dev_info(dev, "init: create_arenas: %d\n", ret);
1256 goto out_free;
1257 }
1258
1259 ret = btt_meta_init(btt);
1260 if (ret) {
1261 dev_err(dev, "init: error in meta_init: %d\n", ret);
1262 return NULL;
1263 }
1264 }
1265
1266 ret = btt_blk_init(btt);
1267 if (ret) {
1268 dev_err(dev, "init: error in blk_init: %d\n", ret);
1269 goto out_free;
1270 }
1271
1272 btt_debugfs_init(btt);
1273
1274 return btt;
1275
1276 out_free:
1277 kfree(btt);
1278 return NULL;
1279}
1280
1281/**
1282 * btt_fini - de-initialize a BTT
1283 * @btt: the BTT handle that was generated by btt_init
1284 *
1285 * De-initialize a Block Translation Table on device removal
1286 *
1287 * Context:
1288 * Might sleep.
1289 */
1290static void btt_fini(struct btt *btt)
1291{
1292 if (btt) {
1293 btt_blk_cleanup(btt);
1294 free_arenas(btt);
1295 debugfs_remove_recursive(btt->debugfs_dir);
1296 kfree(btt);
1297 }
1298}
1299
1300int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
1301{
1302 struct nd_btt *nd_btt = to_nd_btt(ndns->claim);
1303 struct nd_region *nd_region;
1304 struct btt *btt;
1305 size_t rawsize;
1306
1307 if (!nd_btt->uuid || !nd_btt->ndns || !nd_btt->lbasize)
1308 return -ENODEV;
1309
1310 rawsize = nvdimm_namespace_capacity(ndns) - SZ_4K;
1311 if (rawsize < ARENA_MIN_SIZE) {
1312 return -ENXIO;
1313 }
1314 nd_region = to_nd_region(nd_btt->dev.parent);
1315 btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
1316 nd_region);
1317 if (!btt)
1318 return -ENOMEM;
1319 nd_btt->btt = btt;
1320
1321 return 0;
1322}
1323EXPORT_SYMBOL(nvdimm_namespace_attach_btt);
1324
1325int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns)
1326{
1327 struct nd_btt *nd_btt = to_nd_btt(ndns->claim);
1328 struct btt *btt = nd_btt->btt;
1329
1330 btt_fini(btt);
1331 nd_btt->btt = NULL;
1332
1333 return 0;
1334}
1335EXPORT_SYMBOL(nvdimm_namespace_detach_btt);
1336
1337static int __init nd_btt_init(void)
1338{
1339 int rc;
1340
1341 BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
1342
1343 btt_major = register_blkdev(0, "btt");
1344 if (btt_major < 0)
1345 return btt_major;
1346
1347 debugfs_root = debugfs_create_dir("btt", NULL);
1348 if (IS_ERR_OR_NULL(debugfs_root)) {
1349 rc = -ENXIO;
1350 goto err_debugfs;
1351 }
1352
1353 return 0;
1354
1355 err_debugfs:
1356 unregister_blkdev(btt_major, "btt");
1357
1358 return rc;
1359}
1360
1361static void __exit nd_btt_exit(void)
1362{
1363 debugfs_remove_recursive(debugfs_root);
1364 unregister_blkdev(btt_major, "btt");
1365}
1366
1367MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
1368MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>");
1369MODULE_LICENSE("GPL v2");
1370module_init(nd_btt_init);
1371module_exit(nd_btt_exit);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index e8f6d8e0ddd3..8c95a7792c3e 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -19,6 +19,39 @@
19 19
20#define BTT_SIG_LEN 16 20#define BTT_SIG_LEN 16
21#define BTT_SIG "BTT_ARENA_INFO\0" 21#define BTT_SIG "BTT_ARENA_INFO\0"
22#define MAP_ENT_SIZE 4
23#define MAP_TRIM_SHIFT 31
24#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
25#define MAP_ERR_SHIFT 30
26#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
27#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
28#define MAP_ENT_NORMAL 0xC0000000
29#define LOG_ENT_SIZE sizeof(struct log_entry)
30#define ARENA_MIN_SIZE (1UL << 24) /* 16 MB */
31#define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */
32#define RTT_VALID (1UL << 31)
33#define RTT_INVALID 0
34#define INT_LBASIZE_ALIGNMENT 256
35#define BTT_PG_SIZE 4096
36#define BTT_DEFAULT_NFREE ND_MAX_LANES
37#define LOG_SEQ_INIT 1
38
39#define IB_FLAG_ERROR 0x00000001
40#define IB_FLAG_ERROR_MASK 0x00000001
41
42enum btt_init_state {
43 INIT_UNCHECKED = 0,
44 INIT_NOTFOUND,
45 INIT_READY
46};
47
48struct log_entry {
49 __le32 lba;
50 __le32 old_map;
51 __le32 new_map;
52 __le32 seq;
53 __le64 padding[2];
54};
22 55
23struct btt_sb { 56struct btt_sb {
24 u8 signature[BTT_SIG_LEN]; 57 u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
42 __le64 checksum; 75 __le64 checksum;
43}; 76};
44 77
78struct free_entry {
79 u32 block;
80 u8 sub;
81 u8 seq;
82};
83
84struct aligned_lock {
85 union {
86 spinlock_t lock;
87 u8 cacheline_padding[L1_CACHE_BYTES];
88 };
89};
90
91/**
92 * struct arena_info - handle for an arena
93 * @size: Size in bytes this arena occupies on the raw device.
94 * This includes arena metadata.
95 * @external_lba_start: The first external LBA in this arena.
96 * @internal_nlba: Number of internal blocks available in the arena
97 * including nfree reserved blocks
98 * @internal_lbasize: Internal and external lba sizes may be different as
99 * we can round up 'odd' external lbasizes such as 520B
100 * to be aligned.
101 * @external_nlba: Number of blocks contributed by the arena to the number
102 * reported to upper layers. (internal_nlba - nfree)
103 * @external_lbasize: LBA size as exposed to upper layers.
104 * @nfree: A reserve number of 'free' blocks that is used to
105 * handle incoming writes.
106 * @version_major: Metadata layout version major.
107 * @version_minor: Metadata layout version minor.
108 * @nextoff: Offset in bytes to the start of the next arena.
109 * @infooff: Offset in bytes to the info block of this arena.
110 * @dataoff: Offset in bytes to the data area of this arena.
111 * @mapoff: Offset in bytes to the map area of this arena.
112 * @logoff: Offset in bytes to the log area of this arena.
113 * @info2off: Offset in bytes to the backup info block of this arena.
114 * @freelist: Pointer to in-memory list of free blocks
115 * @rtt: Pointer to in-memory "Read Tracking Table"
116 * @map_locks: Spinlocks protecting concurrent map writes
117 * @nd_btt: Pointer to parent nd_btt structure.
118 * @list: List head for list of arenas
119 * @debugfs_dir: Debugfs dentry
120 * @flags: Arena flags - may signify error states.
121 *
122 * arena_info is a per-arena handle. Once an arena is narrowed down for an
123 * IO, this struct is passed around for the duration of the IO.
124 */
125struct arena_info {
126 u64 size; /* Total bytes for this arena */
127 u64 external_lba_start;
128 u32 internal_nlba;
129 u32 internal_lbasize;
130 u32 external_nlba;
131 u32 external_lbasize;
132 u32 nfree;
133 u16 version_major;
134 u16 version_minor;
135 /* Byte offsets to the different on-media structures */
136 u64 nextoff;
137 u64 infooff;
138 u64 dataoff;
139 u64 mapoff;
140 u64 logoff;
141 u64 info2off;
142 /* Pointers to other in-memory structures for this arena */
143 struct free_entry *freelist;
144 u32 *rtt;
145 struct aligned_lock *map_locks;
146 struct nd_btt *nd_btt;
147 struct list_head list;
148 struct dentry *debugfs_dir;
149 /* Arena flags */
150 u32 flags;
151};
152
153/**
154 * struct btt - handle for a BTT instance
155 * @btt_disk: Pointer to the gendisk for BTT device
156 * @btt_queue: Pointer to the request queue for the BTT device
157 * @arena_list: Head of the list of arenas
158 * @debugfs_dir: Debugfs dentry
159 * @nd_btt: Parent nd_btt struct
160 * @nlba: Number of logical blocks exposed to the upper layers
161 * after removing the amount of space needed by metadata
162 * @rawsize: Total size in bytes of the available backing device
163 * @lbasize: LBA size as requested and presented to upper layers.
164 * This is sector_size + size of any metadata.
165 * @sector_size: The Linux sector size - 512 or 4096
166 * @lanes: Per-lane spinlocks
167 * @init_lock: Mutex used for the BTT initialization
168 * @init_state: Flag describing the initialization state for the BTT
169 * @num_arenas: Number of arenas in the BTT instance
170 */
171struct btt {
172 struct gendisk *btt_disk;
173 struct request_queue *btt_queue;
174 struct list_head arena_list;
175 struct dentry *debugfs_dir;
176 struct nd_btt *nd_btt;
177 u64 nlba;
178 unsigned long long rawsize;
179 u32 lbasize;
180 u32 sector_size;
181 struct nd_region *nd_region;
182 struct mutex init_lock;
183 int init_state;
184 int num_arenas;
185};
45#endif 186#endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index effb70a88347..470fbdccd0ac 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -348,7 +348,8 @@ struct device *nd_btt_create(struct nd_region *nd_region)
348 */ 348 */
349u64 nd_btt_sb_checksum(struct btt_sb *btt_sb) 349u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
350{ 350{
351 u64 sum, sum_save; 351 u64 sum;
352 __le64 sum_save;
352 353
353 sum_save = btt_sb->checksum; 354 sum_save = btt_sb->checksum;
354 btt_sb->checksum = 0; 355 btt_sb->checksum = 0;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 2c50a0719f8d..4aa647c8d644 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -76,6 +76,30 @@ static bool is_namespace_io(struct device *dev)
76 return dev ? dev->type == &namespace_io_device_type : false; 76 return dev ? dev->type == &namespace_io_device_type : false;
77} 77}
78 78
79const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
80 char *name)
81{
82 struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
83 const char *suffix = "";
84
85 if (ndns->claim && is_nd_btt(ndns->claim))
86 suffix = "s";
87
88 if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev))
89 sprintf(name, "pmem%d%s", nd_region->id, suffix);
90 else if (is_namespace_blk(&ndns->dev)) {
91 struct nd_namespace_blk *nsblk;
92
93 nsblk = to_nd_namespace_blk(&ndns->dev);
94 sprintf(name, "ndblk%d.%d%s", nd_region->id, nsblk->id, suffix);
95 } else {
96 return NULL;
97 }
98
99 return name;
100}
101EXPORT_SYMBOL(nvdimm_namespace_disk_name);
102
79static ssize_t nstype_show(struct device *dev, 103static ssize_t nstype_show(struct device *dev,
80 struct device_attribute *attr, char *buf) 104 struct device_attribute *attr, char *buf)
81{ 105{
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index d13eccbb67e9..1b937c235913 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -20,6 +20,12 @@
20#include "label.h" 20#include "label.h"
21 21
22enum { 22enum {
23 /*
24 * Limits the maximum number of block apertures a dimm can
25 * support and is an input to the geometry/on-disk-format of a
26 * BTT instance
27 */
28 ND_MAX_LANES = 256,
23 SECTOR_SHIFT = 9, 29 SECTOR_SHIFT = 9,
24}; 30};
25 31
@@ -75,6 +81,11 @@ static inline struct nd_namespace_index *to_next_namespace_index(
75 for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \ 81 for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
76 res; res = next, next = next ? next->sibling : NULL) 82 res; res = next, next = next ? next->sibling : NULL)
77 83
84struct nd_percpu_lane {
85 int count;
86 spinlock_t lock;
87};
88
78struct nd_region { 89struct nd_region {
79 struct device dev; 90 struct device dev;
80 struct ida ns_ida; 91 struct ida ns_ida;
@@ -84,9 +95,10 @@ struct nd_region {
84 u16 ndr_mappings; 95 u16 ndr_mappings;
85 u64 ndr_size; 96 u64 ndr_size;
86 u64 ndr_start; 97 u64 ndr_start;
87 int id; 98 int id, num_lanes;
88 void *provider_data; 99 void *provider_data;
89 struct nd_interleave_set *nd_set; 100 struct nd_interleave_set *nd_set;
101 struct nd_percpu_lane __percpu *lane;
90 struct nd_mapping mapping[0]; 102 struct nd_mapping mapping[0];
91}; 103};
92 104
@@ -100,9 +112,11 @@ static inline unsigned nd_inc_seq(unsigned seq)
100 return next[seq & 3]; 112 return next[seq & 3];
101} 113}
102 114
115struct btt;
103struct nd_btt { 116struct nd_btt {
104 struct device dev; 117 struct device dev;
105 struct nd_namespace_common *ndns; 118 struct nd_namespace_common *ndns;
119 struct btt *btt;
106 unsigned long lbasize; 120 unsigned long lbasize;
107 u8 *uuid; 121 u8 *uuid;
108 int id; 122 int id;
@@ -157,6 +171,8 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
157 171
158#endif 172#endif
159struct nd_region *to_nd_region(struct device *dev); 173struct nd_region *to_nd_region(struct device *dev);
174unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
175void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
160int nd_region_to_nstype(struct nd_region *nd_region); 176int nd_region_to_nstype(struct nd_region *nd_region);
161int nd_region_register_namespaces(struct nd_region *nd_region, int *err); 177int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
162u64 nd_region_interleave_set_cookie(struct nd_region *nd_region); 178u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -172,4 +188,8 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
172 resource_size_t n); 188 resource_size_t n);
173resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns); 189resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
174struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev); 190struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
191int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
192int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns);
193const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
194 char *name);
175#endif /* __ND_H__ */ 195#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d0c6b4bdba69..7346054bccbb 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -160,7 +160,6 @@ static void pmem_detach_disk(struct pmem_device *pmem)
160static int pmem_attach_disk(struct nd_namespace_common *ndns, 160static int pmem_attach_disk(struct nd_namespace_common *ndns,
161 struct pmem_device *pmem) 161 struct pmem_device *pmem)
162{ 162{
163 struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
164 struct gendisk *disk; 163 struct gendisk *disk;
165 164
166 pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); 165 pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
@@ -183,7 +182,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
183 disk->private_data = pmem; 182 disk->private_data = pmem;
184 disk->queue = pmem->pmem_queue; 183 disk->queue = pmem->pmem_queue;
185 disk->flags = GENHD_FL_EXT_DEVT; 184 disk->flags = GENHD_FL_EXT_DEVT;
186 sprintf(disk->disk_name, "pmem%d", nd_region->id); 185 nvdimm_namespace_disk_name(ndns, disk->disk_name);
187 disk->driverfs_dev = &ndns->dev; 186 disk->driverfs_dev = &ndns->dev;
188 set_capacity(disk, pmem->size >> 9); 187 set_capacity(disk, pmem->size >> 9);
189 pmem->pmem_disk = disk; 188 pmem->pmem_disk = disk;
@@ -211,17 +210,6 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
211 return 0; 210 return 0;
212} 211}
213 212
214static int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
215{
216 /* TODO */
217 return -ENXIO;
218}
219
220static void nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns)
221{
222 /* TODO */
223}
224
225static void pmem_free(struct pmem_device *pmem) 213static void pmem_free(struct pmem_device *pmem)
226{ 214{
227 iounmap(pmem->virt_addr); 215 iounmap(pmem->virt_addr);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 2a5f3f53d79d..eb8aebcd4800 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -10,6 +10,7 @@
10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
11 * General Public License for more details. 11 * General Public License for more details.
12 */ 12 */
13#include <linux/cpumask.h>
13#include <linux/module.h> 14#include <linux/module.h>
14#include <linux/device.h> 15#include <linux/device.h>
15#include <linux/nd.h> 16#include <linux/nd.h>
@@ -18,10 +19,21 @@
18static int nd_region_probe(struct device *dev) 19static int nd_region_probe(struct device *dev)
19{ 20{
20 int err; 21 int err;
22 static unsigned long once;
21 struct nd_region_namespaces *num_ns; 23 struct nd_region_namespaces *num_ns;
22 struct nd_region *nd_region = to_nd_region(dev); 24 struct nd_region *nd_region = to_nd_region(dev);
23 int rc = nd_region_register_namespaces(nd_region, &err); 25 int rc = nd_region_register_namespaces(nd_region, &err);
24 26
27 if (nd_region->num_lanes > num_online_cpus()
28 && nd_region->num_lanes < num_possible_cpus()
29 && !test_and_set_bit(0, &once)) {
30 dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
31 num_online_cpus(), nd_region->num_lanes,
32 num_possible_cpus());
33 dev_info(dev, "setting nr_cpus=%d may yield better libnvdimm device performance\n",
34 nd_region->num_lanes);
35 }
36
25 num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL); 37 num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
26 if (!num_ns) 38 if (!num_ns)
27 return -ENOMEM; 39 return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 4fd92389fa7e..fe8ec21fe3d7 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -32,6 +32,7 @@ static void nd_region_release(struct device *dev)
32 32
33 put_device(&nvdimm->dev); 33 put_device(&nvdimm->dev);
34 } 34 }
35 free_percpu(nd_region->lane);
35 ida_simple_remove(&region_ida, nd_region->id); 36 ida_simple_remove(&region_ida, nd_region->id);
36 kfree(nd_region); 37 kfree(nd_region);
37} 38}
@@ -531,13 +532,66 @@ void *nd_region_provider_data(struct nd_region *nd_region)
531} 532}
532EXPORT_SYMBOL_GPL(nd_region_provider_data); 533EXPORT_SYMBOL_GPL(nd_region_provider_data);
533 534
535/**
536 * nd_region_acquire_lane - allocate and lock a lane
537 * @nd_region: region id and number of lanes possible
538 *
539 * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
540 * We optimize for the common case where there are 256 lanes, one
541 * per-cpu. For larger systems we need to lock to share lanes. For now
542 * this implementation assumes the cost of maintaining an allocator for
543 * free lanes is on the order of the lock hold time, so it implements a
544 * static lane = cpu % num_lanes mapping.
545 *
546 * In the case of a BTT instance on top of a BLK namespace a lane may be
547 * acquired recursively. We lock on the first instance.
548 *
549 * In the case of a BTT instance on top of PMEM, we only acquire a lane
550 * for the BTT metadata updates.
551 */
552unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
553{
554 unsigned int cpu, lane;
555
556 cpu = get_cpu();
557 if (nd_region->num_lanes < nr_cpu_ids) {
558 struct nd_percpu_lane *ndl_lock, *ndl_count;
559
560 lane = cpu % nd_region->num_lanes;
561 ndl_count = per_cpu_ptr(nd_region->lane, cpu);
562 ndl_lock = per_cpu_ptr(nd_region->lane, lane);
563 if (ndl_count->count++ == 0)
564 spin_lock(&ndl_lock->lock);
565 } else
566 lane = cpu;
567
568 return lane;
569}
570EXPORT_SYMBOL(nd_region_acquire_lane);
571
572void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
573{
574 if (nd_region->num_lanes < nr_cpu_ids) {
575 unsigned int cpu = get_cpu();
576 struct nd_percpu_lane *ndl_lock, *ndl_count;
577
578 ndl_count = per_cpu_ptr(nd_region->lane, cpu);
579 ndl_lock = per_cpu_ptr(nd_region->lane, lane);
580 if (--ndl_count->count == 0)
581 spin_unlock(&ndl_lock->lock);
582 put_cpu();
583 }
584 put_cpu();
585}
586EXPORT_SYMBOL(nd_region_release_lane);
587
534static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, 588static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
535 struct nd_region_desc *ndr_desc, struct device_type *dev_type, 589 struct nd_region_desc *ndr_desc, struct device_type *dev_type,
536 const char *caller) 590 const char *caller)
537{ 591{
538 struct nd_region *nd_region; 592 struct nd_region *nd_region;
539 struct device *dev; 593 struct device *dev;
540 u16 i; 594 unsigned int i;
541 595
542 for (i = 0; i < ndr_desc->num_mappings; i++) { 596 for (i = 0; i < ndr_desc->num_mappings; i++) {
543 struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i]; 597 struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
@@ -557,9 +611,19 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
557 if (!nd_region) 611 if (!nd_region)
558 return NULL; 612 return NULL;
559 nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL); 613 nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
560 if (nd_region->id < 0) { 614 if (nd_region->id < 0)
561 kfree(nd_region); 615 goto err_id;
562 return NULL; 616
617 nd_region->lane = alloc_percpu(struct nd_percpu_lane);
618 if (!nd_region->lane)
619 goto err_percpu;
620
621 for (i = 0; i < nr_cpu_ids; i++) {
622 struct nd_percpu_lane *ndl;
623
624 ndl = per_cpu_ptr(nd_region->lane, i);
625 spin_lock_init(&ndl->lock);
626 ndl->count = 0;
563 } 627 }
564 628
565 memcpy(nd_region->mapping, ndr_desc->nd_mapping, 629 memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -573,6 +637,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
573 nd_region->ndr_mappings = ndr_desc->num_mappings; 637 nd_region->ndr_mappings = ndr_desc->num_mappings;
574 nd_region->provider_data = ndr_desc->provider_data; 638 nd_region->provider_data = ndr_desc->provider_data;
575 nd_region->nd_set = ndr_desc->nd_set; 639 nd_region->nd_set = ndr_desc->nd_set;
640 nd_region->num_lanes = ndr_desc->num_lanes;
576 ida_init(&nd_region->ns_ida); 641 ida_init(&nd_region->ns_ida);
577 ida_init(&nd_region->btt_ida); 642 ida_init(&nd_region->btt_ida);
578 dev = &nd_region->dev; 643 dev = &nd_region->dev;
@@ -585,11 +650,18 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
585 nd_device_register(dev); 650 nd_device_register(dev);
586 651
587 return nd_region; 652 return nd_region;
653
654 err_percpu:
655 ida_simple_remove(&region_ida, nd_region->id);
656 err_id:
657 kfree(nd_region);
658 return NULL;
588} 659}
589 660
590struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 661struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
591 struct nd_region_desc *ndr_desc) 662 struct nd_region_desc *ndr_desc)
592{ 663{
664 ndr_desc->num_lanes = ND_MAX_LANES;
593 return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type, 665 return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type,
594 __func__); 666 __func__);
595} 667}
@@ -600,6 +672,7 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
600{ 672{
601 if (ndr_desc->num_mappings > 1) 673 if (ndr_desc->num_mappings > 1)
602 return NULL; 674 return NULL;
675 ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
603 return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type, 676 return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type,
604 __func__); 677 __func__);
605} 678}
@@ -608,6 +681,7 @@ EXPORT_SYMBOL_GPL(nvdimm_blk_region_create);
608struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus, 681struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
609 struct nd_region_desc *ndr_desc) 682 struct nd_region_desc *ndr_desc)
610{ 683{
684 ndr_desc->num_lanes = ND_MAX_LANES;
611 return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type, 685 return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type,
612 __func__); 686 __func__);
613} 687}
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a59dca17b3aa..531d99dfac68 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -85,6 +85,7 @@ struct nd_region_desc {
85 const struct attribute_group **attr_groups; 85 const struct attribute_group **attr_groups;
86 struct nd_interleave_set *nd_set; 86 struct nd_interleave_set *nd_set;
87 void *provider_data; 87 void *provider_data;
88 int num_lanes;
88}; 89};
89 90
90struct nvdimm_bus; 91struct nvdimm_bus;