block: update documentation for REQ_FLUSH / REQ_FUA

Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
author: Christoph Hellwig <hch@lst.de> 2010-09-03 05:56:17 -0400
committer: Jens Axboe <jaxboe@fusionio.com> 2010-09-10 06:35:37 -0400
commit: 04ccc65cd1f57aee861708e08cd2272c5a0d088c (patch)
tree: 60b72d9acc0f94f9abfd3a3cee7de2e58549c9eb
parent: 09d60c701b64b509f328cac72970eb894f485b9e (diff)
3 files changed, 88 insertions, 263 deletions
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index a406286f6f3e..d111e3b23db0 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -1,7 +1,5 @@
 00-INDEX
        - This file
-barrier.txt
-        - I/O Barriers
 biodoc.txt
        - Notes on the Generic Block Layer Rewrite in Linux 2.5
 capability.txt
@@ -16,3 +14,5 @@ stat.txt
        - Block layer statistics in /sys/block/<dev>/stat
 switching-sched.txt
        - Switching I/O schedulers at runtime
+writeback_cache_control.txt
+        - Control of volatile write back caches
diff --git a/Documentation/block/barrier.txt b/Documentation/block/barrier.txt
deleted file mode 100644
index 2c2f24f634e4..000000000000
--- a/Documentation/block/barrier.txt
+++ /dev/null
@@ -1,261 +0,0 @@
-I/O Barriers
-============
-Tejun Heo <htejun@gmail.com>, July 22 2005
-I/O barrier requests are used to guarantee ordering around the barrier
-requests.  Unless you're crazy enough to use disk drives for
-implementing synchronization constructs (wow, sounds interesting...),
-the ordering is meaningful only for write requests for things like
-journal checkpoints.  All requests queued before a barrier request
-must be finished (made it to the physical medium) before the barrier
-request is started, and all requests queued after the barrier request
-must be started only after the barrier request is finished (again,
-made it to the physical medium).
-In other words, I/O barrier requests have the following two properties.
-1. Request ordering
-Requests cannot pass the barrier request.  Preceding requests are
-processed before the barrier and following requests after.
-Depending on what features a drive supports, this can be done in one
-of the following three ways.
-i. For devices which have queue depth greater than 1 (TCQ devices) and
-support ordered tags, block layer can just issue the barrier as an
-ordered request and the lower level driver, controller and drive
-itself are responsible for making sure that the ordering constraint is
-met.  Most modern SCSI controllers/drives should support this.
-NOTE: SCSI ordered tag isn't currently used due to limitation in the
-      SCSI midlayer, see the following random notes section.
-ii. For devices which have queue depth greater than 1 but don't
-support ordered tags, block layer ensures that the requests preceding
-a barrier request finishes before issuing the barrier request.  Also,
-it defers requests following the barrier until the barrier request is
-finished.  Older SCSI controllers/drives and SATA drives fall in this
-category.
-iii. Devices which have queue depth of 1.  This is a degenerate case
-of ii.  Just keeping issue order suffices.  Ancient SCSI
-controllers/drives and IDE drives are in this category.
-2. Forced flushing to physical medium
-Again, if you're not gonna do synchronization with disk drives (dang,
-it sounds even more appealing now!), the reason you use I/O barriers
-is mainly to protect filesystem integrity when power failure or some
-other events abruptly stop the drive from operating and possibly make
-the drive lose data in its cache.  So, I/O barriers need to guarantee
-that requests actually get written to non-volatile medium in order.
-There are four cases,
-i. No write-back cache.  Keeping requests ordered is enough.
-ii. Write-back cache but no flush operation.  There's no way to
-guarantee physical-medium commit order.  This kind of devices can't to
-I/O barriers.
-iii. Write-back cache and flush operation but no FUA (forced unit
-access).  We need two cache flushes - before and after the barrier
-request.
-iv. Write-back cache, flush operation and FUA.  We still need one
-flush to make sure requests preceding a barrier are written to medium,
-but post-barrier flush can be avoided by using FUA write on the
-barrier itself.
-How to support barrier requests in drivers
------------------------------------------
-All barrier handling is done inside block layer proper.  All low level
-drivers have to are implementing its prepare_flush_fn and using one
-the following two functions to indicate what barrier type it supports
-and how to prepare flush requests.  Note that the term 'ordered' is
-used to indicate the whole sequence of performing barrier requests
-including draining and flushing.
-typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
-int blk_queue_ordered(struct request_queue *q, unsigned ordered,
-                      prepare_flush_fn *prepare_flush_fn);
-@q                      : the queue in question
-@ordered                : the ordered mode the driver/device supports
-@prepare_flush_fn       : this function should prepare @rq such that it
-                          flushes cache to physical medium when executed
-For example, SCSI disk driver's prepare_flush_fn looks like the
-following.
-static void sd_prepare_flush(struct request_queue *q, struct request *rq)
-{
-        memset(rq->cmd, 0, sizeof(rq->cmd));
-        rq->cmd_type = REQ_TYPE_BLOCK_PC;
-        rq->timeout = SD_TIMEOUT;
-        rq->cmd[0] = SYNCHRONIZE_CACHE;
-        rq->cmd_len = 10;
-}
-The following seven ordered modes are supported.  The following table
-shows which mode should be used depending on what features a
-device/driver supports.  In the leftmost column of table,
-QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
-The table is followed by description of each mode.  Note that in the
-descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
-used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
-preceding step must be complete before proceeding to the next step.
-'->' indicates that the next step can start as soon as the previous
-step is issued.
-            write-back cache    ordered tag     flush           FUA
-----------------------------------------------------------------------
-NONE            yes/no          N/A             no              N/A
-DRAIN           no              no              N/A             N/A
-DRAIN_FLUSH     yes             no              yes             no
-DRAIN_FUA       yes             no              yes             yes
-TAG             no              yes             N/A             N/A
-TAG_FLUSH       yes             yes             yes             no
-TAG_FUA         yes             yes             yes             yes
-QUEUE_ORDERED_NONE
-        I/O barriers are not needed and/or supported.
-        Sequence: N/A
-QUEUE_ORDERED_DRAIN
-        Requests are ordered by draining the request queue and cache
-        flushing isn't needed.
-        Sequence: drain => barrier
-QUEUE_ORDERED_DRAIN_FLUSH
-        Requests are ordered by draining the request queue and both
-        pre-barrier and post-barrier cache flushings are needed.
-        Sequence: drain => preflush => barrier => postflush
-QUEUE_ORDERED_DRAIN_FUA
-        Requests are ordered by draining the request queue and
-        pre-barrier cache flushing is needed.  By using FUA on barrier
-        request, post-barrier flushing can be skipped.
-        Sequence: drain => preflush => barrier
-QUEUE_ORDERED_TAG
-        Requests are ordered by ordered tag and cache flushing isn't
-        needed.
-        Sequence: barrier
-QUEUE_ORDERED_TAG_FLUSH
-        Requests are ordered by ordered tag and both pre-barrier and
-        post-barrier cache flushings are needed.
-        Sequence: preflush -> barrier -> postflush
-QUEUE_ORDERED_TAG_FUA
-        Requests are ordered by ordered tag and pre-barrier cache
-        flushing is needed.  By using FUA on barrier request,
-        post-barrier flushing can be skipped.
-        Sequence: preflush -> barrier
-Random notes/caveats
--------------------
-* SCSI layer currently can't use TAG ordering even if the drive,
-controller and driver support it.  The problem is that SCSI midlayer
-request dispatch function is not atomic.  It releases queue lock and
-switch to SCSI host lock during issue and it's possible and likely to
-happen in time that requests change their relative positions.  Once
-this problem is solved, TAG ordering can be enabled.
-* Currently, no matter which ordered mode is used, there can be only
-one barrier request in progress.  All I/O barriers are held off by
-block layer until the previous I/O barrier is complete.  This doesn't
-make any difference for DRAIN ordered devices, but, for TAG ordered
-devices with very high command latency, passing multiple I/O barriers
-to low level *might* be helpful if they are very frequent.  Well, this
-certainly is a non-issue.  I'm writing this just to make clear that no
-two I/O barrier is ever passed to low-level driver.
-* Completion order.  Requests in ordered sequence are issued in order
-but not required to finish in order.  Barrier implementation can
-handle out-of-order completion of ordered sequence.  IOW, the requests
-MUST be processed in order but the hardware/software completion paths
-are allowed to reorder completion notifications - eg. current SCSI
-midlayer doesn't preserve completion order during error handling.
-* Requeueing order.  Low-level drivers are free to requeue any request
-after they removed it from the request queue with
-blkdev_dequeue_request().  As barrier sequence should be kept in order
-when requeued, generic elevator code takes care of putting requests in
-order around barrier.  See blk_ordered_req_seq() and
-ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
-Note that block drivers must not requeue preceding requests while
-completing latter requests in an ordered sequence.  Currently, no
-error checking is done against this.
-* Error handling.  Currently, block layer will report error to upper
-layer if any of requests in an ordered sequence fails.  Unfortunately,
-this doesn't seem to be enough.  Look at the following request flow.
-QUEUE_ORDERED_TAG_FLUSH is in use.
- [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
-                                          still in elevator
-Let's say request [2], [3] are write requests to update file system
-metadata (journal or whatever) and [barrier] is used to mark that
-those updates are valid.  Consider the following sequence.
- i.     Requests [0] ~ [post] leaves the request queue and enters
-        low-level driver.
- ii.    After a while, unfortunately, something goes wrong and the
-        drive fails [2].  Note that any of [0], [1] and [3] could have
-        completed by this time, but [pre] couldn't have been finished
-        as the drive must process it in order and it failed before
-        processing that command.
- iii.   Error handling kicks in and determines that the error is
-        unrecoverable and fails [2], and resumes operation.
- iv.    [pre] [barrier] [post] gets processed.
- v.     *BOOM* power fails
-The problem here is that the barrier request is *supposed* to indicate
-that filesystem update requests [2] and [3] made it safely to the
-physical medium and, if the machine crashes after the barrier is
-written, filesystem recovery code can depend on that.  Sadly, that
-isn't true in this case anymore.  IOW, the success of a I/O barrier
-should also be dependent on success of some of the preceding requests,
-where only upper layer (filesystem) knows what 'some' is.
-This can be solved by implementing a way to tell the block layer which
-requests affect the success of the following barrier request and
-making lower lever drivers to resume operation on error only after
-block layer tells it to do so.
-As the probability of this happening is very low and the drive should
-be faulty, implementing the fix is probably an overkill.  But, still,
-it's there.
-* In previous drafts of barrier implementation, there was fallback
-mechanism such that, if FUA or ordered TAG fails, less fancy ordered
-mode can be selected and the failed barrier request is retried
-automatically.  The rationale for this feature was that as FUA is
-pretty new in ATA world and ordered tag was never used widely, there
-could be devices which report to support those features but choke when
-actually given such requests.
- This was removed for two reasons 1. it's an overkill 2. it's
-impossible to implement properly when TAG ordering is used as low
-level drivers resume after an error automatically.  If it's ever
-needed adding it back and modifying low level drivers accordingly
-shouldn't be difficult.
diff --git a/Documentation/block/writeback_cache_control.txt b/Documentation/block/writeback_cache_control.txt
new file mode 100644
index 000000000000..83407d36630a
--- /dev/null
+++ b/Documentation/block/writeback_cache_control.txt
@@ -0,0 +1,86 @@
+Explicit volatile write back cache control
+=====================================
+Introduction
+------------
+Many storage devices, especially in the consumer market, come with volatile
+write back caches.  That means the devices signal I/O completion to the
+operating system before data actually has hit the non-volatile storage.  This
+behavior obviously speeds up various workloads, but it means the operating
+system needs to force data out to the non-volatile storage when it performs
+a data integrity operation like fsync, sync or an unmount.
+The Linux block layer provides two simple mechanisms that let filesystems
+control the caching behavior of the storage device.  These mechanisms are
+a forced cache flush, and the Force Unit Access (FUA) flag for requests.
+Explicit cache flushes
+----------------------
+The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
+the filesystem and will make sure the volatile cache of the storage device
+has been flushed before the actual I/O operation is started.  This explicitly
+guarantees that previously completed write requests are on non-volatile
+storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
+set on an otherwise empty bio structure, which causes only an explicit cache
+flush without any dependent I/O.  It is recommend to use
+the blkdev_issue_flush() helper for a pure cache flush.
+Forced Unit Access
+-----------------
+The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
+filesystem and will make sure that I/O completion for this request is only
+signaled after the data has been committed to non-volatile storage.
+Implementation details for filesystems
+--------------------------------------
+Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
+worry if the underlying devices need any explicit cache flushing and how
+the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
+may both be set on a single bio.
+Implementation details for make_request_fn based block drivers
+--------------------------------------------------------------
+These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
+directly below the submit_bio interface.  For remapping drivers the REQ_FUA
+bits need to be propagated to underlying devices, and a global flush needs
+to be implemented for bios with the REQ_FLUSH bit set.  For real device
+drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
+on non-empty bios can simply be ignored, and REQ_FLUSH requests without
+data can be completed successfully without doing any work.  Drivers for
+devices with volatile caches need to implement the support for these
+flags themselves without any help from the block layer.
+Implementation details for request_fn based block drivers
+--------------------------------------------------------------
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_FLUSH requests before
+entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
+requests that have a payload.  For devices with volatile write caches the
+driver needs to tell the block layer that it supports flushing caches by
+doing:
+        blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
+and handle empty REQ_FLUSH requests in its prep_fn/request_fn.  Note that
+REQ_FLUSH requests with a payload are automatically turned into a sequence
+of an empty REQ_FLUSH request followed by the actual write by the block
+layer.  For devices that also support the FUA bit the block layer needs
+to be told to pass through the REQ_FUA bit using:
+        blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
+and the driver must handle write requests that have the REQ_FUA bit set
+in prep_fn/request_fn.  If the FUA bit is not natively supported the block
+layer turns it into an empty REQ_FLUSH request after the actual write.
author	Christoph Hellwig <hch@lst.de>	2010-09-03 05:56:17 -0400
committer	Jens Axboe <jaxboe@fusionio.com>	2010-09-10 06:35:37 -0400
commit	04ccc65cd1f57aee861708e08cd2272c5a0d088c (patch)
tree	60b72d9acc0f94f9abfd3a3cee7de2e58549c9eb
parent	09d60c701b64b509f328cac72970eb894f485b9e (diff)

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX index a406286f6f3e..d111e3b23db0 100644 --- a/Documentation/block/00-INDEX +++ b/Documentation/block/00-INDEX
@@ -1,7 +1,5 @@
1	00-INDEX	1	00-INDEX
2	- This file	2	- This file
3	barrier.txt
4	- I/O Barriers
5	biodoc.txt	3	biodoc.txt
6	- Notes on the Generic Block Layer Rewrite in Linux 2.5	4	- Notes on the Generic Block Layer Rewrite in Linux 2.5
7	capability.txt	5	capability.txt
@@ -16,3 +14,5 @@ stat.txt
16	- Block layer statistics in /sys/block/<dev>/stat	14	- Block layer statistics in /sys/block/<dev>/stat
17	switching-sched.txt	15	switching-sched.txt
18	- Switching I/O schedulers at runtime	16	- Switching I/O schedulers at runtime
		17	writeback_cache_control.txt
		18	- Control of volatile write back caches


diff --git a/Documentation/block/barrier.txt b/Documentation/block/barrier.txt deleted file mode 100644 index 2c2f24f634e4..000000000000 --- a/Documentation/block/barrier.txt +++ /dev/null
@@ -1,261 +0,0 @@
1	I/O Barriers
2	============
3	Tejun Heo <htejun@gmail.com>, July 22 2005
4
5	I/O barrier requests are used to guarantee ordering around the barrier
6	requests. Unless you're crazy enough to use disk drives for
7	implementing synchronization constructs (wow, sounds interesting...),
8	the ordering is meaningful only for write requests for things like
9	journal checkpoints. All requests queued before a barrier request
10	must be finished (made it to the physical medium) before the barrier
11	request is started, and all requests queued after the barrier request
12	must be started only after the barrier request is finished (again,
13	made it to the physical medium).
14
15	In other words, I/O barrier requests have the following two properties.
16
17	1. Request ordering
18
19	Requests cannot pass the barrier request. Preceding requests are
20	processed before the barrier and following requests after.
21
22	Depending on what features a drive supports, this can be done in one
23	of the following three ways.
24
25	i. For devices which have queue depth greater than 1 (TCQ devices) and
26	support ordered tags, block layer can just issue the barrier as an
27	ordered request and the lower level driver, controller and drive
28	itself are responsible for making sure that the ordering constraint is
29	met. Most modern SCSI controllers/drives should support this.
30
31	NOTE: SCSI ordered tag isn't currently used due to limitation in the
32	SCSI midlayer, see the following random notes section.
33
34	ii. For devices which have queue depth greater than 1 but don't
35	support ordered tags, block layer ensures that the requests preceding
36	a barrier request finishes before issuing the barrier request. Also,
37	it defers requests following the barrier until the barrier request is
38	finished. Older SCSI controllers/drives and SATA drives fall in this
39	category.
40
41	iii. Devices which have queue depth of 1. This is a degenerate case
42	of ii. Just keeping issue order suffices. Ancient SCSI
43	controllers/drives and IDE drives are in this category.
44
45	2. Forced flushing to physical medium
46
47	Again, if you're not gonna do synchronization with disk drives (dang,
48	it sounds even more appealing now!), the reason you use I/O barriers
49	is mainly to protect filesystem integrity when power failure or some
50	other events abruptly stop the drive from operating and possibly make
51	the drive lose data in its cache. So, I/O barriers need to guarantee
52	that requests actually get written to non-volatile medium in order.
53
54	There are four cases,
55
56	i. No write-back cache. Keeping requests ordered is enough.
57
58	ii. Write-back cache but no flush operation. There's no way to
59	guarantee physical-medium commit order. This kind of devices can't to
60	I/O barriers.
61
62	iii. Write-back cache and flush operation but no FUA (forced unit
63	access). We need two cache flushes - before and after the barrier
64	request.
65
66	iv. Write-back cache, flush operation and FUA. We still need one
67	flush to make sure requests preceding a barrier are written to medium,
68	but post-barrier flush can be avoided by using FUA write on the
69	barrier itself.
70
71
72	How to support barrier requests in drivers
73	------------------------------------------
74
75	All barrier handling is done inside block layer proper. All low level
76	drivers have to are implementing its prepare_flush_fn and using one
77	the following two functions to indicate what barrier type it supports
78	and how to prepare flush requests. Note that the term 'ordered' is
79	used to indicate the whole sequence of performing barrier requests
80	including draining and flushing.
81
82	typedef void (prepare_flush_fn)(struct request_queue q, struct request rq);
83
84	int blk_queue_ordered(struct request_queue *q, unsigned ordered,
85	prepare_flush_fn *prepare_flush_fn);
86
87	@q : the queue in question
88	@ordered : the ordered mode the driver/device supports
89	@prepare_flush_fn : this function should prepare @rq such that it
90	flushes cache to physical medium when executed
91
92	For example, SCSI disk driver's prepare_flush_fn looks like the
93	following.
94
95	static void sd_prepare_flush(struct request_queue q, struct request rq)
96	{
97	memset(rq->cmd, 0, sizeof(rq->cmd));
98	rq->cmd_type = REQ_TYPE_BLOCK_PC;
99	rq->timeout = SD_TIMEOUT;
100	rq->cmd[0] = SYNCHRONIZE_CACHE;
101	rq->cmd_len = 10;
102	}
103
104	The following seven ordered modes are supported. The following table
105	shows which mode should be used depending on what features a
106	device/driver supports. In the leftmost column of table,
107	QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
108
109	The table is followed by description of each mode. Note that in the
110	descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
111	used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
112	preceding step must be complete before proceeding to the next step.
113	'->' indicates that the next step can start as soon as the previous
114	step is issued.
115
116	write-back cache ordered tag flush FUA
117	-----------------------------------------------------------------------
118	NONE yes/no N/A no N/A
119	DRAIN no no N/A N/A
120	DRAIN_FLUSH yes no yes no
121	DRAIN_FUA yes no yes yes
122	TAG no yes N/A N/A
123	TAG_FLUSH yes yes yes no
124	TAG_FUA yes yes yes yes
125
126
127	QUEUE_ORDERED_NONE
128	I/O barriers are not needed and/or supported.
129
130	Sequence: N/A
131
132	QUEUE_ORDERED_DRAIN
133	Requests are ordered by draining the request queue and cache
134	flushing isn't needed.
135
136	Sequence: drain => barrier
137
138	QUEUE_ORDERED_DRAIN_FLUSH
139	Requests are ordered by draining the request queue and both
140	pre-barrier and post-barrier cache flushings are needed.
141
142	Sequence: drain => preflush => barrier => postflush
143
144	QUEUE_ORDERED_DRAIN_FUA
145	Requests are ordered by draining the request queue and
146	pre-barrier cache flushing is needed. By using FUA on barrier
147	request, post-barrier flushing can be skipped.
148
149	Sequence: drain => preflush => barrier
150
151	QUEUE_ORDERED_TAG
152	Requests are ordered by ordered tag and cache flushing isn't
153	needed.
154
155	Sequence: barrier
156
157	QUEUE_ORDERED_TAG_FLUSH
158	Requests are ordered by ordered tag and both pre-barrier and
159	post-barrier cache flushings are needed.
160
161	Sequence: preflush -> barrier -> postflush
162
163	QUEUE_ORDERED_TAG_FUA
164	Requests are ordered by ordered tag and pre-barrier cache
165	flushing is needed. By using FUA on barrier request,
166	post-barrier flushing can be skipped.
167
168	Sequence: preflush -> barrier
169
170
171	Random notes/caveats
172	--------------------
173
174	* SCSI layer currently can't use TAG ordering even if the drive,
175	controller and driver support it. The problem is that SCSI midlayer
176	request dispatch function is not atomic. It releases queue lock and
177	switch to SCSI host lock during issue and it's possible and likely to
178	happen in time that requests change their relative positions. Once
179	this problem is solved, TAG ordering can be enabled.
180
181	* Currently, no matter which ordered mode is used, there can be only
182	one barrier request in progress. All I/O barriers are held off by
183	block layer until the previous I/O barrier is complete. This doesn't
184	make any difference for DRAIN ordered devices, but, for TAG ordered
185	devices with very high command latency, passing multiple I/O barriers
186	to low level might be helpful if they are very frequent. Well, this
187	certainly is a non-issue. I'm writing this just to make clear that no
188	two I/O barrier is ever passed to low-level driver.
189
190	* Completion order. Requests in ordered sequence are issued in order
191	but not required to finish in order. Barrier implementation can
192	handle out-of-order completion of ordered sequence. IOW, the requests
193	MUST be processed in order but the hardware/software completion paths
194	are allowed to reorder completion notifications - eg. current SCSI
195	midlayer doesn't preserve completion order during error handling.
196
197	* Requeueing order. Low-level drivers are free to requeue any request
198	after they removed it from the request queue with
199	blkdev_dequeue_request(). As barrier sequence should be kept in order
200	when requeued, generic elevator code takes care of putting requests in
201	order around barrier. See blk_ordered_req_seq() and
202	ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
203
204	Note that block drivers must not requeue preceding requests while
205	completing latter requests in an ordered sequence. Currently, no
206	error checking is done against this.
207
208	* Error handling. Currently, block layer will report error to upper
209	layer if any of requests in an ordered sequence fails. Unfortunately,
210	this doesn't seem to be enough. Look at the following request flow.
211	QUEUE_ORDERED_TAG_FLUSH is in use.
212
213	[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
214	still in elevator
215
216	Let's say request [2], [3] are write requests to update file system
217	metadata (journal or whatever) and [barrier] is used to mark that
218	those updates are valid. Consider the following sequence.
219
220	i. Requests [0] ~ [post] leaves the request queue and enters
221	low-level driver.
222	ii. After a while, unfortunately, something goes wrong and the
223	drive fails [2]. Note that any of [0], [1] and [3] could have
224	completed by this time, but [pre] couldn't have been finished
225	as the drive must process it in order and it failed before
226	processing that command.
227	iii. Error handling kicks in and determines that the error is
228	unrecoverable and fails [2], and resumes operation.
229	iv. [pre] [barrier] [post] gets processed.
230	v. BOOM power fails
231
232	The problem here is that the barrier request is supposed to indicate
233	that filesystem update requests [2] and [3] made it safely to the
234	physical medium and, if the machine crashes after the barrier is
235	written, filesystem recovery code can depend on that. Sadly, that
236	isn't true in this case anymore. IOW, the success of a I/O barrier
237	should also be dependent on success of some of the preceding requests,
238	where only upper layer (filesystem) knows what 'some' is.
239
240	This can be solved by implementing a way to tell the block layer which
241	requests affect the success of the following barrier request and
242	making lower lever drivers to resume operation on error only after
243	block layer tells it to do so.
244
245	As the probability of this happening is very low and the drive should
246	be faulty, implementing the fix is probably an overkill. But, still,
247	it's there.
248
249	* In previous drafts of barrier implementation, there was fallback
250	mechanism such that, if FUA or ordered TAG fails, less fancy ordered
251	mode can be selected and the failed barrier request is retried
252	automatically. The rationale for this feature was that as FUA is
253	pretty new in ATA world and ordered tag was never used widely, there
254	could be devices which report to support those features but choke when
255	actually given such requests.
256
257	This was removed for two reasons 1. it's an overkill 2. it's
258	impossible to implement properly when TAG ordering is used as low
259	level drivers resume after an error automatically. If it's ever
260	needed adding it back and modifying low level drivers accordingly
261	shouldn't be difficult.


diff --git a/Documentation/block/writeback_cache_control.txt b/Documentation/block/writeback_cache_control.txt new file mode 100644 index 000000000000..83407d36630a --- /dev/null +++ b/Documentation/block/writeback_cache_control.txt
@@ -0,0 +1,86 @@
		1
		2	Explicit volatile write back cache control
		3	=====================================
		4
		5	Introduction
		6	------------
		7
		8	Many storage devices, especially in the consumer market, come with volatile
		9	write back caches. That means the devices signal I/O completion to the
		10	operating system before data actually has hit the non-volatile storage. This
		11	behavior obviously speeds up various workloads, but it means the operating
		12	system needs to force data out to the non-volatile storage when it performs
		13	a data integrity operation like fsync, sync or an unmount.
		14
		15	The Linux block layer provides two simple mechanisms that let filesystems
		16	control the caching behavior of the storage device. These mechanisms are
		17	a forced cache flush, and the Force Unit Access (FUA) flag for requests.
		18
		19
		20	Explicit cache flushes
		21	----------------------
		22
		23	The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
		24	the filesystem and will make sure the volatile cache of the storage device
		25	has been flushed before the actual I/O operation is started. This explicitly
		26	guarantees that previously completed write requests are on non-volatile
		27	storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
		28	set on an otherwise empty bio structure, which causes only an explicit cache
		29	flush without any dependent I/O. It is recommend to use
		30	the blkdev_issue_flush() helper for a pure cache flush.
		31
		32
		33	Forced Unit Access
		34	-----------------
		35
		36	The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
		37	filesystem and will make sure that I/O completion for this request is only
		38	signaled after the data has been committed to non-volatile storage.
		39
		40
		41	Implementation details for filesystems
		42	--------------------------------------
		43
		44	Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
		45	worry if the underlying devices need any explicit cache flushing and how
		46	the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
		47	may both be set on a single bio.
		48
		49
		50	Implementation details for make_request_fn based block drivers
		51	--------------------------------------------------------------
		52
		53	These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
		54	directly below the submit_bio interface. For remapping drivers the REQ_FUA
		55	bits need to be propagated to underlying devices, and a global flush needs
		56	to be implemented for bios with the REQ_FLUSH bit set. For real device
		57	drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
		58	on non-empty bios can simply be ignored, and REQ_FLUSH requests without
		59	data can be completed successfully without doing any work. Drivers for
		60	devices with volatile caches need to implement the support for these
		61	flags themselves without any help from the block layer.
		62
		63
		64	Implementation details for request_fn based block drivers
		65	--------------------------------------------------------------
		66
		67	For devices that do not support volatile write caches there is no driver
		68	support required, the block layer completes empty REQ_FLUSH requests before
		69	entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
		70	requests that have a payload. For devices with volatile write caches the
		71	driver needs to tell the block layer that it supports flushing caches by
		72	doing:
		73
		74	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
		75
		76	and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
		77	REQ_FLUSH requests with a payload are automatically turned into a sequence
		78	of an empty REQ_FLUSH request followed by the actual write by the block
		79	layer. For devices that also support the FUA bit the block layer needs
		80	to be told to pass through the REQ_FUA bit using:
		81
		82	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH \| REQ_FUA);
		83
		84	and the driver must handle write requests that have the REQ_FUA bit set
		85	in prep_fn/request_fn. If the FUA bit is not natively supported the block
		86	layer turns it into an empty REQ_FLUSH request after the actual write.