Linux-2.6.12-rc2v2.6.12-rc2

Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
author: Linus Torvalds <torvalds@ppc970.osdl.org> 2005-04-16 18:20:36 -0400
committer: Linus Torvalds <torvalds@ppc970.osdl.org> 2005-04-16 18:20:36 -0400
commit: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree: 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/block/as-iosched.txt
1 files changed, 165 insertions, 0 deletions
diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt
new file mode 100644
index 000000000000..6f47332c883d
--- /dev/null
+++ b/Documentation/block/as-iosched.txt
@@ -0,0 +1,165 @@
+Anticipatory IO scheduler
+-------------------------
+Nick Piggin <piggin@cyberone.com.au>    13 Sep 2003
+Attention! Database servers, especially those using "TCQ" disks should
+investigate performance with the 'deadline' IO scheduler. Any system with high
+disk performance requirements should do so, in fact.
+If you see unusual performance characteristics of your disk systems, or you
+see big performance regressions versus the deadline scheduler, please email
+me. Database users don't bother unless you're willing to test a lot of patches
+from me ;) its a known issue.
+Also, users with hardware RAID controllers, doing striping, may find
+highly variable performance results with using the as-iosched. The
+as-iosched anticipatory implementation is based on the notion that a disk
+device has only one physical seeking head.  A striped RAID controller
+actually has a head for each physical device in the logical RAID device.
+However, setting the antic_expire (see tunable parameters below) produces
+very similar behavior to the deadline IO scheduler.
+Selecting IO schedulers
+-----------------------
+To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
+'noop' and 'as' (the default) are also available. IO schedulers are assigned
+globally at boot time only presently.
+Anticipatory IO scheduler Policies
+----------------------------------
+The as-iosched implementation implements several layers of policies
+to determine when an IO request is dispatched to the disk controller.
+Here are the policies outlined, in order of application.
+1. one-way Elevator algorithm.
+The elevator algorithm is similar to that used in deadline scheduler, with
+the addition that it allows limited backward movement of the elevator
+(i.e. seeks backwards).  A seek backwards can occur when choosing between
+two IO requests where one is behind the elevator's current position, and
+the other is in front of the elevator's position. If the seek distance to
+the request in back of the elevator is less than half the seek distance to
+the request in front of the elevator, then the request in back can be chosen.
+Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
+This favors forward movement of the elevator, while allowing opportunistic
+"short" backward seeks.
+2. FIFO expiration times for reads and for writes.
+This is again very similar to the deadline IO scheduler.  The expiration
+times for requests on these lists is tunable using the parameters read_expire
+and write_expire discussed below.  When a read or a write expires in this way,
+the IO scheduler will interrupt its current elevator sweep or read anticipation
+to service the expired request.
+3. Read and write request batching
+A batch is a collection of read requests or a collection of write
+requests.  The as scheduler alternates dispatching read and write batches
+to the driver.  In the case a read batch, the scheduler submits read
+requests to the driver as long as there are read requests to submit, and
+the read batch time limit has not been exceeded (read_batch_expire).
+The read batch time limit begins counting down only when there are
+competing write requests pending.
+In the case of a write batch, the scheduler submits write requests to
+the driver as long as there are write requests available, and the
+write batch time limit has not been exceeded (write_batch_expire).
+However, the length of write batches will be gradually shortened
+when read batches frequently exceed their time limit.
+When changing between batch types, the scheduler waits for all requests
+from the previous batch to complete before scheduling requests for the
+next batch.
+The read and write fifo expiration times described in policy 2 above
+are checked only when in scheduling IO of a batch for the corresponding
+(read/write) type.  So for example, the read FIFO timeout values are
+tested only during read batches.  Likewise, the write FIFO timeout
+values are tested only during write batches.  For this reason,
+it is generally not recommended for the read batch time
+to be longer than the write expiration time, nor for the write batch
+time to exceed the read expiration time (see tunable parameters below).
+When the IO scheduler changes from a read to a write batch,
+it begins the elevator from the request that is on the head of the
+write expiration FIFO.  Likewise, when changing from a write batch to
+a read batch, scheduler begins the elevator from the first entry
+on the read expiration FIFO.
+4. Read anticipation.
+Read anticipation occurs only when scheduling a read batch.
+This implementation of read anticipation allows only one read request
+to be dispatched to the disk controller at a time.  In
+contrast, many write requests may be dispatched to the disk controller
+at a time during a write batch.  It is this characteristic that can make
+the anticipatory scheduler perform anomalously with controllers supporting
+TCQ, or with hardware striped RAID devices. Setting the antic_expire
+queue paramter (see below) to zero disables this behavior, and the anticipatory
+scheduler behaves essentially like the deadline scheduler.
+When read anticipation is enabled (antic_expire is not zero), reads
+are dispatched to the disk controller one at a time.
+At the end of each read request, the IO scheduler examines its next
+candidate read request from its sorted read list.  If that next request
+is from the same process as the request that just completed,
+or if the next request in the queue is "very close" to the
+just completed request, it is dispatched immediately.  Otherwise,
+statistics (average think time, average seek distance) on the process
+that submitted the just completed request are examined.  If it seems
+likely that that process will submit another request soon, and that
+request is likely to be near the just completed request, then the IO
+scheduler will stop dispatching more read requests for up time (antic_expire)
+milliseconds, hoping that process will submit a new request near the one
+that just completed.  If such a request is made, then it is dispatched
+immediately.  If the antic_expire wait time expires, then the IO scheduler
+will dispatch the next read request from the sorted read queue.
+To decide whether an anticipatory wait is worthwhile, the scheduler
+maintains statistics for each process that can be used to compute
+mean "think time" (the time between read requests), and mean seek
+distance for that process.  One observation is that these statistics
+are associated with each process, but those statistics are not associated
+with a specific IO device.  So for example, if a process is doing IO
+on several file systems on separate devices, the statistics will be
+a combination of IO behavior from all those devices.
+Tuning the anticipatory IO scheduler
+------------------------------------
+When using 'as', the anticipatory IO scheduler there are 5 parameters under
+/sys/block/*/queue/iosched/. All are units of milliseconds.
+The parameters are:
+* read_expire
+    Controls how long until a read request becomes "expired". It also controls the
+    interval between which expired requests are served, so set to 50, a request
+    might take anywhere < 100ms to be serviced _if_ it is the next on the
+    expired list. Obviously request expiration strategies won't make the disk
+    go faster. The result basically equates to the timeslice a single reader
+    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
+    very roughly the % streaming read efficiency your disk should get with
+    multiple readers.
+* read_batch_expire
+    Controls how much time a batch of reads is given before pending writes are
+    served. A higher value is more efficient. This might be set below read_expire
+    if writes are to be given higher priority than reads, but reads are to be
+    as efficient as possible when there are no writes. Generally though, it
+    should be some multiple of read_expire.
+* write_expire, and
+* write_batch_expire are equivalent to the above, for writes.
+* antic_expire
+    Controls the maximum amount of time we can anticipate a good read (one
+    with a short seek distance from the most recently completed request) before
+    giving up. Many other factors may cause anticipation to be stopped early,
+    or some processes will not be "anticipated" at all. Should be a bit higher
+    for big seek time devices though not a linear correspondence - most
+    processes have only a few ms thinktime.
author	Linus Torvalds <torvalds@ppc970.osdl.org>	2005-04-16 18:20:36 -0400
committer	Linus Torvalds <torvalds@ppc970.osdl.org>	2005-04-16 18:20:36 -0400
commit	1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree	0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/block/as-iosched.txt

diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt new file mode 100644 index 000000000000..6f47332c883d --- /dev/null +++ b/Documentation/block/as-iosched.txt
@@ -0,0 +1,165 @@
	1	Anticipatory IO scheduler
	2	-------------------------
	3	Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
	4
	5	Attention! Database servers, especially those using "TCQ" disks should
	6	investigate performance with the 'deadline' IO scheduler. Any system with high
	7	disk performance requirements should do so, in fact.
	8
	9	If you see unusual performance characteristics of your disk systems, or you
	10	see big performance regressions versus the deadline scheduler, please email
	11	me. Database users don't bother unless you're willing to test a lot of patches
	12	from me ;) its a known issue.
	13
	14	Also, users with hardware RAID controllers, doing striping, may find
	15	highly variable performance results with using the as-iosched. The
	16	as-iosched anticipatory implementation is based on the notion that a disk
	17	device has only one physical seeking head. A striped RAID controller
	18	actually has a head for each physical device in the logical RAID device.
	19
	20	However, setting the antic_expire (see tunable parameters below) produces
	21	very similar behavior to the deadline IO scheduler.
	22
	23
	24	Selecting IO schedulers
	25	-----------------------
	26	To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
	27	'noop' and 'as' (the default) are also available. IO schedulers are assigned
	28	globally at boot time only presently.
	29
	30
	31	Anticipatory IO scheduler Policies
	32	----------------------------------
	33	The as-iosched implementation implements several layers of policies
	34	to determine when an IO request is dispatched to the disk controller.
	35	Here are the policies outlined, in order of application.
	36
	37	1. one-way Elevator algorithm.
	38
	39	The elevator algorithm is similar to that used in deadline scheduler, with
	40	the addition that it allows limited backward movement of the elevator
	41	(i.e. seeks backwards). A seek backwards can occur when choosing between
	42	two IO requests where one is behind the elevator's current position, and
	43	the other is in front of the elevator's position. If the seek distance to
	44	the request in back of the elevator is less than half the seek distance to
	45	the request in front of the elevator, then the request in back can be chosen.
	46	Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
	47	This favors forward movement of the elevator, while allowing opportunistic
	48	"short" backward seeks.
	49
	50	2. FIFO expiration times for reads and for writes.
	51
	52	This is again very similar to the deadline IO scheduler. The expiration
	53	times for requests on these lists is tunable using the parameters read_expire
	54	and write_expire discussed below. When a read or a write expires in this way,
	55	the IO scheduler will interrupt its current elevator sweep or read anticipation
	56	to service the expired request.
	57
	58	3. Read and write request batching
	59
	60	A batch is a collection of read requests or a collection of write
	61	requests. The as scheduler alternates dispatching read and write batches
	62	to the driver. In the case a read batch, the scheduler submits read
	63	requests to the driver as long as there are read requests to submit, and
	64	the read batch time limit has not been exceeded (read_batch_expire).
	65	The read batch time limit begins counting down only when there are
	66	competing write requests pending.
	67
	68	In the case of a write batch, the scheduler submits write requests to
	69	the driver as long as there are write requests available, and the
	70	write batch time limit has not been exceeded (write_batch_expire).
	71	However, the length of write batches will be gradually shortened
	72	when read batches frequently exceed their time limit.
	73
	74	When changing between batch types, the scheduler waits for all requests
	75	from the previous batch to complete before scheduling requests for the
	76	next batch.
	77
	78	The read and write fifo expiration times described in policy 2 above
	79	are checked only when in scheduling IO of a batch for the corresponding
	80	(read/write) type. So for example, the read FIFO timeout values are
	81	tested only during read batches. Likewise, the write FIFO timeout
	82	values are tested only during write batches. For this reason,
	83	it is generally not recommended for the read batch time
	84	to be longer than the write expiration time, nor for the write batch
	85	time to exceed the read expiration time (see tunable parameters below).
	86
	87	When the IO scheduler changes from a read to a write batch,
	88	it begins the elevator from the request that is on the head of the
	89	write expiration FIFO. Likewise, when changing from a write batch to
	90	a read batch, scheduler begins the elevator from the first entry
	91	on the read expiration FIFO.
	92
	93	4. Read anticipation.
	94
	95	Read anticipation occurs only when scheduling a read batch.
	96	This implementation of read anticipation allows only one read request
	97	to be dispatched to the disk controller at a time. In
	98	contrast, many write requests may be dispatched to the disk controller
	99	at a time during a write batch. It is this characteristic that can make
	100	the anticipatory scheduler perform anomalously with controllers supporting
	101	TCQ, or with hardware striped RAID devices. Setting the antic_expire
	102	queue paramter (see below) to zero disables this behavior, and the anticipatory
	103	scheduler behaves essentially like the deadline scheduler.
	104
	105	When read anticipation is enabled (antic_expire is not zero), reads
	106	are dispatched to the disk controller one at a time.
	107	At the end of each read request, the IO scheduler examines its next
	108	candidate read request from its sorted read list. If that next request
	109	is from the same process as the request that just completed,
	110	or if the next request in the queue is "very close" to the
	111	just completed request, it is dispatched immediately. Otherwise,
	112	statistics (average think time, average seek distance) on the process
	113	that submitted the just completed request are examined. If it seems
	114	likely that that process will submit another request soon, and that
	115	request is likely to be near the just completed request, then the IO
	116	scheduler will stop dispatching more read requests for up time (antic_expire)
	117	milliseconds, hoping that process will submit a new request near the one
	118	that just completed. If such a request is made, then it is dispatched
	119	immediately. If the antic_expire wait time expires, then the IO scheduler
	120	will dispatch the next read request from the sorted read queue.
	121
	122	To decide whether an anticipatory wait is worthwhile, the scheduler
	123	maintains statistics for each process that can be used to compute
	124	mean "think time" (the time between read requests), and mean seek
	125	distance for that process. One observation is that these statistics
	126	are associated with each process, but those statistics are not associated
	127	with a specific IO device. So for example, if a process is doing IO
	128	on several file systems on separate devices, the statistics will be
	129	a combination of IO behavior from all those devices.
	130
	131
	132	Tuning the anticipatory IO scheduler
	133	------------------------------------
	134	When using 'as', the anticipatory IO scheduler there are 5 parameters under
	135	/sys/block/*/queue/iosched/. All are units of milliseconds.
	136
	137	The parameters are:
	138	* read_expire
	139	Controls how long until a read request becomes "expired". It also controls the
	140	interval between which expired requests are served, so set to 50, a request
	141	might take anywhere < 100ms to be serviced _if_ it is the next on the
	142	expired list. Obviously request expiration strategies won't make the disk
	143	go faster. The result basically equates to the timeslice a single reader
	144	gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
	145	very roughly the % streaming read efficiency your disk should get with
	146	multiple readers.
	147
	148	* read_batch_expire
	149	Controls how much time a batch of reads is given before pending writes are
	150	served. A higher value is more efficient. This might be set below read_expire
	151	if writes are to be given higher priority than reads, but reads are to be
	152	as efficient as possible when there are no writes. Generally though, it
	153	should be some multiple of read_expire.
	154
	155	* write_expire, and
	156	* write_batch_expire are equivalent to the above, for writes.
	157
	158	* antic_expire
	159	Controls the maximum amount of time we can anticipate a good read (one
	160	with a short seek distance from the most recently completed request) before
	161	giving up. Many other factors may cause anticipation to be stopped early,
	162	or some processes will not be "anticipated" at all. Should be a bit higher
	163	for big seek time devices though not a linear correspondence - most
	164	processes have only a few ms thinktime.
	165