2 files changed, 77 insertions, 3 deletions
diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
index e578feed6d81..6d670f570451 100644
--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -43,3 +43,74 @@ If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
 to IOPS mode and starts providing fairness in terms of number of requests
 dispatched. Note that this mode switching takes effect only for group
 scheduling. For non-cgroup users nothing should change.
+CFQ IO scheduler Idling Theory
+===============================
+Idling on a queue is primarily about waiting for the next request to come
+on same queue after completion of a request. In this process CFQ will not
+dispatch requests from other cfq queues even if requests are pending there.
+The rationale behind idling is that it can cut down on number of seeks
+on rotational media. For example, if a process is doing dependent
+sequential reads (next read will come on only after completion of previous
+one), then not dispatching request from other queue should help as we
+did not move the disk head and kept on dispatching sequential IO from
+one queue.
+CFQ has following service trees and various queues are put on these trees.
+        sync-idle       sync-noidle     async
+All cfq queues doing synchronous sequential IO go on to sync-idle tree.
+On this tree we idle on each queue individually.
+All synchronous non-sequential queues go on sync-noidle tree. Also any
+request which are marked with REQ_NOIDLE go on this service tree. On this
+tree we do not idle on individual queues instead idle on the whole group
+of queues or the tree. So if there are 4 queues waiting for IO to dispatch
+we will idle only once last queue has dispatched the IO and there is
+no more IO on this service tree.
+All async writes go on async service tree. There is no idling on async
+queues.
+CFQ has some optimizations for SSDs and if it detects a non-rotational
+media which can support higher queue depth (multiple requests at in
+flight at a time), then it cuts down on idling of individual queues and
+all the queues move to sync-noidle tree and only tree idle remains. This
+tree idling provides isolation with buffered write queues on async tree.
+FAQ
+===
+Q1. Why to idle at all on queues marked with REQ_NOIDLE.
+A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
+    with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
+    queues. Otherwise in presence of many sequential readers, other
+    synchronous IO might not get fair share of disk.
+    For example, if there are 10 sequential readers doing IO and they get
+    100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
+    roughly after 1 second. If after completion of REQ_NOIDLE request we
+    do not idle, and after a couple of milli seconds a another REQ_NOIDLE
+    request comes in, again it will be scheduled after 1second. Repeat it
+    and notice how a workload can lose its disk share and suffer due to
+    multiple sequential readers.
+    fsync can generate dependent IO where bunch of data is written in the
+    context of fsync, and later some journaling data is written. Journaling
+    data comes in only after fsync has finished its IO (atleast for ext4
+    that seemed to be the case). Now if one decides not to idle on fsync
+    thread due to REQ_NOIDLE, then next journaling write will not get
+    scheduled for another second. A process doing small fsync, will suffer
+    badly in presence of multiple sequential readers.
+    Hence doing tree idling on threads using REQ_NOIDLE flag on requests
+    provides isolation from multiple sequential readers and at the same
+    time we do not idle on individual threads.
+Q2. When to specify REQ_NOIDLE
+A2. I would think whenever one is doing synchronous write and not expecting
+    more writes to be dispatched from same context soon, should be able
+    to specify REQ_NOIDLE on writes and that probably should work well for
+    most of the cases.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 6ca1f5cb71e0..614d0382e2cb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1350,9 +1350,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
                        it is equivalent to "nosmp", which also disables
                        the IO APIC.
-        max_loop=       [LOOP] Maximum number of loopback devices that can
+        max_loop=       [LOOP] The number of loop block devices that get
-                        be mounted
+        (loop.max_loop) unconditionally pre-created at init time. The default
-                        Format: <1-256>
+                        number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
+                        of statically allocating a predefined number, loop
+                        devices can be requested on-demand with the
+                        /dev/loop-control interface.
        mcatest=        [IA-64]

diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt index e578feed6d81..6d670f570451 100644 --- a/Documentation/block/cfq-iosched.txt +++ b/Documentation/block/cfq-iosched.txt
@@ -43,3 +43,74 @@ If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
43	to IOPS mode and starts providing fairness in terms of number of requests	43	to IOPS mode and starts providing fairness in terms of number of requests
44	dispatched. Note that this mode switching takes effect only for group	44	dispatched. Note that this mode switching takes effect only for group
45	scheduling. For non-cgroup users nothing should change.	45	scheduling. For non-cgroup users nothing should change.
		46
		47	CFQ IO scheduler Idling Theory
		48	===============================
		49	Idling on a queue is primarily about waiting for the next request to come
		50	on same queue after completion of a request. In this process CFQ will not
		51	dispatch requests from other cfq queues even if requests are pending there.
		52
		53	The rationale behind idling is that it can cut down on number of seeks
		54	on rotational media. For example, if a process is doing dependent
		55	sequential reads (next read will come on only after completion of previous
		56	one), then not dispatching request from other queue should help as we
		57	did not move the disk head and kept on dispatching sequential IO from
		58	one queue.
		59
		60	CFQ has following service trees and various queues are put on these trees.
		61
		62	sync-idle sync-noidle async
		63
		64	All cfq queues doing synchronous sequential IO go on to sync-idle tree.
		65	On this tree we idle on each queue individually.
		66
		67	All synchronous non-sequential queues go on sync-noidle tree. Also any
		68	request which are marked with REQ_NOIDLE go on this service tree. On this
		69	tree we do not idle on individual queues instead idle on the whole group
		70	of queues or the tree. So if there are 4 queues waiting for IO to dispatch
		71	we will idle only once last queue has dispatched the IO and there is
		72	no more IO on this service tree.
		73
		74	All async writes go on async service tree. There is no idling on async
		75	queues.
		76
		77	CFQ has some optimizations for SSDs and if it detects a non-rotational
		78	media which can support higher queue depth (multiple requests at in
		79	flight at a time), then it cuts down on idling of individual queues and
		80	all the queues move to sync-noidle tree and only tree idle remains. This
		81	tree idling provides isolation with buffered write queues on async tree.
		82
		83	FAQ
		84	===
		85	Q1. Why to idle at all on queues marked with REQ_NOIDLE.
		86
		87	A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
		88	with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
		89	queues. Otherwise in presence of many sequential readers, other
		90	synchronous IO might not get fair share of disk.
		91
		92	For example, if there are 10 sequential readers doing IO and they get
		93	100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
		94	roughly after 1 second. If after completion of REQ_NOIDLE request we
		95	do not idle, and after a couple of milli seconds a another REQ_NOIDLE
		96	request comes in, again it will be scheduled after 1second. Repeat it
		97	and notice how a workload can lose its disk share and suffer due to
		98	multiple sequential readers.
		99
		100	fsync can generate dependent IO where bunch of data is written in the
		101	context of fsync, and later some journaling data is written. Journaling
		102	data comes in only after fsync has finished its IO (atleast for ext4
		103	that seemed to be the case). Now if one decides not to idle on fsync
		104	thread due to REQ_NOIDLE, then next journaling write will not get
		105	scheduled for another second. A process doing small fsync, will suffer
		106	badly in presence of multiple sequential readers.
		107
		108	Hence doing tree idling on threads using REQ_NOIDLE flag on requests
		109	provides isolation from multiple sequential readers and at the same
		110	time we do not idle on individual threads.
		111
		112	Q2. When to specify REQ_NOIDLE
		113	A2. I would think whenever one is doing synchronous write and not expecting
		114	more writes to be dispatched from same context soon, should be able
		115	to specify REQ_NOIDLE on writes and that probably should work well for
		116	most of the cases.


diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6ca1f5cb71e0..614d0382e2cb 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt
@@ -1350,9 +1350,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
1350	it is equivalent to "nosmp", which also disables	1350	it is equivalent to "nosmp", which also disables
1351	the IO APIC.	1351	the IO APIC.
1352		1352
1353	max_loop= [LOOP] Maximum number of loopback devices that can	1353	max_loop= [LOOP] The number of loop block devices that get
1354	be mounted	1354	(loop.max_loop) unconditionally pre-created at init time. The default
1355	Format: <1-256>	1355	number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
		1356	of statically allocating a predefined number, loop
		1357	devices can be requested on-demand with the
		1358	/dev/loop-control interface.
1356		1359
1357	mcatest= [IA-64]	1360	mcatest= [IA-64]
1358		1361