diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2011-08-19 13:47:07 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2011-08-19 13:47:07 -0400 |
commit | 5ccc38740a283aba81a00e92941310d0c1aeb2ee (patch) | |
tree | ba7d725947975a9391e085bd1d5958b004bfdc3e /Documentation | |
parent | 0c3bef612881ee6216a36952ffaabfc35b83545c (diff) | |
parent | b53d1ed734a2b9af8da115b836b658daa7d47a48 (diff) |
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/block/cfq-iosched.txt | 71 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 9 |
2 files changed, 77 insertions, 3 deletions
diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt index e578feed6d81..6d670f570451 100644 --- a/Documentation/block/cfq-iosched.txt +++ b/Documentation/block/cfq-iosched.txt | |||
@@ -43,3 +43,74 @@ If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches | |||
43 | to IOPS mode and starts providing fairness in terms of number of requests | 43 | to IOPS mode and starts providing fairness in terms of number of requests |
44 | dispatched. Note that this mode switching takes effect only for group | 44 | dispatched. Note that this mode switching takes effect only for group |
45 | scheduling. For non-cgroup users nothing should change. | 45 | scheduling. For non-cgroup users nothing should change. |
46 | |||
47 | CFQ IO scheduler Idling Theory | ||
48 | =============================== | ||
49 | Idling on a queue is primarily about waiting for the next request to come | ||
50 | on same queue after completion of a request. In this process CFQ will not | ||
51 | dispatch requests from other cfq queues even if requests are pending there. | ||
52 | |||
53 | The rationale behind idling is that it can cut down on number of seeks | ||
54 | on rotational media. For example, if a process is doing dependent | ||
55 | sequential reads (next read will come on only after completion of previous | ||
56 | one), then not dispatching request from other queue should help as we | ||
57 | did not move the disk head and kept on dispatching sequential IO from | ||
58 | one queue. | ||
59 | |||
60 | CFQ has following service trees and various queues are put on these trees. | ||
61 | |||
62 | sync-idle sync-noidle async | ||
63 | |||
64 | All cfq queues doing synchronous sequential IO go on to sync-idle tree. | ||
65 | On this tree we idle on each queue individually. | ||
66 | |||
67 | All synchronous non-sequential queues go on sync-noidle tree. Also any | ||
68 | request which are marked with REQ_NOIDLE go on this service tree. On this | ||
69 | tree we do not idle on individual queues instead idle on the whole group | ||
70 | of queues or the tree. So if there are 4 queues waiting for IO to dispatch | ||
71 | we will idle only once last queue has dispatched the IO and there is | ||
72 | no more IO on this service tree. | ||
73 | |||
74 | All async writes go on async service tree. There is no idling on async | ||
75 | queues. | ||
76 | |||
77 | CFQ has some optimizations for SSDs and if it detects a non-rotational | ||
78 | media which can support higher queue depth (multiple requests at in | ||
79 | flight at a time), then it cuts down on idling of individual queues and | ||
80 | all the queues move to sync-noidle tree and only tree idle remains. This | ||
81 | tree idling provides isolation with buffered write queues on async tree. | ||
82 | |||
83 | FAQ | ||
84 | === | ||
85 | Q1. Why to idle at all on queues marked with REQ_NOIDLE. | ||
86 | |||
87 | A1. We only do tree idle (all queues on sync-noidle tree) on queues marked | ||
88 | with REQ_NOIDLE. This helps in providing isolation with all the sync-idle | ||
89 | queues. Otherwise in presence of many sequential readers, other | ||
90 | synchronous IO might not get fair share of disk. | ||
91 | |||
92 | For example, if there are 10 sequential readers doing IO and they get | ||
93 | 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled | ||
94 | roughly after 1 second. If after completion of REQ_NOIDLE request we | ||
95 | do not idle, and after a couple of milli seconds a another REQ_NOIDLE | ||
96 | request comes in, again it will be scheduled after 1second. Repeat it | ||
97 | and notice how a workload can lose its disk share and suffer due to | ||
98 | multiple sequential readers. | ||
99 | |||
100 | fsync can generate dependent IO where bunch of data is written in the | ||
101 | context of fsync, and later some journaling data is written. Journaling | ||
102 | data comes in only after fsync has finished its IO (atleast for ext4 | ||
103 | that seemed to be the case). Now if one decides not to idle on fsync | ||
104 | thread due to REQ_NOIDLE, then next journaling write will not get | ||
105 | scheduled for another second. A process doing small fsync, will suffer | ||
106 | badly in presence of multiple sequential readers. | ||
107 | |||
108 | Hence doing tree idling on threads using REQ_NOIDLE flag on requests | ||
109 | provides isolation from multiple sequential readers and at the same | ||
110 | time we do not idle on individual threads. | ||
111 | |||
112 | Q2. When to specify REQ_NOIDLE | ||
113 | A2. I would think whenever one is doing synchronous write and not expecting | ||
114 | more writes to be dispatched from same context soon, should be able | ||
115 | to specify REQ_NOIDLE on writes and that probably should work well for | ||
116 | most of the cases. | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6ca1f5cb71e0..614d0382e2cb 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -1350,9 +1350,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. | |||
1350 | it is equivalent to "nosmp", which also disables | 1350 | it is equivalent to "nosmp", which also disables |
1351 | the IO APIC. | 1351 | the IO APIC. |
1352 | 1352 | ||
1353 | max_loop= [LOOP] Maximum number of loopback devices that can | 1353 | max_loop= [LOOP] The number of loop block devices that get |
1354 | be mounted | 1354 | (loop.max_loop) unconditionally pre-created at init time. The default |
1355 | Format: <1-256> | 1355 | number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead |
1356 | of statically allocating a predefined number, loop | ||
1357 | devices can be requested on-demand with the | ||
1358 | /dev/loop-control interface. | ||
1356 | 1359 | ||
1357 | mcatest= [IA-64] | 1360 | mcatest= [IA-64] |
1358 | 1361 | ||