diff options
Diffstat (limited to 'Documentation/hrtimers/hrtimers.txt')
-rw-r--r-- | Documentation/hrtimers/hrtimers.txt | 178 |
1 files changed, 178 insertions, 0 deletions
diff --git a/Documentation/hrtimers/hrtimers.txt b/Documentation/hrtimers/hrtimers.txt new file mode 100644 index 000000000000..ce31f65e12e7 --- /dev/null +++ b/Documentation/hrtimers/hrtimers.txt | |||
@@ -0,0 +1,178 @@ | |||
1 | |||
2 | hrtimers - subsystem for high-resolution kernel timers | ||
3 | ---------------------------------------------------- | ||
4 | |||
5 | This patch introduces a new subsystem for high-resolution kernel timers. | ||
6 | |||
7 | One might ask the question: we already have a timer subsystem | ||
8 | (kernel/timers.c), why do we need two timer subsystems? After a lot of | ||
9 | back and forth trying to integrate high-resolution and high-precision | ||
10 | features into the existing timer framework, and after testing various | ||
11 | such high-resolution timer implementations in practice, we came to the | ||
12 | conclusion that the timer wheel code is fundamentally not suitable for | ||
13 | such an approach. We initially didn't believe this ('there must be a way | ||
14 | to solve this'), and spent a considerable effort trying to integrate | ||
15 | things into the timer wheel, but we failed. In hindsight, there are | ||
16 | several reasons why such integration is hard/impossible: | ||
17 | |||
18 | - the forced handling of low-resolution and high-resolution timers in | ||
19 | the same way leads to a lot of compromises, macro magic and #ifdef | ||
20 | mess. The timers.c code is very "tightly coded" around jiffies and | ||
21 | 32-bitness assumptions, and has been honed and micro-optimized for a | ||
22 | relatively narrow use case (jiffies in a relatively narrow HZ range) | ||
23 | for many years - and thus even small extensions to it easily break | ||
24 | the wheel concept, leading to even worse compromises. The timer wheel | ||
25 | code is very good and tight code, there's zero problems with it in its | ||
26 | current usage - but it is simply not suitable to be extended for | ||
27 | high-res timers. | ||
28 | |||
29 | - the unpredictable [O(N)] overhead of cascading leads to delays which | ||
30 | necessitate a more complex handling of high resolution timers, which | ||
31 | in turn decreases robustness. Such a design still led to rather large | ||
32 | timing inaccuracies. Cascading is a fundamental property of the timer | ||
33 | wheel concept, it cannot be 'designed out' without unevitably | ||
34 | degrading other portions of the timers.c code in an unacceptable way. | ||
35 | |||
36 | - the implementation of the current posix-timer subsystem on top of | ||
37 | the timer wheel has already introduced a quite complex handling of | ||
38 | the required readjusting of absolute CLOCK_REALTIME timers at | ||
39 | settimeofday or NTP time - further underlying our experience by | ||
40 | example: that the timer wheel data structure is too rigid for high-res | ||
41 | timers. | ||
42 | |||
43 | - the timer wheel code is most optimal for use cases which can be | ||
44 | identified as "timeouts". Such timeouts are usually set up to cover | ||
45 | error conditions in various I/O paths, such as networking and block | ||
46 | I/O. The vast majority of those timers never expire and are rarely | ||
47 | recascaded because the expected correct event arrives in time so they | ||
48 | can be removed from the timer wheel before any further processing of | ||
49 | them becomes necessary. Thus the users of these timeouts can accept | ||
50 | the granularity and precision tradeoffs of the timer wheel, and | ||
51 | largely expect the timer subsystem to have near-zero overhead. | ||
52 | Accurate timing for them is not a core purpose - in fact most of the | ||
53 | timeout values used are ad-hoc. For them it is at most a necessary | ||
54 | evil to guarantee the processing of actual timeout completions | ||
55 | (because most of the timeouts are deleted before completion), which | ||
56 | should thus be as cheap and unintrusive as possible. | ||
57 | |||
58 | The primary users of precision timers are user-space applications that | ||
59 | utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel | ||
60 | users like drivers and subsystems which require precise timed events | ||
61 | (e.g. multimedia) can benefit from the availability of a separate | ||
62 | high-resolution timer subsystem as well. | ||
63 | |||
64 | While this subsystem does not offer high-resolution clock sources just | ||
65 | yet, the hrtimer subsystem can be easily extended with high-resolution | ||
66 | clock capabilities, and patches for that exist and are maturing quickly. | ||
67 | The increasing demand for realtime and multimedia applications along | ||
68 | with other potential users for precise timers gives another reason to | ||
69 | separate the "timeout" and "precise timer" subsystems. | ||
70 | |||
71 | Another potential benefit is that such a separation allows even more | ||
72 | special-purpose optimization of the existing timer wheel for the low | ||
73 | resolution and low precision use cases - once the precision-sensitive | ||
74 | APIs are separated from the timer wheel and are migrated over to | ||
75 | hrtimers. E.g. we could decrease the frequency of the timeout subsystem | ||
76 | from 250 Hz to 100 HZ (or even smaller). | ||
77 | |||
78 | hrtimer subsystem implementation details | ||
79 | ---------------------------------------- | ||
80 | |||
81 | the basic design considerations were: | ||
82 | |||
83 | - simplicity | ||
84 | |||
85 | - data structure not bound to jiffies or any other granularity. All the | ||
86 | kernel logic works at 64-bit nanoseconds resolution - no compromises. | ||
87 | |||
88 | - simplification of existing, timing related kernel code | ||
89 | |||
90 | another basic requirement was the immediate enqueueing and ordering of | ||
91 | timers at activation time. After looking at several possible solutions | ||
92 | such as radix trees and hashes, we chose the red black tree as the basic | ||
93 | data structure. Rbtrees are available as a library in the kernel and are | ||
94 | used in various performance-critical areas of e.g. memory management and | ||
95 | file systems. The rbtree is solely used for time sorted ordering, while | ||
96 | a separate list is used to give the expiry code fast access to the | ||
97 | queued timers, without having to walk the rbtree. | ||
98 | |||
99 | (This separate list is also useful for later when we'll introduce | ||
100 | high-resolution clocks, where we need separate pending and expired | ||
101 | queues while keeping the time-order intact.) | ||
102 | |||
103 | Time-ordered enqueueing is not purely for the purposes of | ||
104 | high-resolution clocks though, it also simplifies the handling of | ||
105 | absolute timers based on a low-resolution CLOCK_REALTIME. The existing | ||
106 | implementation needed to keep an extra list of all armed absolute | ||
107 | CLOCK_REALTIME timers along with complex locking. In case of | ||
108 | settimeofday and NTP, all the timers (!) had to be dequeued, the | ||
109 | time-changing code had to fix them up one by one, and all of them had to | ||
110 | be enqueued again. The time-ordered enqueueing and the storage of the | ||
111 | expiry time in absolute time units removes all this complex and poorly | ||
112 | scaling code from the posix-timer implementation - the clock can simply | ||
113 | be set without having to touch the rbtree. This also makes the handling | ||
114 | of posix-timers simpler in general. | ||
115 | |||
116 | The locking and per-CPU behavior of hrtimers was mostly taken from the | ||
117 | existing timer wheel code, as it is mature and well suited. Sharing code | ||
118 | was not really a win, due to the different data structures. Also, the | ||
119 | hrtimer functions now have clearer behavior and clearer names - such as | ||
120 | hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly | ||
121 | equivalent to del_timer() and del_timer_sync()] - so there's no direct | ||
122 | 1:1 mapping between them on the algorithmical level, and thus no real | ||
123 | potential for code sharing either. | ||
124 | |||
125 | Basic data types: every time value, absolute or relative, is in a | ||
126 | special nanosecond-resolution type: ktime_t. The kernel-internal | ||
127 | representation of ktime_t values and operations is implemented via | ||
128 | macros and inline functions, and can be switched between a "hybrid | ||
129 | union" type and a plain "scalar" 64bit nanoseconds representation (at | ||
130 | compile time). The hybrid union type optimizes time conversions on 32bit | ||
131 | CPUs. This build-time-selectable ktime_t storage format was implemented | ||
132 | to avoid the performance impact of 64-bit multiplications and divisions | ||
133 | on 32bit CPUs. Such operations are frequently necessary to convert | ||
134 | between the storage formats provided by kernel and userspace interfaces | ||
135 | and the internal time format. (See include/linux/ktime.h for further | ||
136 | details.) | ||
137 | |||
138 | hrtimers - rounding of timer values | ||
139 | ----------------------------------- | ||
140 | |||
141 | the hrtimer code will round timer events to lower-resolution clocks | ||
142 | because it has to. Otherwise it will do no artificial rounding at all. | ||
143 | |||
144 | one question is, what resolution value should be returned to the user by | ||
145 | the clock_getres() interface. This will return whatever real resolution | ||
146 | a given clock has - be it low-res, high-res, or artificially-low-res. | ||
147 | |||
148 | hrtimers - testing and verification | ||
149 | ---------------------------------- | ||
150 | |||
151 | We used the high-resolution clock subsystem ontop of hrtimers to verify | ||
152 | the hrtimer implementation details in praxis, and we also ran the posix | ||
153 | timer tests in order to ensure specification compliance. We also ran | ||
154 | tests on low-resolution clocks. | ||
155 | |||
156 | The hrtimer patch converts the following kernel functionality to use | ||
157 | hrtimers: | ||
158 | |||
159 | - nanosleep | ||
160 | - itimers | ||
161 | - posix-timers | ||
162 | |||
163 | The conversion of nanosleep and posix-timers enabled the unification of | ||
164 | nanosleep and clock_nanosleep. | ||
165 | |||
166 | The code was successfully compiled for the following platforms: | ||
167 | |||
168 | i386, x86_64, ARM, PPC, PPC64, IA64 | ||
169 | |||
170 | The code was run-tested on the following platforms: | ||
171 | |||
172 | i386(UP/SMP), x86_64(UP/SMP), ARM, PPC | ||
173 | |||
174 | hrtimers were also integrated into the -rt tree, along with a | ||
175 | hrtimers-based high-resolution clock implementation, so the hrtimers | ||
176 | code got a healthy amount of testing and use in practice. | ||
177 | |||
178 | Thomas Gleixner, Ingo Molnar | ||