diff options
Diffstat (limited to 'Documentation/timers/highres.rst')
| -rw-r--r-- | Documentation/timers/highres.rst | 250 |
1 files changed, 250 insertions, 0 deletions
diff --git a/Documentation/timers/highres.rst b/Documentation/timers/highres.rst new file mode 100644 index 000000000000..bde5eb7e5c9e --- /dev/null +++ b/Documentation/timers/highres.rst | |||
| @@ -0,0 +1,250 @@ | |||
| 1 | ===================================================== | ||
| 2 | High resolution timers and dynamic ticks design notes | ||
| 3 | ===================================================== | ||
| 4 | |||
| 5 | Further information can be found in the paper of the OLS 2006 talk "hrtimers | ||
| 6 | and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can | ||
| 7 | be found on the OLS website: | ||
| 8 | https://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf | ||
| 9 | |||
| 10 | The slides to this talk are available from: | ||
| 11 | http://www.cs.columbia.edu/~nahum/w6998/papers/ols2006-hrtimers-slides.pdf | ||
| 12 | |||
| 13 | The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the | ||
| 14 | changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the | ||
| 15 | design of the Linux time(r) system before hrtimers and other building blocks | ||
| 16 | got merged into mainline. | ||
| 17 | |||
| 18 | Note: the paper and the slides are talking about "clock event source", while we | ||
| 19 | switched to the name "clock event devices" in meantime. | ||
| 20 | |||
| 21 | The design contains the following basic building blocks: | ||
| 22 | |||
| 23 | - hrtimer base infrastructure | ||
| 24 | - timeofday and clock source management | ||
| 25 | - clock event management | ||
| 26 | - high resolution timer functionality | ||
| 27 | - dynamic ticks | ||
| 28 | |||
| 29 | |||
| 30 | hrtimer base infrastructure | ||
| 31 | --------------------------- | ||
| 32 | |||
| 33 | The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of | ||
| 34 | the base implementation are covered in Documentation/timers/hrtimers.rst. See | ||
| 35 | also figure #2 (OLS slides p. 15) | ||
| 36 | |||
| 37 | The main differences to the timer wheel, which holds the armed timer_list type | ||
| 38 | timers are: | ||
| 39 | |||
| 40 | - time ordered enqueueing into a rb-tree | ||
| 41 | - independent of ticks (the processing is based on nanoseconds) | ||
| 42 | |||
| 43 | |||
| 44 | timeofday and clock source management | ||
| 45 | ------------------------------------- | ||
| 46 | |||
| 47 | John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of | ||
| 48 | code out of the architecture-specific areas into a generic management | ||
| 49 | framework, as illustrated in figure #3 (OLS slides p. 18). The architecture | ||
| 50 | specific portion is reduced to the low level hardware details of the clock | ||
| 51 | sources, which are registered in the framework and selected on a quality based | ||
| 52 | decision. The low level code provides hardware setup and readout routines and | ||
| 53 | initializes data structures, which are used by the generic time keeping code to | ||
| 54 | convert the clock ticks to nanosecond based time values. All other time keeping | ||
| 55 | related functionality is moved into the generic code. The GTOD base patch got | ||
| 56 | merged into the 2.6.18 kernel. | ||
| 57 | |||
| 58 | Further information about the Generic Time Of Day framework is available in the | ||
| 59 | OLS 2005 Proceedings Volume 1: | ||
| 60 | |||
| 61 | http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf | ||
| 62 | |||
| 63 | The paper "We Are Not Getting Any Younger: A New Approach to Time and | ||
| 64 | Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. | ||
| 65 | |||
| 66 | Figure #3 (OLS slides p.18) illustrates the transformation. | ||
| 67 | |||
| 68 | |||
| 69 | clock event management | ||
| 70 | ---------------------- | ||
| 71 | |||
| 72 | While clock sources provide read access to the monotonically increasing time | ||
| 73 | value, clock event devices are used to schedule the next event | ||
| 74 | interrupt(s). The next event is currently defined to be periodic, with its | ||
| 75 | period defined at compile time. The setup and selection of the event device | ||
| 76 | for various event driven functionalities is hardwired into the architecture | ||
| 77 | dependent code. This results in duplicated code across all architectures and | ||
| 78 | makes it extremely difficult to change the configuration of the system to use | ||
| 79 | event interrupt devices other than those already built into the | ||
| 80 | architecture. Another implication of the current design is that it is necessary | ||
| 81 | to touch all the architecture-specific implementations in order to provide new | ||
| 82 | functionality like high resolution timers or dynamic ticks. | ||
| 83 | |||
| 84 | The clock events subsystem tries to address this problem by providing a generic | ||
| 85 | solution to manage clock event devices and their usage for the various clock | ||
| 86 | event driven kernel functionalities. The goal of the clock event subsystem is | ||
| 87 | to minimize the clock event related architecture dependent code to the pure | ||
| 88 | hardware related handling and to allow easy addition and utilization of new | ||
| 89 | clock event devices. It also minimizes the duplicated code across the | ||
| 90 | architectures as it provides generic functionality down to the interrupt | ||
| 91 | service handler, which is almost inherently hardware dependent. | ||
| 92 | |||
| 93 | Clock event devices are registered either by the architecture dependent boot | ||
| 94 | code or at module insertion time. Each clock event device fills a data | ||
| 95 | structure with clock-specific property parameters and callback functions. The | ||
| 96 | clock event management decides, by using the specified property parameters, the | ||
| 97 | set of system functions a clock event device will be used to support. This | ||
| 98 | includes the distinction of per-CPU and per-system global event devices. | ||
| 99 | |||
| 100 | System-level global event devices are used for the Linux periodic tick. Per-CPU | ||
| 101 | event devices are used to provide local CPU functionality such as process | ||
| 102 | accounting, profiling, and high resolution timers. | ||
| 103 | |||
| 104 | The management layer assigns one or more of the following functions to a clock | ||
| 105 | event device: | ||
| 106 | |||
| 107 | - system global periodic tick (jiffies update) | ||
| 108 | - cpu local update_process_times | ||
| 109 | - cpu local profiling | ||
| 110 | - cpu local next event interrupt (non periodic mode) | ||
| 111 | |||
| 112 | The clock event device delegates the selection of those timer interrupt related | ||
| 113 | functions completely to the management layer. The clock management layer stores | ||
| 114 | a function pointer in the device description structure, which has to be called | ||
| 115 | from the hardware level handler. This removes a lot of duplicated code from the | ||
| 116 | architecture specific timer interrupt handlers and hands the control over the | ||
| 117 | clock event devices and the assignment of timer interrupt related functionality | ||
| 118 | to the core code. | ||
| 119 | |||
| 120 | The clock event layer API is rather small. Aside from the clock event device | ||
| 121 | registration interface it provides functions to schedule the next event | ||
| 122 | interrupt, clock event device notification service and support for suspend and | ||
| 123 | resume. | ||
| 124 | |||
| 125 | The framework adds about 700 lines of code which results in a 2KB increase of | ||
| 126 | the kernel binary size. The conversion of i386 removes about 100 lines of | ||
| 127 | code. The binary size decrease is in the range of 400 byte. We believe that the | ||
| 128 | increase of flexibility and the avoidance of duplicated code across | ||
| 129 | architectures justifies the slight increase of the binary size. | ||
| 130 | |||
| 131 | The conversion of an architecture has no functional impact, but allows to | ||
| 132 | utilize the high resolution and dynamic tick functionalities without any change | ||
| 133 | to the clock event device and timer interrupt code. After the conversion the | ||
| 134 | enabling of high resolution timers and dynamic ticks is simply provided by | ||
| 135 | adding the kernel/time/Kconfig file to the architecture specific Kconfig and | ||
| 136 | adding the dynamic tick specific calls to the idle routine (a total of 3 lines | ||
| 137 | added to the idle function and the Kconfig file) | ||
| 138 | |||
| 139 | Figure #4 (OLS slides p.20) illustrates the transformation. | ||
| 140 | |||
| 141 | |||
| 142 | high resolution timer functionality | ||
| 143 | ----------------------------------- | ||
| 144 | |||
| 145 | During system boot it is not possible to use the high resolution timer | ||
| 146 | functionality, while making it possible would be difficult and would serve no | ||
| 147 | useful function. The initialization of the clock event device framework, the | ||
| 148 | clock source framework (GTOD) and hrtimers itself has to be done and | ||
| 149 | appropriate clock sources and clock event devices have to be registered before | ||
| 150 | the high resolution functionality can work. Up to the point where hrtimers are | ||
| 151 | initialized, the system works in the usual low resolution periodic mode. The | ||
| 152 | clock source and the clock event device layers provide notification functions | ||
| 153 | which inform hrtimers about availability of new hardware. hrtimers validates | ||
| 154 | the usability of the registered clock sources and clock event devices before | ||
| 155 | switching to high resolution mode. This ensures also that a kernel which is | ||
| 156 | configured for high resolution timers can run on a system which lacks the | ||
| 157 | necessary hardware support. | ||
| 158 | |||
| 159 | The high resolution timer code does not support SMP machines which have only | ||
| 160 | global clock event devices. The support of such hardware would involve IPI | ||
| 161 | calls when an interrupt happens. The overhead would be much larger than the | ||
| 162 | benefit. This is the reason why we currently disable high resolution and | ||
| 163 | dynamic ticks on i386 SMP systems which stop the local APIC in C3 power | ||
| 164 | state. A workaround is available as an idea, but the problem has not been | ||
| 165 | tackled yet. | ||
| 166 | |||
| 167 | The time ordered insertion of timers provides all the infrastructure to decide | ||
| 168 | whether the event device has to be reprogrammed when a timer is added. The | ||
| 169 | decision is made per timer base and synchronized across per-cpu timer bases in | ||
| 170 | a support function. The design allows the system to utilize separate per-CPU | ||
| 171 | clock event devices for the per-CPU timer bases, but currently only one | ||
| 172 | reprogrammable clock event device per-CPU is utilized. | ||
| 173 | |||
| 174 | When the timer interrupt happens, the next event interrupt handler is called | ||
| 175 | from the clock event distribution code and moves expired timers from the | ||
| 176 | red-black tree to a separate double linked list and invokes the softirq | ||
| 177 | handler. An additional mode field in the hrtimer structure allows the system to | ||
| 178 | execute callback functions directly from the next event interrupt handler. This | ||
| 179 | is restricted to code which can safely be executed in the hard interrupt | ||
| 180 | context. This applies, for example, to the common case of a wakeup function as | ||
| 181 | used by nanosleep. The advantage of executing the handler in the interrupt | ||
| 182 | context is the avoidance of up to two context switches - from the interrupted | ||
| 183 | context to the softirq and to the task which is woken up by the expired | ||
| 184 | timer. | ||
| 185 | |||
| 186 | Once a system has switched to high resolution mode, the periodic tick is | ||
| 187 | switched off. This disables the per system global periodic clock event device - | ||
| 188 | e.g. the PIT on i386 SMP systems. | ||
| 189 | |||
| 190 | The periodic tick functionality is provided by an per-cpu hrtimer. The callback | ||
| 191 | function is executed in the next event interrupt context and updates jiffies | ||
| 192 | and calls update_process_times and profiling. The implementation of the hrtimer | ||
| 193 | based periodic tick is designed to be extended with dynamic tick functionality. | ||
| 194 | This allows to use a single clock event device to schedule high resolution | ||
| 195 | timer and periodic events (jiffies tick, profiling, process accounting) on UP | ||
| 196 | systems. This has been proved to work with the PIT on i386 and the Incrementer | ||
| 197 | on PPC. | ||
| 198 | |||
| 199 | The softirq for running the hrtimer queues and executing the callbacks has been | ||
| 200 | separated from the tick bound timer softirq to allow accurate delivery of high | ||
| 201 | resolution timer signals which are used by itimer and POSIX interval | ||
| 202 | timers. The execution of this softirq can still be delayed by other softirqs, | ||
| 203 | but the overall latencies have been significantly improved by this separation. | ||
| 204 | |||
| 205 | Figure #5 (OLS slides p.22) illustrates the transformation. | ||
| 206 | |||
| 207 | |||
| 208 | dynamic ticks | ||
| 209 | ------------- | ||
| 210 | |||
| 211 | Dynamic ticks are the logical consequence of the hrtimer based periodic tick | ||
| 212 | replacement (sched_tick). The functionality of the sched_tick hrtimer is | ||
| 213 | extended by three functions: | ||
| 214 | |||
| 215 | - hrtimer_stop_sched_tick | ||
| 216 | - hrtimer_restart_sched_tick | ||
| 217 | - hrtimer_update_jiffies | ||
| 218 | |||
| 219 | hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code | ||
| 220 | evaluates the next scheduled timer event (from both hrtimers and the timer | ||
| 221 | wheel) and in case that the next event is further away than the next tick it | ||
| 222 | reprograms the sched_tick to this future event, to allow longer idle sleeps | ||
| 223 | without worthless interruption by the periodic tick. The function is also | ||
| 224 | called when an interrupt happens during the idle period, which does not cause a | ||
| 225 | reschedule. The call is necessary as the interrupt handler might have armed a | ||
| 226 | new timer whose expiry time is before the time which was identified as the | ||
| 227 | nearest event in the previous call to hrtimer_stop_sched_tick. | ||
| 228 | |||
| 229 | hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before | ||
| 230 | it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, | ||
| 231 | which is kept active until the next call to hrtimer_stop_sched_tick(). | ||
| 232 | |||
| 233 | hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens | ||
| 234 | in the idle period to make sure that jiffies are up to date and the interrupt | ||
| 235 | handler has not to deal with an eventually stale jiffy value. | ||
| 236 | |||
| 237 | The dynamic tick feature provides statistical values which are exported to | ||
| 238 | userspace via /proc/stat and can be made available for enhanced power | ||
| 239 | management control. | ||
| 240 | |||
| 241 | The implementation leaves room for further development like full tickless | ||
| 242 | systems, where the time slice is controlled by the scheduler, variable | ||
| 243 | frequency profiling, and a complete removal of jiffies in the future. | ||
| 244 | |||
| 245 | |||
| 246 | Aside the current initial submission of i386 support, the patchset has been | ||
| 247 | extended to x86_64 and ARM already. Initial (work in progress) support is also | ||
| 248 | available for MIPS and PowerPC. | ||
| 249 | |||
| 250 | Thomas, Ingo | ||
