diff options
author | Linus Walleij <linus.walleij@linaro.org> | 2014-07-10 03:52:27 -0400 |
---|---|---|
committer | John Stultz <john.stultz@linaro.org> | 2014-07-23 18:07:13 -0400 |
commit | 7806f60e1d205db46eca6ad24429b3f86eda2588 (patch) | |
tree | cac0d38e36b81b91d58b11c6eef36631b8c058db /Documentation/timers | |
parent | 375f45b5b53a91dfa8f0c11328e0e044f82acbed (diff) |
clocksource: document some basic timekeeping concepts
This adds some documentation about clock sources, clock events,
the weak sched_clock() function and delay timers that answers
questions that repeatedly arise on the mailing lists.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Colin Cross <ccross@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Diffstat (limited to 'Documentation/timers')
-rw-r--r-- | Documentation/timers/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/timers/timekeeping.txt | 179 |
2 files changed, 181 insertions, 0 deletions
diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX index 6d042dc1cce0..ee212a27772f 100644 --- a/Documentation/timers/00-INDEX +++ b/Documentation/timers/00-INDEX | |||
@@ -12,6 +12,8 @@ Makefile | |||
12 | - Build and link hpet_example | 12 | - Build and link hpet_example |
13 | NO_HZ.txt | 13 | NO_HZ.txt |
14 | - Summary of the different methods for the scheduler clock-interrupts management. | 14 | - Summary of the different methods for the scheduler clock-interrupts management. |
15 | timekeeping.txt | ||
16 | - Clock sources, clock events, sched_clock() and delay timer notes | ||
15 | timers-howto.txt | 17 | timers-howto.txt |
16 | - how to insert delays in the kernel the right (tm) way. | 18 | - how to insert delays in the kernel the right (tm) way. |
17 | timer_stats.txt | 19 | timer_stats.txt |
diff --git a/Documentation/timers/timekeeping.txt b/Documentation/timers/timekeeping.txt new file mode 100644 index 000000000000..f3a8cf28f802 --- /dev/null +++ b/Documentation/timers/timekeeping.txt | |||
@@ -0,0 +1,179 @@ | |||
1 | Clock sources, Clock events, sched_clock() and delay timers | ||
2 | ----------------------------------------------------------- | ||
3 | |||
4 | This document tries to briefly explain some basic kernel timekeeping | ||
5 | abstractions. It partly pertains to the drivers usually found in | ||
6 | drivers/clocksource in the kernel tree, but the code may be spread out | ||
7 | across the kernel. | ||
8 | |||
9 | If you grep through the kernel source you will find a number of architecture- | ||
10 | specific implementations of clock sources, clockevents and several likewise | ||
11 | architecture-specific overrides of the sched_clock() function and some | ||
12 | delay timers. | ||
13 | |||
14 | To provide timekeeping for your platform, the clock source provides | ||
15 | the basic timeline, whereas clock events shoot interrupts on certain points | ||
16 | on this timeline, providing facilities such as high-resolution timers. | ||
17 | sched_clock() is used for scheduling and timestamping, and delay timers | ||
18 | provide an accurate delay source using hardware counters. | ||
19 | |||
20 | |||
21 | Clock sources | ||
22 | ------------- | ||
23 | |||
24 | The purpose of the clock source is to provide a timeline for the system that | ||
25 | tells you where you are in time. For example issuing the command 'date' on | ||
26 | a Linux system will eventually read the clock source to determine exactly | ||
27 | what time it is. | ||
28 | |||
29 | Typically the clock source is a monotonic, atomic counter which will provide | ||
30 | n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over. | ||
31 | It will ideally NEVER stop ticking as long as the system is running. It | ||
32 | may stop during system suspend. | ||
33 | |||
34 | The clock source shall have as high resolution as possible, and the frequency | ||
35 | shall be as stable and correct as possible as compared to a real-world wall | ||
36 | clock. It should not move unpredictably back and forth in time or miss a few | ||
37 | cycles here and there. | ||
38 | |||
39 | It must be immune to the kind of effects that occur in hardware where e.g. | ||
40 | the counter register is read in two phases on the bus lowest 16 bits first | ||
41 | and the higher 16 bits in a second bus cycle with the counter bits | ||
42 | potentially being updated in between leading to the risk of very strange | ||
43 | values from the counter. | ||
44 | |||
45 | When the wall-clock accuracy of the clock source isn't satisfactory, there | ||
46 | are various quirks and layers in the timekeeping code for e.g. synchronizing | ||
47 | the user-visible time to RTC clocks in the system or against networked time | ||
48 | servers using NTP, but all they do basically is update an offset against | ||
49 | the clock source, which provides the fundamental timeline for the system. | ||
50 | These measures does not affect the clock source per se, they only adapt the | ||
51 | system to the shortcomings of it. | ||
52 | |||
53 | The clock source struct shall provide means to translate the provided counter | ||
54 | into a nanosecond value as an unsigned long long (unsigned 64 bit) number. | ||
55 | Since this operation may be invoked very often, doing this in a strict | ||
56 | mathematical sense is not desirable: instead the number is taken as close as | ||
57 | possible to a nanosecond value using only the arithmetic operations | ||
58 | multiply and shift, so in clocksource_cyc2ns() you find: | ||
59 | |||
60 | ns ~= (clocksource * mult) >> shift | ||
61 | |||
62 | You will find a number of helper functions in the clock source code intended | ||
63 | to aid in providing these mult and shift values, such as | ||
64 | clocksource_khz2mult(), clocksource_hz2mult() that help determine the | ||
65 | mult factor from a fixed shift, and clocksource_register_hz() and | ||
66 | clocksource_register_khz() which will help out assigning both shift and mult | ||
67 | factors using the frequency of the clock source as the only input. | ||
68 | |||
69 | For real simple clock sources accessed from a single I/O memory location | ||
70 | there is nowadays even clocksource_mmio_init() which will take a memory | ||
71 | location, bit width, a parameter telling whether the counter in the | ||
72 | register counts up or down, and the timer clock rate, and then conjure all | ||
73 | necessary parameters. | ||
74 | |||
75 | Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 | ||
76 | seconds, the code handling the clock source will have to compensate for this. | ||
77 | That is the reason why the clock source struct also contains a 'mask' | ||
78 | member telling how many bits of the source are valid. This way the timekeeping | ||
79 | code knows when the counter will wrap around and can insert the necessary | ||
80 | compensation code on both sides of the wrap point so that the system timeline | ||
81 | remains monotonic. | ||
82 | |||
83 | |||
84 | Clock events | ||
85 | ------------ | ||
86 | |||
87 | Clock events are the conceptual reverse of clock sources: they take a | ||
88 | desired time specification value and calculate the values to poke into | ||
89 | hardware timer registers. | ||
90 | |||
91 | Clock events are orthogonal to clock sources. The same hardware | ||
92 | and register range may be used for the clock event, but it is essentially | ||
93 | a different thing. The hardware driving clock events has to be able to | ||
94 | fire interrupts, so as to trigger events on the system timeline. On an SMP | ||
95 | system, it is ideal (and customary) to have one such event driving timer per | ||
96 | CPU core, so that each core can trigger events independently of any other | ||
97 | core. | ||
98 | |||
99 | You will notice that the clock event device code is based on the same basic | ||
100 | idea about translating counters to nanoseconds using mult and shift | ||
101 | arithmetic, and you find the same family of helper functions again for | ||
102 | assigning these values. The clock event driver does not need a 'mask' | ||
103 | attribute however: the system will not try to plan events beyond the time | ||
104 | horizon of the clock event. | ||
105 | |||
106 | |||
107 | sched_clock() | ||
108 | ------------- | ||
109 | |||
110 | In addition to the clock sources and clock events there is a special weak | ||
111 | function in the kernel called sched_clock(). This function shall return the | ||
112 | number of nanoseconds since the system was started. An architecture may or | ||
113 | may not provide an implementation of sched_clock() on its own. If a local | ||
114 | implementation is not provided, the system jiffy counter will be used as | ||
115 | sched_clock(). | ||
116 | |||
117 | As the name suggests, sched_clock() is used for scheduling the system, | ||
118 | determining the absolute timeslice for a certain process in the CFS scheduler | ||
119 | for example. It is also used for printk timestamps when you have selected to | ||
120 | include time information in printk for things like bootcharts. | ||
121 | |||
122 | Compared to clock sources, sched_clock() has to be very fast: it is called | ||
123 | much more often, especially by the scheduler. If you have to do trade-offs | ||
124 | between accuracy compared to the clock source, you may sacrifice accuracy | ||
125 | for speed in sched_clock(). It however requires some of the same basic | ||
126 | characteristics as the clock source, i.e. it should be monotonic. | ||
127 | |||
128 | The sched_clock() function may wrap only on unsigned long long boundaries, | ||
129 | i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps | ||
130 | after circa 585 years. (For most practical systems this means "never".) | ||
131 | |||
132 | If an architecture does not provide its own implementation of this function, | ||
133 | it will fall back to using jiffies, making its maximum resolution 1/HZ of the | ||
134 | jiffy frequency for the architecture. This will affect scheduling accuracy | ||
135 | and will likely show up in system benchmarks. | ||
136 | |||
137 | The clock driving sched_clock() may stop or reset to zero during system | ||
138 | suspend/sleep. This does not matter to the function it serves of scheduling | ||
139 | events on the system. However it may result in interesting timestamps in | ||
140 | printk(). | ||
141 | |||
142 | The sched_clock() function should be callable in any context, IRQ- and | ||
143 | NMI-safe and return a sane value in any context. | ||
144 | |||
145 | Some architectures may have a limited set of time sources and lack a nice | ||
146 | counter to derive a 64-bit nanosecond value, so for example on the ARM | ||
147 | architecture, special helper functions have been created to provide a | ||
148 | sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the | ||
149 | same counter that is also used as clock source is used for this purpose. | ||
150 | |||
151 | On SMP systems, it is crucial for performance that sched_clock() can be called | ||
152 | independently on each CPU without any synchronization performance hits. | ||
153 | Some hardware (such as the x86 TSC) will cause the sched_clock() function to | ||
154 | drift between the CPUs on the system. The kernel can work around this by | ||
155 | enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect | ||
156 | that makes sched_clock() different from the ordinary clock source. | ||
157 | |||
158 | |||
159 | Delay timers (some architectures only) | ||
160 | -------------------------------------- | ||
161 | |||
162 | On systems with variable CPU frequency, the various kernel delay() functions | ||
163 | will sometimes behave strangely. Basically these delays usually use a hard | ||
164 | loop to delay a certain number of jiffy fractions using a "lpj" (loops per | ||
165 | jiffy) value, calibrated on boot. | ||
166 | |||
167 | Let's hope that your system is running on maximum frequency when this value | ||
168 | is calibrated: as an effect when the frequency is geared down to half the | ||
169 | full frequency, any delay() will be twice as long. Usually this does not | ||
170 | hurt, as you're commonly requesting that amount of delay *or more*. But | ||
171 | basically the semantics are quite unpredictable on such systems. | ||
172 | |||
173 | Enter timer-based delays. Using these, a timer read may be used instead of | ||
174 | a hard-coded loop for providing the desired delay. | ||
175 | |||
176 | This is done by declaring a struct delay_timer and assigning the appropriate | ||
177 | function pointers and rate settings for this delay timer. | ||
178 | |||
179 | This is available on some architectures like OpenRISC or ARM. | ||