clocksource: document some basic timekeeping concepts

This adds some documentation about clock sources, clock events, the weak sched_clock() function and delay timers that answers questions that repeatedly arise on the mailing lists. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Colin Cross <ccross@google.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
author: Linus Walleij <linus.walleij@linaro.org> 2014-07-10 03:52:27 -0400
committer: John Stultz <john.stultz@linaro.org> 2014-07-23 18:07:13 -0400
commit: 7806f60e1d205db46eca6ad24429b3f86eda2588 (patch)
tree: cac0d38e36b81b91d58b11c6eef36631b8c058db /Documentation/timers
parent: 375f45b5b53a91dfa8f0c11328e0e044f82acbed (diff)
2 files changed, 181 insertions, 0 deletions
diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX
index 6d042dc1cce0..ee212a27772f 100644
--- a/Documentation/timers/00-INDEX
+++ b/Documentation/timers/00-INDEX
@@ -12,6 +12,8 @@ Makefile
        - Build and link hpet_example
 NO_HZ.txt
        - Summary of the different methods for the scheduler clock-interrupts management.
+timekeeping.txt
+        - Clock sources, clock events, sched_clock() and delay timer notes
 timers-howto.txt
        - how to insert delays in the kernel the right (tm) way.
 timer_stats.txt
diff --git a/Documentation/timers/timekeeping.txt b/Documentation/timers/timekeeping.txt
new file mode 100644
index 000000000000..f3a8cf28f802
--- /dev/null
+++ b/Documentation/timers/timekeeping.txt
@@ -0,0 +1,179 @@
+Clock sources, Clock events, sched_clock() and delay timers
+-----------------------------------------------------------
+This document tries to briefly explain some basic kernel timekeeping
+abstractions. It partly pertains to the drivers usually found in
+drivers/clocksource in the kernel tree, but the code may be spread out
+across the kernel.
+If you grep through the kernel source you will find a number of architecture-
+specific implementations of clock sources, clockevents and several likewise
+architecture-specific overrides of the sched_clock() function and some
+delay timers.
+To provide timekeeping for your platform, the clock source provides
+the basic timeline, whereas clock events shoot interrupts on certain points
+on this timeline, providing facilities such as high-resolution timers.
+sched_clock() is used for scheduling and timestamping, and delay timers
+provide an accurate delay source using hardware counters.
+Clock sources
+-------------
+The purpose of the clock source is to provide a timeline for the system that
+tells you where you are in time. For example issuing the command 'date' on
+a Linux system will eventually read the clock source to determine exactly
+what time it is.
+Typically the clock source is a monotonic, atomic counter which will provide
+n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over.
+It will ideally NEVER stop ticking as long as the system is running. It
+may stop during system suspend.
+The clock source shall have as high resolution as possible, and the frequency
+shall be as stable and correct as possible as compared to a real-world wall
+clock. It should not move unpredictably back and forth in time or miss a few
+cycles here and there.
+It must be immune to the kind of effects that occur in hardware where e.g.
+the counter register is read in two phases on the bus lowest 16 bits first
+and the higher 16 bits in a second bus cycle with the counter bits
+potentially being updated in between leading to the risk of very strange
+values from the counter.
+When the wall-clock accuracy of the clock source isn't satisfactory, there
+are various quirks and layers in the timekeeping code for e.g. synchronizing
+the user-visible time to RTC clocks in the system or against networked time
+servers using NTP, but all they do basically is update an offset against
+the clock source, which provides the fundamental timeline for the system.
+These measures does not affect the clock source per se, they only adapt the
+system to the shortcomings of it.
+The clock source struct shall provide means to translate the provided counter
+into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
+Since this operation may be invoked very often, doing this in a strict
+mathematical sense is not desirable: instead the number is taken as close as
+possible to a nanosecond value using only the arithmetic operations
+multiply and shift, so in clocksource_cyc2ns() you find:
+  ns ~= (clocksource * mult) >> shift
+You will find a number of helper functions in the clock source code intended
+to aid in providing these mult and shift values, such as
+clocksource_khz2mult(), clocksource_hz2mult() that help determine the
+mult factor from a fixed shift, and clocksource_register_hz() and
+clocksource_register_khz() which will help out assigning both shift and mult
+factors using the frequency of the clock source as the only input.
+For real simple clock sources accessed from a single I/O memory location
+there is nowadays even clocksource_mmio_init() which will take a memory
+location, bit width, a parameter telling whether the counter in the
+register counts up or down, and the timer clock rate, and then conjure all
+necessary parameters.
+Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
+seconds, the code handling the clock source will have to compensate for this.
+That is the reason why the clock source struct also contains a 'mask'
+member telling how many bits of the source are valid. This way the timekeeping
+code knows when the counter will wrap around and can insert the necessary
+compensation code on both sides of the wrap point so that the system timeline
+remains monotonic.
+Clock events
+------------
+Clock events are the conceptual reverse of clock sources: they take a
+desired time specification value and calculate the values to poke into
+hardware timer registers.
+Clock events are orthogonal to clock sources. The same hardware
+and register range may be used for the clock event, but it is essentially
+a different thing. The hardware driving clock events has to be able to
+fire interrupts, so as to trigger events on the system timeline. On an SMP
+system, it is ideal (and customary) to have one such event driving timer per
+CPU core, so that each core can trigger events independently of any other
+core.
+You will notice that the clock event device code is based on the same basic
+idea about translating counters to nanoseconds using mult and shift
+arithmetic, and you find the same family of helper functions again for
+assigning these values. The clock event driver does not need a 'mask'
+attribute however: the system will not try to plan events beyond the time
+horizon of the clock event.
+sched_clock()
+-------------
+In addition to the clock sources and clock events there is a special weak
+function in the kernel called sched_clock(). This function shall return the
+number of nanoseconds since the system was started. An architecture may or
+may not provide an implementation of sched_clock() on its own. If a local
+implementation is not provided, the system jiffy counter will be used as
+sched_clock().
+As the name suggests, sched_clock() is used for scheduling the system,
+determining the absolute timeslice for a certain process in the CFS scheduler
+for example. It is also used for printk timestamps when you have selected to
+include time information in printk for things like bootcharts.
+Compared to clock sources, sched_clock() has to be very fast: it is called
+much more often, especially by the scheduler. If you have to do trade-offs
+between accuracy compared to the clock source, you may sacrifice accuracy
+for speed in sched_clock(). It however requires some of the same basic
+characteristics as the clock source, i.e. it should be monotonic.
+The sched_clock() function may wrap only on unsigned long long boundaries,
+i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
+after circa 585 years. (For most practical systems this means "never".)
+If an architecture does not provide its own implementation of this function,
+it will fall back to using jiffies, making its maximum resolution 1/HZ of the
+jiffy frequency for the architecture. This will affect scheduling accuracy
+and will likely show up in system benchmarks.
+The clock driving sched_clock() may stop or reset to zero during system
+suspend/sleep. This does not matter to the function it serves of scheduling
+events on the system. However it may result in interesting timestamps in
+printk().
+The sched_clock() function should be callable in any context, IRQ- and
+NMI-safe and return a sane value in any context.
+Some architectures may have a limited set of time sources and lack a nice
+counter to derive a 64-bit nanosecond value, so for example on the ARM
+architecture, special helper functions have been created to provide a
+sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
+same counter that is also used as clock source is used for this purpose.
+On SMP systems, it is crucial for performance that sched_clock() can be called
+independently on each CPU without any synchronization performance hits.
+Some hardware (such as the x86 TSC) will cause the sched_clock() function to
+drift between the CPUs on the system. The kernel can work around this by
+enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
+that makes sched_clock() different from the ordinary clock source.
+Delay timers (some architectures only)
+--------------------------------------
+On systems with variable CPU frequency, the various kernel delay() functions
+will sometimes behave strangely. Basically these delays usually use a hard
+loop to delay a certain number of jiffy fractions using a "lpj" (loops per
+jiffy) value, calibrated on boot.
+Let's hope that your system is running on maximum frequency when this value
+is calibrated: as an effect when the frequency is geared down to half the
+full frequency, any delay() will be twice as long. Usually this does not
+hurt, as you're commonly requesting that amount of delay *or more*. But
+basically the semantics are quite unpredictable on such systems.
+Enter timer-based delays. Using these, a timer read may be used instead of
+a hard-coded loop for providing the desired delay.
+This is done by declaring a struct delay_timer and assigning the appropriate
+function pointers and rate settings for this delay timer.
+This is available on some architectures like OpenRISC or ARM.
author	Linus Walleij <linus.walleij@linaro.org>	2014-07-10 03:52:27 -0400
committer	John Stultz <john.stultz@linaro.org>	2014-07-23 18:07:13 -0400
commit	7806f60e1d205db46eca6ad24429b3f86eda2588 (patch)
tree	cac0d38e36b81b91d58b11c6eef36631b8c058db /Documentation/timers
parent	375f45b5b53a91dfa8f0c11328e0e044f82acbed (diff)

diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX index 6d042dc1cce0..ee212a27772f 100644 --- a/Documentation/timers/00-INDEX +++ b/Documentation/timers/00-INDEX
@@ -12,6 +12,8 @@ Makefile
12	- Build and link hpet_example	12	- Build and link hpet_example
13	NO_HZ.txt	13	NO_HZ.txt
14	- Summary of the different methods for the scheduler clock-interrupts management.	14	- Summary of the different methods for the scheduler clock-interrupts management.
		15	timekeeping.txt
		16	- Clock sources, clock events, sched_clock() and delay timer notes
15	timers-howto.txt	17	timers-howto.txt
16	- how to insert delays in the kernel the right (tm) way.	18	- how to insert delays in the kernel the right (tm) way.
17	timer_stats.txt	19	timer_stats.txt


diff --git a/Documentation/timers/timekeeping.txt b/Documentation/timers/timekeeping.txt new file mode 100644 index 000000000000..f3a8cf28f802 --- /dev/null +++ b/Documentation/timers/timekeeping.txt
@@ -0,0 +1,179 @@
		1	Clock sources, Clock events, sched_clock() and delay timers
		2	-----------------------------------------------------------
		3
		4	This document tries to briefly explain some basic kernel timekeeping
		5	abstractions. It partly pertains to the drivers usually found in
		6	drivers/clocksource in the kernel tree, but the code may be spread out
		7	across the kernel.
		8
		9	If you grep through the kernel source you will find a number of architecture-
		10	specific implementations of clock sources, clockevents and several likewise
		11	architecture-specific overrides of the sched_clock() function and some
		12	delay timers.
		13
		14	To provide timekeeping for your platform, the clock source provides
		15	the basic timeline, whereas clock events shoot interrupts on certain points
		16	on this timeline, providing facilities such as high-resolution timers.
		17	sched_clock() is used for scheduling and timestamping, and delay timers
		18	provide an accurate delay source using hardware counters.
		19
		20
		21	Clock sources
		22	-------------
		23
		24	The purpose of the clock source is to provide a timeline for the system that
		25	tells you where you are in time. For example issuing the command 'date' on
		26	a Linux system will eventually read the clock source to determine exactly
		27	what time it is.
		28
		29	Typically the clock source is a monotonic, atomic counter which will provide
		30	n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over.
		31	It will ideally NEVER stop ticking as long as the system is running. It
		32	may stop during system suspend.
		33
		34	The clock source shall have as high resolution as possible, and the frequency
		35	shall be as stable and correct as possible as compared to a real-world wall
		36	clock. It should not move unpredictably back and forth in time or miss a few
		37	cycles here and there.
		38
		39	It must be immune to the kind of effects that occur in hardware where e.g.
		40	the counter register is read in two phases on the bus lowest 16 bits first
		41	and the higher 16 bits in a second bus cycle with the counter bits
		42	potentially being updated in between leading to the risk of very strange
		43	values from the counter.
		44
		45	When the wall-clock accuracy of the clock source isn't satisfactory, there
		46	are various quirks and layers in the timekeeping code for e.g. synchronizing
		47	the user-visible time to RTC clocks in the system or against networked time
		48	servers using NTP, but all they do basically is update an offset against
		49	the clock source, which provides the fundamental timeline for the system.
		50	These measures does not affect the clock source per se, they only adapt the
		51	system to the shortcomings of it.
		52
		53	The clock source struct shall provide means to translate the provided counter
		54	into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
		55	Since this operation may be invoked very often, doing this in a strict
		56	mathematical sense is not desirable: instead the number is taken as close as
		57	possible to a nanosecond value using only the arithmetic operations
		58	multiply and shift, so in clocksource_cyc2ns() you find:
		59
		60	ns ~= (clocksource * mult) >> shift
		61
		62	You will find a number of helper functions in the clock source code intended
		63	to aid in providing these mult and shift values, such as
		64	clocksource_khz2mult(), clocksource_hz2mult() that help determine the
		65	mult factor from a fixed shift, and clocksource_register_hz() and
		66	clocksource_register_khz() which will help out assigning both shift and mult
		67	factors using the frequency of the clock source as the only input.
		68
		69	For real simple clock sources accessed from a single I/O memory location
		70	there is nowadays even clocksource_mmio_init() which will take a memory
		71	location, bit width, a parameter telling whether the counter in the
		72	register counts up or down, and the timer clock rate, and then conjure all
		73	necessary parameters.
		74
		75	Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
		76	seconds, the code handling the clock source will have to compensate for this.
		77	That is the reason why the clock source struct also contains a 'mask'
		78	member telling how many bits of the source are valid. This way the timekeeping
		79	code knows when the counter will wrap around and can insert the necessary
		80	compensation code on both sides of the wrap point so that the system timeline
		81	remains monotonic.
		82
		83
		84	Clock events
		85	------------
		86
		87	Clock events are the conceptual reverse of clock sources: they take a
		88	desired time specification value and calculate the values to poke into
		89	hardware timer registers.
		90
		91	Clock events are orthogonal to clock sources. The same hardware
		92	and register range may be used for the clock event, but it is essentially
		93	a different thing. The hardware driving clock events has to be able to
		94	fire interrupts, so as to trigger events on the system timeline. On an SMP
		95	system, it is ideal (and customary) to have one such event driving timer per
		96	CPU core, so that each core can trigger events independently of any other
		97	core.
		98
		99	You will notice that the clock event device code is based on the same basic
		100	idea about translating counters to nanoseconds using mult and shift
		101	arithmetic, and you find the same family of helper functions again for
		102	assigning these values. The clock event driver does not need a 'mask'
		103	attribute however: the system will not try to plan events beyond the time
		104	horizon of the clock event.
		105
		106
		107	sched_clock()
		108	-------------
		109
		110	In addition to the clock sources and clock events there is a special weak
		111	function in the kernel called sched_clock(). This function shall return the
		112	number of nanoseconds since the system was started. An architecture may or
		113	may not provide an implementation of sched_clock() on its own. If a local
		114	implementation is not provided, the system jiffy counter will be used as
		115	sched_clock().
		116
		117	As the name suggests, sched_clock() is used for scheduling the system,
		118	determining the absolute timeslice for a certain process in the CFS scheduler
		119	for example. It is also used for printk timestamps when you have selected to
		120	include time information in printk for things like bootcharts.
		121
		122	Compared to clock sources, sched_clock() has to be very fast: it is called
		123	much more often, especially by the scheduler. If you have to do trade-offs
		124	between accuracy compared to the clock source, you may sacrifice accuracy
		125	for speed in sched_clock(). It however requires some of the same basic
		126	characteristics as the clock source, i.e. it should be monotonic.
		127
		128	The sched_clock() function may wrap only on unsigned long long boundaries,
		129	i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
		130	after circa 585 years. (For most practical systems this means "never".)
		131
		132	If an architecture does not provide its own implementation of this function,
		133	it will fall back to using jiffies, making its maximum resolution 1/HZ of the
		134	jiffy frequency for the architecture. This will affect scheduling accuracy
		135	and will likely show up in system benchmarks.
		136
		137	The clock driving sched_clock() may stop or reset to zero during system
		138	suspend/sleep. This does not matter to the function it serves of scheduling
		139	events on the system. However it may result in interesting timestamps in
		140	printk().
		141
		142	The sched_clock() function should be callable in any context, IRQ- and
		143	NMI-safe and return a sane value in any context.
		144
		145	Some architectures may have a limited set of time sources and lack a nice
		146	counter to derive a 64-bit nanosecond value, so for example on the ARM
		147	architecture, special helper functions have been created to provide a
		148	sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
		149	same counter that is also used as clock source is used for this purpose.
		150
		151	On SMP systems, it is crucial for performance that sched_clock() can be called
		152	independently on each CPU without any synchronization performance hits.
		153	Some hardware (such as the x86 TSC) will cause the sched_clock() function to
		154	drift between the CPUs on the system. The kernel can work around this by
		155	enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
		156	that makes sched_clock() different from the ordinary clock source.
		157
		158
		159	Delay timers (some architectures only)
		160	--------------------------------------
		161
		162	On systems with variable CPU frequency, the various kernel delay() functions
		163	will sometimes behave strangely. Basically these delays usually use a hard
		164	loop to delay a certain number of jiffy fractions using a "lpj" (loops per
		165	jiffy) value, calibrated on boot.
		166
		167	Let's hope that your system is running on maximum frequency when this value
		168	is calibrated: as an effect when the frequency is geared down to half the
		169	full frequency, any delay() will be twice as long. Usually this does not
		170	hurt, as you're commonly requesting that amount of delay or more. But
		171	basically the semantics are quite unpredictable on such systems.
		172
		173	Enter timer-based delays. Using these, a timer read may be used instead of
		174	a hard-coded loop for providing the desired delay.
		175
		176	This is done by declaring a struct delay_timer and assigning the appropriate
		177	function pointers and rate settings for this delay timer.
		178
		179	This is available on some architectures like OpenRISC or ARM.