aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-02-28 22:48:26 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2013-02-28 22:48:26 -0500
commit2af78448fff61e13392daf4f770cfbcf9253316a (patch)
tree6c0494284dd1dd737d5f76ee19c553618e8d0e54 /Documentation
parent5e04f4b4290e03deb91b074087ae8d7c169d947d (diff)
parentf5b6d45f8cf688f51140fd21f1da3b90562762a9 (diff)
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui: "Highlights: - introduction of Dove thermal sensor driver. - introduction of Kirkwood thermal sensor driver. - introduction of intel_powerclamp thermal cooling device driver. - add interrupt and DT support for rcar thermal driver. - add thermal emulation support which allows platform thermal driver to do software/hardware emulation for thermal issues." * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits) thermal: rcar: remove __devinitconst thermal: return an error on failure to register thermal class Thermal: rename thermal governor Kconfig option to avoid generic naming thermal: exynos: Use the new thermal trend type for quick cooling action. Thermal: exynos: Add support for temperature falling interrupt. Thermal: Dove: Add Themal sensor support for Dove. thermal: Add support for the thermal sensor on Kirkwood SoCs thermal: rcar: add Device Tree support thermal: rcar: remove machine_power_off() from rcar_thermal_notify() thermal: rcar: add interrupt support thermal: rcar: add read/write functions for common/priv data thermal: rcar: multi channel support thermal: rcar: use mutex lock instead of spin lock thermal: rcar: enable CPCTL to use hardware TSC deciding thermal: rcar: use parenthesis on macro Thermal: fix a build warning when CONFIG_THERMAL_EMULATION cleared Thermal: fix a wrong comment thermal: sysfs: Add a new sysfs node emul_temp for thermal emulation PM: intel_powerclamp: off by one in start_power_clamp() thermal: exynos: Miscellaneous fixes to support falling threshold interrupt ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/devicetree/bindings/thermal/dove-thermal.txt18
-rw-r--r--Documentation/devicetree/bindings/thermal/kirkwood-thermal.txt15
-rw-r--r--Documentation/devicetree/bindings/thermal/rcar-thermal.txt29
-rw-r--r--Documentation/thermal/exynos_thermal_emulation53
-rw-r--r--Documentation/thermal/intel_powerclamp.txt307
-rw-r--r--Documentation/thermal/sysfs-api.txt18
6 files changed, 438 insertions, 2 deletions
diff --git a/Documentation/devicetree/bindings/thermal/dove-thermal.txt b/Documentation/devicetree/bindings/thermal/dove-thermal.txt
new file mode 100644
index 000000000000..6f474677d472
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/dove-thermal.txt
@@ -0,0 +1,18 @@
1* Dove Thermal
2
3This driver is for Dove SoCs which contain a thermal sensor.
4
5Required properties:
6- compatible : "marvell,dove-thermal"
7- reg : Address range of the thermal registers
8
9The reg properties should contain two ranges. The first is for the
10three Thermal Manager registers, while the second range contains the
11Thermal Diode Control Registers.
12
13Example:
14
15 thermal@10078 {
16 compatible = "marvell,dove-thermal";
17 reg = <0xd001c 0x0c>, <0xd005c 0x08>;
18 };
diff --git a/Documentation/devicetree/bindings/thermal/kirkwood-thermal.txt b/Documentation/devicetree/bindings/thermal/kirkwood-thermal.txt
new file mode 100644
index 000000000000..8c0f5eb86da7
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/kirkwood-thermal.txt
@@ -0,0 +1,15 @@
1* Kirkwood Thermal
2
3This version is for Kirkwood 88F8262 & 88F6283 SoCs. Other kirkwoods
4don't contain a thermal sensor.
5
6Required properties:
7- compatible : "marvell,kirkwood-thermal"
8- reg : Address range of the thermal registers
9
10Example:
11
12 thermal@10078 {
13 compatible = "marvell,kirkwood-thermal";
14 reg = <0x10078 0x4>;
15 };
diff --git a/Documentation/devicetree/bindings/thermal/rcar-thermal.txt b/Documentation/devicetree/bindings/thermal/rcar-thermal.txt
new file mode 100644
index 000000000000..28ef498a66e5
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/rcar-thermal.txt
@@ -0,0 +1,29 @@
1* Renesas R-Car Thermal
2
3Required properties:
4- compatible : "renesas,rcar-thermal"
5- reg : Address range of the thermal registers.
6 The 1st reg will be recognized as common register
7 if it has "interrupts".
8
9Option properties:
10
11- interrupts : use interrupt
12
13Example (non interrupt support):
14
15thermal@e61f0100 {
16 compatible = "renesas,rcar-thermal";
17 reg = <0xe61f0100 0x38>;
18};
19
20Example (interrupt support):
21
22thermal@e61f0000 {
23 compatible = "renesas,rcar-thermal";
24 reg = <0xe61f0000 0x14
25 0xe61f0100 0x38
26 0xe61f0200 0x38
27 0xe61f0300 0x38>;
28 interrupts = <0 69 4>;
29};
diff --git a/Documentation/thermal/exynos_thermal_emulation b/Documentation/thermal/exynos_thermal_emulation
new file mode 100644
index 000000000000..b73bbfb697bb
--- /dev/null
+++ b/Documentation/thermal/exynos_thermal_emulation
@@ -0,0 +1,53 @@
1EXYNOS EMULATION MODE
2========================
3
4Copyright (C) 2012 Samsung Electronics
5
6Written by Jonghwa Lee <jonghwa3.lee@samsung.com>
7
8Description
9-----------
10
11Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal management unit.
12Thermal emulation mode supports software debug for TMU's operation. User can set temperature
13manually with software code and TMU will read current temperature from user value not from
14sensor's value.
15
16Enabling CONFIG_EXYNOS_THERMAL_EMUL option will make this support in available.
17When it's enabled, sysfs node will be created under
18/sys/bus/platform/devices/'exynos device name'/ with name of 'emulation'.
19
20The sysfs node, 'emulation', will contain value 0 for the initial state. When you input any
21temperature you want to update to sysfs node, it automatically enable emulation mode and
22current temperature will be changed into it.
23(Exynos also supports user changable delay time which would be used to delay of
24 changing temperature. However, this node only uses same delay of real sensing time, 938us.)
25
26Exynos emulation mode requires synchronous of value changing and enabling. It means when you
27want to update the any value of delay or next temperature, then you have to enable emulation
28mode at the same time. (Or you have to keep the mode enabling.) If you don't, it fails to
29change the value to updated one and just use last succeessful value repeatedly. That's why
30this node gives users the right to change termerpature only. Just one interface makes it more
31simply to use.
32
33Disabling emulation mode only requires writing value 0 to sysfs node.
34
35
36TEMP 120 |
37 |
38 100 |
39 |
40 80 |
41 | +-----------
42 60 | | |
43 | +-------------| |
44 40 | | | |
45 | | | |
46 20 | | | +----------
47 | | | | |
48 0 |______________|_____________|__________|__________|_________
49 A A A A TIME
50 |<----->| |<----->| |<----->| |
51 | 938us | | | | | |
52emulation : 0 50 | 70 | 20 | 0
53current temp : sensor 50 70 20 sensor
diff --git a/Documentation/thermal/intel_powerclamp.txt b/Documentation/thermal/intel_powerclamp.txt
new file mode 100644
index 000000000000..332de4a39b5a
--- /dev/null
+++ b/Documentation/thermal/intel_powerclamp.txt
@@ -0,0 +1,307 @@
1 =======================
2 INTEL POWERCLAMP DRIVER
3 =======================
4By: Arjan van de Ven <arjan@linux.intel.com>
5 Jacob Pan <jacob.jun.pan@linux.intel.com>
6
7Contents:
8 (*) Introduction
9 - Goals and Objectives
10
11 (*) Theory of Operation
12 - Idle Injection
13 - Calibration
14
15 (*) Performance Analysis
16 - Effectiveness and Limitations
17 - Power vs Performance
18 - Scalability
19 - Calibration
20 - Comparison with Alternative Techniques
21
22 (*) Usage and Interfaces
23 - Generic Thermal Layer (sysfs)
24 - Kernel APIs (TBD)
25
26============
27INTRODUCTION
28============
29
30Consider the situation where a system’s power consumption must be
31reduced at runtime, due to power budget, thermal constraint, or noise
32level, and where active cooling is not preferred. Software managed
33passive power reduction must be performed to prevent the hardware
34actions that are designed for catastrophic scenarios.
35
36Currently, P-states, T-states (clock modulation), and CPU offlining
37are used for CPU throttling.
38
39On Intel CPUs, C-states provide effective power reduction, but so far
40they’re only used opportunistically, based on workload. With the
41development of intel_powerclamp driver, the method of synchronizing
42idle injection across all online CPU threads was introduced. The goal
43is to achieve forced and controllable C-state residency.
44
45Test/Analysis has been made in the areas of power, performance,
46scalability, and user experience. In many cases, clear advantage is
47shown over taking the CPU offline or modulating the CPU clock.
48
49
50===================
51THEORY OF OPERATION
52===================
53
54Idle Injection
55--------------
56
57On modern Intel processors (Nehalem or later), package level C-state
58residency is available in MSRs, thus also available to the kernel.
59
60These MSRs are:
61 #define MSR_PKG_C2_RESIDENCY 0x60D
62 #define MSR_PKG_C3_RESIDENCY 0x3F8
63 #define MSR_PKG_C6_RESIDENCY 0x3F9
64 #define MSR_PKG_C7_RESIDENCY 0x3FA
65
66If the kernel can also inject idle time to the system, then a
67closed-loop control system can be established that manages package
68level C-state. The intel_powerclamp driver is conceived as such a
69control system, where the target set point is a user-selected idle
70ratio (based on power reduction), and the error is the difference
71between the actual package level C-state residency ratio and the target idle
72ratio.
73
74Injection is controlled by high priority kernel threads, spawned for
75each online CPU.
76
77These kernel threads, with SCHED_FIFO class, are created to perform
78clamping actions of controlled duty ratio and duration. Each per-CPU
79thread synchronizes its idle time and duration, based on the rounding
80of jiffies, so accumulated errors can be prevented to avoid a jittery
81effect. Threads are also bound to the CPU such that they cannot be
82migrated, unless the CPU is taken offline. In this case, threads
83belong to the offlined CPUs will be terminated immediately.
84
85Running as SCHED_FIFO and relatively high priority, also allows such
86scheme to work for both preemptable and non-preemptable kernels.
87Alignment of idle time around jiffies ensures scalability for HZ
88values. This effect can be better visualized using a Perf timechart.
89The following diagram shows the behavior of kernel thread
90kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
91for a given "duration", then relinquishes the CPU to other tasks,
92until the next time interval.
93
94The NOHZ schedule tick is disabled during idle time, but interrupts
95are not masked. Tests show that the extra wakeups from scheduler tick
96have a dramatic impact on the effectiveness of the powerclamp driver
97on large scale systems (Westmere system with 80 processors).
98
99CPU0
100 ____________ ____________
101kidle_inject/0 | sleep | mwait | sleep |
102 _________| |________| |_______
103 duration
104CPU1
105 ____________ ____________
106kidle_inject/1 | sleep | mwait | sleep |
107 _________| |________| |_______
108 ^
109 |
110 |
111 roundup(jiffies, interval)
112
113Only one CPU is allowed to collect statistics and update global
114control parameters. This CPU is referred to as the controlling CPU in
115this document. The controlling CPU is elected at runtime, with a
116policy that favors BSP, taking into account the possibility of a CPU
117hot-plug.
118
119In terms of dynamics of the idle control system, package level idle
120time is considered largely as a non-causal system where its behavior
121cannot be based on the past or current input. Therefore, the
122intel_powerclamp driver attempts to enforce the desired idle time
123instantly as given input (target idle ratio). After injection,
124powerclamp moniors the actual idle for a given time window and adjust
125the next injection accordingly to avoid over/under correction.
126
127When used in a causal control system, such as a temperature control,
128it is up to the user of this driver to implement algorithms where
129past samples and outputs are included in the feedback. For example, a
130PID-based thermal controller can use the powerclamp driver to
131maintain a desired target temperature, based on integral and
132derivative gains of the past samples.
133
134
135
136Calibration
137-----------
138During scalability testing, it is observed that synchronized actions
139among CPUs become challenging as the number of cores grows. This is
140also true for the ability of a system to enter package level C-states.
141
142To make sure the intel_powerclamp driver scales well, online
143calibration is implemented. The goals for doing such a calibration
144are:
145
146a) determine the effective range of idle injection ratio
147b) determine the amount of compensation needed at each target ratio
148
149Compensation to each target ratio consists of two parts:
150
151 a) steady state error compensation
152 This is to offset the error occurring when the system can
153 enter idle without extra wakeups (such as external interrupts).
154
155 b) dynamic error compensation
156 When an excessive amount of wakeups occurs during idle, an
157 additional idle ratio can be added to quiet interrupts, by
158 slowing down CPU activities.
159
160A debugfs file is provided for the user to examine compensation
161progress and results, such as on a Westmere system.
162[jacob@nex01 ~]$ cat
163/sys/kernel/debug/intel_powerclamp/powerclamp_calib
164controlling cpu: 0
165pct confidence steady dynamic (compensation)
1660 0 0 0
1671 1 0 0
1682 1 1 0
1693 3 1 0
1704 3 1 0
1715 3 1 0
1726 3 1 0
1737 3 1 0
1748 3 1 0
175...
17630 3 2 0
17731 3 2 0
17832 3 1 0
17933 3 2 0
18034 3 1 0
18135 3 2 0
18236 3 1 0
18337 3 2 0
18438 3 1 0
18539 3 2 0
18640 3 3 0
18741 3 1 0
18842 3 2 0
18943 3 1 0
19044 3 1 0
19145 3 2 0
19246 3 3 0
19347 3 0 0
19448 3 2 0
19549 3 3 0
196
197Calibration occurs during runtime. No offline method is available.
198Steady state compensation is used only when confidence levels of all
199adjacent ratios have reached satisfactory level. A confidence level
200is accumulated based on clean data collected at runtime. Data
201collected during a period without extra interrupts is considered
202clean.
203
204To compensate for excessive amounts of wakeup during idle, additional
205idle time is injected when such a condition is detected. Currently,
206we have a simple algorithm to double the injection ratio. A possible
207enhancement might be to throttle the offending IRQ, such as delaying
208EOI for level triggered interrupts. But it is a challenge to be
209non-intrusive to the scheduler or the IRQ core code.
210
211
212CPU Online/Offline
213------------------
214Per-CPU kernel threads are started/stopped upon receiving
215notifications of CPU hotplug activities. The intel_powerclamp driver
216keeps track of clamping kernel threads, even after they are migrated
217to other CPUs, after a CPU offline event.
218
219
220=====================
221Performance Analysis
222=====================
223This section describes the general performance data collected on
224multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
225
226Effectiveness and Limitations
227-----------------------------
228The maximum range that idle injection is allowed is capped at 50
229percent. As mentioned earlier, since interrupts are allowed during
230forced idle time, excessive interrupts could result in less
231effectiveness. The extreme case would be doing a ping -f to generated
232flooded network interrupts without much CPU acknowledgement. In this
233case, little can be done from the idle injection threads. In most
234normal cases, such as scp a large file, applications can be throttled
235by the powerclamp driver, since slowing down the CPU also slows down
236network protocol processing, which in turn reduces interrupts.
237
238When control parameters change at runtime by the controlling CPU, it
239may take an additional period for the rest of the CPUs to catch up
240with the changes. During this time, idle injection is out of sync,
241thus not able to enter package C- states at the expected ratio. But
242this effect is minor, in that in most cases change to the target
243ratio is updated much less frequently than the idle injection
244frequency.
245
246Scalability
247-----------
248Tests also show a minor, but measurable, difference between the 4P/8P
249Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
250More compensation is needed on Westmere for the same amount of
251target idle ratio. The compensation also increases as the idle ratio
252gets larger. The above reason constitutes the need for the
253calibration code.
254
255On the IVB 8P system, compared to an offline CPU, powerclamp can
256achieve up to 40% better performance per watt. (measured by a spin
257counter summed over per CPU counting threads spawned for all running
258CPUs).
259
260====================
261Usage and Interfaces
262====================
263The powerclamp driver is registered to the generic thermal layer as a
264cooling device. Currently, it’s not bound to any thermal zones.
265
266jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
267cur_state:0
268max_state:50
269type:intel_powerclamp
270
271Example usage:
272- To inject 25% idle time
273$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
274"
275
276If the system is not busy and has more than 25% idle time already,
277then the powerclamp driver will not start idle injection. Using Top
278will not show idle injection kernel threads.
279
280If the system is busy (spin test below) and has less than 25% natural
281idle time, powerclamp kernel threads will do idle injection, which
282appear running to the scheduler. But the overall system idle is still
283reflected. In this example, 24.1% idle is shown. This helps the
284system admin or user determine the cause of slowdown, when a
285powerclamp driver is in action.
286
287
288Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
289Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
290Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
291Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
292
293 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
294 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
295 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
296 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
297 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
298 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
299 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
300 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
301 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
302
303Tests have shown that by using the powerclamp driver as a cooling
304device, a PID based userspace thermal controller can manage to
305control CPU temperature effectively, when no other thermal influence
306is added. For example, a UltraBook user can compile the kernel under
307certain temperature (below most active trip points).
diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt
index 88c02334e356..6859661c9d31 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -55,6 +55,8 @@ temperature) and throttle appropriate devices.
55 .get_trip_type: get the type of certain trip point. 55 .get_trip_type: get the type of certain trip point.
56 .get_trip_temp: get the temperature above which the certain trip point 56 .get_trip_temp: get the temperature above which the certain trip point
57 will be fired. 57 will be fired.
58 .set_emul_temp: set the emulation temperature which helps in debugging
59 different threshold temperature points.
58 60
591.1.2 void thermal_zone_device_unregister(struct thermal_zone_device *tz) 611.1.2 void thermal_zone_device_unregister(struct thermal_zone_device *tz)
60 62
@@ -153,6 +155,7 @@ Thermal zone device sys I/F, created once it's registered:
153 |---trip_point_[0-*]_temp: Trip point temperature 155 |---trip_point_[0-*]_temp: Trip point temperature
154 |---trip_point_[0-*]_type: Trip point type 156 |---trip_point_[0-*]_type: Trip point type
155 |---trip_point_[0-*]_hyst: Hysteresis value for this trip point 157 |---trip_point_[0-*]_hyst: Hysteresis value for this trip point
158 |---emul_temp: Emulated temperature set node
156 159
157Thermal cooling device sys I/F, created once it's registered: 160Thermal cooling device sys I/F, created once it's registered:
158/sys/class/thermal/cooling_device[0-*]: 161/sys/class/thermal/cooling_device[0-*]:
@@ -252,6 +255,16 @@ passive
252 Valid values: 0 (disabled) or greater than 1000 255 Valid values: 0 (disabled) or greater than 1000
253 RW, Optional 256 RW, Optional
254 257
258emul_temp
259 Interface to set the emulated temperature method in thermal zone
260 (sensor). After setting this temperature, the thermal zone may pass
261 this temperature to platform emulation function if registered or
262 cache it locally. This is useful in debugging different temperature
263 threshold and its associated cooling action. This is write only node
264 and writing 0 on this node should disable emulation.
265 Unit: millidegree Celsius
266 WO, Optional
267
255***************************** 268*****************************
256* Cooling device attributes * 269* Cooling device attributes *
257***************************** 270*****************************
@@ -329,8 +342,9 @@ The framework includes a simple notification mechanism, in the form of a
329netlink event. Netlink socket initialization is done during the _init_ 342netlink event. Netlink socket initialization is done during the _init_
330of the framework. Drivers which intend to use the notification mechanism 343of the framework. Drivers which intend to use the notification mechanism
331just need to call thermal_generate_netlink_event() with two arguments viz 344just need to call thermal_generate_netlink_event() with two arguments viz
332(originator, event). Typically the originator will be an integer assigned 345(originator, event). The originator is a pointer to struct thermal_zone_device
333to a thermal_zone_device when it registers itself with the framework. The 346from where the event has been originated. An integer which represents the
347thermal zone device will be used in the message to identify the zone. The
334event will be one of:{THERMAL_AUX0, THERMAL_AUX1, THERMAL_CRITICAL, 348event will be one of:{THERMAL_AUX0, THERMAL_AUX1, THERMAL_CRITICAL,
335THERMAL_DEV_FAULT}. Notification can be sent when the current temperature 349THERMAL_DEV_FAULT}. Notification can be sent when the current temperature
336crosses any of the configured thresholds. 350crosses any of the configured thresholds.