diff options
-rw-r--r-- | Documentation/cpu-freq/intel-pstate.txt | 241 |
1 files changed, 199 insertions, 42 deletions
diff --git a/Documentation/cpu-freq/intel-pstate.txt b/Documentation/cpu-freq/intel-pstate.txt index be8d4006bf76..f7b12c071d53 100644 --- a/Documentation/cpu-freq/intel-pstate.txt +++ b/Documentation/cpu-freq/intel-pstate.txt | |||
@@ -1,61 +1,131 @@ | |||
1 | Intel P-state driver | 1 | Intel P-State driver |
2 | -------------------- | 2 | -------------------- |
3 | 3 | ||
4 | This driver provides an interface to control the P state selection for | 4 | This driver provides an interface to control the P-State selection for the |
5 | SandyBridge+ Intel processors. The driver can operate two different | 5 | SandyBridge+ Intel processors. |
6 | modes based on the processor model, legacy mode and Hardware P state (HWP) | 6 | |
7 | mode. | 7 | The following document explains P-States: |
8 | 8 | http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf | |
9 | In legacy mode, the Intel P-state implements two internal governors, | 9 | As stated in the document, P-State doesn’t exactly mean a frequency. However, for |
10 | performance and powersave, that differ from the general cpufreq governors of | 10 | the sake of the relationship with cpufreq, P-State and frequency are used |
11 | the same name (the general cpufreq governors implement target(), whereas the | 11 | interchangeably. |
12 | internal Intel P-state governors implement setpolicy()). The internal | 12 | |
13 | performance governor sets the max_perf_pct and min_perf_pct to 100; that is, | 13 | Understanding the cpufreq core governors and policies are important before |
14 | the governor selects the highest available P state to maximize the performance | 14 | discussing more details about the Intel P-State driver. Based on what callbacks |
15 | of the core. The internal powersave governor selects the appropriate P state | 15 | a cpufreq driver provides to the cpufreq core, it can support two types of |
16 | based on the current load on the CPU. | 16 | drivers: |
17 | 17 | - with target_index() callback: In this mode, the drivers using cpufreq core | |
18 | In HWP mode P state selection is implemented in the processor | 18 | simply provide the minimum and maximum frequency limits and an additional |
19 | itself. The driver provides the interfaces between the cpufreq core and | 19 | interface target_index() to set the current frequency. The cpufreq subsystem |
20 | the processor to control P state selection based on user preferences | 20 | has a number of scaling governors ("performance", "powersave", "ondemand", |
21 | and reporting frequency to the cpufreq core. In this mode the | 21 | etc.). Depending on which governor is in use, cpufreq core will call for |
22 | internal Intel P-state governor code is disabled. | 22 | transitions to a specific frequency using target_index() callback. |
23 | 23 | - setpolicy() callback: In this mode, drivers do not provide target_index() | |
24 | In addition to the interfaces provided by the cpufreq core for | 24 | callback, so cpufreq core can't request a transition to a specific frequency. |
25 | controlling frequency the driver provides sysfs files for | 25 | The driver provides minimum and maximum frequency limits and callbacks to set a |
26 | controlling P state selection. These files have been added to | 26 | policy. The policy in cpufreq sysfs is referred to as the "scaling governor". |
27 | /sys/devices/system/cpu/intel_pstate/ | 27 | The cpufreq core can request the driver to operate in any of the two policies: |
28 | 28 | "performance: and "powersave". The driver decides which frequency to use based | |
29 | max_perf_pct: limits the maximum P state that will be requested by | 29 | on the above policy selection considering minimum and maximum frequency limits. |
30 | the driver stated as a percentage of the available performance. The | 30 | |
31 | available (P states) performance may be reduced by the no_turbo | 31 | The Intel P-State driver falls under the latter category, which implements the |
32 | setpolicy() callback. This driver decides what P-State to use based on the | ||
33 | requested policy from the cpufreq core. If the processor is capable of | ||
34 | selecting its next P-State internally, then the driver will offload this | ||
35 | responsibility to the processor (aka HWP: Hardware P-States). If not, the | ||
36 | driver implements algorithms to select the next P-State. | ||
37 | |||
38 | Since these policies are implemented in the driver, they are not same as the | ||
39 | cpufreq scaling governors implementation, even if they have the same name in | ||
40 | the cpufreq sysfs (scaling_governors). For example the "performance" policy is | ||
41 | similar to cpufreq’s "performance" governor, but "powersave" is completely | ||
42 | different than the cpufreq "powersave" governor. The strategy here is similar | ||
43 | to cpufreq "ondemand", where the requested P-State is related to the system load. | ||
44 | |||
45 | Sysfs Interface | ||
46 | |||
47 | In addition to the frequency-controlling interfaces provided by the cpufreq | ||
48 | core, the driver provides its own sysfs files to control the P-State selection. | ||
49 | These files have been added to /sys/devices/system/cpu/intel_pstate/. | ||
50 | Any changes made to these files are applicable to all CPUs (even in a | ||
51 | multi-package system). | ||
52 | |||
53 | max_perf_pct: Limits the maximum P-State that will be requested by | ||
54 | the driver. It states it as a percentage of the available performance. The | ||
55 | available (P-State) performance may be reduced by the no_turbo | ||
32 | setting described below. | 56 | setting described below. |
33 | 57 | ||
34 | min_perf_pct: limits the minimum P state that will be requested by | 58 | min_perf_pct: Limits the minimum P-State that will be requested by |
35 | the driver stated as a percentage of the max (non-turbo) | 59 | the driver. It states it as a percentage of the max (non-turbo) |
36 | performance level. | 60 | performance level. |
37 | 61 | ||
38 | no_turbo: limits the driver to selecting P states below the turbo | 62 | no_turbo: Limits the driver to selecting P-State below the turbo |
39 | frequency range. | 63 | frequency range. |
40 | 64 | ||
41 | turbo_pct: displays the percentage of the total performance that | 65 | turbo_pct: Displays the percentage of the total performance that |
42 | is supported by hardware that is in the turbo range. This number | 66 | is supported by hardware that is in the turbo range. This number |
43 | is independent of whether turbo has been disabled or not. | 67 | is independent of whether turbo has been disabled or not. |
44 | 68 | ||
45 | num_pstates: displays the number of pstates that are supported | 69 | num_pstates: Displays the number of P-States that are supported |
46 | by hardware. This number is independent of whether turbo has | 70 | by hardware. This number is independent of whether turbo has |
47 | been disabled or not. | 71 | been disabled or not. |
48 | 72 | ||
73 | For example, if a system has these parameters: | ||
74 | Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) | ||
75 | Max non turbo ratio: 0x17 | ||
76 | Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) | ||
77 | |||
78 | Sysfs will show : | ||
79 | max_perf_pct:100, which corresponds to 1 core ratio | ||
80 | min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio | ||
81 | no_turbo:0, turbo is not disabled | ||
82 | num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) | ||
83 | turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates | ||
84 | |||
85 | Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual | ||
86 | Volume 3: System Programming Guide" to understand ratios. | ||
87 | |||
88 | cpufreq sysfs for Intel P-State | ||
89 | |||
90 | Since this driver registers with cpufreq, cpufreq sysfs is also presented. | ||
91 | There are some important differences, which need to be considered. | ||
92 | |||
93 | scaling_cur_freq: This displays the real frequency which was used during | ||
94 | the last sample period instead of what is requested. Some other cpufreq driver, | ||
95 | like acpi-cpufreq, displays what is requested (Some changes are on the | ||
96 | way to fix this for acpi-cpufreq driver). The same is true for frequencies | ||
97 | displayed at /proc/cpuinfo. | ||
98 | |||
99 | scaling_governor: This displays current active policy. Since each CPU has a | ||
100 | cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this | ||
101 | is not possible with Intel P-States, as there is one common policy for all | ||
102 | CPUs. Here, the last requested policy will be applicable to all CPUs. It is | ||
103 | suggested that one use the cpupower utility to change policy to all CPUs at the | ||
104 | same time. | ||
105 | |||
106 | scaling_setspeed: This attribute can never be used with Intel P-State. | ||
107 | |||
108 | scaling_max_freq/scaling_min_freq: This interface can be used similarly to | ||
109 | the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies | ||
110 | are converted to nearest possible P-State, this is prone to rounding errors. | ||
111 | This method is not preferred to limit performance. | ||
112 | |||
113 | affected_cpus: Not used | ||
114 | related_cpus: Not used | ||
115 | |||
49 | For contemporary Intel processors, the frequency is controlled by the | 116 | For contemporary Intel processors, the frequency is controlled by the |
50 | processor itself and the P-states exposed to software are related to | 117 | processor itself and the P-State exposed to software is related to |
51 | performance levels. The idea that frequency can be set to a single | 118 | performance levels. The idea that frequency can be set to a single |
52 | frequency is fiction for Intel Core processors. Even if the scaling | 119 | frequency is fictional for Intel Core processors. Even if the scaling |
53 | driver selects a single P state the actual frequency the processor | 120 | driver selects a single P-State, the actual frequency the processor |
54 | will run at is selected by the processor itself. | 121 | will run at is selected by the processor itself. |
55 | 122 | ||
56 | For legacy mode debugfs files have also been added to allow tuning of | 123 | Tuning Intel P-State driver |
57 | the internal governor algorythm. These files are located at | 124 | |
58 | /sys/kernel/debug/pstate_snb/ These files are NOT present in HWP mode. | 125 | When HWP mode is not used, debugfs files have also been added to allow the |
126 | tuning of the internal governor algorithm. These files are located at | ||
127 | /sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional | ||
128 | Integral Derivative) controller. The PID tunable parameters are: | ||
59 | 129 | ||
60 | deadband | 130 | deadband |
61 | d_gain_pct | 131 | d_gain_pct |
@@ -63,3 +133,90 @@ the internal governor algorythm. These files are located at | |||
63 | p_gain_pct | 133 | p_gain_pct |
64 | sample_rate_ms | 134 | sample_rate_ms |
65 | setpoint | 135 | setpoint |
136 | |||
137 | To adjust these parameters, some understanding of driver implementation is | ||
138 | necessary. There are some tweeks described here, but be very careful. Adjusting | ||
139 | them requires expert level understanding of power and performance relationship. | ||
140 | These limits are only useful when the "powersave" policy is active. | ||
141 | |||
142 | -To make the system more responsive to load changes, sample_rate_ms can | ||
143 | be adjusted (current default is 10ms). | ||
144 | -To make the system use higher performance, even if the load is lower, setpoint | ||
145 | can be adjusted to a lower number. This will also lead to faster ramp up time | ||
146 | to reach the maximum P-State. | ||
147 | If there are no derivative and integral coefficients, The next P-State will be | ||
148 | equal to: | ||
149 | current P-State - ((setpoint - current cpu load) * p_gain_pct) | ||
150 | |||
151 | For example, if the current PID parameters are (Which are defaults for the core | ||
152 | processors like SandyBridge): | ||
153 | deadband = 0 | ||
154 | d_gain_pct = 0 | ||
155 | i_gain_pct = 0 | ||
156 | p_gain_pct = 20 | ||
157 | sample_rate_ms = 10 | ||
158 | setpoint = 97 | ||
159 | |||
160 | If the current P-State = 0x08 and current load = 100, this will result in the | ||
161 | next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State | ||
162 | goes up by only 1. If during next sample interval the current load doesn't | ||
163 | change and still 100, then P-State goes up by one again. This process will | ||
164 | continue as long as the load is more than the setpoint until the maximum P-State | ||
165 | is reached. | ||
166 | |||
167 | For the same load at setpoint = 60, this will result in the next P-State | ||
168 | = 0x08 - ((60 - 100) * 0.2) = 16 | ||
169 | So by changing the setpoint from 97 to 60, there is an increase of the | ||
170 | next P-State from 9 to 16. So this will make processor execute at higher | ||
171 | P-State for the same CPU load. If the load continues to be more than the | ||
172 | setpoint during next sample intervals, then P-State will go up again till the | ||
173 | maximum P-State is reached. But the ramp up time to reach the maximum P-State | ||
174 | will be much faster when the setpoint is 60 compared to 97. | ||
175 | |||
176 | Debugging Intel P-State driver | ||
177 | |||
178 | Event tracing | ||
179 | To debug P-State transition, the Linux event tracing interface can be used. | ||
180 | There are two specific events, which can be enabled (Provided the kernel | ||
181 | configs related to event tracing are enabled). | ||
182 | |||
183 | # cd /sys/kernel/debug/tracing/ | ||
184 | # echo 1 > events/power/pstate_sample/enable | ||
185 | # echo 1 > events/power/cpu_frequency/enable | ||
186 | # cat trace | ||
187 | gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 | ||
188 | scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 | ||
189 | freq=2474476 | ||
190 | cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 | ||
191 | |||
192 | |||
193 | Using ftrace | ||
194 | |||
195 | If function level tracing is required, the Linux ftrace interface can be used. | ||
196 | For example if we want to check how often a function to set a P-State is | ||
197 | called, we can set ftrace filter to intel_pstate_set_pstate. | ||
198 | |||
199 | # cd /sys/kernel/debug/tracing/ | ||
200 | # cat available_filter_functions | grep -i pstate | ||
201 | intel_pstate_set_pstate | ||
202 | intel_pstate_cpu_init | ||
203 | ... | ||
204 | |||
205 | # echo intel_pstate_set_pstate > set_ftrace_filter | ||
206 | # echo function > current_tracer | ||
207 | # cat trace | head -15 | ||
208 | # tracer: function | ||
209 | # | ||
210 | # entries-in-buffer/entries-written: 80/80 #P:4 | ||
211 | # | ||
212 | # _-----=> irqs-off | ||
213 | # / _----=> need-resched | ||
214 | # | / _---=> hardirq/softirq | ||
215 | # || / _--=> preempt-depth | ||
216 | # ||| / delay | ||
217 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
218 | # | | | |||| | | | ||
219 | Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func | ||
220 | gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func | ||
221 | gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func | ||
222 | <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func | ||