diff options
Diffstat (limited to 'Documentation/arm/cluster-pm-race-avoidance.txt')
| -rw-r--r-- | Documentation/arm/cluster-pm-race-avoidance.txt | 498 |
1 files changed, 498 insertions, 0 deletions
diff --git a/Documentation/arm/cluster-pm-race-avoidance.txt b/Documentation/arm/cluster-pm-race-avoidance.txt new file mode 100644 index 000000000000..750b6fc24af9 --- /dev/null +++ b/Documentation/arm/cluster-pm-race-avoidance.txt | |||
| @@ -0,0 +1,498 @@ | |||
| 1 | Cluster-wide Power-up/power-down race avoidance algorithm | ||
| 2 | ========================================================= | ||
| 3 | |||
| 4 | This file documents the algorithm which is used to coordinate CPU and | ||
| 5 | cluster setup and teardown operations and to manage hardware coherency | ||
| 6 | controls safely. | ||
| 7 | |||
| 8 | The section "Rationale" explains what the algorithm is for and why it is | ||
| 9 | needed. "Basic model" explains general concepts using a simplified view | ||
| 10 | of the system. The other sections explain the actual details of the | ||
| 11 | algorithm in use. | ||
| 12 | |||
| 13 | |||
| 14 | Rationale | ||
| 15 | --------- | ||
| 16 | |||
| 17 | In a system containing multiple CPUs, it is desirable to have the | ||
| 18 | ability to turn off individual CPUs when the system is idle, reducing | ||
| 19 | power consumption and thermal dissipation. | ||
| 20 | |||
| 21 | In a system containing multiple clusters of CPUs, it is also desirable | ||
| 22 | to have the ability to turn off entire clusters. | ||
| 23 | |||
| 24 | Turning entire clusters off and on is a risky business, because it | ||
| 25 | involves performing potentially destructive operations affecting a group | ||
| 26 | of independently running CPUs, while the OS continues to run. This | ||
| 27 | means that we need some coordination in order to ensure that critical | ||
| 28 | cluster-level operations are only performed when it is truly safe to do | ||
| 29 | so. | ||
| 30 | |||
| 31 | Simple locking may not be sufficient to solve this problem, because | ||
| 32 | mechanisms like Linux spinlocks may rely on coherency mechanisms which | ||
| 33 | are not immediately enabled when a cluster powers up. Since enabling or | ||
| 34 | disabling those mechanisms may itself be a non-atomic operation (such as | ||
| 35 | writing some hardware registers and invalidating large caches), other | ||
| 36 | methods of coordination are required in order to guarantee safe | ||
| 37 | power-down and power-up at the cluster level. | ||
| 38 | |||
| 39 | The mechanism presented in this document describes a coherent memory | ||
| 40 | based protocol for performing the needed coordination. It aims to be as | ||
| 41 | lightweight as possible, while providing the required safety properties. | ||
| 42 | |||
| 43 | |||
| 44 | Basic model | ||
| 45 | ----------- | ||
| 46 | |||
| 47 | Each cluster and CPU is assigned a state, as follows: | ||
| 48 | |||
| 49 | DOWN | ||
| 50 | COMING_UP | ||
| 51 | UP | ||
| 52 | GOING_DOWN | ||
| 53 | |||
| 54 | +---------> UP ----------+ | ||
| 55 | | v | ||
| 56 | |||
| 57 | COMING_UP GOING_DOWN | ||
| 58 | |||
| 59 | ^ | | ||
| 60 | +--------- DOWN <--------+ | ||
| 61 | |||
| 62 | |||
| 63 | DOWN: The CPU or cluster is not coherent, and is either powered off or | ||
| 64 | suspended, or is ready to be powered off or suspended. | ||
| 65 | |||
| 66 | COMING_UP: The CPU or cluster has committed to moving to the UP state. | ||
| 67 | It may be part way through the process of initialisation and | ||
| 68 | enabling coherency. | ||
| 69 | |||
| 70 | UP: The CPU or cluster is active and coherent at the hardware | ||
| 71 | level. A CPU in this state is not necessarily being used | ||
| 72 | actively by the kernel. | ||
| 73 | |||
| 74 | GOING_DOWN: The CPU or cluster has committed to moving to the DOWN | ||
| 75 | state. It may be part way through the process of teardown and | ||
| 76 | coherency exit. | ||
| 77 | |||
| 78 | |||
| 79 | Each CPU has one of these states assigned to it at any point in time. | ||
| 80 | The CPU states are described in the "CPU state" section, below. | ||
| 81 | |||
| 82 | Each cluster is also assigned a state, but it is necessary to split the | ||
| 83 | state value into two parts (the "cluster" state and "inbound" state) and | ||
| 84 | to introduce additional states in order to avoid races between different | ||
| 85 | CPUs in the cluster simultaneously modifying the state. The cluster- | ||
| 86 | level states are described in the "Cluster state" section. | ||
| 87 | |||
| 88 | To help distinguish the CPU states from cluster states in this | ||
| 89 | discussion, the state names are given a CPU_ prefix for the CPU states, | ||
| 90 | and a CLUSTER_ or INBOUND_ prefix for the cluster states. | ||
| 91 | |||
| 92 | |||
| 93 | CPU state | ||
| 94 | --------- | ||
| 95 | |||
| 96 | In this algorithm, each individual core in a multi-core processor is | ||
| 97 | referred to as a "CPU". CPUs are assumed to be single-threaded: | ||
| 98 | therefore, a CPU can only be doing one thing at a single point in time. | ||
| 99 | |||
| 100 | This means that CPUs fit the basic model closely. | ||
| 101 | |||
| 102 | The algorithm defines the following states for each CPU in the system: | ||
| 103 | |||
| 104 | CPU_DOWN | ||
| 105 | CPU_COMING_UP | ||
| 106 | CPU_UP | ||
| 107 | CPU_GOING_DOWN | ||
| 108 | |||
| 109 | cluster setup and | ||
| 110 | CPU setup complete policy decision | ||
| 111 | +-----------> CPU_UP ------------+ | ||
| 112 | | v | ||
| 113 | |||
| 114 | CPU_COMING_UP CPU_GOING_DOWN | ||
| 115 | |||
| 116 | ^ | | ||
| 117 | +----------- CPU_DOWN <----------+ | ||
| 118 | policy decision CPU teardown complete | ||
| 119 | or hardware event | ||
| 120 | |||
| 121 | |||
| 122 | The definitions of the four states correspond closely to the states of | ||
| 123 | the basic model. | ||
| 124 | |||
| 125 | Transitions between states occur as follows. | ||
| 126 | |||
| 127 | A trigger event (spontaneous) means that the CPU can transition to the | ||
| 128 | next state as a result of making local progress only, with no | ||
| 129 | requirement for any external event to happen. | ||
| 130 | |||
| 131 | |||
| 132 | CPU_DOWN: | ||
| 133 | |||
| 134 | A CPU reaches the CPU_DOWN state when it is ready for | ||
| 135 | power-down. On reaching this state, the CPU will typically | ||
| 136 | power itself down or suspend itself, via a WFI instruction or a | ||
| 137 | firmware call. | ||
| 138 | |||
| 139 | Next state: CPU_COMING_UP | ||
| 140 | Conditions: none | ||
| 141 | |||
| 142 | Trigger events: | ||
| 143 | |||
| 144 | a) an explicit hardware power-up operation, resulting | ||
| 145 | from a policy decision on another CPU; | ||
| 146 | |||
| 147 | b) a hardware event, such as an interrupt. | ||
| 148 | |||
| 149 | |||
| 150 | CPU_COMING_UP: | ||
| 151 | |||
| 152 | A CPU cannot start participating in hardware coherency until the | ||
| 153 | cluster is set up and coherent. If the cluster is not ready, | ||
| 154 | then the CPU will wait in the CPU_COMING_UP state until the | ||
| 155 | cluster has been set up. | ||
| 156 | |||
| 157 | Next state: CPU_UP | ||
| 158 | Conditions: The CPU's parent cluster must be in CLUSTER_UP. | ||
| 159 | Trigger events: Transition of the parent cluster to CLUSTER_UP. | ||
| 160 | |||
| 161 | Refer to the "Cluster state" section for a description of the | ||
| 162 | CLUSTER_UP state. | ||
| 163 | |||
| 164 | |||
| 165 | CPU_UP: | ||
| 166 | When a CPU reaches the CPU_UP state, it is safe for the CPU to | ||
| 167 | start participating in local coherency. | ||
| 168 | |||
| 169 | This is done by jumping to the kernel's CPU resume code. | ||
| 170 | |||
| 171 | Note that the definition of this state is slightly different | ||
| 172 | from the basic model definition: CPU_UP does not mean that the | ||
| 173 | CPU is coherent yet, but it does mean that it is safe to resume | ||
| 174 | the kernel. The kernel handles the rest of the resume | ||
| 175 | procedure, so the remaining steps are not visible as part of the | ||
| 176 | race avoidance algorithm. | ||
| 177 | |||
| 178 | The CPU remains in this state until an explicit policy decision | ||
| 179 | is made to shut down or suspend the CPU. | ||
| 180 | |||
| 181 | Next state: CPU_GOING_DOWN | ||
| 182 | Conditions: none | ||
| 183 | Trigger events: explicit policy decision | ||
| 184 | |||
| 185 | |||
| 186 | CPU_GOING_DOWN: | ||
| 187 | |||
| 188 | While in this state, the CPU exits coherency, including any | ||
| 189 | operations required to achieve this (such as cleaning data | ||
| 190 | caches). | ||
| 191 | |||
| 192 | Next state: CPU_DOWN | ||
| 193 | Conditions: local CPU teardown complete | ||
| 194 | Trigger events: (spontaneous) | ||
| 195 | |||
| 196 | |||
| 197 | Cluster state | ||
| 198 | ------------- | ||
| 199 | |||
| 200 | A cluster is a group of connected CPUs with some common resources. | ||
| 201 | Because a cluster contains multiple CPUs, it can be doing multiple | ||
| 202 | things at the same time. This has some implications. In particular, a | ||
| 203 | CPU can start up while another CPU is tearing the cluster down. | ||
| 204 | |||
| 205 | In this discussion, the "outbound side" is the view of the cluster state | ||
| 206 | as seen by a CPU tearing the cluster down. The "inbound side" is the | ||
| 207 | view of the cluster state as seen by a CPU setting the CPU up. | ||
| 208 | |||
| 209 | In order to enable safe coordination in such situations, it is important | ||
| 210 | that a CPU which is setting up the cluster can advertise its state | ||
| 211 | independently of the CPU which is tearing down the cluster. For this | ||
| 212 | reason, the cluster state is split into two parts: | ||
| 213 | |||
| 214 | "cluster" state: The global state of the cluster; or the state | ||
| 215 | on the outbound side: | ||
| 216 | |||
| 217 | CLUSTER_DOWN | ||
| 218 | CLUSTER_UP | ||
| 219 | CLUSTER_GOING_DOWN | ||
| 220 | |||
| 221 | "inbound" state: The state of the cluster on the inbound side. | ||
| 222 | |||
| 223 | INBOUND_NOT_COMING_UP | ||
| 224 | INBOUND_COMING_UP | ||
| 225 | |||
| 226 | |||
| 227 | The different pairings of these states results in six possible | ||
| 228 | states for the cluster as a whole: | ||
| 229 | |||
| 230 | CLUSTER_UP | ||
| 231 | +==========> INBOUND_NOT_COMING_UP -------------+ | ||
| 232 | # | | ||
| 233 | | | ||
| 234 | CLUSTER_UP <----+ | | ||
| 235 | INBOUND_COMING_UP | v | ||
| 236 | |||
| 237 | ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN | ||
| 238 | # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP | ||
| 239 | |||
| 240 | CLUSTER_DOWN | | | ||
| 241 | INBOUND_COMING_UP <----+ | | ||
| 242 | | | ||
| 243 | ^ | | ||
| 244 | +=========== CLUSTER_DOWN <------------+ | ||
| 245 | INBOUND_NOT_COMING_UP | ||
| 246 | |||
| 247 | Transitions -----> can only be made by the outbound CPU, and | ||
| 248 | only involve changes to the "cluster" state. | ||
| 249 | |||
| 250 | Transitions ===##> can only be made by the inbound CPU, and only | ||
| 251 | involve changes to the "inbound" state, except where there is no | ||
| 252 | further transition possible on the outbound side (i.e., the | ||
| 253 | outbound CPU has put the cluster into the CLUSTER_DOWN state). | ||
| 254 | |||
| 255 | The race avoidance algorithm does not provide a way to determine | ||
| 256 | which exact CPUs within the cluster play these roles. This must | ||
| 257 | be decided in advance by some other means. Refer to the section | ||
| 258 | "Last man and first man selection" for more explanation. | ||
| 259 | |||
| 260 | |||
| 261 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the | ||
| 262 | cluster can actually be powered down. | ||
| 263 | |||
| 264 | The parallelism of the inbound and outbound CPUs is observed by | ||
| 265 | the existence of two different paths from CLUSTER_GOING_DOWN/ | ||
| 266 | INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic | ||
| 267 | model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to | ||
| 268 | COMING_UP in the basic model). The second path avoids cluster | ||
| 269 | teardown completely. | ||
| 270 | |||
| 271 | CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic | ||
| 272 | model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP | ||
| 273 | is trivial and merely resets the state machine ready for the | ||
| 274 | next cycle. | ||
| 275 | |||
| 276 | Details of the allowable transitions follow. | ||
| 277 | |||
| 278 | The next state in each case is notated | ||
| 279 | |||
| 280 | <cluster state>/<inbound state> (<transitioner>) | ||
| 281 | |||
| 282 | where the <transitioner> is the side on which the transition | ||
| 283 | can occur; either the inbound or the outbound side. | ||
| 284 | |||
| 285 | |||
| 286 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP: | ||
| 287 | |||
| 288 | Next state: CLUSTER_DOWN/INBOUND_COMING_UP (inbound) | ||
| 289 | Conditions: none | ||
| 290 | Trigger events: | ||
| 291 | |||
| 292 | a) an explicit hardware power-up operation, resulting | ||
| 293 | from a policy decision on another CPU; | ||
| 294 | |||
| 295 | b) a hardware event, such as an interrupt. | ||
| 296 | |||
| 297 | |||
| 298 | CLUSTER_DOWN/INBOUND_COMING_UP: | ||
| 299 | |||
| 300 | In this state, an inbound CPU sets up the cluster, including | ||
| 301 | enabling of hardware coherency at the cluster level and any | ||
| 302 | other operations (such as cache invalidation) which are required | ||
| 303 | in order to achieve this. | ||
| 304 | |||
| 305 | The purpose of this state is to do sufficient cluster-level | ||
| 306 | setup to enable other CPUs in the cluster to enter coherency | ||
| 307 | safely. | ||
| 308 | |||
| 309 | Next state: CLUSTER_UP/INBOUND_COMING_UP (inbound) | ||
| 310 | Conditions: cluster-level setup and hardware coherency complete | ||
| 311 | Trigger events: (spontaneous) | ||
| 312 | |||
| 313 | |||
| 314 | CLUSTER_UP/INBOUND_COMING_UP: | ||
| 315 | |||
| 316 | Cluster-level setup is complete and hardware coherency is | ||
| 317 | enabled for the cluster. Other CPUs in the cluster can safely | ||
| 318 | enter coherency. | ||
| 319 | |||
| 320 | This is a transient state, leading immediately to | ||
| 321 | CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster | ||
| 322 | should consider treat these two states as equivalent. | ||
| 323 | |||
| 324 | Next state: CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) | ||
| 325 | Conditions: none | ||
| 326 | Trigger events: (spontaneous) | ||
| 327 | |||
| 328 | |||
| 329 | CLUSTER_UP/INBOUND_NOT_COMING_UP: | ||
| 330 | |||
| 331 | Cluster-level setup is complete and hardware coherency is | ||
| 332 | enabled for the cluster. Other CPUs in the cluster can safely | ||
| 333 | enter coherency. | ||
| 334 | |||
| 335 | The cluster will remain in this state until a policy decision is | ||
| 336 | made to power the cluster down. | ||
| 337 | |||
| 338 | Next state: CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) | ||
| 339 | Conditions: none | ||
| 340 | Trigger events: policy decision to power down the cluster | ||
| 341 | |||
| 342 | |||
| 343 | CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: | ||
| 344 | |||
| 345 | An outbound CPU is tearing the cluster down. The selected CPU | ||
| 346 | must wait in this state until all CPUs in the cluster are in the | ||
| 347 | CPU_DOWN state. | ||
| 348 | |||
| 349 | When all CPUs are in the CPU_DOWN state, the cluster can be torn | ||
| 350 | down, for example by cleaning data caches and exiting | ||
| 351 | cluster-level coherency. | ||
| 352 | |||
| 353 | To avoid wasteful unnecessary teardown operations, the outbound | ||
| 354 | should check the inbound cluster state for asynchronous | ||
| 355 | transitions to INBOUND_COMING_UP. Alternatively, individual | ||
| 356 | CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. | ||
| 357 | |||
| 358 | |||
| 359 | Next states: | ||
| 360 | |||
| 361 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) | ||
| 362 | Conditions: cluster torn down and ready to power off | ||
| 363 | Trigger events: (spontaneous) | ||
| 364 | |||
| 365 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) | ||
| 366 | Conditions: none | ||
| 367 | Trigger events: | ||
| 368 | |||
| 369 | a) an explicit hardware power-up operation, | ||
| 370 | resulting from a policy decision on another | ||
| 371 | CPU; | ||
| 372 | |||
| 373 | b) a hardware event, such as an interrupt. | ||
| 374 | |||
| 375 | |||
| 376 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP: | ||
| 377 | |||
| 378 | The cluster is (or was) being torn down, but another CPU has | ||
| 379 | come online in the meantime and is trying to set up the cluster | ||
| 380 | again. | ||
| 381 | |||
| 382 | If the outbound CPU observes this state, it has two choices: | ||
| 383 | |||
| 384 | a) back out of teardown, restoring the cluster to the | ||
| 385 | CLUSTER_UP state; | ||
| 386 | |||
| 387 | b) finish tearing the cluster down and put the cluster | ||
| 388 | in the CLUSTER_DOWN state; the inbound CPU will | ||
| 389 | set up the cluster again from there. | ||
| 390 | |||
| 391 | Choice (a) permits the removal of some latency by avoiding | ||
| 392 | unnecessary teardown and setup operations in situations where | ||
| 393 | the cluster is not really going to be powered down. | ||
| 394 | |||
| 395 | |||
| 396 | Next states: | ||
| 397 | |||
| 398 | CLUSTER_UP/INBOUND_COMING_UP (outbound) | ||
| 399 | Conditions: cluster-level setup and hardware | ||
| 400 | coherency complete | ||
| 401 | Trigger events: (spontaneous) | ||
| 402 | |||
| 403 | CLUSTER_DOWN/INBOUND_COMING_UP (outbound) | ||
| 404 | Conditions: cluster torn down and ready to power off | ||
| 405 | Trigger events: (spontaneous) | ||
| 406 | |||
| 407 | |||
| 408 | Last man and First man selection | ||
| 409 | -------------------------------- | ||
| 410 | |||
| 411 | The CPU which performs cluster tear-down operations on the outbound side | ||
| 412 | is commonly referred to as the "last man". | ||
| 413 | |||
| 414 | The CPU which performs cluster setup on the inbound side is commonly | ||
| 415 | referred to as the "first man". | ||
| 416 | |||
| 417 | The race avoidance algorithm documented above does not provide a | ||
| 418 | mechanism to choose which CPUs should play these roles. | ||
| 419 | |||
| 420 | |||
| 421 | Last man: | ||
| 422 | |||
| 423 | When shutting down the cluster, all the CPUs involved are initially | ||
| 424 | executing Linux and hence coherent. Therefore, ordinary spinlocks can | ||
| 425 | be used to select a last man safely, before the CPUs become | ||
| 426 | non-coherent. | ||
| 427 | |||
| 428 | |||
| 429 | First man: | ||
| 430 | |||
| 431 | Because CPUs may power up asynchronously in response to external wake-up | ||
| 432 | events, a dynamic mechanism is needed to make sure that only one CPU | ||
| 433 | attempts to play the first man role and do the cluster-level | ||
| 434 | initialisation: any other CPUs must wait for this to complete before | ||
| 435 | proceeding. | ||
| 436 | |||
| 437 | Cluster-level initialisation may involve actions such as configuring | ||
| 438 | coherency controls in the bus fabric. | ||
| 439 | |||
| 440 | The current implementation in mcpm_head.S uses a separate mutual exclusion | ||
| 441 | mechanism to do this arbitration. This mechanism is documented in | ||
| 442 | detail in vlocks.txt. | ||
| 443 | |||
| 444 | |||
| 445 | Features and Limitations | ||
| 446 | ------------------------ | ||
| 447 | |||
| 448 | Implementation: | ||
| 449 | |||
| 450 | The current ARM-based implementation is split between | ||
| 451 | arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and | ||
| 452 | arch/arm/common/mcpm_entry.c (everything else): | ||
| 453 | |||
| 454 | __mcpm_cpu_going_down() signals the transition of a CPU to the | ||
| 455 | CPU_GOING_DOWN state. | ||
| 456 | |||
| 457 | __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN | ||
| 458 | state. | ||
| 459 | |||
| 460 | A CPU transitions to CPU_COMING_UP and then to CPU_UP via the | ||
| 461 | low-level power-up code in mcpm_head.S. This could | ||
| 462 | involve CPU-specific setup code, but in the current | ||
| 463 | implementation it does not. | ||
| 464 | |||
| 465 | __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() | ||
| 466 | handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN | ||
| 467 | and from there to CLUSTER_DOWN or back to CLUSTER_UP (in | ||
| 468 | the case of an aborted cluster power-down). | ||
| 469 | |||
| 470 | These functions are more complex than the __mcpm_cpu_*() | ||
| 471 | functions due to the extra inter-CPU coordination which | ||
| 472 | is needed for safe transitions at the cluster level. | ||
| 473 | |||
| 474 | A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via | ||
| 475 | the low-level power-up code in mcpm_head.S. This | ||
| 476 | typically involves platform-specific setup code, | ||
| 477 | provided by the platform-specific power_up_setup | ||
| 478 | function registered via mcpm_sync_init. | ||
| 479 | |||
| 480 | Deep topologies: | ||
| 481 | |||
| 482 | As currently described and implemented, the algorithm does not | ||
| 483 | support CPU topologies involving more than two levels (i.e., | ||
| 484 | clusters of clusters are not supported). The algorithm could be | ||
| 485 | extended by replicating the cluster-level states for the | ||
| 486 | additional topological levels, and modifying the transition | ||
| 487 | rules for the intermediate (non-outermost) cluster levels. | ||
| 488 | |||
| 489 | |||
| 490 | Colophon | ||
| 491 | -------- | ||
| 492 | |||
| 493 | Originally created and documented by Dave Martin for Linaro Limited, in | ||
| 494 | collaboration with Nicolas Pitre and Achin Gupta. | ||
| 495 | |||
| 496 | Copyright (C) 2012-2013 Linaro Limited | ||
| 497 | Distributed under the terms of Version 2 of the GNU General Public | ||
| 498 | License, as defined in linux/COPYING. | ||
