diff options
author | Changbin Du <changbin.du@gmail.com> | 2019-05-08 11:21:31 -0400 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2019-05-08 16:34:11 -0400 |
commit | 1cd7af509dc223905dce622c07ec62e26044e3c0 (patch) | |
tree | 9bfae71e7c0b5248634f08b60f8616572a3f00f3 /Documentation/x86 | |
parent | 3d07bc393f9b63ca4c6f9953922f9122a11f29c3 (diff) |
Documentation: x86: convert resctrl_ui.txt to reST
This converts the plain text documentation to reStructuredText format and
add it to Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/index.rst | 1 | ||||
-rw-r--r-- | Documentation/x86/resctrl_ui.rst (renamed from Documentation/x86/resctrl_ui.txt) | 916 |
2 files changed, 494 insertions, 423 deletions
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index ae29c026be72..6e3c887a0c3b 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst | |||
@@ -23,3 +23,4 @@ x86-specific Documentation | |||
23 | amd-memory-encryption | 23 | amd-memory-encryption |
24 | pti | 24 | pti |
25 | microcode | 25 | microcode |
26 | resctrl_ui | ||
diff --git a/Documentation/x86/resctrl_ui.txt b/Documentation/x86/resctrl_ui.rst index c1f95b59e14d..225cfd4daaee 100644 --- a/Documentation/x86/resctrl_ui.txt +++ b/Documentation/x86/resctrl_ui.rst | |||
@@ -1,33 +1,44 @@ | |||
1 | .. SPDX-License-Identifier: GPL-2.0 | ||
2 | .. include:: <isonum.txt> | ||
3 | |||
4 | =========================================== | ||
1 | User Interface for Resource Control feature | 5 | User Interface for Resource Control feature |
6 | =========================================== | ||
2 | 7 | ||
3 | Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). | 8 | :Copyright: |copy| 2016 Intel Corporation |
4 | AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). | 9 | :Authors: - Fenghua Yu <fenghua.yu@intel.com> |
10 | - Tony Luck <tony.luck@intel.com> | ||
11 | - Vikas Shivappa <vikas.shivappa@intel.com> | ||
5 | 12 | ||
6 | Copyright (C) 2016 Intel Corporation | ||
7 | 13 | ||
8 | Fenghua Yu <fenghua.yu@intel.com> | 14 | Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). |
9 | Tony Luck <tony.luck@intel.com> | 15 | AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). |
10 | Vikas Shivappa <vikas.shivappa@intel.com> | ||
11 | 16 | ||
12 | This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo | 17 | This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo |
13 | flag bits: | 18 | flag bits: |
14 | RDT (Resource Director Technology) Allocation - "rdt_a" | ||
15 | CAT (Cache Allocation Technology) - "cat_l3", "cat_l2" | ||
16 | CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2" | ||
17 | CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc" | ||
18 | MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local" | ||
19 | MBA (Memory Bandwidth Allocation) - "mba" | ||
20 | 19 | ||
21 | To use the feature mount the file system: | 20 | ============================================= ================================ |
21 | RDT (Resource Director Technology) Allocation "rdt_a" | ||
22 | CAT (Cache Allocation Technology) "cat_l3", "cat_l2" | ||
23 | CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" | ||
24 | CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" | ||
25 | MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" | ||
26 | MBA (Memory Bandwidth Allocation) "mba" | ||
27 | ============================================= ================================ | ||
28 | |||
29 | To use the feature mount the file system:: | ||
22 | 30 | ||
23 | # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl | 31 | # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl |
24 | 32 | ||
25 | mount options are: | 33 | mount options are: |
26 | 34 | ||
27 | "cdp": Enable code/data prioritization in L3 cache allocations. | 35 | "cdp": |
28 | "cdpl2": Enable code/data prioritization in L2 cache allocations. | 36 | Enable code/data prioritization in L3 cache allocations. |
29 | "mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA | 37 | "cdpl2": |
30 | bandwidth in MBps | 38 | Enable code/data prioritization in L2 cache allocations. |
39 | "mba_MBps": | ||
40 | Enable the MBA Software Controller(mba_sc) to specify MBA | ||
41 | bandwidth in MBps | ||
31 | 42 | ||
32 | L2 and L3 CDP are controlled seperately. | 43 | L2 and L3 CDP are controlled seperately. |
33 | 44 | ||
@@ -44,7 +55,7 @@ For more details on the behavior of the interface during monitoring | |||
44 | and allocation, see the "Resource alloc and monitor groups" section. | 55 | and allocation, see the "Resource alloc and monitor groups" section. |
45 | 56 | ||
46 | Info directory | 57 | Info directory |
47 | -------------- | 58 | ============== |
48 | 59 | ||
49 | The 'info' directory contains information about the enabled | 60 | The 'info' directory contains information about the enabled |
50 | resources. Each resource has its own subdirectory. The subdirectory | 61 | resources. Each resource has its own subdirectory. The subdirectory |
@@ -56,77 +67,93 @@ allocation: | |||
56 | Cache resource(L3/L2) subdirectory contains the following files | 67 | Cache resource(L3/L2) subdirectory contains the following files |
57 | related to allocation: | 68 | related to allocation: |
58 | 69 | ||
59 | "num_closids": The number of CLOSIDs which are valid for this | 70 | "num_closids": |
60 | resource. The kernel uses the smallest number of | 71 | The number of CLOSIDs which are valid for this |
61 | CLOSIDs of all enabled resources as limit. | 72 | resource. The kernel uses the smallest number of |
62 | 73 | CLOSIDs of all enabled resources as limit. | |
63 | "cbm_mask": The bitmask which is valid for this resource. | 74 | "cbm_mask": |
64 | This mask is equivalent to 100%. | 75 | The bitmask which is valid for this resource. |
65 | 76 | This mask is equivalent to 100%. | |
66 | "min_cbm_bits": The minimum number of consecutive bits which | 77 | "min_cbm_bits": |
67 | must be set when writing a mask. | 78 | The minimum number of consecutive bits which |
68 | 79 | must be set when writing a mask. | |
69 | "shareable_bits": Bitmask of shareable resource with other executing | 80 | |
70 | entities (e.g. I/O). User can use this when | 81 | "shareable_bits": |
71 | setting up exclusive cache partitions. Note that | 82 | Bitmask of shareable resource with other executing |
72 | some platforms support devices that have their | 83 | entities (e.g. I/O). User can use this when |
73 | own settings for cache use which can over-ride | 84 | setting up exclusive cache partitions. Note that |
74 | these bits. | 85 | some platforms support devices that have their |
75 | "bit_usage": Annotated capacity bitmasks showing how all | 86 | own settings for cache use which can over-ride |
76 | instances of the resource are used. The legend is: | 87 | these bits. |
77 | "0" - Corresponding region is unused. When the system's | 88 | "bit_usage": |
89 | Annotated capacity bitmasks showing how all | ||
90 | instances of the resource are used. The legend is: | ||
91 | |||
92 | "0": | ||
93 | Corresponding region is unused. When the system's | ||
78 | resources have been allocated and a "0" is found | 94 | resources have been allocated and a "0" is found |
79 | in "bit_usage" it is a sign that resources are | 95 | in "bit_usage" it is a sign that resources are |
80 | wasted. | 96 | wasted. |
81 | "H" - Corresponding region is used by hardware only | 97 | |
98 | "H": | ||
99 | Corresponding region is used by hardware only | ||
82 | but available for software use. If a resource | 100 | but available for software use. If a resource |
83 | has bits set in "shareable_bits" but not all | 101 | has bits set in "shareable_bits" but not all |
84 | of these bits appear in the resource groups' | 102 | of these bits appear in the resource groups' |
85 | schematas then the bits appearing in | 103 | schematas then the bits appearing in |
86 | "shareable_bits" but no resource group will | 104 | "shareable_bits" but no resource group will |
87 | be marked as "H". | 105 | be marked as "H". |
88 | "X" - Corresponding region is available for sharing and | 106 | "X": |
107 | Corresponding region is available for sharing and | ||
89 | used by hardware and software. These are the | 108 | used by hardware and software. These are the |
90 | bits that appear in "shareable_bits" as | 109 | bits that appear in "shareable_bits" as |
91 | well as a resource group's allocation. | 110 | well as a resource group's allocation. |
92 | "S" - Corresponding region is used by software | 111 | "S": |
112 | Corresponding region is used by software | ||
93 | and available for sharing. | 113 | and available for sharing. |
94 | "E" - Corresponding region is used exclusively by | 114 | "E": |
115 | Corresponding region is used exclusively by | ||
95 | one resource group. No sharing allowed. | 116 | one resource group. No sharing allowed. |
96 | "P" - Corresponding region is pseudo-locked. No | 117 | "P": |
118 | Corresponding region is pseudo-locked. No | ||
97 | sharing allowed. | 119 | sharing allowed. |
98 | 120 | ||
99 | Memory bandwitdh(MB) subdirectory contains the following files | 121 | Memory bandwitdh(MB) subdirectory contains the following files |
100 | with respect to allocation: | 122 | with respect to allocation: |
101 | 123 | ||
102 | "min_bandwidth": The minimum memory bandwidth percentage which | 124 | "min_bandwidth": |
103 | user can request. | 125 | The minimum memory bandwidth percentage which |
126 | user can request. | ||
104 | 127 | ||
105 | "bandwidth_gran": The granularity in which the memory bandwidth | 128 | "bandwidth_gran": |
106 | percentage is allocated. The allocated | 129 | The granularity in which the memory bandwidth |
107 | b/w percentage is rounded off to the next | 130 | percentage is allocated. The allocated |
108 | control step available on the hardware. The | 131 | b/w percentage is rounded off to the next |
109 | available bandwidth control steps are: | 132 | control step available on the hardware. The |
110 | min_bandwidth + N * bandwidth_gran. | 133 | available bandwidth control steps are: |
134 | min_bandwidth + N * bandwidth_gran. | ||
111 | 135 | ||
112 | "delay_linear": Indicates if the delay scale is linear or | 136 | "delay_linear": |
113 | non-linear. This field is purely informational | 137 | Indicates if the delay scale is linear or |
114 | only. | 138 | non-linear. This field is purely informational |
139 | only. | ||
115 | 140 | ||
116 | If RDT monitoring is available there will be an "L3_MON" directory | 141 | If RDT monitoring is available there will be an "L3_MON" directory |
117 | with the following files: | 142 | with the following files: |
118 | 143 | ||
119 | "num_rmids": The number of RMIDs available. This is the | 144 | "num_rmids": |
120 | upper bound for how many "CTRL_MON" + "MON" | 145 | The number of RMIDs available. This is the |
121 | groups can be created. | 146 | upper bound for how many "CTRL_MON" + "MON" |
147 | groups can be created. | ||
122 | 148 | ||
123 | "mon_features": Lists the monitoring events if | 149 | "mon_features": |
124 | monitoring is enabled for the resource. | 150 | Lists the monitoring events if |
151 | monitoring is enabled for the resource. | ||
125 | 152 | ||
126 | "max_threshold_occupancy": | 153 | "max_threshold_occupancy": |
127 | Read/write file provides the largest value (in | 154 | Read/write file provides the largest value (in |
128 | bytes) at which a previously used LLC_occupancy | 155 | bytes) at which a previously used LLC_occupancy |
129 | counter can be considered for re-use. | 156 | counter can be considered for re-use. |
130 | 157 | ||
131 | Finally, in the top level of the "info" directory there is a file | 158 | Finally, in the top level of the "info" directory there is a file |
132 | named "last_cmd_status". This is reset with every "command" issued | 159 | named "last_cmd_status". This is reset with every "command" issued |
@@ -134,6 +161,7 @@ via the file system (making new directories or writing to any of the | |||
134 | control files). If the command was successful, it will read as "ok". | 161 | control files). If the command was successful, it will read as "ok". |
135 | If the command failed, it will provide more information that can be | 162 | If the command failed, it will provide more information that can be |
136 | conveyed in the error returns from file operations. E.g. | 163 | conveyed in the error returns from file operations. E.g. |
164 | :: | ||
137 | 165 | ||
138 | # echo L3:0=f7 > schemata | 166 | # echo L3:0=f7 > schemata |
139 | bash: echo: write error: Invalid argument | 167 | bash: echo: write error: Invalid argument |
@@ -141,7 +169,7 @@ conveyed in the error returns from file operations. E.g. | |||
141 | mask f7 has non-consecutive 1-bits | 169 | mask f7 has non-consecutive 1-bits |
142 | 170 | ||
143 | Resource alloc and monitor groups | 171 | Resource alloc and monitor groups |
144 | --------------------------------- | 172 | ================================= |
145 | 173 | ||
146 | Resource groups are represented as directories in the resctrl file | 174 | Resource groups are represented as directories in the resctrl file |
147 | system. The default group is the root directory which, immediately | 175 | system. The default group is the root directory which, immediately |
@@ -226,6 +254,7 @@ When monitoring is enabled all MON groups will also contain: | |||
226 | 254 | ||
227 | Resource allocation rules | 255 | Resource allocation rules |
228 | ------------------------- | 256 | ------------------------- |
257 | |||
229 | When a task is running the following rules define which resources are | 258 | When a task is running the following rules define which resources are |
230 | available to it: | 259 | available to it: |
231 | 260 | ||
@@ -252,7 +281,7 @@ Resource monitoring rules | |||
252 | 281 | ||
253 | 282 | ||
254 | Notes on cache occupancy monitoring and control | 283 | Notes on cache occupancy monitoring and control |
255 | ----------------------------------------------- | 284 | =============================================== |
256 | When moving a task from one group to another you should remember that | 285 | When moving a task from one group to another you should remember that |
257 | this only affects *new* cache allocations by the task. E.g. you may have | 286 | this only affects *new* cache allocations by the task. E.g. you may have |
258 | a task in a monitor group showing 3 MB of cache occupancy. If you move | 287 | a task in a monitor group showing 3 MB of cache occupancy. If you move |
@@ -321,7 +350,7 @@ of the capacity of the cache. You could partition the cache into four | |||
321 | equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. | 350 | equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. |
322 | 351 | ||
323 | Memory bandwidth Allocation and monitoring | 352 | Memory bandwidth Allocation and monitoring |
324 | ------------------------------------------ | 353 | ========================================== |
325 | 354 | ||
326 | For Memory bandwidth resource, by default the user controls the resource | 355 | For Memory bandwidth resource, by default the user controls the resource |
327 | by indicating the percentage of total memory bandwidth. | 356 | by indicating the percentage of total memory bandwidth. |
@@ -369,7 +398,7 @@ In order to mitigate this and make the interface more user friendly, | |||
369 | resctrl added support for specifying the bandwidth in MBps as well. The | 398 | resctrl added support for specifying the bandwidth in MBps as well. The |
370 | kernel underneath would use a software feedback mechanism or a "Software | 399 | kernel underneath would use a software feedback mechanism or a "Software |
371 | Controller(mba_sc)" which reads the actual bandwidth using MBM counters | 400 | Controller(mba_sc)" which reads the actual bandwidth using MBM counters |
372 | and adjust the memowy bandwidth percentages to ensure | 401 | and adjust the memowy bandwidth percentages to ensure:: |
373 | 402 | ||
374 | "actual bandwidth < user specified bandwidth". | 403 | "actual bandwidth < user specified bandwidth". |
375 | 404 | ||
@@ -380,14 +409,14 @@ sections. | |||
380 | 409 | ||
381 | L3 schemata file details (code and data prioritization disabled) | 410 | L3 schemata file details (code and data prioritization disabled) |
382 | ---------------------------------------------------------------- | 411 | ---------------------------------------------------------------- |
383 | With CDP disabled the L3 schemata format is: | 412 | With CDP disabled the L3 schemata format is:: |
384 | 413 | ||
385 | L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | 414 | L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... |
386 | 415 | ||
387 | L3 schemata file details (CDP enabled via mount option to resctrl) | 416 | L3 schemata file details (CDP enabled via mount option to resctrl) |
388 | ------------------------------------------------------------------ | 417 | ------------------------------------------------------------------ |
389 | When CDP is enabled L3 control is split into two separate resources | 418 | When CDP is enabled L3 control is split into two separate resources |
390 | so you can specify independent masks for code and data like this: | 419 | so you can specify independent masks for code and data like this:: |
391 | 420 | ||
392 | L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | 421 | L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... |
393 | L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | 422 | L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... |
@@ -395,7 +424,7 @@ so you can specify independent masks for code and data like this: | |||
395 | L2 schemata file details | 424 | L2 schemata file details |
396 | ------------------------ | 425 | ------------------------ |
397 | L2 cache does not support code and data prioritization, so the | 426 | L2 cache does not support code and data prioritization, so the |
398 | schemata format is always: | 427 | schemata format is always:: |
399 | 428 | ||
400 | L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | 429 | L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... |
401 | 430 | ||
@@ -403,6 +432,7 @@ Memory bandwidth Allocation (default mode) | |||
403 | ------------------------------------------ | 432 | ------------------------------------------ |
404 | 433 | ||
405 | Memory b/w domain is L3 cache. | 434 | Memory b/w domain is L3 cache. |
435 | :: | ||
406 | 436 | ||
407 | MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... | 437 | MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... |
408 | 438 | ||
@@ -410,6 +440,7 @@ Memory bandwidth Allocation specified in MBps | |||
410 | --------------------------------------------- | 440 | --------------------------------------------- |
411 | 441 | ||
412 | Memory bandwidth domain is L3 cache. | 442 | Memory bandwidth domain is L3 cache. |
443 | :: | ||
413 | 444 | ||
414 | MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... | 445 | MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... |
415 | 446 | ||
@@ -418,17 +449,18 @@ Reading/writing the schemata file | |||
418 | Reading the schemata file will show the state of all resources | 449 | Reading the schemata file will show the state of all resources |
419 | on all domains. When writing you only need to specify those values | 450 | on all domains. When writing you only need to specify those values |
420 | which you wish to change. E.g. | 451 | which you wish to change. E.g. |
452 | :: | ||
421 | 453 | ||
422 | # cat schemata | 454 | # cat schemata |
423 | L3DATA:0=fffff;1=fffff;2=fffff;3=fffff | 455 | L3DATA:0=fffff;1=fffff;2=fffff;3=fffff |
424 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff | 456 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff |
425 | # echo "L3DATA:2=3c0;" > schemata | 457 | # echo "L3DATA:2=3c0;" > schemata |
426 | # cat schemata | 458 | # cat schemata |
427 | L3DATA:0=fffff;1=fffff;2=3c0;3=fffff | 459 | L3DATA:0=fffff;1=fffff;2=3c0;3=fffff |
428 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff | 460 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff |
429 | 461 | ||
430 | Cache Pseudo-Locking | 462 | Cache Pseudo-Locking |
431 | -------------------- | 463 | ==================== |
432 | CAT enables a user to specify the amount of cache space that an | 464 | CAT enables a user to specify the amount of cache space that an |
433 | application can fill. Cache pseudo-locking builds on the fact that a | 465 | application can fill. Cache pseudo-locking builds on the fact that a |
434 | CPU can still read and write data pre-allocated outside its current | 466 | CPU can still read and write data pre-allocated outside its current |
@@ -442,6 +474,7 @@ a region of memory with reduced average read latency. | |||
442 | The creation of a cache pseudo-locked region is triggered by a request | 474 | The creation of a cache pseudo-locked region is triggered by a request |
443 | from the user to do so that is accompanied by a schemata of the region | 475 | from the user to do so that is accompanied by a schemata of the region |
444 | to be pseudo-locked. The cache pseudo-locked region is created as follows: | 476 | to be pseudo-locked. The cache pseudo-locked region is created as follows: |
477 | |||
445 | - Create a CAT allocation CLOSNEW with a CBM matching the schemata | 478 | - Create a CAT allocation CLOSNEW with a CBM matching the schemata |
446 | from the user of the cache region that will contain the pseudo-locked | 479 | from the user of the cache region that will contain the pseudo-locked |
447 | memory. This region must not overlap with any current CAT allocation/CLOS | 480 | memory. This region must not overlap with any current CAT allocation/CLOS |
@@ -480,6 +513,7 @@ initial mmap() handling, there is no enforcement afterwards and the | |||
480 | application self needs to ensure it remains affine to the correct cores. | 513 | application self needs to ensure it remains affine to the correct cores. |
481 | 514 | ||
482 | Pseudo-locking is accomplished in two stages: | 515 | Pseudo-locking is accomplished in two stages: |
516 | |||
483 | 1) During the first stage the system administrator allocates a portion | 517 | 1) During the first stage the system administrator allocates a portion |
484 | of cache that should be dedicated to pseudo-locking. At this time an | 518 | of cache that should be dedicated to pseudo-locking. At this time an |
485 | equivalent portion of memory is allocated, loaded into allocated | 519 | equivalent portion of memory is allocated, loaded into allocated |
@@ -506,7 +540,7 @@ by user space in order to obtain access to the pseudo-locked memory region. | |||
506 | An example of cache pseudo-locked region creation and usage can be found below. | 540 | An example of cache pseudo-locked region creation and usage can be found below. |
507 | 541 | ||
508 | Cache Pseudo-Locking Debugging Interface | 542 | Cache Pseudo-Locking Debugging Interface |
509 | --------------------------------------- | 543 | ---------------------------------------- |
510 | The pseudo-locking debugging interface is enabled by default (if | 544 | The pseudo-locking debugging interface is enabled by default (if |
511 | CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. | 545 | CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. |
512 | 546 | ||
@@ -514,6 +548,7 @@ There is no explicit way for the kernel to test if a provided memory | |||
514 | location is present in the cache. The pseudo-locking debugging interface uses | 548 | location is present in the cache. The pseudo-locking debugging interface uses |
515 | the tracing infrastructure to provide two ways to measure cache residency of | 549 | the tracing infrastructure to provide two ways to measure cache residency of |
516 | the pseudo-locked region: | 550 | the pseudo-locked region: |
551 | |||
517 | 1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data | 552 | 1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data |
518 | from these measurements are best visualized using a hist trigger (see | 553 | from these measurements are best visualized using a hist trigger (see |
519 | example below). In this test the pseudo-locked region is traversed at | 554 | example below). In this test the pseudo-locked region is traversed at |
@@ -529,87 +564,97 @@ it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single | |||
529 | write-only file, pseudo_lock_measure, is present in this directory. The | 564 | write-only file, pseudo_lock_measure, is present in this directory. The |
530 | measurement of the pseudo-locked region depends on the number written to this | 565 | measurement of the pseudo-locked region depends on the number written to this |
531 | debugfs file: | 566 | debugfs file: |
532 | 1 - writing "1" to the pseudo_lock_measure file will trigger the latency | 567 | |
568 | 1: | ||
569 | writing "1" to the pseudo_lock_measure file will trigger the latency | ||
533 | measurement captured in the pseudo_lock_mem_latency tracepoint. See | 570 | measurement captured in the pseudo_lock_mem_latency tracepoint. See |
534 | example below. | 571 | example below. |
535 | 2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache | 572 | 2: |
573 | writing "2" to the pseudo_lock_measure file will trigger the L2 cache | ||
536 | residency (cache hits and misses) measurement captured in the | 574 | residency (cache hits and misses) measurement captured in the |
537 | pseudo_lock_l2 tracepoint. See example below. | 575 | pseudo_lock_l2 tracepoint. See example below. |
538 | 3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache | 576 | 3: |
577 | writing "3" to the pseudo_lock_measure file will trigger the L3 cache | ||
539 | residency (cache hits and misses) measurement captured in the | 578 | residency (cache hits and misses) measurement captured in the |
540 | pseudo_lock_l3 tracepoint. | 579 | pseudo_lock_l3 tracepoint. |
541 | 580 | ||
542 | All measurements are recorded with the tracing infrastructure. This requires | 581 | All measurements are recorded with the tracing infrastructure. This requires |
543 | the relevant tracepoints to be enabled before the measurement is triggered. | 582 | the relevant tracepoints to be enabled before the measurement is triggered. |
544 | 583 | ||
545 | Example of latency debugging interface: | 584 | Example of latency debugging interface |
585 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
546 | In this example a pseudo-locked region named "newlock" was created. Here is | 586 | In this example a pseudo-locked region named "newlock" was created. Here is |
547 | how we can measure the latency in cycles of reading from this region and | 587 | how we can measure the latency in cycles of reading from this region and |
548 | visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS | 588 | visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS |
549 | is set: | 589 | is set:: |
550 | # :> /sys/kernel/debug/tracing/trace | 590 | |
551 | # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger | 591 | # :> /sys/kernel/debug/tracing/trace |
552 | # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable | 592 | # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger |
553 | # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure | 593 | # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable |
554 | # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable | 594 | # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure |
555 | # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist | 595 | # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable |
556 | 596 | # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist | |
557 | # event histogram | 597 | |
558 | # | 598 | # event histogram |
559 | # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] | 599 | # |
560 | # | 600 | # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] |
561 | 601 | # | |
562 | { latency: 456 } hitcount: 1 | 602 | |
563 | { latency: 50 } hitcount: 83 | 603 | { latency: 456 } hitcount: 1 |
564 | { latency: 36 } hitcount: 96 | 604 | { latency: 50 } hitcount: 83 |
565 | { latency: 44 } hitcount: 174 | 605 | { latency: 36 } hitcount: 96 |
566 | { latency: 48 } hitcount: 195 | 606 | { latency: 44 } hitcount: 174 |
567 | { latency: 46 } hitcount: 262 | 607 | { latency: 48 } hitcount: 195 |
568 | { latency: 42 } hitcount: 693 | 608 | { latency: 46 } hitcount: 262 |
569 | { latency: 40 } hitcount: 3204 | 609 | { latency: 42 } hitcount: 693 |
570 | { latency: 38 } hitcount: 3484 | 610 | { latency: 40 } hitcount: 3204 |
571 | 611 | { latency: 38 } hitcount: 3484 | |
572 | Totals: | 612 | |
573 | Hits: 8192 | 613 | Totals: |
574 | Entries: 9 | 614 | Hits: 8192 |
575 | Dropped: 0 | 615 | Entries: 9 |
576 | 616 | Dropped: 0 | |
577 | Example of cache hits/misses debugging: | 617 | |
618 | Example of cache hits/misses debugging | ||
619 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
578 | In this example a pseudo-locked region named "newlock" was created on the L2 | 620 | In this example a pseudo-locked region named "newlock" was created on the L2 |
579 | cache of a platform. Here is how we can obtain details of the cache hits | 621 | cache of a platform. Here is how we can obtain details of the cache hits |
580 | and misses using the platform's precision counters. | 622 | and misses using the platform's precision counters. |
623 | :: | ||
581 | 624 | ||
582 | # :> /sys/kernel/debug/tracing/trace | 625 | # :> /sys/kernel/debug/tracing/trace |
583 | # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable | 626 | # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable |
584 | # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure | 627 | # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure |
585 | # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable | 628 | # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable |
586 | # cat /sys/kernel/debug/tracing/trace | 629 | # cat /sys/kernel/debug/tracing/trace |
587 | 630 | ||
588 | # tracer: nop | 631 | # tracer: nop |
589 | # | 632 | # |
590 | # _-----=> irqs-off | 633 | # _-----=> irqs-off |
591 | # / _----=> need-resched | 634 | # / _----=> need-resched |
592 | # | / _---=> hardirq/softirq | 635 | # | / _---=> hardirq/softirq |
593 | # || / _--=> preempt-depth | 636 | # || / _--=> preempt-depth |
594 | # ||| / delay | 637 | # ||| / delay |
595 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | 638 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION |
596 | # | | | |||| | | | 639 | # | | | |||| | | |
597 | pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 | 640 | pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 |
598 | 641 | ||
599 | 642 | ||
600 | Examples for RDT allocation usage: | 643 | Examples for RDT allocation usage |
644 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
645 | |||
646 | 1) Example 1 | ||
601 | 647 | ||
602 | Example 1 | ||
603 | --------- | ||
604 | On a two socket machine (one L3 cache per socket) with just four bits | 648 | On a two socket machine (one L3 cache per socket) with just four bits |
605 | for cache bit masks, minimum b/w of 10% with a memory bandwidth | 649 | for cache bit masks, minimum b/w of 10% with a memory bandwidth |
606 | granularity of 10% | 650 | granularity of 10%. |
651 | :: | ||
607 | 652 | ||
608 | # mount -t resctrl resctrl /sys/fs/resctrl | 653 | # mount -t resctrl resctrl /sys/fs/resctrl |
609 | # cd /sys/fs/resctrl | 654 | # cd /sys/fs/resctrl |
610 | # mkdir p0 p1 | 655 | # mkdir p0 p1 |
611 | # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata | 656 | # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata |
612 | # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata | 657 | # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata |
613 | 658 | ||
614 | The default resource group is unmodified, so we have access to all parts | 659 | The default resource group is unmodified, so we have access to all parts |
615 | of all caches (its schemata file reads "L3:0=f;1=f"). | 660 | of all caches (its schemata file reads "L3:0=f;1=f"). |
@@ -628,100 +673,106 @@ the b/w accordingly. | |||
628 | 673 | ||
629 | If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB | 674 | If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB |
630 | rather than the percentage values. | 675 | rather than the percentage values. |
676 | :: | ||
631 | 677 | ||
632 | # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata | 678 | # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata |
633 | # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata | 679 | # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata |
634 | 680 | ||
635 | In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w | 681 | In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w |
636 | of 1024MB where as on socket 1 they would use 500MB. | 682 | of 1024MB where as on socket 1 they would use 500MB. |
637 | 683 | ||
638 | Example 2 | 684 | 2) Example 2 |
639 | --------- | 685 | |
640 | Again two sockets, but this time with a more realistic 20-bit mask. | 686 | Again two sockets, but this time with a more realistic 20-bit mask. |
641 | 687 | ||
642 | Two real time tasks pid=1234 running on processor 0 and pid=5678 running on | 688 | Two real time tasks pid=1234 running on processor 0 and pid=5678 running on |
643 | processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy | 689 | processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy |
644 | neighbors, each of the two real-time tasks exclusively occupies one quarter | 690 | neighbors, each of the two real-time tasks exclusively occupies one quarter |
645 | of L3 cache on socket 0. | 691 | of L3 cache on socket 0. |
692 | :: | ||
646 | 693 | ||
647 | # mount -t resctrl resctrl /sys/fs/resctrl | 694 | # mount -t resctrl resctrl /sys/fs/resctrl |
648 | # cd /sys/fs/resctrl | 695 | # cd /sys/fs/resctrl |
649 | 696 | ||
650 | First we reset the schemata for the default group so that the "upper" | 697 | First we reset the schemata for the default group so that the "upper" |
651 | 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by | 698 | 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by |
652 | ordinary tasks: | 699 | ordinary tasks:: |
653 | 700 | ||
654 | # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata | 701 | # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata |
655 | 702 | ||
656 | Next we make a resource group for our first real time task and give | 703 | Next we make a resource group for our first real time task and give |
657 | it access to the "top" 25% of the cache on socket 0. | 704 | it access to the "top" 25% of the cache on socket 0. |
705 | :: | ||
658 | 706 | ||
659 | # mkdir p0 | 707 | # mkdir p0 |
660 | # echo "L3:0=f8000;1=fffff" > p0/schemata | 708 | # echo "L3:0=f8000;1=fffff" > p0/schemata |
661 | 709 | ||
662 | Finally we move our first real time task into this resource group. We | 710 | Finally we move our first real time task into this resource group. We |
663 | also use taskset(1) to ensure the task always runs on a dedicated CPU | 711 | also use taskset(1) to ensure the task always runs on a dedicated CPU |
664 | on socket 0. Most uses of resource groups will also constrain which | 712 | on socket 0. Most uses of resource groups will also constrain which |
665 | processors tasks run on. | 713 | processors tasks run on. |
714 | :: | ||
666 | 715 | ||
667 | # echo 1234 > p0/tasks | 716 | # echo 1234 > p0/tasks |
668 | # taskset -cp 1 1234 | 717 | # taskset -cp 1 1234 |
669 | 718 | ||
670 | Ditto for the second real time task (with the remaining 25% of cache): | 719 | Ditto for the second real time task (with the remaining 25% of cache):: |
671 | 720 | ||
672 | # mkdir p1 | 721 | # mkdir p1 |
673 | # echo "L3:0=7c00;1=fffff" > p1/schemata | 722 | # echo "L3:0=7c00;1=fffff" > p1/schemata |
674 | # echo 5678 > p1/tasks | 723 | # echo 5678 > p1/tasks |
675 | # taskset -cp 2 5678 | 724 | # taskset -cp 2 5678 |
676 | 725 | ||
677 | For the same 2 socket system with memory b/w resource and CAT L3 the | 726 | For the same 2 socket system with memory b/w resource and CAT L3 the |
678 | schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is | 727 | schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is |
679 | 10): | 728 | 10): |
680 | 729 | ||
681 | For our first real time task this would request 20% memory b/w on socket | 730 | For our first real time task this would request 20% memory b/w on socket 0. |
682 | 0. | 731 | :: |
683 | 732 | ||
684 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata | 733 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata |
685 | 734 | ||
686 | For our second real time task this would request an other 20% memory b/w | 735 | For our second real time task this would request an other 20% memory b/w |
687 | on socket 0. | 736 | on socket 0. |
737 | :: | ||
688 | 738 | ||
689 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata | 739 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata |
690 | 740 | ||
691 | Example 3 | 741 | 3) Example 3 |
692 | --------- | ||
693 | 742 | ||
694 | A single socket system which has real-time tasks running on core 4-7 and | 743 | A single socket system which has real-time tasks running on core 4-7 and |
695 | non real-time workload assigned to core 0-3. The real-time tasks share text | 744 | non real-time workload assigned to core 0-3. The real-time tasks share text |
696 | and data, so a per task association is not required and due to interaction | 745 | and data, so a per task association is not required and due to interaction |
697 | with the kernel it's desired that the kernel on these cores shares L3 with | 746 | with the kernel it's desired that the kernel on these cores shares L3 with |
698 | the tasks. | 747 | the tasks. |
748 | :: | ||
699 | 749 | ||
700 | # mount -t resctrl resctrl /sys/fs/resctrl | 750 | # mount -t resctrl resctrl /sys/fs/resctrl |
701 | # cd /sys/fs/resctrl | 751 | # cd /sys/fs/resctrl |
702 | 752 | ||
703 | First we reset the schemata for the default group so that the "upper" | 753 | First we reset the schemata for the default group so that the "upper" |
704 | 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 | 754 | 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 |
705 | cannot be used by ordinary tasks: | 755 | cannot be used by ordinary tasks:: |
706 | 756 | ||
707 | # echo "L3:0=3ff\nMB:0=50" > schemata | 757 | # echo "L3:0=3ff\nMB:0=50" > schemata |
708 | 758 | ||
709 | Next we make a resource group for our real time cores and give it access | 759 | Next we make a resource group for our real time cores and give it access |
710 | to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on | 760 | to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on |
711 | socket 0. | 761 | socket 0. |
762 | :: | ||
712 | 763 | ||
713 | # mkdir p0 | 764 | # mkdir p0 |
714 | # echo "L3:0=ffc00\nMB:0=50" > p0/schemata | 765 | # echo "L3:0=ffc00\nMB:0=50" > p0/schemata |
715 | 766 | ||
716 | Finally we move core 4-7 over to the new group and make sure that the | 767 | Finally we move core 4-7 over to the new group and make sure that the |
717 | kernel and the tasks running there get 50% of the cache. They should | 768 | kernel and the tasks running there get 50% of the cache. They should |
718 | also get 50% of memory bandwidth assuming that the cores 4-7 are SMT | 769 | also get 50% of memory bandwidth assuming that the cores 4-7 are SMT |
719 | siblings and only the real time threads are scheduled on the cores 4-7. | 770 | siblings and only the real time threads are scheduled on the cores 4-7. |
771 | :: | ||
720 | 772 | ||
721 | # echo F0 > p0/cpus | 773 | # echo F0 > p0/cpus |
722 | 774 | ||
723 | Example 4 | 775 | 4) Example 4 |
724 | --------- | ||
725 | 776 | ||
726 | The resource groups in previous examples were all in the default "shareable" | 777 | The resource groups in previous examples were all in the default "shareable" |
727 | mode allowing sharing of their cache allocations. If one resource group | 778 | mode allowing sharing of their cache allocations. If one resource group |
@@ -732,157 +783,168 @@ In this example a new exclusive resource group will be created on a L2 CAT | |||
732 | system with two L2 cache instances that can be configured with an 8-bit | 783 | system with two L2 cache instances that can be configured with an 8-bit |
733 | capacity bitmask. The new exclusive resource group will be configured to use | 784 | capacity bitmask. The new exclusive resource group will be configured to use |
734 | 25% of each cache instance. | 785 | 25% of each cache instance. |
786 | :: | ||
735 | 787 | ||
736 | # mount -t resctrl resctrl /sys/fs/resctrl/ | 788 | # mount -t resctrl resctrl /sys/fs/resctrl/ |
737 | # cd /sys/fs/resctrl | 789 | # cd /sys/fs/resctrl |
738 | 790 | ||
739 | First, we observe that the default group is configured to allocate to all L2 | 791 | First, we observe that the default group is configured to allocate to all L2 |
740 | cache: | 792 | cache:: |
741 | 793 | ||
742 | # cat schemata | 794 | # cat schemata |
743 | L2:0=ff;1=ff | 795 | L2:0=ff;1=ff |
744 | 796 | ||
745 | We could attempt to create the new resource group at this point, but it will | 797 | We could attempt to create the new resource group at this point, but it will |
746 | fail because of the overlap with the schemata of the default group: | 798 | fail because of the overlap with the schemata of the default group:: |
747 | # mkdir p0 | 799 | |
748 | # echo 'L2:0=0x3;1=0x3' > p0/schemata | 800 | # mkdir p0 |
749 | # cat p0/mode | 801 | # echo 'L2:0=0x3;1=0x3' > p0/schemata |
750 | shareable | 802 | # cat p0/mode |
751 | # echo exclusive > p0/mode | 803 | shareable |
752 | -sh: echo: write error: Invalid argument | 804 | # echo exclusive > p0/mode |
753 | # cat info/last_cmd_status | 805 | -sh: echo: write error: Invalid argument |
754 | schemata overlaps | 806 | # cat info/last_cmd_status |
807 | schemata overlaps | ||
755 | 808 | ||
756 | To ensure that there is no overlap with another resource group the default | 809 | To ensure that there is no overlap with another resource group the default |
757 | resource group's schemata has to change, making it possible for the new | 810 | resource group's schemata has to change, making it possible for the new |
758 | resource group to become exclusive. | 811 | resource group to become exclusive. |
759 | # echo 'L2:0=0xfc;1=0xfc' > schemata | 812 | :: |
760 | # echo exclusive > p0/mode | 813 | |
761 | # grep . p0/* | 814 | # echo 'L2:0=0xfc;1=0xfc' > schemata |
762 | p0/cpus:0 | 815 | # echo exclusive > p0/mode |
763 | p0/mode:exclusive | 816 | # grep . p0/* |
764 | p0/schemata:L2:0=03;1=03 | 817 | p0/cpus:0 |
765 | p0/size:L2:0=262144;1=262144 | 818 | p0/mode:exclusive |
819 | p0/schemata:L2:0=03;1=03 | ||
820 | p0/size:L2:0=262144;1=262144 | ||
766 | 821 | ||
767 | A new resource group will on creation not overlap with an exclusive resource | 822 | A new resource group will on creation not overlap with an exclusive resource |
768 | group: | 823 | group:: |
769 | # mkdir p1 | 824 | |
770 | # grep . p1/* | 825 | # mkdir p1 |
771 | p1/cpus:0 | 826 | # grep . p1/* |
772 | p1/mode:shareable | 827 | p1/cpus:0 |
773 | p1/schemata:L2:0=fc;1=fc | 828 | p1/mode:shareable |
774 | p1/size:L2:0=786432;1=786432 | 829 | p1/schemata:L2:0=fc;1=fc |
775 | 830 | p1/size:L2:0=786432;1=786432 | |
776 | The bit_usage will reflect how the cache is used: | 831 | |
777 | # cat info/L2/bit_usage | 832 | The bit_usage will reflect how the cache is used:: |
778 | 0=SSSSSSEE;1=SSSSSSEE | 833 | |
779 | 834 | # cat info/L2/bit_usage | |
780 | A resource group cannot be forced to overlap with an exclusive resource group: | 835 | 0=SSSSSSEE;1=SSSSSSEE |
781 | # echo 'L2:0=0x1;1=0x1' > p1/schemata | 836 | |
782 | -sh: echo: write error: Invalid argument | 837 | A resource group cannot be forced to overlap with an exclusive resource group:: |
783 | # cat info/last_cmd_status | 838 | |
784 | overlaps with exclusive group | 839 | # echo 'L2:0=0x1;1=0x1' > p1/schemata |
840 | -sh: echo: write error: Invalid argument | ||
841 | # cat info/last_cmd_status | ||
842 | overlaps with exclusive group | ||
785 | 843 | ||
786 | Example of Cache Pseudo-Locking | 844 | Example of Cache Pseudo-Locking |
787 | ------------------------------- | 845 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
788 | Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked | 846 | Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked |
789 | region is exposed at /dev/pseudo_lock/newlock that can be provided to | 847 | region is exposed at /dev/pseudo_lock/newlock that can be provided to |
790 | application for argument to mmap(). | 848 | application for argument to mmap(). |
849 | :: | ||
791 | 850 | ||
792 | # mount -t resctrl resctrl /sys/fs/resctrl/ | 851 | # mount -t resctrl resctrl /sys/fs/resctrl/ |
793 | # cd /sys/fs/resctrl | 852 | # cd /sys/fs/resctrl |
794 | 853 | ||
795 | Ensure that there are bits available that can be pseudo-locked, since only | 854 | Ensure that there are bits available that can be pseudo-locked, since only |
796 | unused bits can be pseudo-locked the bits to be pseudo-locked needs to be | 855 | unused bits can be pseudo-locked the bits to be pseudo-locked needs to be |
797 | removed from the default resource group's schemata: | 856 | removed from the default resource group's schemata:: |
798 | # cat info/L2/bit_usage | 857 | |
799 | 0=SSSSSSSS;1=SSSSSSSS | 858 | # cat info/L2/bit_usage |
800 | # echo 'L2:1=0xfc' > schemata | 859 | 0=SSSSSSSS;1=SSSSSSSS |
801 | # cat info/L2/bit_usage | 860 | # echo 'L2:1=0xfc' > schemata |
802 | 0=SSSSSSSS;1=SSSSSS00 | 861 | # cat info/L2/bit_usage |
862 | 0=SSSSSSSS;1=SSSSSS00 | ||
803 | 863 | ||
804 | Create a new resource group that will be associated with the pseudo-locked | 864 | Create a new resource group that will be associated with the pseudo-locked |
805 | region, indicate that it will be used for a pseudo-locked region, and | 865 | region, indicate that it will be used for a pseudo-locked region, and |
806 | configure the requested pseudo-locked region capacity bitmask: | 866 | configure the requested pseudo-locked region capacity bitmask:: |
807 | 867 | ||
808 | # mkdir newlock | 868 | # mkdir newlock |
809 | # echo pseudo-locksetup > newlock/mode | 869 | # echo pseudo-locksetup > newlock/mode |
810 | # echo 'L2:1=0x3' > newlock/schemata | 870 | # echo 'L2:1=0x3' > newlock/schemata |
811 | 871 | ||
812 | On success the resource group's mode will change to pseudo-locked, the | 872 | On success the resource group's mode will change to pseudo-locked, the |
813 | bit_usage will reflect the pseudo-locked region, and the character device | 873 | bit_usage will reflect the pseudo-locked region, and the character device |
814 | exposing the pseudo-locked region will exist: | 874 | exposing the pseudo-locked region will exist:: |
815 | 875 | ||
816 | # cat newlock/mode | 876 | # cat newlock/mode |
817 | pseudo-locked | 877 | pseudo-locked |
818 | # cat info/L2/bit_usage | 878 | # cat info/L2/bit_usage |
819 | 0=SSSSSSSS;1=SSSSSSPP | 879 | 0=SSSSSSSS;1=SSSSSSPP |
820 | # ls -l /dev/pseudo_lock/newlock | 880 | # ls -l /dev/pseudo_lock/newlock |
821 | crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock | 881 | crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock |
822 | 882 | ||
823 | /* | 883 | :: |
824 | * Example code to access one page of pseudo-locked cache region | 884 | |
825 | * from user space. | 885 | /* |
826 | */ | 886 | * Example code to access one page of pseudo-locked cache region |
827 | #define _GNU_SOURCE | 887 | * from user space. |
828 | #include <fcntl.h> | 888 | */ |
829 | #include <sched.h> | 889 | #define _GNU_SOURCE |
830 | #include <stdio.h> | 890 | #include <fcntl.h> |
831 | #include <stdlib.h> | 891 | #include <sched.h> |
832 | #include <unistd.h> | 892 | #include <stdio.h> |
833 | #include <sys/mman.h> | 893 | #include <stdlib.h> |
834 | 894 | #include <unistd.h> | |
835 | /* | 895 | #include <sys/mman.h> |
836 | * It is required that the application runs with affinity to only | 896 | |
837 | * cores associated with the pseudo-locked region. Here the cpu | 897 | /* |
838 | * is hardcoded for convenience of example. | 898 | * It is required that the application runs with affinity to only |
839 | */ | 899 | * cores associated with the pseudo-locked region. Here the cpu |
840 | static int cpuid = 2; | 900 | * is hardcoded for convenience of example. |
841 | 901 | */ | |
842 | int main(int argc, char *argv[]) | 902 | static int cpuid = 2; |
843 | { | 903 | |
844 | cpu_set_t cpuset; | 904 | int main(int argc, char *argv[]) |
845 | long page_size; | 905 | { |
846 | void *mapping; | 906 | cpu_set_t cpuset; |
847 | int dev_fd; | 907 | long page_size; |
848 | int ret; | 908 | void *mapping; |
849 | 909 | int dev_fd; | |
850 | page_size = sysconf(_SC_PAGESIZE); | 910 | int ret; |
851 | 911 | ||
852 | CPU_ZERO(&cpuset); | 912 | page_size = sysconf(_SC_PAGESIZE); |
853 | CPU_SET(cpuid, &cpuset); | 913 | |
854 | ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); | 914 | CPU_ZERO(&cpuset); |
855 | if (ret < 0) { | 915 | CPU_SET(cpuid, &cpuset); |
856 | perror("sched_setaffinity"); | 916 | ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); |
857 | exit(EXIT_FAILURE); | 917 | if (ret < 0) { |
858 | } | 918 | perror("sched_setaffinity"); |
859 | 919 | exit(EXIT_FAILURE); | |
860 | dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); | 920 | } |
861 | if (dev_fd < 0) { | 921 | |
862 | perror("open"); | 922 | dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); |
863 | exit(EXIT_FAILURE); | 923 | if (dev_fd < 0) { |
864 | } | 924 | perror("open"); |
865 | 925 | exit(EXIT_FAILURE); | |
866 | mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, | 926 | } |
867 | dev_fd, 0); | 927 | |
868 | if (mapping == MAP_FAILED) { | 928 | mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, |
869 | perror("mmap"); | 929 | dev_fd, 0); |
870 | close(dev_fd); | 930 | if (mapping == MAP_FAILED) { |
871 | exit(EXIT_FAILURE); | 931 | perror("mmap"); |
872 | } | 932 | close(dev_fd); |
873 | 933 | exit(EXIT_FAILURE); | |
874 | /* Application interacts with pseudo-locked memory @mapping */ | 934 | } |
875 | 935 | ||
876 | ret = munmap(mapping, page_size); | 936 | /* Application interacts with pseudo-locked memory @mapping */ |
877 | if (ret < 0) { | 937 | |
878 | perror("munmap"); | 938 | ret = munmap(mapping, page_size); |
879 | close(dev_fd); | 939 | if (ret < 0) { |
880 | exit(EXIT_FAILURE); | 940 | perror("munmap"); |
881 | } | 941 | close(dev_fd); |
882 | 942 | exit(EXIT_FAILURE); | |
883 | close(dev_fd); | 943 | } |
884 | exit(EXIT_SUCCESS); | 944 | |
885 | } | 945 | close(dev_fd); |
946 | exit(EXIT_SUCCESS); | ||
947 | } | ||
886 | 948 | ||
887 | Locking between applications | 949 | Locking between applications |
888 | ---------------------------- | 950 | ---------------------------- |
@@ -921,86 +983,86 @@ Read lock: | |||
921 | B) If success read the directory structure. | 983 | B) If success read the directory structure. |
922 | C) funlock | 984 | C) funlock |
923 | 985 | ||
924 | Example with bash: | 986 | Example with bash:: |
925 | 987 | ||
926 | # Atomically read directory structure | 988 | # Atomically read directory structure |
927 | $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl | 989 | $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl |
928 | 990 | ||
929 | # Read directory contents and create new subdirectory | 991 | # Read directory contents and create new subdirectory |
930 | 992 | ||
931 | $ cat create-dir.sh | 993 | $ cat create-dir.sh |
932 | find /sys/fs/resctrl/ > output.txt | 994 | find /sys/fs/resctrl/ > output.txt |
933 | mask = function-of(output.txt) | 995 | mask = function-of(output.txt) |
934 | mkdir /sys/fs/resctrl/newres/ | 996 | mkdir /sys/fs/resctrl/newres/ |
935 | echo mask > /sys/fs/resctrl/newres/schemata | 997 | echo mask > /sys/fs/resctrl/newres/schemata |
936 | 998 | ||
937 | $ flock /sys/fs/resctrl/ ./create-dir.sh | 999 | $ flock /sys/fs/resctrl/ ./create-dir.sh |
938 | 1000 | ||
939 | Example with C: | 1001 | Example with C:: |
940 | 1002 | ||
941 | /* | 1003 | /* |
942 | * Example code do take advisory locks | 1004 | * Example code do take advisory locks |
943 | * before accessing resctrl filesystem | 1005 | * before accessing resctrl filesystem |
944 | */ | 1006 | */ |
945 | #include <sys/file.h> | 1007 | #include <sys/file.h> |
946 | #include <stdlib.h> | 1008 | #include <stdlib.h> |
947 | 1009 | ||
948 | void resctrl_take_shared_lock(int fd) | 1010 | void resctrl_take_shared_lock(int fd) |
949 | { | 1011 | { |
950 | int ret; | 1012 | int ret; |
951 | 1013 | ||
952 | /* take shared lock on resctrl filesystem */ | 1014 | /* take shared lock on resctrl filesystem */ |
953 | ret = flock(fd, LOCK_SH); | 1015 | ret = flock(fd, LOCK_SH); |
954 | if (ret) { | 1016 | if (ret) { |
955 | perror("flock"); | 1017 | perror("flock"); |
956 | exit(-1); | 1018 | exit(-1); |
957 | } | 1019 | } |
958 | } | 1020 | } |
959 | 1021 | ||
960 | void resctrl_take_exclusive_lock(int fd) | 1022 | void resctrl_take_exclusive_lock(int fd) |
961 | { | 1023 | { |
962 | int ret; | 1024 | int ret; |
963 | 1025 | ||
964 | /* release lock on resctrl filesystem */ | 1026 | /* release lock on resctrl filesystem */ |
965 | ret = flock(fd, LOCK_EX); | 1027 | ret = flock(fd, LOCK_EX); |
966 | if (ret) { | 1028 | if (ret) { |
967 | perror("flock"); | 1029 | perror("flock"); |
968 | exit(-1); | 1030 | exit(-1); |
969 | } | 1031 | } |
970 | } | 1032 | } |
971 | 1033 | ||
972 | void resctrl_release_lock(int fd) | 1034 | void resctrl_release_lock(int fd) |
973 | { | 1035 | { |
974 | int ret; | 1036 | int ret; |
975 | 1037 | ||
976 | /* take shared lock on resctrl filesystem */ | 1038 | /* take shared lock on resctrl filesystem */ |
977 | ret = flock(fd, LOCK_UN); | 1039 | ret = flock(fd, LOCK_UN); |
978 | if (ret) { | 1040 | if (ret) { |
979 | perror("flock"); | 1041 | perror("flock"); |
980 | exit(-1); | 1042 | exit(-1); |
981 | } | 1043 | } |
982 | } | 1044 | } |
983 | 1045 | ||
984 | void main(void) | 1046 | void main(void) |
985 | { | 1047 | { |
986 | int fd, ret; | 1048 | int fd, ret; |
987 | 1049 | ||
988 | fd = open("/sys/fs/resctrl", O_DIRECTORY); | 1050 | fd = open("/sys/fs/resctrl", O_DIRECTORY); |
989 | if (fd == -1) { | 1051 | if (fd == -1) { |
990 | perror("open"); | 1052 | perror("open"); |
991 | exit(-1); | 1053 | exit(-1); |
992 | } | 1054 | } |
993 | resctrl_take_shared_lock(fd); | 1055 | resctrl_take_shared_lock(fd); |
994 | /* code to read directory contents */ | 1056 | /* code to read directory contents */ |
995 | resctrl_release_lock(fd); | 1057 | resctrl_release_lock(fd); |
996 | 1058 | ||
997 | resctrl_take_exclusive_lock(fd); | 1059 | resctrl_take_exclusive_lock(fd); |
998 | /* code to read and write directory contents */ | 1060 | /* code to read and write directory contents */ |
999 | resctrl_release_lock(fd); | 1061 | resctrl_release_lock(fd); |
1000 | } | 1062 | } |
1001 | 1063 | ||
1002 | Examples for RDT Monitoring along with allocation usage: | 1064 | Examples for RDT Monitoring along with allocation usage |
1003 | 1065 | ======================================================= | |
1004 | Reading monitored data | 1066 | Reading monitored data |
1005 | ---------------------- | 1067 | ---------------------- |
1006 | Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would | 1068 | Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would |
@@ -1009,17 +1071,17 @@ group or CTRL_MON group. | |||
1009 | 1071 | ||
1010 | 1072 | ||
1011 | Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) | 1073 | Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) |
1012 | --------- | 1074 | ------------------------------------------------------------------------ |
1013 | On a two socket machine (one L3 cache per socket) with just four bits | 1075 | On a two socket machine (one L3 cache per socket) with just four bits |
1014 | for cache bit masks | 1076 | for cache bit masks:: |
1015 | 1077 | ||
1016 | # mount -t resctrl resctrl /sys/fs/resctrl | 1078 | # mount -t resctrl resctrl /sys/fs/resctrl |
1017 | # cd /sys/fs/resctrl | 1079 | # cd /sys/fs/resctrl |
1018 | # mkdir p0 p1 | 1080 | # mkdir p0 p1 |
1019 | # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata | 1081 | # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata |
1020 | # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata | 1082 | # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata |
1021 | # echo 5678 > p1/tasks | 1083 | # echo 5678 > p1/tasks |
1022 | # echo 5679 > p1/tasks | 1084 | # echo 5679 > p1/tasks |
1023 | 1085 | ||
1024 | The default resource group is unmodified, so we have access to all parts | 1086 | The default resource group is unmodified, so we have access to all parts |
1025 | of all caches (its schemata file reads "L3:0=f;1=f"). | 1087 | of all caches (its schemata file reads "L3:0=f;1=f"). |
@@ -1029,47 +1091,51 @@ Tasks that are under the control of group "p0" may only allocate from the | |||
1029 | Tasks in group "p1" use the "lower" 50% of cache on both sockets. | 1091 | Tasks in group "p1" use the "lower" 50% of cache on both sockets. |
1030 | 1092 | ||
1031 | Create monitor groups and assign a subset of tasks to each monitor group. | 1093 | Create monitor groups and assign a subset of tasks to each monitor group. |
1094 | :: | ||
1032 | 1095 | ||
1033 | # cd /sys/fs/resctrl/p1/mon_groups | 1096 | # cd /sys/fs/resctrl/p1/mon_groups |
1034 | # mkdir m11 m12 | 1097 | # mkdir m11 m12 |
1035 | # echo 5678 > m11/tasks | 1098 | # echo 5678 > m11/tasks |
1036 | # echo 5679 > m12/tasks | 1099 | # echo 5679 > m12/tasks |
1037 | 1100 | ||
1038 | fetch data (data shown in bytes) | 1101 | fetch data (data shown in bytes) |
1102 | :: | ||
1039 | 1103 | ||
1040 | # cat m11/mon_data/mon_L3_00/llc_occupancy | 1104 | # cat m11/mon_data/mon_L3_00/llc_occupancy |
1041 | 16234000 | 1105 | 16234000 |
1042 | # cat m11/mon_data/mon_L3_01/llc_occupancy | 1106 | # cat m11/mon_data/mon_L3_01/llc_occupancy |
1043 | 14789000 | 1107 | 14789000 |
1044 | # cat m12/mon_data/mon_L3_00/llc_occupancy | 1108 | # cat m12/mon_data/mon_L3_00/llc_occupancy |
1045 | 16789000 | 1109 | 16789000 |
1046 | 1110 | ||
1047 | The parent ctrl_mon group shows the aggregated data. | 1111 | The parent ctrl_mon group shows the aggregated data. |
1112 | :: | ||
1048 | 1113 | ||
1049 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy | 1114 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy |
1050 | 31234000 | 1115 | 31234000 |
1051 | 1116 | ||
1052 | Example 2 (Monitor a task from its creation) | 1117 | Example 2 (Monitor a task from its creation) |
1053 | --------- | 1118 | -------------------------------------------- |
1054 | On a two socket machine (one L3 cache per socket) | 1119 | On a two socket machine (one L3 cache per socket):: |
1055 | 1120 | ||
1056 | # mount -t resctrl resctrl /sys/fs/resctrl | 1121 | # mount -t resctrl resctrl /sys/fs/resctrl |
1057 | # cd /sys/fs/resctrl | 1122 | # cd /sys/fs/resctrl |
1058 | # mkdir p0 p1 | 1123 | # mkdir p0 p1 |
1059 | 1124 | ||
1060 | An RMID is allocated to the group once its created and hence the <cmd> | 1125 | An RMID is allocated to the group once its created and hence the <cmd> |
1061 | below is monitored from its creation. | 1126 | below is monitored from its creation. |
1127 | :: | ||
1062 | 1128 | ||
1063 | # echo $$ > /sys/fs/resctrl/p1/tasks | 1129 | # echo $$ > /sys/fs/resctrl/p1/tasks |
1064 | # <cmd> | 1130 | # <cmd> |
1065 | 1131 | ||
1066 | Fetch the data | 1132 | Fetch the data:: |
1067 | 1133 | ||
1068 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy | 1134 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy |
1069 | 31789000 | 1135 | 31789000 |
1070 | 1136 | ||
1071 | Example 3 (Monitor without CAT support or before creating CAT groups) | 1137 | Example 3 (Monitor without CAT support or before creating CAT groups) |
1072 | --------- | 1138 | --------------------------------------------------------------------- |
1073 | 1139 | ||
1074 | Assume a system like HSW has only CQM and no CAT support. In this case | 1140 | Assume a system like HSW has only CQM and no CAT support. In this case |
1075 | the resctrl will still mount but cannot create CTRL_MON directories. | 1141 | the resctrl will still mount but cannot create CTRL_MON directories. |
@@ -1078,27 +1144,29 @@ able to monitor all tasks including kernel threads. | |||
1078 | 1144 | ||
1079 | This can also be used to profile jobs cache size footprint before being | 1145 | This can also be used to profile jobs cache size footprint before being |
1080 | able to allocate them to different allocation groups. | 1146 | able to allocate them to different allocation groups. |
1147 | :: | ||
1081 | 1148 | ||
1082 | # mount -t resctrl resctrl /sys/fs/resctrl | 1149 | # mount -t resctrl resctrl /sys/fs/resctrl |
1083 | # cd /sys/fs/resctrl | 1150 | # cd /sys/fs/resctrl |
1084 | # mkdir mon_groups/m01 | 1151 | # mkdir mon_groups/m01 |
1085 | # mkdir mon_groups/m02 | 1152 | # mkdir mon_groups/m02 |
1086 | 1153 | ||
1087 | # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks | 1154 | # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks |
1088 | # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks | 1155 | # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks |
1089 | 1156 | ||
1090 | Monitor the groups separately and also get per domain data. From the | 1157 | Monitor the groups separately and also get per domain data. From the |
1091 | below its apparent that the tasks are mostly doing work on | 1158 | below its apparent that the tasks are mostly doing work on |
1092 | domain(socket) 0. | 1159 | domain(socket) 0. |
1160 | :: | ||
1093 | 1161 | ||
1094 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy | 1162 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy |
1095 | 31234000 | 1163 | 31234000 |
1096 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy | 1164 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy |
1097 | 34555 | 1165 | 34555 |
1098 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy | 1166 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy |
1099 | 31234000 | 1167 | 31234000 |
1100 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy | 1168 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy |
1101 | 32789 | 1169 | 32789 |
1102 | 1170 | ||
1103 | 1171 | ||
1104 | Example 4 (Monitor real time tasks) | 1172 | Example 4 (Monitor real time tasks) |
@@ -1107,15 +1175,17 @@ Example 4 (Monitor real time tasks) | |||
1107 | A single socket system which has real time tasks running on cores 4-7 | 1175 | A single socket system which has real time tasks running on cores 4-7 |
1108 | and non real time tasks on other cpus. We want to monitor the cache | 1176 | and non real time tasks on other cpus. We want to monitor the cache |
1109 | occupancy of the real time threads on these cores. | 1177 | occupancy of the real time threads on these cores. |
1178 | :: | ||
1179 | |||
1180 | # mount -t resctrl resctrl /sys/fs/resctrl | ||
1181 | # cd /sys/fs/resctrl | ||
1182 | # mkdir p1 | ||
1110 | 1183 | ||
1111 | # mount -t resctrl resctrl /sys/fs/resctrl | 1184 | Move the cpus 4-7 over to p1:: |
1112 | # cd /sys/fs/resctrl | ||
1113 | # mkdir p1 | ||
1114 | 1185 | ||
1115 | Move the cpus 4-7 over to p1 | 1186 | # echo f0 > p1/cpus |
1116 | # echo f0 > p1/cpus | ||
1117 | 1187 | ||
1118 | View the llc occupancy snapshot | 1188 | View the llc occupancy snapshot:: |
1119 | 1189 | ||
1120 | # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy | 1190 | # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy |
1121 | 11234000 | 1191 | 11234000 |