diff options
author | Joshua Bakita <bakitajoshua@gmail.com> | 2024-09-25 13:28:56 -0400 |
---|---|---|
committer | Joshua Bakita <bakitajoshua@gmail.com> | 2024-09-25 13:28:56 -0400 |
commit | e2fe4cb56e6252b9cf0b43c6180efbb20a168ce0 (patch) | |
tree | 925d1dd31efe5e066e5a962b3dfbc761ad7caca8 /README.md | |
parent | 2104d15eb12b03ed4cfa8eb4dc95ad13cee43227 (diff) |
Add a README
See also the RTAS'23 and RTAS'24 papers.
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 97 |
1 files changed, 97 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..da3e5d7 --- /dev/null +++ b/README.md | |||
@@ -0,0 +1,97 @@ | |||
1 | # nvdebug | ||
2 | Copyright 2021-2024 Joshua Bakita | ||
3 | |||
4 | Written to support my research on increasing the throughput and predictability of NVIDIA GPUs when running multiple tasks. | ||
5 | |||
6 | Please cite the following papers if using this in any published work: | ||
7 | |||
8 | 1. J. Bakita and J. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 54–66, May 2023. | ||
9 | 2. J. Bakita and J. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 294-305, May 2024. | ||
10 | |||
11 | ## API Overview | ||
12 | This module creates a virtual folder at `/proc/gpuX` for each GPU X in the system. | ||
13 | The contained virtual files and folders, when read, return plaintext representations of various aspects of GPU state. | ||
14 | Some files can also be written to, to modify GPU behavior. | ||
15 | |||
16 | The major API surfaces are composed of the following files: | ||
17 | |||
18 | - Device info (`device_info`, `copy_topology`...) | ||
19 | - Scheduling examination (`runlist`) | ||
20 | - Scheduling manipulation (`enable/disable_channel`, `switch_to/preempt_tsg`, `resubmit_runlist`) | ||
21 | |||
22 | As of Sept 2024, this module supports all generations of NVIDIA GPUs. | ||
23 | Very old GPUs (pre-Kepler) are mostly untested, and only a few APIs are likely to work. | ||
24 | APIs are designed to detect and handle errors, and should not crash your system in any circumstance. | ||
25 | |||
26 | We now detail how to use each of these APIs. | ||
27 | The following documentation assumes you have already `cd`ed to `/proc/gpuX` for your GPU of interest. | ||
28 | |||
29 | ## Device Information APIs | ||
30 | |||
31 | ### List of Engines | ||
32 | Use `cat device_info` to get a pretty-printed breakdown of which engines this GPU contains. | ||
33 | This information is pulled directly from the GPU topology registers, and should be very reliable. | ||
34 | |||
35 | ### Copy Engines | ||
36 | **See our RTAS'24 paper for why this is important.** | ||
37 | |||
38 | Use `cat copy_topology` to get a pretty-printed mapping of how each configured logical copy engine is serviced. | ||
39 | They may be serviced by a physical copy engines, or configured to map onto another logical copy engine. | ||
40 | This is pulled directly from the GPU copy configuration registers, and should be very reliable. | ||
41 | See the RTAS'24 paper listed in the "Citing" section for details on why this is important. | ||
42 | |||
43 | Use `cat num_ces` to get the number of available copy engines (number of logical copy engines on Pascal+). | ||
44 | |||
45 | ### Texture Processing Cluster (TPC)/Graphics Processing Cluster (GPC) Floorsweeping | ||
46 | **See our RTAS'23 paper for why this is important.** | ||
47 | |||
48 | Use `cat num_gpcs` to get the number of __on-chip__ GPCs. | ||
49 | Not all these GPCs will necessarially be enabled. | ||
50 | |||
51 | Use `cat gpc_mask` to get a bit mask of which GPCs are disabled. | ||
52 | A set bit indicates a disabled GPC. | ||
53 | Bit 0 corresponds to GPC 0, bit 1 to GPC 1, and so on, up to the total number of on-chip GPCs. | ||
54 | Bits greater than the number of on-chip GPCs should be ignored (it may appear than non-existent GPCs are "disabled"). | ||
55 | |||
56 | Use `cat num_tpc_per_gpc` to get the number of __on-chip__ TPCs per GPC. | ||
57 | Not all these TPCs will necessarially be enabled in every GPC. | ||
58 | |||
59 | Use `cat gpcX_tpc_mask` to get a bit mask of which TPCs are disabled for GPC X. | ||
60 | A set bit indicates a disabled TPC. | ||
61 | This API is only available on enabled GPCs. | ||
62 | |||
63 | Example usage: To get the number of on-chip SMs on Volta+ GPUs, multiply the return of `cat num_gpcs` with `cat num_tpc_per_gpc` and multiply by 2 (SMs per TPC). | ||
64 | |||
65 | ## Scheduling Examination and Manipulation | ||
66 | **See our RTAS'24 paper for some uses of this.** | ||
67 | |||
68 | Some of these APIs operate within the scope of a runlist. | ||
69 | `runlistY` represents one of the `runlist0`, `runlist1`, `runlist2`, etc folders. | ||
70 | |||
71 | Use `cat runlistY/runlist` to view the contents and status of all channels in runlist Y. | ||
72 | **This is nvdebug's most substantial API.** | ||
73 | The runlist is composed of time-slice groups (TSGs, also called channel groups in nouveau) and channels. | ||
74 | Channels are indented in the output to indicate that they below to the preceeding TSG. | ||
75 | |||
76 | Use `echo Z > disable_channel` or `echo Z > runlistY/disable_channel` to disable channel with ID Z. | ||
77 | |||
78 | Use `echo Z > enable_channel` or `echo Z > runlistY/enable_channel` to enable channel with ID Z. | ||
79 | |||
80 | Use `echo Z > preempt_tsg` or `echo Z > runlistY/preempt_tsg` to trigger a preempt of TSG with ID Z. | ||
81 | |||
82 | Use `echo Z > runlistY/switch_to_tsg` to switch the GPU to run only the specified TSG with ID Z on runlist Y. | ||
83 | |||
84 | Use `echo Y > resubmit_runlist` to resubmit runlist Y (useful to prompt newer GPUs to pick up on re-enabled channels). | ||
85 | |||
86 | ## General Codebase Structure | ||
87 | - `nvdebug.h` defines and describes all GPU data structures. This does not depend on any kernel-internal headers. | ||
88 | - `nvdebug_entry.h` contains module startup, device detection, initialization, and module teardown logic. | ||
89 | - `runlist.c`, `bus.c`, and `mmu.c` describe Linux-independent (as far as practicable) GPU data structure accessors. | ||
90 | - `*_procfs.c` define `/proc/gpuX/` interfaces for reading or writing to GPU data structures. | ||
91 | - `nvdebug_linux.c` contains Linux-specific accessors. | ||
92 | |||
93 | ## Known Issues and Workarounds | ||
94 | |||
95 | - The runlist-printing API does not work when runlist management is delegated to the GPU System Processor (GSP) (most Turing+ datacenter GPUs). | ||
96 | To workaround, enable the `FALLBACK_TO_PRAMIN` define in `runlist.c`, or reload the `nvidia` kernel module with the `NVreg_EnableGpuFirmware=0` parameter setting. | ||
97 | (Eg. on A100: end all GPU-using processes, then `sudo rmmod nvidia_uvm nvidia; sudo modprobe nvidia NVreg_EnableGpuFirmware=0`.) | ||