diff options
| -rw-r--r-- | README.md | 97 |
1 files changed, 97 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..da3e5d7 --- /dev/null +++ b/README.md | |||
| @@ -0,0 +1,97 @@ | |||
| 1 | # nvdebug | ||
| 2 | Copyright 2021-2024 Joshua Bakita | ||
| 3 | |||
| 4 | Written to support my research on increasing the throughput and predictability of NVIDIA GPUs when running multiple tasks. | ||
| 5 | |||
| 6 | Please cite the following papers if using this in any published work: | ||
| 7 | |||
| 8 | 1. J. Bakita and J. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 54–66, May 2023. | ||
| 9 | 2. J. Bakita and J. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 294-305, May 2024. | ||
| 10 | |||
| 11 | ## API Overview | ||
| 12 | This module creates a virtual folder at `/proc/gpuX` for each GPU X in the system. | ||
| 13 | The contained virtual files and folders, when read, return plaintext representations of various aspects of GPU state. | ||
| 14 | Some files can also be written to, to modify GPU behavior. | ||
| 15 | |||
| 16 | The major API surfaces are composed of the following files: | ||
| 17 | |||
| 18 | - Device info (`device_info`, `copy_topology`...) | ||
| 19 | - Scheduling examination (`runlist`) | ||
| 20 | - Scheduling manipulation (`enable/disable_channel`, `switch_to/preempt_tsg`, `resubmit_runlist`) | ||
| 21 | |||
| 22 | As of Sept 2024, this module supports all generations of NVIDIA GPUs. | ||
| 23 | Very old GPUs (pre-Kepler) are mostly untested, and only a few APIs are likely to work. | ||
| 24 | APIs are designed to detect and handle errors, and should not crash your system in any circumstance. | ||
| 25 | |||
| 26 | We now detail how to use each of these APIs. | ||
| 27 | The following documentation assumes you have already `cd`ed to `/proc/gpuX` for your GPU of interest. | ||
| 28 | |||
| 29 | ## Device Information APIs | ||
| 30 | |||
| 31 | ### List of Engines | ||
| 32 | Use `cat device_info` to get a pretty-printed breakdown of which engines this GPU contains. | ||
| 33 | This information is pulled directly from the GPU topology registers, and should be very reliable. | ||
| 34 | |||
| 35 | ### Copy Engines | ||
| 36 | **See our RTAS'24 paper for why this is important.** | ||
| 37 | |||
| 38 | Use `cat copy_topology` to get a pretty-printed mapping of how each configured logical copy engine is serviced. | ||
| 39 | They may be serviced by a physical copy engines, or configured to map onto another logical copy engine. | ||
| 40 | This is pulled directly from the GPU copy configuration registers, and should be very reliable. | ||
| 41 | See the RTAS'24 paper listed in the "Citing" section for details on why this is important. | ||
| 42 | |||
| 43 | Use `cat num_ces` to get the number of available copy engines (number of logical copy engines on Pascal+). | ||
| 44 | |||
| 45 | ### Texture Processing Cluster (TPC)/Graphics Processing Cluster (GPC) Floorsweeping | ||
| 46 | **See our RTAS'23 paper for why this is important.** | ||
| 47 | |||
| 48 | Use `cat num_gpcs` to get the number of __on-chip__ GPCs. | ||
| 49 | Not all these GPCs will necessarially be enabled. | ||
| 50 | |||
| 51 | Use `cat gpc_mask` to get a bit mask of which GPCs are disabled. | ||
| 52 | A set bit indicates a disabled GPC. | ||
| 53 | Bit 0 corresponds to GPC 0, bit 1 to GPC 1, and so on, up to the total number of on-chip GPCs. | ||
| 54 | Bits greater than the number of on-chip GPCs should be ignored (it may appear than non-existent GPCs are "disabled"). | ||
| 55 | |||
| 56 | Use `cat num_tpc_per_gpc` to get the number of __on-chip__ TPCs per GPC. | ||
| 57 | Not all these TPCs will necessarially be enabled in every GPC. | ||
| 58 | |||
| 59 | Use `cat gpcX_tpc_mask` to get a bit mask of which TPCs are disabled for GPC X. | ||
| 60 | A set bit indicates a disabled TPC. | ||
| 61 | This API is only available on enabled GPCs. | ||
| 62 | |||
| 63 | Example usage: To get the number of on-chip SMs on Volta+ GPUs, multiply the return of `cat num_gpcs` with `cat num_tpc_per_gpc` and multiply by 2 (SMs per TPC). | ||
| 64 | |||
| 65 | ## Scheduling Examination and Manipulation | ||
| 66 | **See our RTAS'24 paper for some uses of this.** | ||
| 67 | |||
| 68 | Some of these APIs operate within the scope of a runlist. | ||
| 69 | `runlistY` represents one of the `runlist0`, `runlist1`, `runlist2`, etc folders. | ||
| 70 | |||
| 71 | Use `cat runlistY/runlist` to view the contents and status of all channels in runlist Y. | ||
| 72 | **This is nvdebug's most substantial API.** | ||
| 73 | The runlist is composed of time-slice groups (TSGs, also called channel groups in nouveau) and channels. | ||
| 74 | Channels are indented in the output to indicate that they below to the preceeding TSG. | ||
| 75 | |||
| 76 | Use `echo Z > disable_channel` or `echo Z > runlistY/disable_channel` to disable channel with ID Z. | ||
| 77 | |||
| 78 | Use `echo Z > enable_channel` or `echo Z > runlistY/enable_channel` to enable channel with ID Z. | ||
| 79 | |||
| 80 | Use `echo Z > preempt_tsg` or `echo Z > runlistY/preempt_tsg` to trigger a preempt of TSG with ID Z. | ||
| 81 | |||
| 82 | Use `echo Z > runlistY/switch_to_tsg` to switch the GPU to run only the specified TSG with ID Z on runlist Y. | ||
| 83 | |||
| 84 | Use `echo Y > resubmit_runlist` to resubmit runlist Y (useful to prompt newer GPUs to pick up on re-enabled channels). | ||
| 85 | |||
| 86 | ## General Codebase Structure | ||
| 87 | - `nvdebug.h` defines and describes all GPU data structures. This does not depend on any kernel-internal headers. | ||
| 88 | - `nvdebug_entry.h` contains module startup, device detection, initialization, and module teardown logic. | ||
| 89 | - `runlist.c`, `bus.c`, and `mmu.c` describe Linux-independent (as far as practicable) GPU data structure accessors. | ||
| 90 | - `*_procfs.c` define `/proc/gpuX/` interfaces for reading or writing to GPU data structures. | ||
| 91 | - `nvdebug_linux.c` contains Linux-specific accessors. | ||
| 92 | |||
| 93 | ## Known Issues and Workarounds | ||
| 94 | |||
| 95 | - The runlist-printing API does not work when runlist management is delegated to the GPU System Processor (GSP) (most Turing+ datacenter GPUs). | ||
| 96 | To workaround, enable the `FALLBACK_TO_PRAMIN` define in `runlist.c`, or reload the `nvidia` kernel module with the `NVreg_EnableGpuFirmware=0` parameter setting. | ||
| 97 | (Eg. on A100: end all GPU-using processes, then `sudo rmmod nvidia_uvm nvidia; sudo modprobe nvidia NVreg_EnableGpuFirmware=0`.) | ||
