nvdebug
Copyright 2021-2024 Joshua Bakita
Written to support my research on increasing the throughput and predictability of NVIDIA GPUs when running multiple tasks.
Please cite the following papers if using this in any published work:
- J. Bakita and J. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 54–66, May 2023.
- J. Bakita and J. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 294-305, May 2024.
API Overview
This module creates a virtual folder at /proc/gpuX
for each GPU X in the system.
The contained virtual files and folders, when read, return plaintext representations of various aspects of GPU state.
Some files can also be written to, to modify GPU behavior.
The major API surfaces are composed of the following files:
- Device info (
device_info
,copy_topology
...) - Scheduling examination (
runlist
) - Scheduling manipulation (
enable/disable_channel
,switch_to/preempt_tsg
,resubmit_runlist
)
As of Sept 2024, this module supports all generations of NVIDIA GPUs. Very old GPUs (pre-Kepler) are mostly untested, and only a few APIs are likely to work. APIs are designed to detect and handle errors, and should not crash your system in any circumstance.
We now detail how to use each of these APIs.
The following documentation assumes you have already cd
ed to /proc/gpuX
for your GPU of interest.
Device Information APIs
List of Engines
Use cat device_info
to get a pretty-printed breakdown of which engines this GPU contains.
This information is pulled directly from the GPU topology registers, and should be very reliable.
Copy Engines
See our RTAS'24 paper for why this is important.
Use cat copy_topology
to get a pretty-printed mapping of how each configured logical copy engine is serviced.
They may be serviced by a physical copy engines, or configured to map onto another logical copy engine.
This is pulled directly from the GPU copy configuration registers, and should be very reliable.
See the RTAS'24 paper listed in the "Citing" section for details on why this is important.
Use cat num_ces
to get the number of available copy engines (number of logical copy engines on Pascal+).
Texture Processing Cluster (TPC)/Graphics Processing Cluster (GPC) Floorsweeping
See our RTAS'23 paper for why this is important.
Use cat num_gpcs
to get the number of on-chip GPCs.
Not all these GPCs will necessarially be enabled.
Use cat gpc_mask
to get a bit mask of which GPCs are disabled.
A set bit indicates a disabled GPC.
Bit 0 corresponds to GPC 0, bit 1 to GPC 1, and so on, up to the total number of on-chip GPCs.
Bits greater than the number of on-chip GPCs should be ignored (it may appear than non-existent GPCs are "disabled").
Use cat num_tpc_per_gpc
to get the number of on-chip TPCs per GPC.
Not all these TPCs will necessarially be enabled in every GPC.
Use cat gpcX_tpc_mask
to get a bit mask of which TPCs are disabled for GPC X.
A set bit indicates a disabled TPC.
This API is only available on enabled GPCs.
Example usage: To get the number of on-chip SMs on Volta+ GPUs, multiply the return of cat num_gpcs
with cat num_tpc_per_gpc
and multiply by 2 (SMs per TPC).
Scheduling Examination and Manipulation
See our RTAS'24 paper for some uses of this.
Some of these APIs operate within the scope of a runlist.
runlistY
represents one of the runlist0
, runlist1
, runlist2
, etc folders.
Use cat runlistY/runlist
to view the contents and status of all channels in runlist Y.
This is nvdebug's most substantial API.
The runlist is composed of time-slice groups (TSGs, also called channel groups in nouveau) and channels.
Channels are indented in the output to indicate that they below to the preceeding TSG.
Use echo Z > disable_channel
or echo Z > runlistY/disable_channel
to disable channel with ID Z.
Use echo Z > enable_channel
or echo Z > runlistY/enable_channel
to enable channel with ID Z.
Use echo Z > preempt_tsg
or echo Z > runlistY/preempt_tsg
to trigger a preempt of TSG with ID Z.
Use echo Z > runlistY/switch_to_tsg
to switch the GPU to run only the specified TSG with ID Z on runlist Y.
Use echo Y > resubmit_runlist
to resubmit runlist Y (useful to prompt newer GPUs to pick up on re-enabled channels).
General Codebase Structure
nvdebug.h
defines and describes all GPU data structures. This does not depend on any kernel-internal headers.nvdebug_entry.h
contains module startup, device detection, initialization, and module teardown logic.runlist.c
,bus.c
, andmmu.c
describe Linux-independent (as far as practicable) GPU data structure accessors.*_procfs.c
define/proc/gpuX/
interfaces for reading or writing to GPU data structures.nvdebug_linux.c
contains Linux-specific accessors.
Known Issues and Workarounds
- The runlist-printing API does not work when runlist management is delegated to the GPU System Processor (GSP) (most Turing+ datacenter GPUs).
To workaround, enable the
FALLBACK_TO_PRAMIN
define inrunlist.c
, or reload thenvidia
kernel module with theNVreg_EnableGpuFirmware=0
parameter setting. (Eg. on A100: end all GPU-using processes, thensudo rmmod nvidia_uvm nvidia; sudo modprobe nvidia NVreg_EnableGpuFirmware=0
.)