nvdebug

Written to support my research on increasing the throughput and predictability of NVIDIA GPUs when running multiple tasks.

Please cite the following papers if using this in any published work:

J. Bakita and J. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 54–66, May 2023.
J. Bakita and J. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 294-305, May 2024.

API Overview

This module creates a virtual folder at /proc/gpuX for each GPU X in the system. The contained virtual files and folders, when read, return plaintext representations of various aspects of GPU state. Some files can also be written to, to modify GPU behavior.

The major API surfaces are composed of the following files:

Device info (device_info, copy_topology...)
Scheduling examination (runlist)
Scheduling manipulation (enable/disable_channel, switch_to/preempt_tsg, resubmit_runlist)

As of Sept 2024, this module supports all generations of NVIDIA GPUs. Very old GPUs (pre-Kepler) are mostly untested, and only a few APIs are likely to work. APIs are designed to detect and handle errors, and should not crash your system in any circumstance.

We now detail how to use each of these APIs. The following documentation assumes you have already cded to /proc/gpuX for your GPU of interest.

Device Information APIs

List of Engines

Use cat device_info to get a pretty-printed breakdown of which engines this GPU contains. This information is pulled directly from the GPU topology registers, and should be very reliable.

Copy Engines

See our RTAS'24 paper for why this is important.

Use cat copy_topology to get a pretty-printed mapping of how each configured logical copy engine is serviced. They may be serviced by a physical copy engines, or configured to map onto another logical copy engine. This is pulled directly from the GPU copy configuration registers, and should be very reliable. See the RTAS'24 paper listed in the "Citing" section for details on why this is important.

Use cat num_ces to get the number of available copy engines (number of logical copy engines on Pascal+).

Texture Processing Cluster (TPC)/Graphics Processing Cluster (GPC) Floorsweeping

See our RTAS'23 paper for why this is important.

Use cat num_gpcs to get the number of on-chip GPCs. Not all these GPCs will necessarially be enabled.

Use cat gpc_mask to get a bit mask of which GPCs are disabled. A set bit indicates a disabled GPC. Bit 0 corresponds to GPC 0, bit 1 to GPC 1, and so on, up to the total number of on-chip GPCs. Bits greater than the number of on-chip GPCs should be ignored (it may appear than non-existent GPCs are "disabled").

Use cat num_tpc_per_gpc to get the number of on-chip TPCs per GPC. Not all these TPCs will necessarially be enabled in every GPC.

Use cat gpcX_tpc_mask to get a bit mask of which TPCs are disabled for GPC X. A set bit indicates a disabled TPC. This API is only available on enabled GPCs.

Example usage: To get the number of on-chip SMs on Volta+ GPUs, multiply the return of cat num_gpcs with cat num_tpc_per_gpc and multiply by 2 (SMs per TPC).

Scheduling Examination and Manipulation

See our RTAS'24 paper for some uses of this.

Some of these APIs operate within the scope of a runlist. runlistY represents one of the runlist0, runlist1, runlist2, etc folders.

Use cat runlistY/runlist to view the contents and status of all channels in runlist Y. This is nvdebug's most substantial API. The runlist is composed of time-slice groups (TSGs, also called channel groups in nouveau) and channels. Channels are indented in the output to indicate that they below to the preceeding TSG.

Use echo Z > disable_channel or echo Z > runlistY/disable_channel to disable channel with ID Z.

Use echo Z > enable_channel or echo Z > runlistY/enable_channel to enable channel with ID Z.

Use echo Z > preempt_tsg or echo Z > runlistY/preempt_tsg to trigger a preempt of TSG with ID Z.

Use echo Z > runlistY/switch_to_tsg to switch the GPU to run only the specified TSG with ID Z on runlist Y.

Use echo Y > resubmit_runlist to resubmit runlist Y (useful to prompt newer GPUs to pick up on re-enabled channels).

General Codebase Structure

nvdebug.h defines and describes all GPU data structures. This does not depend on any kernel-internal headers.
nvdebug_entry.h contains module startup, device detection, initialization, and module teardown logic.
runlist.c, bus.c, and mmu.c describe Linux-independent (as far as practicable) GPU data structure accessors.
*_procfs.c define /proc/gpuX/ interfaces for reading or writing to GPU data structures.
nvdebug_linux.c contains Linux-specific accessors.

Known Issues and Workarounds

The runlist-printing API does not work when runlist management is delegated to the GPU System Processor (GSP) (most Turing+ datacenter GPUs). To workaround, enable the FALLBACK_TO_PRAMIN define in runlist.c, or reload the nvidia kernel module with the NVreg_EnableGpuFirmware=0 parameter setting. (Eg. on A100: end all GPU-using processes, then sudo rmmod nvidia_uvm nvidia; sudo modprobe nvidia NVreg_EnableGpuFirmware=0.)