From e2fe4cb56e6252b9cf0b43c6180efbb20a168ce0 Mon Sep 17 00:00:00 2001
From: Joshua Bakita <bakitajoshua@gmail.com>
Date: Wed, 25 Sep 2024 13:28:56 -0400
Subject: Add a README

See also the RTAS'23 and RTAS'24 papers.
---
 README.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..da3e5d7
--- /dev/null
+++ b/README.md
@@ -0,0 +1,97 @@
+# nvdebug
+Copyright 2021-2024 Joshua Bakita
+
+Written to support my research on increasing the throughput and predictability of NVIDIA GPUs when running multiple tasks.
+
+Please cite the following papers if using this in any published work:
+
+1. J. Bakita and J. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 54–66, May 2023.
+2. J. Bakita and J. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 294-305, May 2024.
+
+## API Overview
+This module creates a virtual folder at `/proc/gpuX` for each GPU X in the system.
+The contained virtual files and folders, when read, return plaintext representations of various aspects of GPU state.
+Some files can also be written to, to modify GPU behavior.
+
+The major API surfaces are composed of the following files:
+
+- Device info (`device_info`, `copy_topology`...)
+- Scheduling examination (`runlist`)
+- Scheduling manipulation (`enable/disable_channel`, `switch_to/preempt_tsg`, `resubmit_runlist`)
+
+As of Sept 2024, this module supports all generations of NVIDIA GPUs.
+Very old GPUs (pre-Kepler) are mostly untested, and only a few APIs are likely to work.
+APIs are designed to detect and handle errors, and should not crash your system in any circumstance.
+
+We now detail how to use each of these APIs.
+The following documentation assumes you have already `cd`ed to `/proc/gpuX` for your GPU of interest.
+
+## Device Information APIs
+
+### List of Engines
+Use `cat device_info` to get a pretty-printed breakdown of which engines this GPU contains.
+This information is pulled directly from the GPU topology registers, and should be very reliable.
+
+### Copy Engines
+**See our RTAS'24 paper for why this is important.**
+
+Use `cat copy_topology` to get a pretty-printed mapping of how each configured logical copy engine is serviced.
+They may be serviced by a physical copy engines, or configured to map onto another logical copy engine.
+This is pulled directly from the GPU copy configuration registers, and should be very reliable.
+See the RTAS'24 paper listed in the "Citing" section for details on why this is important.
+
+Use `cat num_ces` to get the number of available copy engines (number of logical copy engines on Pascal+).
+
+### Texture Processing Cluster (TPC)/Graphics Processing Cluster (GPC) Floorsweeping
+**See our RTAS'23 paper for why this is important.**
+
+Use `cat num_gpcs` to get the number of __on-chip__ GPCs.
+Not all these GPCs will necessarially be enabled.
+
+Use `cat gpc_mask` to get a bit mask of which GPCs are disabled.
+A set bit indicates a disabled GPC.
+Bit 0 corresponds to GPC 0, bit 1 to GPC 1, and so on, up to the total number of on-chip GPCs.
+Bits greater than the number of on-chip GPCs should be ignored (it may appear than non-existent GPCs are "disabled").
+
+Use `cat num_tpc_per_gpc` to get the number of __on-chip__ TPCs per GPC.
+Not all these TPCs will necessarially be enabled in every GPC.
+
+Use `cat gpcX_tpc_mask` to get a bit mask of which TPCs are disabled for GPC X.
+A set bit indicates a disabled TPC.
+This API is only available on enabled GPCs.
+
+Example usage: To get the number of on-chip SMs on Volta+ GPUs, multiply the return of `cat num_gpcs` with `cat num_tpc_per_gpc` and multiply by 2 (SMs per TPC).
+
+## Scheduling Examination and Manipulation
+**See our RTAS'24 paper for some uses of this.**
+
+Some of these APIs operate within the scope of a runlist.
+`runlistY` represents one of the `runlist0`, `runlist1`, `runlist2`, etc folders.
+
+Use `cat runlistY/runlist` to view the contents and status of all channels in runlist Y.
+**This is nvdebug's most substantial API.**
+The runlist is composed of time-slice groups (TSGs, also called channel groups in nouveau) and channels.
+Channels are indented in the output to indicate that they below to the preceeding TSG.
+
+Use `echo Z > disable_channel` or `echo Z > runlistY/disable_channel` to disable channel with ID Z.
+
+Use `echo Z > enable_channel` or `echo Z > runlistY/enable_channel` to enable channel with ID Z.
+
+Use `echo Z > preempt_tsg` or `echo Z > runlistY/preempt_tsg` to trigger a preempt of TSG with ID Z.
+
+Use `echo Z > runlistY/switch_to_tsg` to switch the GPU to run only the specified TSG with ID Z on runlist Y.
+
+Use `echo Y > resubmit_runlist` to resubmit runlist Y (useful to prompt newer GPUs to pick up on re-enabled channels).
+
+## General Codebase Structure
+- `nvdebug.h` defines and describes all GPU data structures. This does not depend on any kernel-internal headers.
+- `nvdebug_entry.h` contains module startup, device detection, initialization, and module teardown logic.
+- `runlist.c`, `bus.c`, and `mmu.c` describe Linux-independent (as far as practicable) GPU data structure accessors.
+- `*_procfs.c` define `/proc/gpuX/` interfaces for reading or writing to GPU data structures.
+- `nvdebug_linux.c` contains Linux-specific accessors.
+
+## Known Issues and Workarounds
+
+- The runlist-printing API does not work when runlist management is delegated to the GPU System Processor (GSP) (most Turing+ datacenter GPUs).
+  To workaround, enable the `FALLBACK_TO_PRAMIN` define in `runlist.c`, or reload the `nvidia` kernel module with the `NVreg_EnableGpuFirmware=0` parameter setting.
+  (Eg. on A100: end all GPU-using processes, then `sudo rmmod nvidia_uvm nvidia; sudo modprobe nvidia NVreg_EnableGpuFirmware=0`.)
-- 
cgit v1.2.2