nvdebug.git, branch master

Fix a race condition in nvdebug_{readl,readq,writel,writeq}

2025-04-04T14:29:54+00:00

When the GPU is powered off, attempts to read any of its registers
(such as via nvdebug_readl()) result in a fatal interrupt. The
pm_runtime_get() call included in nvdebug sent a request to nvgpu
to turn the GPU back on. **However,** this call did not wait for
the power-on command to take effect. This resulted in a race between
nvdebug and the power management logic, meaning that the GPU may not
have powered-on by the time that nvdebug attempted to read its
registers.

Use pm_runtime_get_sync() instead, which explicitly waits for the
power-on command to complete (or fail) before returning. This
eliminates the race condition.

Thank you to Diego Alejandro Parra Guzman
, who brought this issue to my
attention.

Fix a critical regression in 71be6bb5 causing multiple API failures

2024-11-04T16:28:21+00:00

Instead of printing the read in `nvdebug_reg32_read()`, another
read was being performed, using the first read value as the register
offset! This is a mistaken incomplete removal of the old pre-error-
-handling logic in 71be6bb5.

This caused any APIs using this function to not work by returning
bizzare or incorrect values, or crashing the system on Jetson boards.

Delete no-longer-needed nvgpu headers

2024-09-25T20:09:09+00:00

The dependency on these was removed in commit 8340d234.

Remove dependency on Jetson (nvgpu) driver internals

2024-09-25T19:58:37+00:00

For integrated (Jetson) GPUs:
- Directly retrieve and map GPU register region 0
- Directly check GPU power-on state before a register read/write
- Resume the GPU as needed for a register read/write

Most nvgpu APIs can now be called on TX2+ integrated GPUs without
first having to start some task on the GPU to make it non-suspended.

Tested on Jetson TX1, TX2, Xavier, and Orin.

Add a README

2024-09-25T17:28:56+00:00

See also the RTAS'23 and RTAS'24 papers.

Correct an off-by-one error in addr_to_pramin_mut()

2024-09-25T16:52:42+00:00

Not known to cause any current bugs, but could cause the returned
address to be inaccessible.

Add IDs and names of new Hopper+ engines

2024-09-25T15:05:54+00:00

Draws from NVIDIA's open-gpu-kernel-modules project.

Return an error, rather than a flag value, from `nvdebug_reg32_read()`

2024-09-19T19:40:33+00:00

This is used to back APIs like `num_gpcs`. Better to return an error
to the caller, rather than -1 (which may be confused for an actual
result).

Correctly check for read errors in the nvdebug_read* functions

2024-09-19T19:38:53+00:00

Follows how NVIDIA's open-source GPU driver checks for bad reads.

Ampere: disable/enable_channel, preempt/switch_to_tsg, and resubmit_runlist

2024-09-19T17:59:56+00:00

**Modifes the user API from `echo 1 > /proc/gpuX/switch_to_tsg` to
`echo 1 > /proc/gpuX/runlist0/switch_to_tsg` to switch to TSG 1 on
runlist 0 on GPU X for pre-Ampere GPUs (for example).**

Feature changes:
- switch_to_tsg only makes sense on a per-runlist level. Before, this
  always operated on runlist0; this commit allows operating on any
  runlist by moving the API to the per-runlist paths.
- On Ampere+, channel and TSG IDs are per-runlist, and no longer
  GPU-global. Consequently, the disable/enable_channel and
  preempt_tsg APIs have been moved from GPU-global to per-runlist
  paths on Ampere+.

Bug fixes:
- `preempt_runlist()` is now supported on Maxwell and Pascal.
- `resubmit_runlist()` detects too-old GPUs.
- MAX_CHID corrected from 512 to 511 and documented.
- switch_to_tsg now includes a runlist resubmit, which appears to be
  necessary on Turing+ GPUs.

Tested on GK104 (Quadro K5000), GM204 (GTX 970), GP106 (GTX 1060 3GB),
GP104 (GTX 1080 Ti), GP10B (Jetson TX2), GV11B (Jetson Xavier), GV100
(Titan V), TU102 (RTX 2080 Ti), and AD102 (RTX 6000 Ada).