libsmctrl.git - Library to enable intra-context SM/TPC partitioning on NVIDIA GPUs

	Commit message (Collapse)	Author	Age
*	De-duplicate CUDA version checks and omit when building with CUDA > 6.5HEAD master	Joshua Bakita	2025-05-05
\| \| \| \| \| \| \| \| \|	Code built with CUDA > 6.5 cannot run on CUDA 6.5 or older, so the check added unecessary overhead. Tested on CUDA 6.5 and CUDA 10.2 to generate the correct code, and global and next tested to work on GTX 1060 3 GB with either build while using CUDA 10.2 at runtime.
*	Support stream masking on CUDA 12.7 (x86) and 12.8 (x86)	Joshua Bakita	2025-04-07
\|
*	Makefile improvements	Joshua Bakita	2025-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \|	- Add an "all" build target - Fix build if libcuda.so is not on linker search path - Do not assume that nvcc is available on $PATH - Allow specifying CFLAGS and LDFLAGS when running make - Allow passing non-standard CUDA build locations to make Suggested usage if CUDA is installed in a non-standard location, say, /playpen/jbakita/CUDA/cuda-archive/cuda-12.2: make CUDA=/playpen/jbakita/CUDA/cuda-archive/cuda-12.2
*	Correctly pass arguments to Python's ctypes find_library() function	Joshua Bakita	2024-12-23
\| \| \| \| \| \| \| \| \|	The prefix "lib" should not be included, per the documentation. This was not caught in local tests as the fallback was always used locally. Patch courtesy of Guanbin Xu <xugb@mail.ustc.edu.cn>.
*	CRITICAL: Remove stray brackets breaking build	Joshua Bakita	2024-12-23
\|
*	Remove unused variables from Makefile	Joshua Bakita	2024-12-19
\| \| \| \| \| \| \|	Make automatically provides CXX and CC, and these manual definitions were being ignored. Also fix a missing space in one of the messages from the tests.
*	Bugfix stream-mask override, support old CUDA, and start Hopper support	Joshua Bakita	2024-12-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use a different callback to intercept the TMD/QMD later in the launch pipeline. Major improvements: - Fix bug with next mask not overriding stream mask on CUDA 11.0+ - Add CUDA 6.5-10.2 support for next- and global-granularity partitioning masks on x86_64 and aarch64 Jetson - Remove libdl dependency - Partially support TMD/QMD Version 4 (Hopper) Minor improvements: - Check for sufficient CUDA version before before attempting to apply a next-granularity partitioning mask - Only check for sufficient CUDA version on the first call to `libsmctrl_set_next_mask()` or `libsmctrl_set_global_mask()`, rather than checking every time (lowers overheads) - Check that TMD version is sufficient before modifying it - Improve documentation Issues: - Partitioning mask bits have a different meaning in TMD/QMD Version 4 and require floorsweeping and remapping information to properly construct. This information will be forthcoming in future releases of libsmctrl and nvdebug.
*	Support CUDA 12.2, 12.5, and 12.6 on Jetson aarch64	Joshua Bakita	2024-12-19
\| \| \| \|	Also test and note that stream masking on CUDA 6.5 seems impossible.
*	Check for read() errors on procfs files during libsmctrl_get_gpc_info()	Joshua Bakita	2024-12-19
\| \| \| \|	Also update a comment
*	Fix a potential bug with stream masking on CUDA 12.6 on aarch64 Jetson	Joshua Bakita	2024-12-18
\| \| \| \| \| \| \| \| \| \|	Commit 3f9bda39 made an error by using the pre-CUDA-12 mask structure layout on CUDA 12.6 on aarch64 Jetson. Switch to the CUDA 12+ layout (as used on x86_64). Tests work either way on the Jetson Orin, so this change is not strictly required, but seems advisable to support potenital large (PCIe-attached?) GPUs on Jetson/DRIVE platforms.
*	Support stream masking on CUDA 12.3 (x86) and 12.5 (x86)	Joshua Bakita	2024-11-26
\|
*	Support stream masking on CUDA 12.4 (x86) and 12.6 (x86, aarch64)	Joshua Bakita	2024-11-26
\| \| \| \|	Credit to Nordine Feddal for testing CUDA 12.4 on 550.544.14.
*	Rewrite libsmctrl_test_gpc_info to output clearly and take a GPU ID	Joshua Bakita	2024-11-26
\|
*	Add build and use instructions to the README	Joshua Bakita	2024-02-19
\| \| \| \| \|	Also allow building with an alternate version of g++ for backwards compatibility.
*	Add test that higher-granularity masks override lower-granularity ones	Joshua Bakita	2024-02-14
\| \| \| \| \| \| \| \| \| \| \|	Stream-level masks should always override globally-set masks. Next-kernel masks should always override both stream-level masks and globally-set masks. Tests reveal an issue with the next-kernel mask not overriding the stream mask on CUDA 11.0+. CUDA appears to apply the per-stream mask to the QMD/TMD after `launchCallback()` is triggered, making it impossible to override as currently implemented.
*	Abort process on error, and better document callback-based masking	Joshua Bakita	2023-11-29
\|
*	Add a README and tests for stream masking and next masking	Joshua Bakita	2023-11-29
\| \| \| \|	Also rewrite the global masking test to be much more thorough.
*	Fix stream masking on many platforms and support >64-bit stream masks	Joshua Bakita	2023-11-29
\| \| \| \| \| \| \| \| \|	Previously did not delineate between aarch64 and x86_64 stream offsets, causing incorrect offsets to be used in many circumstances. This has now been fixed. A new function, libsmctrl_set_stream_mask_ext() has also been added which supports masking up to 128 TPCs (rather than just 64).
*	Build on CUDA 11.8+; Adds libdl dependency	Joshua Bakita	2023-11-29
\| \| \| \| \| \|	nvcc links against a stub version of libcuda.so by default which is missing a required symbol starting around CUDA 11.8. Use libdl to resolve the symbol at runtime instead.
*	Add test for libsmctrl_set_global_mask()	Joshua Bakita	2023-10-17
\| \| \| \| \|	Also use static linking for tests, to avoid a need to set LD_LIBRARY_PATH to include the libsmctrl directory.
*	Support global masking on x86_64 and aarch64 with CUDA 10.2	Joshua Bakita	2023-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \|	Also improve documentation and abort with an error message if attempting to set a global SM mask on an unsupported CUDA version. (Would crash/corrupt state before.) Also uncomment a line which errantly disabled global masking on CUDA 10.2 on aarch64. Tested with CUDA 10.2 on: - x86_64 (GTX 1060 3GB, driver 440.100, jbakita-old.cs.unc.edu) - aarch64 (Jetson TX2, driver r32.5, grizzly.cs.unc.edu)
*	Include stdint.h in libsmctrl.h	Joshua Bakita	2023-10-16
\| \| \| \| \|	Necessary for declarations of included functions. Absence would result in a compilation error for programs omitting this include.
*	Fix libsmctrl_set_stream_mask() on the TX2 with CUDA 9.0 + cleanup	Joshua Bakita	2023-10-16
\| \| \| \| \| \| \| \|	This function was previously unreliable when using CUDA 9.0 on the Jetson TX2. Also update some version comments and remove `set_sm_mask()`---a legacy partitioning function that's no longer used.
*	Introduce pysmctrl: A python interface to libsmctrl	Joshua Bakita	2023-03-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initially supports the GPU information functions via: - pysmctrl.get_gpc_info(dev_id) - pysmctrl.get_tpc_info(dev_id) - pysmctrl.get_tpc_info_cuda(cuda_dev_id) All functions are extensively documented. See pysmctrl/__init__.py for details. Device partitioning functions have yet to be mapped into Python, as these will require more testing. As part of this: - libsmctrl_get_*_info() functions have been modified to consistently return positive error codes. - libsmctrl_get_tpc_info() now uses nvdebug-style device numbering and uses libsmctrl_get_gpc_info() under the covers. This should be more reliable. - libsmctrl_get_tpc_info_cuda() has been introduced as an improved version of the old libsmctrl_get_tpc_info() function. This continues to use CUDA-style device numbering, but is now resiliant to CUDA failures. - Various minor style improvements in libsmctrl.c
*	Correct a sign-extension issue in libsmctrl_get_gpc_info()	Joshua Bakita	2023-03-15
\| \| \| \| \|	This function would previously would yield invalid results for GPUs with more than 31 TPCs.
*	Initial reimplementation of libsmctrl as a library	Joshua Bakita	2023-03-02
	- Tested working with cuda_scheduling_examiner - Supports everything described in the accepted RTAS'23 paper - Can be used as either a shared or staticly-linked library - Documented in libsmctrl.h