From 16a483083435014ca4af1b66473fb400ac0b5c90 Mon Sep 17 00:00:00 2001
From: Joshua Bakita <bakitajoshua@gmail.com>
Date: Thu, 5 Nov 2020 13:13:30 -0500
Subject: Transcribe the documentation to Markdown

---
 README.md       | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 readme_fig1.png | Bin 0 -> 34845 bytes
 readme_fig2.png | Bin 0 -> 25812 bytes
 3 files changed, 147 insertions(+)
 create mode 100644 README.md
 create mode 100644 readme_fig1.png
 create mode 100644 readme_fig2.png

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..6b53ca1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,147 @@
+<!--
+Copyright 2020 The University of North Carolina at Chapel Hill
+Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019
+-->
+# GM Software Deliverable
+There are three parts to this deliverable:
+- Source code implementing GPU kernel-scheduling middleware
+- Source code implementing a CUDA micro-benchmark program
+- Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware
+
+## Kernel-Scheduling Middleware
+**Introduction**
+This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is
+dynamically linked with CUDA programs. This extension provides a "middleware" scheduling
+infrastructure that controls CUDA kernel launch requests. It was designed to control kernel
+scheduling for CUDA programs having the following characteristics commonly used for
+concurrent GPU sharing. Using the middleware with a CUDA program having these
+characteristics does not require any changes to the CUDA source code.
+
+- A main process that creates multiple threads (pthreads) sharing a single process address
+space (i.e., the conditions under which kernels can run concurrently on a GPU).
+- Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and
+uses for invoking GPU operations. There is a one-to-one relationship between threads
+and streams.
+- The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>)
+and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize()
+between successive instances of kernel launches in a given stream.
+- The CUDA program is dynamically linked with the CUDA library libcudart
+
+In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules
+for choosing among multiple streams with kernels at the top of their FIFO queues are not
+documented. This middleware attempts to implement and control some of the scheduling
+choices that can be made.
+
+The library functions are transparently invoked by &quot;wrapping&quot; calls to certain of the original
+CUDA API functions and performing scheduling choices before or after invoking the &quot;real&quot; CUDA
+code. Control over which kernel launch requests can be issued to the NVIDIA software and
+hardware scheduling mechanisms is achieved by blocking and signaling operations on the
+program threads.
+
+The new library functions were designed following the fundamental principle of separation
+between mechanism and policy. Most of the library implements the mechanisms that are
+required for any policy. Many scheduling policies are possible given adequate mechanisms for
+carrying out a given policy. The separation of mechanism and policy makes it easy to try out
+and evaluate different policies. In the library code, all aspects of policy are implemented in a
+single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a
+kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be
+implemented as instances of the `find_next_kernel()` function each contained in a .h file named
+in a #include statement.
+
+A complete description of the software, including specific details, can be found in the extensive
+comments embedded in the source code.
+
+## CUDA Micro-Benchmark Program
+**Introduction**
+The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program
+to emulate running multiple instances form a specified set of kernel descriptons. The kernels
+launched are randomly selected from the set using the specified parameters for each kernel.
+The kernel parameters are:
+- frequency each kernel from the set is launched
+- execution attribute (compute-intensive or memory-intensive)
+- number of blocks in the kernel
+- number of threads in the block
+- kernel execution times (expressed as loop counts in kernels)
+- delays between kernel launches
+
+Kernel execution attributes are implemented by two micro-kernels:
+- a compute-intensive kernel that uses 32-bit floating-point operations with data held in
+registers only.
+- a memory-intensive kernel that uses both 32-bit integer operations and memory
+references.
+
+This program implements multiple POSIX threads and each thread manages one user-defined
+stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any
+operations in the stream to complete. Host pinned memory is used.
+
+A complete description of the software, including specific details, can be found in the extensive
+comments embedded in the source code.
+
+## Instructions for Installing and Executing the Programs
+**Installation**
+The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should
+work with any CUDA Version 9 installation.
+- Place the tar file in a directory you will be using for running the software and extract the
+contents. This should produce a top-level directory containing:
+  - a directory named `gpuScheduler` which contains the source code for the
+scheduling middleware
+  - a directory named `gpuBenchmark` which contains the source code for the
+micro-benchmark program
+
+- change directory to gpuScheduler and enter `make`. This should produce a dynamic link
+library named `libcudart_wrapper.so`.
+- change directory to gpuBenchmark and enter `make`. This should produce an
+executable CUDA program named `gpuBench`
+
+**Example of using the middleware scheduler**
+This example illustrates running the middleware for scheduling kernels executed by the micro-
+benchmark program, gpuBench, produced from the `make` described above. It launches
+compute-intensive kernels as defined in the include file `compute_config.h` in the
+gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step
+above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is
+"min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the
+smallest number of available GPU threads AND does not fail a test for interference effects. The
+test for interference effects requires that ratio between the number of threads in the kernel
+under consideration and any kernel already scheduled does not exceed a threshold (in this
+implementation, 2.0). This test is motivated by empirical measurements that have shown
+interference effects as much as 500% or higher for large thread ratios between concurrently
+executing kernels. Figure 1 (below) shows these effects when the compute-intensive
+benchmark is run without the scheduling middleware using only the default NVIDIA software
+and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer
+than expected execution times and they are highly variable with worst-case to best-case ratios
+of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU
+trace of all process activity.
+
+To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and
+run the following two commands at the command line (these commands can also be found in
+the README file in that directory, and help for the program parameters can be obtained by
+invoking gpuBench with the –h switch).
+
+```
+cp ../gpuScheduler/libcudart_wrapper.so .
+LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0
+```
+
+This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4
+threads each executing 200 kernels and using a fixed random number seed for each thread.
+The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU,
+and memory clock speeds.
+
+Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels.
+Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels
+but with dramatic reductions in execution time and execution variability for kernels having a
+number of threads between 128 and 640. This scheduling policy does require a trade-off
+between reduced execution times for kernels and potential blocking of launches to enforce
+minimal interference. Whether reduced execution times offset increased blocking times or not
+will depend on the particular mix of kernels and should be evaluated for each application.
+
+![Figure 1](./readme_fig1.png)
+
+Figure 1. Compute-intensive kernel execution times for different numbers of threads when
+scheduled by the default NVIDIA scheduling software and hardware.
+
+![Figure 2](./readme_fig2.png)
+
+Figure 2. Compute-intensive kernel execution times for different numbers of threads when
+scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter-
+kernel interference.
diff --git a/readme_fig1.png b/readme_fig1.png
new file mode 100644
index 0000000..bf9bd32
Binary files /dev/null and b/readme_fig1.png differ
diff --git a/readme_fig2.png b/readme_fig2.png
new file mode 100644
index 0000000..2a0e4c1
Binary files /dev/null and b/readme_fig2.png differ
-- 
cgit v1.2.2