3 files changed, 147 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..6b53ca1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,147 @@
+<!--
+Copyright 2020 The University of North Carolina at Chapel Hill
+Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019
+-->
+# GM Software Deliverable
+There are three parts to this deliverable:
+- Source code implementing GPU kernel-scheduling middleware
+- Source code implementing a CUDA micro-benchmark program
+- Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware
+## Kernel-Scheduling Middleware
+**Introduction**
+This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is
+dynamically linked with CUDA programs. This extension provides a "middleware" scheduling
+infrastructure that controls CUDA kernel launch requests. It was designed to control kernel
+scheduling for CUDA programs having the following characteristics commonly used for
+concurrent GPU sharing. Using the middleware with a CUDA program having these
+characteristics does not require any changes to the CUDA source code.
+- A main process that creates multiple threads (pthreads) sharing a single process address
+space (i.e., the conditions under which kernels can run concurrently on a GPU).
+- Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and
+uses for invoking GPU operations. There is a one-to-one relationship between threads
+and streams.
+- The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>)
+and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize()
+between successive instances of kernel launches in a given stream.
+- The CUDA program is dynamically linked with the CUDA library libcudart
+In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules
+for choosing among multiple streams with kernels at the top of their FIFO queues are not
+documented. This middleware attempts to implement and control some of the scheduling
+choices that can be made.
+The library functions are transparently invoked by &quot;wrapping&quot; calls to certain of the original
+CUDA API functions and performing scheduling choices before or after invoking the &quot;real&quot; CUDA
+code. Control over which kernel launch requests can be issued to the NVIDIA software and
+hardware scheduling mechanisms is achieved by blocking and signaling operations on the
+program threads.
+The new library functions were designed following the fundamental principle of separation
+between mechanism and policy. Most of the library implements the mechanisms that are
+required for any policy. Many scheduling policies are possible given adequate mechanisms for
+carrying out a given policy. The separation of mechanism and policy makes it easy to try out
+and evaluate different policies. In the library code, all aspects of policy are implemented in a
+single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a
+kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be
+implemented as instances of the `find_next_kernel()` function each contained in a .h file named
+in a #include statement.
+A complete description of the software, including specific details, can be found in the extensive
+comments embedded in the source code.
+## CUDA Micro-Benchmark Program
+**Introduction**
+The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program
+to emulate running multiple instances form a specified set of kernel descriptons. The kernels
+launched are randomly selected from the set using the specified parameters for each kernel.
+The kernel parameters are:
+- frequency each kernel from the set is launched
+- execution attribute (compute-intensive or memory-intensive)
+- number of blocks in the kernel
+- number of threads in the block
+- kernel execution times (expressed as loop counts in kernels)
+- delays between kernel launches
+Kernel execution attributes are implemented by two micro-kernels:
+- a compute-intensive kernel that uses 32-bit floating-point operations with data held in
+registers only.
+- a memory-intensive kernel that uses both 32-bit integer operations and memory
+references.
+This program implements multiple POSIX threads and each thread manages one user-defined
+stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any
+operations in the stream to complete. Host pinned memory is used.
+A complete description of the software, including specific details, can be found in the extensive
+comments embedded in the source code.
+## Instructions for Installing and Executing the Programs
+**Installation**
+The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should
+work with any CUDA Version 9 installation.
+- Place the tar file in a directory you will be using for running the software and extract the
+contents. This should produce a top-level directory containing:
+  - a directory named `gpuScheduler` which contains the source code for the
+scheduling middleware
+  - a directory named `gpuBenchmark` which contains the source code for the
+micro-benchmark program
+- change directory to gpuScheduler and enter `make`. This should produce a dynamic link
+library named `libcudart_wrapper.so`.
+- change directory to gpuBenchmark and enter `make`. This should produce an
+executable CUDA program named `gpuBench`
+**Example of using the middleware scheduler**
+This example illustrates running the middleware for scheduling kernels executed by the micro-
+benchmark program, gpuBench, produced from the `make` described above. It launches
+compute-intensive kernels as defined in the include file `compute_config.h` in the
+gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step
+above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is
+"min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the
+smallest number of available GPU threads AND does not fail a test for interference effects. The
+test for interference effects requires that ratio between the number of threads in the kernel
+under consideration and any kernel already scheduled does not exceed a threshold (in this
+implementation, 2.0). This test is motivated by empirical measurements that have shown
+interference effects as much as 500% or higher for large thread ratios between concurrently
+executing kernels. Figure 1 (below) shows these effects when the compute-intensive
+benchmark is run without the scheduling middleware using only the default NVIDIA software
+and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer
+than expected execution times and they are highly variable with worst-case to best-case ratios
+of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU
+trace of all process activity.
+To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and
+run the following two commands at the command line (these commands can also be found in
+the README file in that directory, and help for the program parameters can be obtained by
+invoking gpuBench with the –h switch).
+```
+cp ../gpuScheduler/libcudart_wrapper.so .
+LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0
+```
+This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4
+threads each executing 200 kernels and using a fixed random number seed for each thread.
+The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU,
+and memory clock speeds.
+Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels.
+Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels
+but with dramatic reductions in execution time and execution variability for kernels having a
+number of threads between 128 and 640. This scheduling policy does require a trade-off
+between reduced execution times for kernels and potential blocking of launches to enforce
+minimal interference. Whether reduced execution times offset increased blocking times or not
+will depend on the particular mix of kernels and should be evaluated for each application.
+![Figure 1](./readme_fig1.png)
+Figure 1. Compute-intensive kernel execution times for different numbers of threads when
+scheduled by the default NVIDIA scheduling software and hardware.
+![Figure 2](./readme_fig2.png)
+Figure 2. Compute-intensive kernel execution times for different numbers of threads when
+scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter-
+kernel interference.
diff --git a/readme_fig1.png b/readme_fig1.png
new file mode 100644
index 0000000..bf9bd32
--- /dev/null
+++ b/readme_fig1.png
Binary files differ
diff --git a/readme_fig2.png b/readme_fig2.png
new file mode 100644
index 0000000..2a0e4c1
--- /dev/null
+++ b/readme_fig2.png
Binary files differ

diff --git a/README.md b/README.md new file mode 100644 index 0000000..6b53ca1 --- /dev/null +++ b/README.md
@@ -0,0 +1,147 @@
	1	<!--
	2	Copyright 2020 The University of North Carolina at Chapel Hill
	3	Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019
	4	-->
	5	# GM Software Deliverable
	6	There are three parts to this deliverable:
	7	- Source code implementing GPU kernel-scheduling middleware
	8	- Source code implementing a CUDA micro-benchmark program
	9	- Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware
	10
	11	## Kernel-Scheduling Middleware
	12	Introduction
	13	This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is
	14	dynamically linked with CUDA programs. This extension provides a "middleware" scheduling
	15	infrastructure that controls CUDA kernel launch requests. It was designed to control kernel
	16	scheduling for CUDA programs having the following characteristics commonly used for
	17	concurrent GPU sharing. Using the middleware with a CUDA program having these
	18	characteristics does not require any changes to the CUDA source code.
	19
	20	- A main process that creates multiple threads (pthreads) sharing a single process address
	21	space (i.e., the conditions under which kernels can run concurrently on a GPU).
	22	- Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and
	23	uses for invoking GPU operations. There is a one-to-one relationship between threads
	24	and streams.
	25	- The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>)
	26	and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize()
	27	between successive instances of kernel launches in a given stream.
	28	- The CUDA program is dynamically linked with the CUDA library libcudart
	29
	30	In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules
	31	for choosing among multiple streams with kernels at the top of their FIFO queues are not
	32	documented. This middleware attempts to implement and control some of the scheduling
	33	choices that can be made.
	34
	35	The library functions are transparently invoked by "wrapping" calls to certain of the original
	36	CUDA API functions and performing scheduling choices before or after invoking the "real" CUDA
	37	code. Control over which kernel launch requests can be issued to the NVIDIA software and
	38	hardware scheduling mechanisms is achieved by blocking and signaling operations on the
	39	program threads.
	40
	41	The new library functions were designed following the fundamental principle of separation
	42	between mechanism and policy. Most of the library implements the mechanisms that are
	43	required for any policy. Many scheduling policies are possible given adequate mechanisms for
	44	carrying out a given policy. The separation of mechanism and policy makes it easy to try out
	45	and evaluate different policies. In the library code, all aspects of policy are implemented in a
	46	single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a
	47	kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be
	48	implemented as instances of the `find_next_kernel()` function each contained in a .h file named
	49	in a #include statement.
	50
	51	A complete description of the software, including specific details, can be found in the extensive
	52	comments embedded in the source code.
	53
	54	## CUDA Micro-Benchmark Program
	55	Introduction
	56	The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program
	57	to emulate running multiple instances form a specified set of kernel descriptons. The kernels
	58	launched are randomly selected from the set using the specified parameters for each kernel.
	59	The kernel parameters are:
	60	- frequency each kernel from the set is launched
	61	- execution attribute (compute-intensive or memory-intensive)
	62	- number of blocks in the kernel
	63	- number of threads in the block
	64	- kernel execution times (expressed as loop counts in kernels)
	65	- delays between kernel launches
	66
	67	Kernel execution attributes are implemented by two micro-kernels:
	68	- a compute-intensive kernel that uses 32-bit floating-point operations with data held in
	69	registers only.
	70	- a memory-intensive kernel that uses both 32-bit integer operations and memory
	71	references.
	72
	73	This program implements multiple POSIX threads and each thread manages one user-defined
	74	stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any
	75	operations in the stream to complete. Host pinned memory is used.
	76
	77	A complete description of the software, including specific details, can be found in the extensive
	78	comments embedded in the source code.
	79
	80	## Instructions for Installing and Executing the Programs
	81	Installation
	82	The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should
	83	work with any CUDA Version 9 installation.
	84	- Place the tar file in a directory you will be using for running the software and extract the
	85	contents. This should produce a top-level directory containing:
	86	- a directory named `gpuScheduler` which contains the source code for the
	87	scheduling middleware
	88	- a directory named `gpuBenchmark` which contains the source code for the
	89	micro-benchmark program
	90
	91	- change directory to gpuScheduler and enter `make`. This should produce a dynamic link
	92	library named `libcudart_wrapper.so`.
	93	- change directory to gpuBenchmark and enter `make`. This should produce an
	94	executable CUDA program named `gpuBench`
	95
	96	Example of using the middleware scheduler
	97	This example illustrates running the middleware for scheduling kernels executed by the micro-
	98	benchmark program, gpuBench, produced from the `make` described above. It launches
	99	compute-intensive kernels as defined in the include file `compute_config.h` in the
	100	gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step
	101	above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is
	102	"min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the
	103	smallest number of available GPU threads AND does not fail a test for interference effects. The
	104	test for interference effects requires that ratio between the number of threads in the kernel
	105	under consideration and any kernel already scheduled does not exceed a threshold (in this
	106	implementation, 2.0). This test is motivated by empirical measurements that have shown
	107	interference effects as much as 500% or higher for large thread ratios between concurrently
	108	executing kernels. Figure 1 (below) shows these effects when the compute-intensive
	109	benchmark is run without the scheduling middleware using only the default NVIDIA software
	110	and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer
	111	than expected execution times and they are highly variable with worst-case to best-case ratios
	112	of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU
	113	trace of all process activity.
	114
	115	To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and
	116	run the following two commands at the command line (these commands can also be found in
	117	the README file in that directory, and help for the program parameters can be obtained by
	118	invoking gpuBench with the –h switch).
	119
	120	```
	121	cp ../gpuScheduler/libcudart_wrapper.so .
	122	LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0
	123	```
	124
	125	This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4
	126	threads each executing 200 kernels and using a fixed random number seed for each thread.
	127	The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU,
	128	and memory clock speeds.
	129
	130	Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels.
	131	Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels
	132	but with dramatic reductions in execution time and execution variability for kernels having a
	133	number of threads between 128 and 640. This scheduling policy does require a trade-off
	134	between reduced execution times for kernels and potential blocking of launches to enforce
	135	minimal interference. Whether reduced execution times offset increased blocking times or not
	136	will depend on the particular mix of kernels and should be evaluated for each application.
	137
	138	![Figure 1](./readme_fig1.png)
	139
	140	Figure 1. Compute-intensive kernel execution times for different numbers of threads when
	141	scheduled by the default NVIDIA scheduling software and hardware.
	142
	143	![Figure 2](./readme_fig2.png)
	144
	145	Figure 2. Compute-intensive kernel execution times for different numbers of threads when
	146	scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter-
	147	kernel interference.


diff --git a/readme_fig1.png b/readme_fig1.png new file mode 100644 index 0000000..bf9bd32 --- /dev/null +++ b/readme_fig1.png
Binary files differ


diff --git a/readme_fig2.png b/readme_fig2.png new file mode 100644 index 0000000..2a0e4c1 --- /dev/null +++ b/readme_fig2.png
Binary files differ