aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md147
-rw-r--r--readme_fig1.pngbin0 -> 34845 bytes
-rw-r--r--readme_fig2.pngbin0 -> 25812 bytes
3 files changed, 147 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..6b53ca1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,147 @@
1<!--
2Copyright 2020 The University of North Carolina at Chapel Hill
3Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019
4-->
5# GM Software Deliverable
6There are three parts to this deliverable:
7- Source code implementing GPU kernel-scheduling middleware
8- Source code implementing a CUDA micro-benchmark program
9- Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware
10
11## Kernel-Scheduling Middleware
12**Introduction**
13This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is
14dynamically linked with CUDA programs. This extension provides a "middleware" scheduling
15infrastructure that controls CUDA kernel launch requests. It was designed to control kernel
16scheduling for CUDA programs having the following characteristics commonly used for
17concurrent GPU sharing. Using the middleware with a CUDA program having these
18characteristics does not require any changes to the CUDA source code.
19
20- A main process that creates multiple threads (pthreads) sharing a single process address
21space (i.e., the conditions under which kernels can run concurrently on a GPU).
22- Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and
23uses for invoking GPU operations. There is a one-to-one relationship between threads
24and streams.
25- The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>)
26and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize()
27between successive instances of kernel launches in a given stream.
28- The CUDA program is dynamically linked with the CUDA library libcudart
29
30In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules
31for choosing among multiple streams with kernels at the top of their FIFO queues are not
32documented. This middleware attempts to implement and control some of the scheduling
33choices that can be made.
34
35The library functions are transparently invoked by &quot;wrapping&quot; calls to certain of the original
36CUDA API functions and performing scheduling choices before or after invoking the &quot;real&quot; CUDA
37code. Control over which kernel launch requests can be issued to the NVIDIA software and
38hardware scheduling mechanisms is achieved by blocking and signaling operations on the
39program threads.
40
41The new library functions were designed following the fundamental principle of separation
42between mechanism and policy. Most of the library implements the mechanisms that are
43required for any policy. Many scheduling policies are possible given adequate mechanisms for
44carrying out a given policy. The separation of mechanism and policy makes it easy to try out
45and evaluate different policies. In the library code, all aspects of policy are implemented in a
46single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a
47kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be
48implemented as instances of the `find_next_kernel()` function each contained in a .h file named
49in a #include statement.
50
51A complete description of the software, including specific details, can be found in the extensive
52comments embedded in the source code.
53
54## CUDA Micro-Benchmark Program
55**Introduction**
56The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program
57to emulate running multiple instances form a specified set of kernel descriptons. The kernels
58launched are randomly selected from the set using the specified parameters for each kernel.
59The kernel parameters are:
60- frequency each kernel from the set is launched
61- execution attribute (compute-intensive or memory-intensive)
62- number of blocks in the kernel
63- number of threads in the block
64- kernel execution times (expressed as loop counts in kernels)
65- delays between kernel launches
66
67Kernel execution attributes are implemented by two micro-kernels:
68- a compute-intensive kernel that uses 32-bit floating-point operations with data held in
69registers only.
70- a memory-intensive kernel that uses both 32-bit integer operations and memory
71references.
72
73This program implements multiple POSIX threads and each thread manages one user-defined
74stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any
75operations in the stream to complete. Host pinned memory is used.
76
77A complete description of the software, including specific details, can be found in the extensive
78comments embedded in the source code.
79
80## Instructions for Installing and Executing the Programs
81**Installation**
82The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should
83work with any CUDA Version 9 installation.
84- Place the tar file in a directory you will be using for running the software and extract the
85contents. This should produce a top-level directory containing:
86 - a directory named `gpuScheduler` which contains the source code for the
87scheduling middleware
88 - a directory named `gpuBenchmark` which contains the source code for the
89micro-benchmark program
90
91- change directory to gpuScheduler and enter `make`. This should produce a dynamic link
92library named `libcudart_wrapper.so`.
93- change directory to gpuBenchmark and enter `make`. This should produce an
94executable CUDA program named `gpuBench`
95
96**Example of using the middleware scheduler**
97This example illustrates running the middleware for scheduling kernels executed by the micro-
98benchmark program, gpuBench, produced from the `make` described above. It launches
99compute-intensive kernels as defined in the include file `compute_config.h` in the
100gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step
101above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is
102"min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the
103smallest number of available GPU threads AND does not fail a test for interference effects. The
104test for interference effects requires that ratio between the number of threads in the kernel
105under consideration and any kernel already scheduled does not exceed a threshold (in this
106implementation, 2.0). This test is motivated by empirical measurements that have shown
107interference effects as much as 500% or higher for large thread ratios between concurrently
108executing kernels. Figure 1 (below) shows these effects when the compute-intensive
109benchmark is run without the scheduling middleware using only the default NVIDIA software
110and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer
111than expected execution times and they are highly variable with worst-case to best-case ratios
112of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU
113trace of all process activity.
114
115To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and
116run the following two commands at the command line (these commands can also be found in
117the README file in that directory, and help for the program parameters can be obtained by
118invoking gpuBench with the –h switch).
119
120```
121cp ../gpuScheduler/libcudart_wrapper.so .
122LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0
123```
124
125This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4
126threads each executing 200 kernels and using a fixed random number seed for each thread.
127The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU,
128and memory clock speeds.
129
130Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels.
131Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels
132but with dramatic reductions in execution time and execution variability for kernels having a
133number of threads between 128 and 640. This scheduling policy does require a trade-off
134between reduced execution times for kernels and potential blocking of launches to enforce
135minimal interference. Whether reduced execution times offset increased blocking times or not
136will depend on the particular mix of kernels and should be evaluated for each application.
137
138![Figure 1](./readme_fig1.png)
139
140Figure 1. Compute-intensive kernel execution times for different numbers of threads when
141scheduled by the default NVIDIA scheduling software and hardware.
142
143![Figure 2](./readme_fig2.png)
144
145Figure 2. Compute-intensive kernel execution times for different numbers of threads when
146scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter-
147kernel interference.
diff --git a/readme_fig1.png b/readme_fig1.png
new file mode 100644
index 0000000..bf9bd32
--- /dev/null
+++ b/readme_fig1.png
Binary files differ
diff --git a/readme_fig2.png b/readme_fig2.png
new file mode 100644
index 0000000..2a0e4c1
--- /dev/null
+++ b/readme_fig2.png
Binary files differ