README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147

<!--
Copyright 2020 The University of North Carolina at Chapel Hill
Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019
-->
# GM Software Deliverable
There are three parts to this deliverable:
- Source code implementing GPU kernel-scheduling middleware
- Source code implementing a CUDA micro-benchmark program
- Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware

## Kernel-Scheduling Middleware
**Introduction**
This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is
dynamically linked with CUDA programs. This extension provides a "middleware" scheduling
infrastructure that controls CUDA kernel launch requests. It was designed to control kernel
scheduling for CUDA programs having the following characteristics commonly used for
concurrent GPU sharing. Using the middleware with a CUDA program having these
characteristics does not require any changes to the CUDA source code.

- A main process that creates multiple threads (pthreads) sharing a single process address
space (i.e., the conditions under which kernels can run concurrently on a GPU).
- Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and
uses for invoking GPU operations. There is a one-to-one relationship between threads
and streams.
- The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>)
and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize()
between successive instances of kernel launches in a given stream.
- The CUDA program is dynamically linked with the CUDA library libcudart

In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules
for choosing among multiple streams with kernels at the top of their FIFO queues are not
documented. This middleware attempts to implement and control some of the scheduling
choices that can be made.

The library functions are transparently invoked by &quot;wrapping&quot; calls to certain of the original
CUDA API functions and performing scheduling choices before or after invoking the &quot;real&quot; CUDA
code. Control over which kernel launch requests can be issued to the NVIDIA software and
hardware scheduling mechanisms is achieved by blocking and signaling operations on the
program threads.

The new library functions were designed following the fundamental principle of separation
between mechanism and policy. Most of the library implements the mechanisms that are
required for any policy. Many scheduling policies are possible given adequate mechanisms for
carrying out a given policy. The separation of mechanism and policy makes it easy to try out
and evaluate different policies. In the library code, all aspects of policy are implemented in a
single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a
kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be
implemented as instances of the `find_next_kernel()` function each contained in a .h file named
in a #include statement.

A complete description of the software, including specific details, can be found in the extensive
comments embedded in the source code.

## CUDA Micro-Benchmark Program
**Introduction**
The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program
to emulate running multiple instances form a specified set of kernel descriptons. The kernels
launched are randomly selected from the set using the specified parameters for each kernel.
The kernel parameters are:
- frequency each kernel from the set is launched
- execution attribute (compute-intensive or memory-intensive)
- number of blocks in the kernel
- number of threads in the block
- kernel execution times (expressed as loop counts in kernels)
- delays between kernel launches

Kernel execution attributes are implemented by two micro-kernels:
- a compute-intensive kernel that uses 32-bit floating-point operations with data held in
registers only.
- a memory-intensive kernel that uses both 32-bit integer operations and memory
references.

This program implements multiple POSIX threads and each thread manages one user-defined
stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any
operations in the stream to complete. Host pinned memory is used.

A complete description of the software, including specific details, can be found in the extensive
comments embedded in the source code.

## Instructions for Installing and Executing the Programs
**Installation**
The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should
work with any CUDA Version 9 installation.
- Place the tar file in a directory you will be using for running the software and extract the
contents. This should produce a top-level directory containing:
  - a directory named `gpuScheduler` which contains the source code for the
scheduling middleware
  - a directory named `gpuBenchmark` which contains the source code for the
micro-benchmark program

- change directory to gpuScheduler and enter `make`. This should produce a dynamic link
library named `libcudart_wrapper.so`.
- change directory to gpuBenchmark and enter `make`. This should produce an
executable CUDA program named `gpuBench`

**Example of using the middleware scheduler**
This example illustrates running the middleware for scheduling kernels executed by the micro-
benchmark program, gpuBench, produced from the `make` described above. It launches
compute-intensive kernels as defined in the include file `compute_config.h` in the
gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step
above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is
"min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the
smallest number of available GPU threads AND does not fail a test for interference effects. The
test for interference effects requires that ratio between the number of threads in the kernel
under consideration and any kernel already scheduled does not exceed a threshold (in this
implementation, 2.0). This test is motivated by empirical measurements that have shown
interference effects as much as 500% or higher for large thread ratios between concurrently
executing kernels. Figure 1 (below) shows these effects when the compute-intensive
benchmark is run without the scheduling middleware using only the default NVIDIA software
and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer
than expected execution times and they are highly variable with worst-case to best-case ratios
of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU
trace of all process activity.

To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and
run the following two commands at the command line (these commands can also be found in
the README file in that directory, and help for the program parameters can be obtained by
invoking gpuBench with the –h switch).

```
cp ../gpuScheduler/libcudart_wrapper.so .
LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0
```

This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4
threads each executing 200 kernels and using a fixed random number seed for each thread.
The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU,
and memory clock speeds.

Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels.
Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels
but with dramatic reductions in execution time and execution variability for kernels having a
number of threads between 128 and 640. This scheduling policy does require a trade-off
between reduced execution times for kernels and potential blocking of launches to enforce
minimal interference. Whether reduced execution times offset increased blocking times or not
will depend on the particular mix of kernels and should be evaluated for each application.

![Figure 1](./readme_fig1.png)

Figure 1. Compute-intensive kernel execution times for different numbers of threads when
scheduled by the default NVIDIA scheduling software and hardware.

![Figure 2](./readme_fig2.png)

Figure 2. Compute-intensive kernel execution times for different numbers of threads when
scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter-
kernel interference.