diff options
-rw-r--r-- | README.md | 147 | ||||
-rw-r--r-- | readme_fig1.png | bin | 0 -> 34845 bytes | |||
-rw-r--r-- | readme_fig2.png | bin | 0 -> 25812 bytes |
3 files changed, 147 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..6b53ca1 --- /dev/null +++ b/README.md | |||
@@ -0,0 +1,147 @@ | |||
1 | <!-- | ||
2 | Copyright 2020 The University of North Carolina at Chapel Hill | ||
3 | Document transcribed from SoftwareDocumentation.docx on Jan 22nd 2019 | ||
4 | --> | ||
5 | # GM Software Deliverable | ||
6 | There are three parts to this deliverable: | ||
7 | - Source code implementing GPU kernel-scheduling middleware | ||
8 | - Source code implementing a CUDA micro-benchmark program | ||
9 | - Document giving an introduction to the two implementations along with instructions for installation and running an example of using the middleware | ||
10 | |||
11 | ## Kernel-Scheduling Middleware | ||
12 | **Introduction** | ||
13 | This library implements a transparent extension to the NVIDIA runtime API (libcudart) that is | ||
14 | dynamically linked with CUDA programs. This extension provides a "middleware" scheduling | ||
15 | infrastructure that controls CUDA kernel launch requests. It was designed to control kernel | ||
16 | scheduling for CUDA programs having the following characteristics commonly used for | ||
17 | concurrent GPU sharing. Using the middleware with a CUDA program having these | ||
18 | characteristics does not require any changes to the CUDA source code. | ||
19 | |||
20 | - A main process that creates multiple threads (pthreads) sharing a single process address | ||
21 | space (i.e., the conditions under which kernels can run concurrently on a GPU). | ||
22 | - Each thread creates one user-defined CUDA stream (FIFO queue) that it manages and | ||
23 | uses for invoking GPU operations. There is a one-to-one relationship between threads | ||
24 | and streams. | ||
25 | - The program is written to launch kernels using the angle-brackets syntax (<<<.....>>>) | ||
26 | and synchronizes the CPU and GPU with at least one call to cudaStreamSynchronize() | ||
27 | between successive instances of kernel launches in a given stream. | ||
28 | - The CUDA program is dynamically linked with the CUDA library libcudart | ||
29 | |||
30 | In the case of a CUDA program with multiple user-defined streams, the NVIDIA scheduling rules | ||
31 | for choosing among multiple streams with kernels at the top of their FIFO queues are not | ||
32 | documented. This middleware attempts to implement and control some of the scheduling | ||
33 | choices that can be made. | ||
34 | |||
35 | The library functions are transparently invoked by "wrapping" calls to certain of the original | ||
36 | CUDA API functions and performing scheduling choices before or after invoking the "real" CUDA | ||
37 | code. Control over which kernel launch requests can be issued to the NVIDIA software and | ||
38 | hardware scheduling mechanisms is achieved by blocking and signaling operations on the | ||
39 | program threads. | ||
40 | |||
41 | The new library functions were designed following the fundamental principle of separation | ||
42 | between mechanism and policy. Most of the library implements the mechanisms that are | ||
43 | required for any policy. Many scheduling policies are possible given adequate mechanisms for | ||
44 | carrying out a given policy. The separation of mechanism and policy makes it easy to try out | ||
45 | and evaluate different policies. In the library code, all aspects of policy are implemented in a | ||
46 | single function, `find_next_kernel()`, which returns either an identifier for a stream to launch a | ||
47 | kernel or -1 to indicate that no new launch is allowed. The policy functions are intended to be | ||
48 | implemented as instances of the `find_next_kernel()` function each contained in a .h file named | ||
49 | in a #include statement. | ||
50 | |||
51 | A complete description of the software, including specific details, can be found in the extensive | ||
52 | comments embedded in the source code. | ||
53 | |||
54 | ## CUDA Micro-Benchmark Program | ||
55 | **Introduction** | ||
56 | The CUDA program `gpuBench.cu` is a multi-thread, multi-kernel micro-benchmark program | ||
57 | to emulate running multiple instances form a specified set of kernel descriptons. The kernels | ||
58 | launched are randomly selected from the set using the specified parameters for each kernel. | ||
59 | The kernel parameters are: | ||
60 | - frequency each kernel from the set is launched | ||
61 | - execution attribute (compute-intensive or memory-intensive) | ||
62 | - number of blocks in the kernel | ||
63 | - number of threads in the block | ||
64 | - kernel execution times (expressed as loop counts in kernels) | ||
65 | - delays between kernel launches | ||
66 | |||
67 | Kernel execution attributes are implemented by two micro-kernels: | ||
68 | - a compute-intensive kernel that uses 32-bit floating-point operations with data held in | ||
69 | registers only. | ||
70 | - a memory-intensive kernel that uses both 32-bit integer operations and memory | ||
71 | references. | ||
72 | |||
73 | This program implements multiple POSIX threads and each thread manages one user-defined | ||
74 | stream with asynchronous kernel launches. cudaStreamSynchronize() is used to wait for any | ||
75 | operations in the stream to complete. Host pinned memory is used. | ||
76 | |||
77 | A complete description of the software, including specific details, can be found in the extensive | ||
78 | comments embedded in the source code. | ||
79 | |||
80 | ## Instructions for Installing and Executing the Programs | ||
81 | **Installation** | ||
82 | The software has been tested on an NVIDIA Jetson TX2 with CUDA version V9.0.252 but should | ||
83 | work with any CUDA Version 9 installation. | ||
84 | - Place the tar file in a directory you will be using for running the software and extract the | ||
85 | contents. This should produce a top-level directory containing: | ||
86 | - a directory named `gpuScheduler` which contains the source code for the | ||
87 | scheduling middleware | ||
88 | - a directory named `gpuBenchmark` which contains the source code for the | ||
89 | micro-benchmark program | ||
90 | |||
91 | - change directory to gpuScheduler and enter `make`. This should produce a dynamic link | ||
92 | library named `libcudart_wrapper.so`. | ||
93 | - change directory to gpuBenchmark and enter `make`. This should produce an | ||
94 | executable CUDA program named `gpuBench` | ||
95 | |||
96 | **Example of using the middleware scheduler** | ||
97 | This example illustrates running the middleware for scheduling kernels executed by the micro- | ||
98 | benchmark program, gpuBench, produced from the `make` described above. It launches | ||
99 | compute-intensive kernels as defined in the include file `compute_config.h` in the | ||
100 | gpuBenchmark directory. The middleware dynamic-link library generated from the `make` step | ||
101 | above uses the scheduling policy implemented in the file `MinFitMinIntfR2.h`. The policy is | ||
102 | "min thread use, min interference", i.e., find the ready-to-launch kernel that will occupy the | ||
103 | smallest number of available GPU threads AND does not fail a test for interference effects. The | ||
104 | test for interference effects requires that ratio between the number of threads in the kernel | ||
105 | under consideration and any kernel already scheduled does not exceed a threshold (in this | ||
106 | implementation, 2.0). This test is motivated by empirical measurements that have shown | ||
107 | interference effects as much as 500% or higher for large thread ratios between concurrently | ||
108 | executing kernels. Figure 1 (below) shows these effects when the compute-intensive | ||
109 | benchmark is run without the scheduling middleware using only the default NVIDIA software | ||
110 | and hardware. Note that the kernels with smaller numbers of threads (128 to 640) have longer | ||
111 | than expected execution times and they are highly variable with worst-case to best-case ratios | ||
112 | of 7:1 or more. These data were obtained by using NVIDIA’s nvprof tool to generate a GPU | ||
113 | trace of all process activity. | ||
114 | |||
115 | To run the benchmark with the scheduling middleware, change directory to gpuBenchmark and | ||
116 | run the following two commands at the command line (these commands can also be found in | ||
117 | the README file in that directory, and help for the program parameters can be obtained by | ||
118 | invoking gpuBench with the –h switch). | ||
119 | |||
120 | ``` | ||
121 | cp ../gpuScheduler/libcudart_wrapper.so . | ||
122 | LD_PRELOAD=./libcudart_wrapper.so ./gpuBench -k 200 -t 4 -rs 0 | ||
123 | ``` | ||
124 | |||
125 | This runs the gpuBench program (dynamically linked with the scheduling middleware) with 4 | ||
126 | threads each executing 200 kernels and using a fixed random number seed for each thread. | ||
127 | The program executes for about 7 to 8 minutes on a TX2 running at the maximum CPU, GPU, | ||
128 | and memory clock speeds. | ||
129 | |||
130 | Figure 2 (below) shows the effects of the scheduling policy on compute-intensive kernels. | ||
131 | Comparing Figure 2 with Figure 1, we see that kernel execution times are reduced for all kernels | ||
132 | but with dramatic reductions in execution time and execution variability for kernels having a | ||
133 | number of threads between 128 and 640. This scheduling policy does require a trade-off | ||
134 | between reduced execution times for kernels and potential blocking of launches to enforce | ||
135 | minimal interference. Whether reduced execution times offset increased blocking times or not | ||
136 | will depend on the particular mix of kernels and should be evaluated for each application. | ||
137 | |||
138 | ![Figure 1](./readme_fig1.png) | ||
139 | |||
140 | Figure 1. Compute-intensive kernel execution times for different numbers of threads when | ||
141 | scheduled by the default NVIDIA scheduling software and hardware. | ||
142 | |||
143 | ![Figure 2](./readme_fig2.png) | ||
144 | |||
145 | Figure 2. Compute-intensive kernel execution times for different numbers of threads when | ||
146 | scheduled by middleware extensions to the CUDA runtime using a policy of minimizing inter- | ||
147 | kernel interference. | ||
diff --git a/readme_fig1.png b/readme_fig1.png new file mode 100644 index 0000000..bf9bd32 --- /dev/null +++ b/readme_fig1.png | |||
Binary files differ | |||
diff --git a/readme_fig2.png b/readme_fig2.png new file mode 100644 index 0000000..2a0e4c1 --- /dev/null +++ b/readme_fig2.png | |||
Binary files differ | |||