The Cuda module contains a set of Mutators, Filters and Templates capable of building CUDA kernels from llc-annotated regions in the original source code by reading the yaCF Intermediate Representation.
Short introduction about cuda
The CUDA programming model is similar in style to a singleprogram multiple-data (SPMD) software model. The GPU is treated as a coprocessor that executes data-parallel kernel functions.
CUDA provides three key abstractions, a hierarchy of thread groups, shared memories, and barrier synchronization. Threads have a three level hierarchy. A grid is a set of thread blocks that execute a kernel function. Each grid consists of blocks of threads. Each block is composed of hundreds of threads. Threads within one block can share data using shared memory and can be synchronized at a barrier. All threads within a block are executed concurrently on a multithreaded architecture.
The programmer specifies the number of threads per block, and the number of blocks per grid. A thread in the CUDA programming language is much lighter weight than a thread in traditional operating systems. A thread in CUDA typically processes one data element at a time. The CUDA programming model has two shared read-write memory spaces, the shared memory space and the global memory space. The shared memory is local to a block and the global memory space is accessible by all blocks. CUDA also provides two read-only memory spaces, the constant space and the texture space, which reside in external DRAM, and are accessed via read-only caches.
The GPU architecture consists of a scalable number of streaming multiprocessors (SMs), each containing eight streaming processor (SP) cores, two special function units (SFUs), a multithreaded instruction fetch and issue unit, a read-only constant cache, and a 16KB read/write shared memory
The SM executes a batch of 32 threads together called a warp. Executing a warp instruction applies the instruction to 32 threads, similar to executing a SIMD instruction like an SSE instruction
in X86. However, unlike SIMD instructions, the concept of warp is not exposed to the programmers, rather programmers write a program for one thread, and then specify the number of parallel threads in a block, and the number of blocks in a kernel grid. The Tesla architecture forms a warp using a batch of 32 threads
and in the rest of the paperwe also use awarp as a batch of 32 threads.
All the threads in one block are executed on one SM together. One SM can also have multiple concurrently running blocks. The number of blocks that are running on one SM is determined by the resource requirements of each block such as the number of registers and shared memory usage. The blocks that are running on one SM at a given time are called active blocks in this paper. Since one block typically has several warps (the number of warps is the same as the number of threads in a block divided by 32), the total number of active warps per SM is equal to the number of warps per block times the number of active blocks.
The shared memory is implemented within each SM multiprocessor as an SRAM and the global memory is part of the offchip DRAM. The shared memory has very low access latency (almost the same as that of register) and high bandwidth. However, since a warp of 32 threads access the shared memory together, when there is a bank conflict within a warp, accessing the shared memory takes multiple cycles.
The SM processor executes one warp at one time, and schedules warps in a time-sharing fashion. The processor has enough functional units and register read/write ports to execute 32 threads (i.e. one warp) together. Since an SM has only 8 functional units, executing 32 threads takes 4 SM processor cycles for computation instructions.
When the SM processor executes a memory instruction, it generates memory requests and switches to another warp until all the memory values in the warp are ready. Ideally, all the memory accesses within a warp can be combined into one memory transaction. Unfortunately, that depends on the memory access pattern within a warp. If the memory addresses are sequential, all of the memory requests within a warp can be coalesced into a single memory transaction. Otherwise, each memory address will generate a different transaction. Figure 2 illustrates two cases. The CUDA manual
provides detailed algorithms to identify types of coalesced/ uncoalesced memory accesses. If memory requests in a warp are uncoalesced, the warp cannot be executed until all memory transactions from the same warp are serviced, which takes significantly longer than waiting for only one memory request (coalesced case).
Algorithm to translate an llc code to Cuda:
Create the source storage to save the destination files
Encapsulate the parallel region into a separate function using regionEncapsulate
The process of creating a kernel from a loop is not a trivial task. Some details must be taken into account:
** Loop-specific constructs cannot be used inside the kernel (i.e break, continue, etc) but need to be ported in some way to the kernel. ** Control must not exit the kernel region (thus, block must be a SESE block, see XXX for information)
** Pointer access cannot be used freely (although, vector arithmetics might be reused as indexing position) ** No dynamic memory can be allocated inside a loop
** All function calls inside the loop must be executed inside the same device where the kernel is running.
If we assume that one iteration is one thread
** No asumption can be made about the order of thread execution (all parallel, all secuential, 0 frist, 0 last). ** Thread numbering always start at 0 and ends at nblocks * nthreads, thus, if a loop starts at other value, or its stride
is not one, a mapping function must be defined.
** There is a maximum number of threads available and, thus, a maximum number of iteration exists for this mapping scheme.
If we assume that one thread perfoms N iterations
We can simplify this case applying an “unroll (N)” transformation to the loop.
However, it is important to notice the performance penalty introduced if access to global variables are not coalesced
Currently, templates are held inside Mutators code.
A separate Mutator has been written for each OpenMP construct. Their parent is Backends.Cuda.Mutators.Common
The following constructs have been implemented:
OpenMP Parallel
OpenMP Parallel For
OpenMP For
This module has one member in order to store CUDA source files:
A file type to store Cuda source code files
Get the preprocessor macros from a given string and store it on the file.
Currently supported macros are:
#include <[A-Za-Z0-9.]+> #include "[A-Za-Z0-9.]+"
Parameters: | line_list – The original text |
---|
Return a pretty representation of the file
Parameters: | text – The original text to be pretty printed |
---|