CUDA Backend

The Cuda module contains a set of Mutators, Filters and Templates capable of building CUDA kernels from llc-annotated regions in the original source code by reading the yaCF Intermediate Representation.

CUDA devices

Short introduction about cuda

Background on the CUDA Programming Model

The CUDA programming model is similar in style to a singleprogram multiple-data (SPMD) software model. The GPU is treated as a coprocessor that executes data-parallel kernel functions.

CUDA provides three key abstractions, a hierarchy of thread groups, shared memories, and barrier synchronization. Threads have a three level hierarchy. A grid is a set of thread blocks that execute a kernel function. Each grid consists of blocks of threads. Each block is composed of hundreds of threads. Threads within one block can share data using shared memory and can be synchronized at a barrier. All threads within a block are executed concurrently on a multithreaded architecture.

The programmer specifies the number of threads per block, and the number of blocks per grid. A thread in the CUDA programming language is much lighter weight than a thread in traditional operating systems. A thread in CUDA typically processes one data element at a time. The CUDA programming model has two shared read-write memory spaces, the shared memory space and the global memory space. The shared memory is local to a block and the global memory space is accessible by all blocks. CUDA also provides two read-only memory spaces, the constant space and the texture space, which reside in external DRAM, and are accessed via read-only caches.

Background on the GPU Architecture

The GPU architecture consists of a scalable number of streaming multiprocessors (SMs), each containing eight streaming processor (SP) cores, two special function units (SFUs), a multithreaded instruction fetch and issue unit, a read-only constant cache, and a 16KB read/write shared memory

The SM executes a batch of 32 threads together called a warp. Executing a warp instruction applies the instruction to 32 threads, similar to executing a SIMD instruction like an SSE instruction

in X86. However, unlike SIMD instructions, the concept of warp is not exposed to the programmers, rather programmers write a program for one thread, and then specify the number of parallel threads in a block, and the number of blocks in a kernel grid. The Tesla architecture forms a warp using a batch of 32 threads

and in the rest of the paperwe also use awarp as a batch of 32 threads.

All the threads in one block are executed on one SM together. One SM can also have multiple concurrently running blocks. The number of blocks that are running on one SM is determined by the resource requirements of each block such as the number of registers and shared memory usage. The blocks that are running on one SM at a given time are called active blocks in this paper. Since one block typically has several warps (the number of warps is the same as the number of threads in a block divided by 32), the total number of active warps per SM is equal to the number of warps per block times the number of active blocks.

The shared memory is implemented within each SM multiprocessor as an SRAM and the global memory is part of the offchip DRAM. The shared memory has very low access latency (almost the same as that of register) and high bandwidth. However, since a warp of 32 threads access the shared memory together, when there is a bank conflict within a warp, accessing the shared memory takes multiple cycles.

Coalesced and Uncoalesced Memory Accesses

The SM processor executes one warp at one time, and schedules warps in a time-sharing fashion. The processor has enough functional units and register read/write ports to execute 32 threads (i.e. one warp) together. Since an SM has only 8 functional units, executing 32 threads takes 4 SM processor cycles for computation instructions.

When the SM processor executes a memory instruction, it generates memory requests and switches to another warp until all the memory values in the warp are ready. Ideally, all the memory accesses within a warp can be combined into one memory transaction. Unfortunately, that depends on the memory access pattern within a warp. If the memory addresses are sequential, all of the memory requests within a warp can be coalesced into a single memory transaction. Otherwise, each memory address will generate a different transaction. Figure 2 illustrates two cases. The CUDA manual

provides detailed algorithms to identify types of coalesced/ uncoalesced memory accesses. If memory requests in a warp are uncoalesced, the warp cannot be executed until all memory transactions from the same warp are serviced, which takes significantly longer than waiting for only one memory request (coalesced case).

Runner

Algorithm to translate an llc code to Cuda:

  1. Create the source storage to save the destination files

  2. For each parallel for
    • Encapsulate parallel region into a separate function using regionEncapsulate
    • Create a kernel from the loop using the Kernelize algorithm
  3. For each parallel region
    • Encapsulate the parallel region into a separate function using regionEncapsulate

    • For each for construct
      • Create a kernel from the loop Kernelize algorithm
    • For each Task
      • Ignore for now

Kernelize

The process of creating a kernel from a loop is not a trivial task. Some details must be taken into account:

  • Syntactic constraints:

** Loop-specific constructs cannot be used inside the kernel (i.e break, continue, etc) but need to be ported in some way to the kernel. ** Control must not exit the kernel region (thus, block must be a SESE block, see XXX for information)

** Pointer access cannot be used freely (although, vector arithmetics might be reused as indexing position) ** No dynamic memory can be allocated inside a loop

** All function calls inside the loop must be executed inside the same device where the kernel is running.

  • Semanthic constraints:

If we assume that one iteration is one thread

** No asumption can be made about the order of thread execution (all parallel, all secuential, 0 frist, 0 last). ** Thread numbering always start at 0 and ends at nblocks * nthreads, thus, if a loop starts at other value, or its stride

is not one, a mapping function must be defined.

** There is a maximum number of threads available and, thus, a maximum number of iteration exists for this mapping scheme.

If we assume that one thread perfoms N iterations

We can simplify this case applying an “unroll (N)” transformation to the loop.

However, it is important to notice the performance penalty introduced if access to global variables are not coalesced

Filter Vistiros

Templates

Currently, templates are held inside Mutators code.

Mutators

A separate Mutator has been written for each OpenMP construct. Their parent is Backends.Cuda.Mutators.Common

The following constructs have been implemented:

OpenMP Parallel

OpenMP Parallel For

OpenMP For

Files

This module has one member in order to store CUDA source files:

class Backends.Cuda.Files.FileTypes.Cuda_FileType

A file type to store Cuda source code files

importMacros(line_list)

Get the preprocessor macros from a given string and store it on the file.

Currently supported macros are:

#include <[A-Za-Z0-9.]+>
#include "[A-Za-Z0-9.]+"
Parameters:line_list – The original text
pretty_print(text)

Return a pretty representation of the file

Parameters:text – The original text to be pretty printed