Development environment for multi platform CUDA software

Vincent Jordan (KDE lab.)

The CUDA GPU computing development framework is available for three operating systems: Microsoft Windows, GNU/Linux and MacOS X. The last two share the same UNIX-like architecture thus CUDA toolkit is quite similar on both of them. The Nvidia CUDA buildchain falls back on the default compiler available from the operating system for host compilation (CPU). GNU Compiler Collection (gcc), Microsoft Visual Studio compiler (cl) or Intel C++ Compiler (icc) can be used by nvcc. The first CUDA SDK released to the public was the 1.1 Beta version in June 2007. At the time of writing, the latest version is 3.1, but the compilation workflow remained the same, despite the numerous improvements of the toolkit.

Before describing each step of building a CUDA software, let me remind the main stages of building any C/C++ software. To start off, preprocessing stage matches some text string and replaces them by others according to macros rules. Then, compilation stage translates the source code into assembly code. Next, assembly code is converted into machine code. Finally, the linking stage creates a connection to the operating system for primitives. This includes adding the runtime library, which mainly consists of memory management routines.
This process can apply to CUDA language as well since it takes place after a conversion of the C++ and CUDA extensions language into regular ANSI C language.

CUDA compilation workflow

The buildchain consists of several different tools. The following figure shows the complete compilation process and intermediate files of CUDA source file to the final executable file.
The first part is performed by cudafe which split up device (GPU) from host (CPU) code. The device code is then compiled by nvopencc into Parallel Thread eXecution (PTX) code, which is an intermediate assembly. This assembly code is then compiled into CUDA binary (Cubin) by the proprietary ptxas tool. Cubin format is the machine code of the targeted GPU instruction set. Cubin format is proprietary, undocumented and subject to change.

CUDA compilation workflow
Overview of the CUDA compilation workflow (based on explanation found in [NVCC31])

CUDA front end

cudafe stands for CUDA frontend and has two purpose: preprocessing (with -E option) and CUDA source code analysis. This tool is based on gcc.

Unlike in the standard compilation scheme, preprocessing stage is performed three times. Please note that the .ii extension refers to C++ preprocessed files while .i refers to C preprocessed files.

The figure shows that CUDA frontend is invoked two times.

Nvidia compiler

Actually the compilation stage of the CUDA toolchain is divided into two parts: high-level and low-level compilations. The intermediate language between these two parts is the PTX assembly. Unlike well-known assembly codes (ARM, x86, ...), PTX is not just converted into Cubin (machine code) as a direct translation.
PTX defines a virtual machine and ISA (Instruction Set Architecture) for general purpose Parallel Thread eXecution (PTX). This compilation stage was introduced to provide a stable ISA that spans multiple GPU generations.
It is worth noting that nvcc is different from nvopencc since nvcc refer to the whole process of compilation (preprocessing and first-only or both compilation stages) while nvopencc refers only to the first stage of compilation producing PTX code.
There are several options for low-level compilation stage:

nvopencc (high-level) and ptxas/graphic driver (low-level) both perform some compilation tasks.
nvopencc is a fork of a subset of the open-source Open64 compiler [OPEN64] developed by the Computer Architecture and Paralllel Systems Laboratory (CAPSL) of the University of Delaware. According to Mike Murphy from Nvidia in [MURPHY08], Open64 was chosen for the strength of its optimizations over GCC.
The high-level compiled only uses a subset of Open64 because its input is always C language. Another simplification is that nvopencc does not do any cross-file inter-procedural analysis (IPA) therefore whole kernel source code has to be included into one source file only (this might change in the future).

Low-level compilation is made by Nvidia proprietary Optimized Code Generator (OCG). PTX provides a virtual machine model and is independent of the underlying processor. OCG allocates registers and schedules the instruction according to the targeted GPU chip (Cubin format). Decuda/cudasm tools [DECUDA] can disassemble/assemble these files for G8X and G9X architectures even if it is not supported by nVIDIA. Those tools were created using reverse engineering.

xhtml valid? | css valid? | last update on September 2010