Day 5: OpenCL basics

What is a device, a compute unit (CU), a work-group, a work-item, a command-queue?

Work-group vs Work-item

From [1]:
A multi-core CPU or multiple CPUs (in a multi-socket machine) constitute a single OpenCL device. Separate cores are compute units. Work-group [is a ] a collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel and share local memory and work-group barriers.

To continue the CPU analogy, a work-group is executed on one core, and each work-item are processed together using vector instructions (such as MMX and SSE). On a CPU, those elements of a work-group usually share the same L1 and L2 cache, on a GPU, there is more likely a scraptchpad as local memory. Later GPU architectures make it possible to use this scratchpad memory as a cache as well, hence looking more like a CPU design.

It is also worth noting how a condition which is result in a different execution path within a work-group (e.g., some work-items taking path A, some work-items take path B) is killing performances on GPU, because work-groups are larger, and path A and path B cannot be executed in parallel (therefore, more silicon is "wasted" doing nothing).

This analogy also shows the mismatching terminology between CPU and GPU: a CPU core is totally difference from a so-called CUDA core. A CUDA "core" is an ALU lane in CPU terminology, while a CPU core is: a "Stream Multiprocessor" (aka SM) in CUDA terminology, a "Compute Unit" (aka CU) in OpenCL terminology, an "Execution Unit" (aka EU) in Intel GPU terminology.

CUDA compute capability table from Wikipedia [2]:

The line "Number of ALU lanes" is what is called core by NVidia.

The "GeForce GTX 1080 Ti" [3] is based on a Pascal architecture with compute capability 6.1, therefore it has 128 ALU lanes. 3584 / 128 = 28. The chip would be more accurately seen as a 28-core GPU with vector instructions on up to 128 32-bit data.

Confirmation at [4]:

CL_DEVICE_MAX_COMPUTE_UNITS 28
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT 28

To fully utilize this GPU, you could start an OpenCL kernel on 28 work-groups of 128 work-items each.

AMD does the same, but they advertise their compute-units and stream-processor count together:

The "Radeon Pro" based on a Vega architecture features a 64-core GPU with vector instructions on up to 64 32-bit data (organized in two groups of 32).

Following this logic, one could say that the 72-core Intel Xeon Phi Processor [5] with AVX-512 vector instructions [6] on 512 bits, which can process 16 32-bit data in one instruction, features 16 * 72 = 1152 newly called "OpenCL cores" (NVidia-style) or "OpenCL stream processors" (AMD-style).

Example of clinfo output for an Intel HD Graphics GPU:

Max compute units                               24
[...]
Preferred / native vector sizes
    char                                                16 / 16
    short                                                8 / 8
    int                                                  4 / 4
    long                                                 1 / 1
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                1 / 1
    double                                               1 / 1        (cl_khr_fp64)

Example of clinfo output for an Intel Core CPU:

Max compute units                               4
[...]
Preferred / native vector sizes
    char                                                 1 / 32
    short                                                1 / 16
    int                                                  1 / 8
    long                                                 1 / 4
    half                                                 0 / 0        (n/a)
    float                                                1 / 8
    double                                               1 / 4        (cl_khr_fp64)

Using 256-bit SSE vector instructions, the CPU can process 32 8-bit char, or 16 16-bit short, or 8 32-bit int, or 4 64-bit long. Preferred size is 1, since hte CPU can process those data also one-by-one, using the normal x86 instructions.

On the other side, GPU preferred sizes are the same as native sizes, because there is no dedicated instructions to process one data at a time, hence wasting resources. It is akin to using an SSE vector instruction to process a single int data.

Data-parallel vs Task-parallel

OpenCL is very oriented to data-parallelism, while usual CPU parallel processing is task-parallelism [7].

Task-Parallel programming model – the OpenCL programming model that runs a single work-group with a single work item.

Interested in having a command-queues on specific compute unit like there where different OpenCL device? It is called "Device Fission" [8].

How to size OpenCL kernel execution?

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

cl_int clEnqueueNDRangeKernel(
    cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint work_dim,
    const size_t *global_work_offset,
    const size_t *global_work_size,     /* number of work-groups x work-group_size */
    const size_t *local_work_size,      /* work-group_size */
    [...]
)