OpenCL with pocl

pocl will convert your opencl kernels into AVX-vectorized CPU binaries

pocl stands for portable compute language [1].

Building pocl

For sweet and crispy pocl, you'll need:

  • warm up your favorite C++ compiler
  • prepare the follow baking packages (if you have an Ubuntu 16.04 kitchen)
# llvm clang 5.0 or 6.0
# Note: at each release, pocl supports current previous llvm version
sudo apt install clang-5.0 libclang-5.0-dev

sudo apt install cmake pkg-config
sudo apt install libhwloc-dev

# For ICD support (not necessary if linking directly to libpocl)
sudo apt install opencl-headers
sudo apt install ocl-icd-opencl-dev
wget https://github.com/pocl/pocl/archive/v1.1.tar.gz
tar xf v1.1.tar.gz
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_ICD=ON ../pocl-1.1
# To build with clang
#cmake -DCMAKE_C_COMPILER=clang-5.0 -DCMAKE_CXX_COMPILER=clang++-5.0 -DCMAKE_BUILD_TYPE=Release -DENABLE_ICD=ON ../pocl-1.1
make -j`nproc`
sudo make install

If you want clinfo and others to find the new pocl ICD:

sudo mkdir -p /etc/OpenCL/vendors/
sudo cp /usr/local/etc/OpenCL/vendors/pocl.icd /etc/OpenCL/vendors/

The output of clinfo is [CLINFO].

The installed files are [INSTALL].

About poclcc: quoting official doucmentation "You can build a runtime-only pocl to run prebuilt pocl binaries on a device" [2].

Building simple_add

Find the sources at [SRC] and build them with:

g++ -c -g -std=c++11 -o main.o main.cpp
g++ -o simple_add main.o -lOpenCL

Running simple_add

Normal execution:

Using platform: Portable Computing Language
Using device: pthread-Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
 result:
0 2 4 3 5 7 6 8 10 9

The execution of pocl can be customized via environment variables [3]. Advices about debugging can be found there [4], but the env vars needs to be updated:

mkdir tmp
export POCL_CACHE_DIR=tmp
export POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1
./simple_add
cd tmp/KA/KEBEMMEIICNNHAKIHDLLOEMLKLGMDJJNIAENE/simple_add/10-1-1
llvm-dis-5.0 < parallel.bc > parallel.ll

The bytecode disassembly result is [BYTECODE].

Quote from Kalle Raiskilla:

But beware the debugger. Dark side of the source it is. If once you start down the dark path, forever will it dominate your destiny. Consume you it will.

For dumping the binary assembly:

llvm-objdump-5.0 -disassemble  simple_add.so

The binary disassembly result is [BIN].

This part is especially interesting:

2f7:    c5 fe 6f 04 86  vmovdqu (%rsi,%rax,4), %ymm0
2fc:    c5 fd fe 04 87  vpaddd  (%rdi,%rax,4), %ymm0, %ymm0
301:    c5 fe 7f 04 82  vmovdqu %ymm0, (%rdx,%rax,4)

The vpaddd is an AVX vector instruction for: "Add packed doubleword integers from xmm2, xmm3/m128 and store in xmm1" [5].

Note: "doubleword" is for double 16-bit word, therefore 32 bits.

clinfo output for pocl on "Intel Core i7-5500U" (broadwell architecture)

[CLINFO]
  Platform Name                                   Portable Computing Language
  Platform Vendor                                 The pocl project
  Platform Version                                OpenCL 1.2 pocl 1.1 Release, LLVM 5.0.0, SPIR, SLEEF, POCL_DEBUG
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             POCL


  Platform Name                                   Portable Computing Language
Number of devices                                 1
  Device Name                                     pthread-Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  Device Vendor                                   GenuineIntel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-broadwell
  Driver Version                                  1.1
  Device OpenCL C Version                         OpenCL C 1.2 pocl
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               4
  Max clock frequency                             3000MHz
  Device Partition                                (core)
        Max number of sub-devices                     4
        Supported partition types                     equally, by counts
  Max work item dimensions                        3
  Max work item sizes                             4096x4096x4096
  Max work group size                             4096
  Preferred work group size multiple              8
  Preferred / native vector sizes
        char                                                16 / 16
        short                                               16 / 16
        int                                                  8 / 8
        long                                                 4 / 4
        half                                                16 / 16       (n/a)
        float                                                8 / 8
        double                                               4 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 Yes
        Round to infinity                             Yes
        IEEE754-2008 fused multiply-add               Yes
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
        Denormals                                     Yes
        Infinity and NANs                             Yes
        Round to nearest                              Yes
        Round to zero                                 Yes
        Round to infinity                             Yes
        IEEE754-2008 fused multiply-add               Yes
        Support is emulated in software               No
        Correctly-rounded divide and sqrt operations  Yes
  Address bits                                    64, Little-Endian
  Global memory size                              6127976448 (5.707GiB)
  Error Correction support                        No
  Max memory allocation                           2147483648 (2GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        4194304
  Global Memory cache line                        64 bytes
  Image support                                   Yes
        Max number of samplers per kernel             16
        Max size for 1D images from buffer            134217728 pixels
        Max 1D or 2D image array size                 2048 images
        Max 2D image size                             8192x8192 pixels
        Max 3D image size                             2048x2048x2048 pixels
        Max number of read image args                 128
        Max number of write image args                128
  Local memory type                               Global
  Local memory size                               2097152 (2MiB)
  Max constant buffer size                        2097152 (2MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties
        Out-of-order execution                        No
        Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities
        Run OpenCL kernels                            Yes
        Run native kernels                            Yes
        SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_spir cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64

pocl installed files

[INSTALL]
Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/etc/OpenCL/vendors/pocl.icd
-- Installing: /usr/local/lib/pkgconfig/pocl.pc
-- Installing: /usr/local/share/pocl/include/_enable_all_exts.h
-- Installing: /usr/local/share/pocl/include/_builtin_renames.h
-- Installing: /usr/local/share/pocl/include/_kernel.h
-- Installing: /usr/local/share/pocl/include/_kernel_c.h
-- Installing: /usr/local/share/pocl/include/_kernel_constants.h
-- Installing: /usr/local/share/pocl/include/pocl_types.h
-- Installing: /usr/local/share/pocl/include/pocl_device.h
-- Installing: /usr/local/share/pocl/include/pocl.h
-- Installing: /usr/local/share/pocl/kernel-x86_64-pc-linux-gnu-broadwell.bc
-- Installing: /usr/local/lib/pocl/libllvmopencl.so
-- Installing: /usr/local/lib/libpocl.so.2.1.0
-- Installing: /usr/local/lib/libpocl.so.2
-- Installing: /usr/local/lib/libpocl.so
-- Set runtime path of "/usr/local/lib/libpocl.so.2.1.0" to ""
-- Installing: /usr/local/bin/poclcc

OpenCL "hello world", simple add

[SRC]
// Inspired from:
// - http://simpleopencl.blogspot.de/2013/06/tutorial-simple-start-with-opencl-and-c.html
// - http://github.khronos.org/OpenCL-CLHPP/

#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_TARGET_OPENCL_VERSION  120

//#define CL_HPP_ENABLE_EXCEPTIONS
//#define CL_HPP_CL_1_2_DEFAULT_BUILD

#include <iostream>
#include <CL/cl2.hpp>   // backward compatible with OpenCL 1.x

int main() {
    // ### Step A: Select a device
    // get all platforms (drivers)
    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);
    if(all_platforms.size() == 0) {
        std::cout << " No platforms found. Check OpenCL installation!" << std::endl;
        exit(1);
    }

    // set default plateform
    cl::Platform default_platform = all_platforms[1];
    std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << std::endl;

    // get default device of the default platform
    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
    if(all_devices.size() == 0){
        std::cout << " No devices found. Check OpenCL installation!" << std::endl;
        exit(1);
    }

    // set default device
    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << std::endl;

    // ### Step B: Build kernel and prepare arguments
    cl::Context context({default_device});

    // kernel calculates for each element C=A+B
    // Note: use C++11 raw string literals for kernel source code
    std::string kernel_code{R"CLC(
        void kernel simple_add(global const int* A, global const int* B, global int* C)
        {
                C[get_global_id(0)] = A[get_global_id(0)] + B[get_global_id(0)];
        }
    )CLC"};

    // Note: new simpler string interface style
    std::vector<std::string> sources{kernel_code};
    cl::Program program(context, sources);
    if(program.build({default_device}) != CL_SUCCESS) {
        std::cout << " Error building: " << std::endl;

        auto buildInfo = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>();
        for (auto &pair : buildInfo) {
            std::cerr << pair.second << std::endl << std::endl;
        }
        exit(1);
    }

    // create buffers on the device
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int)*10);
    cl::Buffer buffer_B(context, CL_MEM_READ_WRITE, sizeof(int)*10);
    cl::Buffer buffer_C(context, CL_MEM_READ_WRITE, sizeof(int)*10);

    int A[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    int B[] = {0, 1, 2, 0, 1, 2, 0, 1, 2, 0};

    // create queue to which we will push commands for the device.
    cl::CommandQueue queue(context, default_device);

    // write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int)*10, A);
    queue.enqueueWriteBuffer(buffer_B, CL_TRUE, 0, sizeof(int)*10, B);

    // ### Step C: Run the kernel
    cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
    kernel_add.setArg(0, buffer_A);
    kernel_add.setArg(1, buffer_B);
    kernel_add.setArg(2, buffer_C);
    queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(10), cl::NullRange);
    queue.finish();

    // ### Step D: Get the results
    // read result C from the device to array C
    int C[10];
    queue.enqueueReadBuffer(buffer_C, CL_TRUE, 0, sizeof(int)*10, C);

    std::cout << " result: " << std::endl;
    for(int i=0; i<10; i++) {
        std::cout << C[i] << " ";
    }

    return 0;
}

llvm bytecode disassembly

[BYTECODE]
define void @_pocl_launcher_simple_add(i32* nocapture readonly, i32* nocapture readonly, i32* nocapture, { i32, [3 x i64], [3 x i64], [3 x i64], [3 x i64] }* nocapture readonly) local_unnamed_addr #0 {
vector.ph:
  %4 = getelementptr { i32, [3 x i64], [3 x i64], [3 x i64], [3 x i64] }, { i32, [3 x i64], [3 x i64], [3 x i64], [3 x i64] }* %3, i64 0, i32 2, i64 0
  %5 = load i64, i64* %4, align 8
  %6 = getelementptr { i32, [3 x i64], [3 x i64], [3 x i64], [3 x i64] }, { i32, [3 x i64], [3 x i64], [3 x i64], [3 x i64] }* %3, i64 0, i32 3, i64 0
  %7 = load i64, i64* %6, align 8
  %8 = mul i64 %5, 10
  %9 = add i64 %8, %7
  %10 = getelementptr inbounds i32, i32* %0, i64 %9
  %11 = bitcast i32* %10 to <8 x i32>*
  %wide.load = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
  %12 = getelementptr inbounds i32, i32* %1, i64 %9
  %13 = bitcast i32* %12 to <8 x i32>*
  %wide.load1 = load <8 x i32>, <8 x i32>* %13, align 4, !tbaa !2
  %14 = add nsw <8 x i32> %wide.load1, %wide.load
  %15 = getelementptr inbounds i32, i32* %2, i64 %9
  %16 = bitcast i32* %15 to <8 x i32>*
  store <8 x i32> %14, <8 x i32>* %16, align 4, !tbaa !2
  %17 = add i64 %9, 8
  %arrayidx.i = getelementptr inbounds i32, i32* %0, i64 %17
  %18 = load i32, i32* %arrayidx.i, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  %arrayidx2.i = getelementptr inbounds i32, i32* %1, i64 %17
  %19 = load i32, i32* %arrayidx2.i, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  %add.i = add nsw i32 %19, %18
  %arrayidx4.i = getelementptr inbounds i32, i32* %2, i64 %17
  store i32 %add.i, i32* %arrayidx4.i, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  %20 = add i64 %9, 9
  %arrayidx.i.1 = getelementptr inbounds i32, i32* %0, i64 %20
  %21 = load i32, i32* %arrayidx.i.1, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  %arrayidx2.i.1 = getelementptr inbounds i32, i32* %1, i64 %20
  %22 = load i32, i32* %arrayidx2.i.1, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  %add.i.1 = add nsw i32 %22, %21
  %arrayidx4.i.1 = getelementptr inbounds i32, i32* %2, i64 %20
  store i32 %add.i.1, i32* %arrayidx4.i.1, align 4, !tbaa !2, !llvm.mem.parallel_loop_access !6
  ret void
}

binary disassembly

[BIN]
   _pocl_launcher_simple_add:
2e0:       55      pushq   %rbp
2e1:       48 89 e5        movq    %rsp, %rbp
2e4:       48 83 e4 f8     andq    $-8, %rsp
2e8:       48 8b 41 20     movq    32(%rcx), %rax
2ec:       48 8d 04 80     leaq    (%rax,%rax,4), %rax
2f0:       48 01 c0        addq    %rax, %rax
2f3:       48 03 41 38     addq    56(%rcx), %rax
2f7:       c5 fe 6f 04 86  vmovdqu (%rsi,%rax,4), %ymm0
2fc:       c5 fd fe 04 87  vpaddd  (%rdi,%rax,4), %ymm0, %ymm0
301:       c5 fe 7f 04 82  vmovdqu %ymm0, (%rdx,%rax,4)
306:       8b 4c 86 20     movl    32(%rsi,%rax,4), %ecx
30a:       03 4c 87 20     addl    32(%rdi,%rax,4), %ecx
30e:       89 4c 82 20     movl    %ecx, 32(%rdx,%rax,4)
312:       8b 4c 86 24     movl    36(%rsi,%rax,4), %ecx
316:       03 4c 87 24     addl    36(%rdi,%rax,4), %ecx
31a:       89 4c 82 24     movl    %ecx, 36(%rdx,%rax,4)
31e:       48 89 ec        movq    %rbp, %rsp
321:       5d      popq    %rbp
322:       c5 f8 77        vzeroupper
325:       c3      retq
326:       66 2e 0f 1f 84 00 00 00 00 00   nopw    %cs:(%rax,%rax)