Day 6: OpenCL malloc

How to make OpenCL memory allocation functions more like normal malloc/memcpy?

The problem

There is a problem with clCreateBuffer [1] is that it returns an opaque cl_mem object, rather than a good old pointer to the allocated memory.

cl_mem clCreateBuffer (
    cl_context context,
    cl_mem_flags flags,
    size_t size,
    void *host_ptr,
    cl_int *errcode_ret
)

It becomes complicated to use with software architectured to take function pointer to malloc and memcpy like cudaMalloc and cudaMemcpy.

Of course, an easy solution would be to simplify everything with shared memory, but this feature is only available starting from OpenCL 2.0 (OpenCL 2.0 SVM [2]), and many device hardware do not support it. It would also make the solution incompatible with discrete GPU.

Another solution

The flags parameter can take those host-related values:

  • CL_MEM_USE_HOST_PTR
  • CL_MEM_ALLOC_HOST_PTR
  • CL_MEM_COPY_HOST_PTR

This Intel article [3] provides with advices to create zero-copy buffers in OpenCL 1.2.

Use CL_MEM_ALLOC_HOST_PTR and let the runtime handle creating a zero copy allocation buffer for you

As it can be seen from the function prototype, it is impossible to get a pointer from the function, therefore there are no possible settings which would allocate the memory for you, and give you the pointer to this allocated memory (what is actually wanted here).

With CL_MEM_ALLOC_HOST_PTR alone, host_ptr has to be null.

The only solution is to allocate the memory using a normal malloc, and follow vendor-specific alignment and size restrictions for zero-copy to be possible.

When using the CL_MEM_USE_HOST_PTR flag, if we want to guarantee a zero copy buffer on Intel processor graphics, we need to ensure that we adhere to two device-dependent alignment and size rules. We must create a buffer that is aligned to a 4096 byte boundary and have a size that is a multiple of 64 bytes.

The example code [4] checks buffers like this:

unsigned int verifyZeroCopyPtr(void *ptr, unsigned int sizeOfContentsOfPtr)
{
        int status; // so we only have one exit point from function
        if((unsigned int)ptr % 4096 == 0) // page alignment and cache alignment
        {
                if(sizeOfContentsOfPtr % 64 == 0) // multiple of cache size
                {
                        status = 1;
                }
                else status = 0;
        }
        else status = 0;
        return status;
}

"How to allocate aligned memory only using the standard library?" [5]

C11 has a nice solution to this:

#include <stdlib.h>
void *aligned_alloc(size_t alignment, size_t size);

What happens with discrete GPU?

This specific memory buffer setting will create pinned memory.

Pinned memory is slower to allocate, but offers better PCIe transfers [6].