Day 9: Week summary

In the "thesis reboot" post, I brought back from the dead the dusty implementation of my master thesis. Seeing that it was still (almost) working, a week off work was planned to prepare its adaptation to the almighty OpenCL. This post summarizes what this week was all about.

TL; DR aka meta-conclusion

The thesis implementation could have worked with OpenCL 1.2, but the v_array stack would have become even more complicated than it already was with cuda because of: (1) the requirement to declare destination memory address space of every pointer, and (2) the lack closure-like construct.

OpenCL 2.0 solves the above-mentionned shortcomings with (1) generic address space and (2) clang blocks, but is not widely supported and the Intel solution is very unstable.

Despite seven years of improvement since 2010, unless Intel GPU debugger work better than expected, OpenCL development will require the same "primitive" strategy which was applied to the early cuda implementation: make OpenCL sources turn into plain C sources using macro, and use normal debug tools (gdb and valgrind).

Day 1, Saturday 30 September

Installation of the Intel drivers and test of OpenCL performances with Ethereum coin mining.

Wasted time on: Curious why ethminer did not see any CPU-based OpenCL devices and attempted to change it. Looking for the differences between miners and mining teams.

Mini-conclusion: OpenCL mining performances are far from impressive. Learned that Etherneum mining was purposedly designed to the I/O-bound so as to prevent the use of ASICs. Intel integrated GPUs have bad memory bandwidth because of they share their memory with the CPU, unlike discrete graphic cards having dedicated high-bandwidth memory controller. The discrepancy is even larger since AMD and Nvidia use on-chip HBM. Ethereum mining is only worth on those new devices.

Not a problem with the master thesis implementation because the benefit of zero-copy out balances fast memory. Also fast memory bandwidth is only an advantage with aligned, sequential large memory read requests, which is not the type of memory requests used here.

See Day 1 notes.

Day 2, Sunday 1 October

Refreshed and refactored the metada generator to have it work of larger XML dataset as generated by Doxygen.

Wasted time on: qt-creator issues related to gdb auto-detection.

Mini-conclusion: Works fine now. The resulting sqlite database could be included the preprocessing phase of inapp documentation generation. libxml2 SAX parser is still quite fast. A 25MB XML input gives a 100MB sqlite database output. Fancy UTF-8-friendly word breaking which was disabled in the original implementation is still not needed here.

There were no notes in additional post.

Day 3, Monday 2 October

The simple OpenCL "Hello world" demo (aka "vector add"). Integration in qt-creator, use of OpenCL C++ wrapper. Discovered oclgrind. First attempt to prepare a "v_array" demo with dynamic memory allocation in OpenCL kernel.

Wasted time on: Why set_default for {platform, device} does not work? First taste of the "documentation mess" of Intel OpenCL tools. Much time wasted to extract gdbserver-igfx from Intel Parallel Studio and using deprecated documentation.

Mini-conclusion: OpenCL does not integrate as well as Cuda with host source code, because the sources are provided to the OpenCL compiler as a big string, and those sources are built at runtime. Intel releases their gdb-based debug tools in several products, and pushes its non-free development software. Went in the wrong direction with Parallel Studio, see day 8.

See Day 3 notes.

Day 4, Tuesday 3 October

The custom dynamic memory allocator uses atomics. Updated knowledge about atomics. Discovered C11 and its new standardized atomics mirroring C++11 ones. Found that OpenCL 2.0 nicely supports C11/C++11 atomics. Used a new locking strategy after checking state-of-the-art.

Mini-conclusion: Had to think again the scope of the project: do I really want OpenCL 1.2 support? OpenCL 2.0 and shared virtual memory would solve this issue in a second.

See Day 4 notes.

Note for later: Lock-free memory manager, and atomic benchmark, especially between OpenCL 1.2 custom atomics, OpenCL 2.0 with generic address space, OpenCL 2.0 with coarse and fine-grained memory sharing, and OpenCL 2.0 with shared atomics for cooperative CPU/GPU architecture.

Day 5, Wednesday 4 October

Restarting from the basics. Checked the implementation of clCreateBuffer. Checked the different way to allocate memory buffers between CPU and GPU. Learned about pinned memory.

Wasted time on: First issue in the architecture of the legacy implementation: malloc/memcpy are different in OpenCL and do not fit in the design of the original implementation, trying to trick it by casting cl_mem object, and failed.

Mini-conclusion: Thinking of dropping OpenCL 1.2 support altogether in favor to clSVMalloc, but concerned that most OpenCL hardware will be stuck with OpenCL 1.2.

See Day 5 notes.

Day 6, Thursday 5 October

Used pinned memory a solution to have something like a clMalloc very similar to cudaMalloc and standard malloc, thinking that OpenCL 1.2 compability can be saved after all. Discovered about the the mess of global/local definitions of "points to" of every pointer in OpenCL (which was not there in cuda).

Wasted time on: Tried to adapt the old implementation to used it. Many problem arised. Lack of template to support all the combinaisons of read/write from/to local/global memory.

Mini-conclusion: Needed to check ARM Mali implementation, to see whether OpenCL 1.2 can be dropped again. ARM GPU bifrost architecture will support OpenCL 2.0. Interesting that ARM Mali GPU are quite similar to Intel GPU, and are friendlier to divergent threads than AMD and Nvidia's.

See Day 6-1 and Day 6-2 notes.

Day 7, Friday 6 October

Focusing back to CPU-only with old implemention. Managed to do some doxygen XPath queries, but not all of them worked. The original implementation is a bit broken (e.g., the XPath parser). Trying different thread sizes, but does not seem to change the execution time.

Give up OpenCL 1.2. Forced to use OpenCL 2.0 for the "v_array" demo. Use blocks to kinda generic templates. Cannot use laptop GPU anymore. Cannot use AMD CPU device anymore.

Wasted time on: Find the hard-coded tricks in the legacy implementation. OpenCL printf is very inconsistant. Intel OpenCL CPU target simply crashes during the compilation of the kernel, but not GPU. No useful error message from the Intel runtime.

Mini-conclusion: The algorithm of the legacy implementation seems well-suited for the kind of symbol requests (find the enum in this class in this namespace in this file included there). Zero-copy buffers is a must have feature. Maybe learn more about offline kernel compilation for Intel device.

Note for later: GPU-optimized loaded of the element streams from the sqlite database? Prepare the stream lists offline, and simply load them on-demand? GPU-optimized partitioning of the XML tree?

See Day 7 notes.

Day 8, Saturday 7 October

"OpenCL kernel debugging, an Intel adventure" day! The last hope is to use Intel GPU debugger.

Wasted _much_ time on: Finding the right Intel documentation among the outdated ones. Finding out that "Intel OpenCL SDK" is the free "product" which also contains gdb-igfx and gdbserver-igfx, which is different from the one found in Intel Parallel Studio. Intel OpenCL runtime crashed all the time.

Mini-conclusion: Intel GPU debugger is hard to use because it is an unstable black box. CPU runtime is not a reliable fallback solution as well, since it crashes often too. Use the CPU manually, by actually running on the CPU directly.

Note for later: Setup the Intel mini computer once again and install CentOS as described in Intel docs.

See Day 8 notes.