Let's make a friendly array for CPU/GPU processing
(A) Very well-supported, but impractical to use and error prone. (B) Cannot be memcpy'ed. Contains a pointer. Indexing is indirect. (C) Perfect if the array does not change. (D) Need to unpack. No direct indexing possible.
(C) is memcpy-able, does no use pointers, and allows direct indexing.
Flat containers for the win!
Wouldn't it be nice to describe your flat container in some sort of description language and get the matching accessors generated in both plain C (for the OpenCL side) and C++ (for the host side)?
There is a Google project for that: FlatBuffers .
Skeleton for your data-processing cluster
A node is like this:
- Data storage in compressed files of serialized tiled data.
- On event, a tile file archive is downloaded from a HTTP content delivery network (aka CDN).
- Archive is uncompressed and memory-mapped into RAM.
An instance is like this:
- On request, CPU and GPU read the memory-mapped data in shared RAM.
Each instance is a docker container which shares the memory-mapped dataset with other containers on the same node.
Faster? On each new dataset, accessors can be rebuilt JIT to optimize for fixed array sizes and container offsets.
Memory bandwidth is limiting? The dataset can be sub-tiled to match memory banks and the node can run in NUMA ("non-uniform memory access") mode, where the docker containers processing a sub-tile would be running on the CPU cores connected to the memory controller having direct access to the corresponding memory bank of the sub-tile.
|||Google FlatBuffers https://google.github.io/flatbuffers/|