Day 7: OpenCL on ARM Mali

Surprisingly, ARM Mali and Intel HD Graphics GPUs are similar in their design, in comparison to NVidia and AMD's designs.

Intel and ARM are friendlier to divergent threads

On the left, ARM Mali [ARM], on the right Intel IGP [INTEL].

The similarity seems to stem from the two companies having designed CPU before entering the GPU world. They already had silicon-proven IPs for SIMD (ARM NEON, Intel SSE). It looks like they reused them their GPU. It is also worth noting that both ARM and Intel GPU are designed to run on battery power: smartphones for ARM, laptops for Intel. Saving power also implies saving on memory bandwidth as described in [1] ("Rule of thumb: 100mW / GB/s" according to [2], corresponding presentation at [3]), making unified memory more appealing.

Quote from "ARM Mali Compute Architecture Fundamentals" [ARM]:

Finally, Midgard is a Fine-Grain Multi-Threaded (FGMT) architecture, such that each core runs its threads in a round-robin fashion, on every cycle switching to the next ready-to-execute thread. What's interesting, each thread has its individual program counter (unlike warp-based designs, where threads in a warp share the same program counter).

The reference to warp-based design, is a reference to NVidia's architecture. An OpenCL's work-group is a CUDA's warp.

NVidia and AMD are friendlier to lockstep threads

Unlike ARM and Intel, NVidia and ATI designed GPU before CPU. They give priority to raw computational power, with large ALUs. Power requirements can be large since the GPU is made to run in a desktop computer. Impressive memory bandwidth are achieved, with latest generation making use of on-chip HBM ("High Bandwidth Memory").

AMD nCU (compute-unit) design in Vega [AMD].

All Roads Lead to Rome

Probably to improve game performances of its GPU, Intel in increasing the size of its ALU and decrease the number of compute-units in the latest Iris GPU.

On the opposite side, NVidia adds more warp schedulers and each scheduler is able to issue multiple instruction at once, potentialy improving thread performances in general-purpose computing.