Comparing CPU/GPU memory bandwidth on shared-memory systems

Did you know that you could enable tons of examples in pocl?

Setup

Intallation of pocl on Ubuntu 16.04 with AMDSDK2.9 examples:

apt install build-essential cmake git pkg-config make ocl-icd-libopencl1 ocl-icd-dev ocl-icd-opencl-dev libhwloc-dev zlib1g zlib1g-dev clinfo
apt install libclang-5.0-dev clang-5.0 llvm-5.0
mkdir pocl
cd pocl
git clone https://github.com/pocl/pocl.git
cd pocl
git checkout v1.1
cd ..
mkdir build
cd build
[ manually copy AMD-APP-SDK-v2.9-RC-lnx64.tgz to "pocl/examples/AMDSDK2.9" ]
cmake -DCMAKE_INSTALL_PREFIX=/usr -DKERNELLIB_HOST_CPU_VARIANTS=distro -DPOCL_ICD_ABSOLUTE_PATH=OFF -DENABLE_TESTSUITES=AMDSDK2.9 ../pocl/
make -j`nproc`
make install
clinfo
make prepare_examples
cd examples/AMDSDK2.9/src/AMDSDK2.9/AMD-APP-SDK-v2.9-RC-lnx64/samples/opencl/bin/x86_64/

./GlobalMemoryBandwidth -c 1
./GlobalMemoryBandwidth -c 8

./MatrixMultiplication -t -i 500 -x 512 -y 512 -z 512

Results

A+B: Acer laptop
C:   Scaleway C2M
D:   Scaleway X64-15GB
E+F: Intel NUC Skull Canyon
G:   Build server

A: Intel Core i7-5500U CPU @ 2.40GHz
B: Intel HD Graphics 5500 (Broadwell GT2) (NEO driver)
C: Intel Atom CPU  C2750  @ 2.40GHz
D: Intel Xeon CPU D-1531 @ 2.20GHz
E: Intel Core i7-6770HQ CPU @ 2.60GHz
F: Intel Iris Pro Graphics 580 (NEO driver)
G: Intel Xeon CPU E5-2640v4 @ 2.40GHz

                              A       B       C       D       E       F       G
---------------------------------------------------------------------------------
vector instructions     |  AVX2 |       |  SSE2 |  AVX2 |  AVX2 |       |  AVX2 |

vector1   (33554432 bytes = 33M)
GB/s                          A       B       C       D       E       F       G
---------------------------------------------------------------------------------
Read: Linear Uncached   |   5.6 |  29.2 |   6.9 |   7.8 |  16.4 |  36.5 |  34.0 (170)
Read: Linear Cached     |  16.9 |   6.7 |  22.8 |  24.2 |  33.6 | 161.0 |  58.4 (433)
Read: Single            | 165.2 |   8.9 |  88.7 | 211.4 | 300.8 | 859.6 | 115.3 (498)
Read: Random            |   1.1 |   1.9 |   1.2 |   1.4 |   2.3 |  23.9 |   7.5 (8)
Read: UnCombine_unCache |  11.1 |  11.2 |  14.6 |  16.8 |  35.4 |  45.2 |  46.1 (72)
Write: Linear           | 106.4 |  27.7 |  37.2 | 109.2 | 136.1 | 238.4 |  50.0 (50)

vector8   (268435456 bytes = 268M)
GB/s                          A       B     16C       D       E       F       G
---------------------------------------------------------------------------------
Read: Linear Uncached   |  20.5 |  16.4 |  13.3 |  26.2 |  39.2 |  46.5 | 146.5
Read: Linear Cached     | 114.9 |  19.0 | 148.5 | 160.8 | 217.5 | 145.3 | 421.8
Read: Single            | 166.4 | 131.5 | 182.3 | 260.5 | 336.5 | 953.5 | 488.7
Read: Random            |   5.7 |   4.9 |   7.8 |   4.7 |  13.0 |  91.8 |  42.3
Read: UnCombine_unCache |  18.1 |   9.8 |  17.2 |  26.1 |  14.8 |  13.7 |  25.9
Write: Linear           | 143.0 |  14.5 | 112.4 | 221.3 | 304.1 | 160.7 | 426.7

GFlops                        A       B       C       D       E       F       G
---------------------------------------------------------------------------------
matrix multiplication   |    14 |    33 |     5 |    25 |    43 |   212 |   188

What is the "vector size"?

__kernel void read_linear(__global DATATYPE *input,__global DATATYPE *output)
{
        DATATYPE val = (DATATYPE)(0.0f);
        uint gid = get_global_id(0);

        val = val + input[gid + 0];
        val = val + input[gid + 1];
        val = val + input[gid + 2];
        val = val + input[gid + 3];
        /* [...] */
        val = val + input[gid + 31];

        output[gid] = val;
}

Target