What is a device, a compute unit (CU), a work-group, a work-item, a command-queue?
Work-group vs Work-item
- From :
- A multi-core CPU or multiple CPUs (in a multi-socket machine) constitute a single OpenCL device. Separate cores are compute units. Work-group [is a ] a collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel and share local memory and work-group barriers.
To continue the CPU analogy, a work-group is executed on one core, and each work-item are processed together using vector instructions (such as MMX and SSE). On a CPU, those elements of a work-group usually share the same L1 and L2 cache, on a GPU, there is more likely a scraptchpad as local memory. Later GPU architectures make it possible to use this scratchpad memory as a cache as well, hence looking more like a CPU design.
It is also worth noting how a condition which is result in a different execution path within a work-group (e.g., some work-items taking path A, some work-items take path B) is killing performances on GPU, because work-groups are larger, and path A and path B cannot be executed in parallel (therefore, more silicon is "wasted" doing nothing).
This analogy also shows the mismatching terminology between CPU and GPU: a CPU core is totally difference from a so-called CUDA core. A CUDA "core" is an ALU lane in CPU terminology, while a CPU core is: a "Stream Multiprocessor" (aka SM) in CUDA terminology, a "Compute Unit" (aka CU) in OpenCL terminology, an "Execution Unit" (aka EU) in Intel GPU terminology.
CUDA compute capability table from Wikipedia :
The line "Number of ALU lanes" is what is called core by NVidia.
The "GeForce GTX 1080 Ti"  is based on a Pascal architecture with compute capability 6.1, therefore it has 128 ALU lanes. 3584 / 128 = 28. The chip would be more accurately seen as a 28-core GPU with vector instructions on up to 128 32-bit data.
Confirmation at :
CL_DEVICE_MAX_COMPUTE_UNITS 28 CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT 28
To fully utilize this GPU, you could start an OpenCL kernel on 28 work-groups of 128 work-items each.
AMD does the same, but they advertise their compute-units and stream-processor count together:
The "Radeon Pro" based on a Vega architecture features a 64-core GPU with vector instructions on up to 64 32-bit data (organized in two groups of 32).
Following this logic, one could say that the 72-core Intel Xeon Phi Processor  with AVX-512 vector instructions  on 512 bits, which can process 16 32-bit data in one instruction, features 16 * 72 = 1152 newly called "OpenCL cores" (NVidia-style) or "OpenCL stream processors" (AMD-style).
Example of clinfo output for an Intel HD Graphics GPU:
Max compute units 24 [...] Preferred / native vector sizes char 16 / 16 short 8 / 8 int 4 / 4 long 1 / 1 half 8 / 8 (cl_khr_fp16) float 1 / 1 double 1 / 1 (cl_khr_fp64)
Example of clinfo output for an Intel Core CPU:
Max compute units 4 [...] Preferred / native vector sizes char 1 / 32 short 1 / 16 int 1 / 8 long 1 / 4 half 0 / 0 (n/a) float 1 / 8 double 1 / 4 (cl_khr_fp64)
Using 256-bit SSE vector instructions, the CPU can process 32 8-bit char, or 16 16-bit short, or 8 32-bit int, or 4 64-bit long. Preferred size is 1, since hte CPU can process those data also one-by-one, using the normal x86 instructions.
On the other side, GPU preferred sizes are the same as native sizes, because there is no dedicated instructions to process one data at a time, hence wasting resources. It is akin to using an SSE vector instruction to process a single int data.
Data-parallel vs Task-parallel
OpenCL is very oriented to data-parallelism, while usual CPU parallel processing is task-parallelism .
Task-Parallel programming model – the OpenCL programming model that runs a single work-group with a single work item.
Interested in having a command-queues on specific compute unit like there where different OpenCL device? It is called "Device Fission" .
How to size OpenCL kernel execution?
cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, /* number of work-groups x work-group_size */ const size_t *local_work_size, /* work-group_size */ [...] )
|||Wikipedia, CUDA, https://en.wikipedia.org/wiki/CUDA|
|||Wikipedia, AVX-512, https://en.wikipedia.org/wiki/AVX-512|
https://www.youtube.com/watch?v=_to3xz12Ojo http://infocenter.arm.com/help/topic/com.arm.doc.dui0538f/DUI0538F_mali_t600_opencl_dg.pdf http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/index.html https://stackoverflow.com/a/25446487 http://sc13.supercomputing.org/sites/default/files/PostersArchive/src_posters/spost110s2-file2.pdf https://www.youtube.com/watch?v=mo5zVbCg12I