Day 4: v_mem_manager and spinlock

The simple memory manager of v_array [V_ARRAY] needs a lock on the bitfield which keeps the state of the memory chunks (i.e., allocated or free).

How to implement this lock on CPU and on GPU? [1].

Bonus question: in an unified memory environment, how to make it so that the CPU and GPU can use the memory pool at the same time? (e.g., ARM chip with OpenCL-compatible Mali GPU, Intel chip with integrated HD Graphics GPU, AMD fusion chip with Radeon GPU)

spinlock

According to Wikipedia [2]:

[...] a spinlock is a lock which causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking if the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting.

An example of userspace spinlock implementation is found in the documentation of atomic_exchange at [3].

A discussion of the use of atomics in different kinds of mutex is available in at [4], with corresponding implementation at [5].

Atomic operation in C11

The C11 specification presents a new header: stdatomic.h. gcc has its own specific implementation [6].

Example from [7]:

static atomic_flag cat = ATOMIC_FLAG_INIT;
static unsigned volatile counter= 0;

// critical section starts
do {} while (atomic_flag_test_and_set(&cat));
++counter;
// critical section ends
atomic_flag_clear(&cat);

Atomic operation in OpenCL

[8]

atomic_xchg [9]

In OpenCL 2.0, there is an effort to be compatible with C11/C++11 specifications [10].

Note myself

Suggestion for a lock-free solution: loop over the bitfield and try to set a bit to one using atomic compare-and-swap function (aka CAS) directly on the words of the bitfield. If the previous value changed during the call, the CAS will fail, and the next bitfield word is tried.

Inspired from this [11] [12]:

Several other atomic operations can be used to provide mutual exclusion of data structures; most notable of these is compare-and-swap (CAS). CAS can be used to achieve wait-free mutual exclusion for any shared data structure by creating a linked list where each node represents the desired operation to be performed. CAS is then used to change the pointers in the linked list during the insertion of a new node. Only one process can be successful in its CAS; all other processes attempting to add a node at the same time will have to try again. Each process can then keep a local copy of the data structure, and upon traversing the linked list, can perform each operation from the list on its local copy.

The C11 atomic for CAS is atomic_compare_exchange [13]. There are weak and strong, and implicit and explicit implementations.

When a compare-and-exchange is in a loop, the weak version will yield better performance on some platforms.

Be careful about what is compared [14].

explicit gives the possibility to choose a specific memory order [15]:

On strongly-ordered systems (x86, SPARC TSO, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire). On weakly-ordered systems (ARM, Itanium, PowerPC), special CPU load or memory fence instructions have to be used.

Legacy atomic operation in v_array 2010 (CUDA)

do {} while(atomicCAS(&data->lock_i, 0, 1));

[16] [17]