Symmetric Multi-Processing

Race conditions

Race condition: resource counter release

void release_resource()
{
     counter--;

     if (!counter)
         free_resource();
}

Race condition scenario

 

../_images/ditaa-35f7597b35b83bb0025ac2a5f158c9eae23050c8.png

Avoiding race conditions

Linux kernel concurrency sources

Atomic operations

Using atomic_dec_and_test() to implement resource counter release

void release_resource()
{
    if (atomic_dec_and_test(&counter))
         free_resource();
}

Atomic operations may not be atomic on SMP systems

 

../_images/ditaa-ddd14be50300088958e86912bc5f396797634a3a.png

Fixing atomic operations for SMP systems (x86)

 

../_images/ditaa-c11fccb956cdf115910f9f72e1dc14cd7ed549ff.png

Synchronization with interrupts (x86)

 #define local_irq_disable() \
     asm volatile („cli” : : : „memory”)

#define local_irq_enable() \
    asm volatile („sti” : : : „memory”)

#define local_irq_save(flags) \
    asm volatile ("pushf ; pop %0" :"=g" (flags)
                  : /* no input */: "memory") \
    asm volatile("cli": : :"memory")

#define local_irq_restore(flags) \
    asm volatile ("push %0 ; popf"
                  : /* no output */
                  : "g" (flags) :"memory", "cc");

Spin Lock Implementation Example (x86)

spin_lock:
    lock bts [my_lock], 0
    jc spin_lock

/* critical section */

spin_unlock:
    mov [my_lock], 0

bts dts, src - bit test and set; it copies the src bit from the dts memory address to the carry flag and then sets it:

CF <- dts[src]
dts[src] <- 1

Lock Contention

Cache Thrashing

Cache thrashing occurs when multiple cores are trying to read and write to the same memory resulting in excessive cache misses.

Since spin locks continuously access memory during lock contention, cache thrashing is a common occurrence due to the way cache coherency is implemented.

Synchronized caches and memory

 

../_images/ditaa-4d63c157487ff8291f2a6e93fe680ec38c1a3212.png

Unsynchronized caches and memory

 

../_images/ditaa-7ee0f9bb5f5af586e043afd47cfbad0adcc34888.png

Cache Coherency Protocols

Bus snooping is simpler but it performs poorly when the number of cores goes beyond 32-64.

Directory based cache coherence protocols scale much better (up to thousands of cores) and are usually used in NUMA systems.

MESI Cache Coherence Protocol

MESI State Transitions

Cache thrashing due to spin lock contention

 

../_images/ditaa-b26d802c286bda6c559b4dcfa8a7fb27f840463e.png

Optimized spin lock (KeAcquireSpinLock)

 

spin_lock:
    rep ; nop
    test lock_addr, 1
    jnz spin_lock
    lock bts lock_addr
    jc spin_lock

Queued Spin Locks

 

../_images/ditaa-58545831034f050660727be99cede213bc4a53c7.png

Process and Interrupt Handler Synchronization Deadlock

Interrupt Synchronization for SMP

Bottom-Half Synchronization for SMP

Preemption

 

Preemption is configurable: when active it provides better latency and response time, while when deactivated it provides better throughput.

Preemption is disabled by spin locks and mutexes but it can be manually disabled as well (by core kernel code).

Preemption and Bottom-Half Masking

#define PREEMPT_BITS      8
#define SOFTIRQ_BITS      8
#define HARDIRQ_BITS      4
#define NMI_BITS          1

#define preempt_disable() preempt_count_inc()

#define local_bh_disable() add_preempt_count(SOFTIRQ_OFFSET)

#define local_bh_enable() sub_preempt_count(SOFTIRQ_OFFSET)

#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))

#define in_interrupt() irq_count()

asmlinkage void do_softirq(void)
{
    if (in_interrupt()) return;
    ...

Mutexes

mutex_lock() fast path

void __sched mutex_lock(struct mutex *lock)
{
  might_sleep();

  if (!__mutex_trylock_fast(lock))
    __mutex_lock_slowpath(lock);
}

static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
{
  unsigned long curr = (unsigned long)current;

  if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
    return true;

  return false;
}

mutex_lock() slow path

...
  spin_lock(&lock->wait_lock);
...
  /* add waiting tasks to the end of the waitqueue (FIFO): */
  list_add_tail(&waiter.list, &lock->wait_list);
...
  waiter.task = current;
...
  for (;;) {
    if (__mutex_trylock(lock))
      goto acquired;
  ...
    spin_unlock(&lock->wait_lock);
  ...
    set_current_state(state);
    spin_lock(&lock->wait_lock);
  }
  spin_lock(&lock->wait_lock);
acquired:
  __set_current_state(TASK_RUNNING);
  mutex_remove_waiter(lock, &waiter, current);
  spin_lock(&lock->wait_lock);
...

mutex_unlock() fast path

void __sched mutex_unlock(struct mutex *lock)
{
  if (__mutex_unlock_fast(lock))
    return;
  __mutex_unlock_slowpath(lock, _RET_IP_);
}

static __always_inline bool __mutex_unlock_fast(struct mutex *lock)
{
  unsigned long curr = (unsigned long)current;

  if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)
    return true;

  return false;
}

void __mutex_lock_slowpath(struct mutex *lock)
{
...
  if (__mutex_waiter_is_first(lock, &waiter))
          __mutex_set_flag(lock, MUTEX_FLAG_WAITERS);
...

mutex_unlock() slow path

...
spin_lock(&lock->wait_lock);
if (!list_empty(&lock->wait_list)) {
  /* get the first entry from the wait-list: */
  struct mutex_waiter *waiter;
  waiter = list_first_entry(&lock->wait_list, struct mutex_waiter,
                            list);
  next = waiter->task;
  wake_q_add(&wake_q, next);
}
...
spin_unlock(&lock->wait_lock);
...
wake_up_q(&wake_q);

Per CPU data

Out of Order Compiler Generated Code

C code Compiler generated code
a = 1;
b = 2;
MOV R10, 1
MOV R11, 2
STORE R11, b
STORE R10, a

Barriers

Read Copy Update (RCU)

Removal and Reclamation

RCU List Delete

 

../_images/ditaa-5193a924360bebc83d2f81188cd0b0093ec01e6a.png

RCU list APIs cheat sheet

/* list traversal */
rcu_read_lock();
list_for_each_entry_rcu(i, head) {
  /* no sleeping, blocking calls or context switch allowed */
}
rcu_read_unlock();


/* list element delete  */
spin_lock(&lock);
list_del_rcu(&node->list);
spin_unlock(&lock);
synchronize_rcu();
kfree(node);

/* list element add  */
spin_lock(&lock);
list_add_rcu(head, &node->list);
spin_unlock(&lock);