Linux Heap, Contention in free()

David Boreham

Introduction

The user space heap (implemented by the functions malloc(), free() and realloc()) can be critical to the performance of many applications.  In addition, on SMP hardware, where multiple threads within a single process make heavy use of the heap, the potential for contention arises. This problem, long recognized[1], has been largely addressed in the current Linux heap implementation[1]. However the potential for contention still arises inside the free() function. This paper describes that contention potential and a code modification which aims to remove it.

The Problem

The heap code resides inside the glibc component, and is packaged in the libc.so.x shared library. The current implementation of the heap uses multiple independent sub-heaps called arenas. Each arena has its own mutex for concurrency protection. Thus if there are sufficient arenas within a process' heap, and a mechanism to distribute the threads' heap accesses evenly between them, then the potential for contention for the mutexes should be minimal. It turns out that this works well for allocations. In malloc(), a test is made to see if the mutex for current target arena for the current thread is free (trylock). If so then the arena is now locked and the allocation proceeds. If the mutex is busy then each remaining arena is tried in turn and used if the mutex is not busy. In the event that no arena can be locked without blocking,  a fresh new arena is created. This arena by definition is not already locked, so the allocation can now proceed without blocking. Lastly, the ID of the arena last used by a thread is retained in thread local storage, and subsequently used as the first arena to try when malloc() is next called by that thread. Therefore all calls to malloc() will proceed without blocking.
However, when it comes time to free a block, the heap implementation simply attempts to acquire the relevant arena mutex. By its design, the heap implementation must free a block back to the arena from which is was allocated. If the arena mutex can not be acquired, then the thread blocks. This introduces a potential for contention between a thread allocating from one arena while one or more other threads are attempting to free to the same arena. This syndrome is in fact observed in some threaded applications and can lead to significant loss in performance. The effect is most noticeable where an application's design leads to memory blocks being freed by threads other than that which made the corresponding allocation. This can happen for example where there is a pool of worker threads all accessing a common data cache in memory, with objects allocatated from the heap.

The Solution

If, rather than waiting on the target arena's mutex to become unlocked, we were able to store the address of the block to be freed later, then free() could proceed without blocking. Some other thread which later acquires the arena lock (without waiting) could then perform the actual work of free'ing any stored block addresses. This notion of a free queue or free buffer is not trivial to implement since it is necessary to ensure that the enqueing of a block is done without blocking (otherwise the point of contention has simply been moved to the queue).
One way to implement a non-blocking queue of limited size is to use a single memory word and compare-exchange instructions to ensure atomic access to the queue elements. Each bit in the memory word controls access to a single queue or buffer entry. Hence on a 32-bit architecture, the queue can be up to 32 elements deep. Note that a short queue and an unordered de-queuing process are acceptable for this particular problem. This considerably simplifies the implementation.

Enqueuing process:

Examine the queue word in memory, looking for the first clear bit, least significant bit towards most significant bit.
If no clear bit is found, we fail to queue and revert to blocking behavior.
If a clear bit is found, attempt to set it atomically using compare and swap in a spin loop.
If we fail to set the bit because it is now already set, proceed to try the next bit.
If we do succeed in setting a bit, then store the address of the block to be freed in the
queue slot at the offset matching the bit position.

Dequeuing process:

While holding the arena lock, Examine the queue word. If it is non-zero, we need to free some memory, otherwise there is no work to do.
Examine each queue word bit in turn, beginning at the most significant bit and proceeding towards the least significant.
If a bit is set, read the corresponding queue slot. If that is non-null, free the block.
Now set the queue slot to null.
Finally reset the queue word bit to zero, using compare and swap in a spin loop.

Blocks are enqueued from free(), whenever trylock() fails to acquire the arena mutex at the first attempt.
Queued blocks are freed in malloc() and free(), whenever the arena lock is about to be released.

Note that strictly speaking the data structure isn't a queue (there isn't a strict ordering of elements). It's perhaps more accurately termed a scratchpad. Please excuse my sloppy terminology.

Test Results

Using the test program below, on a 4 CPU Xeon III 700MHz machine running Kernel 2.4.18 and glibc 2.2.4, these test results were obtained:

# threads (-w n)
Frees/s unpatched malloc code
Frees/s patched malloc code
Speedup X
1
411780
594225
1.44
2
351896
837319
2.38
3
356521
606856
1.70
4
355365
492932
1.39
8
136447
336099
2.46

(Tests conducted with a commercial server application show similar results).

Code Patches

Notes:
  1. these patches are experimental. They target glibc version 2.2.4 (malloc.c CVS version 1.87). 
  2. The code requires the use of atomic compare and swap instructions. There is a C function giving portable access to this instruction (where it is present in the ISA) within libpthread (part of the Linux glibc distribution). However, the function is private to libpthread. malloc.c is in libc and therefore does not have access to __compare_and_swap(). Rather than changing the export status of __compare_and_swap() within libpthread, for this experimental path I have pased the IA-32 source code for the function into the malloc.c. A production patch would need to provide linkage to __compare_and_swap().
  3. There may be the potential for a pathalogical deadlock condition due to the spin loops being infinite. No such syndrome has been observed in testing but a production version of this code should adequately address this issue.

Get the patch file here

  



 

Test Program

This test program exercises contention in the heap between threads calling malloc() and threads calling free(), passing blocks between the threads.

Get the test program code here .

     

References

[1] Hoard : A Scalable Memory Allocator for Multithreaded Applications, by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.
[2] malloc() Performance in a Multithreaded Linux Environment, by Chuck Lever and David Boreham.