Last year, I published a Rust library called basedrop, which implements a memory reclamation system tailored to the constraints of real-time audio scenarios. The purpose of basedrop is to make it easy to share dynamically allocated memory with a real-time audio thread while ensuring that no allocations or deallocations happen on that thread. This is accomplished by providing a set of smart pointers (analogous to `Box`

and `Arc`

from the Rust standard library) which do not directly free their associated allocation when dropped, but instead automatically push it onto a lock-free queue to be collected later on another thread.

Basedrop's design has some compelling benefits: it frees you from having to write code by hand every time you want to transfer an object to another thread to be freed, and if you restrict yourself to its vocabulary of smart pointers, it eliminates the possibility of accidentally dropping an allocation on the real-time thread (a mistake which can easily remain invisible if you don't have something like `assert_no_alloc`

to catch it). However, after talking with some developers trying to make use of basedrop in real projects, it became clear to me that these benefits come at the cost of a somewhat opinionated API, making it difficult to integrate with certain program architectures. I decided that a stripped-down version of the core linked-list queue would probably have some value, and the end result of that was the llq crate.

A central piece of basedrop's design is the `Node<T>`

type, which represents a node that can potentially be added to the collector queue's linked list. Each of basedrop's smart pointers allocates a `Node<T>`

on the heap at creation time, so when the time comes to mark the contained object as ready for deallocation, that node already exists, stored inline as part of the original allocation, and can simply be linked into the queue. This makes it possible to send an object back from the real-time thread to be reclaimed without performing any allocator operations.

The llq crate extracts just that core functionality, of a wait-free linked-list queue with preallocated nodes, and presents it in an unopinionated way. With llq, you can create some nodes:

use llq::{Node, Queue}; let x = Node::new(0); let y = Node::new(1); let z = Node::new(2);

push them onto a queue:

let (mut tx, mut rx) = Queue::<usize>::new().split(); tx.push(x); tx.push(y); tx.push(z);

pull them off the other end:

let x = rx.pop().unwrap(); let y = rx.pop().unwrap(); let z = rx.pop().unwrap();

and even reuse them with a separate queue:

let (mut tx2, mut rx2) = Queue::<usize>::new().split(); tx2.push(x); tx2.push(y); tx2.push(z);

and none of the above `push`

or `pop`

operations will ever allocate or free memory, lock a mutex, or even enter an unbounded compare-exchange loop.

It's worth noting that essentially the only synchronization operations in the entire source of llq are a single acquire load in the body of `pop`

and a release store in the body of `push`

. I consider it a pretty compelling demonstration of Rust's type system and safety guarantees that a concurrent data structure with such minimal synchronization overhead can still have a data race-free public API (assuming llq's implementation is bug-free, of course!).

For reference, the queue design in llq is based on a particular unbounded SPSC queue design from 1024cores. There's also an implementation of a similar queue design in the internals of the Rust standard library.

It's important to note that llq is designed for a specific, uncommon set of requirements, and several aspects of its design are more or less the opposite of what one would want out of a general-purpose channel for communicating between threads. llq is designed for the scenario where

- you don't want the channel to interact with the OS scheduler at all (no blocking)
- you want to ensure that one thread never allocates or deallocates in the process of sending or receiving items
- you want sending to be infallible for one thread (for e.g. returning objects to be deallocated from the audio thread)

A general-purpose queue should probably

- have blocking semantics and interact with the system scheduler, when receiving from an empty channel or sending to a full channel
- store items inline together, for less pointer-chasing and less memory usage overall
- have a bounded capacity, for backpressure

So, if you're just looking for a generic SPSC queue, there's good chance llq is not what you want. But if you're implementing real-time audio software, or are otherwise facing a situation where some threads in your program have much stricter latency requirements than others, llq might be worth a look.

You can check it out over on GitHub or on crates.io.

Discuss on: Twitter · Mastodon · r/rust · r/programming · HN

]]>Last year, I published a Rust library called basedrop, which implements a memory reclamation system tailored to the constraints of real-time audio scenarios. The purpose of basedrop is to make it easy to share dynamically allocated memory with a real-time audio thread while ensuring that no allocations or deallocations happen on that thread. This is accomplished by providing a set of smart pointers (analogous to `Box`

and `Arc`

from the Rust standard library) which do not directly free their associated allocation when dropped, but instead automatically push it onto a lock-free queue to be collected later on another thread.

Basedrop's design has some compelling benefits: it frees you from having to write code by hand every time you want to transfer an object to another thread to be freed, and if you restrict yourself to its vocabulary of smart pointers, it eliminates the possibility of accidentally dropping an allocation on the real-time thread (a mistake which can easily remain invisible if you don't have something like `assert_no_alloc`

to catch it). However, after talking with some developers trying to make use of basedrop in real projects, it became clear to me that these benefits come at the cost of a somewhat opinionated API, making it difficult to integrate with certain program architectures. I decided that a stripped-down version of the core linked-list queue would probably have some value, and the end result of that was the llq crate.

In real-time audio, deadlines are critical. Your code has on the order of several milliseconds to fill a buffer with samples to be shipped off to the DAC, milliseconds which it may be sharing with a number of other audio plugins. If your code takes too long to produce those samples, there are no second chances; they simply won't get played, and the user will hear an objectionable glitch or stutter instead.

In order to prevent this, real-time audio code must avoid performing any operations that can block the audio thread for an unbounded or unpredictable amount of time. Such operations include file and network I/O, memory allocation and deallocation, and the use of locks to synchronize with non-audio threads; these operations are not considered "real-time safe." Instead, operations like I/O and memory allocation should be performed on other threads, and synchronization should be performed using primitives that are wait-free for the audio thread. A more thorough overview of the subject can be found in Ross Bencina's now-classic blog post "Time Waits for Nothing".

Given that audio software generally does need to allocate memory and make use of it from the audio thread, the question becomes how to accomplish this in a manageable and efficient way while subject to the above constraints. Basedrop is my attempt at providing one answer to this question.

Consider a simple scenario: we have a buffer of samples stored in a `Vec<f32>`

, possibly synthesized or loaded from disk, and we would like to use it from the audio thread. As an initial sketch of a solution, we could use a wait-free bounded-capacity SPSC channel (such as the `rtrb`

crate) to send the buffer over to the audio thread, and then when we're done using it and want to reclaim the memory, we could send it back to a non-real-time thread over another SPSC channel to be freed.

In simple cases, this solution works well. However, it has drawbacks as an application grows in complexity. For instance, if a large number of allocations are being transferred to and from the audio thread, the fixed-capacity channel for returning allocations can fill up. Since it is not acceptable to block the audio thread in this case, the application needs to ensure either that the channel is polled frequently enough to keep up, that the channel always has worst-case capacity (using a more complex dynamically allocated design), or that the audio thread can continue without error if it is not currently possible to send back an allocation. Additionally, this solution relies on programmer discipline to ensure that allocations are always sent back to be freed, and Rust's RAII design makes mistakes in this regard largely invisible. Diagnostic tools like the `assert_no_alloc`

crate can go a long way towards detecting such mistakes, but it would be nice to have a guarantee at compile time.

Basedrop's solution is to replace the fixed-capacity ring buffer for returning allocations with an MPSC linked-list queue whose nodes are created at allocation time for (and stored inline next to) any piece of memory intended to be shared with the audio thread. When the audio thread is ready to release a piece of memory for reclamation, the corresponding node can be pushed onto the queue in an allocation-free, wait-free operation. This pattern is encapsulated by a pair of smart pointers, `Owned<T>`

and `Shared<T>`

, analogous to `Box<T>`

and `Arc<T>`

, which push their contents onto the queue for deferred reclamation rather than dropping them directly. The queue can then be processed periodically on another thread using basedrop's `Collector`

type.

This system has the advantage that is impossible for the reclamation channel to become full (short of a full-on OOM). It is also impossible to forget to send something back to be collected, as long as it was initially wrapped in an `Owned<T>`

or `Shared<T>`

. `Shared<T>`

in particular opens up exciting possibilities for sharing immutable and persistent data structures between audio and non-audio threads in ways that would be cumbersome or impossible with the manual message-passing approach.

`SharedCell`

Basedrop provides another primitive for sharing memory with the audio thread, called `SharedCell<T>`

. `SharedCell<T>`

acts as a thread-safe mutable memory location for storing `Shared<T>`

pointers, providing `get`

, `set`

, and `replace`

methods (much like `Cell`

) for fetching and updating the contents. I envision this being used as a way for a non-real-time thread to atomically publish data which can then be immutably observed by the real-time audio thread.

The main difficulty in implementing this pattern in a lock-free way lies in the fact that getting a copy of a reference-counted pointer actually consists of two steps: first, fetching the actual pointer, and then incrementing the reference count. In between these two steps, writers must not be allowed to replace the pointer with a new value, decrement the reference count for the previous value to zero, and then free its referent, as this would result in a use-after-free for the reader. There are various possible solutions to this problem with different tradeoffs.

The approach taken by `SharedCell<T>`

is to keep a reader count alongside the stored pointer. Readers increment this count while fetching the pointer and only decrement it after successfully incrementing the pointer's reference count. Writers, in turn, after replacing the stored pointer, spin until the count is observed to be zero before they are allowed to move on and possibly decrement the reference count. This scheme is designed to be low-cost and non-blocking for readers, while being somewhat higher-overhead for writers, which I deem to be the appropriate tradeoff for real-time audio, where the reader (the audio thread) has much tighter latency deadlines and executes much more often than the writer.

Basedrop doesn't currently support dynamically sized types, like `Owned<[T]>`

or `Owned<dyn Trait>`

. This should become possible when `CoerceUnsized`

or equivalent is stabilized. For now, it can be worked around without much issue by wrapping the DST in another layer of allocation.

Additionally, `Shared<T>`

doesn't currently support weak references for cyclic data structures the way `Arc<T>`

does. This would complicate the reference-counting logic (see the `Arc`

source), and I wanted to start with something simple that I could be sure was correct. However, this would certainly be nice to have.

I would also like to explore memory reclamation strategies with less overhead than reference counting, such as the RCU pattern found in the Linux kernel, epoch-based reclamation, and quiescent state-based reclamation. I haven't yet been able to come up with a design in this vein that both dovetails with Rust ownership and satisfies the constraints of real-time audio (and audio plugins), but I think it's a promising direction for the future.

Basedrop is available on crates.io! Please feel free to give it a try in your own projects. Feedback and bug reports are welcome.

Lastly, I would like to thank William Light for some very helpful conversations while I was working out the design of basedrop.

]]>In real-time audio, deadlines are critical. Your code has on the order of several milliseconds to fill a buffer with samples to be shipped off to the DAC, milliseconds which it may be sharing with a number of other audio plugins. If your code takes too long to produce those samples, there are no second chances; they simply won't get played, and the user will hear an objectionable glitch or stutter instead.

In order to prevent this, real-time audio code must avoid performing any operations that can block the audio thread for an unbounded or unpredictable amount of time. Such operations include file and network I/O, memory allocation and deallocation, and the use of locks to synchronize with non-audio threads; these operations are not considered "real-time safe." Instead, operations like I/O and memory allocation should be performed on other threads, and synchronization should be performed using primitives that are wait-free for the audio thread. A more thorough overview of the subject can be found in Ross Bencina's now-classic blog post "Time Waits for Nothing".

Given that audio software generally does need to allocate memory and make use of it from the audio thread, the question becomes how to accomplish this in a manageable and efficient way while subject to the above constraints. Basedrop is my attempt at providing one answer to this question.

]]>Complex numbers have a representation as $2×2$ matrices, which can serve to illuminate some initially non-obvious aspects of how they work. A real number $a$ can be represented as a multiple of the identity matrix:

$aI=[a0 0a ]$

with addition and multiplication given by the corresponding matrix operations. In order to extend this representation to the complex numbers, we need a matrix $J$ such that $J_{2}=−I$:

$[01 −10 ]_{2}=[−10 0−1 ]$

We can thus represent any complex number $a+bi$ as:

$aI+bJ=[ab −ba ]$

It can be verified that addition and multiplication of these matrices is equivalent to addition and multiplication of the complex numbers they represent (meaning that matrices of this form comprise a field isomorphic to $C$).

Just as any complex number $a+bi$ can be written in polar form $re_{iθ}$, a matrix of the above form can be written as a scaled rotation (or an “amplitwist,” as Tristan Needham refers to it in *Visual Complex Analysis*):

$[ab −ba ]=r[cosθsinθ −sinθcosθ ]$

(This is a special case of the more general polar decomposition for matrices, by which a square matrix can be written as the product of a symmetric positive-definite matrix and an orthogonal matrix.)

In fact, these matrices act on vectors in the plane in the same way that complex numbers act on one another: by scaling and rotation. So, we can look at complex multiplication as a particular binary operation on vectors in $R_{2}$, or we can look at it as standard matrix multiplication on a particular class of matrices in $R_{2×2}$, *or* we can look at it as multiplication of vectors in $R_{2}$ by that particular class of matrices.

When moving from $R$ to $C$, the proper generalizations of many constructions from linear algebra involve the complex conjugate $a+bi =a−bi$ and the conjugate transpose $X_{∗}=X_{T}$:

- The real inner product $⟨u,v⟩=u_{T}v$ generalizes to the complex inner product $⟨u,v⟩=u_{∗}v$
- Symmetric matrices $A=A_{T}$ generalize to Hermitian matrices $A=A_{∗}$
- Orthogonal matrices $A_{−1}=A_{T}$ generalize to unitary matrices $A_{−1}=A_{∗}$

and so on. There are various explanations for this. One that I am fond of involves replacing the individual complex elements in a matrix or vector with their $2×2$ matrix representations, turning a complex column vector into a $2n×2$ real block matrix:

$⎣⎢⎢⎡ a+bi⋮c+di ⎦⎥⎥⎤ ⇒⎣⎢⎢⎢⎢⎢⎢⎡ ab⋮cd −ba⋮−dc ⎦⎥⎥⎥⎥⎥⎥⎤ $

and a complex matrix into a $2m×2n$ real block matrix:

$⎣⎢⎢⎡ a+bi⋮e+fi ⋯⋱⋯ c+di⋮g+hi ⎦⎥⎥⎤ $

$⇓$

$⎣⎢⎢⎢⎢⎢⎢⎡ ab⋮ef −ba⋮−fe ⋯⋯⋱⋯⋯ cd⋮gh −dc⋮−hg ⎦⎥⎥⎥⎥⎥⎥⎤ $

Since the transpose of an individual $2×2$ block

$[ab −ba ]_{T}=[a−b ba ]$

corresponds to the conjugate of the original complex number, the original notions of inner product, symmetric matrix, orthogonal matrix, and so on give the same results over these block matrices as their complex generalizations do over complex vectors and matrices.

A complex function $C→C$ can be looked at as a function $R_{2}→R_{2}$. The conditions for continuity are the same. However, the conditions for differentiability are different.

The derivative of a real function $f(x,y)=(u(x,y),v(x,y))$ at a point $(x,y)$ is the linear function $R_{2}→R_{2}$ which best locally approximates $f$ at that point. It can be written as the $2×2$ Jacobian matrix of $f$'s partial derivatives:

$df_{(x,y)}=⎣⎢⎢⎡ ∂x∂u ∂x∂v ∂y∂u ∂y∂v ⎦⎥⎥⎤ $

All that is necessary for such a function to be differentiable is for each of these partial derivatives to exist. If $f$ is instead considered as a complex function $f(x+yi)=u(x,y)+v(x,y)i$, its derivative at a point $z=x+yi$ should again be the best local linear approximation, but this time it should be a linear function $C→C$ of a single complex variable, meaning that it can be expressed as a single complex number to be multiplied by its argument.

Which $2×2$ Jacobian matrices can we pack into a single complex number? In other words, which $2×2$ real matrices act on a vector in $R_{2}$ the way complex numbers act on one other? We discovered this above: they are the matrices of the form

$[ab −ba ]$

i.e. scaled rotation matrices or amplitwists. So a function $f(x+yi)=u(x,y)+v(x,y)i$ is differentiable if and only if the following conditions hold:

$∂x∂u =∂y∂v $

$∂y∂u =−∂x∂v $

These are known as the Cauchy-Riemann equations.

Functions with this property are known as holomorphic. This turns out to be a much stronger condition than differentiability over $R_{2}$, with correspondingly much stronger implications:

- Holomorphic functions are analytic, i.e. they are everywhere locally equal to their Taylor series
- Both the real and imaginary parts of a holomorphic function are harmonic, i.e. their Laplacian vanishes everywhere
- A holomorphic function is conformal, i.e. it locally preserves angles, as long as its derivative is nonzero everywhere

Complex numbers have a representation as $2×2$ matrices, which can serve to illuminate some initially non-obvious aspects of how they work. A real number $a$ can be represented as a multiple of the identity matrix:

$aI=[a0 0a ]$

with addition and multiplication given by the corresponding matrix operations. In order to extend this representation to the complex numbers, we need a matrix $J$ such that $J_{2}=−I$:

$[01 −10 ]_{2}=[−10 0−1 ]$

We can thus represent any complex number $a+bi$ as:

$aI+bJ=[ab −ba ]$

It can be verified that addition and multiplication of these matrices is equivalent to addition and multiplication of the complex numbers they represent (meaning that matrices of this form comprise a field isomorphic to $C$).

]]>The following is a video of a rotating cube:

This can be visualized as a plane sweeping through a cube of spacetime, or a flipbook:

If the plane is instead swept as follows,

the resulting video is:

]]>The following is a video of a rotating cube:

This can be visualized as a plane sweeping through a cube of spacetime, or a flipbook:

]]>