Arrays and Boxing in Common Lisp & SIMD in SBCL

<Hackmore> Hey Releston what are you up to?

<Releston> Looking at this here sensor code, it’s supposed to be rated for much higher resolution that we’re running. I mean a rock could come at us flying and the ship wouldn’t react in time.

<Hackmore> What?!? Am I supposed to feel even more scared than I already am? Okay, show me.

<Releston> Here… look at this for example, it’s supposed to remove the background noise from the sensor array.

(defun distance-sensors-delete-background-noise (distances backgrounds)
  (loop :for distance :across distances
        :for background :across backgrounds
        :for index :from 0
        :do (setf (aref distances index)
                  (if (< distance background) 0 distance))))

<Releston> It takes over 300ms just to clear the noise, who knows how slow everything else here is.

Array Specialization🔗

<Hackmore> Right, I see some weirdness here I think I could help with, first of all, what is the type of that input array?

<Releston> It’s just a simple-array nothing special, why?

<Hackmore> Oh, that might mean the values are all boxed, let’s look.

CL-USER> (sb-vm:hexdump v)
120CA3FA20: 0000000000000089
120CA3FA28: 0000000000000006 = 3
120CA3FA30: 000000120413C3CF = 1.3d0
120CA3FA38: 0000001204E5C89F = 1.2134d0
120CA3FA40: 000000120A7FA69F = 1.3d0
120CA3FA48: 0000000000000000 = 0

<Hackmore> You see those values in there? Or rather a lack thereof? All of those are just pointers to the heap, not actually the numbers you’re working with.

<Releston> How do you see that?

<Hackmore> Honestly, I’m not sure how to tell for sure, however there are a few giveaways. The addresses show you generally where you can expect allocated heap objects to end up, the values in the cells of the vector are suspiciously close to that. The other more obvious telltale is that the two ~double-float~s with an equal value have completely different binary representations, both of which are not actually representations of the number.

<Releston> Oh, that’s ugly, so Common Lisp doesn’t have non-boxed arrays?

<Hackmore> Of course it has, they are however sadly quite implementation dependent.

Let’s start with the fact that array semantics are based on boxes, for various reasons, by default they can hold absolutely any type and if you don’t know the sizes of anything you can’t make an array out of them, since an array is literally just “places of the same size next to each other”.

So without any further knowledge Common Lisp is forced to assume you might want to do anything and just box. However, nothing is stopping you from actually telling it what you’ll be using your array with.

Common Lisp defines something called “Specialized Arrays”, these allow the implementation to store the elements in a “more efficient” manner. Common Lisp itself only requires (vector bit) (a.k.a. bit-vector) and (vector character) (a.k.a. string) specializations, however implementations are allowed to create additional specializations as they see fit.

To inspect what is optimized in a particular environment the standard provides #'upgraded-array-element-type. This function takes a type and returns what the most specialized array capable of holding it is actually capable of holding. So if I wanted to store an (unsigned-byte 3), I could run (upgraded-array-element-type '(unsigned-byte 3)) which for me returns (unsigned-byte 4). This means that the implementation has some alternative way to treat arrays of (unsigned-byte 4), however it will also use that specialized version to store your (unsigned-byte 3). If the #'upgraded-array-element-type is ever just T, that means the classic generic array will be used. In fact you’ll find that the :element-type argument to #'make-array defaults to T.

It is worth noting that this is independent of type-checking. If you set the :element-type to (unsigned-byte 3) but then try to assign 8 to it, it might work if the array has no declared type. The array subtyping rules apply here to prevent nastiness¹, so if you declare the type of an array it has to be at least as specific as its :element-type. This means you can bind an :element-type '(unsigned-byte 4) into (array (unsigned-byte 3)) (otherwise element-type upgrading wouldn’t make sense) but not the other way around.

That mouthful matters if you’re curious what exactly happens, in general you can just say you want an (array single-float), and if the implementation can make that faster it will, the fact that the array itself might not be storing exactly what you asked for should be 100% transparent to you.

Nonetheless, we can use this to inspect the various better options an implementation offers us. In the case of SBCL I found a special variable which happens to line up with what gets optimized:

(map 'vector
  (alexandria:rcurry #'slot-value 'sb-vm::specifier)
  sb-vm:*specialized-array-element-type-properties*)

Which lists their specializations for us:

NIL
BASE-CHAR
CHARACTER
SINGLE-FLOAT
DOUBLE-FLOAT
BIT
UNSIGNED-BYTE
- 2
- 4
- 7
- 8
- 15
- 16
- 31
- 32
- 62
- 63
- 64
SIGNED-BYTE
- 8
- 16
- 32
- 64
FIXNUM
COMPLEX
- SINGLE-FLOAT
- DOUBLE-FLOAT

The list looks kinda bizarre at times, I’m sure every entry has a very good reason. Including the fun NIL entry (CL’s equivalent of the bottom type²), which just says congratulates you for trying to use it:

An attempt to access an array of element-type NIL was made. Congratulations!

This itself doesn’t tell you the elements aren’t boxed, but might give you a pretty good idea what types to look for when optimizing. CL does not guarantee any specific optimization behaviour only how to find out that the implementation would like to know about this specific type because it has some trick up its sleeve. With SBCL we can verify our assumption quite trivially, by constructing v with :element-type 'double-float:

CL-USER> (sb-vm:hexdump v)
1209ABBB60: 00000000000000D5
1209ABBB68: 0000000000000006 = 3
1209ABBB70: 3FF4CCCCCCCCCCCD
1209ABBB78: 3FF36A161E4F7660
1209ABBB80: 3FF4CCCCCCCCCCCD
1209ABBB88: 0000000000000000 = 0

<Hackmore> See? Now try it with this.

<Releston> … … … It’s slower!!!

<Hackmore> Well, they do always say to benchmark and test, the reason why this is slower is currently slightly beyond my grasp.

I am reasonably sure it has to do with re-boxing floats, as the main problem is that with no element-type there are no allocations when run, but when we set it equal to double-float the program allocates a lot. Even if I’m operating on an unboxed array the values I’m using in the rest of the program may still be boxed, which means we end up with the item being reboxed on each access:

If I want to operate on a generic array I have to call the generic functions with standard Lisp values, which for a double-float means it needs to be boxed, so every read from the array is sort of associated with allocating memory. We need to show to Lisp that the array is safely able to provide and accept unboxed floats. Adding a declaration suffices to get rid of that issue.

This also changes speed due to plenty of inlining of math, etc. that however seems to speed up only by about half without the specialization.

(defun distance-sensors-delete-background-noise2 (distances backgrounds)
  (declare (type (simple-array double-float) distances backgrounds))
  (loop :for index :below (length distances)
        :for distance := (aref distances index)
        :for background := (aref backgrounds index)
        :do 
           (setf (aref distances index)
                 (if (< distance background) 0.0d0 distance))))

<Releston> That’s more like it, under 20ms or so, that’s one hell of an improvement.

<Hackmore> This was supposed to be rather educational, not just about optimizing, so don’t go doing this everywhere before really finding out why something isn’t the way you want and make sure the edit actually makes it that way.

<Releston> Okay, I guess that’s it, we can’t get below that, there’s minimal work being done save for actually going over the elements one by one.

SIMD🔗

<Hackmore> Actually there might still be one thing, it’s actually why I came down here. I just read in the manual that they fitted our ship’s computer with AVX2.

<Releston> Elaborate

The SBCL Manual actually already has a good section to introduce you to this, however I had my own problems understanding it so I will re-iterate and partly show how this works anyway.

Modern processors are fitted with chonky registers to help with operations on vectors. These vary from 128-bits in SSE all the way up to 512-bits in AVX-512. All these SSE and AVX acronyms are descriptions of which registers (and associated operations) are present on a CPU. The most commonly supported version around me at the time of writing is AVX2, which includes 256-bit registers.

These registers can load their entire size from memory all at once and then apply operations that act as if they had multiple separate numbers inside them. Emulating this behaviour with a normal register would be problematic, since placing let’s say 2 32-bit integers into a 64-bit register and trying to add them could overflow from one number to the other. Other operations would provide their own little problems, so we have specialized registers and instructions for this process.

For AVX2 SBCL provides functions like f64.4+ or f32.8<, these mean that you are using a 256-bit register to operate on 4 double-floats or 8 single-floats respectively. If you do the math you’ll find that 4×f64 and 8×f32 are both 256-bits. SBCL provides these inside the sb-simd package, along with sb-simd-<extension-version> for those parts of SIMD only applicable to certain extensions.

Things like sb-simd:f64 and sb-simd:u8 are mostly just type aliases for the actual built-in Common Lisp types. So using sb-simd:u16+ on two numbers or sb-simd:f32* on two floats is perfectly acceptable.

How do we get multiple double-floats then, so we can use these functions on them? We use <type>.<count>-aref on an array restricted to the given type. It’s that easy, this returns a “SIMD pack” as SBCL calls it, and we can operate on it, before setf-ing back. In essence you only replace aref with ...-aref and + with ...-+ and you’re done. SIMD if always evaluates both branches, since it is a low-level register operation that needs to make a choice over multiple numbers at the same time. < creates a boolean mask which SIMD if uses to choose a number from one of the other two arguments, so to do that both must already be calculated.

If our dear crew strictly assumes AVX2 is present they could write something like this:

(defun distance-sensors-delete-background-noise3 (distances backgrounds)
  (loop :for index :below (length distances) :by 4
        :for distance := (sb-simd-avx2:f64.4-aref distances index)
        :for background := (sb-simd-avx2:f64.4-aref backgrounds index)
        :do (setf (sb-simd-avx2:f64.4-aref distances index)
                  (sb-simd-avx2:f64.4-if
                   (sb-simd-avx2:f64.4< distance background)
                   0.0d0
                   distance))))

This is slower than before, but I meant mostly to show off the features as they took me way too long for me to understand and I find them interesting and productive to know. For more complex calculations or when used with larger leverage, like with bytes where the register can support 32 of them at once, the performance of SIMD is incomparable.

Footnotes🔗

¹ Though it is still up to the user to keep things coherent, declaring the type incorrectly can easily lead to values with a declared type that are not of that type.

² This means that the type has no values, no values are subtypes of it, nothing. Just completely vast emptiness of no possible values ever.