Arrays and Boxing in Common Lisp & SIMD in SBCL

<Hackmore> Hey Releston what are you up to?

<Releston> Looking at this here sensor code, it’s supposed to be rated for much higher resolution that we’re running. I mean a rock could come at us flying and the ship wouldn’t react in time.

<Hackmore> What?!? Am I supposed to feel even more scared than I already am? Okay, show me.

<Releston> Here… look at this for example, it’s supposed to remove the background noise from the sensor array.

(defun distance-sensors-delete-background-noise (distances backgrounds)
  (loop :for distance :across distances
        :for background :across backgrounds
        :for index :from 0
        :do (setf (aref distances index)
                  (if (< distance background) 0 distance))))

<Releston> It takes over 300ms just to clear the noise, who knows how slow everything else here is.

Array Specialization🔗

<Hackmore> Right, I see some weirdness here I think I could help with, first of all, what is the type of that input array?

<Releston> It’s just a simple-array nothing special, why?

<Hackmore> Oh, that might mean the values are all boxed, let’s look.

CL-USER> (sb-vm:hexdump v)
120CA3FA20: 0000000000000089
120CA3FA28: 0000000000000006 = 3
120CA3FA30: 000000120413C3CF = 1.3d0
120CA3FA38: 0000001204E5C89F = 1.2134d0
120CA3FA40: 000000120A7FA69F = 1.3d0
120CA3FA48: 0000000000000000 = 0

<Hackmore> You see those values in there? Or rather a lack thereof? All of those are just pointers to the heap, not actually the numbers you’re working with.

<Releston> How do you see that?

<Hackmore> Honestly, I’m not sure how to tell for sure, however there are a few giveaways. The addresses show you generally where you can expect allocated heap objects to end up, the values in the cells of the vector are suspiciously close to that. The other more obvious telltale is that the two double-floats with an equal value have completely different binary representations, both of which are not actually representations of the number.

<Releston> Oh, that’s ugly, so Common Lisp doesn’t have non-boxed arrays?

<Hackmore> Of course it has, they are however sadly quite implementation dependent.

Let’s start with the fact that array semantics are based on boxes, for various reasons, by default they can hold absolutely any type and if you don’t know the sizes of anything you can’t make an array out of them, since an array is literally just “places of the same size next to each other”.

So without any further knowledge Common Lisp is forced to assume you might want to do anything and just box. However, nothing is stopping you from actually telling it what you’ll be using your array with.

Common Lisp defines something called “Specialized Arrays”, these allow the implementation to store the elements in a “more efficient” manner. Common Lisp itself only requires (vector bit) (a.k.a. bit-vector) and (vector character) (a.k.a. string) specializations, however implementations are allowed to create additional specializations as they see fit.

To inspect what is optimized in a particular environment the standard provides #'upgraded-array-element-type. This function takes a type and returns what the most specialized array capable of holding it is actually capable of holding. That might’ve just sounded weird, why am I interested in what it holds rather than the type itself? It would be redundant, since the type that holds is just an (array <that-type>) anyways. So if I wanted to store an (unsigned-byte 3), I could run (upgraded-array-element-type '(unsigned-byte 3)) which for me returns (unsigned-byte 4). This means that the implementations has some alternative way to treat arrays of your type and if you ask for one it’ll use the one that is specialized for use with (unsigned-byte 4) but you can feel free to store your (unsigned-byte 3) in there the array’ll handle it.

Nonetheless, we can use this to inspect the various better options an implementation offers us. In the case of SBCL I found a special variable which happens to line up with what gets optimized:

(map 'vector
  (alexandria:rcurry #'slot-value 'sb-vm::specifier)
  sb-vm:*specialized-array-element-type-properties*)

Which lists their specializations for us:

NIL
BASE-CHAR
CHARACTER
SINGLE-FLOAT
DOUBLE-FLOAT
BIT
UNSIGNED-BYTE
- 2
- 4
- 7
- 8
- 15
- 16
- 31
- 32
- 62
- 63
- 64
SIGNED-BYTE
- 8
- 16
- 32
- 64
FIXNUM
COMPLEX
- SINGLE-FLOAT
- DOUBLE-FLOAT

The list looks kinda bizarre at times, I’m sure every entry has a very good reason. Including the fun NIL entry (CL’s equivalent of the bottom type), which just says congratulates you for trying to use it.

This itself doesn’t tell you the elements aren’t boxed but might give you a pretty good idea what types to look for when optimizing. With SBCL we can verify our assumption quite trivially, by constructing v with :element-type 'double-float:

CL-USER> (sb-vm:hexdump v)
1209ABBB60: 00000000000000D5
1209ABBB68: 0000000000000006 = 3
1209ABBB70: 3FF4CCCCCCCCCCCD
1209ABBB78: 3FF36A161E4F7660
1209ABBB80: 3FF4CCCCCCCCCCCD
1209ABBB88: 0000000000000000 = 0

<Hackmore> See? Now try it with this.

<Releston> … … … It’s slower!!!

<Hackmore> Well, they do always say to benchmark and test, the reason why this is slower is currently slightly beyond my grasp.

I’m reasonably sure it has to do with re-boxing floats, as the main problem is that with no element-type there are no allocations when run, but with it equal to double-float it allocates a lot. If I want to operate on a generic array I have to call the generic functions with standard Lisp values, which for a double-float means it needs to be boxed, so every read from the array is sort of associated with allocating memory. We need to show to Lisp that the array is safely able to provide and accept unboxed floats. Adding a declaration suffices to get rid of that issue.

This also changes speed due to plenty of inlining of math, etc. that however seems to speed up only by about half without the specialization.

(defun distance-sensors-delete-background-noise2 (distances backgrounds)
  (declare (type (simple-array double-float) distances backgrounds))
  (loop :for index :below (length distances)
        :for distance := (aref distances index)
        :for background := (aref backgrounds index)
        :do 
           (setf (aref distances index)
                 (if (< distance background) 0.0d0 distance))))

<Releston> That’s more like it, under 20ms or so, that’s one hell of an improvement.

<Hackmore> This was supposed to be rather educational, not just about optimizing, so don’t go doing this everywhere before really finding out why something isn’t the way you want and make sure the edit actually makes it that way.

<Releston> Okay, I guess that’s it, we can’t get below that, there’s minimal work being done save for actually going over the elements one by one.

SIMD🔗

<Hackmore> Actually there might still be one thing, it’s actually why I came down here. I just read in the manual that they fitted our ship’s computer with AVX2.

<Releston> Elaborate

The SBCL Manual actually has a good section to introduce you to this, however I had my own problems understanding it so I will re-iterate and partly show how this works anyway.

Modern processors are fitted with chonky registers to help with operations on vectors. These vary from 128-bits in SSE all the way up to 512-bits in AVX-512. All these SSE and AVX acronyms are descriptions of which registers are present on a CPU. The most common around me at the time of writing is AVX2, which includes 256-bit registers.

These registers can load from memory all at once and then apply operations that act as if they had multiple separate numbers inside them. Emulating this behaviour with a normal register would be problematic, since placing let’s say 2 32-bit integers into a 64-bit register and trying to add them could overflow from one number to the other. Other operations would provide their own little problems, so we have specialized registers and instructions for this process.

For AVX2 SBCL provides functions like f64.4+ or f32.8<, if you do the math you’ll find that 4xf64 and 8xf32 is both 256-bits. This means that you are using a 256-bit register to operate on 4 double-floats or 8 single-floats respectively. SBCL provides these inside the sb-simd package, along with sb-simd-<extension-version> for those parts of SIMD only applicable to certain extensions.

Things like sb-simd:f64 and sb-simd:u8 are mostly just type aliases for the actual built-in Common Lisp types. So using sb-simd:u16+ on two numbers or sb-simd:f32* on two floats is perfectly acceptable.

How do we get multiple double-floats for example then, so we can use these functions on them? We use <type>.<count>-aref on an array restricted to the given type. It’s that easy, this returns a “SIMD pack” as SBCL calls it, and we can operate on it, before setfing back. In essence you only replace aref with ...-aref and + with ...-+ and you’re done. One thing to note is that SIMD if works a bit differently, since we need to make a choice over multiple numbers at the same time. < actually creates a boolean mask which SIMD if uses to choose a number from one of the other two arguments, however to do that both must already be pre-calculated.

If our dear crew strictly assumes AVX2 is present they could write something like this:

(defun distance-sensors-delete-background-noise3 (distances backgrounds)
  (loop :for index :below (length distances) :by 4
        :for distance := (sb-simd-avx2:f64.4-aref distances index)
        :for background := (sb-simd-avx2:f64.4-aref backgrounds index)
        :do (setf (sb-simd-avx2:f64.4-aref distances index)
                  (sb-simd-avx2:f64.4-if
                   (sb-simd-avx2:f64.4< distance background)
                   0.0d0
                   distance))))

This is slower than before, but I meant mostly to show off the features as they took me way too long for me to understand and I find them interesting and productive to know. For more complex calculations or when used with larger leverage, like with bytes where the register can support 32 of them at once, the performance of SIMD is incomparable.