Tiny Z-curves in AVR

Posted on 2025-06-14 :: 2048 Words

Background wikipedia article: https://en.wikipedia.org/wiki/Z-order_curve

(WARNING: This article may contain assembly, not suitable for children under 4)

I'm always rediscovering solutions to long-answered questions. A particular problem I had at work recently was bit-banging multiple I2C channels from a multiplexed datastream. In itself, that's not a particularly hard problem. I2C is, after all, one of the simpler protocols you can implement. I think most firmware engineers have written at least one I2C driver at some point.

The problem I encountered was simply one of bandwidth. I was using an ATTiny running at a face-melting 20MHz. Even with 100% optimal assembly, the clock speed here means you can only go up to about 250KHz on your I2C bus. That's typically not a problem, but it really matters when you're streaming data in real-time.

So let's think about how data flows through this system. We have an incoming stream of data which contains multiplexed streams for each of 5 output channels. We want to take the next byte for each channel, and push the X'th bit of each byte onto the wire. There are lots and lots of ways you can accomplish this, but most of them are horrendously slow when you peek under the hood

So we've received some samples for our channels.

	S0	S1	S2
Ch 0	0x3A	0xF1	0x9C
Ch 1	0x7D	0x00	0xAA
Ch 2	0x42	0xC3	0x1F
Ch 3	0xE7	0x5A	0x6D
Ch 4	0x18	0xBE	0x73

As arrays in C are row-major, each channel is stored contiguously in memory. The above 2D array is exactly identical to:

uint8_t samples[] = {0x3A, 0xF1, 0x9C, 0x7D, 0x00, 0xAA, 0x42, 0xC3, 0x1F, 0xE7, 0x5A, 0x6D, 0x18, 0xBE, 0x73}

If we expand the first column into a binary matrix, we can start to see the problem

	Bit 7	Bit 6	Bit 5	Bit 4	Bit 3	Bit 2	Bit 1	Bit 0
Ch 0	0	0	1	1	1	0	1	0
Ch 1	0	1	1	1	1	1	0	1
Ch 2	0	1	0	0	0	0	1	0
Ch 3	1	1	1	0	0	1	1	1
Ch 4	0	0	0	1	1	0	0	0

If you think about it, this matrix of bits is a direct representation of the states of our output IO lines. All we need to do is take a column of this bit matrix and throw it straight out the IO register. But the data is physically in memory as rows. Extracting the X'th bit from each byte turns out to be way too slow to run in real-time. What we want is to somehow rotate this bit matrix by 90 degrees:

	Ch 4	Ch 3	Ch 2	Ch 1	Ch 0
Bit 7	0	1	0	0	0
Bit 6	0	1	1	1	0
Bit 5	0	1	0	1	1
Bit 4	1	0	0	1	1
Bit 3	1	0	0	1	1
Bit 2	0	1	0	1	0
Bit 1	0	1	1	0	1
Bit 0	0	1	0	1	0

Hopefully you can now see the power available in this technique. If we can transpose a byte array into a column-major format, we can "simply" pop a byte off the top and push it to the IO register. That is a grand total of 2 clock cycles if you do it right.

The next question is: how do we perform this transposition? Well, this is the solution that I rediscovered. I'd gotten pretty close on my own, but I wasn't satisfied that I was thinking about it the right way, so I went looking to see how we solved this problem back in the TTL days. Really I was just looking for some inspiration because I felt like I'd boxed myself into this idea.

What I found was that I was actually already on the right track. I found the Wikipedia article on Morton ordering and decided that this was definitely the correct solution. So then we get to the implementation.

The canonical naive C implementation is not in any way surprising:

void slice(uint8_t* in_bytes, uint8_t *out_bytes){  
    for(uint8_t i = 0; i < 8; i++)  
        for(uint8_t j = 0; j < 8; j++)  
            for (uint8_t k = 0; k < 5; k++)  
                out_bytes[i] &= ((in_bytes[k] >> j) & 1) << (k);  
}

If you think about it, it can't really be any other way, right? You must shift your inputs, your masks, or your outputs around to get the bits to land in the right place. This is the only logical way to shuffle bits around, right? Before I answer that, let's inspect the assembly for this function:

slice:
	push r17
	push r28
	push r29
/* prologue: function */
/* frame size = 0 */
/* stack size = 3 */
.L__stack_usage = 3
	ldi r17,lo8(8)
.L2:
	ldi r21,0
	ldi r20,0
	rjmp .L6
.L3:
	ld r18,X+  //load an input byte to R18
	ldi r19,0  //zero a scratch register
	mov r0,r20 
	rjmp 2f    //for k to 5
	asr r19    //shift r19 right, lowest bit shifted to carry
	ror r18    //rotate r18 right, carry shifted into bit 7
	dec r0
	brpl 1b    //loop
	andi r18,1 //get only the lowest bit 
	clr r19    //zero scratch register
	mov r0,r30 
	rjmp 2f    //for j to 8
	lsl r18    //shift r18 left
	dec r0
	brpl 1b    //loop
	movw r28,r22 //set up Y pointer based on our loop indices
	ld r19,Y   //load the value at Y*
	and r18,r19  //AND it with our result
	st Y,r18   //move it back to Y*
	adiw r30,1 
	cpi r30,5
	cpc r31,__zero_reg__
	brne .L3   //for i to 8
	subi r20,-1
	sbci r21,-1
	cpi r20,8
	cpc r21,__zero_reg__
	breq .L4
.L6:
	movw r26,r24
	ldi r31,0
	ldi r30,0
	rjmp .L3
.L4:
	subi r17,lo8(-(-1))
	subi r22,-1
	sbci r23,-1
	cpse r17,__zero_reg__
	rjmp .L2
/* epilogue start */
	pop r29
	pop r28
	pop r17
	ret

As you may or may not be able to see, the vast majority of this code is just spent managing loop counters and jumping around. All those jumps and breq statements are expensive! The assembly is (unsurprisingly) quite close to how we expressed it in C and looks fairly clean on first inspection. However, when you actually count up all the loop iterations, the problem becomes immediately apparent: the body of this function is executed 8 * 8 * 5 = 320 times! That's several thousand operations to compute five bytes! If we (very) generously say each instruction is one cycle, that's hundreds microseconds per output frame! At 250KHz, you have about 40 microseconds to output one frame. Granted, it's much more subtle than that, but we're still looking at an order of magnitude difference!

That's this approach dead in the water then. The CPU just does not physically have enough time to process this data, right?

Well, now it's time to talk about the AVR CPU architecture. The particular chip I'm working with uses the AVRxt instruction set. This includes two extremely interesting instructions: BLD and BST, the BitLoaD and BitSeT instructions.

The AVRxt CPU has bit-level access to a subset of its registers. That means there are specific opcodes to operate only on a specific bit position in a specific register. This is intended for very fast access to a single IO line, or config register, but with the BST and BLD instructions, it becomes much, much more versatile.

BLD will copy the bit from position X in register Y to the T bit in the STATUS register. The BST instruction will then obviously copy the T bit to a given position in a register. Each of these instructions takes only one cycle to complete, or two full cycles to construct one bit of our output byte, no shifting required!

After several days of debugging and whiteboard scribbling, I came up with this:

void slice_bits(const uint8_t *in_bytes, const uint8_t *out_bytes) {
    //the optimized assembly below does the same operation in <100 instructions without loops
    //this relies on One Weird Trick the CPU has that Big Compiler doesn't want you to know!
    //we can load a bit from a position in a register to a temporary bit in the status register
    //then that bit can be put into a different position in a different register
    //load and store are single-cycle instructions! That means ~10 instructions per input byte
    //there are 32 8-bit registers and six of those are combined into three 16-bit registers X, Y, Z
    //registers Y and Z (and X, kind of) are pointer registers that allow faster access to RAM
    //we store the 5 input bytes in regular registers, and the output pointer at Z
    //we use a 6th register to accumulate bits from the input bytes, then set the value
    //at the Z address with an atomic increment of the Z pointer
    //in this way, constructing one output byte requires:
    //5 bit-loads from input bytes (1cy)
    //5 bit-stores to output byte (1cy)
    //copy output to *Z and increment Z (1cy)
    //clear output register (1cy)
    //that's 12 cycles per byte, 96 cycles per 8B frame, or ~4.8 microseconds
    //at 250KHz I2C bus speed it takes ~40us to write one byte, we have plenty of headroom
#define LOAD_BITS(pos) \
    "bst %[b0],"#pos"\n\t"\
    "bld r10,1\n\t"\
    "bst %[b1],"#pos"\n\t"\
    "bld r10,2\n\t"\
    "bst %[b2],"#pos"\n\t"\
    "bld r10,3\n\t"\
    "bst %[b3],"#pos"\n\t"\
    "bld r10,4\n\t"\
    "bst %[b4],"#pos"\n\t"\
    "bld r10,5\n\t"\
    "st Z+,r10\n\t"\
    "clr r10\n\t"
    _fastPtr_z(out_bytes, out_bytes) //store output pointer in z
    _fastPtr_y(in_bytes, in_bytes)  //pointer to input in Y register. GCC makes worse assembly otherwise
    asm volatile(
            //assembly laid out here for clarity
            //first, load all input bytes into registers. This is defined at the end of the listing,
            //but the implementation looks something like:
            //"ld Y,%[&in_bytes]\n\t"
            //"ld %[b0],Y+\n\t"
            //"ld %[b1],Y+\n\t"
            //"ld %[b2],Y+\n\t"
            //"ld %[b3],Y+\n\t"
            //"ld %[b4],Y+\n\t"
            //the LOAD_BITS macro contains the following
            //"bst %[b0],"#pos"\n\t"    //load the 'pos'th bit from byte 0 into the T bit of SREG
            //"bld r10,1\n\t"           //copy that bit to bit 1 of our output byte
            //"bst %[b1],"#pos"\n\t"    //repeat for all five input bytes
            //"bld r10,2\n\t"
            //"bst %[b2],"#pos"\n\t"
            //"bld r10,3\n\t"
            //"bst %[b3],"#pos"\n\t"
            //"bld r10,4\n\t"
            //"bst %[b4],"#pos"\n\t"
            //"bld r10,5\n\t"
            //"st Z+,r10\n\t"           //copy the output byte to *Z and increment Z as a single operation
            //"clr r10\n\t              //set our temp register back to zero
            //this could be done as a loop from 0..7, but we save something like 40
            //CPU cycles by unrolling the loop, at the cost of a few hundred bytes of program space
            LOAD_BITS(0)
            LOAD_BITS(1)
            LOAD_BITS(2)
            LOAD_BITS(3)
            LOAD_BITS(4)
            LOAD_BITS(5)
            LOAD_BITS(6)
            LOAD_BITS(7)
            //this construction allows the compiler to select registers for us
            :                               //no output registers
            :   [b0] "r" (in_bytes[0]),     //move input bytes to input registers
                [b1] "r" (in_bytes[1]),
                [b2] "r" (in_bytes[2]),
                [b3] "r" (in_bytes[3]),
                [b4] "r" (in_bytes[4])
            :   "r10", "cc"                //mark r10 and SREG as clobbered (cc is an alias to SREG (I think))
            //clobbering means GCC will preserve any previous value and restore on exit
            //r10 is just an arbitrary register used for working memory
            //SREG is the status register. Since we directly manipulate it, we need to restore state
            //otherwise we might preempt a two cycle compare-and-branch and accidentally destroy the comparison state
            );
}

Hopefully that makes sense. We leverage some unexpectedly powerful features on a very underpowered CPU to go tens of times faster than C. And so we got something like 10KB/s throughput on five bitbanged I2C channels.

(what looks like an off-by-one is an implementation detail; IO bit 0 is our clock pin)

One very important feature is the use of the Z register. AVRxt has several instructions for the X, Y, Z registers that can load from or store to the address pointed to by Z and increment or decrement as a single operation. This saves a handful of additional instructions we'd otherwise have to spend managing that pointer ourselves. That trick comes from the ATTiny Arduino core, where it's used a lot in the I2C and UART drivers in much the same way. If you're curious, you should look it up. Without this trick, GCC creates some awful assembly. Truly dreadful stuff.

The surrounding I2C driver also was designed to run this process in the dead time between I2C frames, all very carefully balanced to try and extract the most compute that this CPU can possibly give. Unfortunately I can't share the whole driver, but the rest of it looks pretty similar to the above.

I had to learn a lot through this process. Not only did I have to learn AVR assembly, I had to learn about GCC inline assembly and all the rules associated with it. It was a fun project and I definitely spent way too much time on it, but I think the additional performance is worth it.

If you noticed that the assembly above would only read one channel instead of all five, congratulations! You're an even bigger nerd than me. Send me an email, we should be friends ;)

I couldn't find a great place to mention it, but at some point I implemented a column-major array. When we demux the input stream, we separate the channels into a ColArray structure. This lines up the memory in a way that's (slightly) faster for the assembly to consume. It really isn't that interesting:

struct ColArray {
    ColArray(const uint8_t rows, const uint8_t cols) {
        backing = new uint8_t[rows * cols];
    }
    ~ColArray() {
        delete[] backing;
    }
    uint8_t* operator[](const uint8_t idx) const {
        return &backing[idx * rows];
    }
private:
    uint8_t *backing;
};

One of the more interesting facets of this technique of using BLD and BST is that C has no way to express such a concept! There is no way to express to the compiler that you're working on individual bits, so there's no good way for it to realize that it should use BLD and BST in this way. The technique of shifting bits through the carry flag comes close, but it still requires spending significant time rotating your bytes to get the bits in the right places before you even do the operation! In some specific cases, GCC does use bit-access opcodes. If you do something like PORTA &= 1, you'll actually get an instruction that sets bit 0 of the PORTA register. I assume that this is a specific optimization case someone has written for AVR IO, and that no one has thought to implement BLD and BST this way.

Addendum: Go back to the Wikipedia article and take a look at the images. Consider: 2D arrays are a Z-order curve on a 1D array. Logically, we can then extend a 2D array into 3D or higher. If a 1D byte array is a 2D bit array in a trenchcoat, a 2D byte array can then be considered a 3D bit array.

What we've done here is treat a multiplexed 1D byte stream as a 3D bit matrix and made optimal use of the CPU to transform that matrix into a new interleaved 1D byte stream formatted exactly to stream out through an IO port to the physical world.

That juxtaposition is one of the things I find most exciting and satisfying about embedded programming. We have a very primitive physical input, we transform it into this highly abstract concept in a physically impossible data-space, poke at it a bit, then go from there straight to directly manipulating physical reality.