In this, our next exciting installment of STM32 and Rust for USB device drivers, we're going to look at what the STM32 calls the 'packet memory area'. If you've been reading along with the course, including reading up on the datasheet content then you'll be aware that as well as the STM32's normal SRAM, there's a 512 byte SRAM dedicated to the USB peripheral. This SRAM is called the 'packet memory area' and is shared between the main bus and the USB peripheral core. Its purpose is, simply, to store packets in transit. Both those IN to the host (so stored queued for transmission) or OUT from the host (so stored, queued for the application to extract and consume).

It's time to actually put hand to keyboard on some Rust code, and the PMA is the perfect starting point, since it involves two basic structures. Packets are the obvious first structure, and they are contiguous sets of bytes which for the purpose of our work we shall assume are one to sixty-four bytes long. The second is what the STM32 datasheet refers to as the BTABLE or Buffer Descriptor Table. Let's consider the BTABLE first.

The Buffer Descriptor Table

The BTABLE is arranged in quads of 16bit words. For "normal" endpoints this is a pair of descriptors, each consisting of two words, one for transmission, and one for reception. The STM32 also has a concept of double buffered endpoints, but we're not going to consider those in our proof-of-concept work. The STM32 allows for up to eight endpoints (EP0 through EP7) in internal register naming, though they support endpoints numbered from zero to fifteen in the sense of the endpoint address numbering. As such there're eight descriptors each four 16bit words long (eight bytes) making for a buffer descriptor table which is 64 bytes in size at most.

Buffer Descriptor Table
Byte offset in PMA Field name Description
(EPn * 8) + 0 USB_ADDRn_TX The address (inside the PMA) of the TX buffer for EPn
(EPn * 8) + 2 USB_COUNTn_TX The number of bytes present in the TX buffer for EPn
(EPn * 8) + 4 USB_ADDRn_RX The address (inside the PMA) of the RX buffer for EPn
(EPn * 8) + 6 USB_COUNTn_RX The number of bytes of space available for the RX buffer for EPn (and once received, the number of bytes received)

The TX entries are trivial to comprehend. To transmit a packet, part of the process involves writing the packet into the PMA, putting the address into the appropriate USB_ADDRn_TX entry, and the length into the corresponding USB_COUNTn_TX entry, before marking the endpoint as ready to transmit.

To receive a packet though is slightly more complex. The application must allocate some space in the PMA, setting the address into the USB_ADDRn_RX entry of the BTABLE before filling out the top half of the USB_COUNTn_RX entry. For ease of bit sizing, the STM32 only supports space allocations of two to sixty-two bytes in steps of two bytes at a time, or thirty-two to five-hundred-twelve bytes in steps of thirty-two bytes at a time. Once the packet is received, the USB peripheral will fill out the lower bits of the USB_COUNTn_RX entry with the actual number of bytes filled out in the buffer.

Packets themselves

Since packets are, typically, a maximum of 64 bytes long (for USB 2.0) and are simply sequences of bytes with no useful structure to them (as far as the USB peripheral itself is concerned) the PMA simply requires that they be present and contiguous in PMA memory space. Addresses of packets are relative to the base of the PMA and are byte-addressed, however they cannot start on an odd byte, so essentially they are 16bit addressed. Since the BTABLE can be anywhere within the PMA, as can the packets, the application will have to do some memory management (either statically, or dynamically) to manage the packets in the PMA.

Accessing the PMA

The PMA is accessed in 16bit word sections. It's not possible to access single bytes of the PMA, nor is it conveniently structured as far as the CPU is concerned. Instead the PMA's 16bit words are spread on 32bit word boundaries as far as the CPU knows. This is done for convenience and simplicity of hardware, but it means that we need to ensure our library code knows how to deal with this.

First up, to convert an address in the PMA into something which the CPU can use we need to know where in the CPU's address space the PMA is. Fortunately this is fixed at 0x4000_6000. Secondly we need to know what address in the PMA we wish to access, so we can determine which 16bit word that is, and thus what the address is as far as the CPU is concerned. If we assume we only ever want to access 16bit entries, we can just multiply the PMA offset by two before adding it to the PMA base address. So, to access the 16bit word at byte-offset 8 in the PMA, we'd look for the 16bit word at 0x4000_6000 + (0x08 * 2) => 0x4000_6010.

Bundling the PMA into something we can use

I said we'd do some Rust, and so we shall…

    // Thanks to the work by Jorge Aparicio, we have a convenient wrapper
    // for peripherals which means we can declare a PMA peripheral:
    pub const PMA: Peripheral<PMA> = unsafe { Peripheral::new(0x4000_6000) };

    // The PMA struct type which the peripheral will return a ref to
    pub struct PMA {
        pma_area: PMA_Area,
    }

    // And the way we turn that ref into something we can put a useful impl on
    impl Deref for PMA {
        type Target = PMA_Area;
        fn deref(&self) -> &PMA_Area {
            &self.pma_area
        }
    }

    // This is the actual representation of the peripheral, we use the C repr
    // in order to ensure it ends up packed nicely together
    #[repr(C)]
    pub struct PMA_Area {
        // The PMA consists of 256 u16 words separated by u16 gaps, so lets
        // represent that as 512 u16 words which we'll only use every other of.
        words: [VolatileCell<u16>; 512],
    }

That block of code gives us three important things. Firstly a peripheral object which we will be able to (later) manage nicely as part of the set of peripherals which RTFM will look after for us. Secondly we get a convenient packed array of u16s which will be considered volatile (the compiler won't optimise around the ordering of writes etc). Finally we get a struct on which we can hang an implementation to give our PMA more complex functionality.

A useful first pair of functions would be to simply let us get and put u16s in and out of that word array, since we're only using every other word…

    impl PMA_Area {
        pub fn get_u16(&self, offset: usize) -> u16 {
            assert!((offset & 0x01) == 0);
            self.words[offset].get()
        }

        pub fn set_u16(&self, offset: usize, val: u16) {
            assert!((offset & 0x01) == 0);
            self.words[offset].set(val);
        }
    }

These two functions take an offset in the PMA and return the u16 word at that offset. They only work on u16 boundaries and as such they assert that the bottom bit of the offset is unset. In a release build, that will go away, but during debugging this might be essential. Since we're only using 16bit boundaries, this means that the first word in the PMA will be at offset zero, and the second at offset two, then four, then six, etc. Since we allocated our words array to expect to use every other entry, this automatically converts into the addresses we desire.

If we pop (and please don't worry about the unsafe{} stuff for now):

    unsafe { (&*usb::pma::PMA.get()).set_u16(4, 64); }

into our main function somewhere, and then build and objdump our test binary we can see the following set of instructions added:

 80001e4:   f246 0008   movw    r0, #24584  ; 0x6008
 80001e8:   2140        movs    r1, #64 ; 0x40
 80001ea:   f2c4 0000   movt    r0, #16384  ; 0x4000
 80001ee:   8001        strh    r1, [r0, #0]

This boils down to a u16 write of 0x0040 (64) to the address 0x4006008 which is the third 32 bit word in the CPU's view of the PMA memory space (where offset 4 is the third 16bit word) which is exactly what we'd expect to see.

We can, from here, build up some functions for manipulating a BTABLE, though the most useful ones for us to take a look at are the RX counter functions:

    pub fn get_rxcount(&self, ep: usize) -> u16 {
        self.get_u16(BTABLE + (ep * 8) + 6) & 0x3ff
    }

    pub fn set_rxcount(&self, ep: usize, val: u16) {
        assert!(val <= 1024);
        let rval: u16 = {
            if val > 62 {
                assert!((val & 0x1f) == 0);
                (((val >> 5) - 1) << 10) | 0x8000
            } else {
                assert!((val & 1) == 0);
                (val >> 1) << 10
            }
        };
        self.set_u16(BTABLE + (ep * 8) + 6, rval)
    }

The getter is fairly clean and clear, we need the BTABLE base in the PMA, add the address of the USB_COUNTn_RX entry to that, retrieve the u16 and then mask off the bottom ten bits since that's the size of the relevant field.

The setter is a little more complex, since it has to deal with the two possible cases, this isn't pretty and we might be able to write some better peripheral structs in the future, but for now, if the length we're setting is 62 or less, and is divisible by two, then we put a zero in the top bit, and the number of 2-byte lumps in at bits 14:10, and if it's 64 or more, we mask off the bottom to check it's divisible by 32, and then put the count (minus one) of those blocks in, instead, and set the top bit to mark it as such.

Fortunately, when we set constants, Rust's compiler manages to optimise all this very quickly. For a BTABLE at the bottom of the PMA, and an initialisation statement of:

    unsafe { (&*usb::pma::PMA.get()).set_rxcount(1, 64); }

then we end up with the simple instruction sequence:

80001e4:    f246 001c   movw    r0, #24604  ; 0x601c
80001e8:    f44f 4104   mov.w   r1, #33792  ; 0x8400
80001ec:    f2c4 0000   movt    r0, #16384  ; 0x4000
80001f0:    8001        strh    r1, [r0, #0]

We can decompose that into a C like *((u16*)0x4000601c) = 0x8400 and from there we can see that it's writing to the u16 at 0x1c bytes into the CPU's view of the PMA, which is 14 bytes into the PMA itself. Since we know we set the BTABLE at the start of the PMA, it's 14 bytes into the BTABLE which is firmly in the EP1 entries. Specifically it's USB_COUNT1_RX which is what we were hoping for. To confirm this, check out page 651 of the datasheet. The value set was 0x8400 which we can decompose into 0x8000 and 0x0400. The first is the top bit and tells us that BL_SIZE is one, and thus the blocks are 32 bytes long. Next the 0x4000 if we shift it right ten places, we get the value 2 for the field NUM_BLOCK and multiplying 2 by 32 we get the 64 bytes we asked it to set as the size of the RX buffer. It has done exactly what we hoped it would, but the compiler managed to optimise it into a single 16 bit store of a constant value to a constant location. Nice and efficient.

Finally, let's look at what happens if we want to write a packet into the PMA. For now, let's assume packets come as slices of u16s because that'll make our life a little simpler:

    pub fn write_buffer(&self, base: usize, buf: &[u16]) {
        for (ofs, v) in buf.iter().enumerate() {
            self.set_u16(base + (ofs * 2), *v);
        }
    }

Yes, even though we're deep in no_std territory, we can still get an iterator over the slice, and enumerate it, getting a nice iterator of (index, value) though in this case, the value is a ref to the content of the slice, so we end up with *v to deref it. I am sure I could get that automatically happening but for now it's there.

Amazingly, despite using iterators, enumerators, high level for loops, function calls, etc, if we pop:

    unsafe { (&*usb::pma::PMA.get()).write_buffer(0, &[0x1000, 0x2000, 0x3000]); }

into our main function and compile it, we end up with the instruction sequence:

80001e4:    f246 0000   movw    r0, #24576  ; 0x6000
80001e8:    f44f 5180   mov.w   r1, #4096   ; 0x1000
80001ec:    f2c4 0000   movt    r0, #16384  ; 0x4000
80001f0:    8001        strh    r1, [r0, #0]
80001f2:    f44f 5100   mov.w   r1, #8192   ; 0x2000
80001f6:    8081        strh    r1, [r0, #4]
80001f8:    f44f 5140   mov.w   r1, #12288  ; 0x3000
80001fc:    8101        strh    r1, [r0, #8]

which, as you can see, ends up being three sequential halfword stores directly to the right locations in the CPU's view of the PMA. You have to love seriously aggressive compile-time optimisation :-)

Hopefully, by next time, we'll have layered some more pleasant routines on our PMA code, and begun a foray into the setup necessary before we can begin handling interrupts and start turning up on a USB port.