Digression - fixed point and floating point for STM32F4

This discussion was created from comments split from: Axoloti.

A side note about fixed point vs floating point…

I went crazy last night and rewrote Elements’ resonator code in fixed point - just to make sure I left no stone unturned in terms of performance. Wouldn’t it be a pleasant surprise if Elements could become duophonic due to a fantastic optimization!

Elements’ resonator loop consists of:

  • N zero-delay feedback state variable filters (the proper band-pass elements for each mode). Zero-delay feedback is absolutely necessary for this application because the filters can be very resonant, can go close to Nyquist, and the tuning needs to be accurate.
  • 8 tuned delay lines (no interpolation - the delay is an integer). The dirty secret is that Elements uses a mix of banded-waveguided synthesis for bowing/blowing the first modes - and vanilla modal for the rest.
  • 2 x N evaluation of the cosine function on a periodic grid to determine the modes amplitudes and create the stereo comb-filtering effect of POSITION and SPACE. Originally I used a proper comb-filter, but it smeared transients, sounded laggy, and was terrible to modulate (ewww corny flanger) - so I decided to explicitly compute the amplitude of each mode for the left and right channel. This is a super sensitive part since it has to be done for each sample to avoid zipper noise when the POSITION parameter is modulated. It’s a cheap operation though - boiling down to one MAC - but I do it a lot.

The floating point code is pure C+. The fixed point version calls an inlined smmul for int32xint32 >> 32 and smmla for= int32xint32 >> 32. I tried to use Q1.31 as much as possible, and to shift up by 1 outside loops.

Here’s what I get.

Fixed point: maximum N before buffer underrun = 30
Floating point: maximum N before buffer underrun = 64

Some observations from the generated code:

  • The floating point version loads all the stuff for a group of 4 modes in registers, do all the computations, then store everything at the end - which wouldn’t be possible in fixed point by lack of registers. There are freakishly long stretches of pipeline-friendly code consisting of just vmul.f32, vfma.f32 and vmov.f32 which are just asking to be SIM’d. Please ARM, give us a new “digital signal controller” Cortex-M8 profile with NEON! And please ST, make a chip out of it!
  • The fixed point version weaves a bunch of load/stores with the code, which might stink from a pipelining point of view. That’s because there aren’t enough registers to hold all state variables, so they are written back to memory and reloaded frequently. Another source of slowness are the occasional extra shifts (one SVF coefficient needs to be Q4.27).

I’ve also tried reorganizing the computation so that it needs a smaller “register depth”. It doesn’t help the fixed point much - RAM access are still very frequent (everything gets committed back to RAM between chunks of register-intensive computation anyway). It confuses the compiler a lot for floating point, though - because of the narrow focus on small stretches of computation whose results are interdependent.

Fixed point: N = 32
Floating point: N = 55

ARM is a fairly general purpose chip. At what point do DSP chips become worthwhile?

I notice the Cylonix Shapeshifter uses some Terasic FPGA board, big and ugly but powerful? obviously making a product based on such things is a risk as they may discontinue the board.

There’s a big class of algorithms and software where it makes sense to add DSP instructions on top of the general-purpose set.

It would make good business sense to add NEON as soon as the required extra wafer size (manuf cost) is small enough and it doesn’t break the power budget too much.

Obviously this can’t come too soon if you deal with media, streaming, compression or the “small” market that is mobile phones…

> At what point do DSP chips become worthwhile?

I think I’m still operating at a point where the quirkinesses of DSPs (external memory, proprietary development tools, assembly code or rigid libraries, high power consumption) are not worth it. They would certainly be a good architecture for something like Elements, but maybe not for something like Clouds in which there’s quite a lot of “general purpose” computing (shuffling data between buffers, doing arithmetic and comparisons on sample counters…). Even Elements’ exciter section is more suitable for a general purpose architecture.

So I am not getting into DSPs because we’re very close to the point where they won’t be necessary - the point where ARM with SIMD extensions will be cheap and do everything I want (200 / 300 MHz thing with SIMD floating point and an “embedded” profile and set of peripherals). I see markets that would totally benefit from such things (smart objects with sensors that require DSP, the whole thing running with very low power - or sensor coprocessor in a mobile phone) - so it’ll happen eventually.

FPGAs are another story - power hungry (for these, an onboard switcher is a must!), unpleasant packages (it’s quite telling that Intellijel finds it more cost-effective to use a devboard on the back of the module rather than integrate the chip and chipset on their boards). They could be effective for something with lot of parallelism like Elements’ resonator. Maybe for a granulator. Maybe something polyphonic.

I tried programming a Motorola DSP once, never again :slight_smile: I guess writing the code in machine code wasn’t the best idea. But I can see that general purpose computing is a much easier environment to work with.

There’s a few interesting FPGA projects around, this one seeks to replicate many old computers.

MIST

Interesting stuff!
I did some experimental re-writes of C++ fx code in inline asm for the m4, and the memory access and pipeline dependencies (fpu instructions execute in a single cycle, but only if you don’t use the result immediately) can really hurt. NEON would have been great…

With some manual loop unrolling, load/store batching and some reorganising/tweaking, I got 30-40% improvements over gcc (4.7.4?) which indeed seems to be easily confused :slight_smile:
But admittedly it was a lot of work, sometimes the ‘improvement’ was negative, and it’s possible most of the gains could have been had with mindful C++ in the first place. Plus maintainability is questionable…

IIRC there was a posting on music-dsp about this topic recently too.

I’m going to be the uninformed asshole and wonder if you couldn’t just put a faster and more expensive processor in there? I’d think the impact on the final price of a module like Elements would be marginal…

GCC supports so many different CPU types that it won’t ever be totally optimal for all of them. Intel’s x86 compiler is supposedly much faster than GCC.

Question is what sort of faster more expensive processor?

Intel Atom? you’d need much bigger boards and higher component counts for sure. ARM chips tend to integrate most of the components you need onto one chip.

> if you couldn’t just put a faster and more expensive processor in there?

But which one do you suggest?

I’m looking for something that can be bought in hundreds or thousands from Mouser/Digikey (not something bought in 50,000 after a drunken karaoke night with a sales representative and signing a NDA), which doesn’t require expensive development tools, on which I can do bare-metal coding (not something meant to run linux or android), that is relatively low-tech in terms of PCB (no super dense BGA, no external memories and wide buses requiring super dense 8-layers PCBs), and if possible in a small-ish package that can be sneaked in between two pots on the PCB.

All these constraints point to powerful ARM microcontrollers. The STM32F4 is pretty much the best for this application. There’s a 180 MHz version that could give a tiny speed boost (I use the 168 MHz), though it’s only available in larger packages that would need to be on their own board on the back of the module like the erbe-verb - and goodbye skiffability.

Maybe this

The thing is the M4s are AFAIK really close to top-of-the-line as far as microcontrollers with integrated memories are concerned. Next level up means external memory and storage and really an OS.

Again, sorry for being the uninformed asshole, but is there a specific reason why there aren’t there any ARM microcontrollers like the STM32F4 that do their thing at a much higher clock rate? Is this just a case of there not being a market large enough to justify developing these things?

Because there’s no market for such a thing.

The people who need more powerful stuff are making phones, TVs, media players, cameras and they don’t mind at all if the chip has to be bought in 10,000s and runs Android or some flavor of Linux.

Reversely, the people who need something lightweight, easy to integrate and low-power rarely need lots of computing power, so there’s not much incentive for chip-maker to push FPU and clock rate for these chips. I suspect it will happen though, through new applications - IoT and all…

What we need for the DSP Eurorack module is very specific: very powerful for running code, but very low-tech and minimalistic when it comes to designing a board for it and getting code on it.

@pichenettes Again, I’m coming from a know-nothing angle, but could you envisage ever producing a module that uses multiple Cortex or similar processors on the same board?

a|x

Why not, but at this stage (maintaining several firmware, debugging communication issues, power use…) it could be wiser to think of something smarter…

LXR uses a Cortex M4, Braids uses an M3, not sure of the differences but I know that the LXR’s CPU comes on a daughterboard, so must be a little more bulky.

The M4 could run 5 or 6 instances of the Braids code. But it was very new at the time I had started working on the project (end of 2011), and frankly the M3 was already such an improvement over the AVRs I was using on the Shruthi that it seemed “good enough”.

I think the CPU daughterboard for the LXR is necessary for the DIYability, same for preenfm2 and p6. If you’re adventurous you can clock the stm32f4 at 192Mhz (and higher?) but that’s probably out-of-spec…

The M3s and M4s aren’t too different except for the core features (MAC, FPU), are they?