Arduino Supersaw Polysynth

So we had this synth-jam-meeting in Malmö, Sweden this weekend.
And as a proof-of-concept / fun project we built a polyphonic arduino synth.
At the present it features 4 voices, each with 4 detuned saw oscillators in unison,
a sweet sounding 4-pole LP,HP,BP switch capacitor filter, portamento, pitchbend
and volume shaping.

Here’s a little video of it in action during during the final stages of the build

Actually, it was just one person (not me) who did all the hardware design and software…
In just three days, I might add.

Alltså, shit pommes-frittes!

Cool stuff! I like how you keep the pizza sallad close to the keyboard so that you don’t have to eat any spectators in case of a sudden hunger attack.

Pizza sallad is a vital component :slight_smile:

The creator of the synth has published schematics and code.

In december 2009 I developed Shruti-4, a 4-voices paraphonic version of the Shruti-1 firmware.

I dropped it after 1 week. To me, the deal-breakers were:

  • The aliasing. Obviously on an AVR with 4 voices of polyphony you cannot go the whole interpolated wavetable thing at full sample rate ; so it was either full sample-rate “naive” phase counters ; or half-sample rate, bleak sounding, band-limited wavetables.
  • Mono-oscillator. While I love 1 osc synths for basses and leads, I prefer complex timbres for polyphony.
  • The whole paraphonic thing. It felt so awkward…

Ah… didn’t know about the Shruthi-4 :slight_smile:

Yes, quite a bit of aliasing with the syltsynth… But, even so… it sounds very nice.
Especially considering the extreme minimalist design, something like 6 components for the filter and D/A plus the arduino :slight_smile:

There are some improvements to be made, but overall it was just a concept, and shouldn’t be taken too seriously.

This is the least serious project I’ve ever done I think.

It was just for fun and nothing else and seriously, it has a 5% precision resistor ladder D/A.
That alone should tell you alot about the unpretentiousness of the design.
I believe my own words describing the design was “a cheap abomination designed by a musical illiterate”.

Aliasing you say? You don’t know what aliasing is until you’ve heard this sucker ^^

I make no assertations what so ever as to the quality of sound that this clusterfuck-of-a-synth produces.

Being a programmer with no knowledge of synth concepts I also haven’t the slightest clue what a phase counter, bleak sounding or band-limited wavetable is. Timbre - that’s wood right? Paraphonic? Sounds awesome!

I’ll say this though, it was fun to build and to see people enjoying what it shat out the other end.

Sounds lo-fi as hell and it fucks alot more with the auditory parts of my brain than I could ever have imagined but no, it’s not gonna make the Shruti it’s bitch any time soon.

Hey stg, great to see you here! I didn’t mean to criticize your project, and I’m sorry if you felt that way. I was just replying (in advance) to people who would think “why isn’t the Shruthi already doing this”. It would be cool indeed, and many people would be grateful, if you ported this to the Shruthi hardware :smiley:

BTW a few tips, if you want to do more of those things:
1/ you can use the SPI (or USART in SPI mode) on the AVRs to write to shift registers in a single CPU cycle (transfer is then done by the hardware in the background).
2/ you’d rather strobe the latch pin at the beginning of the ISR rather than after all the writes, since I am not sure they execute in constant time.
3/ doing the audio rendering by blocks of 16 or 32 samples outside the ISR to save on interrupt prelude/postlude and to keep everything in registers (with just a ring buffer read inside the ISR) will give you a bunch of CPU cycles “for free”. Probably enough to sneak in a 5th or 6th polyphony voice :slight_smile:

Naaaah I didn’t take it that way - I just wanted to clarify what this is and I really wouldn’t take offence regardless of what anyone had to say - it would be another thing if I had really tried to design something good, but I’d hardly use an Arduino for that :wink:

1/ Yep, I know. Thats the way it was designed at first (thats why two of the registers pin numbers correspond to SPI outputs), but then I needed the A output of Timers0 and 2 to drive the filter - which steals one of the pins required for SPI. At first I hoped to use the B output of these timers, but the hardware doesn’t support the frequency generation stuff for output B - it mostly deals with the different PWM modes.

2/ Good point, and it would well work at the moment - however it’s designed that way because I want to add another shift register sharing the same pins that will be clocked from the main loop in order to input the current state of the HP/LP/BP switch. This register will be circular (ie serial out connected to serial in) so that an interrupt clocking out eight bits will always leave the data in the input register intact. Manipulating the input register will destroy the data in the output register unless I disable interrupts during that whole procedure (which would muck up the sample rate) and make that register circular as well. I will still need to disable interrupts for the eight clock transitions, but that will have minimal effect on timing.

3/ That’s similar to what I did at first - didn’t work out for various reasons. Not sure what your resoning is regarding saving time on the interrupt. The interrupt will still need to be called at the same rate and the entry/exit code will hence take the same amount of time. All interrupt variables are modular or static to prevent unnecessary stack pointer manipulation on entry/exit and the extra cycles associated with access to temporary variables on the stack.

The way I see it the amount of processing will be identical except that having a loop to render a number of samples outside the isr would actually steal a few cycles counting and comparing, not to mention the extra ram requirements for the ring buffer and cycles consumed checking the state of the ring buffer to determine when it’s time to push more data - or am I missing something here?

since I’m here I might as well add fanboyism warning that I fucking love what my friends have shown me of the Shruti - it sounds fantastic :slight_smile:

I forgot that you had only 1 UART (on the 644P it’s convenient… 1 UART for MIDI and 1 UART for the SPI master).

As for the interrupt thing: check the “register depth” of your ISR. At first sight I would say at least 25 registers, since gcc will keep all the ppcm variables in different registers to do the big sum at end (instead of reloading them one by one to do the sums on line 253). So it’s 25 push/pop per sample on one hand. On the other hand, picking a sample from a ring buffer uses 3 registers, and that’s why it paid a lot on the Shruthi - super thin interrupt handler doing no more than the job normally done by a DMA controller :smiley:

But even without doing much re-layout, I think you could already save by reordering the code to minimize the lifecycle of some variables - reducing the register width of the code greatly:

ppcm[ 0 ][ 0 ] += pfreq[ 0 ][ 0 ];
ppcm[ 0 ][ 1 ] += pfreq[ 0 ][ 1 ];
ppcm[ 0 ][ 2 ] += pfreq[ 0 ][ 2 ];
ppcm[ 0 ][ 3 ] += pfreq[ 0 ][ 3 ];

int16_t pout = ( ( char )( ppcm[ 0 ][ 0 ] >> 8 ) )

  • ( ( char )( ppcm[ 0 ][ 1 ] >> 8 ) )
  • ( ( char )( ppcm[ 0 ][ 2 ] >> 8 ) )
  • ( ( char )( ppcm[ 0 ][ 3 ] >> 8 ) );
    iout = ( ( pout * pm[ 0 ].volume ) >> 6 );

// Those will reuse the same registers as ppcm[ 0 ][ 0 ]
ppcm[ 1 ][ 0 ] += pfreq[ 1 ][ 0 ];
ppcm[ 1 ][ 1 ] += pfreq[ 1 ][ 1 ];
ppcm[ 1 ][ 2 ] += pfreq[ 1 ][ 2 ];
ppcm[ 1 ][ 3 ] += pfreq[ 1 ][ 3 ];

pout = ( ( char )( ppcm[ 1 ][ 0 ] >> 8 ) )

  • ( ( char )( ppcm[ 1 ][ 1 ] >> 8 ) )
  • ( ( char )( ppcm[ 1 ][ 2 ] >> 8 ) )
  • ( ( char )( ppcm[ 1 ][ 3 ] >> 8 ) );
    iout += ( ( pout * pm[ 0 ].volume ) >> 6 );

And so on (you could even try interleaving the ppcm accumulations with the pout accumulation)…

Ah yes, I thought you meant just moving the majority of the code to outside the interrupt would be the source of more cpu time :slight_smile:

UARTs are not used for SPI (uarts are asynchronois, spi is synchronous) but there is still actually two interfaces. The UART (used for MIDI) and the SSP module (AVR’s version of a semi-generic synchronous interface) which requires Arduino’s pin #11. I have only three timers to generate my two filter clocks and two of them share pins with the SSP module rendering it unavalable, which is the reason I do not use it - not the lack of an interface.

Again, I am not at all used to the avr architecture - I have not looked at the generated code or even the available instruction set for that matter but I am very aware there are many optimizations to be made. The initial code before adding volume multiplication was actually much more optimized but I expanded the code temporarily for readability at that point. Interleaving the code to optimize memory access will no doubt increase performance.

The reason I want optimizations at all is not that I want to generate additional voices but because the isr is taking so much time that midi events are dropped when there is too much incoming midi data causing notes to be locked when pitch bending for example.

I make a living writing embedded code for much larger systems and from that point of view we are either speaking slightly different languages or the avr (or gcc for the avr) doesn’t work at at all like what I am used to.

From my perspective gcc shouldn’t generate even a single push/pop (unless the obvious entry/exit and possibly for operations that are not supported by hardware, 16-bit add perhaps?).

When I say push/pop I am referring to a write/read from the top of the stack.
When I say register (in regards to code) I am referring to a register in the cpu, not a ram or stack location.

There should definetely be a whole bunch of completely unnecessary read/writes to and from static locations (there should not be any array indexing calculations since all references to the array are using constant indexes) in the ram due to the way the code is layed out.

Since all variables in the ISR are declared as modular (well… global actually, no static) outside the ISR, they should not be placed on the stack in the ISR and the ISR should as such have a stack depth of 0 while the lifetime of all variables should be infinite. They should certainly not be reused when they are global and even used elsewhere as well. Especially ppcm cannot be reused since it is accumulative between calls. pout however and iout however, could - with proper code ordering.

There are also other things that should be taken care of such as the copy from voice 0 to voice 1 during monophonic mode not being atomic which could cause a shift between the waveforms every time the interrupt is called in between these operations but I really mean it when I say I haven’t done much to optimize or take care of any such things - this is a hack n slash coding in every way :slight_smile:

> UARTs are not used for SPI

Actually, you can configure the UART module on the ATMega to be used as a SPI master :slight_smile: Well, I wasn’t right to call it a UART in the first place, this is a USART indeed ; so a lowly ATMega328p can have two SPI master peripherals, one with the “proper” SPI module, and one with the USART (which is, surprisingly, faster in some situations).

> the isr is taking so much time that midi events are dropped when there is too much incoming midi data causing notes to be locked when pitch bending for > example.

Two solutions to that :

  • Use a non-blocking ISR (further interrupts are still allowed while the handler is executed) so that the UART RX handler interrupt can be triggered even while the “audio” timer interrupt runs. They are declared with ISR ( TIMER1_OVF_vect, ISR_NOBLOCK )
  • Do not use the arduino serial thing, just poll the UART from your audio timer interrupt (I’m now using this in the Shruthi code, it’s cheaper), put the byte in a ring buffer, and pick data from the ring buffer in your main loop. If there’s already a timer interrupt happening, and going faster than the MIDI data rate, why not poll from there instead of using a dedicated RX interrupt?

> When I say push/pop I am referring to a write/read from the top of the stack.
> When I say register (in regards to code) I am referring to a register in the cpu, not a ram or stack location.

Yes, that’s what I am referring to :slight_smile: I’ve written a few compilers/code generators in the past (university and commercial products), so I’m a bit obsessed about how things are turned into code. The AVR behaves like many load/store architectures. Variables are loaded into registers from memory, processed in registers, and stored back in memory. If your code does something like this:

// do something with A
// do something with B
// do something with C
// compute A + B + C

There are two options for the compiler: keep A, B and C loaded in different registers to save a load at the 4th step; or reload them from memory at the 4th step. The first option is often the one picked by the compiler because memory load/save are slow ; but the downside is that A will “occupy” a register until step 4. So this code has a “register depth” of 3 ; and that’s the “lifetime” I mentioned. There’s a significant chunk of code during which A will be maintained “alive” in a register.

This variant:

// do something with A
// acc += A
// do something with B
// acc += B
// do something with C
// acc += C

With this code, the compiler can do a much better job : one register is used to hold acc ; and one register is used to hold A, then reused for B, then reused for C. Each variable has a shorter “lifetime” because it needs to be held in a register for a short number of instructions.

And that’s pretty much why the reorder version I gave you is likely to perform better.

> From my perspective gcc shouldn’t generate even a single push/pop (unless the obvious entry/exit and possibly for operations that are not supported by hardware, 16-bit add perhaps?).

All the phase accumulator incrementations in your ISR are translated into the following code:

// update pcm values for all 16 polyphonys
// ppcm[ 0 ][ 0 ] += pfreq[ 0 ][ 0 ];
lds r18, 0x08CC
lds r19, 0x08CD
lds r24, 0x08A9
lds r25, 0x08AA
add r18, r24
adc r19, r25
std Y+4, r19 ; 0x04
std Y+3, r18 ; 0x03
sts 0x08AA, r19
sts 0x08A9, r18
// ppcm[ 0 ][ 1 ] += pfreq[ 0 ][ 1 ];
lds r30, 0x08CE
lds r31, 0x08CF
lds r24, 0x08AB
lds r25, 0x08AC
add r30, r24
adc r31, r25
sts 0x08AC, r31
sts 0x08AB, r30
// ppcm[ 0 ][ 2 ] += pfreq[ 0 ][ 2 ];
lds r22, 0x08D0
lds r23, 0x08D1
lds r24, 0x08AD
lds r25, 0x08AE
add r22, r24
adc r23, r25
std Y+2, r23 ; 0x02
std Y+1, r22 ; 0x01
sts 0x08AE, r23
sts 0x08AD, r22

// ppcm[ 2 ][ 0 ] += pfreq[ 2 ][ 0 ];
lds r6, 0x08DC
lds r7, 0x08DD
lds r24, 0x08B9
lds r25, 0x08BA
add r6, r24
adc r7, r25
sts 0x08BA, r7
sts 0x08B9, r6

You see the pattern? gcc is reusing different registers for each incrementation, because this allows each ppcm to be accessed later for the “big sum” (pout) without a reload from memory. The downside is that your ISR uses pretty much all the registers.

And thus, this is no surprise that your ISR starts with:

push r1
push r0
in r0, 0x3f ; 63
push r0
eor r1, r1
push r2
push r3
push r4
push r5
push r6
push r7
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
push r16
push r17
push r18
push r19
push r20
push r21
push r22
push r23
push r24
push r25
push r26
push r27
push r30
push r31
push r29
push r28

> From my perspective gcc shouldn’t generate even a single push/pop (unless the obvious entry/exit and possibly for operations that are not supported by hardware, 16-bit add perhaps?).

What I was talking about from the beginning is the entry/exit. If you write your code in such a way that the compiler keeps many different variables alive in register, the ISR intro/outtro will have to push / pop the whole 32 registers of your CPU, and there’s no magical instruction to do that in 5 cycles… This amounts to 32 (regs) x 2 (push/pop) x 2 (cycles for the push/pop instruction) = 128 ; out of the 512 you have for audio rendering that’s a heft 25% ISR tax! If you rearrange your code so that everything can be done with 6 or 7 registers (that’s what the re-arranged code does), you’ll save a bunch of push/pop pairs per sample. This can be further improved by moving the audio rendering code out of the ISR so that: 1/ your ISR intro/outro becomes super light (3 or 4 push/pop) ; 2/ you can take advantage of doing the rendering by blocks (for example, the values of ppcm are at the moment read from memory, incremented, written back to memory for each sample… If you do the rendering in a 16 samples loop, you can load the 8 of them in local variables, do the loop, and copy them back to memory outside the loop ; so they will be kept in registers inside the loop - that’s what happens for the phase accumulators in the Shruthi - for most oscillators they are just kept in registers).

> Since all variables in the ISR are declared as modular (well… global actually, no static) outside the ISR, they should not be placed on the stack in the ISR and the ISR should as such have a stack depth of 0 while the lifetime of all variables should be infinite. They should certainly not be reused when they are global and even used elsewhere as well. Especially ppcm cannot be reused since it is accumulative between calls. pout however and iout however, could - with proper code ordering.

I was talking about lifetime / depth from the point of view of register allocation.

> Ah yes, I thought you meant just moving the majority of the code to outside the interrupt would be the source of more cpu time :slight_smile:

Yes, I meant it. It paid handsomely when I did this for the Shruthi.

> Duuude… I had no idea AVR had so many registers.

When reading your messages I thought “hmmm, looks like this guy works with PICs”, but I did not want to start an AVR vs PIC flamewar. I’ve worked recently closely to 8-bits PIC code (not writing it myself, but looking at it for bugs or optimizations) and pretty much all the optimizations and ideas that I found while working on the Shruthi did not apply there (the ISR intro/outro is < 10 cycles, there’s only the program counter, the flags and the W register to save and it’s all done by a single instruction).

On ARM the push statement is encoded as a bitfield iirc, so you can do “Push {r0-r12,r14}” all at once! On SPARC you have register windows… but on AVR you are stuck with super expensive context switches, and that’s pretty much why they have gone with this awkward event system / turbocharged DMA thing on the XMega to do all kind of basic plumbing without ISR.

> The volume is already “kind of” 6-bit - it goes from 0x00-0x40 (not 0x3F). I could certainly scale it differently to utilize an 8-bit shift for performance but that wouldn’t scale neatly ie: ((0xFF*0xFF)>>8)!=0xFF while ((0xFF*0x40)>>6==0xFF so there is actually a thought behind this

Yes, this is not mathematically exact. The Shruthi has a few of those “forced” -0.04dB (255 / 256) attenuations in the signal processing. Maybe it is responsible for its “digital grittiness”?

> The idea for this project was to do a 20-minute demo on how to quickly build a (filter-less) polyphonic square wave synth using only “standardized” components and libraries. That’s why I used the MIDI library instead of polling the UART. Damn, I feel like I am spending most of my typing explaining myself - I guess I’m just trying to point out that I am not the fool this code makes me look like :wink:

Sure, but there’s always the game of squeezing more features :slight_smile:

> The only part I fail to understand why avoiding signed extensions should speed things up. Seems to me an 8-bit shift right, addition and subtraction will compile the same for signed and unsigned data and for the >> 6 AVR seems to have an arithmetic shift right instruction.

Your original code casts to signed because you want your saw to be in the -128 127 range (so the final sum in the -512 508 range). There’s nothing wrong with that (it is necessary to apply the volume after all), so I don’t think we can avoid the -512. But let’s look closely at this:

int16_t x = static_cast<int8_t>(foo >> 8).

The problem is not with the static_cast<int8_t>(foo >> 8) itself, but in assigning an int8_t to an int16_t. The MSB of the int16_t will either be 0xff if the int8_t is negative ; or 0x00 if the int8_t is positive. That’s sign extension. Assigning an uint8_t to an int16_t is much easier, since the MSB will always be 0.

Let’s look at the generated code for

int8_t x;
int16_t y;
y += x;

// r24: x ; r18:r19: y ; r25: MSB of sign-extended version of x
// Clean the MSB
eor r25, r25
// If bit 7 of r24 is not set (= if r24 is positive), skip next statement, which sets the MSB to 0xff
sbrc r24, 7
com r25
// At this stage, r25 is 0x00 if x is positive ; 0xff if x is negative ; so r24:r25 contains the 16 bits signed version of x ; and can be added
// Proper addition
add r18, r24
adc r19, r25

And now:

volatile uint8_t x;
volatile int16_t y;
y += x;

// r24: x ; r18:r19: y
// Proper addition, r1 contains 0, so we are just adding the carry
add r18, r24
adc r19, r1

So you save 4 cycles per addition = 32 cycles => 6% ; and your code needs one less register. On a x86 you would have used MOVSX, but there’s no instruction like that (no sex :frowning: ) on the AVR.

I’d be curious to know how the performance improve with code like this optimized for register allocation. Note that I have avoided a signed extension to remove the “DC offset” of the saw. I just subtract 128 x 4.

int16_t iout = 0;
int16_t pout;

// update pcm values for all 16 polyphonys
pout = -512;
ppcm[ 0 ][ 0 ] += pfreq[ 0 ][ 0 ];
pout += static_cast<uint8_t>(ppcm[ 0 ][ 0 ] >> 8);
ppcm[ 0 ][ 1 ] += pfreq[ 0 ][ 1 ];
pout += static_cast<uint8_t>(ppcm[ 0 ][ 1 ] >> 8);
ppcm[ 0 ][ 2 ] += pfreq[ 0 ][ 2 ];
pout += static_cast<uint8_t>(ppcm[ 0 ][ 2 ] >> 8);
ppcm[ 0 ][ 3 ] += pfreq[ 0 ][ 3 ];
pout += static_cast<uint8_t>(ppcm[ 0 ][ 3 ] >> 8);
iout += ( ( pout * pm[ 0 ].volume ) >> 6 );

pout = -512;
ppcm[ 1 ][ 0 ] += pfreq[ 1 ][ 0 ];
pout += static_cast<uint8_t>(ppcm[ 1 ][ 0 ] >> 8);
ppcm[ 1 ][ 1 ] += pfreq[ 1 ][ 1 ];
pout += static_cast<uint8_t>(ppcm[ 1 ][ 1 ] >> 8);
ppcm[ 1 ][ 2 ] += pfreq[ 1 ][ 2 ];
pout += static_cast<uint8_t>(ppcm[ 1 ][ 2 ] >> 8);
ppcm[ 1 ][ 3 ] += pfreq[ 1 ][ 3 ];
pout += static_cast<uint8_t>(ppcm[ 1 ][ 3 ] >> 8);
iout += ( ( pout * pm[ 1 ].volume ) >> 6 );

pout = -512;
ppcm[ 2 ][ 0 ] += pfreq[ 2 ][ 0 ];
pout += static_cast<uint8_t>(ppcm[ 2 ][ 0 ] >> 8);
ppcm[ 2 ][ 1 ] += pfreq[ 1 ][ 1 ];
pout += static_cast<uint8_t>(ppcm[ 2 ][ 1 ] >> 8);
ppcm[ 2 ][ 2 ] += pfreq[ 1 ][ 2 ];
pout += static_cast<uint8_t>(ppcm[ 2 ][ 2 ] >> 8);
ppcm[ 2 ][ 3 ] += pfreq[ 1 ][ 3 ];
pout += static_cast<uint8_t>(ppcm[ 2 ][ 3 ] >> 8);
iout += ( ( pout * pm[ 2 ].volume ) >> 6 );

pout = -512;
ppcm[ 3 ][ 0 ] += pfreq[ 3 ][ 0 ];
pout += static_cast<uint8_t>(ppcm[ 3 ][ 0 ] >> 8);
ppcm[ 3 ][ 1 ] += pfreq[ 3 ][ 1 ];
pout += static_cast<uint8_t>(ppcm[ 3 ][ 1 ] >> 8);
ppcm[ 3 ][ 2 ] += pfreq[ 3 ][ 2 ];
pout += static_cast<uint8_t>(ppcm[ 3 ][ 2 ] >> 8);
ppcm[ 3 ][ 3 ] += pfreq[ 3 ][ 3 ];
pout += static_cast<uint8_t>(ppcm[ 3 ][ 3 ] >> 8);
iout += ( ( pout * pm[ 3 ].volume ) >> 6 );

// limit
if( iout > 2047 ) iout = 2047;
if( iout < -2048 ) iout = -2048;
// scale and convert to unsigned
out = ( ( char )( iout >> 4 ) ) ^ 0x80;

If you can live with a 6 bit volume, scale it before, and use >> 8 (no barrel shifter on the AVR :frowning: ). This reduced the code size by 318 so there are at least 120 CPU cycles saved here… 150 or 160 if you remove the >> 6.

Duuude… I had no idea AVR had so many registers. That’s the only point where we were miscommunicating. I just read through the AVR sheets on the architecture. Your initially very strange suggestions now make ALOT of sense. This will definetely change the way I write code for AVR.

Honestly, “at first I was like WTF (is this guy going on about) but then i LOL’d” to quote the meme :wink:
Thank you so much for taking the time to explain what I should have RTFM’d myself.

It seems we are very much alike. I too have a strange fetishism with how things are translated into machine code and I also enjoy writing and optimizing compilers and code alike :smiley:

Also, (regarding static_cast) I had no idea the Arduino was actually compiling C++. Thanks again!

The volume is already “kind of” 6-bit - it goes from 0x00-0x40 (not 0x3F). I could certainly scale it differently to utilize an 8-bit shift for performance but that wouldn’t scale neatly ie: ((0xFF*0xFF)>>8)!=0xFF while ((0xFF*0x40)>>6==0xFF so there is actually a thought behind this - it imitates how I usually implement integer DSP multiplication when dealing with discrete or FPGA logic. Regardless, the velocity from MIDI is scaled to 0x00-0x3F anyway which breaks that thought completely and seriously - on this sucker performance is vastly more important than the slight quality loss from what you are suggesting. Your way is better hands down in this instance.

The idea for this project was to do a 20-minute demo on how to quickly build a (filter-less) polyphonic square wave synth using only “standardized” components and libraries. That’s why I used the MIDI library instead of polling the UART. Damn, I feel like I am spending most of my typing explaining myself - I guess I’m just trying to point out that I am not the fool this code makes me look like :wink:

The only part I fail to understand why avoiding signed extensions should speed things up. Seems to me an 8-bit shift right, addition and subtraction will compile the same for signed and unsigned data and for the >> 6 AVR seems to have an arithmetic shift right instruction.

I’d really appreciate if you could find the time to explain what I am missing here, because if this was an error of thought performance could be boosted even further by removing the initial -512 assignment and shoving the static_cast<uint8_t>(ppcm[ n ][ 0 ] >> 8) result directly into pout and adding the remaining three calculations.

To save myself from embarrassment I feel the urge to note that I have absolutely no experience with the AVR architecture whatsoever, apart from what little I’ve done on the Arduino and that is something I really never do unless for instances where I want non-programmers to be able to replicate my work such as in this case.

I work primarily with x86(-64), PIC-RISC and eZ80 and none of those architectures have more than very few general purpose registers. It’s something I’ve missed from the 68k time and one of the reasons I really like ARM.

Well, in conclusion, you are the hero of the day and that certainly deserves a mention in the project.

Gotcha, did not think that one through - thanks again!