Squeeze extra performance out of mixed-radix algorithm

When profiling with `perf`, a huge amount of time (40-60% of the entire transform) seems to be spent in the very first "narrow SIMD" pass, where the stride isn't large enough to fill an entire SIMD vector.  Right now there is an optimization for radix-4 on AVX, but even with that the performance is underwhelming.