Skip to content

Squeeze extra performance out of mixed-radix algorithm #3

@calebzulawski

Description

@calebzulawski

When profiling with perf, a huge amount of time (40-60% of the entire transform) seems to be spent in the very first "narrow SIMD" pass, where the stride isn't large enough to fill an entire SIMD vector. Right now there is an optimization for radix-4 on AVX, but even with that the performance is underwhelming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions