There's some matrix transpose code we could use: - https://gist.github.com/nihui/37d98b705a6a28911d77c502282b4748#file-avx512_transpose-cpp-L253 - https://stackoverflow.com/questions/29519222/how-to-transpose-a-16x16-matrix-using-simd-instructions