Hi,
I'm currently trying to replicate the performance of Qwen2-Audio on the AIR Bench. However, I noticed that the repository at AIR-Bench doesn't provide the complete test script. It only includes the inference script and the GPT-4 evaluation generation script.
Could you please clarify how the scores for the Speech, Sound, Music, and Mixed Audio metrics are obtained? It would be very helpful if you could provide the complete test script for these metrics.
Thank you for your assistance!