-
Notifications
You must be signed in to change notification settings - Fork 102
Description
First of all, thank you so much for your excellent work on the wavkokenizer project! I am currently training the wavkokenizer-small model on the LibriTTS dataset using your provided configuration. Due to my hardware setup (NVIDIA 4090), I've set the batch_size to 10. Based on my understanding, for 3-second audio clips sampled at 24000Hz, each batch contains approximately 1200 tokens (or frames). During training, I've observed that the sum of the EMA statistics for the usage frequency of each codebook entry (i.e., _ema_cluster_size) tends to converge towards a value related to the number of recently processed tokens. In my setup, this sum is approximately 1200.
Considering the codebook size is 4096, if the sum of _ema_cluster_size approaches 1200, then according to the pigeonhole principle, a large number of codebook entries will have an _ema_cluster_size value significantly lower than the threshold_ema_dead_code (set to 2 in the code). This theoretically means that a substantial number of codewords would be frequently marked as "dead codes" and potentially reset, even with the default batch_size of 40 (which would lead to a larger sum for _ema_cluster_size but still many individual entries below 2). However, my experimental results show that even under these circumstances, wavkokenizer still manages to reconstruct speech quality reasonably well. Therefore, I would like to inquire about the considerations behind setting the threshold_ema_dead_code to 2.
Thank you very much for your time and any insights you can provide!