Skip to content

Question regarding threshold_ema_dead_code setting and its interaction with batch_size #82

@y61329697

Description

@y61329697

First of all, thank you so much for your excellent work on the wavkokenizer project! I am currently training the wavkokenizer-small model on the LibriTTS dataset using your provided configuration. Due to my hardware setup (NVIDIA 4090), I've set the batch_size to 10. Based on my understanding, for 3-second audio clips sampled at 24000Hz, each batch contains approximately 1200 tokens (or frames). During training, I've observed that the sum of the EMA statistics for the usage frequency of each codebook entry (i.e., _ema_cluster_size) tends to converge towards a value related to the number of recently processed tokens. In my setup, this sum is approximately 1200.

Considering the codebook size is 4096, if the sum of _ema_cluster_size approaches 1200, then according to the pigeonhole principle, a large number of codebook entries will have an _ema_cluster_size value significantly lower than the threshold_ema_dead_code (set to 2 in the code). This theoretically means that a substantial number of codewords would be frequently marked as "dead codes" and potentially reset, even with the default batch_size of 40 (which would lead to a larger sum for _ema_cluster_size but still many individual entries below 2). However, my experimental results show that even under these circumstances, wavkokenizer still manages to reconstruct speech quality reasonably well. Therefore, I would like to inquire about the considerations behind setting the threshold_ema_dead_code to 2.

Thank you very much for your time and any insights you can provide!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions