Skip to content

Wavtokenizer Medium v2 versus Large v2 (Quality Issue) #75

@tanmaylaud

Description

@tanmaylaud

Hi,

I noticed a degradation in Large v2 quality compared to Medium v2 quality.
I want to make sure I am using the config and settings correctly.

from encoder.utils import convert_audio
import torchaudio
import torch
from decoder.pretrained import WavTokenizer
device=torch.device('cpu')

config_path = "configs/wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
model_path = "novateur/wavtokenizer_large_speech_320_v2.ckpt"
'''OR'''
model_path = "novateur/WavTokenizer-medium-speech-75token/wavtokenizer_medium_speech_320_24k_v2.ckpt"

audio_path = "test_audio.wav"
audio_outpath = "wav_test_audio.wav"
wavtokenizer = WavTokenizer.from_pretrained0802(config_path, model_path)
wavtokenizer = wavtokenizer.to(device)

wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, 24000, 1) 
bandwidth_id = torch.tensor([0])
wav=wav.to(device)
features,discrete_code= wavtokenizer.encode_infer(wav, bandwidth_id=bandwidth_id)
print(features.shape)
print(discrete_code.shape)
for i in range(0, discrete_code.shape[-1], 75):
    print(discrete_code[:, :, i:i+75], end='\n\n')
audio_out = wavtokenizer.decode(features, bandwidth_id=bandwidth_id) 

torchaudio.save(audio_outpath, audio_out, sample_rate=24000, encoding='PCM_S', bits_per_sample=16)

Test audio file:
https://limewire.com/d/XpsaM#fpKfVRRdUd

Notice that breathing sounds are all messed up in large v2 compared medium v2
@jishengpeng

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions