-
Notifications
You must be signed in to change notification settings - Fork 134
Description
With MultiProcDataset and sharding_method = "seq_order" (default), the seq order shuffling will be done by the first worker (sub dataset), and then we evenly split it up over the workers (via init_seq_order(seq_order=...) in each worker).
So normally, in all the other workers, _get_random_seed_for_epoch is not really used anymore.
Except in those cases where it is: E.g. OggZipDataset with speed perturbation. OggZipDataset.init_seq_order does:
random_seed = self._get_random_seed_for_epoch(epoch=epoch)
self._audio_random.seed(random_seed)And this self._audio_random RNG is used by the feature extraction code (ExtractAudioFeatures(random_state=self._audio_random, **audio)), and also passed to user functions, such as my i6_experiments.users.zeyer.speed_pert.librosa_config.speed_pert_librosa_config.
The problem now is, this RNG is seeded in the same way in all workers! So, e.g. such speed perturbation will always perturb in the same way in each worker. So, this makes the behavior of what you get different depending on num_workers. E.g. for num_workers=1, every single sequence would get a different speed perturbation. But for e.g. num_workers=20, every 20 sequences would get the same speed perturbation.
Btw, for sharding_method = "dedicated": We have the same issue! The random seed is the same in every worker. It must be that way because of how get_seq_order_for_epoch is implemented, which also uses that function _get_random_seed_for_epoch, which first does sorting/shuffling on the whole dataset (and expects this to get the same order in every worker), and then applies partition epoch / sharding, i.e. taking the right slice, on the resulting list.
I think what we want is some more custom behavior of _get_random_seed_for_epoch depending on how it is used. When used for get_seq_order_for_epoch, we deliberately want that it returns the same random seed independent of the shard index or worker index.
We should look through all usages of _get_random_seed_for_epoch. I assume, most of the usages (I can also think of target label augmentation, e.g. different random subword segmentations) fall into the same case as speed perturbation, and _get_random_seed_for_epoch is the exception here.
Another issue is how to pass this to the subdataset in case of sharding_method = "seq_order". In case of sharding_method = "dedicated", the dataset knows about the shard index. But in case of sharding_method = "seq_order", the dataset does not really know about the worker index. But it must get that somehow and use it for _get_random_seed_for_epoch, just like the shard index would be used otherwise.
(cc @NeoLegends @dorian-K)