MultiProcDataset, _get_random_seed_for_epoch in workers results in unexpected behavior

With `MultiProcDataset` and `sharding_method = "seq_order"` (default), the seq order shuffling will be done by the first worker (sub dataset), and then we evenly split it up over the workers (via `init_seq_order(seq_order=...)` in each worker).

So normally, in all the other workers, `_get_random_seed_for_epoch` is not really used anymore.

Except in those cases where it is: E.g. `OggZipDataset` with speed perturbation. `OggZipDataset.init_seq_order` does:
```python
random_seed = self._get_random_seed_for_epoch(epoch=epoch)
self._audio_random.seed(random_seed)
```
And this `self._audio_random` RNG is used by the feature extraction code (`ExtractAudioFeatures(random_state=self._audio_random, **audio)`), and also passed to user functions, such as my `i6_experiments.users.zeyer.speed_pert.librosa_config.speed_pert_librosa_config`.

The problem now is, this RNG is seeded in the same way in all workers! So, e.g. such speed perturbation will always perturb in the same way in each worker. So, this makes the behavior of what you get different depending on `num_workers`. E.g. for `num_workers=1`, every single sequence would get a different speed perturbation. But for e.g. `num_workers=20`, every 20 sequences would get the same speed perturbation.

Btw, for `sharding_method = "dedicated"`: We have the same issue! The random seed is the same in every worker. It must be that way because of how `get_seq_order_for_epoch` is implemented, which also uses that function `_get_random_seed_for_epoch`, which first does sorting/shuffling on the whole dataset (and expects this to get the same order in every worker), and then applies partition epoch / sharding, i.e. taking the right slice, on the resulting list.

I think what we want is some more custom behavior of `_get_random_seed_for_epoch` depending on how it is used. When used for `get_seq_order_for_epoch`, we deliberately want that it returns the same random seed *independent* of the shard index or worker index.

We should look through all usages of `_get_random_seed_for_epoch`. I assume, most of the usages (I can also think of target label augmentation, e.g. different random subword segmentations) fall into the same case as speed perturbation, and `_get_random_seed_for_epoch` is the exception here.

Another issue is how to pass this to the subdataset in case of `sharding_method = "seq_order"`. In case of `sharding_method = "dedicated"`, the dataset knows about the shard index. But in case of `sharding_method = "seq_order"`, the dataset does not really know about the worker index. But it must get that somehow and use it for `_get_random_seed_for_epoch`, just like the shard index would be used otherwise.

(cc @NeoLegends @dorian-K)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiProcDataset, _get_random_seed_for_epoch in workers results in unexpected behavior #1762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MultiProcDataset, _get_random_seed_for_epoch in workers results in unexpected behavior #1762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions