-
Notifications
You must be signed in to change notification settings - Fork 134
Description
Currently, in many of our setups, we make use of a so-called "devtrain" dataset, which is just a subset of the train dataset, which is then used for evaluation (to do some evaluation on the train data with the eval conditions, comparable to other dev sets).
RETURNN treats devtrain (and any other dev dataset) just the same. So in practice, train and devtrain would be two separate Dataset instances, which can require quite a bit of memory (see e.g. also #1498).
But often they really use the same underlying data (e.g. ogg-zip files) and use fixed_random_subset or seq_list_filter_file or so to take out the subset.
We could have some special logic for devtrain, that it makes use of the same dataset, but just resets properties like fixed_random_subset, seq_list_filter_file, also seq ordering, etc.
For PyTorch, we need to be careful also for the DataLoader logic, and having workers (multi-processing) on the DataLoader, etc.
(cc @robin-p-schmitt)