devtrain dataset, reuse train dataset

Currently, in many of our setups, we make use of a so-called "devtrain" dataset, which is just a subset of the train dataset, which is then used for evaluation (to do some evaluation on the train data with the eval conditions, comparable to other dev sets).

RETURNN treats devtrain (and any other dev dataset) just the same. So in practice, train and devtrain would be two separate `Dataset` instances, which can require quite a bit of memory (see e.g. also #1498).

But often they really use the same underlying data (e.g. ogg-zip files) and use `fixed_random_subset` or `seq_list_filter_file` or so to take out the subset.

We could have some special logic for devtrain, that it makes use of the same dataset, but just resets properties like `fixed_random_subset`, `seq_list_filter_file`, also seq ordering, etc.

For PyTorch, we need to be careful also for the DataLoader logic, and having workers (multi-processing) on the DataLoader, etc.

(cc @robin-p-schmitt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

devtrain dataset, reuse train dataset #1771

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

devtrain dataset, reuse train dataset #1771

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions