Skip to content

devtrain dataset, reuse train dataset #1771

@albertz

Description

@albertz

Currently, in many of our setups, we make use of a so-called "devtrain" dataset, which is just a subset of the train dataset, which is then used for evaluation (to do some evaluation on the train data with the eval conditions, comparable to other dev sets).

RETURNN treats devtrain (and any other dev dataset) just the same. So in practice, train and devtrain would be two separate Dataset instances, which can require quite a bit of memory (see e.g. also #1498).

But often they really use the same underlying data (e.g. ogg-zip files) and use fixed_random_subset or seq_list_filter_file or so to take out the subset.

We could have some special logic for devtrain, that it makes use of the same dataset, but just resets properties like fixed_random_subset, seq_list_filter_file, also seq ordering, etc.

For PyTorch, we need to be careful also for the DataLoader logic, and having workers (multi-processing) on the DataLoader, etc.

(cc @robin-p-schmitt)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions