fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438
+50
−44
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What it does
First mentioned in issue #2328. When using a merged dataset (version 3.0)
torch.util.data.DataLoaderencounters an exception that the given frame index is invalid.The error was due to not reseting the
latest_durationoffset inaggregate_videoswhen creating a new file, resulting in an offset equal to the last files frame number when writing the second episode of the new file.Solution was to reset
latest_durationafter creating a new file.How it was tested
Using the fixed aggregate, I merged a new copy of my dataset containing 75 episodes with video file size set to 500 MB, creating 13 video file in the merged dataset.
Viewing the newly merged dataset, the offsets were corrected.
Then I executed a model training with batch size of 10 and 100 training steps. After multiple try, no exception was caught, assuming the problem was solved.
Repository included tests were successful.
How to checkout & try? (for the reviewer)
Try merging smaller datasets to have at least 2 video files in the new dataset. Viewing the meta/episodes/.../file-000.parquet the
from_timestampandto_timestampvalues are sequential with no skips and resets to 0.0 when a new file starts.Training a model gives no errors as well.