-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Labels
Description
Description
David introduced a new feistel shuffling algorithm to address issues we had seen which looked an awful lot like unshuffled data. The new shuffle algorithm hasn't experienced such issues recently. That being said, we should test this in a controlled ablation!
- We'd expect the linear Shuffle to have more issues with data that is sneakily internally sorted such that neighboring documents are correlated with eachother heavily. A good example of this is the stack v2 edu filtered, which is sorted by programming language.
- To test the improved shuffle, I'll train 30M param models to 1B tokens on the Stack v2 EDU Filtered.
- We'll compare these runs based on their performance on Paloma 100 Languages Github which is a decent measure of high level programming performance.
Hypothesis or Goal
We expect to see that the Feistel shuffle should lead to more stably decreasing loss throughout the training run. Ultimately, we'd also expect this to lead to lower validation loss since the data is closer to true I.i.d. shuffle.
Links
- WandB Report: (link)
- Data Browser: (link)
Results
TBD