Skip to content

Controlled Comparison of Feistel v.s. Linear Shuffle #1803

@Helw150

Description

@Helw150

Description

David introduced a new feistel shuffling algorithm to address issues we had seen which looked an awful lot like unshuffled data. The new shuffle algorithm hasn't experienced such issues recently. That being said, we should test this in a controlled ablation!

  1. We'd expect the linear Shuffle to have more issues with data that is sneakily internally sorted such that neighboring documents are correlated with eachother heavily. A good example of this is the stack v2 edu filtered, which is sorted by programming language.
  2. To test the improved shuffle, I'll train 30M param models to 1B tokens on the Stack v2 EDU Filtered.
  3. We'll compare these runs based on their performance on Paloma 100 Languages Github which is a decent measure of high level programming performance.

Hypothesis or Goal

We expect to see that the Feistel shuffle should lead to more stably decreasing loss throughout the training run. Ultimately, we'd also expect this to lead to lower validation loss since the data is closer to true I.i.d. shuffle.

Links

  • WandB Report: (link)
  • Data Browser: (link)

Results

TBD

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions