Controlled Comparison of Feistel v.s. Linear Shuffle

## Description

David introduced a new feistel shuffling algorithm to address issues we had seen which looked an awful lot like unshuffled data. The new shuffle algorithm hasn't experienced such issues recently. That being said, we should test this in a controlled ablation! 

1) We'd expect the linear Shuffle to have more issues with data that is sneakily internally sorted such that neighboring documents are correlated with eachother heavily. A good example of this is the stack v2 edu filtered, which is sorted by programming language.
2) To test the improved shuffle, I'll train 30M param models to 1B tokens on the Stack v2 EDU Filtered. 
3) We'll compare these runs based on their performance on Paloma 100 Languages Github which is a decent measure of high level programming performance.

## Hypothesis or Goal
We expect to see that the Feistel shuffle should lead to more stably decreasing loss throughout the training run. Ultimately, we'd also expect this to lead to lower validation loss since the data is closer to true I.i.d. shuffle.

### Links
* WandB Report:  (link)
* Data Browser: (link)

## Results

TBD


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controlled Comparison of Feistel v.s. Linear Shuffle #1803

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controlled Comparison of Feistel v.s. Linear Shuffle #1803

Description

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions