Skip to content

Conversation

@kentslaney
Copy link

@kentslaney kentslaney commented May 7, 2025

prior art

If anyone has more GPUs than ideas, I'd appreciate this being tried (and getting feedback from it). It trains on a small scale and doesn't immediately diverge in loss from the original, but my hope is that it might mitigate mode collapse, which happens late and at scale. That being said, it's a negative result so far. If I get around to trying it at scale myself, I'll update the thread.

Thoughts and discussion without results is welcome as well.

I also have a standalone implementation for anyone without a training setup

@aifartist
Copy link

Unfortunately I have more ideas that GPU's and not enough time to try them. But I do have a threadripper 7985 system with 256 GB's of DDR5-6000 and dual 5090's.

Do you have any interesting experiments you want to have run. I'm willing to offer some time in exchange for learning a bit more.

Currently I'm all in on Karpathy's nanochat which is a very fast trainer. I've already made about a 20% further performance boost eliminating graph breaks and recoding the step function to process the param's in larger batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants