How to train your tiny MoE: Going from 60+ hours to 13
Hosted by Zach Mueller
What you'll learn
How to find the easy wins during model training
What are the first levers to pull when trying to get the best performance during training
How to tell *what* is causing slowdowns
What is the torch profiler? How do we look at what's slow
What to do when you've optimized the training loop
How do you then find easy wins in the modeling code for increasing your FLOPs
Why this topic matters
Over a 72 hour sprint I took a very tiny experimental 0.5 billion parameter Qwen3-MoE-style model to Chinchilla level of tokens (10 billion) from taking 60+ hours to 13.2 on 4 at-my-home graphics cards.
This involved numerous tricks, learning different tools, and figuring out where bottlenecks existed and how to go about them.
Join me as I talk about the secrets most find through experience.
You'll learn from
Zach Mueller
Technical Lead, Hugging Face
I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.
Go deeper with a course
Keep exploring
.png&w=1536&q=75)

.jpg&w=1536&q=75)


