How to train your tiny MoE: Going from 60+ hours to 13

Hosted by Zach Mueller

669 students

In this video

What you'll learn

How to find the easy wins during model training

What are the first levers to pull when trying to get the best performance during training

How to tell *what* is causing slowdowns

What is the torch profiler? How do we look at what's slow

What to do when you've optimized the training loop

How do you then find easy wins in the modeling code for increasing your FLOPs

Why this topic matters

Over a 72 hour sprint I took a very tiny experimental 0.5 billion parameter Qwen3-MoE-style model to Chinchilla level of tokens (10 billion) from taking 60+ hours to 13.2 on 4 at-my-home graphics cards. This involved numerous tricks, learning different tools, and figuring out where bottlenecks existed and how to go about them. Join me as I talk about the secrets most find through experience.

You'll learn from

Zach Mueller

Technical Lead, Hugging Face

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.

Share this lesson

669 students

Share this lesson

669 students

Go deeper with a course

Scratch to Scale: Large-Scale Training in the Modern World