Lightning Lessons

How to train your tiny MoE: Going from 60+ hours to 13

Hosted by Zach Mueller

Fri, Oct 17, 2025

5:00 PM UTC (1 hour)

Virtual (Zoom)

Free to join

105 students

Invite your network

Go deeper with a course

Framework Fundamentals: Designing Distributed Training APIs
Zachary Mueller
View syllabus

What you'll learn

How to find the easy wins during model training

What are the first levers to pull when trying to get the best performance during training

How to tell *what* is causing slowdowns

What is the torch profiler? How do we look at what's slow

What to do when you've optimized the training loop

How do you then find easy wins in the modeling code for increasing your FLOPs

Why this topic matters

Over a 72 hour sprint I took a very tiny experimental 0.5 billion parameter Qwen3-MoE-style model to Chinchilla level of tokens (10 billion) from taking 60+ hours to 13.2 on 4 at-my-home graphics cards. This involved numerous tricks, learning different tools, and figuring out where bottlenecks existed and how to go about them. Join me as I talk about the secrets most find through experience.

You'll learn from

Zach Mueller

Technical Lead, Hugging Face

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.

Sign up to join this lesson

By continuing, you agree to Maven's Terms and Privacy Policy.