4 Weeks
·Cohort-based Course
Learn the techniques used today to take your model training from Colab to Clusters
This course is popular
6 people enrolled last week.
4 Weeks
·Cohort-based Course
Learn the techniques used today to take your model training from Colab to Clusters
This course is popular
6 people enrolled last week.
Hosted by
Zachary Mueller
🤗 accelerate Technical Lead with a decade of experience
Course overview
Master distributed training and real-world scale techniques from top engineers.
Whether you're an ML engineer looking to move beyond single-GPU experiments, or a product leader seeking to understand the language your AI team speaks, this course will give you the hands-on skills and conceptual clarity to operate confidently at scale.
What Makes This Course Different
Rather than saying "here's PyTorch FSDP, it works like such, and now use this configuration" we're going to build every core parallelism strategy from scratch.
I'm not here to tell you how to use a framework. I'm here to tell you how to ensure that your implementations are sound, you can ace that interview, and you know just what the tools you're using are up to for large-scale training.
All course materials and recordings are available after enrollment, and you’ll get free lifetime access to future cohorts.
💡 What’s included:
6+ core lessons split across 4 weeks.
3+ workshops with hands-on experience with tricks I've personally learned, used, and am actively developing
Also will include all 14+ guest lectures from the prior cohort
$2000 in compute from Modal and Lambda
Class discord with lifetime access
100% money-back guarantee (within 14 days of starting the course)
🧠 Distributed Training Course
Learn the foundations and modern techniques used in real-world LLM scaleups across hundreds or thousands of GPUs.
5 core instructor-led workshops:
DDP from scratch and avoiding data bottlenecks
ZeRO (Part 1): How model sharding enables scale
ZeRO (Part 2): Efficiency tradeoffs and stage comparison
Pipeline & Tensor Parallelism: Solving communication slowdowns
Multi-Dimensional Parallelism (recorded from last cohort): Combining all methods for throughput
Plus other targeted workshops covering:
* How DataLoaders work with distributed training
* Using FP8 in the real world (and on consumer hardware)
* How PyTorch traces help us verify our implementations
{More planned/on the way}
Prior (included) cohort talks include:
🏕️ Fireside Chats
Hear from the experts on their real experiences trying to take models to scale, the challenges, and the discoveries
Yuxiang Wei (Meta FAIR)
📣 Conference Talks
Applied Track
Hear how industry leaders are solving real-world scale problems:
Robert Nishihara (Ray, Anyscale): Scaling across thousands of GPUs with Ray
Sami Jaghouar (Prime Intellect): Decentralized global-scale training
Tunji Ruwase (Snowflake): Efficient long-context training with Arctic
Prince Canuma: Local ML workloads using Apple Silicon + MLX
Pretraining Track
Deep dives into LLM pretraining at scale:
Phuc Nguyen (Hugging Face): A practitioner's guide to FP8
Elie Bakouch (Hugging Face): Hyper-optimizing LLMs with MoE, MLA & more
Daniel Han (UnslothAI): Speeding up training with Triton & custom kernels
Guest Lectures include:
Sylvain Gugger (Jane Street): Overview of ZeRO
Wanchao Liang (TorchTitan): DTensor and large-scale pretraining
Wing Lian (Axolotl): 2D Parallelism with Axolotl
Ferdinand Mom (Hugging Face): Multi-dimensional parallelism
Less Wright (Meta): Async TensorParallelism
Matej Sirovatka (Hugging Face): Expert Parallelism for MoE
Marc Sun (Hugging Face): Deployment strategies at scale
✅ Guarantee
If you're not satisfied, we offer a 100% refund up to 14 days after the course begins. No risk, just learning.
01
Beginner to intermediate MLE’s wanting to make sure they have skills that are relevant in today’s market
02
Senior engineers tired of piecing together half-solutions from publications, frameworks, and more.
03
Team leads who want confidence that engineers can execute at scale without burning time.
04
CTOs who need to make fast, informed decisions on how to scale LLMs.
I don't want you to be an expert. But you should have some mild experience training a model of some capacity, be it PyTorch, TF, and such
I'm not here to teach you matrix calculus, nor will we be going that advanced. However some amount of core math is still needed
PyTorch is in Python, so having some experience with the language will do you well since the whole course is in it
Train 100B+ models across 8–1,000 GPUs efficiently
You’ll understand the core problems teams face during large-scale training and how to avoid them using proven methods.
Build real-world experience with modern training techniques
You won’t just watch; you’ll train models using DDP, ZeRO, pipeline parallelism, and more. Each one applied in code.
Understand which training methods to use and when
You’ll learn how to match technique to context. Whether it’s model size, hardware limits, or team constraints, you’ll know what fits.
Be ready before training becomes your bottleneck
Most teams wait too long to prepare for scale. This course makes sure you’re ready before your current training setup stops working.
Go from scattered tutorials to production-ready training skills
You’ll connect theory with practice and walk away with working knowledge you can apply in real systems.
Personalized Instruction
Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.
Live sessions
Learn directly from Zachary Mueller in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to, and have access to all future cohorts
Generous office hours
Bring your blockers to office hours and leave with answers. Get feedback, debug help, and real support when you need it.
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Course notebooks & code
Detailed course notebooks and material with meticulous notes to help walk you through the material and learn along the way
Compute Credits
$1000 in Modal compute credits and $1000 in Lambda compute credits
Maven Guarantee
This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.
19 live sessions • 4 lessons
Nov
4
Nov
4
Nov
6
Distributed Data Parallelism From Scratch
Nov
7
Nov
10
Nov
10
Nov
11
ZeRO: Stage 1 & 2
Nov
12
Nov
11
ZeRO: Stage 3 and Efficient ZeRO Strategies
Nov
17
Nov
17
Nov
18
Nov
20
Nov
21
Nov
24
Nov
24
Nov
25
Nov
28
Nov
28
Distributed Training Lexicon
The Distributed Training Lexicon is a free resource of 49 different distributed training terms with pairing definitions and some visualizations to go with it. The goal is to have a very quick cheatsheet to look at when needing a reminder of what certain methods are.
Download it for free
Free Access to Part of Lesson 1
Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!
(Note: this material preview may change as the course develops, but only for additive purposes)
Get access to the webpage
Hamel Husain
Mark Saroufim
Stas Bekman
Wing Lian
Radek Osmulski
Kevin Bird
Dr. Nathan Lambert
Join an upcoming cohort
Cohort 2
$1,500
Dates
Payment Deadline
Join an upcoming cohort
Cohort 2
$1,500
Dates
Payment Deadline