5 Weeks
·Cohort-based Course
Learn to train 100B+ parameter models efficiently from the engineers who built the leading frameworks and techniques
This course is popular
23 people enrolled last week.
5 Weeks
·Cohort-based Course
Learn to train 100B+ parameter models efficiently from the engineers who built the leading frameworks and techniques
This course is popular
23 people enrolled last week.
Hosted by
Zachary Mueller + 14 Guest Experts
🤗 accelerate Technical Lead with a decade of experience
With speakers from
Course overview
🚨 The Problem
Distributed training is full of invisible traps.
Get it wrong, and months of team velocity is gone and you wasted tens or hundreds of thousands of dollars in unoptimized compute.
Most teams guess their way through ZeRO configs, DDP setups, and pipeline logic. That leads to stalled releases and blown budgets.
✅ The Solution
This course is the only live training that teaches modern distributed AI workflows directly from people who have scaled training across tens of thousands of GPUs, and built the foundational knowledge the AI ecosystem is built upon today.
You’ll walk away knowing what to use, when to use it, and why it works, across real production-scale scenarios.
🔥 Built for High-Impact Teams
Engineers: Train large models across 8 to 1,000+ GPUs with precision
Tech leads and CTOs: Make confident decisions on systems, tools, and costs
Founders and startups: Build smarter AI infrastructure that scales without waste
🧩 What Makes This Course Different
* Learn directly from a curated list of 14 world-class experts in the field from Meta, Ray, Snowflake, Hugging Face, and more
* Get five focused workshops with matching hands-on labs for real skills
* Gain access to a private alumni network for continued learning and hiring
* Use $500+ in real compute credits to apply your skills immediately
📝 Guest Experts and Case Studies
Over 14 guest speakers from the top AI teams. Each session ties directly to a problem that will need to be solved when crafting the ideal training scenario.
Applied Track:
Robert Nishihara (Ray)
How to orchestrate GPU training across thousands of nodes effectively
Sami Jaghouar (Prime Intellect)
Building decentralized training systems at a global scale
Tunji Ruwase (Snowflake)
Training long-context models efficiently without exploding memory
Prince Canuma
Running LLMs directly on Apple Silicon for local-first development
Pretraining Track:
Phuc Nguyen (Hugging Face)
Mastering FP8 precision training
Elie Bakouch (Hugging Face)
Advanced MoE and parallelism strategies
Daniel Han (UnslothAI)
How Triton kernels can be an easy optimization win, and other modern practices
Distributed Technique Track:
Sylvain Gugger (Jane Street)
Overview of the ZeRO algorithm
Wanchao Liang (Thinking Machines)
How DTensors helps bring new engineers into understanding distributed training faster
Ferdinand Mom (Hugging Face)
How you should stack parallelism strategies to maximize your training capacity
Less Wright (Meta)
How Async TensorParallelism is necessary to train across clusters of thousands of GPUs
Matej Sirovatka (Hugging Face)
Why Expert Parallelism is a necessity when training MoE models at scale
Marc Sun (Hugging Face)
Why we need new strategies for deploying large models at scale, and how to get there
01
CTOs who need to make fast, informed decisions on how to scale LLMs.
02
Team leads who want confidence that engineers can execute at scale without burning time.
03
Senior engineers tired of piecing together half-solutions from publications, frameworks, and more.
This course is for engineers already comfortable training models using PyTorch or Hugging Face Transformers.
Trusted by top builders at Hugging Face, Modal, Snowflake, and Meta.
"Zach is one of the key people making distributed training accessible" - Mark Saroufim (Software Engineer at Meta)
Train 100B+ models across 8–1,000 GPUs efficiently
You’ll understand the core problems teams face during large-scale training and how to avoid them using proven methods.
Build real-world experience with modern training techniques
You won’t just watch; you’ll train models using DDP, ZeRO, pipeline parallelism, and more. Each one applied in code.
Understand which training methods to use and when
You’ll learn how to match technique to context. Whether it’s model size, hardware limits, or team constraints, you’ll know what fits.
Be ready before training becomes your bottleneck
Most teams wait too long to prepare for scale. This course makes sure you’re ready before your current training setup stops working.
Go from scattered tutorials to production-ready training skills
You’ll connect theory with practice and walk away with working knowledge you can apply in real systems.
Personalized Instruction
Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.
Live sessions
Learn directly from Zachary Mueller + 14 Guest Experts in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to, and have access to all future cohorts
Generous office hours
Bring your blockers to office hours and leave with answers. Get feedback, debug help, and real support when you need it.
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Course notebooks
Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way
Compute Credits
$500 in Modal compute credits, 6 months of Hugging Face Pro
Maven Guarantee
This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.
Sep
2
Sep
4
Sep
9
Sep
16
Sep
23
Sep
25
Sep
30
Oct
2
Free Access to Part of Lesson 1
Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!
(Note: this material preview may change as the course develops, but only for additive purposes)
Get access to the webpage
Hamel Husain
Mark Saroufim
Stas Bekman
Wing Lian
Radek Osmulski
Kevin Bird
Dr. Nathan Lambert
Join an upcoming cohort
Cohort 1
$2,400
Dates
Payment Deadline
Join an upcoming cohort
Cohort 1
$2,400
Dates
Payment Deadline