All courses Engineering

Class is in session

Scratch to Scale: Large-Scale Training in the Modern World

4 Weeks

Cohort-based Course

Learn the techniques used today to take your model training from Colab to Clusters

Class is in session

Scratch to Scale: Large-Scale Training in the Modern World

4 Weeks

Cohort-based Course

Learn the techniques used today to take your model training from Colab to Clusters

Hosted by

Zachary Mueller

🤗 accelerate Technical Lead with a decade of experience

Zachary Mueller

🤗 accelerate Technical Lead with a decade of experience

Course overview

Build the skills to answer the call when it's time to take your models to scale

Master distributed training and real-world scale techniques from top engineers.

Whether you're an ML engineer looking to move beyond single-GPU experiments, or a product leader seeking to understand the language your AI team speaks, this course will give you the hands-on skills and conceptual clarity to operate confidently at scale.

What Makes This Course Different

Rather than saying "here's PyTorch FSDP, it works like such, and now use this configuration" we're going to build every core parallelism strategy from scratch.

I'm not here to tell you how to use a framework. I'm here to tell you how to ensure that your implementations are sound, you can ace that interview, and you know just what the tools you're using are up to for large-scale training.

All course materials and recordings are available after enrollment, and you’ll get free lifetime access to future cohorts.

💡 What’s included:

6+ core lessons split across 4 weeks.

3+ workshops with hands-on experience with tricks I've personally learned, used, and am actively developing

Also will include all 14+ guest lectures from the prior cohort

$2000 in compute from Modal and Lambda

Class discord with lifetime access

100% money-back guarantee (within 14 days of starting the course)

🧠 Distributed Training Course

Learn the foundations and modern techniques used in real-world LLM scaleups across hundreds or thousands of GPUs.

5 core instructor-led workshops:

DDP from scratch and avoiding data bottlenecks

ZeRO (Part 1): How model sharding enables scale

ZeRO (Part 2): Efficiency tradeoffs and stage comparison

Pipeline & Tensor Parallelism: Solving communication slowdowns

Multi-Dimensional Parallelism (recorded from last cohort): Combining all methods for throughput

Plus other targeted workshops covering:

* How DataLoaders work with distributed training

* Using FP8 in the real world (and on consumer hardware)

* How PyTorch traces help us verify our implementations

{More planned/on the way}

Prior (included) cohort talks include:

🏕️ Fireside Chats

Hear from the experts on their real experiences trying to take models to scale, the challenges, and the discoveries

Yuxiang Wei (Meta FAIR)

📣 Conference Talks

Applied Track

Hear how industry leaders are solving real-world scale problems:

Robert Nishihara (Ray, Anyscale): Scaling across thousands of GPUs with Ray

Sami Jaghouar (Prime Intellect): Decentralized global-scale training

Tunji Ruwase (Snowflake): Efficient long-context training with Arctic

Prince Canuma: Local ML workloads using Apple Silicon + MLX

Pretraining Track

Deep dives into LLM pretraining at scale:

Phuc Nguyen (Hugging Face): A practitioner's guide to FP8

Elie Bakouch (Hugging Face): Hyper-optimizing LLMs with MoE, MLA & more

Daniel Han (UnslothAI): Speeding up training with Triton & custom kernels

Guest Lectures include:

Sylvain Gugger (Jane Street): Overview of ZeRO

Wanchao Liang (TorchTitan): DTensor and large-scale pretraining

Wing Lian (Axolotl): 2D Parallelism with Axolotl

Ferdinand Mom (Hugging Face): Multi-dimensional parallelism

Less Wright (Meta): Async TensorParallelism

Matej Sirovatka (Hugging Face): Expert Parallelism for MoE

Marc Sun (Hugging Face): Deployment strategies at scale

✅ Guarantee

If you're not satisfied, we offer a 100% refund up to 14 days after the course begins. No risk, just learning.

Built for the People Scaling What’s Next

Beginner to intermediate MLE’s wanting to make sure they have skills that are relevant in today’s market

Senior engineers tired of piecing together half-solutions from publications, frameworks, and more.

Team leads who want confidence that engineers can execute at scale without burning time.

Beginner to intermediate MLE’s wanting to make sure they have skills that are relevant in today’s market

Senior engineers tired of piecing together half-solutions from publications, frameworks, and more.

Team leads who want confidence that engineers can execute at scale without burning time.

CTOs who need to make fast, informed decisions on how to scale LLMs.

Prerequisites

Train any model, at least once
I don't want you to be an expert. But you should have some mild experience training a model of some capacity, be it PyTorch, TF, and such
Understand basic high-school algebra
I'm not here to teach you matrix calculus, nor will we be going that advanced. However some amount of core math is still needed
Familiarity with Python coding
PyTorch is in Python, so having some experience with the language will do you well since the whole course is in it

What You'll Achieve

Train 100B+ models across 8–1,000 GPUs efficiently

You’ll understand the core problems teams face during large-scale training and how to avoid them using proven methods.

Build real-world experience with modern training techniques

You won’t just watch; you’ll train models using DDP, ZeRO, pipeline parallelism, and more. Each one applied in code.

Understand which training methods to use and when

You’ll learn how to match technique to context. Whether it’s model size, hardware limits, or team constraints, you’ll know what fits.

Be ready before training becomes your bottleneck

Most teams wait too long to prepare for scale. This course makes sure you’re ready before your current training setup stops working.

Go from scattered tutorials to production-ready training skills

You’ll connect theory with practice and walk away with working knowledge you can apply in real systems.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Generous office hours

Bring your blockers to office hours and leave with answers. Get feedback, debug help, and real support when you need it.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Course notebooks & code

Detailed course notebooks and material with meticulous notes to help walk you through the material and learn along the way

Compute Credits

$1000 in Modal compute credits and $1000 in Lambda compute credits

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

21 live sessions • 16 lessons

Week 1

Nov 3—Nov 9

Nov

Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

Tue 11/46:00 PM—7:00 PM (UTC)

Nov

How to Use Quarto to Make a Blog

Tue 11/47:30 PM—8:30 PM (UTC)

Distributed Data Parallelism From Scratch

Nov
6
Distributed Data Parallelism From Scratch
Thu 11/66:00 PM—7:00 PM (UTC)

2 more items

DataLoader Workshop

2 items

Nov

Office Hours

Fri 11/77:30 PM—8:30 PM (UTC)

Optional

Capstone Project

1 item

Week 2

Nov 10—Nov 16

Nov

Office Hours

Mon 11/102:00 PM—3:00 PM (UTC)

Optional

Nov

Office Hours

Mon 11/1010:00 PM—11:00 PM (UTC)

Optional

ZeRO

Nov
11
ZeRO: Stage 1 & 2
Tue 11/116:00 PM—7:00 PM (UTC)
Nov
11
ZeRO: Stage 3 and Efficient ZeRO Strategies
Tue 11/117:30 PM—9:00 PM (UTC)

2 more items

Nov

PyTorch FSDP

Thu 11/136:00 PM—7:00 PM (UTC)

Week 3

Nov 17—Nov 23

Nov

Office Hours

Mon 11/172:00 PM—3:00 PM (UTC)

Optional

Nov

Office Hours

Mon 11/1710:00 PM—11:00 PM (UTC)

Optional

Nov

Pipeline Parallelism

Tue 11/186:00 PM—7:30 PM (UTC)

Nov

Hands-on FP8 Workshop

Wed 11/196:00 PM—7:00 PM (UTC)

Nov

Pipeline Parallelism with torchtitan

Thu 11/206:00 PM—7:00 PM (UTC)

Nov

How to train your small MoE

Fri 11/216:00 PM—7:00 PM (UTC)

Nov

Office Hours

Fri 11/212:00 PM—3:00 PM (UTC)

Optional

Week 4

Nov 24—Nov 28

Nov

Office Hours

Mon 11/242:00 PM—3:00 PM (UTC)

Optional

Nov

Office Hours

Mon 11/2410:00 PM—11:00 PM (UTC)

Optional

Nov

Tensor Parallelism

Tue 11/256:15 PM—7:45 PM (UTC)

Nov

TensorParallelism with TorchTitan

Wed 11/266:00 PM—7:00 PM (UTC)

Nov

Class Finale: A Review and Farewell

Fri 11/286:00 PM—7:00 PM (UTC)

Bonus

Guest Speakers

10 items

What students are saying

Instructor is a recognized expert, with hands-on experience

Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals

Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE

As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.

Stas Bekman

Senior Machine Learning Engineer, Snowflake

Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this

Wing Lian

Founder, Axolotl

Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.

Radek Osmulski

Senior Data Scientist, NVIDIA

Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention

Kevin Bird

Co-Founder, Problem Solvers Guild

Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.

Dr. Nathan Lambert

LLM Post Training Lead, Ai2

Frequently Asked Questions

Free resource

Free Access to Part of Lesson 1

Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!

(Note: this material preview may change as the course develops, but only for additive purposes)

Get access to the webpage

Free resource

Distributed Operations Cheatsheet

Remembering how each torch.distributed operation behaves can be difficult. Here's an easy PDF with visuals of each kind showcasing how tensors get modified throughout all of the common MPI operations

Download the PDF

Free resource

Distributed Training Lexicon

The Distributed Training Lexicon is a free resource of 49 different distributed training terms with pairing definitions and some visualizations to go with it. The goal is to have a very quick cheatsheet to look at when needing a reminder of what certain methods are.

Download it for free

Be the first to know about upcoming cohorts