From Scratch to Scale: Hands on Distributed Training from the ground up

New
·

5 Weeks

·

Cohort-based Course

Distributed training is now the norm when it comes to training models in the industry. Learn the best techniques used by the pros

This course is popular

14 people enrolled last week.

Course overview

Go from Zero to Distributed Hero

Do any of these sound like you?


1. You’ve heard of multi-GPU training... but haven’t touched it.


2. You’re not sure what “data parallelism” or “pipeline parallelism” actually mean, let alone when to use them.


3. You’ve dabbled with DDP, but it felt more like trial-and-error than engineering.


4. You want to train larger models more efficiently. But you aren’t sure how to do it right.


5. You’ve seen mentions of ZeRO, FSDP, DeepSpeed, but don’t know how they differ or why they matter.


6. You worry that training at scale is only for people at FAANG or OpenAI.


If any of that speaks to you… this course is for you!


This is a hands-on, code-first course designed for engineers and researchers who want to understand how large-scale training actually works and be able to do it themselves. You'll go from:


"I've never used more than one GPU before, ever"

to

"I know how I could scale a training job across 2, 8, 64, or even 256 GPUs, and I understand the best ways of going about it"


If you're ready to stop watching from the sidelines and start building at scale, you belong in this cohort.


---

WHAT YOU CAN EXPECT


This course will provide you with hands-on experience training models at scale. We will meet once to twice a week for five weeks, with generous office hours and guest speakers (read below for the course schedule)


I'll also be holding office hours and host a Discord community where you can communicate with myself and other students. In return, you will learn the skills needed to keep up with today's training practices in the Deep Learning world and stay on top of the competition. All sessions will be recorded and available to students asynchronously.


---


FREE COMPUTE


To make sure all students have appropriate access to resources throughout the course to train models, Modal is sponsoring each student with $500 in compute credits towards their platform.


On top of this, Hugging Face is sponsoring each student with 6 months of Pro!


---

COURSE CONTENT


Lesson 1: Fundamentals & Distributed Data Parallelism from Scratch

- Understanding how to launch distributed jobs through Jupyter Notebooks interactively, to help bootstrap both the learning experience and make debugging smoother

- How distributed training differs when we migrate off a single GPU

- What are the best practices for making sure data is being trained on as efficiently as possible

- How does `torch.distributed.DistributedDataParallel` work and how can we build it ourselves


Lesson 2: The Zero Redundancy Optimizer (Part 1)

- Getting a grasp on what "ZeRO"/Fully Sharded Data Parallelism is

- Implementing the first stage and second stage from scratch

- Understanding how to benchmark distributed code to get accurate measurements


Lesson 3: The Zero Redundancy Optimizer (Part 2)

- Understand how the third and final stage of ZeRO is implemented

- Learn when to use each technique appropriately

- How to bootstrap each technique based on compute availability to speed up training


Lesson 4: Other Distributed Strategies and Distributed Inference

- Learn and implement Pipeline Parallelism and Tensor Parallelism from scratch

- Understand how these techniques help enable deployment of huge models at scale easily


Lesson 5: Combining Distributed Strategies

- Understand how to go from 1D parallelism to 3D parallelism

- Learn which frameworks help implement this the best


Guest Speakers:

* Sylvain Gugger (Jane Street): Topic TBD

* Marc Sun (Hugging Face): Distributed Inference

* Matej Sirovatka (Hugging Face): Expert Parallelism

* Less Wright (Meta): Async Tensor Parallelism

* Ferdinand Mom (Hugging Face): Combining DP, TP, and PP

Who is this course for

01

People who want to learn what it takes to train models at scale

02

People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home

03

Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years

What you’ll get out of this course

Understand not just what distributed training is, but become an expert in it

I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately

Deep understanding of different parallelization strategies

This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.

Train a few models on multiple GPUs

Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.

Hands-On Exercises, Examples, and Code

This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Zachary Mueller

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Course notebooks

Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Sep 1—Sep 7

    Sep

    2

    Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

    Tue 9/26:00 PM—7:00 PM (UTC)

    Sep

    4

    Distributed Data Parallelism From Scratch

    Thu 9/46:00 PM—7:00 PM (UTC)

Week 2

Sep 8—Sep 14

    Sep

    9

    ZeRO: Stage 1 & 2

    Tue 9/96:00 PM—7:00 PM (UTC)

Week 3

Sep 15—Sep 21

    Sep

    16

    ZeRO: Stage 3 and Efficient ZeRO Strategies

    Tue 9/166:00 PM—7:30 PM (UTC)

Week 4

Sep 22—Sep 28

    Sep

    23

    Pipeline Parallelism and Tensor Parallelism

    Tue 9/236:00 PM—7:30 PM (UTC)

    Sep

    25

    Efficient Strategies for Distributed Inference

    Thu 9/256:00 PM—7:00 PM (UTC)

Week 5

Sep 29—Oct 3

    Sep

    30

    2D Parallelism

    Tue 9/306:00 PM—7:00 PM (UTC)

    Oct

    2

    3D Parallelism (Guest Speaker)

    Thu 10/26:00 PM—7:00 PM (UTC)

Instructor is a recognized expert, with hands-on experience

        Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.
Hamel Husain

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals
        Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.
Mark Saroufim

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE
        As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.
Stas Bekman

Stas Bekman

Senior Machine Learning Engineer, Snowflake
        Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this
Wing Lian

Wing Lian

Founder, Axolotl
        Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.
Radek Osmulski

Radek Osmulski

Senior Data Scientist, NVIDIA
        Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention
Kevin Bird

Kevin Bird

Co-Founder, Problem Solvers Guild
        Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.
Dr. Nathan Lambert

Dr. Nathan Lambert

LLM Post Training Lead, Ai2

Meet your instructor

Zachary Mueller

Zachary Mueller

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.


I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.


Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey

A pattern of wavy dots

Join an upcoming cohort

From Scratch to Scale: Hands on Distributed Training from the ground up

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed
A pattern of wavy dots

Join an upcoming cohort

From Scratch to Scale: Hands on Distributed Training from the ground up

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed

$1,500

5 Weeks