From Scratch to Scale: Large Scale Training in the Modern World

New

5 Weeks

Cohort-based Course

Distributed training is now the norm when it comes to training models in the industry. Learn the best techniques used by the pros

This course is popular

30 people enrolled last week.

From Scratch to Scale: Large Scale Training in the Modern World

New

5 Weeks

Cohort-based Course

Distributed training is now the norm when it comes to training models in the industry. Learn the best techniques used by the pros

This course is popular

30 people enrolled last week.

Hosted by

Zachary Mueller

ML Engineer who has spent nearly a decade learning the tools to train models

Zachary Mueller

ML Engineer who has spent nearly a decade learning the tools to train models

Previously at

Course overview

Go from Zero to Distributed Hero

Do any of these sound like you?

1. You’ve heard of multi-GPU training... but haven’t touched it.

2. You’re not sure what “data parallelism” or “pipeline parallelism” actually mean, let alone when to use them.

3. You’ve dabbled with DDP, but it felt more like trial-and-error than engineering.

4. You want to train larger models more efficiently. But you aren’t sure how to do it right.

5. You’ve seen mentions of ZeRO, FSDP, DeepSpeed, but don’t know how they differ or why they matter.

6. You worry that training at scale is only for people at FAANG or OpenAI.

If any of that speaks to you… this course is for you!

This is a hands-on, code-first course designed for engineers and researchers who want to understand how large-scale training actually works and be able to do it themselves. You'll go from:

"I've never used more than one GPU before, ever"

"I know how I could scale a training job across 2, 8, 64, or even 256 GPUs, and I understand the best ways of going about it"

If you're ready to stop watching from the sidelines and start building at scale, you belong in this cohort.

---

WHAT YOU CAN EXPECT

This course will provide you with hands-on experience training models at scale. We will meet once to twice a week for five weeks, with generous office hours and guest speakers (read below for the course schedule)

I'll also be holding office hours and host a Discord community where you can communicate with myself and other students. In return, you will learn the skills needed to keep up with today's training practices in the Deep Learning world and stay on top of the competition. All sessions will be recorded and available to students asynchronously.

---

GUEST SPEAKERS:

* Sylvain Gugger (Jane Street, formerly Hugging Face & fast.ai): Initial concepts of distributed training and ZeRO

* Wanchao Liang (Creator of TorchTitan/DTensors): Introduction to DTensors/TorchTitan

* Less Wright (Meta): Async Tensor Parallelism, a crucial technique when you scale to such high GPU counts

* Phuc Nguyen (Hugging Face): A Practitioner's Guide to FP8 Training

* Sami Jaghouar (Prime Intellect): How decentralized training at insane scales are possible and how Prime Intellect gets there

From Hugging Face:

* Ferdinand Mom: What stacking different parallelism strategies on top of each other takes and why that matters (3D parallelism)

* Matej Sirovatka: How gigantic MoE models get trained at scale efficiently using a new technique called expert parallelism

* Marc Sun: How these scaling strategies get applied when you're ready to deploy such gigantic models

---

FREE COMPUTE

To make sure all students have appropriate access to resources throughout the course to train models, Modal is sponsoring each student with $500 in compute credits towards their platform.

On top of this, Hugging Face is sponsoring each student with 6 months of Pro!

---

COURSE CONTENT

Lesson 1: Fundamentals & Distributed Data Parallelism from Scratch

- Understanding how to launch distributed jobs through Jupyter Notebooks interactively, to help bootstrap both the learning experience and make debugging smoother

- How distributed training differs when we migrate off a single GPU

- What are the best practices for making sure data is being trained on as efficiently as possible

- How does `torch.distributed.DistributedDataParallel` work and how can we build it ourselves

Lesson 2: The Zero Redundancy Optimizer (Part 1)

- Getting a grasp on what "ZeRO"/Fully Sharded Data Parallelism is

- Implementing the first stage and second stage from scratch

- Understanding how to benchmark distributed code to get accurate measurements

Lesson 3: The Zero Redundancy Optimizer (Part 2)

- Understand how the third and final stage of ZeRO is implemented

- Learn when to use each technique appropriately

- How to bootstrap each technique based on compute availability to speed up training

Lesson 4: Other Distributed Strategies and Distributed Inference

- Learn and implement Pipeline Parallelism and Tensor Parallelism from scratch

- Understand how these techniques help enable deployment of huge models at scale easily

Lesson 5: Combining Distributed Strategies

- Understand how to go from 1D parallelism to 3D parallelism

- Learn which frameworks help implement this the best

Who is this course for

People who want to learn what it takes to train models at scale

People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home

Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years

People who want to learn what it takes to train models at scale

People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home

Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years

Prerequisites

One year of PyTorch experience
Generally should be familiar with core tensor operations and how a model gets made with PyTorch
Basic understanding of tensor math
We’re going to be using lots of operations from torch.distributed. I’ll be teaching them to you, but know the core operations for tensors
Trained at least one model in your life
Understand how model training works on single GPU and the full flow (data -> outputs -> gradients -> backprop) are necessary

What you’ll get out of this course

Understand not just what distributed training is, but become an expert in it

I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately

Deep understanding of different parallelization strategies

This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.

Train a few models on multiple GPUs

Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.

Hands-On Exercises, Examples, and Code

This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Course notebooks

Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Sep 1—Sep 7

Sep

Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

Tue 9/26:00 PM—7:00 PM (UTC)

Sep

Distributed Data Parallelism From Scratch

Thu 9/46:00 PM—7:00 PM (UTC)

Week 2

Sep 8—Sep 14

Sep

ZeRO: Stage 1 & 2

Tue 9/96:00 PM—7:00 PM (UTC)

Week 3

Sep 15—Sep 21

Sep

ZeRO: Stage 3 and Efficient ZeRO Strategies

Tue 9/166:00 PM—7:30 PM (UTC)

Week 4

Sep 22—Sep 28

Sep

Pipeline Parallelism and Tensor Parallelism

Tue 9/236:00 PM—7:30 PM (UTC)

Sep

Efficient Strategies for Distributed Inference

Thu 9/256:00 PM—7:00 PM (UTC)

Week 5

Sep 29—Oct 3

Sep

2D Parallelism

Tue 9/306:00 PM—7:00 PM (UTC)

Oct

3D Parallelism (Guest Speaker)

Thu 10/26:00 PM—7:00 PM (UTC)

Free resource

Free Access to Part of Lesson 1

Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!

(Note: this material preview may change as the course develops, but only for additive purposes)

Get access to the webpage

Frequently Asked Questions

Instructor is a recognized expert, with hands-on experience

Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals

Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE

As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.

Stas Bekman

Senior Machine Learning Engineer, Snowflake

Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this

Wing Lian

Founder, Axolotl

Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.

Radek Osmulski

Senior Data Scientist, NVIDIA

Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention

Kevin Bird

Co-Founder, Problem Solvers Guild

Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.

Dr. Nathan Lambert

LLM Post Training Lead, Ai2

Meet your instructor

Zachary Mueller

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.

I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.

Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey

Join an upcoming cohort