5 Weeks
·Cohort-based Course
Distributed training is now the norm when it comes to training models in the industry. Learn the best techniques used by the pros
This course is popular
14 people enrolled last week.
5 Weeks
·Cohort-based Course
Distributed training is now the norm when it comes to training models in the industry. Learn the best techniques used by the pros
This course is popular
14 people enrolled last week.
Course overview
Do any of these sound like you?
1. You’ve heard of multi-GPU training... but haven’t touched it.
2. You’re not sure what “data parallelism” or “pipeline parallelism” actually mean, let alone when to use them.
3. You’ve dabbled with DDP, but it felt more like trial-and-error than engineering.
4. You want to train larger models more efficiently. But you aren’t sure how to do it right.
5. You’ve seen mentions of ZeRO, FSDP, DeepSpeed, but don’t know how they differ or why they matter.
6. You worry that training at scale is only for people at FAANG or OpenAI.
If any of that speaks to you… this course is for you!
This is a hands-on, code-first course designed for engineers and researchers who want to understand how large-scale training actually works and be able to do it themselves. You'll go from:
"I've never used more than one GPU before, ever"
to
"I know how I could scale a training job across 2, 8, 64, or even 256 GPUs, and I understand the best ways of going about it"
If you're ready to stop watching from the sidelines and start building at scale, you belong in this cohort.
---
WHAT YOU CAN EXPECT
This course will provide you with hands-on experience training models at scale. We will meet once to twice a week for five weeks, with generous office hours and guest speakers (read below for the course schedule)
I'll also be holding office hours and host a Discord community where you can communicate with myself and other students. In return, you will learn the skills needed to keep up with today's training practices in the Deep Learning world and stay on top of the competition. All sessions will be recorded and available to students asynchronously.
---
FREE COMPUTE
To make sure all students have appropriate access to resources throughout the course to train models, Modal is sponsoring each student with $500 in compute credits towards their platform.
On top of this, Hugging Face is sponsoring each student with 6 months of Pro!
---
COURSE CONTENT
Lesson 1: Fundamentals & Distributed Data Parallelism from Scratch
- Understanding how to launch distributed jobs through Jupyter Notebooks interactively, to help bootstrap both the learning experience and make debugging smoother
- How distributed training differs when we migrate off a single GPU
- What are the best practices for making sure data is being trained on as efficiently as possible
- How does `torch.distributed.DistributedDataParallel` work and how can we build it ourselves
Lesson 2: The Zero Redundancy Optimizer (Part 1)
- Getting a grasp on what "ZeRO"/Fully Sharded Data Parallelism is
- Implementing the first stage and second stage from scratch
- Understanding how to benchmark distributed code to get accurate measurements
Lesson 3: The Zero Redundancy Optimizer (Part 2)
- Understand how the third and final stage of ZeRO is implemented
- Learn when to use each technique appropriately
- How to bootstrap each technique based on compute availability to speed up training
Lesson 4: Other Distributed Strategies and Distributed Inference
- Learn and implement Pipeline Parallelism and Tensor Parallelism from scratch
- Understand how these techniques help enable deployment of huge models at scale easily
Lesson 5: Combining Distributed Strategies
- Understand how to go from 1D parallelism to 3D parallelism
- Learn which frameworks help implement this the best
Guest Speakers:
* Sylvain Gugger (Jane Street): Topic TBD
* Marc Sun (Hugging Face): Distributed Inference
* Matej Sirovatka (Hugging Face): Expert Parallelism
* Less Wright (Meta): Async Tensor Parallelism
* Ferdinand Mom (Hugging Face): Combining DP, TP, and PP
01
People who want to learn what it takes to train models at scale
02
People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home
03
Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years
Understand not just what distributed training is, but become an expert in it
I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately
Deep understanding of different parallelization strategies
This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.
Train a few models on multiple GPUs
Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.
Hands-On Exercises, Examples, and Code
This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.
Personalized Instruction
Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.
Live sessions
Learn directly from Zachary Mueller in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to.
Course notebooks
Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Maven Guarantee
This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.
Sep
2
Sep
4
Sep
9
Sep
16
Sep
23
Sep
25
Sep
30
Oct
2
Hamel Husain
Mark Saroufim
Stas Bekman
Wing Lian
Radek Osmulski
Kevin Bird
Dr. Nathan Lambert
I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.
I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.
Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey
Join an upcoming cohort
Cohort 1
$1,500
Dates
Payment Deadline
Join an upcoming cohort
Cohort 1
$1,500
Dates
Payment Deadline