5 Weeks
·Cohort-based Course
Distributed training is everywhere now as models scale into trillions of parameters. Learn from world-experts on how this gets done
This course is popular
30 people enrolled last week.
5 Weeks
·Cohort-based Course
Distributed training is everywhere now as models scale into trillions of parameters. Learn from world-experts on how this gets done
This course is popular
30 people enrolled last week.
Previously at
Course overview
This started as a distributed training course. It organically grew into an all-encompassing learning event with world-class speakers who have deep knowledge and experience when it comes to the modern techniques used to train at scale today. The distributed training course is still here, in its full capacity.
But now during each week there are hand-tailored speakers discussing topics directly relating to the tips and techniques brought up in that weeks lecture
to help enhance your understanding of concepts and modern practices to completion.
All materials and recordings will be available to participants who enroll. There are 14 talks across three tracks and 5 lessons (and growing) in addition to office hours.
Conference Talks
------------------
Applied Track
------------------
Hear from industry leaders how they have approached the problems of training at scale, and the solutions they've made
Robert Nishihara: Cofounder of Ray, Anyscale
- How Ray and Anyscale help you scale training across thousands of GPUs
Sami Jaghouar: Research Engineer, Prime Intellect
- How Decentralized training at a global scale is possible, and how Prime Intellect gets it done
Tunji Ruwase: Software Engineer, Snowflake, formerly Microsoft
- How Arctic Long Sequence Training makes training multi-million token context length efficient and scalable
Prince Canuma: Machine Learning Research Engineer
- How MLX (Apple Silicon) lets you combine M-series Macs to run machine learning workloads locally for a fraction of the cloud cost
Pretraining Track:
------------------
Pretraining is a core foundation of modern LLM research. Learn what techniques are used today for creating the best model possible
Phuc Nguyen: Research Engineer, Hugging Face
- A Practitioners Guide to FP8 Training
Elie Bakouch: Machine Learning Researcher, Hugging Face
- How modern LLM's like DeepSeek and others are hyper-optimized for efficient training through techniques like MLA, MoE, and more
Daniel Han: Creator of UnslothAI, formerly NVIDIA
- How Triton kernels and other techniques save you hundreds of hours in training time
Distributed Training Course
------------------
Learn the techniques used today when training and fine-tuning models at scale (hundreds and thousands of GPUs at once). Five workshops
help guide you through implementing the most common techniques from the ground up, including Distributed Data Parallelism, the entirety
of ZeRO, and more.
Workshop 1: Distributed Data Parallelism from scratch, and how to make sure your data isn't your bottleneck
Workshop 2: The Zero Redundancy Optimizer (Part 1): How model sharding helps train large models across smaller, multiple GPUs at once
Workshop 3: The Zero Redunancy Optimizer (Part 2): How the different levels of ZeRO effect training time, and which are the most efficient for your use case
Workshop 4: Pipeline and Tensor Parallelism: How and why these techniques help solve the biggest slowdown when training: communication
Workshop 5: Multi-Dimensional Parallelism: Why today we combined all of the techniques above at once to get the highest throughput possible during training
The Distributed Training Course has these guest lecturers and topics:
Sylvain Gugger: Jane Street, formerly Hugging Face & fast.ai
- Introduction to Distributed Training, and an overview of ZeRO
Wanchao Liang: Formerly Meta, Creator of TorchTitan/DTensor
- How TorchTitan has helped developers take model pretraining to scale faster, and how DTensors have made this easier
Ferdinand Mom: Research Engineer, Hugging Face
- What is multi-dimensional parallelism, and how this is a crucial technique to training models at scale today
Less Wright: PyTorch Partner Engineer, Meta
- How Async TensorParallelism helps you train at scale efficiently
Matej Sirovatka: Machine Learning Engineer, Hugging Face
- What is Expert Parallelism and why it's needed when training Mixture-of-Expert models
Marc Sun: Machine Learning Engineer, Hugging Face
- Tips and tricks needed for deploying large models at scale
Free Compute
------------------
To help give you hands on experience with training models, we're sponsored by the following companies to help get you working on training models at scale from Day 1:
- Hugging Face: 6 months of Pro
- Modal: $500 in compute credits
More to be announced
01
People who want to learn what it takes to train models at scale
02
People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home
03
Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years
Generally should be familiar with core tensor operations and how a model gets made with PyTorch
We’re going to be using lots of operations from torch.distributed. I’ll be teaching them to you, but know the core operations for tensors
Understand how model training works on single GPU and the full flow (data -> outputs -> gradients -> backprop) are necessary
Understand not just what distributed training is, but become an expert in it
I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately
Deep understanding of different parallelization strategies
This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.
Train a few models on multiple GPUs
Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.
Hands-On Exercises, Examples, and Code
This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.
Personalized Instruction
Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.
Live sessions
Learn directly from Zachary Mueller in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to.
Course notebooks
Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Maven Guarantee
This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.
Sep
2
Sep
4
Sep
9
Sep
16
Sep
23
Sep
25
Sep
30
Oct
2
Free Access to Part of Lesson 1
Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!
(Note: this material preview may change as the course develops, but only for additive purposes)
Get access to the webpage
Hamel Husain
Mark Saroufim
Stas Bekman
Wing Lian
Radek Osmulski
Kevin Bird
Dr. Nathan Lambert
I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.
I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.
Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey
Join an upcoming cohort
Cohort 1
$2,200
Dates
Payment Deadline
Join an upcoming cohort
Cohort 1
$2,200
Dates
Payment Deadline