From Scratch to Scale: Large Scale Training in the Modern World

New
·

5 Weeks

·

Cohort-based Course

Distributed training is everywhere now as models scale into trillions of parameters. Learn from world-experts on how this gets done

This course is popular

30 people enrolled last week.

Previously at

Hugging Face
Accenture

Course overview

Build the skills to answer the call when it's time to take your models to scale

This started as a distributed training course. It organically grew into an all-encompassing learning event with world-class speakers who have deep knowledge and experience when it comes to the modern techniques used to train at scale today. The distributed training course is still here, in its full capacity.

But now during each week there are hand-tailored speakers discussing topics directly relating to the tips and techniques brought up in that weeks lecture

to help enhance your understanding of concepts and modern practices to completion.


All materials and recordings will be available to participants who enroll. There are 14 talks across three tracks and 5 lessons (and growing) in addition to office hours.


Conference Talks

------------------


Applied Track

------------------

Hear from industry leaders how they have approached the problems of training at scale, and the solutions they've made


Robert Nishihara: Cofounder of Ray, Anyscale

- How Ray and Anyscale help you scale training across thousands of GPUs

Sami Jaghouar: Research Engineer, Prime Intellect

- How Decentralized training at a global scale is possible, and how Prime Intellect gets it done

Tunji Ruwase: Software Engineer, Snowflake, formerly Microsoft

- How Arctic Long Sequence Training makes training multi-million token context length efficient and scalable

Prince Canuma: Machine Learning Research Engineer

- How MLX (Apple Silicon) lets you combine M-series Macs to run machine learning workloads locally for a fraction of the cloud cost


Pretraining Track:

------------------

Pretraining is a core foundation of modern LLM research. Learn what techniques are used today for creating the best model possible

Phuc Nguyen: Research Engineer, Hugging Face

- A Practitioners Guide to FP8 Training

Elie Bakouch: Machine Learning Researcher, Hugging Face

- How modern LLM's like DeepSeek and others are hyper-optimized for efficient training through techniques like MLA, MoE, and more

Daniel Han: Creator of UnslothAI, formerly NVIDIA

- How Triton kernels and other techniques save you hundreds of hours in training time


Distributed Training Course

------------------

Learn the techniques used today when training and fine-tuning models at scale (hundreds and thousands of GPUs at once). Five workshops

help guide you through implementing the most common techniques from the ground up, including Distributed Data Parallelism, the entirety

of ZeRO, and more.


Workshop 1: Distributed Data Parallelism from scratch, and how to make sure your data isn't your bottleneck

Workshop 2: The Zero Redundancy Optimizer (Part 1): How model sharding helps train large models across smaller, multiple GPUs at once

Workshop 3: The Zero Redunancy Optimizer (Part 2): How the different levels of ZeRO effect training time, and which are the most efficient for your use case

Workshop 4: Pipeline and Tensor Parallelism: How and why these techniques help solve the biggest slowdown when training: communication

Workshop 5: Multi-Dimensional Parallelism: Why today we combined all of the techniques above at once to get the highest throughput possible during training


The Distributed Training Course has these guest lecturers and topics:


Sylvain Gugger: Jane Street, formerly Hugging Face & fast.ai

- Introduction to Distributed Training, and an overview of ZeRO

Wanchao Liang: Formerly Meta, Creator of TorchTitan/DTensor

- How TorchTitan has helped developers take model pretraining to scale faster, and how DTensors have made this easier

Ferdinand Mom: Research Engineer, Hugging Face

- What is multi-dimensional parallelism, and how this is a crucial technique to training models at scale today

Less Wright: PyTorch Partner Engineer, Meta

- How Async TensorParallelism helps you train at scale efficiently

Matej Sirovatka: Machine Learning Engineer, Hugging Face

- What is Expert Parallelism and why it's needed when training Mixture-of-Expert models

Marc Sun: Machine Learning Engineer, Hugging Face

- Tips and tricks needed for deploying large models at scale


Free Compute

------------------


To help give you hands on experience with training models, we're sponsored by the following companies to help get you working on training models at scale from Day 1:


- Hugging Face: 6 months of Pro

- Modal: $500 in compute credits

More to be announced

Who is this course for

01

People who want to learn what it takes to train models at scale

02

People wanting to know how to use those leftover GPUs just lying around to their full capacity to train big models at home

03

Field experts wanting to know where the industry currently stands and what techniques have come out in the last few years

Prerequisites

  • One year of PyTorch experience

    Generally should be familiar with core tensor operations and how a model gets made with PyTorch

  • Basic understanding of tensor math

    We’re going to be using lots of operations from torch.distributed. I’ll be teaching them to you, but know the core operations for tensors

  • Trained at least one model in your life

    Understand how model training works on single GPU and the full flow (data -> outputs -> gradients -> backprop) are necessary

What you’ll get out of this course

Understand not just what distributed training is, but become an expert in it

I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately

Deep understanding of different parallelization strategies

This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.

Train a few models on multiple GPUs

Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.

Hands-On Exercises, Examples, and Code

This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Zachary Mueller

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Course notebooks

Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Sep 1—Sep 7

    Sep

    2

    Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

    Tue 9/26:00 PM—7:00 PM (UTC)

    Sep

    4

    Distributed Data Parallelism From Scratch

    Thu 9/46:00 PM—7:00 PM (UTC)

Week 2

Sep 8—Sep 14

    Sep

    9

    ZeRO: Stage 1 & 2

    Tue 9/96:00 PM—7:00 PM (UTC)

Week 3

Sep 15—Sep 21

    Sep

    16

    ZeRO: Stage 3 and Efficient ZeRO Strategies

    Tue 9/166:00 PM—7:30 PM (UTC)

Week 4

Sep 22—Sep 28

    Sep

    23

    Pipeline Parallelism and Tensor Parallelism

    Tue 9/236:00 PM—7:30 PM (UTC)

    Sep

    25

    Efficient Strategies for Distributed Inference

    Thu 9/256:00 PM—7:00 PM (UTC)

Week 5

Sep 29—Oct 3

    Sep

    30

    2D Parallelism

    Tue 9/306:00 PM—7:00 PM (UTC)

    Oct

    2

    3D Parallelism (Guest Speaker)

    Thu 10/26:00 PM—7:00 PM (UTC)
Free resource

Free Access to Part of Lesson 1

Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!


(Note: this material preview may change as the course develops, but only for additive purposes)

Get access to the webpage

Frequently Asked Questions

Instructor is a recognized expert, with hands-on experience

        Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.
Hamel Husain

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals
        Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.
Mark Saroufim

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE
        As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.
Stas Bekman

Stas Bekman

Senior Machine Learning Engineer, Snowflake
        Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this
Wing Lian

Wing Lian

Founder, Axolotl
        Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.
Radek Osmulski

Radek Osmulski

Senior Data Scientist, NVIDIA
        Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention
Kevin Bird

Kevin Bird

Co-Founder, Problem Solvers Guild
        Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.
Dr. Nathan Lambert

Dr. Nathan Lambert

LLM Post Training Lead, Ai2

Meet your instructor

Zachary Mueller

Zachary Mueller

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.


I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.


Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey

A pattern of wavy dots

Join an upcoming cohort

From Scratch to Scale: Large Scale Training in the Modern World

Cohort 1

$2,200

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed
A pattern of wavy dots

Join an upcoming cohort

From Scratch to Scale: Large Scale Training in the Modern World

Cohort 1

$2,200

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed

$2,200

5 Weeks