Scratch to Scale: Large-Scale Training in the Modern World

New
·

5 Weeks

·

Cohort-based Course

Learn the techniques used today from world-class researchers and engineers from Meta, Ray, Hugging Face, and more

This course is popular

17 people enrolled last week.

Hosted by

Zachary Mueller

🤗 accelerate Technical Lead with a decade of experience

With speakers from

Hugging Face
Anyscale
PyTorch
Snowflake
Unsloth AI

Course overview

Build the skills to answer the call when it's time to take your models to scale

Master distributed training and real-world scale techniques from top engineers.

Whether you're an ML engineer looking to move beyond single-GPU experiments, or a product leader seeking to understand the language your AI team speaks, this course will give you the hands-on skills and conceptual clarity to operate confidently at scale.


What Makes This Course Different

This started as a distributed training course. It organically grew into an all-encompassing learning event with world-class speakers who bring deep, real-world experience in modern large-scale training.

The distributed training curriculum remains intact: five hands-on workshops covering today’s core scale-up methods.

Now, each week features hand-tailored guest lectures from top engineers at Hugging Face, Meta, Snowflake, and more.

These expert sessions are aligned to each workshop’s topic, helping you bridge theory with modern production practices.

All course materials and recordings are available after enrollment, and you’ll get free lifetime access to future cohorts.


💡 What’s included:


5 core workshops

15+ guest talks across 3 curated tracks

Weekly live office hours

Class discord with lifetime access

Over $500 in credits from Modal and Hugging Face

100% money-back guarantee (within 14 days of course completion)


🏕️ Fireside Chats


Hear from the experts on their real experiences trying to take models to scale, the challenges, and the discoveries


Yuxiang Wei (Meta FAIR)


📣 Conference Talks


Applied Track

Hear how industry leaders are solving real-world scale problems:


Robert Nishihara (Ray, Anyscale): Scaling across thousands of GPUs with Ray

Sami Jaghouar (Prime Intellect): Decentralized global-scale training

Tunji Ruwase (Snowflake): Efficient long-context training with Arctic

Prince Canuma: Local ML workloads using Apple Silicon + MLX


Pretraining Track

Deep dives into LLM pretraining at scale:


Phuc Nguyen (Hugging Face): A practitioner's guide to FP8

Elie Bakouch (Hugging Face): Hyper-optimizing LLMs with MoE, MLA & more

Daniel Han (UnslothAI): Speeding up training with Triton & custom kernels


🧠 Distributed Training Course

Learn the foundations and modern techniques used in real-world LLM scaleups across hundreds or thousands of GPUs.

5 instructor-led workshops:

DDP from scratch and avoiding data bottlenecks

ZeRO (Part 1): How model sharding enables scale

ZeRO (Part 2): Efficiency tradeoffs and stage comparison

Pipeline & Tensor Parallelism: Solving communication slowdowns

Multi-Dimensional Parallelism: Combining all methods for throughput


Guest Lectures include:

Sylvain Gugger (Jane Street): Overview of ZeRO

Wanchao Liang (TorchTitan): DTensor and large-scale pretraining

Wing Lian (Axolotl): 2D Parallelism with Axolotl

Ferdinand Mom (Hugging Face): Multi-dimensional parallelism with nanotron

Less Wright (Meta): Async TensorParallelism

Matej Sirovatka (Hugging Face): Expert Parallelism for MoE

Marc Sun (Hugging Face): Deployment strategies at scale


🚀 Free Compute & Tools

Get hands-on with real-scale training from Day 1.

We’re proud to be sponsored by:

🤗 Hugging Face — 6 months Pro access

⚙️ Modal — $500 in compute credits

More partnerships coming soon


✅ Guarantee

If you're not satisfied, we offer a 100% refund up to 14 days after the course ends. No risk, just learning.

Built for the People Scaling What’s Next

01

Beginner to intermediate MLE’s wanting to make sure they have skills that are relevant in today’s market

02

Senior engineers tired of piecing together half-solutions from publications, frameworks, and more.

03

Team leads who want confidence that engineers can execute at scale without burning time.

04

CTOs who need to make fast, informed decisions on how to scale LLMs.

Prerequisites

  • Train any model, at least once

    I don't want you to be an expert. But you should have some mild experience training a model of some capacity, be it PyTorch, TF, and such

  • Understand basic high-school algebra

    I'm not here to teach you matrix calculus, nor will we be going that advanced. However some amount of core math is still needed

  • Familiarity with Python coding

    PyTorch is in Python, so having some experience with the language will do you well since the whole course is in it

What You'll Achieve

Train 100B+ models across 8–1,000 GPUs efficiently

You’ll understand the core problems teams face during large-scale training and how to avoid them using proven methods.

Build real-world experience with modern training techniques

You won’t just watch; you’ll train models using DDP, ZeRO, pipeline parallelism, and more. Each one applied in code.

Understand which training methods to use and when

You’ll learn how to match technique to context. Whether it’s model size, hardware limits, or team constraints, you’ll know what fits.

Be ready before training becomes your bottleneck

Most teams wait too long to prepare for scale. This course makes sure you’re ready before your current training setup stops working.

Go from scattered tutorials to production-ready training skills

You’ll connect theory with practice and walk away with working knowledge you can apply in real systems.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Zachary Mueller

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to, and have access to all future cohorts

Generous office hours

Bring your blockers to office hours and leave with answers. Get feedback, debug help, and real support when you need it.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Course notebooks

Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way

Compute Credits

$500 in Modal compute credits, 6 months of Hugging Face Pro

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Sep 1—Sep 7

    Sep

    2

    Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

    Tue 9/25:00 PM—6:00 PM (UTC)

    Sep

    3

    Fireside Chat with Yuxiang

    Wed 9/36:00 PM—7:00 PM (UTC)

    Sep

    4

    Distributed Data Parallelism From Scratch

    Thu 9/45:00 PM—6:00 PM (UTC)

    Sep

    5

    Guest Speaker: Robert Nishihara (Ray, Anyscale)

    Fri 9/55:00 PM—6:00 PM (UTC)

Week 2

Sep 8—Sep 14

    Sep

    9

    An Overview of ZeRO with Sylvain Gugger

    Tue 9/95:00 PM—6:00 PM (UTC)

    Sep

    9

    ZeRO: Stage 1 & 2

    Tue 9/96:00 PM—7:00 PM (UTC)

    Sep

    10

    Practitioners Guide to FP8 Training (Phuc)

    Wed 9/105:00 PM—6:00 PM (UTC)

    Sep

    11

    Speeding Up Training with Triton and Custom Kernels (Daniel Han)

    Thu 9/115:00 PM—6:00 PM (UTC)

    Sep

    11

    Hyper-optimizing LLMs with MoE, MLA, and More (Elie Bakouch)

    Thu 9/116:30 PM—7:30 PM (UTC)

Week 3

Sep 15—Sep 21

    Sep

    16

    ZeRO: Stage 3 and Efficient ZeRO Strategies

    Tue 9/166:00 PM—7:30 PM (UTC)

    Sep

    17

    DTensor and Large-Scale Pretraining (Wanchao)

    Wed 9/175:00 PM—6:00 PM (UTC)

    Sep

    17

    Parallelizing parallel programming of parallel processors with Modal (Charles Frye)

    Wed 9/176:30 PM—7:30 PM (UTC)

    Sep

    18

    Efficient Long-Context Training with Arctic (Tunji Ruwase)

    Thu 9/185:00 PM—6:00 PM (UTC)

Week 4

Sep 22—Sep 28

    Sep

    23

    Pipeline Parallelism and Tensor Parallelism

    Tue 9/235:00 PM—6:30 PM (UTC)

    Sep

    25

    Efficient Strategies for Distributed Inference (Marc Sun)

    Thu 9/256:00 PM—7:00 PM (UTC)

    Sep

    24

    Async TensorParallelism (Less Wright)

    Wed 9/245:00 PM—6:00 PM (UTC)

    Sep

    24

    Decentralized Global-Scale Training (Sami Jaghouar)

    Wed 9/246:30 PM—7:30 PM (UTC)

Week 5

Sep 29—Oct 3

    Sep

    30

    2D Parallelism with Wing Lian

    Tue 9/306:00 PM—7:00 PM (UTC)

    Oct

    2

    Guest Speaker: Prince Canuma

    Thu 10/25:00 PM—6:00 PM (UTC)

    Oct

    2

    3D Parallelism with Ferdinand Mom

    Thu 10/26:30 PM—7:30 PM (UTC)
Free resource

Distributed Training Lexicon

The Distributed Training Lexicon is a free resource of 49 different distributed training terms with pairing definitions and some visualizations to go with it. The goal is to have a very quick cheatsheet to look at when needing a reminder of what certain methods are.

Download it for free

Free resource

Free Access to Part of Lesson 1

Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!


(Note: this material preview may change as the course develops, but only for additive purposes)

Get access to the webpage

Frequently Asked Questions

Instructor is a recognized expert, with hands-on experience

        Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.
Hamel Husain

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals
        Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.
Mark Saroufim

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE
        As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.
Stas Bekman

Stas Bekman

Senior Machine Learning Engineer, Snowflake
        Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this
Wing Lian

Wing Lian

Founder, Axolotl
        Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.
Radek Osmulski

Radek Osmulski

Senior Data Scientist, NVIDIA
        Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention
Kevin Bird

Kevin Bird

Co-Founder, Problem Solvers Guild
        Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.
Dr. Nathan Lambert

Dr. Nathan Lambert

LLM Post Training Lead, Ai2
A pattern of wavy dots

Join an upcoming cohort

Scratch to Scale: Large-Scale Training in the Modern World

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed
A pattern of wavy dots

Join an upcoming cohort

Scratch to Scale: Large-Scale Training in the Modern World

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed

$1,500

5 Weeks