Scratch to Scale: Large Scale Training in the Modern World

New
·

5 Weeks

·

Cohort-based Course

Master the journey from prototype to production with large-scale model training, scaling, and deployment.

This course is popular

21 people enrolled last week.

Previously at

Hugging Face
Accenture

Course overview

Build the skills to answer the call when it's time to take your models to scale

Master distributed training and real-world scale techniques from top engineers.

Whether you're an ML engineer looking to move beyond single-GPU experiments, or a product leader seeking to understand the language your AI team speaks, this course will give you the hands-on skills and conceptual clarity to operate confidently at scale.


What Makes This Course Different

This started as a distributed training course. It organically grew into an all-encompassing learning event with world-class speakers who bring deep, real-world experience in modern large-scale training.

The distributed training curriculum remains intact: five hands-on workshops covering today’s core scale-up methods.

Now, each week features hand-tailored guest lectures from top engineers at Hugging Face, Meta, Snowflake, and more.

These expert sessions are aligned to each workshop’s topic, helping you bridge theory with modern production practices.

All course materials and recordings are available after enrollment, and you’ll get free lifetime access to future cohorts.


💡 What’s included:


5 core workshops

14+ guest talks across 3 curated tracks

Weekly live office hours

Community collaboration

Compute credits from Hugging Face & Modal

100% money-back guarantee (within 14 days of course completion)


📣 Conference Talks


Applied Track

Hear how industry leaders are solving real-world scale problems:


Robert Nishihara (Ray, Anyscale): Scaling across thousands of GPUs with Ray

Sami Jaghouar (Prime Intellect): Decentralized global-scale training

Tunji Ruwase (Snowflake): Efficient long-context training with Arctic

Prince Canuma: Local ML workloads using Apple Silicon + MLX


Pretraining Track

Deep dives into LLM pretraining at scale:


Phuc Nguyen (Hugging Face): A practitioner's guide to FP8

Elie Bakouch (Hugging Face): Hyper-optimizing LLMs with MoE, MLA & more

Daniel Han (UnslothAI): Speeding up training with Triton & custom kernels


🧠 Distributed Training Course

Learn the foundations and modern techniques used in real-world LLM scaleups across hundreds or thousands of GPUs.

5 instructor-led workshops:

DDP from scratch and avoiding data bottlenecks

ZeRO (Part 1): How model sharding enables scale

ZeRO (Part 2): Efficiency tradeoffs and stage comparison

Pipeline & Tensor Parallelism: Solving communication slowdowns

Multi-Dimensional Parallelism: Combining all methods for throughput


Guest Lectures include:

Sylvain Gugger (Jane Street): Overview of ZeRO

Wanchao Liang (TorchTitan): DTensor and large-scale pretraining

Ferdinand Mom (Hugging Face): Multi-dimensional parallelism

Less Wright (Meta): Async TensorParallelism

Matej Sirovatka (Hugging Face): Expert Parallelism for MoE

Marc Sun (Hugging Face): Deployment strategies at scale


🚀 Free Compute & Tools

Get hands-on with real-scale training from Day 1.

We’re proud to be sponsored by:

🤗 Hugging Face — 6 months Pro access

⚙️ Modal — $500 in compute credits

More partnerships coming soon


✅ Guarantee

If you're not satisfied, we offer a 100% refund up to 14 days after the course ends. No risk, just learning.

Who is this course for

01

Recent graduates and beginner Machine Learning Engineers wanting to know the tools of the trade used today for modern model training

02

Senior ML Engineers dropped into the world of Large LLMs that need to know the parts to focus on when it comes to modernizing your stack

03

Project managers leading ML teams who need to speak the language of scale, efficiency, and delivery in today’s AI training world

Prerequisites

  • One year of PyTorch experience

    Generally should be familiar with core tensor operations and how a model gets made with PyTorch

  • Basic understanding of tensor math

    We’re going to be using lots of operations from torch.distributed. I’ll be teaching them to you, but know the core operations for tensors

  • Trained at least one model in your life

    Understand how model training works on single GPU and the full flow (data -> outputs -> gradients -> backprop) are necessary

What you’ll get out of this course

Listen to top experts in the field

This conference is home to over a dozen world experts in the field of DL and ML when it comes to distributed training, all centered in one location just for you

Understand not just what distributed training is, but become an expert in it

I don't want you to take this course and go "okay, I think I get what's happening here." I want you to walk away feeling knowledgeable enough to where if someone went up to you and said "here's 1,000 GPUs for a day, do something" you can move into action immediately

Deep understanding of different parallelization strategies

This won't be a surface level course teaching you "how to use torch.distributed.FSDP". We're going to understand it from the ground-up.

Train a few models on multiple GPUs

Above all, I'm going to make sure everyone gets experience training in a distributed fashion by the end of this course on at least one model through the homework.

Hands-On Exercises, Examples, and Code

This is not a course where I bore you with a slide deck the entire time (though for some it might be needed). Instead we are down in the weeds of code, having you implement along with me.

Personalized Instruction

Generous office hours ensure that students can ask questions about their specific issues, interests, and needs.

What’s included

Zachary Mueller

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Compute Credits

$500 in Modal compute credits, 6 months of Hugging Face Pro

Course notebooks

Detailed course notebooks and material with maticulous notes to help walk you through the material and learn along the way

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Generous office hours

I'll be making myself available to you for feedback, questions, and anything else I can help with

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Sep 1—Sep 7

    Sep

    2

    Course Introduction and `nbdistributed`: A Jupyter framework for interactive distributed PyTorch

    Tue 9/26:00 PM—7:00 PM (UTC)

    Sep

    4

    Distributed Data Parallelism From Scratch

    Thu 9/46:00 PM—7:00 PM (UTC)

Week 2

Sep 8—Sep 14

    Sep

    9

    ZeRO: Stage 1 & 2

    Tue 9/96:00 PM—7:00 PM (UTC)

Week 3

Sep 15—Sep 21

    Sep

    16

    ZeRO: Stage 3 and Efficient ZeRO Strategies

    Tue 9/166:00 PM—7:30 PM (UTC)

Week 4

Sep 22—Sep 28

    Sep

    23

    Pipeline Parallelism and Tensor Parallelism

    Tue 9/236:00 PM—7:30 PM (UTC)

    Sep

    25

    Efficient Strategies for Distributed Inference

    Thu 9/256:00 PM—7:00 PM (UTC)

Week 5

Sep 29—Oct 3

    Sep

    30

    2D Parallelism

    Tue 9/306:00 PM—7:00 PM (UTC)

    Oct

    2

    3D Parallelism (Guest Speaker)

    Thu 10/26:00 PM—7:00 PM (UTC)
Free resource

Free Access to Part of Lesson 1

Hi there! To help you get a good grasp on how the course will be oriented and an idea on what some of the content looks like, I can share with you an exclusive preview into what the course webpage will be and how some of the content is shaped. I've worked hard to make sure Quarto and Jupyter will help me create educational material that will wow you, so let me know if it does!


(Note: this material preview may change as the course develops, but only for additive purposes)

Get access to the webpage

Frequently Asked Questions

Instructor is a recognized expert, with hands-on experience

        Zach is my go to person on anything dealing with distributed training. He has maintained the most popular library in the world that helps developers with this problem, which means he’s familiar with all of the issues mere mortals have while tackling this problem. Zach is the best person to teach this subject. I am taking this course.
Hamel Husain

Hamel Husain

Founder, Parlance Labs | Evals, evals, evals
        Zach is one of the key people in the world making distributed machine learning more accessible. He has firsthand experience building some incredible popular tools like huggingface/accelerate. If you're GPU poor but considering moving to the GPU middle class then I can't think of a better instructor.
Mark Saroufim

Mark Saroufim

Software Engineer at Meta | Co-founder, GPU MODE
        As a long time maintainer of HF Accelerate, Zach has had to master not only a deep understanding of ML scaling methods, but also to integrate them into a cohesive API for the masses to use. I've seen Zach consistently deliver robust, well-integrated solutions with a deep system-level understanding. You will be in good hands with Zach at the helm.
Stas Bekman

Stas Bekman

Senior Machine Learning Engineer, Snowflake
        Zach's stewardship of Accelerate and managing the intricacies of multiple distributed technologies (while abstracting it into an easy to use API) make Zach the preeminent leader in distributed training. Zach has shown deep understanding of everything from fundamentals to implementation, and is the first person that would come to mind to teach this
Wing Lian

Wing Lian

Founder, Axolotl
        Zach is truly one in a million. I've never met anyone who puts so much time and thought into crafting deep learning code. With his background and experience, learning from him is an invaluable opportunity.
Radek Osmulski

Radek Osmulski

Senior Data Scientist, NVIDIA
        Zach has a strong grasp of the fundamentals of fastai, but what really sets him apart is his ability to teach. He mixes in practical topics throughout his lessons, making every video engaging and worthwhile. With a proven track record of creating high-quality content, I’m confident that any course Zach produces will be worth your time and attention
Kevin Bird

Kevin Bird

Co-Founder, Problem Solvers Guild
        Zach and I used to work together at HuggingFace, since then and through today he’s been building foundational tools for the open ML community to use and learn distributed training techniques. I’ve personally used his tools for years to train models such as OLMo and Tülu along with benefiting from his knowledge to better understand what is going on.
Dr. Nathan Lambert

Dr. Nathan Lambert

LLM Post Training Lead, Ai2

Meet your instructor

Zachary Mueller

Zachary Mueller

I've been in the field for almost a decade now. I first started in the fast.ai community, quickly learning how modern-day training pipelines are built and operated. Then I moved to Hugging Face, where I'm the Technical Lead on the accelerate project and manage the transformers Trainer.


I've written numerous blogs, courses, and given talks on distributed training and PyTorch throughout my career.


Through this experience, I've condensed down almost a decade of learning to this course, and I'm excited to bring you all with me for the learning journey

A pattern of wavy dots

Join an upcoming cohort

Scratch to Scale: Large Scale Training in the Modern World

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed
A pattern of wavy dots

Join an upcoming cohort

Scratch to Scale: Large Scale Training in the Modern World

Cohort 1

$1,500

Dates

Sep 1—Oct 3, 2025

Payment Deadline

Aug 31, 2025
Get reimbursed

$1,500

5 Weeks