Framework Fundamentals: Designing Distributed Training APIs

New
·

3 Weeks

·

Cohort-based Course

Learn how to create your own training framework like Accelerate

This course is popular

5 people enrolled last week.

Course overview

transformers Trainer, Accelerate, Axolotl, what do they all do?

Every training framework wraps around the same core distributed training strategies:

- torch's Distributed Data Parallelism

- DeepSpeed

- torch's FullyShardedDataParallelism


However, APIs can break. Strong deprecations in the name of "clean code" can destroy your entire stack.


This course is designed to help you create your own scalable training framework by incorporating the most common tools in the industry.


This is not a "how does DDP work" course.

This is a "how do I incorporate DDP into an internal training framework that is stable and works well"


Our aim is at the end of 3 weeks to create a minimal version of Hugging Face's Accelerate with a few different spins.


This includes:

1. Writing the source code

2. Learning how to test distributed code well

3. How do you create CI's in Github for distributed testing


I'll give you all the tricks I've learned over 4 years in developing accelerate into one class to help you not only write frameworks yourself but also be able to understand common patterns in other frameworks to make navigating them easier


To help facilitate the learning, Prime Intellect will be sponsoring $300 in compute per student

Who is this course for

01

Students wanting hands-on experience with what real training frameworks are like in the real world and how the code is structured

02

Beginner MLE's who want to understand how to go from one-off scripts to a more robust framework that they own

03

Mid-level MLE's wanting to know how to ensure their training stack is up to par with the latest integrations

Prerequisites

  • 1 year of PyTorch (and Python)

    Our aim is to make a training framework in PyTorch using common libraries. So as a result, I expect you know some PyTorch

  • Have trained one model

    We're building training frameworks. Please make sure you know how back-propagation works (as an idea) and such

What you’ll get out of this course

How to build a scalable API

We need to create a multi-faceted API for such a framework. You will do so and help understand the balance of functionality with readability

How to use FSDP/DDP/DeepSpeed

We'll directly be using the latest and most common training APIs and understanding how they can work together

How to test distributed code

Testing distributed code isn't as simple as "pytest domything"

What’s included

Zachary Mueller

Live sessions

Learn directly from Zachary Mueller in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Nov 3—Nov 9

    Nov

    4

    Course Introduction and Getting Setup

    Tue 11/46:00 PM—7:30 PM (UTC)

    Nov

    6

    Distributed Data Parallism and Testing

    Thu 11/65:00 PM—6:30 PM (UTC)

Week 2

Nov 10—Nov 16

    Nov

    11

    Exploring PyTorch FSDP

    Tue 11/116:00 PM—7:00 PM (UTC)

    Nov

    13

    Exploring PyTorch FSDP: Part 2

    Thu 11/136:00 PM—7:00 PM (UTC)

Week 3

Nov 17—Nov 22

    Nov

    18

    Exploring DeepSpeed: Part 1

    Tue 11/186:00 PM—7:00 PM (UTC)

    Nov

    20

    Exploring DeepSpeed: Part 2/Finale

    Thu 11/206:00 PM—7:30 PM (UTC)

Meet your instructor

Zachary Mueller

Zachary Mueller

This is where you'll add your bio as a way to establish credibility and demonstrate to your audience why you're the right person to teach this course.

A pattern of wavy dots

Join an upcoming cohort

Framework Fundamentals: Designing Distributed Training APIs

Cohort 1

$1,000

Dates

Nov 3—22, 2025

Payment Deadline

Nov 2, 2025
Get reimbursed

Frequently Asked Questions

A pattern of wavy dots

Join an upcoming cohort

Framework Fundamentals: Designing Distributed Training APIs

Cohort 1

$1,000

Dates

Nov 3—22, 2025

Payment Deadline

Nov 2, 2025
Get reimbursed

$1,000

USD

3 Weeks