AI Inference Engineering & Systems Design

5.0 (14)

Abi Aryan

Computer Scientist, ML Engineer & Author

This course is popular

18 people enrolled last week.

From AI Engineer to AI Inference Engineer

If you’re an AI/ML engineer who can already prompt or fine-tune models but you’ve never been able to answer questions like:

- Why is my 70B model using 120 GB of VRAM and still slow?

- How do I serve 500 concurrent users on 4xH100s without going broke?

- What actually happens inside FlashAttention / PagedAttention / tensor parallelism?

- How do I save my company millions running open models in production?

… then this is the course you’ve been waiting for.

A bit of content update is - 𝐭𝐡𝐞𝐫𝐞 𝐢𝐬 𝐧𝐞𝐰 𝐩𝐥𝐚𝐭𝐟𝐨𝐫𝐦 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐢𝐧𝐟𝐫𝐚 (+𝐜𝐥𝐮𝐬𝐭𝐞𝐫⁣) 𝐝𝐞𝐬𝐢𝐠𝐧 𝐚𝐧𝐝 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐬𝐲𝐬𝐭𝐞𝐦𝐬 𝐦𝐚𝐭𝐞𝐫𝐢𝐚𝐥 𝐠𝐞𝐭𝐭𝐢𝐧𝐠 𝐚𝐝𝐝𝐞𝐝 for Cohort 2.

_________________________________________________

For private cohort, email learn@abiaryan.com

What you’ll learn

Most engineers load models with Hugging Face then serve something basic and get stuck on cost, latency, or scale. In this course, we will

Learn to serve a tiny model (e.g. Phi-3-mini or TinyLlama) on a single consumer GPU in <15 minutes.
Learn how to develop and deploy inference gateways, do endpoint management and learn steaming vs non-streaming architectures
You'll learn concrete request‑to‑token mental model before hardware and optimization detail.

Use real tools to see exactly where time/memory is wasted.
You'll learn to calculate exactly how many tokens, bytes and VRAM your model will take.
You'll learn how to build an inference engine from scratch and the internals of vLLM, SGLang in detail and an overview of the rest..

You'll learn how to optimize for compute versus memory.
We will implement compute management tricks and then memory management optimizations incl KV Cache, Quantization etc.
How to decide which hardware to use, for which model and what will scale and where will different model/hardware combinations hit bottleneck

Scaling models doesn't just mean throwing more compute at the problem
You'll learn distributed systems architecture design using Ray and Kubernetes - Fleets, Multi-Node and Multi-GPU & how to scale inferencing
We will implement distributed inferencing for 70B–405B models.

The internals- what happens when your system gets a request, how tokenization, batching, memory allocation is done on your GPUs.
Inference Optimization techniques (speculative decoding, chunked prefill etc) both at prefill & decode stages for dense, sparse, and MoE
How to do concurrency management at request/model/hardware level. How to use Load balancing and parallelism (Data, Tensor, Model, Expert).

We have a huge lineup this time around - to be announced soon.

Learn directly from Abi

Abi Aryan

AI Infrastructure Engineer, Stealth

Book Author (LLMOps, GPU Engineering for AI Systems - upcoming)

See all products from Abi (@goabiaryan)

Who this course is for

AI engineers, ML infrastructure engineers, and backend developers who own inference cost and need to 10–50× optimize it.
Founders & engineers building LLM apps who are tired of burning money on OpenAI
AI engineers who don't understand system design for building reliable applications
Anyone on job market for Inferencing/Solns Architect role

Prerequisites

You have shipped at least one non-trivial Python project
We cannot teach you the basics of Python code.
You can already load and run an LLM using HF Transformers
If you have never touched AI models, this course is not for you.
Comfortable with basic terminal/SSH, git, yaml & json files
We can help with Docker and Kubernetes, don't worry.

What's included

Live sessions

Learn directly from Abi Aryan in a real-time, interactive format.

Lifetime access

Anyone who signs up for the course will have lifetime access to the course and can audit recording/materials too for the next cohort as I do update my course to always keep you ahead of the curve.

1400 USD in compute credits

Compute credits to run the inference exercises during the cohort assignments (thanks to our kind sponsors!)

Resume Review

We will have 1-on-1 resume review if you are currently on the job market or looking for a career transition

171-page system design interview guide

It covers 150 practice questions covering interview questions for 1. Systems Architecture Design 2. Inference Optimization & Serving 3. RAG & Context Engineering 4. Reliability and Guardrails 5. Observability, Evaluation & Monitoring, and finally 6. Tradeoffs, Scenarios & Integrations

Community of peers

A discord channel to stay connected with peers.

Guest Speakers - Fireside Chat

Engineers from Nvidia, AMD, Modular, Lambda, Databricks, Crusoe, Superlinked, Hugging Face, vLLM Core team, SGLang core team, Red Hat AI and many more..

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

Your purchase is backed by the Maven Guarantee.

Course syllabus

10 live sessions • 11 lessons

Week 1

Oct 10—Oct 11

Oct

GPUs, TPUs, and the Physics of Inference

Sat 10/103:30 PM—6:00 PM (UTC)

GPU/TPU/ASIC architecture, how to read spec sheets and cost modelling

1 item

Week 2

Oct 12—Oct 18

Oct

What Happens When You Make an LLM Call?

Sat 10/173:30 PM—6:00 PM (UTC)

Inference Gateways and Inference System Design

1 item

Benchmarking and Inference Observability Design

1 item

Free resources

Schedule

Live sessions

2-3 hrs / week

We meet every Saturday 4-6/6.30 pm CET giving you one clear Sunday to rest and recuperate to be ready for the work-week on Monday. This is the key to last 2 months without burning out.

Sat, Oct 10
3:30 PM—6:00 PM (UTC)
Sat, Oct 17
3:30 PM—6:00 PM (UTC)
Wed, Oct 21
3:30 PM—4:30 PM (UTC)

Weekly Project

1-3 hrs / week

Abi will be available to help you in case you get stuck. And you will be discord helping debug each other's solutions - helps learn how people from different backgrounds and different industries approach the problem differently.

Quizzes and Async content

1 hr / week

This was the most loved part of Cohort 1. Every week you will get written notes on the class material and a quiz to test your understanding - the answer key will also be shared. No judgements whatsoever.

Frequently asked questions

Cohort 1 (Feb 20-April 10) sponsored by

Cohort 2 (July 18- Sept 12) sponsored by

Lambda, Superlinked, Modal, Baseten (more to be updated)..

Maven for Teams

Reimbursement

Get your company to pay

Everything L&D needs: email template, receipts, and certificate of completion.

Get reimbursed

Private cohort

Run a cohort for your org

A dedicated cohort with a custom schedule and curriculum, tailored to your team.

Book a private cohort

$2,499

USD

Oct 10—Dec 5

Enroll

AI Inference Engineering & Systems Design

Abi Aryan

This course is popular

From AI Engineer to AI Inference Engineer

What you’ll learn

Learn directly from Abi

Abi Aryan

Who this course is for

Prerequisites

You have shipped at least one non-trivial Python project

You can already load and run an LLM using HF Transformers

Comfortable with basic terminal/SSH, git, yaml & json files

What's included

Course syllabus

Week 1

GPUs, TPUs, and the Physics of Inference

GPU/TPU/ASIC architecture, how to read spec sheets and cost modelling

Week 2

What Happens When You Make an LLM Call?

Inference Gateways and Inference System Design

Benchmarking and Inference Observability Design

Free resources

Schedule

Frequently asked questions

Cohort 1 (Feb 20-April 10) sponsored by

Cohort 2 (July 18- Sept 12) sponsored by