LLM Inference Engineering: Theory, Practicals and Research

Dr. Raj Dandekar

CTO at Vizuara AI, MIT PhD

Yash Dixit

AI PM at Apple, Mckinsey, MIT PhD

45 hours of live LLM inference lectures, 4 hardware labs, 9 guest lectures

LLMs are everywhere: but deploying them in production is a completely different challenge from training them. Most engineers struggle with latency, GPU memory limits, and skyrocketing serving costs.

The gap between "it works in a notebook" and "it serves 10,000 users" is inference engineering.

This is the first course dedicated entirely to production-level LLM inference. You won't just learn the theory: you will deploy models on real hardware (Raspberry Pi, Jetson Orin Nano, Android phones), optimize inference pipelines using vLLM, TensorRT-LLM, and FlashAttention, and build two capstone projects: a speed-optimized inference server and a self-improving WhatsApp AI assistant.

You will also learn directly from 9 guest engineers at Anthropic, NVIDIA, Apple, Microsoft, AWS, and more: the people actually building inference infrastructure at scale.

By the end, you'll have the skills to take any open-source LLM from raw weights to a fully deployed, optimized production pipeline.

Whether you're an ML engineer, backend engineer moving into AI, or a researcher who wants to ship — this course bridges the gap.

200+ engineers have already enrolled.

What you’ll learn

Understand and build end to end LLM inference systems which run on laptops, mobile phones and edge devices like Raspberry Pi.

Trace every step from tokenization through the forward pass to the autoregressive decoding loop
Master KV cache mechanics — chunked prefill, prefix caching, compression, H2O, and StreamingLLM
Understand why inference is fundamentally different from training and requires its own engineering discipline

Learn SMs, tensor cores, memory hierarchy, and how to read a roofline mode
Identify whether your workload is compute-bound or memory-bound: and know exactly how to fix it
Apply this analysis to make informed decisions about model size, batch size, and hardware selection

Implement FP16, BF16, INT8, INT4 quantization using GPTQ, AWQ, and GGUF
Understand the precision-performance-quality tradeoffs that matter at serving time
Learn when quantization breaks model quality and how to detect it before production

Deploy with vLLM (PagedAttention, continuous batching), SGLang (RadixAttention, structured generation), and TensorRT-LLM
Implement FlashAttention and kernel fusion for IO-aware attention computation
Use Ray Serve and Megatron-LM for distributed serving and model parallelism at scale

Implement draft-target decoding, n-gram prediction, EAGLE, and Medusa for real speedups
Design production serving systems with cold start optimization, canary deployments, and cache-aware routing
Build structured output pipelines using JSON schema enforcement, logit biasing, and guided decoding

Run inference on your laptop using llama.cpp and MLX on Apple Silicon
Deploy quantized models on Raspberry Pi 4 and compare INT4 vs INT8 on ARM
Benchmark GPU vs CPU throughput on Jetson Orin Nano with TensorRT-LLM

Learn directly from Raj & Yash

Dr. Raj Dandekar

Contact

MIT PhD | CTO at Vizuara AI Labs | Researcher

MIT

Yash Dixit

Contact

Apple AI/ML | MIT | Ex-McKinsey | Research mentor for publication-track students

See all products from VizuaraAI

Who this course is for

ML & backend engineers who can build models but struggle with serving them at low latency, optimizing GPUs, and deploying to production.
Researchers & data scientists who understand transformers but have never optimized inference or deployed models on real hardware.
Tech leads & architects evaluating inference infrastructure: vLLM vs SGLang, quantization tradeoffs, and GPU memory planning.

What's included

Live sessions

Learn directly from Dr. Raj Dandekar & Yash Dixit in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Research Paper Mentorship

Two months of 1:1 mentorship with Yash Dixit and Dr. Raj Dandekar. One live call every two weeks : where they review your progress, guide your next steps, and help you work towards a publishable research paper. Get both industrial and research exposure from mentors at Apple, McKinsey, and MIT.

Hardware Labs

Dedicated lab days included in every phase. Every device has a different bottleneck: and you willbenchmark each one live. Labs include: Deployment on laptops using llama.cpp on Apple Silicon INT4 vs INT8 quantization experiments on Raspberry Pi 4 (ARM) On-device LLM inference on Android phones with SmolChat CUDA inference and TensorRT-LLM

Maven Guarantee

Your purchase is backed by the Maven Guarantee.

Course syllabus

14 live sessions • 22 lessons • 6 projects

Week 1

Apr 27—May 3

Phase 1: Foundations

4 items

Apr

Session 1

Tue 4/283:30 AM—6:00 AM (UTC)

Apr

Session 2

Wed 4/293:30 AM—6:00 AM (UTC)

Apr

Session 3

Thu 4/303:30 AM—6:00 AM (UTC)

May

Session 4

Fri 5/13:30 AM—6:00 AM (UTC)

May

Session 5

Mon 5/43:30 AM—6:00 AM (UTC)

Week 2

May 4—May 10

Phase 2: Optimization

5 items

May

Session 6

Tue 5/53:30 AM—6:00 AM (UTC)

May

Session 7

Wed 5/63:30 AM—6:00 AM (UTC)

May

Session 8

Thu 5/73:30 AM—6:00 AM (UTC)

May

Session 9

Fri 5/83:30 AM—6:00 AM (UTC)

May

Session 10

Mon 5/113:30 AM—6:00 AM (UTC)

Schedule

Live sessions

12-14 hrs / week

Lectures will start at 9 am IST. Lectures usually run for two to three hours. All recordings will be immediately available.

Tue, Apr 28
3:30 AM—6:00 AM (UTC)
Wed, Apr 29
3:30 AM—6:00 AM (UTC)
Thu, Apr 30
3:30 AM—6:00 AM (UTC)

Projects

3 hrs / week

Async content

2 hrs / week

Testimonials

I had been the student of Raj for two courses, the Generative AI Fundamentals and also the Building LLM from Scratch. Personally, this journey was absolutely enlightening for me because of very unique pedagogy that Raj follows in his teaching style. First is his approach is absolutely no nonsense. He goes to the details of a working code and explains everybody on how actually the entire concept is working on grassroots. But at the same time, he has this beautiful ability to actually abstract things whenever required because these concepts are so complex and so deep that it's very easy to lose the track. a great hands on experience, lot of practical sessions, and above all, you know, a lot of focus on understanding and building things from scratch.
Samrat Kar
Software Engineering Manager, BoeingAuthor title
I have recently completed GPT from scratch course with Vizuara. I really love the course. Especially the interaction between the instructor, Dr. Raj and the students. Explored the concepts in depth. Also the assignments. They helped me experiment and iterate and really understand how GPT works under the hood. If you are really serious about learning tokenization, attention mechanisms and transformers, this is the course for you. I highly recommend it. Cheers.
Kiran Bandhakavi
Product Manager, Navy Federal Credit Union

A visual journey of what we will cover in the workshop

A visual journey of what we will cover in the workshop

Frequently asked questions

Maven for Teams

Reimbursement

Get your company to pay

Everything L&D needs: email template, receipts, and certificate of completion.

Get reimbursed

Private cohort

Run a cohort for your org

A dedicated cohort with a custom schedule and curriculum, tailored to your team.

Book a private cohort

Get course updates