LLM Inference Engineering: Theory, Practicals and Research

Dr. Raj Dandekar

CTO at Vizuara AI, MIT PhD

Yash Dixit

AI PM at Apple, Mckinsey, MIT PhD

45 hours of live LLM inference lectures, 4 hardware labs, 9 guest lectures

LLMs are everywhere: but deploying them in production is a completely different challenge from training them. Most engineers struggle with latency, GPU memory limits, and skyrocketing serving costs.

The gap between "it works in a notebook" and "it serves 10,000 users" is inference engineering.

This is the first course dedicated entirely to production-level LLM inference. You won't just learn the theory: you will deploy models on real hardware (Raspberry Pi, Jetson Orin Nano, Android phones), optimize inference pipelines using vLLM, TensorRT-LLM, and FlashAttention, and build two capstone projects: a speed-optimized inference server and a self-improving WhatsApp AI assistant.

You will also learn directly from 9 guest engineers at Anthropic, NVIDIA, Apple, Microsoft, AWS, and more: the people actually building inference infrastructure at scale.

By the end, you'll have the skills to take any open-source LLM from raw weights to a fully deployed, optimized production pipeline.

Whether you're an ML engineer, backend engineer moving into AI, or a researcher who wants to ship — this course bridges the gap.

200+ engineers have already enrolled.

What you’ll learn

Understand and build end to end LLM inference systems which run on laptops, mobile phones and edge devices like Raspberry Pi.

  • Trace every step from tokenization through the forward pass to the autoregressive decoding loop

  • Master KV cache mechanics — chunked prefill, prefix caching, compression, H2O, and StreamingLLM

  • Understand why inference is fundamentally different from training and requires its own engineering discipline

  • Learn SMs, tensor cores, memory hierarchy, and how to read a roofline mode

  • Identify whether your workload is compute-bound or memory-bound: and know exactly how to fix it

  • Apply this analysis to make informed decisions about model size, batch size, and hardware selection

  • Implement FP16, BF16, INT8, INT4 quantization using GPTQ, AWQ, and GGUF

  • Understand the precision-performance-quality tradeoffs that matter at serving time

  • Learn when quantization breaks model quality and how to detect it before production

  • Deploy with vLLM (PagedAttention, continuous batching), SGLang (RadixAttention, structured generation), and TensorRT-LLM

  • Implement FlashAttention and kernel fusion for IO-aware attention computation

  • Use Ray Serve and Megatron-LM for distributed serving and model parallelism at scale

  • Implement draft-target decoding, n-gram prediction, EAGLE, and Medusa for real speedups

  • Design production serving systems with cold start optimization, canary deployments, and cache-aware routing

  • Build structured output pipelines using JSON schema enforcement, logit biasing, and guided decoding

  • Run inference on your laptop using llama.cpp and MLX on Apple Silicon

  • Deploy quantized models on Raspberry Pi 4 and compare INT4 vs INT8 on ARM

  • Benchmark GPU vs CPU throughput on Jetson Orin Nano with TensorRT-LLM

Learn directly from Raj & Yash

Dr. Raj Dandekar

Dr. Raj Dandekar

MIT PhD | CTO at Vizuara AI Labs | Researcher

MIT
Massachusetts Institute of Technology
IIT Madras
Vizuara
Yash Dixit

Yash Dixit

Apple AI/ML | MIT | Ex-McKinsey | Research mentor for publication-track students

Who this course is for

  • ML & backend engineers who can build models but struggle with serving them at low latency, optimizing GPUs, and deploying to production.

  • Researchers & data scientists who understand transformers but have never optimized inference or deployed models on real hardware.

  • Tech leads & architects evaluating inference infrastructure: vLLM vs SGLang, quantization tradeoffs, and GPU memory planning.

What's included

Live sessions

Learn directly from Dr. Raj Dandekar & Yash Dixit in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Research Paper Mentorship

Two months of 1:1 mentorship with Yash Dixit and Dr. Raj Dandekar. One live call every two weeks : where they review your progress, guide your next steps, and help you work towards a publishable research paper. Get both industrial and research exposure from mentors at Apple, McKinsey, and MIT.

Hardware Labs

Dedicated lab days included in every phase. Every device has a different bottleneck: and you willbenchmark each one live. Labs include: Deployment on laptops using llama.cpp on Apple Silicon INT4 vs INT8 quantization experiments on Raspberry Pi 4 (ARM) On-device LLM inference on Android phones with SmolChat CUDA inference and TensorRT-LLM

Maven Guarantee

Your purchase is backed by the Maven Guarantee.

Course syllabus

22 lessons • 6 projects

Week 1

Apr 27—May 3

    Phase 1: Foundations

    4 items

Week 2

May 4—May 10

    Phase 2: Optimization

    5 items

Schedule

Live sessions

12-14 hrs / week

Lectures will start at 9 am IST. Lectures usually run for two to three hours. All recordings will be immediately available.

Projects

3 hrs / week

Async content

2 hrs / week

Testimonials

  • I had been the student of Raj for two courses, the Generative AI Fundamentals and also the Building LLM from Scratch. Personally, this journey was absolutely enlightening for me because of very unique pedagogy that Raj follows in his teaching style. First is his approach is absolutely no nonsense. He goes to the details of a working code and explains everybody on how actually the entire concept is working on grassroots. But at the same time, he has this beautiful ability to actually abstract things whenever required because these concepts are so complex and so deep that it's very easy to lose the track. a great hands on experience, lot of practical sessions, and above all, you know, a lot of focus on understanding and building things from scratch.

    Testimonial author image

    Samrat Kar

    Software Engineering Manager, BoeingAuthor title
  • I have recently completed GPT from scratch course with Vizuara. I really love the course. Especially the interaction between the instructor, Dr. Raj and the students. Explored the concepts in depth. Also the assignments. They helped me experiment and iterate and really understand how GPT works under the hood. If you are really serious about learning tokenization, attention mechanisms and transformers, this is the course for you. I highly recommend it. Cheers.

    Testimonial author image

    Kiran Bandhakavi

    Product Manager, Navy Federal Credit Union

Frequently asked questions

$2,500

USD

Apr 28May 26
Enroll