CTO at Vizuara AI, MIT PhD
AI PM at Apple, Mckinsey, MIT PhD


LLMs are everywhere: but deploying them in production is a completely different challenge from training them. Most engineers struggle with latency, GPU memory limits, and skyrocketing serving costs.
The gap between "it works in a notebook" and "it serves 10,000 users" is inference engineering.
This is the first course dedicated entirely to production-level LLM inference. You won't just learn the theory: you will deploy models on real hardware (Raspberry Pi, Jetson Orin Nano, Android phones), optimize inference pipelines using vLLM, TensorRT-LLM, and FlashAttention, and build two capstone projects: a speed-optimized inference server and a self-improving WhatsApp AI assistant.
You will also learn directly from 9 guest engineers at Anthropic, NVIDIA, Apple, Microsoft, AWS, and more: the people actually building inference infrastructure at scale.
By the end, you'll have the skills to take any open-source LLM from raw weights to a fully deployed, optimized production pipeline.
Whether you're an ML engineer, backend engineer moving into AI, or a researcher who wants to ship — this course bridges the gap.
200+ engineers have already enrolled.
Understand and build end to end LLM inference systems which run on laptops, mobile phones and edge devices like Raspberry Pi.
Trace every step from tokenization through the forward pass to the autoregressive decoding loop
Master KV cache mechanics — chunked prefill, prefix caching, compression, H2O, and StreamingLLM
Understand why inference is fundamentally different from training and requires its own engineering discipline
Learn SMs, tensor cores, memory hierarchy, and how to read a roofline mode
Identify whether your workload is compute-bound or memory-bound: and know exactly how to fix it
Apply this analysis to make informed decisions about model size, batch size, and hardware selection
Implement FP16, BF16, INT8, INT4 quantization using GPTQ, AWQ, and GGUF
Understand the precision-performance-quality tradeoffs that matter at serving time
Learn when quantization breaks model quality and how to detect it before production
Deploy with vLLM (PagedAttention, continuous batching), SGLang (RadixAttention, structured generation), and TensorRT-LLM
Implement FlashAttention and kernel fusion for IO-aware attention computation
Use Ray Serve and Megatron-LM for distributed serving and model parallelism at scale
Implement draft-target decoding, n-gram prediction, EAGLE, and Medusa for real speedups
Design production serving systems with cold start optimization, canary deployments, and cache-aware routing
Build structured output pipelines using JSON schema enforcement, logit biasing, and guided decoding
Run inference on your laptop using llama.cpp and MLX on Apple Silicon
Deploy quantized models on Raspberry Pi 4 and compare INT4 vs INT8 on ARM
Benchmark GPU vs CPU throughput on Jetson Orin Nano with TensorRT-LLM
ML & backend engineers who can build models but struggle with serving them at low latency, optimizing GPUs, and deploying to production.
Researchers & data scientists who understand transformers but have never optimized inference or deployed models on real hardware.
Tech leads & architects evaluating inference infrastructure: vLLM vs SGLang, quantization tradeoffs, and GPU memory planning.
Live sessions
Learn directly from Dr. Raj Dandekar & Yash Dixit in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to.
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Research Paper Mentorship
Two months of 1:1 mentorship with Yash Dixit and Dr. Raj Dandekar. One live call every two weeks : where they review your progress, guide your next steps, and help you work towards a publishable research paper. Get both industrial and research exposure from mentors at Apple, McKinsey, and MIT.
Hardware Labs
Dedicated lab days included in every phase. Every device has a different bottleneck: and you willbenchmark each one live. Labs include: Deployment on laptops using llama.cpp on Apple Silicon INT4 vs INT8 quantization experiments on Raspberry Pi 4 (ARM) On-device LLM inference on Android phones with SmolChat CUDA inference and TensorRT-LLM
Maven Guarantee
Your purchase is backed by the Maven Guarantee.
22 lessons • 6 projects
Live sessions
12-14 hrs / week
Lectures will start at 9 am IST. Lectures usually run for two to three hours. All recordings will be immediately available.
Projects
3 hrs / week
Async content
2 hrs / week
I had been the student of Raj for two courses, the Generative AI Fundamentals and also the Building LLM from Scratch. Personally, this journey was absolutely enlightening for me because of very unique pedagogy that Raj follows in his teaching style. First is his approach is absolutely no nonsense. He goes to the details of a working code and explains everybody on how actually the entire concept is working on grassroots. But at the same time, he has this beautiful ability to actually abstract things whenever required because these concepts are so complex and so deep that it's very easy to lose the track. a great hands on experience, lot of practical sessions, and above all, you know, a lot of focus on understanding and building things from scratch.

Samrat Kar
I have recently completed GPT from scratch course with Vizuara. I really love the course. Especially the interaction between the instructor, Dr. Raj and the students. Explored the concepts in depth. Also the assignments. They helped me experiment and iterate and really understand how GPT works under the hood. If you are really serious about learning tokenization, attention mechanisms and transformers, this is the course for you. I highly recommend it. Cheers.

Kiran Bandhakavi
$2,500
USD