Build a Document Processing Pipeline for RAG Systems

Hosted by Stefan Krawcyk

Share this lesson

526 students

What you'll learn

Setup: the components that you need to have

Understand the different components required: document loading, parsing, text chunking, and embedding creation.

Develop: we'll write some code

Set up a basic pipeline to parse some technical documentation to get you started using open source tools.

Caveats: going from development to production

Common caveats when taking such a system to production.

Why this topic matters

Retrieval Augmented Generation or RAG is a 🔥 hot topic. But to use RAG you need to have data to retrieve. Most commonly in organizations this data is in some form of document. Understanding the "what" and "how" of creating a document processing pipeline will enable you to move faster and make better decisions as you build out your RAG system.

You'll learn from

Stefan Krawcyk

Co-creator of Hamilton & Burr, CEO & Co-Founder DAGWorks Inc.

A hands-on leader and Silicon Valley veteran, Stefan has spent over 15 years working across many parts of the stack. For the last decade, he's focused primarily on data and machine learning related systems and their connection to building product applications. He has built many 0 to 1 and 1 to 3 versions of these systems at places like Stanford, Honda Research, LinkedIn, Nextdoor, Idibon, and Stitch Fix.


A regular conference speaker, Stefan has guest lectured at Stanford’s Machine Learning Systems Design course & Building Apps with LLMs Inside course, and is an author of two popular open source frameworks called Hamilton & Burr.

Previously at

Stitch Fix
Nextdoor
LinkedIn
Honda Research Institute
Stanford University
© 2025 Maven Learning, Inc.