In this hands-on training session, you will learn how to build an end-to-end Retrieval-Augmented Generation (RAG) pipeline using cutting-edge open source tools. You'll learn to extract content with Docling, streamline data wrangling with the Data Prep Kit, leverage Milvus for vector storage, and use an open source LLM like Qwen3 / DeepSeek / GPT-OSS.

In this training session, you will learn to implement a complete RAG pipeline using leading open source technologies. You will:

Understand the tooling for building a RAG pipeline
Ingest unstructured documents reliably
Understand embeddings and storage
Try out various LLMs for use cases
Extract content from various documents (PDFs, DOCX, HTML) using Docling.
Use Data Prep Kit to streamline data preparation including markup removal, de-duplication, remove problematic data like spam, creating chunks and creating embeddings
Vector Database Integration: We will use Milvus - a popular open source vector DB, to manage and search vectorized data effectively.
Utilize an open source LLM like Qwen3 / DeepSeek / GPT-OSS. to answer questions about documents.

More about docking: https://github.com/DS4SD/docling
More about Data Prep Kit : https://github.com/IBM/data-prep-kit
More about Milvus: https://milvus.io/

Open Source Rag Pipeline With Docling + Data Prep Kit + Milvus + Open LLMs

Speaker

Sujee Maniyam

Find Sujee Maniyam at:

Speaker

Sujee Maniyam

Date

Level

Share

Prerequisites

Save your place