In this hands-on training session, you will learn how to build an end-to-end Retrieval-Augmented Generation (RAG) pipeline using cutting-edge open source tools. You'll learn to extract content with Docling, streamline data wrangling with the Data Prep Kit, leverage Milvus for vector storage, and use an open source LLM like Qwen3 / DeepSeek / GPT-OSS.
In this training session, you will learn to implement a complete RAG pipeline using leading open source technologies. You will:
- Understand the tooling for building a RAG pipeline
- Ingest unstructured documents reliably
- Understand embeddings and storage
- Try out various LLMs for use cases
- Extract content from various documents (PDFs, DOCX, HTML) using Docling.
- Use Data Prep Kit to streamline data preparation including markup removal, de-duplication, remove problematic data like spam, creating chunks and creating embeddings
- Vector Database Integration: We will use Milvus - a popular open source vector DB, to manage and search vectorized data effectively.
- Utilize an open source LLM like Qwen3 / DeepSeek / GPT-OSS. to answer questions about documents.
More about docking: https://github.com/DS4SD/docling
More about Data Prep Kit : https://github.com/IBM/data-prep-kit
More about Milvus: https://milvus.io/
Speaker

Sujee Maniyam
Developer Advocate @Nebius AI, OS Contributor
Sujee Maniyam is a seasoned practitioner focusing on AI, Big Data, Distributed Systems, and Cloud technologies. He combines deep technical expertise with a passion for developer advocacy, empowering developers to build impactful AI-driven applications. Sujee loves sharing knowledge through workshops, talks, and open source. You can find more of his work at https://sujee.dev