Simplifying RAG Document Pipelines with Multimodal Embeddings

Arne Grobrügge

Natural Language Processing & Audio (incl. Generative AI NLP)
Python Skill Novice
Domain Expertise Intermediate

This talk provides an overview of how document processing for RAG systems can be simplified using multimodal embeddings, grounded in benchmarks on real-world enterprise documents.

What the talk covers

  1. Motivation: Why RAG Is Still Hard
    Why PDFs remain challenging in enterprise RAG systems, and where current document processing approaches break down—especially for presentations and visually structured documents.

  2. The Classical Approach: PDF → Text → Chunks
    An overview of traditional OCR- and layout-based pipelines, including their strengths, typical failure modes, and why they tend to grow into complex and fragile systems over time.

  3. A New Paradigm: Multimodal Page Embeddings
    How embedding entire PDF pages as images changes the ingestion model, what information is preserved compared to text-only approaches, and what this means for retrieval quality and system simplicity.

  4. Benchmark Setup
    How the benchmark comparing classical pipelines and multimodal page embeddings was designed, using anonymized, real-world enterprise documents across multiple document types. Different models and vendors are referenced only as examples, not as the focus.

  5. Results and Key Findings
    Where multimodal page embeddings outperform text-based pipelines, where they do not, and how hybrid approaches can emerge as a practical solution.

  6. Production Best Practices
    Practical guidance for deploying these approaches in real systems, including index design, quality monitoring, cost control, and how to integrate multimodal retrieval cleanly into Python-based RAG architectures.

Attendees will leave with a clear understanding of when multimodal embeddings are a strong replacement for classical PDF pipelines, and how to reason about the trade-offs involved.

Arne Grobrügge

Worked on multi-modal retrieval-augmented generation (RAG) and agentic LLM systems. Designed ingestion and retrieval pipelines across text, video, and structured data to integrate common knowledge platforms such as Microsoft SharePoint. Focused on scalable Azure-based infrastructure, multilingual and multimodal document processing, and continuous evaluation for reliability. Gathered experience in building browser-driven agents using modern orchestration frameworks and MCP integration.