This talk provides an overview of how document processing for RAG systems can be simplified using multimodal embeddings, grounded in benchmarks on real-world enterprise documents.
What the talk covers
Motivation: Why RAG Is Still Hard
Why PDFs remain challenging in enterprise RAG systems, and where current document processing approaches break down—especially for presentations and visually structured documents.
The Classical Approach: PDF → Text → Chunks
An overview of traditional OCR- and layout-based pipelines, including their strengths, typical failure modes, and why they tend to grow into complex and fragile systems over time.
A New Paradigm: Multimodal Page Embeddings
How embedding entire PDF pages as images changes the ingestion model, what information is preserved compared to text-only approaches, and what this means for retrieval quality and system simplicity.
Benchmark Setup
How the benchmark comparing classical pipelines and multimodal page embeddings was designed, using anonymized, real-world enterprise documents across multiple document types. Different models and vendors are referenced only as examples, not as the focus.
Results and Key Findings
Where multimodal page embeddings outperform text-based pipelines, where they do not, and how hybrid approaches can emerge as a practical solution.
Production Best Practices
Practical guidance for deploying these approaches in real systems, including index design, quality monitoring, cost control, and how to integrate multimodal retrieval cleanly into Python-based RAG architectures.
Attendees will leave with a clear understanding of when multimodal embeddings are a strong replacement for classical PDF pipelines, and how to reason about the trade-offs involved.