intro (8 min)
- a very quick summary of what our company does in a few words, necessary to set up the background for the technical challenge this talk is solving (3 min)
- a very high-level overview of our distributed systems architecture (5 min)
reducing the number of processed messages (5 min)
- basic Kafka consumer loop (1 min)
- message deduplication by converting a list of keys into a set (1 min)
- skipping recently processed keys by keeping a TTL-based buffer (3 min)
avoiding data loss (8 min)
- process keys after evicting from buffer instead of when adding to it (3 min)
- when is it safe to acknowledge a message (5 min)
keeping ETL service healthy (5 min)
- avoiding Kafka timeouts
wrap up (4 min)
- final architecture (2 min)
- performance results (2 min)

Mirano Tuk

Principal Software Engineer at ReversingLabs, working on large-scale distributed systems and data-intensive architectures.

I design and operate high-throughput, real-time pipelines, with an emphasis on reliability, observability, and performance in real-world conditions, and a practical approach to engineering trade-offs and system failures.

Filip Bacic

Software Development Manager at ReversingLabs, leading teams responsible for large-scale data processing, data quality, and technical writing. Specialized in turning complex systems into something that works, produces correct results, and is documented well enough that someone else can understand it, usually in that order.

How to Search Through 800 Billion Records in Real Time

Mirano Tuk, Filip Bacic

Mirano Tuk

Filip Bacic