Streaming data between systems—whether across organizations, from secured environments, isolated networks, or even home setups—remains a common challenge in modern data engineering and data sharing workflows. This talk introduces the ZebraStream Protocol: an open, HTTP-based bytestream protocol designed specifically for decoupled systems, where both sides act as clients—no server hosting, no exposed endpoints.
Talk Outline (45 minutes)
1. The Challenge: Data Sharing Between Decoupled Systems (5 min)
- Real-world scenarios: cross-org data exchange, secured environments, isolated networks, home automation, IoT deployments
- Use cases: ETL pipelines, dataset delivery, continuous monitoring, exploratory data access
- Current solutions and their limitations:
- Message brokers (Kafka): discrete messages, can't coordinate query-response without external notification
- File storage (S3/SFTP): batch-oriented, lacks streaming
- HTTP client-server: requires endpoint hosting, security overhead
- Webhooks: incomplete solution, still needs server hosting
2. ZebraStream Protocol Overview (6 min)
- Why HTTP? Interoperability, evolution (HTTP/2, HTTP/3), standardized infrastructure, firewall-friendly
- Two-part protocol design:
- Data API: HTTP-based bytestream transfer (like UNIX pipes over HTTP)
- Connect API: Built-in coordination for push and pull patterns
- Key properties: client-to-client via relay, zero-trust security model, ephemeral, direct data flow
3. Why Bytestreams Matter (8 min)
- Bytestreams vs. messages: continuous byte flow vs. discrete units
- Native format streaming: Parquet, compressed archives, encrypted content
- Supporting event patterns: JSON-lines, CSV within bytestreams
- Python's file-like interface (
io.IOBase) as universal abstraction
- Live demo: Streaming Parquet directly into pandas/DuckDB
- Live demo: Log streaming like
tail -f
4. Coordination for Decoupled Systems (7 min)
- The "who initiates when?" problem
- Symmetric push/pull patterns with same API
- Coordination within
open() call
- Live demo: Event-driven pipeline activation
5. Python Integration: File-Like Interface (6 min)
- Why file-like objects matter: universal Python abstraction
- Two dimensions of simplicity: language-agnostic HTTP + Python-specific interface
- Examples: pandas integration, compression layering, encryption composition
- Stream limitations: seekability and Unix pipe compatibility
6. Open Protocol Specification & Security Model (5 min)
- Open specification: Data API, Connect API, security model
- Security: TLS transport, bearer token auth, ephemeral design, zero-trust with E2EE
- End-to-end encryption patterns (application-layer, protocol-agnostic)
- Comparison with alternatives: Kafka, S3, HTTP client-server (security dimensions)
- Reference implementation: Python client (open source), ZebraStream.io (managed service)
7. Real-World Integration Examples (4 min)
- Data engineering: Cross-org Parquet ETL pipelines with token-based access control
- Privacy-preserving data exchange: End-to-end encrypted datasets (healthcare, research, GDPR compliance)
- Operations: Log streaming and event processing
- IoT & Home automation: Raspberry Pi data delivery from home network without exposed endpoints
- Data science: Ad-hoc dataset sharing for collaborative analysis
- All examples demonstrated with reproducible Python code (open-source client SDK)
8. Design Trade-offs & Lessons Learned (3 min)
- Lessons from building and dogfooding the protocol in beta
- Why bytestreams over messages: native format support vs. framing overhead
- Why ephemeral over persistent: privacy by design, no storage footprint
- Why HTTP over custom protocol: infrastructure reuse, firewall-friendly
- Stream limitations: seekability requirement, Unix pipe compatibility rule
- Future directions: protocol evolution, additional language implementations
9. Q&A (6 min)
- Technical deep-dives and audience questions
Johannes Dröge
Johannes holds a PhD in computer science, has developed open-source software, algorithms and statistic methods for genome data analysis, worked as a data scientist, and led a group of data engineers in a mid-size startup. He is currently bootstrapping SaaS infrastructure software projects with a focus on cross-organizational data sharing.