Data Stream Processing in Massively Distributed Heterogeneous Environments
The ACM International Conference on Distributed and Event‐Based Systems (DEBS) is a premier venue for academia and industry to discuss cutting-edge research of distributed and event-based computing and data processing. DEBS 2024 will be a forum dedicated to the dissemination of original research, the discussion of practical insights, and the reporting of experiences relevant to distributed and event‐based systems.
The 18th DEBS conference will be held between June 25-28, 2024 in Lyon, France. BIFOLD Co-Director Volker Markl will deliver a keynote titled “NebulaStream – Data Stream Processing in Massively Distributed Heterogeneous Environments” on June 26.
In this talk, Volker Markl will describe several challenges arising due to novel applications and architectures for distributed data stream processing. He will present NebulaStream, an innovative open-source system, currently being built to address these challenges. In addition, he will describe NebulaStream's design principles, architecture, performance, application scenarios, as well as the current status of the open-source development.
Abstract:
Modern data-driven applications arising in such domains as smart manufacturing, healthcare, and the Internet of Things pose new challenges to data processing systems. Traditional stream processing systems, such as Flink, Spark, or Kafka, are ill-suited to cope with the massive scale of distribution, the heterogeneous computing landscape, and the requirement for timely processing and actuation. Classical approaches like managed runtimes, interpretation-based query processing, and the optimization of single queries that neglect interactions greatly limit throughput, latency, and the general usability of these systems for emerging applications involving distributed data processing at scale.
At BIFOLD / TU Berlin, we are researching and building NebulaStream, a novel data-stream processing system for massively distributed, heterogeneous environments. NebulaStream supports (potentially resource-constrained) heterogeneous devices, a hierarchical topology (with the distribution of computation and data flow in a cloud-edge-continuum), and the sharing of computations and data across multiple concurrent queries.
The key distinguishing features of NebulaStream, from a technological perspective, include the following:
An incremental and continuous query optimizer that considers the sharing of computation and intermediate results in conjunction with the placement of operations in a massively distributed, heterogeneous cloud-edge continuum.
A compilation-based approach for streaming queries, which avoids the need for managed runtimes and ensures excellent throughput and latency across the board, from small embedded devices to powerful processors.
A distributed runtime that supports in-network processing on a hierarchical topology of heterogeneous devices in an efficient and fault-tolerant way.