BIFOLD Research Talk 06/01/2022
“Darwin: Scale-In Stream Processing”
Speaker: Lawrence Benson
Venue: Virtual event
Time and Date: January 06, 2022. 2 pm – 2:30 pm.
Registration: If you are interested in participating, please contact: email@example.com.
Companies increasingly rely on stream processing engines (SPEs) to quickly analyze data and monitor infrastructure. These systems enable continuous querying of data at high rates. Current production-level systems, such as Apache Flink and Spark, rely on clusters of servers to scale out processing capacity. Yet, these scale-out systems are resource inefficient and cannot fully utilize the hardware. As a solution, hardware-optimized, single-server, scale-up SPEs were developed. To get the best performance, they neglect essential features for industry adoption, such as larger-than-memory state and recovery. This requires users to choose between high performance or system availability. While some streaming workloads can afford to lose or reprocess large amounts of data, others cannot, forcing them to accept lower performance. Users also face a large performance drop once their workloads slightly exceed a single server and force them to use scale-out SPEs. To acknowledge that real-world stream processing setups have drastically varying performance and availability requirements, we propose scale-in processing. Scale-in processing is a new paradigm that adapts to various application demands by achieving high hardware utilization on a wide range of single- and multi-node hardware setups, reducing overall infrastructure requirements. In contrast to scaling-up or -out, it focuses on fully utilizing the given hardware instead of demanding more or ever-larger servers. We present Darwin, our scale-in SPE prototype that tailors its execution towards arbitrary target environments through compiling stream processing queries while recoverable larger-than-memory state management. Early results show that Darwin achieves an order of magnitude speed-up over current scale-out systems and matches processing rates of scale-up systems.
His research focus is on data management with modern hardware. He is passionate about efficiently leveraging hardware in novel system designs. Currently, he is working on persistent memory and next-gen stream processing systems, with multiple published papers at top venues (VLDB, SIGMOD, CIDR, EDBT).
Before his PhD, Lawrence completed his M.Sc. at HPI with a strong focus on databases and stream processing. He wrote his thesis in collaboration with the DIMA Group @ TU Berlin. During my studies he did two internships at Google in California and New York, working on stream processing.