VLDB2021: BOSS Workshop features Open Source Big Data Systems
VLDB2021: BOSS Workshop features Open Source Big Data Systems
BIFOLD researchers will present three full research papers as well as three demo papers at the 47th International Conference on Very Large Data Bases (VLDB 2021), which will take place from August 16 – 29, 2021. In conjunction with VLDB, BIFOLD researchers also co-organize the BOSS 2021 workshop on open source big data systems.
BIFOLD researchers will contribute to the leading international conference on the management and analysis of very large datasets, VLDB, with three full research papers and three demos of their latest database system management research. The paper “Automated Feature Engineering for Algorithmic Fairness”, authored by Ricardo Salazar Diaz, Felix Neutatz, and Ziawasch Abedjan proposes a highly accurate fairness-aware approach for machine learning. Condor, a high-performing dataflow system that integrates approximate summarizations, is presented in the second paper, “In the Land of Data Streams where Synopses are Missing, One Framework to Bring Them All” by Rudi Poepsel-Lemaitre, Martin Kiefer, Joscha von Hein, Jorge-Arnulfo Quiane-Ruiz, and Volker Markl. In their paper “Scotch: Generating FPGA-Accelerators for Sketching at Line Rate,” Martin Kiefer, Ilias Poulakis, Sebastian Bress, and Volker Markl present Scotch, a novel system for accelerating sketch main- tenance using FPGAs, that enables faster processing of compressed data. Additionally, two papers from the NebulaStream research program will be presented at VLDB’s VLIoT workshop.
BIFOLD Research Group Lead Dr. Quiané-Ruiz also co-organizes the Big Data Open Source Systems(BOSS) workshop, which is held in conjunction with VLDB. On August 16, BOSS 2021 will feature tutorials on open source big data systems like Apache Calcite, Apache Arrow, Apache AsterixDB, and a presentation on Apache Wayang by BIFOLD researcher Dr. Zoi Kaoudi. Highlight of the workshop will be the keynote on “Lessons learned from building and growing Apache Spark “ by Reynold Xin, co-founder of Databricks and one of the main developers of Apache Spark – one of the most important open source massive data analytics engine currently in use.
The publications in detail:
Full Research papers
Automated Feature Engineering for Algorithmic Fairness
Abstract: One of the fundamental problems of machine ethics is to avoid the perpetuation and amplification of discrimination through machine learning applications. In particular, it is desired to exclude the influence of attributes with sensitive information, such as gender or race, and other causally related attributes on the machine learning task. The state-of-the-art bias reduction algorithm Capuchin breaks the causality chain of such attributes by adding and removing tuples. However, this horizontal approach can be considered invasive because it changes the data distribution. A vertical approach would be to prune sensitive features entirely. While this would ensure fairness without tampering with the data, it could also hurt the machine learning accuracy. Therefore, we propose a novel multi-objective feature selection strategy that leverages feature construction to generate more features that lead to both high accuracy and fairness. On three well-known datasets, our system achieves higher accuracy than other fairness-aware approaches while maintaining similar or higher fairness.
Abstract: In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses’ implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor’s key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.
Scotch: Generating FPGA-Accelerators for Sketching at Line Rate
Authors: Martin Kiefer, Ilias Poulakis, Sebastian Bress, Volker Markl
Abstract: Sketching algorithms are a powerful tool for single-pass data summarization. Their numerous applications include approximate query processing, machine learning, and large-scale network monitoring. In the presence of high-bandwidth interconnects or in-memory data, the throughput of summary maintenance over input data becomes the bottleneck. While FPGAs have shown admirable throughput and energy-efficiency for data processing tasks, developing FPGA accelerators requires a sophisticated hardware design and expensive manual tuning by an expert. We propose Scotch, a novel system for accelerating sketch maintenance using FPGAs. Scotch provides a domain-specific language for the user-friendly, high-level definition of a broad class of sketching algorithms. A code generator performs the heavy-lifting of hardware description, while an auto-tuning algorithm optimizes the summary size. Our evaluation shows that FPGA accelerators generated by Scotch outperform CPU- and GPU-based sketching by up to two orders of magnitude in terms of throughput and up to a factor of five in terms of energy efficiency.
Abstract: In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.
Abstract: Parameter servers ease the implementation of distributed machine learning systems, but their performance can fall behind that of single machine baselines due to communication overhead. We demonstrate LAPSE, a parameter server with dynamic parameter allocation. Previous work has shown that dynamic parameter allocation can improve parameter server performance by up to two orders of magnitude and lead to near-linear speed-ups over single machine baselines. In this demonstration, attendees learn how they can use LAPSE and how LAPSE can provide order-of-magnitude speed-ups over other parameter servers. To do so, this demonstration interactively analyzes and visualizes how dynamic parameter allocation looks like in action.
Abstract: Distributed matrix computation is common in large-scale data processing and machine learning applications. Iterative-convergent algorithms involving matrix computation share a common property: parameters converge non-uniformly. This property can be exploited to avoid redundant computation via incremental evaluation. Unfortunately, existing systems that support distributed matrix computation, like SystemML, do not employ incremental evaluation. Moreover, incremental evaluation does not always outperform classical matrix computation, which we refer to as a full evaluation. To leverage the benefit of increments, we propose a new system called HyMAC, which performs hybrid plans to balance the trade-off between full and incremental evaluation at each iteration. In this demonstration, attendees will have an opportunity to experience the effect that full, incremental, and hybrid plans have on iterative algorithms.
Abstract: The Internet of Things (IoT) enables the usage of resources at the edge of the network for various data management tasks that are traditionally executed in the cloud. However, the heterogeneity of devices and communication methods in a multitiered IoT environment (cloud/fog/edge) exacerbates the problem of deciding which nodes to use for processing and how to route data. In addition, both decisions cannot be made only statically for the entire lifetime of an application, as an IoT environment is highly dynamic and nodes in the same topology can be both stationary and mobile as well as reliable and volatile. As a result of these different characteristics, an IoT data management system that spans across all tiers of an IoT network cannot meet the same availability assumptions for all its nodes. To address the problem of choosing ad-hoc which nodes to use and include in a processing workload, we propose a networking component that uses a-priori as well as ad-hoc routing information from the network. Our approach, called Rime, relies on keeping track of nodes at the gateway level and exchanging routing information with other nodes in the network. By tracking nodes while the topology evolves in a geo-distributed manner, we enable efficient communication even in the case of frequent node failures. Our evaluation shows that Rime keeps in check communication costs and message transmissions by reducing unnecessary message exchange by up to 82.65%.
Abstract: The Internet of Things (IoT) is rapidly growing into a network of billions of interconnected physical devices that constantly stream data. To enable data-driven IoT applications, data management systems like NebulaStream have emerged that manage and process data streams, potentially in combination with data at rest, in a heterogeneous distributed environment of cloud and edge devices. To perform internal optimizations, an IoT data management system requires a monitoring component that collects system metrics of the underlying infrastructure and application metrics of the running processing tasks. In this paper, we explore the applicability of existing cloud-based monitoring solutions for stream processing engines in an IoT environment. To this end, we provide an overview of commonly used approaches, discuss their design, and outline their suitability for the IoT. Furthermore, we experimentally evaluate different monitoring scenarios in an IoT environment and highlight bottlenecks and inefficiencies of existing approaches. Based on our study, we show the need for novel monitoring solutions for the IoT and define a set of requirements.