Six full research papers and one industrial paper by BIFOLD Researchers on data management topics were accepted at SIGMOD 2021. “The acceptance of such a high number of papers from one German research group at SIGMOD is exceptional. I am very proud of this BIFOLD success and the international recognition of our research efforts,” states Volker Markl. Two of the publications were the result of international research collaborations. One of these papers is due to joint work with colleagues at East China Normal University in Shanghai.
The authors propose HyMAC, a system that enables iterative Machine Learning algorithms to run more efficiently on distributed dataflow systems. The approach has the potential to speed up the process of Machine Learning with data from billions of datapoints by reducing the communication cost in dataflow systems, such as Apache Flink.
The other international collaboration resulted in a publication based on work conducted in the ExDRa (Exploratory Data Science on Raw Data) project, jointly with BIFOLD researchers, Siemens AG, TU Graz, and Know-Center GmbH. This paper was accepted in the SIGMOD Industrial Track. The ExDRa system is designed to support the exploratory data science process over heterogeneous and distributed data. Typically, industrial data scientists propose and evaluate hypotheses, integrate the necessary data, and build and execute models, in order to identify interesting patterns. To aid in this process, ExDRa investigates how to design and build a system that can offer support and help optimize the analysis of problems arising in several Siemens use-cases (e.g., chemical, pharmaceutical, water, oil, gas). For this, the project leverages the NebulaStream data management system for the IoT, which is currently under development in the BIFOLD IoT Lab.
>
Authors:
Felix Neutatz, Felix Biessmann, Ziawasch Abedjan
Abstract:
Responsible usage of Machine Learning (ML) systems in practice does not only require enforcing high prediction quality, but also accounting for other constraints, such as fairness, privacy, or execution time. One way to address multiple user-specified constraints on ML systems is feature selection. Yet, optimizing feature selection strategies for multiple metrics is difficult to implement and has been underrepresented in previous experimental studies. Here, we propose Declarative Feature Selection (DFS) to simplify the de-sign and validation of ML systems satisfying diverse user-specified constraints. We benchmark and evaluate a representative series of feature selection algorithms. From our extensive experimental results, we derive concrete suggestions on when to use which strategy and show that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand. These results demonstrate that feature selection can help to build ML systems that meet combinations of user-specified constraints, independent of the ML methods used.
Preprint (PDF)
Authors:
Felix Neutatz, Felix Biessmann, Ziawasch Abedjan
Abstract:
Responsible usage of Machine Learning (ML) systems in practice does not only require enforcing high prediction quality, but also accounting for other constraints, such as fairness, privacy, or execution time. One way to address multiple user-specified constraints on ML systems is feature selection. Yet, optimizing feature selection strategies for multiple metrics is difficult to implement and has been underrepresented in previous experimental studies. Here, we propose Declarative Feature Selection (DFS) to simplify the de-sign and validation of ML systems satisfying diverse user-specified constraints. We benchmark and evaluate a representative series of feature selection algorithms. From our extensive experimental results, we derive concrete suggestions on when to use which strategy and show that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand. These results demonstrate that feature selection can help to build ML systems that meet combinations of user-specified constraints, independent of the ML methods used.
Preprint (PDF)
>
Authors
Gábor E. Gévay, Jorge-Arnulfo Quiané-Ruiz, Volker Markl
Abstract
Many common data analysis tasks, such as performing hyperparameter optimization, processing a partitioned graph, and treating a matrix as a vector of vectors, offer natural opportunities for nested-parallel operations, i.e., launching parallel operations from inside other parallel operations. However, state-of-the-art dataflow engines, such as Spark and Flink, do not support nested parallelism. Users must implement workarounds, causing orders of magnitude slowdowns for their tasks, let alone the implementation effort. We present Matryoshka, a system that enables dataflow engines to support nested parallelism, even in the presence of control flow statements at inner nesting levels. Matryoshka achieves this via a novel two-phase flattening process, which translates nested-parallel programs to flat-parallel programs that can efficiently run on existing dataflow engines. The first phase introduces novel nesting primitives into the code, which allows for dynamic optimizations based on intermediate data characteristics in the second phase at run time. We validate our system using several common data analysis tasks, such as PageRank and K-means. The results show the superiority of Matryoshka over the state-of-the-art approaches (the DIQL system as well as the outer- and inner-parallel workarounds)to support nested parallelism in dataflow engines.
Authors
Gábor E. Gévay, Jorge-Arnulfo Quiané-Ruiz, Volker Markl
Abstract
Many common data analysis tasks, such as performing hyperparameter optimization, processing a partitioned graph, and treating a matrix as a vector of vectors, offer natural opportunities for nested-parallel operations, i.e., launching parallel operations from inside other parallel operations. However, state-of-the-art dataflow engines, such as Spark and Flink, do not support nested parallelism. Users must implement workarounds, causing orders of magnitude slowdowns for their tasks, let alone the implementation effort. We present Matryoshka, a system that enables dataflow engines to support nested parallelism, even in the presence of control flow statements at inner nesting levels. Matryoshka achieves this via a novel two-phase flattening process, which translates nested-parallel programs to flat-parallel programs that can efficiently run on existing dataflow engines. The first phase introduces novel nesting primitives into the code, which allows for dynamic optimizations based on intermediate data characteristics in the second phase at run time. We validate our system using several common data analysis tasks, such as PageRank and K-means. The results show the superiority of Matryoshka over the state-of-the-art approaches (the DIQL system as well as the outer- and inner-parallel workarounds)to support nested parallelism in dataflow engines.
>
Authors
Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl
Abstract
Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs. In this work, we face the problem of generating training data for data management components tailored to users’ needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9× better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.
Preprint [PDF]
Authors
Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl
Abstract
Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs. In this work, we face the problem of generating training data for data management components tailored to users’ needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9× better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.
Preprint [PDF]
>
Authors
Sebastian Baunsgaard, Matthias Boehm, Ankit Chaudhary, Behrouz Derakhshan, Stefan Geißelsöder, Philipp Marian Grulich, Michael Hildebrand, Kevin Innerebner, Volker Markl, Claus Neubauer, Sarah Osterburg, Olga Ovcharenko, Sergey Redyuk, Tobias Rieger, Alireza Rezaei Mahdiraji, Sebastian Benjamin Wrede, Steffen Zeuch
Abstract
Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the system architecture, selected features of SystemDS’ new federated backend, and promising results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges because, in this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, new markets.
Authors
Sebastian Baunsgaard, Matthias Boehm, Ankit Chaudhary, Behrouz Derakhshan, Stefan Geißelsöder, Philipp Marian Grulich, Michael Hildebrand, Kevin Innerebner, Volker Markl, Claus Neubauer, Sarah Osterburg, Olga Ovcharenko, Sergey Redyuk, Tobias Rieger, Alireza Rezaei Mahdiraji, Sebastian Benjamin Wrede, Steffen Zeuch
Abstract
Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the system architecture, selected features of SystemDS’ new federated backend, and promising results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges because, in this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, new markets.
>
Authors
Kaustubh Beedkar, Jorge Quiane-Ruiz and Volker Markl
Abstract
In this paper, we address the problem of compliant geo-distributed query processing. In particular, we focus on dataflow policies that impose restrictions on movement of data across geographical or institutional borders. Traditional ways to distributed query processing do not consider such restrictions and therefore in geo-distributed environments may lead to non-compliant query execution plans. For example, an execution plan for a query over data sources from Europe, North America, and Asia, which may otherwise be optimal, may not comply with dataflow policies as a result of shipping some restricted (intermediate) data. We pose this problem of compliance in the setting of geo-distributed query processing. We propose a compliance-based query optimizer that takes into account dataflow policies, which are declaratively specified using our policy expressions, to generate compliant geo-distributed execution plans. Our experimental study using a geo-distributed adaptation of the TPC-H benchmark data indicates that our optimization techniques are effective in generating efficient compliant plans and incur low overhead on top of traditional query optimizers.
Preprint [PDF]
Authors
Kaustubh Beedkar, Jorge Quiane-Ruiz and Volker Markl
Abstract
In this paper, we address the problem of compliant geo-distributed query processing. In particular, we focus on dataflow policies that impose restrictions on movement of data across geographical or institutional borders. Traditional ways to distributed query processing do not consider such restrictions and therefore in geo-distributed environments may lead to non-compliant query execution plans. For example, an execution plan for a query over data sources from Europe, North America, and Asia, which may otherwise be optimal, may not comply with dataflow policies as a result of shipping some restricted (intermediate) data. We pose this problem of compliance in the setting of geo-distributed query processing. We propose a compliance-based query optimizer that takes into account dataflow policies, which are declaratively specified using our policy expressions, to generate compliant geo-distributed execution plans. Our experimental study using a geo-distributed adaptation of the TPC-H benchmark data indicates that our optimization techniques are effective in generating efficient compliant plans and incur low overhead on top of traditional query optimizers.
Preprint [PDF]
>
Authors:
Zihao Chen, Chen Xu, Juan Soto, Volker Markl, Weining Qian, Aoying Zhou
Abstract:
Distributed matrix computation is common in large-scale data processing and machine learning applications. Many iterative-convergent algorithms involving matrix computation share a common property: parameters converge non-uniformly. This property can be exploited to eliminate computational redundancy via incremental evaluation. Existing systems that support distributed matrix computation already explore incremental evaluation. However, they are oblivious to the fact that non-zero increments are scattered in different blocks in a distributed environment. Additionally, we observe that incremental evaluation does not always outperform full evaluation. To address these issues, we propose matrix reorganization to optimize the physical layout upon the state-of-art optimized partition schemes, and thereby accelerate the incremental evaluation. More importantly, we propose a hybrid evaluation to efficiently interleave full and incremental evaluation during the iterative process. In particular, it employs a cost model to compare the overhead costs of two types of evaluations and a selective comparison mechanism to reduce the overhead incurred by comparison itself. To demonstrate the efficiency of our techniques, we implement HyMAC, a hybrid matrix computation system based on SystemML. Our experiments show that HyMAC reduces execution time on large datasets by 23% on average in comparison to the state-of-art optimization technique and consequently outperforms SystemML, ScaLAPACK, and SciDB by an order of magnitude.
Preprint [PDF]
Authors:
Zihao Chen, Chen Xu, Juan Soto, Volker Markl, Weining Qian, Aoying Zhou
Abstract:
Distributed matrix computation is common in large-scale data processing and machine learning applications. Many iterative-convergent algorithms involving matrix computation share a common property: parameters converge non-uniformly. This property can be exploited to eliminate computational redundancy via incremental evaluation. Existing systems that support distributed matrix computation already explore incremental evaluation. However, they are oblivious to the fact that non-zero increments are scattered in different blocks in a distributed environment. Additionally, we observe that incremental evaluation does not always outperform full evaluation. To address these issues, we propose matrix reorganization to optimize the physical layout upon the state-of-art optimized partition schemes, and thereby accelerate the incremental evaluation. More importantly, we propose a hybrid evaluation to efficiently interleave full and incremental evaluation during the iterative process. In particular, it employs a cost model to compare the overhead costs of two types of evaluations and a selective comparison mechanism to reduce the overhead incurred by comparison itself. To demonstrate the efficiency of our techniques, we implement HyMAC, a hybrid matrix computation system based on SystemML. Our experiments show that HyMAC reduces execution time on large datasets by 23% on average in comparison to the state-of-art optimization technique and consequently outperforms SystemML, ScaLAPACK, and SciDB by an order of magnitude.
Preprint [PDF]
>
Authors:
Shuhao Zhang, Yancan Mao, Jiong He, Philipp M. Grulich, Steffen Zeuch, Bingsheng He, Richard T. B. Ma, Volker Markl
Abstract:
The intra-window join (IaWJ), i.e., joining two input streams over a single window, is a core operation in modern stream processing applications. This paper presents the first comprehensive study on parallelizing the IaWJ on modern multicore architectures. In particular, we classify IaWJ algorithms into lazy and eager execution approaches. For each approach, there are further design aspects to consider, including different join methods and partitioning schemes, leading to a large design space. Our results show that none of the algorithms always performs the best, and the choice of the most performant algorithm depends on: (i) workload characteristics, (ii) application requirements, and (iii) hardware architectures. Based on the evaluation results, we propose a decision tree that can guide the selection of an appropriate algorithm.
Preprint [PDF]
Authors:
Shuhao Zhang, Yancan Mao, Jiong He, Philipp M. Grulich, Steffen Zeuch, Bingsheng He, Richard T. B. Ma, Volker Markl
Abstract:
The intra-window join (IaWJ), i.e., joining two input streams over a single window, is a core operation in modern stream processing applications. This paper presents the first comprehensive study on parallelizing the IaWJ on modern multicore architectures. In particular, we classify IaWJ algorithms into lazy and eager execution approaches. For each approach, there are further design aspects to consider, including different join methods and partitioning schemes, leading to a large design space. Our results show that none of the algorithms always performs the best, and the choice of the most performant algorithm depends on: (i) workload characteristics, (ii) application requirements, and (iii) hardware architectures. Based on the evaluation results, we propose a decision tree that can guide the selection of an appropriate algorithm.
Preprint [PDF]