BIFOLD at the 2024 ACM SIGMOD/PODS Conference

The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. This year, the conference took place in Santiago, Chile, from June 9 to June 14. At this year’s conference, BIFOLD researchers from the DIMA Group led by Prof. Dr. Volker Markl presented two research papers and a demo. In addition, BIFOLD researchers from the DAMS Group led by Prof. Dr. Matthias Böhm presented a research paper and a demo. Moreover, another research paper and another workshop paper was presented, in which BIFOLD group leader Prof. Sebastian Schelter was involved. Furthermore, BIFOLD Prof. Dr. Ziawasch Abedjan, Head of the D2IP (Data Integration and Data Preparation) Lab, participated on a panel on “The Role of Data Management Research for Responsible AI.”

BIFOLD researchers and some of their respective fellows, friends, and alumni.
1st row, ltr: Felix Naumann (HPI), Matthias Weidlich (HU Berlin), Matthias Boehm (DAMS / BIFOLD), Christina Dionysio (DAMS / BIFOLD), Anastasiia Kozar (DIMA / BIFOLD), Stefan Grafberger (CWI | MDS / BIFOLD), Juan Soto (DIMA / BIFOLD), Samira Akili (HU Berlin); 2nd row, ltr: Gereon Dusella (DIMA / BIFOLD), Philipp Grulich (DIMA / BIFOLD), Haralampos Gavriilidis (DIMA / BIFOLD), Ziawasch Abedjan (D2IP / BIFOLD), Lennart Behme (DIMA / BIFOLD), Volker Markl (DIMA / BIFOLD), Odej Kao (DOS / TU Berlin), Steffen Zeuch (DIMA / BIFOLD).

Below we list each of the research papers and each of the demos, including their respective abstract as well as describe the planned panel discussion.

1. Query Compilation Without Regrets (Research Paper)

Authors: Philipp Grulich (BIFOLD / TU Berlin), Aljoscha Lepping (BIFOLD / TU Berlin), Dwi Nugroho (BIFOLD / TU Berlin), Varun Pandey (BIFOLD / TU Berlin), Bonaventura Del Monte (Observe), Steffen Zeuch (BIFOLD / TU Berlin), and Volker Markl (BIFOLD / TU Berlin)
Abstract: Engineering high-performance query execution engines is a challenging task. Query compilation provides excellent performance but introduces significant system complexity, making the engine hard to build, debug, and maintain. To overcome this complexity, the paper proposes Nautilus, a framework that combines the ease of use of query interpretation with the performance of query compilation. Nautilus features an interpretation-based operator interface and a novel trace-based, multi-backend JIT compiler that translates operators into efficient code, achieving high performance without sacrificing developer productivity.
Link: https://dl.acm.org/doi/10.1145/3654968

2. Fault Tolerance Placement for the Internet of Things (Research Paper)

Authors: Anastasiia Kozar (BIFOLD / TU Berlin); Bonaventura Del Monte (Observe), Steffen Zeuch (BIFOLD / TU Berlin), and Volker Markl (BIFOLD / TU Berlin)
Abstract: Ensuring fault tolerance (FT) in edge computing environments presents unique challenges due to complex network hierarchies and resource-constrained devices. This paper presents a resource-aware FT approach that integrates operator placement (OP) with FT requirements. By treating FT and OP as a combined problem, the approach effectively mitigates failures and significantly improves throughput, outperforming state-of-the-art FT strategies.
Link: https://dl.acm.org/doi/10.1145/3654941

3. SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications (Research Paper)

Authors: Shafaq Siddiqi (Graz University of Technology); Roman Kern (Graz University of Technology); Matthias Boehm (BIFOLD / TU Berlin)
Abstract: In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.
Link: https://dl.acm.org/doi/abs/10.1145/3617338

4. SchemaPile: A Large Collection of Relational Database Schemas (Research Paper)

Authors: Till Döhmen (University of Amsterdam), Radu Geacu (University of Amsterdam), Madelon Hulsebos (UC Berkeley), Sebastian Schelter (University of Amsterdam)
Abstract: Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.
In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of \corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
Link: https://dl.acm.org/doi/10.1145/3654975

5. Multi-Backend Zonal Statistics Execution with Raven (Demo Paper)

Authors: Gereon Dusella (BIFOLD / TU Berlin), Haralampos Gavriilidis (BIFOLD / TU Berlin), Laert Nuhu (Deutsche Kreditbank AG), Volker Markl (BIFOLD / TU Berlin), and Eleni Tzirita Zacharatou (ITU Copenhagen)
Abstract: The recent explosion in the number and size of spatial remote sensing datasets from satellite missions creates new opportunities for data-driven approaches in domains such as climate change monitoring and disaster management. These approaches typically involve a feature engineering step that summarizes remote sensing pixel data located within zones of interest defined by another spatial dataset, an operation called zonal statistics. Although several spatial systems support zonal statistics operations, they differ significantly in terms of interfaces, architectures, and algorithms, making it hard for users to select the best system for a specific workload. To address this limitation, we propose Raven, a zonal statistics framework that provides users with a unified interface across multiple execution backends, while facilitating easy benchmarking and comparisons across systems. This demonstration showcases Raven’s multi-backend execution environment, domain-specific declarative language, optimization techniques, and benchmarking capabilities
Link: https://dl.acm.org/doi/10.1145/3626246.3654730

6. PLUTUS: Understanding Data Distribution Tailoring for Machine Learning (Demo Paper)

Authors: Jiwon Chang (University of Rochester), Christina Dionysio (BIFOLD / TU Berlin), Fatemeh Nargesian (University of Rochester), and Matthias Boehm (BIFOLD / TU Berlin)
Abstract: Existing data debugging tools allow users to trace model performance problems all the way to the data by efficiently identifying slices (conjunctions of features and values) for which a trained model performs significantly worse than the entire dataset. To ensure accurate and fair models, one solution is to acquire enough data for these slices. In addition to crowdsourcing, recent data acquisition techniques design cost-effective algorithms to obtain such data from a union of external sources such as data lakes and data markets. We demonstrate PLUTUS, a tool for human-in-the-loop and model-aware data acquisition pipeline, on SystemDS, as an open source ML system for the end-to-end data science lifecycle. In PLUTUS, a user can efficiently identify problematic slices, connect to external data sources, and acquire the right amount of data for these slices in a cost-effective manner.
Link: https://dl.acm.org/doi/10.1145/3626246.3654745

7. Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines" (Workshop Paper)

Authors: Stefan Grafberger (University of Amsterdam); Paul Groth (University of Amsterdam); Sebastian Schelter (BIFOLD / TU Berlin)
Abstract: Existing data debugging tools allow users to trace model performance problems all the way to the data by efficiently identifying slices (conjunctions of features and values) for which a trained model performs significantly worse than the entire dataset. To ensure accurate and fair models, one solution is to acquire enough data for these slices. In addition to crowdsourcing, recent data acquisition techniques design cost-effective algorithms to obtain such data from a union of external sources such as data lakes and data markets. We demonstrate PLUTUS, a tool for human-in-the-loop and model-aware data acquisition pipeline, on SystemDS, as an open source ML system for the end-to-end data science lifecycle. In PLUTUS, a user can efficiently identify problematic slices, connect to external data sources, and acquire the right amount of data for these slices in a cost-effective manner.
Link: https://dl.acm.org/doi/abs/10.1145/3650203.3663327

8. The Role of Data Management Research for Responsible AI (Panel)

Participants: (see the link for the participants)
Description: The first workshop on Governance, Understanding, and Integration of Data for Effective and Responsible AI (GUIDE-AI) will host a panel on “The Role of Data Management Research for Responsible AI.” Participants are Steven Whang (KAIST EE and AI), Felix Naumann (Hasso-Plattner-Institut), Boris Glavic (University of Illinois at Chicago), Fatemeh Nargesian (University of Rochester), Leopoldo Bertossi (Carleton University), and Ziawasch Abedjan (BIFOLD / TU Berlin), who leads the Data Integration and Data Preparation Group at BIFOLD.
Link: https://guide-ai-workshop.github.io/panel

For more details about the conference click on this link: 2024 ACM SIGMOD/PODS Conference.

Incidentally, the next SIGMOD/PODS Conference will take place in Berlin, Germany from June 22 to June 27, 2025, with BIFOLD researchers playing an important role in organizing it. For more details about next year’s conference click on this link: 2025 ACM SIGMOD/PODS Conference.