Paper Forecast BTW 2025

BIFOLD at BTW 2025

Exploring the Future of Data Management

The 21st Conference on Database Systems for Business, Technology, and Web (BTW 2025) will be held in Bamberg, Germany, from March 3 to 7, 2025. Three BIFOLD research groups — Database Systems and Information Management (DIMA), Big Data Engineering (DAMS), and Information Integration and Data Cleansing (D2IP) — will participate in BTW 2025 and present seven papers. Their contributions highlight recent database research advancements and underline cutting-edge technologies' relevance in modern data management.

Additionally, we want to highlight some program points:

DBIS dissertation prize: The Databases and Information Systems (DBIS) department of the German Informatics Society (GI) will award two DBIS dissertation prizes, on March 7, 2025. These prizes recognize outstanding dissertations in data management and information systems. One of the prize winners is Clemens Lutz, a former doctoral student from the DIMA research group, which BIFOLD co-director Prof. Dr. Volker Markl supervises. The awarded thesis from 2022 is titled "Scalable Data Management using GPUs with Fast Interconnects."
Keynote: BIFOLD Fellow Julia Stoyanovich will deliver a keynote speech on March 5 titled "Follow the Data! Responsible AI Starts with Responsible Data Management."
Talk: Stefan Grafberger, PhD student in the research group of the BIFOLD DEEM Lab, which Prof. Dr. Sebastian Schelter supervises, will give a talk at the Workshop on ML4Sys and Sys4ML titled: "mlwhatif: Data-centric What-If Analysis for Native Machine Learning Pipelines."
Workshop co-organization: Prof. Dr. Matthias Böhm, who supervises the Big Data Engineering Group at BIFOLD, is part of the organization team for the "Workshop on ML4Sys and Sys4ML."
Reproducibility Co-chair: Prof. Dr. Ziawasch Abedjan, who supervises the Information Integration and Data Cleansing at BIFOLD, is a member of the Reproducibility chair.
Data science challenge: The Seminar students of the D2IP Research Group and Muaid Mughrabi were invited to the second stage of the Data Science Challenge.

The conference serves as a key platform for networking among scientists from the data management community in Germany and neighbouring countries as well as practitioners from the industry. By fostering discussions on foundational topics in database and information systems technology and emerging areas such as data science, data protection, the integration of machine learning, and highly scalable data processing, BTW continues to drive future research in the field.

Below, we provide a list of contributions sorted by research group.

Incremental Stream Query Merging In Action

Authors: Ankit Chaudhary, Ninghong Zhu, Laura Mons, Steffen Zeuch, Varun Pandey, Volker Markl
Abstract: Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. However, they primarily focus on achieving high throughput and low latency for a single query. To deploy multiple queries, the users instead scale the infrastructure, executing each query in isolation. As a result, SPEs overlook potential data and computation-sharing opportunities among several long-running queries. As streaming queries are continuous and long-running, identifying sharing opportunities among newly arriving and existing queries can reduce resource utilization. This allows for deploying more queries without the need to scale the infrastructure. In this demonstration, we present Incremental Stream Query Merging (ISQM), an end-to-end solution to identify and maintain sharing among stream queries. We showcase six different types of sharing identification techniques and their impact on query optimization and execution time.
Link PDF

Scalable Data Management using GPUs with Fast Interconnects

Author: Clemens Lutz
Abstract: Modern database management systems (DBMSs) are tasked with analyzing terabytes of data, employing a rich set of relational and machine learning operators. To process data at large scales, research efforts have strived to leverage the high computational throughput and memory bandwidth of specialized co-processors such as graphics processing units (GPUs). However, scaling data management on GPUs is challenging because (1) the on-board memory of GPUs has too little capacity for storing large data volumes, while (2) the interconnect bandwidth is not sufficient for ad hoc transfers from main memory. Thus, data management on GPUs is limited by a data transfer bottleneck. In practice, CPUs process large-scale data faster than GPUs, reducing the utility of GPUs for DBMSs. In this thesis, we investigate how a new class of fast interconnects can address the data transfer bottleneck and scale GPU-enabled data management. Fast interconnects link GPU co-processors to a CPU with high bandwidth and cache-coherence. We apply our insights to process stateful and iterative algorithms out-of-core by the examples of a hash join and k-means clustering. We first analyze the hardware properties. Our experiments show that the high interconnect bandwidth enables the GPU to efficiently process large data sets stored in main memory. Furthermore, cache-coherence facilitates new DBMS designs that tightly integrate CPU and GPU via shared data structures and pageable memory allocations. However, irregular accesses from the GPU to main memory are not efficient. Thus, the operator state of, e.g., joins does not scale beyond the GPU memory capacity. We scale joins to a large state by contributing our new Triton join algorithm. Our main insight is that fast interconnects enable GPUs to efficiently spill the join state by partitioning data out-of-core. Thus, our Triton join breaks through the GPU memory capacity limit and increases throughput by up to 2.5× compared to a radix-partitioned join on the CPU. We scale k-means to large data sets by eliminating two key sources of overhead. In existing strategies, execution crosses from the GPU to the CPU on each iteration, which results in the cross-processing and multi-pass problems. In contrast, our solution requires only a single data pass per iteration and speeds-up throughput by up to 20×. Overall, GPU-enabled DBMSs are able to overcome the data transfer bottleneck by employing new out-of-core algorithms that take advantage of fast interconnects.
Link PDF

Enhancing In-Memory Spatial Indexing with Learned Search

Authors: Varun Pandey, Alexander van Renen, Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Volker Markl, Alfons Kemper
Abstract: Spatial data is generated daily from numerous sources such as GPS-enabled devices, consumer applications (e.g., Uber, Strava), and social media (e.g., location-tagged posts). This exponential growth in spatial data is driving the development of efficient spatial data processing systems.In this study, we enhance spatial indexing with a machine-learned search technique developed for single-dimensional sorted data. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. By instance-optimizing each partitioning technique, we demonstrate that:
- grid-based index structures outperform tree-based ones (from 1.23x to 2.47x),
- learning-enhanced spatial index structures are faster than their original counterparts (from 1.44x to 53.34x),
- machine-learned search within a partition is 11.79% - 39.51% faster than binary search when filtering on one dimension,
- the benefit of machine-learned search decreases in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and
- index lookup is the bottleneck for tree-based structures, which could be mitigated by linearizing the indexed partitions.
Link 1 PDF
Link 2 PDF

[Poster] Achilles' SPEar: Using Metamorphic Testing to Find Bugs in Stream Processing Engines

Authors: Magnus Erk Kroner, Adrian Michalke
Abstract: Stream Processing Engines (SPEs) are critical for real-time data processing, relying on aggressive optimizations to meet performance demands. Ensuring the reliability of such systems requires robust testing, yet testing remains costly and challenging due to the oracle problem. This paper investigates adapting query partitioning, a metamorphic testing technique, to the domain of SPEs. We present Achilles, an automated testing framework for SPEs. Achilles utilizes query partitioning to automatically generate, execute, and evaluate diverse test cases, reducing manual effort and targeting stream operators like filters and windowed aggregations. Our evaluation highlights Achilles' ability to detect unique bugs, including division-by-zero errors and predicate evaluation flaws. We further analyze framework parameters and demonstrate the effectiveness of predicate-based query partitioning, finding that increasing predicate depth boosts bug detection by 10%.
Link PDF

Fast, Parameter-free Time Series Anomaly Detection

Authors: Kristiyan Blagov, Carlos Enrique Muñiz-Cuza, Matthias Boehm
Abstract: Time series anomaly detection is a common problem across many domains. Despite the introduction of numerous algorithms leveraging deep learning, classical machine learning, and data mining techniques, no dominating approach has emerged. A common challenge is extensive parameter tuning and the high computational costs associated with many of these existing methods. To address this problem, this paper proposes a parameter-free anomaly detection algorithm, STAN (summary statistics ensemble). STAN applies a set of summary statistics over sliding windows and compares the results to the normal behavior learned during training. STAN flexibility allows for integrating different statistical aggregates, which effectively handle diverse types of anomalies. Our evaluation shows that STAN achieves a detection accuracy 60.4%, close to the widely used MERLIN algorithm (63.6%) while reducing execution time by more than an order of magnitude compared to all baselines.
Link PDF

Incremental SliceLine for Iterative ML Model Debugging under Updates

Authors: Frederic Caspar Zoepffel, Christina Dionysio, Matthias Boehm
Abstract: SliceLine is a model debugging technique for finding the top-k worst slices (in terms of conjunctions of attributes) where a trained machine learning (ML) model performs significantly worse than on average. In contrast to other slice finding techniques, SliceLine introduced an intuitive scoring function, effective pruning strategies, and fast linear-algebra-based evaluation strategies. Together, SliceLine is able to find the exact top-K worst slices in the full lattice of possible conjunctions in reasonable time. Recently, we observe an increasing trend towards iterative algorithms that incrementally update the dataset (e.g., selecting samples, augmentation with new instances). Fully computing SliceLine from scratch for every update is unnecessarily wasteful. In this paper, we introduce an incremental problem formulation of SliceLine, new pruning strategies that leverage intermediates of previous slice finding runs on a modified dataset, and an extended linear-algebra-based enumeration algorithm. Our experiments show that incremental SliceLine yields robust performance improvements up to an order of magnitude faster than full SliceLine, while still allowing effective parallelization in local, distributed, and federated environments.
Link 1 PDF
Link 2 ZIP

Scalable Computation of Shapley Additive Explanations

Authors: Louis Le Page, Christina Dionysio, Matthias Boehm
Abstract: The growing field of explainable AI (XAI) develops methods that help better understand ML model predictions. While SHapley Additive exPlanations (SHAP) is a widely-used, model-agnostic method for explaining predictions, its use comes with a significant computational burden, particularly for complex models and large datasets with many features. The key—and so far unaddressed—challenge lies in efficiently scaling these computations without compromising accuracy. In this paper, we present a scalable, model-agnostic SHAP sampling framework on top of Apache SystemDS. We leverage Antithetic Permutation Sampling for its efficiency and optimization potential, and we devise a carefully vectorized and parallelized implementation for local and distributed operations. Compared with the state-of-the-art Python shap package, our solutions yield similar accuracy but achieve significant speedups of up to 14x for multi-threaded singlenode operations as well as up to 35x for distributed Spark operations (on a small 8 node cluster).
Link 1 PDF
Link2 ZIP

Identifying Semantic Components for PBE-based Transformation Discovery

Authors: Dakai Men, Binger Chen, Ziawasch Abedjan
Abstract: Complex data transformations involve a combination of syntactic and semantic operations. Recent LLM-based Programming-by-example (PBE) approaches aid in finding sequences of syntactic and semantic operations to satisfy given transformation examples. As testing LLM outputs is expensive, such approaches defer the prompting step after all syntactic operations have been identified. During this process, sequences of tokens that need semantic look-ups are split and their order is lost, leading to lower accuracy. We address this problem by focusing on transformation tasks that are challenging and propose a pre-processing step that impedes destructive splits of such sequences.
Link PDF