Banner Banner

EDBT/ICDT 2025 Conference Contributions

Four Research Groups Represent BIFOLD

Special Highlights: Panel Discussion & Best Paper Award Nomination

Four research groups from BIFOLD will actively contribute to the joint conference on data management, EDBT/ICDT 2025 (International Conference on Extending Database Technology / International Conference on Database Theory), which will be held from March 25 to 28, 2025, in Barcelona, Spain. The participating groups are as follows:

  1. Database Systems and Information Management (DIMA), led by BIFOLD co-director Prof. Dr. Volker Markl.
  2. Big Data Engineering (DAMS), led by Prof. Dr. Matthias Böhm.
  3. Information Integration and Data Cleansing (D2IP), led by Prof. Dr. Ziawasch Abedjan.
  4. Data Engineering for Machine Learning (DEEM Lab), led by Prof. Dr. Sebastian Schelter.

These groups will present a total of seven research papers at the conference, showcasing their latest advancements in database technologies, big data engineering, and information integration. 

Additionally:

  • Volker Markl will participate as a panelist in the Special EDBT/ICDT Joint Event on the Theory and Practice of Query Processing.
  • Ziawasch Abedjan (D2IP) and Zeyu Zhang (DEEM Lab) will give both a presentation as part of the Next-Generation Data Management Systems lecture series.
  • Special congratulations to Arnab Pharni and Matthias Böhm: Their paper titled MEMPHIS: Holistic Lineage-based Reuse and Memory Management for Multi-backend ML Systems, has won the EDBT 2025 Best Paper Award.

BIFOLD researchers also make a variety of contributions to the conference at different levels:

  • Senior Program Committee Member: Matthias Böhm
  • Program Committee Members: Steffen Zeuch, Patrick Damme
  • Workshop Co-chair: Matthias Böhm
  • PhD Workshop Program Committee Member: Ziawasch Abedjan

Below, we provide a list of contributions sorted by research group. 

  • Title: Efficiently Indexing Large Data on GPUs with Fast Interconnects
  • Authors: Josef Schmeißer, Clemens Lutz, Volker Markl
  • Abstract: Modern GPUs have long been capable of processing queries at a high throughput. However, until recently, GPUs faced slow data transfers from CPU main memory, and thus did not reach high processing rates for large, out-of-core data. To cope, database management systems (DBMSs) restrict their data access path to bulk data transfers orchestrated by the CPU, i.e., table scans. When queries expose selectivity, a full table scan wastes band-width, leaving performance on the table. With the arrival of fast interconnects, this design choice must be reconsidered. GPUs can directly access data at up to 7 × higher bandwidth, whereby bytes are loaded on-demand. We investigate four classic and recent index structures (binary search, B+tree, Harmonia, and RadixSpline), which we access via a fast interconnect. We show that indexing data can reduce transfer volume. However, when embedded into an index-nested loop join, we find that all indexes fail to outperform a hash join in the most interesting case: a highly selective query on large data (over 100 GiB). Therefore, we propose windowed partitioning , an index lookup optimization that generalizes to any index. As a result, index-nested loop joins run up to 3–10 × faster than a hash join. Overall, we show that out-of-core indexes are a feasible design choice to exploit selectivity when using a fast interconnect.
  • DOI: https://doi.org/10.48786/EDBT.2025.53 
  • Research track
     
  • Title: CompoDB: A Demonstration of Modular Data Systems in Practice
  • Authors:  Haralampos Gavriilidis, Lennart Behme, Christian Munz, Varun Pandey, Volker Markl
  • Abstract: The increasing demand for specialization in data management systems (DMSes) has driven a shift from monolithic architectures to modular, composable designs. This approach enables the reuse and integration of diverse components, providing flexibility to tailor systems for specific workloads. In this demonstration, we present CompoDB, a framework for composing modular DMSes using standardized interfaces for query parsers, optimizers, and execution engines. CompoDB includes built-in benchmarking functionality to evaluate the performance of DMS compositions by systematically measuring trade-offs across
  • Link: https://www.openproceedings.org/2025/conf/edbt/paper-316.pdf
  • Demo track
     
  • Title: Enabling Complex Event Processing in NebulaStream
  • Authors: Ariane Ziehn, Lily Seidl, Samira Akili, Steffen Zeuch, Volker Markl
  • Abstract: Complex Event Processing (CEP) and Analytical Stream Processing (ASP) are two dominant paradigms for extracting knowledge from unbounded data streams. While CEP functionality is essential for detecting interesting patterns in vast data volumes, traditional CEP systems often face scalability limitations. To address these limitations, state-of-the-art solutions piggyback on cloud-optimized ASP systems for enhanced scalability and performance. The most common solution embeds CEP functionality as a single unary operator within the ASP execution pipeline. However, this design introduces conceptual bottlenecks, hindering the full utilization of ASP optimizations. To tackle this, we analyzed the synergies between both paradigms and proposed a general operator mapping in recent work. Our mapping translates CEP operators into their ASP counterparts, overcoming the bottlenecks of the unary operator solution. In this demonstration, we integrate our mapping with a declarative pattern specification language tailored to the requirements of ASP systems. This integration automates the mapping process, seamlessly translating high-level pattern definitions into optimized ASP query plans. Our demonstration showcases this approach within the ASP system NebulaStream. It allows the audience to submit declarative CEP patterns via its UI and explore the corresponding query plans resulting from our mapping.
    Additionally, we guide the audience through key optimization opportunities enabled by our mapping, which are unattainable with the unary operator solution.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-331.pdf 
  • Demo track
     

  • Title: MEMPHIS: Holistic Lineage-based Reuse and Memory Management for Multi-backend ML Systems
  • Authors: Arnab Phani, Matthias Böhm
  • Abstract: Modern machine learning (ML) systems leverage multiple backends, including CPUs, GPUs, and distributed execution platforms like Apache Spark or Ray. Depending on workload and cluster characteristics, these systems typically compile an ML pipeline into hybrid plans of in-memory CPU, GPU, and distributed operations. Prior work found that exploratory data science processes exhibit a high degree of redundancy, and accordingly applied tailor-made techniques for reusing intermediates in specific backend scenarios. However, achieving efficient holistic reuse in multibackend MLsystems remains a challenge due to its tight coupling with other aspects such as memory management, data exchange, and operator scheduling. In this paper, we introduce MEMPHIS, a principled framework for holistic, application-agnostic, multibackend reuse and memory management. MEMPHIS’s core component is a hierarchical lineage-based reuse cache, which acts as a unified abstraction and manages the reuse, recycling, exchange, and cache eviction across different backends. To address challenges of different backends such as lazy evaluation, asynchronous execution, memory allocation overheads, small available memory, and different interconnect bandwidths, we devise a suite of cache management policies. Moreover, we extend an optimizing ML system compiler by special operators for asynchronous exchange, workload-aware speculative cache management, and related operator ordering for concurrent execution. Our experiments across diverse ML tasks and pipelines show improvements up to 9.6x compared to state-of-the-art ML systems.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-82.pdf 
  •  Research track

  • Title: MaTElDa: Multi-Table Error Detection
  • Authors: Fatemeh Ahmadi, Marc Speckmann, Malte Kuhlmann, Ziawasch Abedjan
  • Abstract: As data-driven applications gain popularity, ensuring high data quality is a growing concern. Yet, data cleaning techniques are limited to treating one table at a time. A table-by-table appli cation of such methods is cumbersome, because these methods either require previous knowledge about constraints or often require labor-intensive configurations and manual labeling for each individual table. As a result, they hardly scale beyond a few tables and miss the chance for optimizing the cleaning process. To tackle these issues, we introduce a novel semi-supervised er ror detection approach, Matelda, that organizes a given set of tables by folding their cells with regard to domain and quality similarity to facilitate user supervision. We propose a unified feature embedding that makes cell values comparable across tables. Experimental evaluations demonstrate that Matelda out performs various configurations of existing single-table cleaning methodologies in the multi-table scenario.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-98.pdf 
  • Research track 
     
  • Title: How Green is AutoML for Tabular Data?
  • Authors: Felix Neutatz, Marius Lindauer, Ziawasch Abedjan
  • Abstract: AutoML has risen to one of the most commonly used tools for day-to-day data science pipeline development and several popular packages exist. While AutoML systems support data scientists during the tedious process of pipeline generation, it can lead to high computation costs that result from extensive search or pre-training. In light of concerns with regard to the environment and the need for Green IT, we holistically analyze the computational cost of pipelines generated through various AutoML systems by combining the cost of system development, execution, and the downstream inference cost. Our findings show the benefits and disadvantages of implementation designs and their potential for Green AutoML.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-97.pdf 
  • Experiments & Analysis Papers 

  • Title: A Deep Dive Into Zero-Shot Entity Matching with Large and Small Language Models
  • Authors: Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
  • Abstract: Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a crossdataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-224.pdf  
  • Experiments & Analysis Papers