Banner Banner

EDBT/ICDT 2025 Conference Contributions

DIMA Researchers at the 28th International Conference on Extending Database Technology

DIMA researchers presented one short paper and two demos at the the 28th International Conference on Extending Database Technology (EDBT), which took place from March 25th to March 28th in Barcelona, Spain. The conference focuses on promoting and supporting research and progress in the areas of databases and information systems technology and applications. 

The short paper originating from the Database Systems and Information Management (DIMA) Group, headed by BIFOLD co-Director Prof. Dr. Volker Markl is titled “Efficiently Indexing Large Data on GPUs with Fast Interconnects.” It is authored by Josef Schmeißer, Clemens Lutz, and Volker Markl. In the paper the researchers investigate four classic and recent index structures (i.e., binary search, B+tree, Harmonia, and RadixSpline), which they access via a fast interconnect. They show that indexing data can reduce transfer volume. However, when embedded into an index-nested loop join, they find that all indexes fail to outperform a hash join in the most interesting case: a highly selective query on large data (over 100 GiB). Therefore, they propose windowed partitioning, an index lookup optimization that generalizes to any index. As a result, index-nested loop joins run up to 3x–10x faster than a hash join. Overall, they show that out-of-core indexes are a feasible design choice to exploit selectivity when using a fast interconnect.

The first demo “CompoDB: A Demonstration of Modular Data Systems in Practice” is authored by Haralampos Gavriilidis, Lennart Behme, Christian Munz, Varun Pandey, and Volker Markl. In this demonstration, the researchers present CompoDB, a framework for composing modular DMSes using standardized interfaces for query parsers, optimizers, and execution engines. CompoDB includes built-in benchmarking functionality to evaluate the performance of DMS compositions by systematically measuring trade-offs across.

The second demo “Enabling Complex Event Processing in NebulaStream” is authored by Ariane Ziehn, Lily Seidl, Samira Akili, Steffen Zeuch, and Volker Markl. Complex Event Processing (CEP) and Analytical Stream Processing (ASP) are two dominant paradigms for extracting knowledge from unbounded data streams. While CEP functionality is essential for detecting interesting patterns in vast data volumes, traditional CEP systems often face scalability limitations. To address these limitations, state-of-the-art solutions piggyback on cloud-optimized ASP systems for enhanced scalability and performance. The most common solution embeds CEP functionality as a single unary operator within the ASP execution pipeline. However, this design introduces conceptual bottlenecks, hindering the full utilization of ASP optimizations. To tackle this, the researchers analyzed the synergies between both paradigms and proposed a general operator mapping in recent work.

In addition, Volker Markl  served as a panelist in the Special EDBT/ICDT Joint Event on the Theory and Practice of Query Processing: Insights from Theory, Systems, and Industry. At the event Volker Markl offered a talk on “Theory in Systems - Query Optimization, Data Stream Processing and Beyond.” The panel discussed how the theory and systems communities could work together to create principled and practical solutions to the important challenges in data management.

DEEM Researchers at the 28th International Conference on Extending Database Technology

DEEM researchers presented one full paper at EDBT. The full paper originating from the DEEM Lab, headed by Prof. Dr-Ing. Sebastian Schelter is titled “A Deep Dive Into Zero-Shot Entity Matching with Large and Small Language Models.” It is authored by Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter. This experimental paper tackles the problem of entity matching (EM), which involves determining whether different data records represent the same real-world entity. This becomes particularly challenging when matching entities across different datasets without pre-existing labels. The study is the first in-depth analysis focusing on how various language model-based techniques handle this task across unfamiliar datasets. The paper contains an evaluation of eight different matching systems using eleven benchmark datasets. The study covers a range of model sizes and transfer learning strategies, analyzing both the accuracy of the match predictions and the costs associated with deploying these technologies. Findings indicate that smaller, fine-tuned models can achieve results comparable to those of larger models, and data-driven methods surpass model-driven ones. Notably, smaller models also present a significantly lower deployment cost than larger commercial models. In addition, Zeyu Zhang from the DEEM Lab gave an invited talk in the “Next-Generation Data Management Systems” industry event at the conference, where he presented previous research on the efficient utilization of language models for table data preparation.

  • Title: Efficiently Indexing Large Data on GPUs with Fast Interconnects
  • Authors: Josef Schmeißer, Clemens Lutz, Volker Markl
  • Abstract: Modern GPUs have long been capable of processing queries at a high throughput. However, until recently, GPUs faced slow data transfers from CPU main memory, and thus did not reach high processing rates for large, out-of-core data. To cope, database management systems (DBMSs) restrict their data access path to bulk data transfers orchestrated by the CPU, i.e., table scans. When queries expose selectivity, a full table scan wastes band-width, leaving performance on the table. With the arrival of fast interconnects, this design choice must be reconsidered. GPUs can directly access data at up to 7 × higher bandwidth, whereby bytes are loaded on-demand. We investigate four classic and recent index structures (binary search, B+tree, Harmonia, and RadixSpline), which we access via a fast interconnect. We show that indexing data can reduce transfer volume. However, when embedded into an index-nested loop join, we find that all indexes fail to outperform a hash join in the most interesting case: a highly selective query on large data (over 100 GiB). Therefore, we propose windowed partitioning , an index lookup optimization that generalizes to any index. As a result, index-nested loop joins run up to 3–10 × faster than a hash join. Overall, we show that out-of-core indexes are a feasible design choice to exploit selectivity when using a fast interconnect.
  • DOI: https://doi.org/10.48786/EDBT.2025.53 
  • Research track
     
  • Title: CompoDB: A Demonstration of Modular Data Systems in Practice
  • Authors:  Haralampos Gavriilidis, Lennart Behme, Christian Munz, Varun Pandey, Volker Markl
  • Abstract: The increasing demand for specialization in data management systems (DMSes) has driven a shift from monolithic architectures to modular, composable designs. This approach enables the reuse and integration of diverse components, providing flexibility to tailor systems for specific workloads. In this demonstration, we present CompoDB, a framework for composing modular DMSes using standardized interfaces for query parsers, optimizers, and execution engines. CompoDB includes built-in benchmarking functionality to evaluate the performance of DMS compositions by systematically measuring trade-offs across
  • Link: https://www.openproceedings.org/2025/conf/edbt/paper-316.pdf
  • Demo track
     
  • Title: Enabling Complex Event Processing in NebulaStream
  • Authors: Ariane Ziehn, Lily Seidl, Samira Akili, Steffen Zeuch, Volker Markl
  • Abstract: Complex Event Processing (CEP) and Analytical Stream Processing (ASP) are two dominant paradigms for extracting knowledge from unbounded data streams. While CEP functionality is essential for detecting interesting patterns in vast data volumes, traditional CEP systems often face scalability limitations. To address these limitations, state-of-the-art solutions piggyback on cloud-optimized ASP systems for enhanced scalability and performance. The most common solution embeds CEP functionality as a single unary operator within the ASP execution pipeline. However, this design introduces conceptual bottlenecks, hindering the full utilization of ASP optimizations. To tackle this, we analyzed the synergies between both paradigms and proposed a general operator mapping in recent work. Our mapping translates CEP operators into their ASP counterparts, overcoming the bottlenecks of the unary operator solution. In this demonstration, we integrate our mapping with a declarative pattern specification language tailored to the requirements of ASP systems. This integration automates the mapping process, seamlessly translating high-level pattern definitions into optimized ASP query plans. Our demonstration showcases this approach within the ASP system NebulaStream. It allows the audience to submit declarative CEP patterns via its UI and explore the corresponding query plans resulting from our mapping.
    Additionally, we guide the audience through key optimization opportunities enabled by our mapping, which are unattainable with the unary operator solution.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-331.pdf 
  • Demo track
     

  • Title: MEMPHIS: Holistic Lineage-based Reuse and Memory Management for Multi-backend ML Systems
  • Authors: Arnab Phani, Matthias Böhm
  • Abstract: Modern machine learning (ML) systems leverage multiple backends, including CPUs, GPUs, and distributed execution platforms like Apache Spark or Ray. Depending on workload and cluster characteristics, these systems typically compile an ML pipeline into hybrid plans of in-memory CPU, GPU, and distributed operations. Prior work found that exploratory data science processes exhibit a high degree of redundancy, and accordingly applied tailor-made techniques for reusing intermediates in specific backend scenarios. However, achieving efficient holistic reuse in multibackend MLsystems remains a challenge due to its tight coupling with other aspects such as memory management, data exchange, and operator scheduling. In this paper, we introduce MEMPHIS, a principled framework for holistic, application-agnostic, multibackend reuse and memory management. MEMPHIS’s core component is a hierarchical lineage-based reuse cache, which acts as a unified abstraction and manages the reuse, recycling, exchange, and cache eviction across different backends. To address challenges of different backends such as lazy evaluation, asynchronous execution, memory allocation overheads, small available memory, and different interconnect bandwidths, we devise a suite of cache management policies. Moreover, we extend an optimizing ML system compiler by special operators for asynchronous exchange, workload-aware speculative cache management, and related operator ordering for concurrent execution. Our experiments across diverse ML tasks and pipelines show improvements up to 9.6x compared to state-of-the-art ML systems.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-82.pdf 
  •  Research track

  • Title: MaTElDa: Multi-Table Error Detection
  • Authors: Fatemeh Ahmadi, Marc Speckmann, Malte Kuhlmann, Ziawasch Abedjan
  • Abstract: As data-driven applications gain popularity, ensuring high data quality is a growing concern. Yet, data cleaning techniques are limited to treating one table at a time. A table-by-table appli cation of such methods is cumbersome, because these methods either require previous knowledge about constraints or often require labor-intensive configurations and manual labeling for each individual table. As a result, they hardly scale beyond a few tables and miss the chance for optimizing the cleaning process. To tackle these issues, we introduce a novel semi-supervised er ror detection approach, Matelda, that organizes a given set of tables by folding their cells with regard to domain and quality similarity to facilitate user supervision. We propose a unified feature embedding that makes cell values comparable across tables. Experimental evaluations demonstrate that Matelda out performs various configurations of existing single-table cleaning methodologies in the multi-table scenario.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-98.pdf 
  • Research track 
     
  • Title: How Green is AutoML for Tabular Data?
  • Authors: Felix Neutatz, Marius Lindauer, Ziawasch Abedjan
  • Abstract: AutoML has risen to one of the most commonly used tools for day-to-day data science pipeline development and several popular packages exist. While AutoML systems support data scientists during the tedious process of pipeline generation, it can lead to high computation costs that result from extensive search or pre-training. In light of concerns with regard to the environment and the need for Green IT, we holistically analyze the computational cost of pipelines generated through various AutoML systems by combining the cost of system development, execution, and the downstream inference cost. Our findings show the benefits and disadvantages of implementation designs and their potential for Green AutoML.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-97.pdf 
  • Experiments & Analysis Papers 

  • Title: A Deep Dive Into Zero-Shot Entity Matching with Large and Small Language Models
  • Authors: Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
  • Abstract: Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a crossdataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.
  • Link: https://openproceedings.org/2025/conf/edbt/paper-224.pdf  
  • Experiments & Analysis Papers