Home > News >

ICDE 2022 Best Demo Award

A framework to efficiently create training data for optimizers

A demo paper co-authored by a group of BIFOLD researchers on “Farming Your ML-based Query Optimizer’s Food” presented at the virtual conference ICDE 2022 has won the best demo award. The award committee members have unanimously chosen this demonstration based on the relevance of the problem, the high potential of the proposed approach and the excellent presentation.

As machine learning is becoming a core component in query optimizers, e.g., to estimate costs or cardinalities, it is critical to collect a large amount of labeled training data to build this machine learning models. The training data should consist of diverse query plans with their label (execution time or cardinality). However, collecting such a training dataset is a very tedious and time-consuming task: It requires both developing numerous plans and executing them to acquire ground-truth labels. The latter can take days if not months, depending on the size of the data.

In a research paper presented last year at SIGMOD 2021 the authors presented DataFarm, a framework for efficiently creating training data for optimizers with learning-based components. This demo paper extends DataFarm with an intuitive graphical user interface which allows users to get informative details of the generated plans and guides them through the generation process step-by-step. As an output of DataFarm, users can download both the generated plans to use as a benchmark and the training data (jobs with their labels).

YouTube

By loading the video, you agree to YouTube’s privacy policy.
Learn more

Load video

The publication in detail:

Robin van de Water, Francesco Ventura, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Volker Markl: Farming Your ML-based Query Optimizer’s Food (to appear)

Abstract

Machine learning (ML) is becoming a core component
in query optimizers, e.g., to estimate costs or cardinalities.
This means large heterogeneous sets of labeled query plans or
jobs (i.e., plans with their runtime or cardinality output) are
needed. However, collecting such a training dataset is a very
tedious and time-consuming task: It requires both developing
numerous jobs and executing them to acquire ground-truth
labels. We demonstrate DATAFARM, a novel framework for
efficiently generating and labeling training data for ML-based
query optimizers to overcome these issues. DATAFARM enables
generating training data tailored to users’ needs by learning from
their existing workload patterns, input data, and computational
resources. It uses an active learning approach to determine a
subset of jobs to be executed and encloses the human into
the loop, resulting in higher quality data. The graphical user
interface of DATAFARM allows users to get informative details
of the generated jobs and guides them through the generation
process step-by-step. We show how users can intervene and
provide feedback to the system in an iterative fashion. As an
output, users can download both the generated jobs to use as a
benchmark and the training data (jobs with their labels).