Higher impact through reproducibility

Home >

Higher impact through reproducibility

Higher impact through reproducibility

Modern science is based on objectiveness. Experimental results should be repeatable by any scientist, provided they use the same experimental setup. Since 2008, the SIGMOD conference, the international leading conference in management of data, awards the reproducibility badge to signify that a scientific work has been successfully reproduced by a third-party reviewer. In 2021, the paper “Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects” by BIFOLD researcher Clemens Lutz was awarded a prestigious reproducibility badge.

SIGMOD Reproducibility has three main goals: Highlight the impact of database research papers, enable easy dissemination of research results and enable easy sharing of code and experimentation set-ups. In computer science, reproducing results is inherently complex due to the many factors that may inadvertently influence the test bed of a computer scientist. Thus, the goal of SIGMOD reproducibility is to assist in building a culture where sharing results, code, and scripts of database research is the norm rather than an exception. The challenge is to do this time efficiently, which means building technical expertise on how to do better research via creating repeatable and shareable research.
“Our paper explores the opportunities that a new technology, fast GPU interconnects, offers for database management systems. To reproduce our work, we faced the unique challenge that our results rely on very specific hardware. Fast GPU interconnects are not yet widely available, and thus a third-party reviewer is unlikely to have the appropriate equipment to repeat our measurements. Together with the reviewers and our system administrator, we overcame this hurdle by granting the reviewers access to our lab equipment”, explains first author Clemens Lutz.
In 2021, only 13 of 143 full papers published at SIGMOD 2020 have been awarded the reproducibility badge.

“Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects”

Authors:
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, Volker Markl

Abstract:
GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the on-board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU@. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.

Publication:
Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

Home >

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

The research paper “Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing” by Alexander Kumaigorodski, Clemens Lutz, and Volker Markl received the Best Paper Award of the 19th Symposium on Database Systems for Business, Technology and Web (BTW 2021). On top, the paper received the Reproducibility Badge, awarded for the first time by BTW 2021, for the high reproducibility of its results.

TU Berlin Master’s graduate Alexander Kumaigorodski and his co-authors from Prof. Dr. Volker Markl‘s Department of Database Systems and Information Management (DIMA) at TU Berlin and from the Intelligent Analytics for Massive Data (IAM) research area at the German Research Centre for Artificial Intelligence (DFKI) present a new approach to speed up loading and processing of tabular CSV data by orders of magnitude.

CSV is a very frequently used format for the exchange of structured data. For example, the City of Berlin publishes its structured datasets in the CSV format in the Berlin Open Data Portal. Such datasets can be imported into databases for data analysis. Accelerating this process allows users to handle the increasing amount of data and to decrease the time required for its data analysis. Each new generation of computer networks and storage media provides higher bandwidths and allows for faster reading times. However, current loading and processing approaches using main processors (CPU) cannot keep up with these hardware technologies and unnecessarily throttle loading times.

© Alexander Kumaigorodski

The procedure described in this paper uses a new approach where CSV data is read and processed by graphics processors (GPU) instead. The advantage of these graphics processors lies primarily in their strong parallel computing power and fast memory access. Using this approach, new hardware technologies can be fully made use of, e.g., NVLink 2.0 or InfiniBand with Remote Direct Memory Access (RDMA). In conclusion, CSV data can be read directly from main memory or the network and processed with multiple gigabytes per second.

The transparency of the tests performed and the independent confirmation of the results also led to the award of the first-ever BTW 2021 Reproducibility Badge. In the data science community, the reproducibility of research results is becoming increasingly important. It serves to verify results as well as to compare them with existing work and is thus an important aspect of scientific quality assurance. Leading international conferences have therefore already devoted special attention to this topic.

To ensure high reproducibility, the authors provided the reproducibility committee with source code, additional test data, and instructions for running the benchmarks. The execution of the tests was demonstrated in a live session and could then also be successfully replicated by a member of the committee. The Reproducibility Badge recognizes above all the good scientific practice of the authors.

The paper in detail:
“Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing”

Authors:
Alexander Kumaigorodski, Clemens Lutz, Volker Markl

Abstract:
Comma-separated values (CSV) is a widely-used format for data exchange. Due to the format’s prevalence, virtually all industrial-strength database systems and stream processing frameworks support importing CSV input. However, loading CSV input close to the speed of I/O hardware is challenging. Modern I/O devices such as InfiniBand NICs and NVMe SSDs are capable of sustaining high transfer rates of 100 Gbit/s and higher. At the same time, CSV parsing performance is limited by the complex control flows that its semi-structured and text-based layout incurs. In this paper, we propose to speed-up loading CSV input using GPUs. We devise a new parsing approach that streamlines the control flow while correctly handling context-sensitive CSV features such as quotes. By offloading I/O and parsing to the GPU, our approach enables databases to load CSVs at high throughput from main memory with NVLink 2.0, as well as directly from the network with RDMA. In our evaluation, we show that GPUs parse real-world datasets at up to 60 GB/s, thereby saturating high-bandwidth I/O devices.

Publication:
K.-U. Sattler et al. (Hrsg.): Datenbanksysteme für Business, Technologie und Web (BTW 2021),Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2021
https://doi.org/10.18420/btw2021-01