Banner Banner

Efficiently Indexing Large Data on GPUs with Fast Interconnects

Josef Schmeißer
Clemens Lutz
Volker Markl

March 28, 2025

Abstract

Modern GPUs have long been capable of processing queries at a high throughput. However, until recently, GPUs faced slow data transfers from CPU main memory, and thus did not reach high processing rates for large, out-of-core data. To cope, database management systems (DBMSs) restrict their data access path to bulk data transfers orchestrated by the CPU, i.e., table scans. When queries expose selectivity, a full table scan wastes band-width, leaving performance on the table. With the arrival of fast interconnects, this design choice must be reconsidered. GPUs can directly access data at up to 7 × higher bandwidth, whereby bytes are loaded on-demand. We investigate four classic and recent index structures (binary search, B+tree, Harmonia, and RadixSpline), which we access via a fast interconnect. We show that indexing data can reduce transfer volume. However, when embedded into an index-nested loop join, we find that all indexes fail to outperform a hash join in the most interesting case: a highly selective query on large data (over 100 GiB). Therefore, we propose windowed partitioning , an index lookup optimization that generalizes to any index. As a result, index-nested loop joins run up to 3–10 × faster than a hash join. Overall, we show that out-of-core indexes are a feasible design choice to exploit selectivity when using a fast interconnect.