Banner Banner

Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search

Lennart Behme
Sainyam Galhotra
Kaustubh Beedkar
Volker Markl

July 31, 2024

Efficient data discovery is crucial in the era of data-driven decisionmaking. However, current practices face significant challenges due
to the intricacies of identifying datasets with specific distributional characteristics, such as percentiles, when data repositories are decentralized. Traditional keyword-based search methods are insufficient for these complex requirements, often resulting in suboptimal dataset search results. To address these challenges, this paper presents Fainder, a fast and accurate index for “percentile predicates” on histogram-based data summaries, which streamlines the search process for datasets with specific distributional requirements. Fainder can be constructed on heterogeneous hitogram collections and employs binary search in conjunction with multistep pruning techniques to efficiently identify search results for percentile predicates. Thereby, it simplifies data provisioning and improves the effectiveness of dataset discovery. Empirical evaluation of our solution on three large-scale data repositories shows that Fainder is effective for distribution-aware dataset search and provides order-of-magnitude efficiency gains over baselines.