Software systems that learn from data via machine learning (ML) are being deployed in increasing numbers in real world application scenarios. These ML applications contain complex data preparation pipelines, which take several raw inputs, integrate, filter and encode them to produce the input data for model training. This is in stark contrast to academic studies and benchmarks, which typically work with static, already prepared datasets. It is a difficult and tedious task to ensure at development time that the data preparation pipelines for such ML applications adhere to sound experimentation practices and compliance requirements. Identifying potential correctness issues currently requires a high degree of discipline, knowledge, and time from data scientists, and they often only implement one-off solutions, based on specialised frameworks that are incompatible with the rest of the data science ecosystem.
We discuss how to model data preparation pipelines as dataflow computations from relational inputs to matrix outputs, and propose techniques that use record-level provenance to automatically screen these pipelines for many common correctness issues (e.g., data leakage between train and test data). We design a prototypical system to screen such data preparation pipelines and furthermore enable the automatic computation of important metadata such as group fairness metrics. We discuss how to extract the semantics and the data provenance of common artifacts in supervised learning tasks and evaluate our system on several example pipelines with real-world data.