Expand your Training Limits! Generating and Labeling Jobs for ML-based Data Management

SIGMOD 2021
Publication Date: 30.03.2021

Abstract

Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs. In this work, we face the problem of generating training data for data management components tailored to users’ needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white-box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9x better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54x (and up to an estimated factor of 104x) compared to standard approaches.