Expand your training limits! Generating Training Data for ML-based Data Management

Originally posted on Medium

Published:10 June 2021

Machine Learning (ML) is quickly becoming a prominent method in many data management components, especially in query optimizers which have recently shown very promising results. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries. In this post, we analyze the limits of the current solutions and discuss DataFarm, an innovative framework for efficiently generating and labeling large query workloads. DataFarm enables users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.