====== Data Cleaning and Validation ====== ===== Overviews ===== * Data Lifecycle * [[https://sigmodrecord.org/publications/sigmodRecord/1806/pdfs/04_Surveys_Polyzotis.pdf|Polyzotis et al 2018 - Data Lifecycle Challenges in Production Machine Learning: A Survey]] ===== Data Cleaning ===== * [[https://arxiv.org/pdf/1711.01299.pdf|Krishnan et al 2017 - BoostClean: Automated Error Detection and Repair for Machine Learning]] (searched "data cleaning ensembling machine learning" on Google Scholar) * [[https://www.researchgate.net/profile/Junhua-Ding/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning_Going_Beyond_Data_Cleaning_and_Transformations/links/59ded28b0f7e9bcfab244bdf/Data-Quality-Considerations-for-Big-Data-and-Machine-Learning-Going-Beyond-Data-Cleaning-and-Transformations.pdf|2017 - Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations]] * [[http://proceedings.mlr.press/v119/liu20e/liu20e.pdf|Liu & Guo 2020 - Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates]] * [[https://arxiv.org/pdf/2009.10795|Swayamdipta et al 2020 - Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics]] ===== Data Validation ===== * Book chapter: [[https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch04.html|Ch 4 - Data Validation]] Talks about TensorFlow Data Validation (TFDV) * [[https://proceedings.mlsys.org/paper/2019/file/5878a7ab84fb43402106c575658472fa-Paper.pdf|Breck et al 2019 - Data Validation for Machine Learning]] {{media:data-validation.png}} (from [[https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch04.html|here]]) ===== Related Pages ===== * [[nlp:Data Preparation]] * [[nlp:Dataset Creation]]