Using Pandera for Cleaning Large Noisy Datasets

Real world datasets come with noise.Identifying noise, setting up and forming data quality expectations requires the use of data quality tools like pandera. The repository contains a notebook illustrating how a data quality tool is used to arrive at denoising rules for a dataset.

This process can be roughly summarized as follows:

  1. Inspect the dataset, form a preliminary estimate of the values that you should expect for each attribute.

  2. Use basic features of pandas to assess the quality for each attribute.

  3. For each attribute develop a schema that captures what is admissible for that attribute

  4. Apply the schema, inspect the results.

  5. Repeat for each attribute and consolidate the rules.