Section 5 of Experimentation
Abstract
This section establishes the formal methodology for data cleaning, distinguishing it from data transformation while outlining a five-step procedural framework for dataset remediation. It argues that effective analysis relies on identifying and mitigating specific data pathologies including structural errors, outliers, and missing values through a systematic validation process. Furthermore, the section defines the five intrinsic characteristics required for data quality, framing them as the necessary validation criteria for the cleaning pipeline to ensure the resulting dataset supports valid theoretical conclusions.
Key Concepts
- Data Cleaning Process: Data cleaning is technically defined as the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. This process specifically targets the removal of data that does not belong in the final dataset, distinguishing it from other data management activities by its focus on correcting errors rather than altering structure.
- Data Transformation and Munging: Data transformation is characterized as the distinct process of converting data from one format or structure into another to facilitate analysis. These transformation processes are often referred to as data wrangling or data munging, involving the mapping of data from a raw form into a format suitable for warehousing and analyzing without necessarily correcting inherent data errors.
- Identification of Irrelevant Observations: The first procedural step involves determining if observations are irrelevant because they do not fit into the specific problem being analyzed. Identifying these observations ensures that the dataset remains focused on the theoretical questions at hand, preventing noise from skewing the analytical results derived from the client or departmental inputs.
- Duplicate Observation Management: Duplicate data creation is a frequent occurrence when data sources are combined from multiple places, scraped from the web, or received from various clients and multiple departments. The cleaning protocol requires the removal of these duplicates to ensure that the dataset reflects unique events or entities rather than redundant copies that would bias statistical calculations.
- Structural Error Correction: Structural errors arise when data is measured or transferred, resulting in strange naming conventions, typos, or incorrect capitalization that obscure the true nature of the information. Resolving these errors is essential for maintaining the integrity of the dataset, as inconsistent formatting can prevent proper aggregation or comparison during the analysis phase.
- Outlier Verification Protocols: The third step requires filtering unwanted outliers, though the mere existence of an outlier does not imply it is incorrect. This step is needed to determine the validity of that number, considering removal only if the outlier proves to be irrelevant for analysis or is definitively established as a mistake rather than a valid extreme value.
- Missing Data Mitigation Strategies: Handling missing data involves a trade-off where one can drop observations that have missing values, but doing this will drop or lose information. Alternatively, practitioners may impute missing values based on other observations or alter the way the data is used to effectively navigate null values without discarding the partial information available.
- Data Validation and Quality Assurance: The final step involves validating and performing quality assurance to ensure the data makes sense and follows the appropriate rules for its field. This phase determines if the data proves or disproves a working theory or brings any insight to light, checking effectively for trends that help form the next theory or identifying data quality issues if insights are absent.
- Data Validity and Constraints: Validity represents the degree to which data conforms to defined business rules or constraints, serving as a primary check on the internal logic of the dataset. High validity ensures that the data elements adhere to the expected schemas and rules, making the dataset reliable for downstream analytical modeling.
- Data Quality Characterization: Quality data is characterized by five distinct attributes, one of which is accuracy, ensuring data is close to the true values, and another being completeness, representing the degree to which all required data is known. Consistency ensures data is consistent within the same dataset and across multiple datasets, while uniformity measures the degree to which data is specified using the same unit of measure.
Key Equations and Algorithms
None
Key Claims and Findings
- Data cleaning serves as the foundational process for fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
- Irrelevant observations must be removed specifically because they do not fit into the specific problem the analyst is trying to resolve.
- Structural errors manifest as strange naming conventions, typos, or incorrect capitalization that emerge during measurement or transfer.
- Outliers are not automatically incorrect and require a validity determination step before considering them for removal from the dataset.
- Missing data handling involves dropping observations, imputing based on other observations, or altering usage strategies to navigate null values.
- The quality of the final dataset is determined by five characteristics: validity, accuracy, completeness, consistency, and uniformity.
Terminology
- Data Cleaning: The defined process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset to ensure it is free of errors.
- Data Transformation: The process of converting data from one format or structure into another, often synonymous with data wrangling or data munging for analysis.
- Irrelevant Observations: Specific observations that are identified as not fitting into the specific problem that the analyst is attempting to analyze within the scope of the research.
- Duplicate Observations: Occurrences where data points are copied multiple times due to combining datasets from multiple places, scraping, or receiving data from clients.
- Structural Errors: Errors detected during data transfer or measurement involving strange naming conventions, typos, or incorrect capitalization that affect data structure.
- Outliers: Specific numbers or observations that exist in the dataset which may be correct or incorrect, requiring a validity determination step to decide on removal.
- Imputation: The technique used to handle missing data by creating or estimating missing values based on other existing observations within the dataset.
- Data Validation: The process of checking if the data makes sense, follows rules, supports theory, and reveals trends, or if it suffers from a data quality issue.
- Validity: A characteristic of quality data describing the degree to which the data conforms to defined business rules or constraints.
- Accuracy: A characteristic of quality data ensuring the recorded values are close to the true values of the phenomenon being measured.
- Completeness: A characteristic of quality data referring to the degree to which all required data is known and present within the dataset.
- Consistency and Uniformity: Consistency ensures data is consistent within the same dataset or across multiple datasets, while uniformity ensures the same unit of measure is specified throughout.