Understanding Data Cleaning
Data cleaning is the process of identifying and rectifying problems in a dataset, such as missing, duplicate, or irrelevant data, to enhance its quality. Raw data, such as log f...
Data cleaning is the process of identifying and rectifying problems in a dataset, such as missing, duplicate, or irrelevant data, to enhance its quality. Raw data, such as log files or transaction records, often contains noise, inconsistencies, and gaps that can undermine the accuracy of analytical models. The primary aim of data cleaning is to ensure datasets are accurate, consistent, and error-free. Clean data is crucial for Exploratory Data Analysis (EDA), as it improves data interpretability and helps derive actionable insights.
Benefits of Data Cleaning
Data cleaning offers numerous advantages, including improved model performance, increased accuracy, better data representation, enhanced data quality, and improved data security.
How to Perform Data Cleaning
Data cleaning involves several systematic steps to address errors in a dataset:
- Remove Unwanted Observations: Eliminate duplicates and irrelevant data entries that create noise.
- Fix Structural Errors: Standardize data formats and variable types for uniformity.
- Manage Outliers: Identify and address extreme values that can skew analyses.
- Handle Missing Data: Use imputation, deletion, or other techniques to fill in data gaps.
Implementation for Data Cleaning
To demonstrate data cleaning steps, let's use the Titanic dataset.
Step 1: Import Libraries and Load Dataset
import pandas as pd
import numpy as np
df = pd.read_csv('Titanic-Dataset.csv')
df.info()
df.head()
Step 2: Check for Duplicate Rows
df.duplicated()
Step 3: Identify Column Data Types
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Categorical columns:', cat_col)
print('Numerical columns:', num_col)
Step 4: Count Unique Values in Categorical Columns
df[cat_col].nunique()
Step 5: Calculate Missing Values as Percentage
round((df.isnull().sum() / df.shape[0]) * 100, 2)
Step 6: Drop Irrelevant or Data-Heavy Missing Columns
df1 = df.drop(columns=['Name', 'Ticket', 'Cabin'])
df1.dropna(subset=['Embarked'], inplace=True)
df1['Age'].fillna(df1['Age'].mean(), inplace=True)
Step 7: Detect Outliers with Box Plot
import matplotlib.pyplot as plt
plt.boxplot(df1['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Step 8: Calculate Outlier Boundaries and Remove Them
mean = df1['Age'].mean()
std = df1['Age'].std()
lower_bound = mean - 2 * std
upper_bound = mean + 2 * std
df2 = df1[(df1['Age'] >= lower_bound) & (df1['Age'] <= upper_bound)]
Step 9: Impute Missing Data Again if Any
df3 = df2.fillna(df2['Age'].mean())
df3.isnull().sum()
Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data
mean = df3['Age'].mean()
std = df3['Age'].std()
lower_bound = mean - 2 * std
upper_bound = mean + 2 * std
print('Lower Bound :', lower_bound)
print('Upper Bound :', upper_bound)
df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]
Step 11: Data Validation and Verification
Separate independent and target features for machine learning predictions.
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
Step 12: Data Formatting
Min-Max Scaling: Rescales feature values to a specific range, typically 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Data Cleaning Tools
Several tools can assist in data cleansing:
- OpenRefine: A tool for cleaning and transforming messy data.
- Trifacta Wrangler: Helps automate data cleaning workflows.
- TIBCO Clarity: Ensures high-quality, standardized datasets.
- Cloudingo: Focuses on deduplication and data cleansing.
- IBM InfoSphere QualityStage: Manages complex data quality tasks.
Advantages
- Improved model performance: Better data quality enhances model learning.
- Increased accuracy: Ensures data is error-free and consistent.
- Better representation: Transforms data to better reflect underlying patterns.
- Improved data quality: Enhances reliability and accuracy.
- Improved data security: Removes sensitive information.
Disadvantages
- Time-consuming: Especially for large datasets.
- Error-prone: Risk of losing important data.
- Cost and resource-intensive: Requires significant resources and expertise.
- Overfitting: Risk of removing too much data.