Understanding Data Cleaning

Data cleaning is the process of identifying and rectifying problems in a dataset, such as missing, duplicate, or irrelevant data, to enhance its quality. Raw data, such as log files or transaction records, often contains noise, inconsistencies, and gaps that can undermine the accuracy of analytical models. The primary aim of data cleaning is to ensure datasets are accurate, consistent, and error-free. Clean data is crucial for Exploratory Data Analysis (EDA), as it improves data interpretability and helps derive actionable insights.

Benefits of Data Cleaning

Data cleaning offers numerous advantages, including improved model performance, increased accuracy, better data representation, enhanced data quality, and improved data security.

How to Perform Data Cleaning

Data cleaning involves several systematic steps to address errors in a dataset:

Remove Unwanted Observations: Eliminate duplicates and irrelevant data entries that create noise.
Fix Structural Errors: Standardize data formats and variable types for uniformity.
Manage Outliers: Identify and address extreme values that can skew analyses.
Handle Missing Data: Use imputation, deletion, or other techniques to fill in data gaps.

Implementation for Data Cleaning

To demonstrate data cleaning steps, let's use the Titanic dataset.

Step 1: Import Libraries and Load Dataset

import pandas as pd
import numpy as np

df = pd.read_csv('Titanic-Dataset.csv')
df.info()
df.head()

Step 2: Check for Duplicate Rows

df.duplicated()

Step 3: Identify Column Data Types

cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']

print('Categorical columns:', cat_col)
print('Numerical columns:', num_col)

Step 4: Count Unique Values in Categorical Columns

df[cat_col].nunique()

Step 5: Calculate Missing Values as Percentage

round((df.isnull().sum() / df.shape[0]) * 100, 2)

Step 6: Drop Irrelevant or Data-Heavy Missing Columns

df1 = df.drop(columns=['Name', 'Ticket', 'Cabin'])
df1.dropna(subset=['Embarked'], inplace=True)
df1['Age'].fillna(df1['Age'].mean(), inplace=True)

Step 7: Detect Outliers with Box Plot

import matplotlib.pyplot as plt

plt.boxplot(df1['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()

Step 8: Calculate Outlier Boundaries and Remove Them

mean = df1['Age'].mean()
std = df1['Age'].std()

lower_bound = mean - 2 * std
upper_bound = mean + 2 * std

df2 = df1[(df1['Age'] >= lower_bound) & (df1['Age'] <= upper_bound)]

Step 9: Impute Missing Data Again if Any

df3 = df2.fillna(df2['Age'].mean())
df3.isnull().sum()

Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data

mean = df3['Age'].mean()
std = df3['Age'].std()

lower_bound = mean - 2 * std
upper_bound = mean + 2 * std

print('Lower Bound :', lower_bound)
print('Upper Bound :', upper_bound)

df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]

Step 11: Data Validation and Verification

Separate independent and target features for machine learning predictions.

X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']

Step 12: Data Formatting

Min-Max Scaling: Rescales feature values to a specific range, typically 0 to 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()

Data Cleaning Tools

Several tools can assist in data cleansing:

OpenRefine: A tool for cleaning and transforming messy data.
Trifacta Wrangler: Helps automate data cleaning workflows.
TIBCO Clarity: Ensures high-quality, standardized datasets.
Cloudingo: Focuses on deduplication and data cleansing.
IBM InfoSphere QualityStage: Manages complex data quality tasks.

Advantages

Improved model performance: Better data quality enhances model learning.
Increased accuracy: Ensures data is error-free and consistent.
Better representation: Transforms data to better reflect underlying patterns.
Improved data quality: Enhances reliability and accuracy.
Improved data security: Removes sensitive information.

Disadvantages

Time-consuming: Especially for large datasets.
Error-prone: Risk of losing important data.
Cost and resource-intensive: Requires significant resources and expertise.
Overfitting: Risk of removing too much data.