Understanding Exploratory Data Analysis with Python
Exploratory Data Analysis (EDA) is a crucial phase in data processing aimed at uncovering patterns, trends, and relationships using statistical tools and visualizations. Python...
Exploratory Data Analysis (EDA) is a crucial phase in data processing aimed at uncovering patterns, trends, and relationships using statistical tools and visualizations. Python is equipped with a range of libraries such as pandas, NumPy, matplotlib, seaborn, and plotly that facilitate effective data exploration and insight extraction, aiding in subsequent modeling and analysis. Key EDA techniques include:
- Data Inspection: Evaluate the dataset's size, structure, data types, and basic summary statistics.
- Handling Missing and Duplicate Data: Identify and resolve missing values or duplicate entries to maintain data integrity.
- Univariate Analysis: Examine a single variable to discern its distribution, trends, and outliers.
- Bivariate Analysis: Analyze the relationship between two variables.
- Multivariate Analysis: Investigate three or more variables to understand complex relationships.
Key Steps for Exploratory Data Analysis (EDA)
Step 1: Importing Required Libraries
To begin, import essential libraries in Python.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')
Step 2: Reading the Dataset
Load the dataset using pandas.
df = pd.read_csv("/content/WineQT.csv")
print(df.head())
Step 3: Analyzing the Data
- Dataset Dimensions: Use
df.shapeto determine the number of rows and columns, providing an overview of the dataset's structure.
df.shape
- Data Information: The
df.info()function reveals the number of entries per column, data types, missing values, and memory usage.
df.info()
- Statistical Summary: The
df.describe().Tmethod offers a statistical summary, including count, mean, std, min, and quartiles for numerical columns.
df.describe().T
- Column Names: Convert column names to a list with
df.columns.tolist()for ease of access and manipulation.
df.columns.tolist()
Step 4: Checking Missing Values
Identify missing values using df.isnull().sum() to pinpoint data gaps.
df.isnull().sum()
Step 5: Checking for Duplicate Values
Determine the count of unique values in each column with df.nunique(), providing insights into data variety.
df.nunique()
Step 6: Univariate Analysis
Visualizing data correctly assists in better understanding:
- Bar Plot: Analyze wine counts by quality.
quality_counts = df['quality'].value_counts()
plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()
- Kernel Density Plot: Examine variance in the dataset.
sns.set_style("darkgrid")
numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns
plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(),2)}")
plt.tight_layout()
plt.show()
- Swarm Plot: Visualize outliers and data distribution.
plt.figure(figsize=(10, 8))
sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')
plt.title('Swarm Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
Step 7: Bivariate Analysis
Explore two variables together to identify interactions:
- Pair Plot: Displays distributions and relationships between pairs of variables.
sns.set_palette("Pastel1")
plt.figure(figsize=(10, 6))
sns.pairplot(df)
plt.suptitle('Pair Plot for DataFrame')
plt.show()
- Violin Plot: Examine the relationship between alcohol content and wine quality.
df['quality'] = df['quality'].astype(str)
plt.figure(figsize=(10, 8))
sns.violinplot(x="quality", y="alcohol", data=df, palette={
'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)
plt.title('Violin Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
- Box Plot: Investigate variability and outliers between alcohol content and quality.
sns.boxplot(x='quality', y='alcohol', data=df)
Step 8: Multivariate Analysis
Assess interactions among multiple variables using a correlation matrix plot.
plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)
plt.title('Correlation Heatmap')
plt.show()
Understanding the insights derived from EDA paves the way for more advanced modeling techniques.