Explanatory Data Analysis (EDA)
EDA helps us to gain a deeper understanding of the data, identify patterns and trends, and generate hypotheses.
Table of contents
As a Data Scientist, I believe that Exploratory Data Analysis (EDA) is a crucial step in any data analysis project. EDA helps us to gain a deeper understanding of the data, identify patterns and trends, and generate hypotheses that can be tested through subsequent analyses.
Initial Data Check and Setup
Try to understand the data at a high level. Speak to leadership and product to try to gain as much context as possible to help inform where to focus your efforts.
If your dataset is too big, for instance computing a correlation matrix on large datasets can take quite a bit of time. take a subsample of your dataset.
What is the unique identifier of each row in the data? - A unique identifier can be a column or set of columns that is guaranteed to be unique across rows in your dataset. This is key for distinguishing rows and referencing them in our EDA.
Working on Dataset
The following are some generic steps involved in conducting EDA:
Check for missing data
Checking your data for missing values is usually a good place to start. For this analysis, and future analysis, I suggest analyzing features one at a time and ranking them concerning your specific analysis.
Ex - count the number of missing values for each feature, and then rank the features by the largest amount of missing values to smallest
Suppose a feature has 70 percent of its values missing. As a result of such a high amount of missing data, some may suggest just remove this feature entirely from the data.
We try to understand what the feature represents and why we’re seeing this behavior.
Ex - After further analysis, we may discover that this feature represents a response to a survey question, and in most cases, it was left blank. A possible hypothesis is that a large proportion of the population didn’t feel comfortable providing an answer. If we simply remove this feature, we introduce bias into our data. Therefore, this missing data is a feature in its own right and should be treated as such.
For each feature try to understand why the data s missing and what it can mean.
Provide Basic Descriptions of Your Sample and Features
We Categorize our features,
Continuous - A continuous feature can assume an infinite number of values in a given range.
Discrete - A discrete feature can assume a countable number of values and is always numeric
Categorical - A discrete feature can only assume a finite number of values.
For Continuous features,
- record any characteristics you feel are important, such as max and min.
For Categorical features,
the number of unique values
number of occurrences of each value
Identify the shape of your data
Check the distribution of your data, and how it can change over time.
Ex - Time Series Data Trend.
Calculate the mean and variance of each feature.
Does the feature hardly change at all?
Is it constantly changing?
Try to hypothesize about the behavior you see.
A feature that has a very low or very high variance may require additional investigation. Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are your friends. To understand the shape of your features, PMFs are used for discrete features and PDFs for continuous features.
Here are a few things that PMFs and PDFs can tell you about your data:
Skewness
Is the feature heterogeneous (multimodal)?
If the PDF has a gap in it, the feature may be disconnected.
Is it bounded?
- Skewed Distribution measures the asymmetry of your data. This might deter us from using the mean as a measure of central tendency. The median is more robust, but it comes with an additional computational cost.
Identify Significant Correlations
The methodology outlined below only applies to continuous and discrete features.
Correlation measures the relationship between two variable features.
Ex - Delivered Orders and Fulfilled Orders. The easiest way to visualize correlation is by plotting a scatter plot with Delivered Orders on the y-axis and Fulfilled Orders on the x-axis. As expected, there’s a positive relationship between these two features.
If there are a high number of features in your dataset, you can use the Pearson correlation matrix. It measures the linear correlation between features in your dataset and assigns a value between -1 and 1 to each feature pair. A positive value indicates a positive relationship and negative value indicates a negative relationship.
We can try to form hypotheses around why features might be correlated with each other.
For time series we can look at the autocorrelation of shops. Computing autocorrelation reveals the relationship between the signal’s current value and its previous values.
Compute the correlation between categorical features
Pearson chi-square test This involves taking pairs of discrete features and computing their contingency table. Each cell in the contingency table shows the frequency of observations. In the Pearson chi-square test, the null hypothesis is that the categorical variables in question are independent and, therefore, not related. In the table above, we compute the contingency table for two categorical features from our dataset: Shop Cohort and Plan. After that, we perform a hypothesis test using the chi-square distribution with a specified alpha level of significance. We then determine whether the categorical features are independent or dependent.
Spot outliers in the Dataset
Outliers are significantly different from other samples in your dataset and can lead to major problems when performing statistical tasks.
The box plot visualization is extremely useful for identifying outliers. Data points that are distant from the majority of the data.
These above-mentioned steps are not an exhaustive list of tasks to follow, they give you a good start to understanding and cleaning your data.
If you like my work and want to connect.
You are currently here! 👉 Blog
👨🏼💻 GitHub
👔 LinkedIn
🐥 Twitter
References