Exploratory Data Analysis
As Wikipedia defines it:
In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization > methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
In this session we will breifly see some ways to explore a given data.
Problem Statement
Bike-sharing system are meant to rent the bicycle and return to the different place for the bike sharing purpose in Washington DC.
You are provided with rental data spanning for 2 years. Explore ways to plot meaningful graphs and gain insight from the provided dataset.
Here is the link to provided dataset on kaggle.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('datasets/train_bikes.csv', parse_dates=['datetime'])
data.head()
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
Column Descriptions:
- datetime - hourly date + timestamp
- season
- 1 = spring, 2 = summer, 3 = fall, 4 = winter
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - temperature in Celsius
- atemp - “feels like” temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals
We directly call plot function on pandas DataFrame
for plotting graphs. we can plot the number of rented bikes against other columns such as: season, holiday etc. by calling plot.scatter()
and sepcifying the x and y columns:
data.plot.scatter(x = 'season', y = 'count')
<AxesSubplot:xlabel='season', ylabel='count'>
data.plot.scatter(x = 'holiday', y = 'count')
<AxesSubplot:xlabel='holiday', ylabel='count'>
data.plot.scatter(x = 'workingday', y = 'count')
<AxesSubplot:xlabel='workingday', ylabel='count'>
data.plot.scatter(x = 'weather', y = 'count')
<AxesSubplot:xlabel='weather', ylabel='count'>
data.plot.scatter(x = 'temp', y = 'count')
<AxesSubplot:xlabel='temp', ylabel='count'>
data.plot.scatter(x = 'atemp', y = 'count')
<AxesSubplot:xlabel='atemp', ylabel='count'>
data.plot.scatter(x = 'humidity', y = 'count')
<AxesSubplot:xlabel='humidity', ylabel='count'>
data.plot.scatter(x = 'windspeed', y = 'count')
<AxesSubplot:xlabel='windspeed', ylabel='count'>
data.plot.scatter(x = 'casual', y = 'count')
<AxesSubplot:xlabel='casual', ylabel='count'>
you can call info()
on a DataFrame
object to get some basic informations about the data.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null datetime64[ns]
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
describe()
function will generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
data.describe()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 36.021955 | 155.552177 | 191.574132 |
std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 49.960477 | 151.039033 | 181.144454 |
min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 4.000000 | 36.000000 | 42.000000 |
50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 17.000000 | 118.000000 | 145.000000 |
75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 49.000000 | 222.000000 | 284.000000 |
max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 367.000000 | 886.000000 | 977.000000 |
pandas_profiling
is a helpful module for generating detailed reports about the dataset. Also it has a very simple API, you can simply call profile_report()
on the dataset.
this module is not installed by default on pandas package, you can insatll it by: pip install -U pandas-profiling[notebook]
import pandas_profiling
data.profile_report()
Summarize dataset: 0%| | 0/25 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
print("count samples & features: ", data.shape)
print("Are there missing values: ", data.isnull().values.any())
count samples & features: (10886, 12)
Are there missing values: False
Now consider if we wanted to plot number of rented bikes over a year against hours per day and have it grouped by if it’s a working day or not.
First thing to do is to add a new column for each entry indicating its hour, next we group the data by the newly added column and workingday
column.
to only achieve information about the number of rented bikes we specify the count
column and aggregate the results by summation. finaly to better represent the resulted series
, we call unstack()
method to retrieve a DataFrame
object.
def plot_by_hour(data, year=None, agg='sum'):
dd = data
if year: dd = dd[ dd.datetime.dt.year == year ]
dd.loc[:, 'hour'] = dd.datetime.dt.hour # extracting the hour data if the year in the data is equal to the year passed as argument
by_hour = dd.groupby(['hour', 'workingday'])['count'].agg(agg).unstack() # groupby hour and working day
return by_hour.plot(kind='bar', ylim=(0, 80000), figsize=(15,5), width=0.9, title=f"Year = {year}")
plot_by_hour(data, year=2011)
plot_by_hour(data, year=2012)
/home/hesam/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:1773: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
<AxesSubplot:title={'center':'Year = 2012'}, xlabel='hour'>
With the same logic, we can create count plots based on each month or each hour of these two years.
def plot_by_year(agg_attr, title):
dd = data.copy()
dd['year'] = data.datetime.dt.year
dd['month'] = data.datetime.dt.month
dd['hour'] = data.datetime.dt.hour
by_year = dd.groupby([agg_attr, 'year'])['count'].agg('sum').unstack()
return by_year.plot(kind='bar', figsize=(15,5), width=0.9, title=title)
plot_by_year('month', "Rent bikes per month in 2011 and 2012")
plot_by_year('hour', "Rent bikes per hour in 2011 and 2012")
<AxesSubplot:title={'center':'Rent bikes per hour in 2011 and 2012'}, xlabel='hour'>
It’s also possible to use boxplot
method from matplotlib
library to plot box plots for the same count hour that we had before
def plot_hours(data, message = ''):
dd = data.copy()
dd['hour'] = data.datetime.dt.hour
hours = {}
for hour in range(24):
hours[hour] = dd[ dd.hour == hour ]['count'].values
plt.figure(figsize=(20,10))
plt.ylabel("Count rent")
plt.xlabel("Hours")
plt.title("count vs hours\n" + message)
plt.boxplot( [hours[hour] for hour in range(24)] )
axis = plt.gca()
axis.set_ylim([1, 1100])
plot_hours(data[data.datetime.dt.year == 2011], 'year 2011')
plot_hours(data[data.datetime.dt.year == 2012], 'year 2012')
data['hour'] = data.datetime.dt.hour
plot_hours(data[ data.workingday == 1], 'working day') # plotting hourly count of rented bikes for working days for a given year
plot_hours(data[data.workingday == 0], 'non working day') # plotting hourly count of rented bikes for non-working days for a given year
If we want to have less details instead of 24 hour plot, we should pack hour column into bins. e.g. 4 bins. sklearn
library has a function for this, KBinDiscretizer
, but here we will define a custom function that does the same job
def categorical_to_numeric(x):
if 0 <= x < 6:
return 0
elif 6 <= x < 13:
return 1
elif 13 <= x < 19:
return 2
elif 19 <= x < 24:
return 3
data['hour'] = data['hour'].apply(categorical_to_numeric)# applying the above conversion logic to dataing data
data.head()
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 0 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 0 |
3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 0 |
4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 0 |
# drop unnecessary columns
data = data.drop(['datetime'], axis=1)
Now we can group the data by the new binned hour
column and plot the data
figure,axes = plt.subplots(figsize = (10, 5))
hours = data.groupby(["hour"]).agg("mean")["count"]
hours.plot(kind="line", ax=axes)
plt.title('Hours VS Counts')
axes.set_xlabel('Time in Hours')
axes.set_ylabel('Average of the Bike Demand')
plt.show()
Also we can plot number of rented bikes against the temperature of the day.
As we saw earlier atemp
and temp
columns are highly correlated so we’ll plot just one of them:
# count of different temp values
a = data.groupby('temp')[['count']].mean()
a.plot()
plt.show()