Exploratory Data Analysis

As Wikipedia defines it:

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization > methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

In this session we will breifly see some ways to explore a given data.

Problem Statement

Bike-sharing system are meant to rent the bicycle and return to the different place for the bike sharing purpose in Washington DC. You are provided with rental data spanning for 2 years. Explore ways to plot meaningful graphs and gain insight from the provided dataset.
Here is the link to provided dataset on kaggle.

import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('datasets/train_bikes.csv', parse_dates=['datetime'])

data.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

Column Descriptions:

datetime - hourly date + timestamp
season
- 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

We directly call plot function on pandas DataFrame for plotting graphs. we can plot the number of rented bikes against other columns such as: season, holiday etc. by calling plot.scatter() and sepcifying the x and y columns:

data.plot.scatter(x = 'season', y = 'count')

<AxesSubplot:xlabel='season', ylabel='count'>

png

data.plot.scatter(x = 'holiday', y = 'count')

<AxesSubplot:xlabel='holiday', ylabel='count'>

png

data.plot.scatter(x = 'workingday', y = 'count')

<AxesSubplot:xlabel='workingday', ylabel='count'>

png

data.plot.scatter(x = 'weather', y = 'count')

<AxesSubplot:xlabel='weather', ylabel='count'>

png

data.plot.scatter(x = 'temp', y = 'count')

<AxesSubplot:xlabel='temp', ylabel='count'>

png

data.plot.scatter(x = 'atemp', y = 'count')

<AxesSubplot:xlabel='atemp', ylabel='count'>

png

data.plot.scatter(x = 'humidity', y = 'count')

<AxesSubplot:xlabel='humidity', ylabel='count'>

png

data.plot.scatter(x = 'windspeed', y = 'count')

<AxesSubplot:xlabel='windspeed', ylabel='count'>

png

data.plot.scatter(x = 'casual', y = 'count')

<AxesSubplot:xlabel='casual', ylabel='count'>

png

you can call info() on a DataFrame object to get some basic informations about the data.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB

describe() function will generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution

data.describe()

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
count	10886.000000	10886.000000	10886.000000	10886.000000	10886.00000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000
mean	2.506614	0.028569	0.680875	1.418427	20.23086	23.655084	61.886460	12.799395	36.021955	155.552177	191.574132
std	1.116174	0.166599	0.466159	0.633839	7.79159	8.474601	19.245033	8.164537	49.960477	151.039033	181.144454
min	1.000000	0.000000	0.000000	1.000000	0.82000	0.760000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	0.000000	1.000000	13.94000	16.665000	47.000000	7.001500	4.000000	36.000000	42.000000
50%	3.000000	0.000000	1.000000	1.000000	20.50000	24.240000	62.000000	12.998000	17.000000	118.000000	145.000000
75%	4.000000	0.000000	1.000000	2.000000	26.24000	31.060000	77.000000	16.997900	49.000000	222.000000	284.000000
max	4.000000	1.000000	1.000000	4.000000	41.00000	45.455000	100.000000	56.996900	367.000000	886.000000	977.000000

pandas_profiling is a helpful module for generating detailed reports about the dataset. Also it has a very simple API, you can simply call profile_report() on the dataset.
this module is not installed by default on pandas package, you can insatll it by: pip install -U pandas-profiling[notebook]

import pandas_profiling

data.profile_report()

Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

print("count samples & features: ", data.shape)
print("Are there missing values: ", data.isnull().values.any())

count samples & features:  (10886, 12)
Are there missing values:  False

Now consider if we wanted to plot number of rented bikes over a year against hours per day and have it grouped by if it’s a working day or not.
First thing to do is to add a new column for each entry indicating its hour, next we group the data by the newly added column and workingday column.
to only achieve information about the number of rented bikes we specify the count column and aggregate the results by summation. finaly to better represent the resulted series, we call unstack() method to retrieve a DataFrame object.

def plot_by_hour(data, year=None, agg='sum'):
    dd = data
    if year: dd = dd[ dd.datetime.dt.year == year ]
    dd.loc[:, 'hour'] = dd.datetime.dt.hour # extracting the hour data if the year in the data is equal to the year passed as argument
    
    by_hour = dd.groupby(['hour', 'workingday'])['count'].agg(agg).unstack() # groupby hour and working day
    return by_hour.plot(kind='bar', ylim=(0, 80000), figsize=(15,5), width=0.9, title=f"Year = {year}")

plot_by_hour(data, year=2011)
plot_by_hour(data, year=2012)

/home/hesam/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:1773: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)





<AxesSubplot:title={'center':'Year = 2012'}, xlabel='hour'>

png

With the same logic, we can create count plots based on each month or each hour of these two years.

def plot_by_year(agg_attr, title):
    dd = data.copy()
    dd['year'] = data.datetime.dt.year 
    dd['month'] = data.datetime.dt.month 
    dd['hour'] = data.datetime.dt.hour
    
    by_year = dd.groupby([agg_attr, 'year'])['count'].agg('sum').unstack() 
    return by_year.plot(kind='bar', figsize=(15,5), width=0.9, title=title)

plot_by_year('month', "Rent bikes per month in 2011 and 2012")
plot_by_year('hour', "Rent bikes per hour in 2011 and 2012")

<AxesSubplot:title={'center':'Rent bikes per hour in 2011 and 2012'}, xlabel='hour'>

png

It’s also possible to use boxplot method from matplotlib library to plot box plots for the same count hour that we had before

def plot_hours(data, message = ''):
    dd = data.copy()
    dd['hour'] = data.datetime.dt.hour
    
    hours = {}
    for hour in range(24):
        hours[hour] = dd[ dd.hour == hour ]['count'].values

    plt.figure(figsize=(20,10))
    plt.ylabel("Count rent")
    plt.xlabel("Hours")
    plt.title("count vs hours\n" + message)
    plt.boxplot( [hours[hour] for hour in range(24)] )
    
    axis = plt.gca()
    axis.set_ylim([1, 1100])

plot_hours(data[data.datetime.dt.year == 2011], 'year 2011')
plot_hours(data[data.datetime.dt.year == 2012], 'year 2012')

png

data['hour'] = data.datetime.dt.hour

plot_hours(data[ data.workingday == 1], 'working day') # plotting hourly count of rented bikes for working days for a given year
plot_hours(data[data.workingday == 0], 'non working day') # plotting hourly count of rented bikes for non-working days for a given year

png

If we want to have less details instead of 24 hour plot, we should pack hour column into bins. e.g. 4 bins. sklearn library has a function for this, KBinDiscretizer, but here we will define a custom function that does the same job

def categorical_to_numeric(x):
    if 0 <=  x < 6:
        return 0
    elif 6 <= x < 13:
        return 1
    elif 13 <= x < 19:
        return 2
    elif 19 <= x < 24:
        return 3

data['hour'] = data['hour'].apply(categorical_to_numeric)# applying the above conversion logic to dataing data
data.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

# drop unnecessary columns
data = data.drop(['datetime'], axis=1)

Now we can group the data by the new binned hour column and plot the data

figure,axes = plt.subplots(figsize = (10, 5))
hours = data.groupby(["hour"]).agg("mean")["count"]  
hours.plot(kind="line", ax=axes) 
plt.title('Hours VS Counts')
axes.set_xlabel('Time in Hours')
axes.set_ylabel('Average of the Bike Demand')
plt.show()

png

Also we can plot number of rented bikes against the temperature of the day.
As we saw earlier atemp and temp columns are highly correlated so we’ll plot just one of them:

# count of different temp values
a = data.groupby('temp')[['count']].mean()
a.plot()
plt.show()

png

Data Science

Exploratory Data Analysis

Problem Statement