Exploratory Data Analysis

As Wikipedia defines it:

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization > methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

In this session we will breifly see some ways to explore a given data.

Problem Statement

Bike-sharing system are meant to rent the bicycle and return to the different place for the bike sharing purpose in Washington DC. You are provided with rental data spanning for 2 years. Explore ways to plot meaningful graphs and gain insight from the provided dataset.
Here is the link to provided dataset on kaggle.

import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
data = pd.read_csv('datasets/train_bikes.csv', parse_dates=['datetime'])

data.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

Column Descriptions:

  • datetime - hourly date + timestamp
  • season
    • 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - “feels like” temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals

We directly call plot function on pandas DataFrame for plotting graphs. we can plot the number of rented bikes against other columns such as: season, holiday etc. by calling plot.scatter() and sepcifying the x and y columns:

data.plot.scatter(x = 'season', y = 'count')
<AxesSubplot:xlabel='season', ylabel='count'>

png

data.plot.scatter(x = 'holiday', y = 'count')
<AxesSubplot:xlabel='holiday', ylabel='count'>

png

data.plot.scatter(x = 'workingday', y = 'count')
<AxesSubplot:xlabel='workingday', ylabel='count'>

png

data.plot.scatter(x = 'weather', y = 'count') 
<AxesSubplot:xlabel='weather', ylabel='count'>

png

data.plot.scatter(x = 'temp', y = 'count')
<AxesSubplot:xlabel='temp', ylabel='count'>

png

data.plot.scatter(x = 'atemp', y = 'count')
<AxesSubplot:xlabel='atemp', ylabel='count'>

png

data.plot.scatter(x = 'humidity', y = 'count')
<AxesSubplot:xlabel='humidity', ylabel='count'>

png

data.plot.scatter(x = 'windspeed', y = 'count')
<AxesSubplot:xlabel='windspeed', ylabel='count'>

png

data.plot.scatter(x = 'casual', y = 'count')
<AxesSubplot:xlabel='casual', ylabel='count'>

png

you can call info() on a DataFrame object to get some basic informations about the data.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB

describe() function will generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution

data.describe()
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000

pandas_profiling is a helpful module for generating detailed reports about the dataset. Also it has a very simple API, you can simply call profile_report() on the dataset.
this module is not installed by default on pandas package, you can insatll it by: pip install -U pandas-profiling[notebook]

import pandas_profiling
data.profile_report()
Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
print("count samples & features: ", data.shape)
print("Are there missing values: ", data.isnull().values.any())
count samples & features:  (10886, 12)
Are there missing values:  False

Now consider if we wanted to plot number of rented bikes over a year against hours per day and have it grouped by if it’s a working day or not.
First thing to do is to add a new column for each entry indicating its hour, next we group the data by the newly added column and workingday column.
to only achieve information about the number of rented bikes we specify the count column and aggregate the results by summation. finaly to better represent the resulted series, we call unstack() method to retrieve a DataFrame object.

def plot_by_hour(data, year=None, agg='sum'):
    dd = data
    if year: dd = dd[ dd.datetime.dt.year == year ]
    dd.loc[:, 'hour'] = dd.datetime.dt.hour # extracting the hour data if the year in the data is equal to the year passed as argument
    
    by_hour = dd.groupby(['hour', 'workingday'])['count'].agg(agg).unstack() # groupby hour and working day
    return by_hour.plot(kind='bar', ylim=(0, 80000), figsize=(15,5), width=0.9, title=f"Year = {year}")
plot_by_hour(data, year=2011)
plot_by_hour(data, year=2012)
/home/hesam/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:1773: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)





<AxesSubplot:title={'center':'Year = 2012'}, xlabel='hour'>

png

png

With the same logic, we can create count plots based on each month or each hour of these two years.

def plot_by_year(agg_attr, title):
    dd = data.copy()
    dd['year'] = data.datetime.dt.year 
    dd['month'] = data.datetime.dt.month 
    dd['hour'] = data.datetime.dt.hour
    
    by_year = dd.groupby([agg_attr, 'year'])['count'].agg('sum').unstack() 
    return by_year.plot(kind='bar', figsize=(15,5), width=0.9, title=title) 
plot_by_year('month', "Rent bikes per month in 2011 and 2012")
plot_by_year('hour', "Rent bikes per hour in 2011 and 2012")
<AxesSubplot:title={'center':'Rent bikes per hour in 2011 and 2012'}, xlabel='hour'>

png

png

It’s also possible to use boxplot method from matplotlib library to plot box plots for the same count hour that we had before

def plot_hours(data, message = ''):
    dd = data.copy()
    dd['hour'] = data.datetime.dt.hour
    
    hours = {}
    for hour in range(24):
        hours[hour] = dd[ dd.hour == hour ]['count'].values

    plt.figure(figsize=(20,10))
    plt.ylabel("Count rent")
    plt.xlabel("Hours")
    plt.title("count vs hours\n" + message)
    plt.boxplot( [hours[hour] for hour in range(24)] )
    
    axis = plt.gca()
    axis.set_ylim([1, 1100])
 
plot_hours(data[data.datetime.dt.year == 2011], 'year 2011')
plot_hours(data[data.datetime.dt.year == 2012], 'year 2012')

png

png

data['hour'] = data.datetime.dt.hour
plot_hours(data[ data.workingday == 1], 'working day') # plotting hourly count of rented bikes for working days for a given year
plot_hours(data[data.workingday == 0], 'non working day') # plotting hourly count of rented bikes for non-working days for a given year

png

png

If we want to have less details instead of 24 hour plot, we should pack hour column into bins. e.g. 4 bins. sklearn library has a function for this, KBinDiscretizer, but here we will define a custom function that does the same job

def categorical_to_numeric(x):
    if 0 <=  x < 6:
        return 0
    elif 6 <= x < 13:
        return 1
    elif 13 <= x < 19:
        return 2
    elif 19 <= x < 24:
        return 3
data['hour'] = data['hour'].apply(categorical_to_numeric)# applying the above conversion logic to dataing data
data.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 0
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 0
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 0
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 0
# drop unnecessary columns
data = data.drop(['datetime'], axis=1)

Now we can group the data by the new binned hour column and plot the data

figure,axes = plt.subplots(figsize = (10, 5))
hours = data.groupby(["hour"]).agg("mean")["count"]  
hours.plot(kind="line", ax=axes) 
plt.title('Hours VS Counts')
axes.set_xlabel('Time in Hours')
axes.set_ylabel('Average of the Bike Demand')
plt.show()

png

Also we can plot number of rented bikes against the temperature of the day.
As we saw earlier atemp and temp columns are highly correlated so we’ll plot just one of them:

# count of different temp values
a = data.groupby('temp')[['count']].mean()
a.plot()
plt.show()

png