Assignment 1 - Sample Answer

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Loading the data

data = pd.read_csv('AB_NYC_2019.csv')
print(data.shape)
data.head()
(48895, 16)
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Let’s find out if there are any NaN or Null values in the dataset

data.isna().sum()
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64
data.isnull().sum()
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

There are several ways to handle these missing values, such as:

  • replacing them by the mean or median value of the field
  • random value from the range
    or when the dataset is relatively large, we can simply delete the entire row:
  • ignoring the whole entry

but in this case we fill the blanks with the most repeated value of that field (mode).
Also for the sake of simplicity and respecting the privacy of the hosts, first we drop these columns:

data.drop(['name', 'id', 'host_name'], axis=1, inplace=True)
nan_cols = ['last_review', 'reviews_per_month']
for col in nan_cols:
    data[col].fillna(data[col].mode()[0], inplace=True)
    if data[col].dtype!='O':
        data[col].replace(0, data[col].mean(), inplace=True)

data.isnull().sum()
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   host_id                         48895 non-null  int64  
 1   neighbourhood_group             48895 non-null  object 
 2   neighbourhood                   48895 non-null  object 
 3   latitude                        48895 non-null  float64
 4   longitude                       48895 non-null  float64
 5   room_type                       48895 non-null  object 
 6   price                           48895 non-null  int64  
 7   minimum_nights                  48895 non-null  int64  
 8   number_of_reviews               48895 non-null  int64  
 9   last_review                     48895 non-null  object 
 10  reviews_per_month               48895 non-null  float64
 11  calculated_host_listings_count  48895 non-null  int64  
 12  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(6), object(4)
memory usage: 4.8+ MB

With the describe() method we can have a nice overview of the numerical column of the dataset. In this case we drop some columns that don’t have numerical value.

data.drop(['host_id', 'latitude', 'longitude'], axis=1).describe()
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000
mean 152.720687 7.029962 23.274466 1.095022 7.143982 112.781327
std 240.154170 20.510550 44.550582 1.594493 32.952519 131.622289
min 0.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 69.000000 1.000000 1.000000 0.040000 1.000000 0.000000
50% 106.000000 3.000000 5.000000 0.370000 1.000000 45.000000
75% 175.000000 5.000000 24.000000 1.580000 2.000000 227.000000
max 10000.000000 1250.000000 629.000000 58.500000 327.000000 365.000000

It is clear that the rent for any type accommodation is not equal to zero, and those that are not available to rent on any day of a year are of no interest, so we drop them first.
Have a look at the max value and the 75% of the minimum_nights and price columns. In these columns there are certainly values that would violate the assumptions for the statistical tests, so the next step is to remove outliers to have a homogenous dataset. We do so by removing entries that are more than 3 standard deviations from the mean.

from scipy.stats import zscore

z_scores = zscore(data['price'])
np.append(zscore(data['minimum_nights']), z_scores)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
data = data[filtered_entries]
data = data[data['price']>0]
data = data[data['availability_365']>0]

Normality Test

As a preprocessing step, it is better to transform values of a field that don’t have a normal distribution, if possible. There are several methods that transforms distributions, i.e. skewed dist., to a normal one, such as:

  • Yeo-Johnson
\[x_i^{(\lambda)} =\begin{cases}[(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \neq 0, x_i \geq 0, \\[8pt]\ln{(x_i + 1)} & \text{if } \lambda = 0, x_i \geq 0 \\[8pt]-[(-x_i + 1)^{2 - \lambda} - 1] / (2 - \lambda) & \text{if } \lambda \neq 2, x_i < 0, \\[8pt]- \ln (- x_i + 1) & \text{if } \lambda = 2, x_i <0, \end{cases}\]
  • Box-Cox
\[x_i^{(\lambda)} =\begin{cases}\dfrac{x_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt]\ln{(x_i)} & \text{if } \lambda = 0,\end{cases}\]
  • Logarithm

Skewness

In statistics, skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry). The skewness value can be positive or negative, or even undefined. If skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data. As a general rule of thumb: If skewness is less than -1 or greater than 1, the distribution is highly skewed. If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

\[\frac{\frac{1}{n}\Sigma^{n}_{i=1}(x_{i}-\bar{x})^{3}}{(\frac{1}{n}\Sigma^{n}_{i=1}(x_{i}-\bar{x})^{2})^{3/2}}\]


Here, x̄ is the sample mean.

Kurtosis

Kurtosis tells you the height and sharpness of the central peak, relative to that of a standard bell curve.

\[\frac{\frac{1}{n}\Sigma^{n}_{i=1}(x_{i}-\bar{x})^{4}}{(\frac{1}{n}\Sigma^{n}_{i=1}(x_{i}-\bar{x})^{2})^{2}}-3\]


Here, x̄ is the sample mean. The “minus 3” at the end of this formula is often explained as a correction to make the kurtosis of the normal distribution equal to zero, as the kurtosis is 3 for a normal distribution. (source)

Note: the following Normal Test is a package-defined test based on D’Agostino and Pearnson’s test, developed by SciPy, and is formulated as:
$s^2+k^2$
where s is the z-score returned by skewtest and k is the z-score returned by kurtosistest.

def normal_report(data):
    import pandas as pd
    normal  = stats.normaltest(data)
    skew    = stats.skewtest(data)
    kurtosis= stats.kurtosistest(data)
    tests   = [normal, skew, kurtosis]
    names   = ['']
    res     =   pd.DataFrame({'score': normal[0], 'pvalue': normal[1]}, index=data.columns)
    for test in tests[1:]:
        temp = pd.DataFrame({f'{test.__class__.__name__}-score': test[0], f'{test.__class__.__name__}-pvalue': test[1]}, index=data.columns)
        res = res.join(temp)
    return res
sns.set(rc={'figure.figsize':(15,10)})
fig, axs = plt.subplots(2, 3)
plot_cols = ['price', 'minimum_nights',	'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count',	'availability_365']
for i in [0,1,2]:
    for j in [0,1]:
        sns.histplot(data=data[plot_cols], x=plot_cols[i+j*3], bins=20, ax=axs[j, i], kde=True)

from scipy import stats
# plot_cols = ['price']
print('Raw data:')
normal_report(data[plot_cols]).head(10)
Raw data:
score pvalue SkewtestResult-score SkewtestResult-pvalue KurtosistestResult-score KurtosistestResult-pvalue
price 15438.202472 0.0 104.375015 0.000000e+00 67.409633 0.0
minimum_nights 69228.991147 0.0 230.696645 0.000000e+00 126.522920 0.0
number_of_reviews 20388.912837 0.0 118.836145 0.000000e+00 79.163650 0.0
reviews_per_month 23466.994229 0.0 117.389137 0.000000e+00 98.421464 0.0
calculated_host_listings_count 35716.921354 0.0 161.055638 0.000000e+00 98.883785 0.0
availability_365 204487.718312 0.0 9.459196 3.103188e-21 452.104238 0.0

png

As the plots and more importantly the statistical tests show, none of these columns follow a normal distributions. Therefore, we apply the aforementioned transformations so that they form a normal distribution.

from sklearn.preprocessing import PowerTransformer

log_transform   = pd.DataFrame(np.log(data[plot_cols]), columns=plot_cols)
log_transform.number_of_reviews.replace(-np.inf, 0, inplace=True)
yeo_johnson = pd.DataFrame(PowerTransformer(method='yeo-johnson').fit_transform(data[plot_cols].add(0.00001)), columns=plot_cols)
box_cox = pd.DataFrame(PowerTransformer(method='box-cox').fit_transform(data[plot_cols].add(0.00001)), columns=plot_cols)



log_transform.hist(bins=20)
print('Log transformed data:')
print(normal_report(log_transform).head(10).to_markdown())

yeo_johnson.hist(bins=20)
print('Yeo-Johnson transformed data:')
print(normal_report(yeo_johnson).head(10).to_markdown())

box_cox.hist(bins=20)
print('Box-Cox transformed data:')
print(normal_report(box_cox).head(10).to_markdown())
Log transformed data:
|                                |     score |      pvalue |   SkewtestResult-score |   SkewtestResult-pvalue |   KurtosistestResult-score |   KurtosistestResult-pvalue |
|:-------------------------------|----------:|------------:|-----------------------:|------------------------:|---------------------------:|----------------------------:|
| price                          |   390.211 | 1.84789e-85 |               16.8914  |             5.20201e-64 |                   -10.2416 |                 1.29061e-24 |
| minimum_nights                 |  3995.06  | 0           |               62.2783  |             0           |                    10.7921 |                 3.75139e-27 |
| number_of_reviews              | 38615.2   | 0           |                4.73889 |             2.14893e-06 |                  -196.45   |                 0           |
| reviews_per_month              |  4415.47  | 0           |              -44.0505  |             0           |                   -49.7497 |                 0           |
| calculated_host_listings_count | 13228.3   | 0           |               99.9695  |             0           |                    56.8714 |                 0           |
| availability_365               |  6527.95  | 0           |              -75.1245  |             0           |                    29.7367 |                 2.5777e-194 |
Yeo-Johnson transformed data:
|                                |    score |     pvalue |   SkewtestResult-score |   SkewtestResult-pvalue |   KurtosistestResult-score |   KurtosistestResult-pvalue |
|:-------------------------------|---------:|-----------:|-----------------------:|------------------------:|---------------------------:|----------------------------:|
| price                          |   210.46 | 1.9913e-46 |               0.810043 |            0.417915     |                   -14.4846 |                 1.51568e-47 |
| minimum_nights                 |  6040.57 | 0          |              21.6506   |            6.00148e-104 |                   -74.6446 |                 0           |
| number_of_reviews              | 14188.8  | 0          |               0.517828 |            0.604578     |                  -119.115  |                 0           |
| reviews_per_month              | 77346.7  | 0          |               9.08888  |            1.0006e-19   |                  -277.964  |                 0           |
| calculated_host_listings_count |  8248.49 | 0          |              47.3544   |            0            |                   -77.4987 |                 0           |
| availability_365               | 36257.7  | 0          |             -21.6742   |            3.5904e-104  |                  -189.177  |                 0           |
Box-Cox transformed data:
|                                |     score |      pvalue |   SkewtestResult-score |   SkewtestResult-pvalue |   KurtosistestResult-score |   KurtosistestResult-pvalue |
|:-------------------------------|----------:|------------:|-----------------------:|------------------------:|---------------------------:|----------------------------:|
| price                          |   183.413 | 1.48747e-40 |                0.71624 |            0.473843     |                   -13.524  |                1.12807e-41  |
| minimum_nights                 |  5339.61  | 0           |               13.5258  |            1.1018e-41   |                   -71.8099 |                0            |
| number_of_reviews              |  1615.13  | 0           |              -32.1328  |            1.53524e-226 |                   -24.1373 |                1.01602e-128 |
| reviews_per_month              |  7531.92  | 0           |              -11.4642  |            1.99548e-30  |                   -86.0261 |                0            |
| calculated_host_listings_count | 31765.7   | 0           |               40.3967  |            0            |                  -173.591  |                0            |
| availability_365               | 28510.3   | 0           |              -22.7117  |            3.42959e-114 |                  -167.315  |                0            |

png

png

png

By observing the results and plots of distributions, we can see that they still are not normal, but after transformations, they are closer to the normal distributions. Further progress can be made by removing some specific data that are causing trouble.

The following is to compare the results and the graph of a normal distribution and those that we had.

mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
print('normalTest:', stats.normaltest(s), sep='\t')
print('skewTest:', stats.skewtest(s), sep='\t')
print('kurtosisTest:', stats.kurtosistest(s), sep='\t')
sns.histplot(s, kde=True)
normalTest:	NormaltestResult(statistic=0.8164538735143994, pvalue=0.6648279877940435)
skewTest:	SkewtestResult(statistic=0.7027089559004347, pvalue=0.48223714524867656)
kurtosisTest:	KurtosistestResult(statistic=0.5680264050303651, pvalue=0.5700170385043334)





<AxesSubplot:ylabel='Count'>

png

Now it’s time to deal with the geographic information, like latitude and longitude and neighbourhood details.
We can derive new a column, density, from the given geographical coordinations to illustrate a heatmap-like graph of the density of accommodations. This density column is a kernel-density estimate of the given coordinates using gaussian kernels and is calculated by scipy’s gaussian_kde class. Also the size of each circle is relative to that home’s price.

from scipy.stats import gaussian_kde
plt.figure(figsize=(20,15))

xy = np.vstack([data.longitude, data.latitude])
z = gaussian_kde(xy)(xy)
data['density'] = z

nyc_img=plt.imread('Neighbourhoods_New_York_City_Map.png')
plt.imshow(nyc_img,zorder=0,extent=[-74.258, -73.7, 40.49,40.92])
sample = data.sample(10000)
sns.scatterplot(x=sample['longitude'], y=sample['latitude'], hue=sample.density, size=sample.price, sizes=(1, 200))
plt.legend()

<matplotlib.legend.Legend at 0x23d2167f910>

png

Here we use a pie chart to demonstrate distributions of Airbnb listings among 5 major neighbourhood groups.

temp = data.groupby('neighbourhood_group')['host_id'].agg(['count'])
labels = temp.index.values
sizes = [i/temp['count'].sum() for i in temp['count']]
explode = (0.1, 0, 0, 0, 0.1)
plt.figure(figsize=(20,15))
fig1, ax1 = plt.subplots()
plt.title('Distribution of NYC Airbnb accommodations among \n 5 major neighbourhood groups\n')
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
(-1.1212892274119892,
 1.1091298492131343,
 -1.1308905501493505,
 1.206175246923361)




<Figure size 1440x1080 with 0 Axes>

png

Next we group the dataste by neighbourhood to plot the average price in each one.

from collections import OrderedDict

n_g = [(i,j) for i, j in zip(data.neighbourhood, data.neighbourhood_group)]
n_g = set(n_g)
n_g = {i[0]:i[1] for i in n_g}
n_g = OrderedDict(sorted(n_g.items(), key=lambda t: t[0]))

neighborhood_df = data.groupby('neighbourhood').mean()
neighborhood_df['group'] = list(n_g.values())
neighborhood_df.head()
host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 density group
neighbourhood
Allerton 1.056663e+08 40.865346 -73.858405 91.702703 2.189189 47.351351 1.798649 2.864865 185.783784 3.299302 Bronx
Arden Heights 2.077379e+08 40.555507 -74.185043 76.000000 12.000000 8.000000 1.150000 1.000000 125.666667 0.132183 Staten Island
Arrochar 8.173055e+07 40.593900 -74.074600 115.000000 2.761905 14.619048 1.061905 4.190476 255.809524 1.611362 Staten Island
Arverne 8.476303e+07 40.592012 -73.794902 156.694444 1.625000 30.486111 1.590556 1.694444 196.763889 3.475760 Queens
Astoria 8.078140e+07 40.764053 -73.919808 104.836996 5.846154 30.309524 1.466850 3.428571 179.152015 26.636433 Queens
plt.figure(figsize=(15,15))
fig, ax1 = plt.subplots()
nyc_img=plt.imread('Neighbourhoods_New_York_City_Map.png')
ax1.imshow(nyc_img,zorder=0,extent=[-74.258, -73.7, 40.49,40.92])
sns.scatterplot(x=neighborhood_df['longitude'], y=neighborhood_df['latitude'], hue=neighborhood_df.group, size=neighborhood_df.price, sizes=(1, 300), ax=ax1)
plt.legend()
<matplotlib.legend.Legend at 0x23d21d751c0>




<Figure size 1080x1080 with 0 Axes>

png

We can find those hosts that are the busiest by applying a threshold on the entries that have listed homes more than others. First we need to group the dataset by host_id to have summarized information about each one.

hosts = data.drop(['latitude', 'longitude'], axis=1).groupby('host_id').mean().sort_values('calculated_host_listings_count', ascending=False)
hosts.head()
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 density
host_id
219517861 253.195719 13.311927 3.917431 1.223119 327.0 301.492355 59.818851
107434423 303.150862 32.198276 0.125000 0.043621 232.0 253.810345 79.626719
30283594 257.771186 29.974576 0.550847 0.046102 121.0 314.364407 82.078826
137358866 44.369565 30.000000 0.891304 0.223478 103.0 219.967391 27.697180
12243051 213.031250 29.000000 0.447917 0.104062 96.0 276.614583 67.112668
hosts.describe()
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 density
count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 149.662725 6.873514 33.843902 1.579087 1.485125 152.727719 53.426646
std 111.975614 25.016220 50.619667 1.688689 3.612234 120.151525 31.392510
min 10.000000 1.000000 0.000000 0.010000 1.000000 1.000000 0.077169
25% 75.000000 2.000000 2.000000 0.270000 1.000000 41.000000 27.232414
50% 120.000000 2.666667 13.000000 1.000000 1.000000 128.333333 55.054562
75% 187.500000 4.000000 44.000000 2.420000 1.000000 261.000000 79.062322
max 860.000000 1250.000000 602.500000 20.940000 327.000000 365.000000 113.931857
plt.figure(figsize=(10,10))
nyc_img=plt.imread('Neighbourhoods_New_York_City_Map.png')
plt.imshow(nyc_img,zorder=0,extent=[-74.258, -73.7, 40.49,40.92])
thres  = int((hosts.calculated_host_listings_count.iloc[0]-hosts.calculated_host_listings_count.iloc[-1])*0.5)
sample = data[data.calculated_host_listings_count>thres]
sns.scatterplot(x=sample['longitude'], y=sample['latitude'], hue=sample.density, size=sample.price, sizes=(1, 300))
plt.legend()
<matplotlib.legend.Legend at 0x23d1c02fc10>

png

In terms of different types of rooms, by grouping the dataset, we can see that ‘price’, ‘minimum_nights’, ‘number_of_reviews’, and ‘calculated_host_listings_count’ columns are different within each group, so we plot those.

home_type = data.groupby('room_type').mean()
home_type.drop(['host_id', 'latitude', 'longitude'], axis=1, inplace=True)
home_type.index.name = None
home_type.head()
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 density
Entire home/apt 201.219373 10.329796 31.303785 1.455429 15.902521 170.961413 57.970652
Private room 86.415408 5.961157 33.729895 1.611134 4.226946 177.426557 47.932338
Shared room 62.753488 6.979070 19.201163 1.323558 5.563953 217.658140 47.182557
fig, ax = plt.subplots(2, 2)
plot_cols = ['price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count']
ax = ax.flatten()
for num, col in enumerate(plot_cols):
    home_type[col].plot(kind='bar', ax=ax[num], sharex=True, title=col, color=['r', 'g', 'b'])

png

Also we can plot different types of properties located whitin major neighbourhood groups. Here we plot calculated_host_listings_count, price, and density columns by sns.barplot. By specifying the column name as hue parameter, you can get more detailed bars, as we’ll see this for the room_type column.

_, ax = plt.subplots(3,1, sharex=True)
sns.barplot(x='neighbourhood_group', y='calculated_host_listings_count', hue='room_type', data=data, ax=ax[0]).set(xlabel=None)
sns.barplot(x='neighbourhood_group', y='price', hue='room_type', data=data, ax=ax[1]).set(xlabel=None)
sns.barplot(x='neighbourhood_group', y='density', hue='room_type', data=data, ax=ax[2])
<AxesSubplot:xlabel='neighbourhood_group', ylabel='density'>

png

There are features in this dataset that are likely to be correlated, reviews_per_month and number_of_reviews are an obvious example. Hence by plotting the correaltion matrix, using a threshold, we can gain a nice prespective of how these variables will behave respectively. For instance, it’s apparent that as the density of a place grows, the prices will consequently rise up.

corr = data.drop(['latitude', 'longitude', 'host_id'], axis=1).corr()
corr = corr[corr>0.1]
sns.heatmap(corr[corr>0.1], cmap="Blues", annot=True)
<AxesSubplot:>

png