Data Visualisation

The purpose of visualization is insight, not pictures. - Ben A. Shneiderman

Before starting visualisation, you should ask yourself for whom you’re visualising the data”?

yourself to explore the data
or declaring a point to someone else

Visualization gives you answers to questions you didn’t know you had. - Ben Schneiderman

Either way, it helps:

Analysing the data in a better way
Simplifying complicated data to make sense of them
And eventually accelerating the process of decision making. (persuading someone else)

Data visualization is an aid from the beginning to the end of the typical data science pipeline. It improves the understanding and communication of the data for both data experts and end users. - Samara Vazquez Perez

But visualisation tools for exploration may vary from representation ones.

Some of useful tools are :

In this notebook we’ll use Matplotlib and Seaborn.

First Plot with Matplotlib

import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("dark")
sns.set(rc={'figure.figsize':(12,8)})

x = np.linspace(0, 2, 100)
plt.figure(figsize=(10, 8), dpi=80)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()

<matplotlib.legend.Legend at 0x7f75c30e46d0>

png

For every single ‘Figure’, always label each axis. use title for axes. when there’s more than one data type to draw, use legend.

Matplotlib Object-Oriented

First, make we should make a Figure. then we add axes to it.

x = np.linspace(0, 2, 100)

fig = plt.figure(figsize=(10, 8))
fig.suptitle('A figure with subplots', fontsize=16)

ax_1 = fig.add_axes([0, 0, 1, 0.4])
ax_2 = fig.add_axes([0, 0.5, 0.4, 0.4])
ax_3 = fig.add_axes([0.5, 0.5, 0.5, 0.4])

ax_1.plot(x, x, label='linear')
ax_2.plot(x, x**2, label='quadratic')
ax_3.plot(x, x**3, label='cubic')

ax_1.set_xlabel('x')
ax_2.set_ylabel('y')
ax_3.set_title('Comparison')

ax_2.legend(loc='best')

<matplotlib.legend.Legend at 0x7f75bd73d3a0>

png

Also, we can draw a plot within another plot.

x = np.linspace(0, 2, 100)

fig = plt.figure(figsize=(10, 8))

ax_1 = fig.add_axes([0, 0, 1, 1])
ax_2 = fig.add_axes([0.1, 0.5, 0.4, 0.4])

ax_1.plot(x, x**2, label='quadratic', color='orange', linestyle='dashed')
ax_1.text(1.3, 1.5, r"$y=x^2$", fontsize=20, color="orange")
ax_1.grid(True)

ax_2.plot(x, x, label='linear')
ax_2.plot(x, x**2, label='quadratic')
ax_2.plot(x, x**3, label='cubic')

ax_1.set_xlabel('x')
ax_1.set_ylabel('y')

ax_1.legend(loc='best')
ax_2.legend(loc='best')

<matplotlib.legend.Legend at 0x7f75bd60dc40>

png

For separate axes, we can easily use subplots().

fig, axes = plt.subplots(nrows=1,ncols=3, figsize=(10,8))

for i in range(3):
    axes[i].plot(x, x**(i+1))

png

It’s clear that the above figure is not intuitive. axes all sit side by side with a different scale on the y-axis. and there’s no point in that figure besides didactic reason.

Bad Visualisation

1. inconsistent scale

x_1 = np.linspace(0, 1, 80, endpoint=False)
x_2 = np.linspace(1, 2, 20)
x = np.concatenate((x_1, x_2))
plt.figure(figsize=(10, 8), dpi=80)
plt.plot(x, label='linear')
plt.plot(x**2, label='quadratic')
plt.plot(x**3, label='cubic')
plt.xlabel('x label')
plt.xticks([0, 20, 40, 60, 80, 90, 100], [0, 0.25, 0.5, 0.75, 1, 1.5, 2])
plt.ylabel('y label')
plt.title("Simple Plot")
plt.grid(True)
plt.legend()

<matplotlib.legend.Legend at 0x7f75bd3b3be0>

png

This is broken. in this plot the range from 1 to 2 on the x-axis is denser than the same range from 0 to 1. this easily leads you to misinterpretation.

Don’t get it wrong, it is advised to use other scales rather than linear space when it’s beneficial. just be consistent and declare that the plot isn’t drawn on linear space.

x = np.linspace(0, 5, 100)
fig, axes = plt.subplots(1, 2, figsize=(10,5))
      
axes[0].plot(x, x, label='linear')
axes[0].plot(x, x**2, label='quadratic')
axes[0].plot(x, x**3, label='cubic')
axes[0].plot(x, np.exp(x), label='cubic')
axes[0].set_title("Normal scale")
axes[0].grid(True)

axes[1].plot(x, x, label='linear')
axes[1].plot(x, x**2, label='quadratic')
axes[1].plot(x, x**3, label='cubic')
axes[1].plot(x, np.exp(x), label='cubic')
axes[1].set_yscale("log")
axes[1].set_title("Logarithmic scale")
axes[1].grid(True)

axes[1].legend(loc='best')

<matplotlib.legend.Legend at 0x7f75bd2f1f10>

png

2. Wrong type of chart

is data quantitative or qualitative?
is quantitative continuous or discrete?

flights = sns.load_dataset("flights")
flights_1957 = flights[flights.year == 1957]
flights_1957.head(12)

	year	month	passengers
96	1957	Jan	315
97	1957	Feb	301
98	1957	Mar	356
99	1957	Apr	348
100	1957	May	355
101	1957	Jun	422
102	1957	Jul	465
103	1957	Aug	467
104	1957	Sep	404
105	1957	Oct	347
106	1957	Nov	305
107	1957	Dec	336

fig, axe = plt.subplots(1, figsize=(10,5))
axe.scatter(list(flights_1957.month), list(flights_1957.passengers), color='orange')
axe.plot(list(flights_1957.month), list(flights_1957.passengers), linestyle='dashed')

[<matplotlib.lines.Line2D at 0x7f75bd605130>]

png

Data is not continuous. so we’re not allowed to draw a line between two consecutive months.

3. Too many variables

diamonds = sns.load_dataset("diamonds")

fig = plt.figure(figsize=(10,8))
ax = diamonds.color.value_counts().plot(kind='pie')

png

Too many variables, especially in the pie chart make it useless. use other categories or bar charts.

4. readability

Seaborn and Which Type for What Reason?

Trends:
- Line charts: seaborn.lineplot
Relationship:
- Bar charts: seaborn.barplot
- Heatmaps: seaborn.heatmap
- Scatter plots: seaborn.scatterplot and seaborn.swarmplot
- regression line: seaborn.regplot and seaborn.lmplot
Distribution
- Histograms: seaborn.displot and seaborn.histplot
- KDE plots: seaborn.kdeplot and seaborn.jointplot

tips = sns.load_dataset('tips')

Trends

sns.lineplot(data=flights, x="year", y="passengers", hue="month")

<AxesSubplot:xlabel='year', ylabel='passengers'>

png

Relationship

sns.barplot(x='sex', y='total_bill', data=tips)

<AxesSubplot:xlabel='sex', ylabel='total_bill'>

png

sns.boxplot(x='time', y='total_bill', data=tips)

<AxesSubplot:xlabel='time', ylabel='total_bill'>

png

sns.violinplot(x='time', y='total_bill', data=tips, hue='sex', split=True)

<AxesSubplot:xlabel='time', ylabel='total_bill'>

png

sns.lmplot(x='total_bill', y='tip', data=tips, hue='sex', height=8, aspect=1)

<seaborn.axisgrid.FacetGrid at 0x7f75b970b970>

png

sns.relplot(data=tips, x="total_bill", y="tip", hue="day", col="time", row="sex")

<seaborn.axisgrid.FacetGrid at 0x7f75b97121f0>

png

flights = sns.load_dataset('flights')

flights_pv = flights.pivot_table(index='month', columns='year', values='passengers')
sns.heatmap(flights_pv)

<AxesSubplot:xlabel='year', ylabel='month'>

png

Distributions

tips = sns.load_dataset('tips')

tips_mean = tips.total_bill.mean()
tips_sd = tips.total_bill.std()

ax = sns.displot(data=tips, x="total_bill", kde=True, height=8)

plt.axvline(x=tips_mean, color='black', linestyle='dashed')

plt.axvline(x=tips_mean + tips_sd, color='red', linestyle='dotted')
plt.axvline(x=tips_mean - tips_sd, color='red', linestyle='dotted')

plt.title('$\mu = {}$ | $\sigma = {}$'.format(round(tips_mean, 2), round(tips_sd, 2)))

Text(0.5, 1.0, '$\\mu = 19.79$ | $\\sigma = 8.9$')

png

sns.jointplot(x=tips['total_bill'], y=tips['tip'], height=10)

<seaborn.axisgrid.JointGrid at 0x7f75b93f5ca0>

png

sns.pairplot(tips, hue='sex', height=3)

<seaborn.axisgrid.PairGrid at 0x7f75b9415b50>

png

Data Science