Data Visualisation

The purpose of visualization is insight, not pictures. - Ben A. Shneiderman

Before starting visualisation, you should ask yourself for whom you’re visualising the data”?

  • yourself to explore the data
  • or declaring a point to someone else

Visualization gives you answers to questions you didn’t know you had. - Ben Schneiderman

Either way, it helps:

  • Analysing the data in a better way
  • Simplifying complicated data to make sense of them
  • And eventually accelerating the process of decision making. (persuading someone else)

Data visualization is an aid from the beginning to the end of the typical data science pipeline. It improves the understanding and communication of the data for both data experts and end users. - Samara Vazquez Perez

But visualisation tools for exploration may vary from representation ones.

Some of useful tools are :

In this notebook we’ll use Matplotlib and Seaborn.

First Plot with Matplotlib

import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("dark")
sns.set(rc={'figure.figsize':(12,8)})
x = np.linspace(0, 2, 100)
plt.figure(figsize=(10, 8), dpi=80)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()
<matplotlib.legend.Legend at 0x7f75c30e46d0>

png

For every single ‘Figure’, always label each axis. use title for axes. when there’s more than one data type to draw, use legend.

Matplotlib Object-Oriented

First, make we should make a Figure. then we add axes to it.

x = np.linspace(0, 2, 100)

fig = plt.figure(figsize=(10, 8))
fig.suptitle('A figure with subplots', fontsize=16)

ax_1 = fig.add_axes([0, 0, 1, 0.4])
ax_2 = fig.add_axes([0, 0.5, 0.4, 0.4])
ax_3 = fig.add_axes([0.5, 0.5, 0.5, 0.4])

ax_1.plot(x, x, label='linear')
ax_2.plot(x, x**2, label='quadratic')
ax_3.plot(x, x**3, label='cubic')

ax_1.set_xlabel('x')
ax_2.set_ylabel('y')
ax_3.set_title('Comparison')

ax_2.legend(loc='best')
<matplotlib.legend.Legend at 0x7f75bd73d3a0>

png

Also, we can draw a plot within another plot.

x = np.linspace(0, 2, 100)

fig = plt.figure(figsize=(10, 8))

ax_1 = fig.add_axes([0, 0, 1, 1])
ax_2 = fig.add_axes([0.1, 0.5, 0.4, 0.4])

ax_1.plot(x, x**2, label='quadratic', color='orange', linestyle='dashed')
ax_1.text(1.3, 1.5, r"$y=x^2$", fontsize=20, color="orange")
ax_1.grid(True)

ax_2.plot(x, x, label='linear')
ax_2.plot(x, x**2, label='quadratic')
ax_2.plot(x, x**3, label='cubic')

ax_1.set_xlabel('x')
ax_1.set_ylabel('y')

ax_1.legend(loc='best')
ax_2.legend(loc='best')
<matplotlib.legend.Legend at 0x7f75bd60dc40>

png

For separate axes, we can easily use subplots().

fig, axes = plt.subplots(nrows=1,ncols=3, figsize=(10,8))

for i in range(3):
    axes[i].plot(x, x**(i+1))

png

It’s clear that the above figure is not intuitive. axes all sit side by side with a different scale on the y-axis. and there’s no point in that figure besides didactic reason.

Bad Visualisation

1. inconsistent scale

x_1 = np.linspace(0, 1, 80, endpoint=False)
x_2 = np.linspace(1, 2, 20)
x = np.concatenate((x_1, x_2))
plt.figure(figsize=(10, 8), dpi=80)
plt.plot(x, label='linear')
plt.plot(x**2, label='quadratic')
plt.plot(x**3, label='cubic')
plt.xlabel('x label')
plt.xticks([0, 20, 40, 60, 80, 90, 100], [0, 0.25, 0.5, 0.75, 1, 1.5, 2])
plt.ylabel('y label')
plt.title("Simple Plot")
plt.grid(True)
plt.legend()
<matplotlib.legend.Legend at 0x7f75bd3b3be0>

png

This is broken. in this plot the range from 1 to 2 on the x-axis is denser than the same range from 0 to 1. this easily leads you to misinterpretation.

Don’t get it wrong, it is advised to use other scales rather than linear space when it’s beneficial. just be consistent and declare that the plot isn’t drawn on linear space.

x = np.linspace(0, 5, 100)
fig, axes = plt.subplots(1, 2, figsize=(10,5))
      
axes[0].plot(x, x, label='linear')
axes[0].plot(x, x**2, label='quadratic')
axes[0].plot(x, x**3, label='cubic')
axes[0].plot(x, np.exp(x), label='cubic')
axes[0].set_title("Normal scale")
axes[0].grid(True)

axes[1].plot(x, x, label='linear')
axes[1].plot(x, x**2, label='quadratic')
axes[1].plot(x, x**3, label='cubic')
axes[1].plot(x, np.exp(x), label='cubic')
axes[1].set_yscale("log")
axes[1].set_title("Logarithmic scale")
axes[1].grid(True)

axes[1].legend(loc='best')
<matplotlib.legend.Legend at 0x7f75bd2f1f10>

png

2. Wrong type of chart

  • is data quantitative or qualitative?
  • is quantitative continuous or discrete?
flights = sns.load_dataset("flights")
flights_1957 = flights[flights.year == 1957]
flights_1957.head(12)
year month passengers
96 1957 Jan 315
97 1957 Feb 301
98 1957 Mar 356
99 1957 Apr 348
100 1957 May 355
101 1957 Jun 422
102 1957 Jul 465
103 1957 Aug 467
104 1957 Sep 404
105 1957 Oct 347
106 1957 Nov 305
107 1957 Dec 336
fig, axe = plt.subplots(1, figsize=(10,5))
axe.scatter(list(flights_1957.month), list(flights_1957.passengers), color='orange')
axe.plot(list(flights_1957.month), list(flights_1957.passengers), linestyle='dashed')
[<matplotlib.lines.Line2D at 0x7f75bd605130>]

png

Data is not continuous. so we’re not allowed to draw a line between two consecutive months.

3. Too many variables

diamonds = sns.load_dataset("diamonds")
fig = plt.figure(figsize=(10,8))
ax = diamonds.color.value_counts().plot(kind='pie')

png

Too many variables, especially in the pie chart make it useless. use other categories or bar charts.

4. readability

Seaborn and Which Type for What Reason?

tips = sns.load_dataset('tips')
sns.lineplot(data=flights, x="year", y="passengers", hue="month")
<AxesSubplot:xlabel='year', ylabel='passengers'>

png

Relationship

sns.barplot(x='sex', y='total_bill', data=tips)
<AxesSubplot:xlabel='sex', ylabel='total_bill'>

png

sns.boxplot(x='time', y='total_bill', data=tips)
<AxesSubplot:xlabel='time', ylabel='total_bill'>

png

sns.violinplot(x='time', y='total_bill', data=tips, hue='sex', split=True)
<AxesSubplot:xlabel='time', ylabel='total_bill'>

png

sns.lmplot(x='total_bill', y='tip', data=tips, hue='sex', height=8, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x7f75b970b970>

png

sns.relplot(data=tips, x="total_bill", y="tip", hue="day", col="time", row="sex")
<seaborn.axisgrid.FacetGrid at 0x7f75b97121f0>

png

flights = sns.load_dataset('flights')
flights_pv = flights.pivot_table(index='month', columns='year', values='passengers')
sns.heatmap(flights_pv)
<AxesSubplot:xlabel='year', ylabel='month'>

png

Distributions

tips = sns.load_dataset('tips')
tips_mean = tips.total_bill.mean()
tips_sd = tips.total_bill.std()

ax = sns.displot(data=tips, x="total_bill", kde=True, height=8)

plt.axvline(x=tips_mean, color='black', linestyle='dashed')

plt.axvline(x=tips_mean + tips_sd, color='red', linestyle='dotted')
plt.axvline(x=tips_mean - tips_sd, color='red', linestyle='dotted')

plt.title('$\mu = {}$ | $\sigma = {}$'.format(round(tips_mean, 2), round(tips_sd, 2)))
Text(0.5, 1.0, '$\\mu = 19.79$ | $\\sigma = 8.9$')

png

sns.jointplot(x=tips['total_bill'], y=tips['tip'], height=10)
<seaborn.axisgrid.JointGrid at 0x7f75b93f5ca0>

png

sns.pairplot(tips, hue='sex', height=3)
<seaborn.axisgrid.PairGrid at 0x7f75b9415b50>

png

Further readings

The first two are strongly recommended, trust me you won’t regret it! :)

  1. Python Graph Gallery, A general guide for visualizing every type of data in python
  2. Data-to-Viz
  3. Fundamentals of Data Visualization by Claus O. Wilke
  4. Storytelling with Data by Cole Nussbaumer Knaflic
  5. Matplotlib Cheatsheets