When we're working with Python, how do we make sure that we are visualizing data in a way that conforms with visualization best practices? Unfortunately, we can't simply use stock visualization tools such as matplotlib -- this is what a scatter plot made using the matplotlib defaults looks like, on the left. On the right, you can see a plot that I made that visualizes the same data, yet has made some different aesthetic choices to more closely follow visualization best practices.
1 2 3 4 5 6 7 8 9 | import pandas as pd import matplotlib.pyplot as plt plt.figure(figsize=(4, 3)) fig = plt.scatter(x=earthquakes_df['properties.cdi'], y=earthquakes_df['properties.mag'], ) plt.xlabel('CDI') plt.ylabel('Magnitude') |
There are several visual elements we can improve on:
- Over-saturated colours. Overly saturated colours are jarring to the eye and make it more difficult to look at data at a glance. We can change this to a more neutral palette.
- Unnecessary and distracting point boundaries. There's no need to add a black outer circle to an already intensely coloured data point. It'll be easier to see the shape of the data if we remove these point boundaries.
Beyond there, though, things start to get tricky. Can we get rid of the axes entirely? Do we need gridlines? Do we need the numeric axis tick labels and the tick marks? How do we actually implement this in Python?
We can certainly define some better defaults using matplotlib's rc params function, but they won't fit our needs in every situation. Instead, we need to look at our visualization, figure out which questions we hope it will answer, and then edit down our visualization to meet our needs. In order to do this easily, we can use the handy seaborn library, which provides a set of tools for producing plots that follow visualization best practices.
The first question you should ask yourself is, "Does my graph need to function as a lookup table?" In some cases, it is necessary for your audience to be able to identify the exact (or near-exact) value of the data elements in your visualization. In these cases, grid lines are not chart junk, they convey key visual information
You can include these grid lines by using the default, or 'darkgrid' style in seaborn:
1 2 3 4 5 6 7 8 9 10 | import seaborn as sns sns.set_context("paper", font_scale=1.1) plt.figure(figsize=(4, 3)) # show adding the whitegrid, removing it. Show adding ticks sns.set_style("darkgrid") ax = sns.regplot(x='properties.cdi', y='properties.mag', fit_reg=False, data=earthquakes_df) ax.set(xlabel='CDI', ylabel='Magnitude') |
However, if it is not necessary for your graph to function as a lookup table, it is preferable from a visual perspective to eliminate unnecessary information by getting rid of the grid lines:
Eliminate grid lines by using the 'white' style and 'despine'-ing your charts with seaborn's 'despine' function:
1 2 3 4 5 6 7 | plt.figure(figsize=(4, 3)) sns.set_style("white") sns.regplot(x='properties.cdi', y='properties.mag', fit_reg=False, data=earthquakes_df) sns.despine() |
With regards to whether the axis tick labels are necessary or chart junk, we need to ask ourselves a similar, second question -- "With what precision does my audience need to analyze my data?". If your audience only needs to look at the general shape of your data, removing most of the numbers and only leaving a few to get a sense of the general scale on which it is on could be most appropriate. If, on the other hand, your chart will function as a lookup table, labeling your axis ticks more thoroughly could make it easier for someone to analyze it.
How do we actually eliminate these axis tick labels?
1 2 3 4 5 6 7 8 9 10 | plt.figure(figsize=(4, 3)) plot = sns.regplot(x='properties.cdi', y='properties.mag', fit_reg=False, data=earthquakes_df) plot.set(xlabel='CDI', ylabel='Magnitude') plot.get_xaxis().set_ticks([4, 8]) plot.get_yaxis().set_ticks([4, 8]) sns.despine() |
As you can see, it's a question of passing in lists that represent the integers that we would like to be shown as our axis tick labels. There's a slight catch, though -- since our plot originally had axes from [-1,8], if we eliminate the tick labels, then it's not readily apparent that our axes do not cross at 0. Since axes tend to cross at 0, it is important that we either relabel our axes, or make sure that we rescale our axes so that they start our 0. Let's go ahead and rescale our axes so that they start at 0 by using the `xlim` and `ylim` methods:
1 2 3 4 5 6 7 8 9 10 11 12 | plt.figure(figsize=(4, 3)) plot = sns.regplot(x='properties.cdi', y='properties.mag', fit_reg=False, data=earthquakes_df) plot.set(xlabel='CDI', ylabel='Magnitude') plot.get_xaxis().set_ticks([4, 8]) plot.get_yaxis().set_ticks([4,8]) sns.despine() sns.plt.xlim(0,) #rescaling x axis sns.plt.ylim(0,) #rescaling y axis |
And there we have three great-looking charts, any of which we can implement based on the type of use case we have in mind. Though none of them is one-size-fits-all, if we carefully edit our visualizations and choose which elements to get rid of or include, we can make great-looking visualizations that enable our audience to more easily understand the data they represent.
No comments:
Post a Comment