Static Visualization with matplotlib#

Loading data#

import pandas as pd
import numpy as np
# Import example data - flight from NAAMES campaign
naames = pd.read_csv('./data/naames-mrg01-c130_merge_20151112_R5_thru20151114.csv', skiprows=223)
# Filter down to just 1 day
naames = naames[naames[' Fractional_Day'] < 317]

Context#

Viewing your data is an integral part of the science process and it happens at many steps along the way. Reasons you may want to plot include:

  1. view your data for a quick check

  2. answer a science question (often iterative)

  3. create final plots for publication

The reason why you are plotting influences how you might choose to plot. Are you trying to make a high-quality plot for publication? Or are you just trying to get an easy glance at your data to make sure nothing is amiss? You’ll likely have to do both.

Choosing a plotting technique: Ease vs. Complexity#

One way the goal of plotting can influence your technology choice is by influencing how much time you are willing to trade for control over detailed aspects of your plot. This tradeoff is summarized below.

Ease vs Complexity

We see that matplotlib, the library being used here, is able to make highly sophisticated but potentially time consuming plots. On the opposite end of the spectrum from matplotlib are other plotting libraries, such as seaborn or altair. I also consider the plotting functions built into pandas to be located on the “Ease of use” end of the spectrum.

Plotting for a quick data check#

Here is an example of using the plotting libraries built into pandas to make a quick plot.

naames.plot.scatter(x=' BC_mass', y = ' ALTP')
<Axes: xlabel=' BC_mass', ylabel=' ALTP'>
../../_images/a08e8f028ea2392c3fef70723475522a91eb4f3976f78863d2ad588eda4bf560.png

What did we learn in this quick plot? The first thing we see is that there is likely some data that has a NAN value we did’t account for. This is likely because two of the way that we see all of the dots located in exactly the same spot in three locations. Let’s remove those and try again.

# Remove NaN values
naames = naames.replace(-999999, np.nan)

# naames.plot.scatter(x=' Fractional_Day', y=' ALTP')
naames.plot.scatter(x=' BC_mass', y = ' ALTP')
<Axes: xlabel=' BC_mass', ylabel=' ALTP'>
../../_images/ad0080f3ef9c5a96eeddd18a63a3973f25ee669d9a74ccc9372301d3be418c74.png

Much better! What did we learn this time?

  • We can see a loose distribution of values: the altitude measurements appear to be about equal, which the black carbon measurements seem to be clustered towards the smaller values.

  • We see the range of values: 0 -> 175 for Black Carbon and 0 -> 7 for Altitude

Making plots like this is a great thing when you first open a dataset.

Object Class#

When plotting in Python it is often really important to pay attention to the type, or class, or object that you are working with. As we will see, many pieces of a plot from axes, to colorbars to gridlines, can have their own type of object. You can’t do the same things on a colorbar object as an axes object so noticing the difference is important, especially when readying Stack Overflow.

If you are ever confused about what type of object you have you can access the .__class__ attribute on most any variable in Python to see what type of object it is. We call the type of object the class, hence the name of that method.

# `naames` is a DataFrame
naames.__class__
pandas.core.frame.DataFrame

matplotlib Concepts#

1. Different parts of a graph are seperate objects#

Plots in Matplotlib or organized as a collection of overlapping objects. Each object can be created independently accessed, although it doesn’t have to be.

A few of the most important objects to start with are figure, axes and axis.

It isn’t always clear that the plot holds independent objects because lots of functions in matplotlib create several objects at once. But know that under the hood each part of a plot is its own entity.

Without looking too much at code yet, let’s look at the following example to see the difference between a figure and axes.

import matplotlib.pyplot as plt

Here we make a plot that is just a figure with a set of axes.

fig = plt.figure()
ax = plt.axes()
../../_images/ed677fe7b0f5f38d48caa0ab8c4daf4732bb40f161b78a8109858b641c89bf0d.png

Now let’s change the color of the figure (fig) and the axes (ax) to see how they occupy different parts of the plot.

# Try commenting out either of the last two lines to see how the figure changes
fig = plt.figure()
ax = plt.axes()
fig.set_facecolor('yellow')  # changing the color of the figure
ax.set_facecolor('green')  # changing the color of the axes
../../_images/196e2615a23a09d72035faf7966918dfce24b3f0692a3b3b25f12e36186909a3.png

Distinguishin between figure and axes seems redundant when you have a single dataset to show, but if you start adding multiple axes into a figure the distinction becomes more clear.

2. Objects are stacked from the bottom up#

Everytime you add something new to a matplotlib figure it gets layered onto the plot it gets added to the top of the plot in the order it was created. I think of matplotlib as being a large, blank canvas. You can add most anything you want anywhere you want, but after each new thing you have to stop and let the paint dry. Anything you paint after that gets added on top of your previous layer of paint.

Let’s look at an example of this, again, without getting too bogged down on the code.

plt.plot([0], [0], color='red', marker = 'o')  # create red dot at 0,0
plt.plot([-1, 0, 1], [0, 0, 0], color='blue') # create a blue horizontal line
plt.plot([-1, 0, 1],[-0.5, 0, 0.5], color='green') # create a green diagonal line
plt.plot([0], [0], color='black', marker='o', markersize=16) # make a large black dot at 0,0
[<matplotlib.lines.Line2D at 0x7fc89dc45710>]
../../_images/0c15c32350e63b7c69a027229bcdb8ece7b797d4ce46b427a8b79f2dbe19ac80.png

Above we see different aspects of the plot being added, one on top of each other. The black dot, which comes last, even totally covers up the red dot in the first line.

Try moving around the different lines in the code cell above to see the different shapes get layered differently.

Plotting APIs#

A strength and a weakness of matplotlib is that there are many ways to create exactly the same plot. The two major approaches, or APIs, are:

  1. pyplot API - uses plt.plot() to change the plot

  2. object-oriented (OOP) API - uses fig, ax = subplots() to create fig and ax objects. Changes are then made by manipulating fig and ax directly.

Option 1 is less code and easier to approach, but it gives less control. Option 2 gives more control but it is more complicated. This is a classic programming tradeoff. We will look at both here.

📝 Check your understanding

Give 2 examples of matplotlib objects. What do each of the 3 object represent?

Method 1: Using plt.plot(), “The pyplot Interface”#

The plt.plot() method is a one-line command to make a plot. The format is:

plt.plot(x, y) Where x and y are arrays of data that you want on each the x and y axis.

Step 1: Organize our data#

x = naames[' Fractional_Day']
y = naames[' ALTP']

Step 2: Make our graph#

import matplotlib.pyplot as plt
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x7fc89dc6c950>]
../../_images/9c5d1bf2eb9c6e8adcfd19859f6d35cfb6344bc708a3f1506085b97f961e71d8.png

Even though we didn’t explicitly create a figure, axes, axis, line, axis labels, or tick marks they were all created together with plt.plot(). If we want to change something about any of those elements or add elements that don’t exist right now we do that by using methods on the plt object.

plt.plot(x, y)
plt.xlim(316.55, 316.68)  # changing the existing x axis scale
plt.title('Subsection Altitude Profile')  # adding a title
plt.xlabel('Fractional Day')   # adding a label on the x axis
plt.ylabel('Altitude')  # adding a label on the y axis
Text(0, 0.5, 'Altitude')
../../_images/f471b8e93887a8e75e147195be708a66e0383ce19ddf6610b356ce1266aed893.png

How do you find out where all those options (ex. plt.xlabel()) came from? Realistically, you probably reference code example you found online. There are truly thousands of commands so googling is really an important skill while plotting.

The other way you could have found the options was my looking at the documentation. There are several ways to customize your figure which are captured in the plt.plot() docs page. Reading documentation has a learning curve and it takes practice, but it is often helpful once you get oriented.

plt.rcParams["figure.figsize"] = (15,3)  # Changing the size of the plot
# Get a smaller chunk of data so we can see the different marker types
naames_subset = naames.iloc[:100]
plt.plot(naames_subset[' Fractional_Day'], naames_subset[' ALTP'], 'g>')
[<matplotlib.lines.Line2D at 0x7fc89dc09850>]
../../_images/82073cb515b6d4ca71fbe4cd152ec89d27e7f838430f1c4a573b4567be3a3354.png

In addition to plt.plot() there are a dozen or so other types of plots you can make that operate in the same way as plt.plot(). A list of all the options is here.

Vocabulary is important while googling. The figure below shows the words matplotlib uses to refer to parts of a plot. It’s a nice reference when figuring out your phrasing in a gogole search.

Image from the matplotlib usage guide

📝 Check your understanding

Create a line plot using either ' UTC', ' Fractional_Day' or ' INDEX' as the x axis, and another column as your y axis. Some possible y axis columns could be: ' SO4_LARGE', '  CCN_SS30_LARGE', ' Toluene_MixingRatio', or ' AOD-452nm_4STAR'.

More types of plots#

In addition to plt.plot(), which creates a line plot, there are a dozen or so other types of plots you can make that operate in the same way as plt.plot(). A list of all the options is here.

Example: scatter plot#

plt.scatter(naames_subset[' Fractional_Day'], naames_subset[' CO_MixingRatio_LARGE'])
plt.title('CO Mixing Ratio vs. Time')
Text(0.5, 1.0, 'CO Mixing Ratio vs. Time')
../../_images/f79da884a3b516c077d919df1f85c78bfb94dac0463b4e2852b0246f74cde29f.png

Example: histogram#

plt.hist(naames[' CO_MixingRatio_LARGE'])
plt.ylabel('Julian Date')
plt.xlabel('Carbon Monoxide Mixing Ratio (LARGE)')
Text(0.5, 0, 'Carbon Monoxide Mixing Ratio (LARGE)')
../../_images/1fe6b87035587ccd95fb238ec6cd46ca081c36d0695af4649d54c6f8c5b6de9f.png

While you can read the API docs to see every possible function available to you, I often prefer to peruse the example gallery.

Method 2: Accessing the figure with plt.subplots(), “The object oriented API”#

It’s been said now that this method is more complicated, but has broader functionality. Let’s see what that means with an example.

To start let’s recreate our good ol’ figure and axes combo.

plt.subplots()
(<Figure size 1500x300 with 1 Axes>, <Axes: >)
../../_images/6f0e56e677cc01a16d6287040e5c0fb1b0b0c701a01f7c4c28eb0956c6d0005e.png

Looks good. But what is so different about this method? In order to do anything with this figure/axes we need to use that command to create figure and axes objects. Those objects are what we will build the rest of the plot around.

fig, ax  = plt.subplots()
print(fig.__class__)
print(ax.__class__)
<class 'matplotlib.figure.Figure'>
<class 'matplotlib.axes._axes.Axes'>
../../_images/6f0e56e677cc01a16d6287040e5c0fb1b0b0c701a01f7c4c28eb0956c6d0005e.png

Now let’s add some data to the axes object.

fig, ax = plt.subplots()
ax.plot(x, y)  # Instead of plt.plot() we use ax.plot()
ax.set_title('Altitude')
Text(0.5, 1.0, 'Altitude')
../../_images/7fb50c5edbce1b2262e38401cb720d0688615c45fa7f3964cb46e368b70e4f1c.png

And there we have a plot of alitutde!

Why go through this longer method of creating a simple plot? One feature that subplots() has that plt.plot() doesn’t is the ability to have multiple axes on the same figure.

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(15, 5)
ax2.plot(x, y)
ax2.set_title('Altitude')
ax1.scatter(x, y=naames[' CO_MixingRatio_LARGE'])
ax1.set_title('CO Mixing Ratio')
Text(0.5, 1.0, 'CO Mixing Ratio')
../../_images/f72cd5cf3fd599d0d8e72b03d26861fe8292e4cb962b792ae4c48701817097ce.png

Notice here that we used ax.plot() to create a line plot for altitude, but we used ax.scatter() for CO Mixing Ratio to create individual points for each observation.

📝 Check your understanding

Read the following block of code. Draw a plot that shows what you expect the output to be.

x = naames[' Fractional_Day']

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_size_inches(15, 5)

ax1.plot(x, naames[' ALTP'])
ax1.set_title('Altitude')

ax2.scatter(x, y=naames[' CO_MixingRatio_LARGE'])
ax2.set_title('CO Mixing Ratio')

ax3.hist(' BC_mass')
ax3.set_title('Distribution of Black carbon')

Which method should I use?#

There is no correct strategy. A variety of different opinions I have encountered are:

  • Start with plt.plot() because it is simpler and switch to subplots() if it seems like you need something that plt.plot() can’t do

  • Always just start with subplots() because no matter what you won’t have to change methods

  • Google first for an example of the type of plot you want to make and follow whatever method the example uses.

It will take time to develop your favorite strategy. To start I’d pick something that sounds right to you. Trying something out and developing preferences is a fun part of programming - it’s a tangible sign of experience!

Closing concepts#

Concept 1: Building on top of matplotlib#

You may not need the specific control that matplotlib gives in which case you may choose to use another library. Because matplotlib is the old standby of the Python plotting world you are often still using matplotlib. This is described by saying that a given plotting function, such as pandas is built on top of matplotlib.

An analogy with legos#

Lego Libraries 1

Translating to Python#

Building on Libraries 2

Concept 2: Static vs. Interactive Visualization#

static visualization is creating a visual that creates a singel image. You might choose to make many images and flash them together to take a move, but in the end you can view the output from a single perspective (the one you defined when creating the image. Historically, matplotlib is a library for static images although it does have some interactive elements.

interactive visualization (also thought of as web visualization) are visuals that are created that you can click on or move around. You can zoom or hover over a point to see its value. bokeh is a library for interactive visualization.

Interactive viz example#