Data Visualization with Pandas—a brief introduction

Data visualization is the graphical representation of data in charts and graphs to help people visualize and understand data more easily than they would by looking at tables of numbers. Data visualization is closely related to and often comes after data analysis. In other words, a data scientist will often perform data analysis to process large data sets and then use data visualization techniques to make charts and graphs from the data.

The pandas module includes functions to draw different types of charts. In pandas, charts are called plots. Interestingly, the plots are not actually drawn by pandas but instead by another Python module named matplotlib.pyplot.

Drawing a plot with the pandas module can be as simple as these three steps:

  1. read the data
  2. define the plot
  3. draw (show) the plot

For example:


import pandas as pd
import matplotlib.pyplot as pyplot

# Step 1. Read a DataFrame from a CSV file.
df = pd.read_csv("filename.csv")

# Step 2. From the DataFrame, define a vertical bar plot.
df.plot(kind="bar", x="column_name_1", y="column_name_2")

# Step 3. Draw and show all defined plots.
pyplot.show()

In the above example, lines 2 and 11 are easy to forget but without them, the plots that your code defines will not be shown on the computer’s monitor.

Line 8 in the example code above contains a call to the pandas DataFrame.plot method. The kind named argument tells the DataFrame.plot method what type of plot to draw. There are at least nine types of plots that DataFrame.plot can draw by changing the value for the kind named parameter as follows:

In addition to the kind named argument, the plot method can take many other named arguments including these:

The x and y named arguments tell the DataFrame.plot method which columns in the data frame to use as the x and y axes in the plot. You can cause pandas to draw the data from multiple columns on the same plot by passing a list of strings for the y named argument as shown on line 3 in the next code example.


# Define a vertical bar plot from the DataFrame.
df.plot(kind="bar", x="column_name_1",
    y=["column_name_2", "column_name_3", "column_name_4"])

Writing code to define a plot for a pandas DataFrame is usually simple. Unfortunately, writing code to define a plot for a pandas Series can be very confusing. Recall that a pandas DataFrame contains multiple rows and columns but that a pandas Series contains only a single column. When we call DataFrame.plot, we can use both the x and y named arguments to chose the columns for the x and y axes. However, when we call Series.plot, a Series has one column only, so we shouldn’t use both the x and y named arguments, but instead we should use only one of them. This problem is most often seen when we write code to group and aggregate a DataFrame which often produces a Series. In that situation, when we call Series.plot we should use just the y named argument.

Although drawing a plot can be very simple, a data scientist may have to write significant amounts of code to analyze and process the data before defining a plot.

There are many options that you can use in your code to modify the look and layout of a plot. Some of these options are not available through the pandas functions but instead are available through the matplotlib.pyplot functions.

Examples

Below are two plots that were drawn by this Python program that uses pandas and matplotlib.pyplot.

import matplotlib.pyplot as pyplot
import matplotlib.ticker as ticker
import pandas as pd


def main():
    try:
        # Read the water.csv file and convert the
        # readDate column from a string to a datetime64.
        df = pd.read_csv("water.csv", parse_dates=["readDate"])

        combine_account_types(df)
        sum_df = sum_usage_by_account_type(df)

        # Call the show_usage_sum function which will define two plots.
        show_usage_sum(sum_df)

        # Show all defined plots.
        pyplot.show()

    except RuntimeError as run_err:
        print(type(run_err).__name__, run_err, sep=": ")


def combine_account_types(df):
    """The water.csv file contains too many account types to be
    shown neatly in a pie plot, so combine some of the account types.
    """
    categories = {
        "Airport Hanger" : "Other",
        "Apartment Complex" : "Apt. Complex",
        "Business" : "Business",
        "Business with Apartment" : "Business",
        "University" : "University",
        "Church" : "Church",
        "City" : "City",
        "Free" : "Other",
        "Out of Town" : "Out of Town",
        "Residence" : "Residence",
        "Residence with Apartment" : "Residence",
        "School" : "School",
        "Sprinklers" : "Other",
        "Town Homes" : "Town Homes",
        "Trailer Court" : "Trailer Court",
        "Vacant" : "Other",
    }

    # Use the Pandas DataFrame.map function
    # to combine some of the account types.
    df["accountType"] = df["accountType"].map(categories)


def sum_usage_by_account_type(df):
    """Create and return a new data frame that
    contains total water usage by account type.
    """
    group = df.groupby("accountType")
    sum_df = group.aggregate(sumUsage=("usage", "sum")).reset_index()
    return sum_df


def show_usage_sum(sum_df):
    """Define a pie plot and a vertical bar plot
    that both show total water usage by account type.
    """
    colors = {
        "City":"gold", "School":"purple", "University":"violet",
        "Apt. Complex":"pink",
        "Trailer Court":"green", "Town Homes":"lime",
        "Out of Town":"gray",
        "Residence":"yellow",
        "Business":"skyblue",
        "Other":"brown", "Church":"orange",
    }

    # Create a dictionary that contains the
    # desired order for the account types.
    order = colors.keys()
    order = dict(zip(order, range(len(order))))

    # Get the colors that will be used in both plots.
    colors = colors.values()

    # Add a column named order that contains the desired sort order.
    sum_df["order"] = sum_df["accountType"].map(order)

    # Sort the data by the values in the "order" column.
    sum_df.sort_values("order", inplace=True)

    # Keep only the accountType and usage columns.
    columns = ["accountType", "sumUsage"]
    sum_df = sum_df[columns]

    # Make the accountType column be the index.
    sum_df.set_index("accountType", inplace=True)

    # Print the data frame so we can verify that it's correct.
    print(sum_df)

    # Define a pie plot.
    title = "Water Usage 2015 - 2019"
    sum_df.plot(kind="pie", y="sumUsage", colors=colors,
            title=title, label="", legend=None)

    # Call the pyplot.tight_layout function, which will format the
    # previously defined plot so that all of its parts are spaced
    # nicely. Strangely, pyplot.tight_layout must be called multiple
    # times, once for each defined plot, but pyplot.show needs to be
    # called only once.
    pyplot.tight_layout()

    # Define a vertical bar plot.
    bar_plot = sum_df.plot(kind="bar", y="sumUsage", color=colors,
            title=title, xlabel="", ylabel="x1000 gallons", legend=None)
    fmtr = ticker.FuncFormatter(lambda val, pos: f"{val:,.0f}")
    bar_plot.yaxis.set_major_formatter(fmtr)

    # Call the pyplot.tight_layout function, which will format the
    # previously defined plot so that all of its parts are spaced
    # nicely. Strangely, pyplot.tight_layout must be called multiple
    # times, once for each defined plot, but pyplot.show needs to be
    # called only once.
    pyplot.tight_layout()


# If this file is executed like this:
# > python sample_plots.py
# then call the main function. However, if this file is simply
# imported (e.g. into a test file), then skip the call to main.
if __name__ == "__main__":
    main()

Both plots show the exact same data, first as a pie plot and then as a vertical bar plot. Although pie plots are very popular, most data scientists don’t use them because they convey less information than a bar plot.

A pie plot that shows total water used in the city during the
    years 2015 through 2019
Figure 1: A pie plot that shows the amount of water used in the city during the years 2015 through 2019.
A vertical bar plot that shows the amount of water used in the
    city during the years 2015 through 2019
Figure 2: A vertical bar plot that shows the amount of water used in the city during the years 2015 through 2019.

Documentation