Data Visualization with Pandas—a brief introduction
Data visualization is the graphical representation of data in charts and graphs to help people visualize and understand data more easily than they would by looking at tables of numbers. Data visualization is closely related to and often comes after data analysis. In other words, a data scientist will often perform data analysis to process large data sets and then use data visualization techniques to make charts and graphs from the data.
The pandas
module includes functions to draw different
types of charts. In pandas
, charts are called plots.
Interestingly, the plots are not actually drawn by pandas
but instead by another Python module named
matplotlib.pyplot
.
Drawing a plot with the pandas
module can be as simple
as these three steps:
- read the data
- define the plot
- draw (show) the plot
For example:
import pandas as pd import matplotlib.pyplot as pyplot # Step 1. Read a DataFrame from a CSV file. df = pd.read_csv("filename.csv") # Step 2. From the DataFrame, define a vertical bar plot. df.plot(kind="bar", x="column_name_1", y="column_name_2") # Step 3. Draw and show all defined plots. pyplot.show()
In the above example, lines 2 and 11 are easy to forget but without them, the plots that your code defines will not be shown on the computer’s monitor.
Line 8 in the example
code above contains a call to the pandas DataFrame.plot
method. The kind named argument tells the
DataFrame.plot
method what type of plot to draw. There are
at least nine types of plots that DataFrame.plot
can draw
by changing the value for the kind named parameter as
follows:
area
- area plotbar
- vertical bar plotbarh
- horizontal bar plotbox
- box plotdensity
- density plothexbin
- hexagonal bin plothist
- histogrampie
- pie plotscatter
- scatter plot
In addition to the kind
named argument, the
plot method
can take many other named arguments including these:
x
: string or list of strings - The column or columns to plot along the x-axis.y
: string or list of strings - The column or columns to plot along the y-axis.title
: string - The title to use for the plot.xlabel
: string - The label for the x-axis.ylabel
: string - The label for the y-axis.color
: string - The of the color that the plot will be drawn in.legend
: boolean - True to draw the legend, False to hide the legend.
The x and y named arguments tell the
DataFrame.plot
method which columns in the data frame to
use as the x and y axes in the plot. You can cause pandas to draw the
data from multiple columns on the same plot by passing a list of strings
for the y named argument as shown on line 3 in the next code example.
# Define a vertical bar plot from the DataFrame. df.plot(kind="bar", x="column_name_1", y=["column_name_2", "column_name_3", "column_name_4"])
Writing code to define a plot for a pandas DataFrame is usually
simple. Unfortunately, writing code to define a plot for a pandas Series
can be very confusing. Recall that a pandas DataFrame contains multiple
rows and columns but that a pandas Series contains only a single column.
When we call DataFrame.plot
, we can use both the
x and y named arguments to chose the columns for
the x and y axes. However, when we call Series.plot
, a
Series has one column only, so we shouldn’t use both the x
and y named arguments, but instead we should use only one of
them. This problem is most often seen when we write code to group and
aggregate a DataFrame which often produces a Series. In that situation,
when we call Series.plot we should use just the y named
argument.
Although drawing a plot can be very simple, a data scientist may have to write significant amounts of code to analyze and process the data before defining a plot.
There are many options that you can use in your code to modify the
look and layout of a plot. Some of these options are not available
through the pandas
functions but instead are available
through the matplotlib.pyplot
functions.
Examples
Below are two plots that were drawn by this Python program that uses
pandas
and matplotlib.pyplot
.
import matplotlib.pyplot as pyplot import matplotlib.ticker as ticker import pandas as pd def main(): try: # Read the water.csv file and convert the # readDate column from a string to a datetime64. df = pd.read_csv("water.csv", parse_dates=["readDate"]) combine_account_types(df) sum_df = sum_usage_by_account_type(df) # Call the show_usage_sum function which will define two plots. show_usage_sum(sum_df) # Show all defined plots. pyplot.show() except RuntimeError as run_err: print(type(run_err).__name__, run_err, sep=": ") def combine_account_types(df): """The water.csv file contains too many account types to be shown neatly in a pie plot, so combine some of the account types. """ categories = { "Airport Hanger" : "Other", "Apartment Complex" : "Apt. Complex", "Business" : "Business", "Business with Apartment" : "Business", "University" : "University", "Church" : "Church", "City" : "City", "Free" : "Other", "Out of Town" : "Out of Town", "Residence" : "Residence", "Residence with Apartment" : "Residence", "School" : "School", "Sprinklers" : "Other", "Town Homes" : "Town Homes", "Trailer Court" : "Trailer Court", "Vacant" : "Other", } # Use the Pandas DataFrame.map function # to combine some of the account types. df["accountType"] = df["accountType"].map(categories) def sum_usage_by_account_type(df): """Create and return a new data frame that contains total water usage by account type. """ group = df.groupby("accountType") sum_df = group.aggregate(sumUsage=("usage", "sum")).reset_index() return sum_df def show_usage_sum(sum_df): """Define a pie plot and a vertical bar plot that both show total water usage by account type. """ colors = { "City":"gold", "School":"purple", "University":"violet", "Apt. Complex":"pink", "Trailer Court":"green", "Town Homes":"lime", "Out of Town":"gray", "Residence":"yellow", "Business":"skyblue", "Other":"brown", "Church":"orange", } # Create a dictionary that contains the # desired order for the account types. order = colors.keys() order = dict(zip(order, range(len(order)))) # Get the colors that will be used in both plots. colors = colors.values() # Add a column named order that contains the desired sort order. sum_df["order"] = sum_df["accountType"].map(order) # Sort the data by the values in the "order" column. sum_df.sort_values("order", inplace=True) # Keep only the accountType and usage columns. columns = ["accountType", "sumUsage"] sum_df = sum_df[columns] # Make the accountType column be the index. sum_df.set_index("accountType", inplace=True) # Print the data frame so we can verify that it's correct. print(sum_df) # Define a pie plot. title = "Water Usage 2015 - 2019" sum_df.plot(kind="pie", y="sumUsage", colors=colors, title=title, label="", legend=None) # Call the pyplot.tight_layout function, which will format the # previously defined plot so that all of its parts are spaced # nicely. Strangely, pyplot.tight_layout must be called multiple # times, once for each defined plot, but pyplot.show needs to be # called only once. pyplot.tight_layout() # Define a vertical bar plot. bar_plot = sum_df.plot(kind="bar", y="sumUsage", color=colors, title=title, xlabel="", ylabel="x1000 gallons", legend=None) fmtr = ticker.FuncFormatter(lambda val, pos: f"{val:,.0f}") bar_plot.yaxis.set_major_formatter(fmtr) # Call the pyplot.tight_layout function, which will format the # previously defined plot so that all of its parts are spaced # nicely. Strangely, pyplot.tight_layout must be called multiple # times, once for each defined plot, but pyplot.show needs to be # called only once. pyplot.tight_layout() # If this file is executed like this: # > python sample_plots.py # then call the main function. However, if this file is simply # imported (e.g. into a test file), then skip the call to main. if __name__ == "__main__": main()
Both plots show the exact same data, first as a pie plot and then as a vertical bar plot. Although pie plots are very popular, most data scientists don’t use them because they convey less information than a bar plot.


Documentation
- The
pandas
Getting Started Tutorials contain a helpful short section about creating plots. - The
pandas
User Guide contains a long section about creating plots.