Data Analysis with Pandas—a brief introduction

Data analysis is a process of inspecting, transforming, and viewing data with the goal of discovering useful information.

The pandas module is a third-party Python module that enables data analysis and includes functionality to read and write CSV files, retrieve, modify, store, sort, filter, and aggregate data, and draw charts. In order to provide all of that functionality, pandas uses several other modules, including numpy and matplotlib.pyplot. Some of the functionality that seems to be part of pandas is actually part of these other modules.

Installing Pandas

In order to use the pandas module, you must first install it using the pip module. Before using pip to install anything, it is best practice to upgrade the pip, setuptools, and wheel modules. You can upgrade them by executing this command in a terminal window:

python -m pip install --user --upgrade pip setuptools wheel

After upgrading pip, you can install the numpy, matplotlib, and pandas modules by executing this command in a terminal window:

python -m pip install --user numpy matplotlib pandas

Using Pandas

The pandas module organizes data into DataFrames, Series, and Elements. A DataFrame is a two-dimensional structure like a table with rows and columns. Each row and each column in a DataFrame is a Series. A Series is a one-dimensional structure and is composed of Elements. An Element is a single piece of data.

  1. The pandas.read_csv function will read a CSV file and return a DataFrame. The read_csv function has a huge number of parameters most of which are optional. The filepath and parse_dates parameters are the most useful for the assignments you will complete in this course.
    import pandas as pd
    df = pd.read_csv("filename.csv")
  2. DataFrame.dtypes is an attribute of a DataFrame and contains a Series with the data type of each column in the DataFrame. You may want to print dtypes attribute after calling read_csv() so that you can verify that the DataFrame was read correctly. For example in the following code, df is a DataFrame that contains all the data that read_csv read from filename.csv.
    import pandas as pd
    df = pd.read_csv("filename.csv")
    print(df.dtypes)
  3. DataFrame.describe() is a function that returns several statistics that summarize the data in a DataFrame. You may want to print the results of describe() after calling read_csv() to help you understand the data that is in the DataFrame. For example:
    import pandas as pd
    df = pd.read_csv("filename.csv")
    print(df.describe())
  4. It is possible to print an entire DataFrame. However, by default pandas will print only some of the data in large data frames.
    import pandas as pd
    df = pd.read_csv("filename.csv")
    print(df)
  5. To extract one column from a DataFrame write the column’s name in square brackets []. For example:
    import pandas as pd
    df = pd.read_csv("filename.csv")
    column = df["column_name"]
    It is also possible extract multiple columns by writing a list of column names between the square brackets like this:
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    column_names = ["col_name1", "col_name2", "col_name3"]
    smaller_df = original_df[column_names]
  6. To filter rows out of a DataFrame, use a comparison to create a filter object. Then use the filter object in square brackets to filter the DataFrame. For example in the following code, df is a DataFrame that contains all the rows from the CSV file filename.csv, filter is a filter object, and filtered_df is a DataFrame that contains only some of the rows from the CSV file because all the rows that don’t satisfy the comparison original_df["column_name"] > threshold have been filtered out.
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    filter = (original_df["column_name"] > threshold)
    filtered_df = original_df[filter]
    It is possible to filter rows by two comparisons. However, because of operator precedence in Python, we must add parentheses around each comparison. In the next example, andfilt and orfilt are filter objects, filtered_and_df is a data frame that contains rows that satisfy both of the comparisons, and filtered_or_df is a data frame that contains rows that satisfy either or both comparisons.
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    andfilt = (original_df["col_name1"] > threshold) \
            & (original_df["col_name2"] == value)
    orfilt  = (original_df["col_name1"] > threshold) \
            | (original_df["col_name2"] == value)
    filtered_and_df = original_df[andfilt]
    filtered_or_df = original_df[orfilt]

    Unfortunately, it is easy to confuse the code that selects a column or columns (shown in the previous item) and the code that filters rows (shown in this item) because both of them use a data frame and square brackets such as:  original_df[ ]. However, look closely at the example code in the previous item and this item to see the difference. The code to select a column or columns contains only a string or list of strings between the square brackets. The code to filter rows contains a filter object between the square brackets.

  7. To filter out duplicate rows, call the DataFrame.drop_duplicates() function. For example in the following code, no_dups_df is a DataFrame that contains the same data as original_df but with all duplicate rows removed.
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    no_dups_df = original_df.drop_duplicates()
  8. The Series.unique() function returns a list of unique values from a Series. This is very helpful to get a list of unique values from a column. In the following code, unique_vals is a list of the unique values that appear in a column named column_name.
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    column = original_df["column_name"]
    unique_vals = column.unique()
  9. The Series.value_counts() function counts and returns the number of times that each unique value occurs in a Series.
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    column = original_df["column_name"]
    counts = column.value_counts()
  10. To create a new column derived from existing columns, use square brackets [] with the new column name on the left side of the assignment operator (=). For example:
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    original_df["new_column_name"] = \
            3 * original_df["existing_column_name"] + 1
  11. To extract parts of datetime values, use the .dt attribute and the .to_period method (tutorial, stack overflow question, official documentation). For example:
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    column = original_df["datetime_column"].dt.to_period("M")
  12. The pandas module includes several functions that aggregate data, including the following:
    1. The Series.sum() function computes the sum of all the values in a Series.
    2. The Series.mean() function computes the average value of all the values in a Series.
    3. The Series.min() function finds the minimum value within a Series.
    4. The Series.max() function finds the maximum value within a Series.
  13. Grouping and aggregating data is an intermediate data analysis operation that can summarize many rows into just a few rows. To group and aggregate data, use the .groupby and .aggregate methods as shown in this template:
    import pandas as pd
    original_df = pd.read_csv("filename.csv")
    group = original_df.groupby("existing_column_to_group_by")
    grouped_df = group.aggregate(
        new_column_name=("existing_column_to_summarize", "function"))

Further Reading

The following articles give more information about using pandas.