Intro to Pandas

Pandas data structure

Pandas consists of two main types of data structures. Let’s make an analogy with Excel.

  1. Series: A 1D labeled array. Think of a 2-columns Excel spreadsheet where the left column would contain a label (e.g., the time of a measurement) and the right column would contain a value (e.g., the actual value measured at the time specified in the label, let’s say the temperature of a river).
  2. DataFrame: A 2D labeled table. This is the same as an Excel spreadsheet that would contain more columns than a Series. You can think of having measurements of different variables contained in each column (e.g., the flow rate, the turbidity etc…).

The keyword here is labelled. In Excel, you might get a column using letters and rows using numbers. In Pandas, you can use the column name (e.g., water_temperature) or the row label (e.g., 2021-06-15 14:19:14).

DataFrame

Throughout this class we will focus on the use of DataFrames, not Series. Keep in mind that the behaviour between both is almost identical.

Anatomy of a DataFrame

Figure 1 shows the basic anatomy of a DataFrame that contains four rows and four columns). We already see some data structuring emerging:

  • Rows tend to represent entries, which can be:
    • Different measurements at specific time steps
    • Different samples collected at different place/times
    • etc.
  • In contrast, column represent attributes and store the properties of each entry:
    • The actual values of different measured parameters
    • The location and time of collected samples, along with associated analyses (e.g., geochemistry)
    • etc.
Figure 1: Basic anatomy of a Pandas DataFrame.

The first row - i.e. the row containing the column labels - is not considered as an entry. This is because the top row of a dataframe is usually used as the label for the columns. Similarly, we might want to set the first column as a label for the rows (Figure 2). In a nutshell:

  • Index refers to the label of the rows. In the index, values are usually unique - meaning that each entry has a different label.
  • Column refers to the label of - logically - the columns
Figure 2: Index and columns of a DataFrame.
Caution 1: Indexing in Python

Remember that in Python, indexing starts from 0 - so the first row or column has an index of 0.