Querying data

Querying data from a DataFrame

Let’s now review how we can access data contained in the DataFrame. This process, known as indexing, consists in specifying a row or a column (or ranges of rows and columns) where the data is stored. In pandas, there are two different ways to do that:

  • By label: data is queried using the actual index/column name (e.g., the VEI column in the DataFrame above)
  • By location: data is queried using the column location (e.g., the 3rd row)

Label-based indexing

Figure 1: Label-based queries using .loc.

Querying rows

When we know the exact label of the row or the column, we can use the .loc function to query the DataFrame (Figure 1). Let’s start by querying specific rows. We have previously defined the Name column as the index (i.e., row label), which means that we can simply pass the name of the volcano.

# Get the row for "Calbuco" volcano
df.loc['Calbuco']
Country                    Chile
Date         2015-04-22 00:00:00
VEI                            4
Latitude                -41.2972
Longitude               -72.6097
Name: Calbuco, dtype: object

Note that the result is a Series (i.e., a 1-dimensional DataFrame where the columns become the index), not a DataFrame. If we want to keep it as a DataFrame, we can use double brackets. Double brackets can also be used to query multiple rows.

df.loc[['Calbuco']] # Query one row and return it as a DataFrame
df.loc[['Calbuco', 'Taal']] # Query multiple rows
Country Date VEI Latitude Longitude
Name
Calbuco Chile 2015-04-22 4 -41.2972 -72.6097
Taal Philippines 2020-01-12 4 14.0020 120.9934
Question

What is the VEI recorded for Etna volcano?

What is the eruption date for Taal volcano?

Double brackets

In general, double brackets [[ ]] will return a DataFrame and not a Series. We will dominantly use this in the following examples.

Querying columns

Let’s now query specific columns. For example, querying the VEI column can be achieved in different ways. The simplest is to use the column name directly. We can also query multiple columns using double brackets

df[['VEI']] # Get the VEI column
df[['Country', 'VEI']] # Query multiple columns
Country VEI
Name
St. Helens USA 5
Pinatubo Philippines 6
El Chichón Mexico 5
Galunggung Indonesia 4
Nevado del Ruiz Colombia 3
Merapi Indonesia 2
Ontake Japan 2
Soufrière Hills Montserrat 3
Etna Italy 2
Nyiragongo DR Congo 1
Kīlauea USA 2
Agung Indonesia 3
Tavurvur Papua New Guinea 3
Sinabung Indonesia 3
Taal Philippines 4
La Soufrière Saint Vincent 4
Calbuco Chile 4
St. Augustine USA 3
Eyjafjallajökull Iceland 4
Cleveland USA 3

Note that until now, we have only retrieved either rows or columns (Figure 1). We can also retrieve specific values by specifying both the row and the column.

df.loc[['Calbuco', 'Taal']][['Country', 'VEI']]
Country VEI
Name
Calbuco Chile 4
Taal Philippines 4

Position-based indexing

Caution 1: Indexing in Python

Remember that in Python, indexing starts from 0 - so the first row or column has an index of 0.

Figure 2: Position-based queries using .iloc.

Some situations require querying data by location instead of label - let’s say for instance we need to retrieve rows 10-20. This is done using the .iloc function (instead of the .loc function previously used; Figure 2). Remember that Python uses zero-based indexing (Caution 1), meaning that the first element is at position 0, the second at position 1, and so on.

The next example queries the first row - using again double brackets to return a DataFrame.

df.iloc[[0]]
Country Date VEI Latitude Longitude
Name
St. Helens USA 1980-05-18 5 46.1914 -122.1956

Get ranges of rows

We can get a range of rows using what is called slicing. This is done using the colon (:) operator. The next example queries rows 3 to 6 of the DataFrame. Note that the end index is exclusive, meaning that the element at the end index is not included in the result.

df.iloc[2:6]
Country Date VEI Latitude Longitude
Name
El Chichón Mexico 1982-03-28 5 17.3559 -93.2233
Galunggung Indonesia 1982-04-05 4 -7.2567 108.0771
Nevado del Ruiz Colombia 1985-11-13 3 4.8950 -75.3220
Merapi Indonesia 2023-12-03 2 -7.5407 110.4457

To get rows 3 to 6 and columns 2-3:

df.iloc[0:5, 1:3]
Date VEI
Name
St. Helens 1980-05-18 5
Pinatubo 1991-04-02 6
El Chichón 1982-03-28 5
Galunggung 1982-04-05 4
Nevado del Ruiz 1985-11-13 3

Count rows from the last

To get the last 5 rows of the DataFrame:

df.iloc[-5:]
Country Date VEI Latitude Longitude
Name
La Soufrière Saint Vincent 2021-04-09 4 13.2833 -61.3875
Calbuco Chile 2015-04-22 4 -41.2972 -72.6097
St. Augustine USA 2006-03-27 3 57.8819 -155.5611
Eyjafjallajökull Iceland 2010-04-14 4 63.6333 -19.6111
Cleveland USA 2023-05-23 3 52.8250 -169.9444

Combining position-based and label-based queries

By experience, position-based queries is more used on rows than columns. For instance, we might want to access the first 10 rows because we don’t know their associated labels, yet it is less likely that we ignore their attributes. It is possible mix label-based and position-based indexing. For example, to get the first 5 rows and the Country and VEI columns:

df.iloc[0:5][['Country', 'VEI']]
Country VEI
Name
St. Helens USA 5
Pinatubo Philippines 6
El Chichón Mexico 5
Galunggung Indonesia 4
Nevado del Ruiz Colombia 3