Querying data

Querying data from a DataFrame

Let’s now review how we can access data contained in the DataFrame. This process, known as indexing, consists in specifying a row or a column (or ranges of rows and columns) where the data is stored. In pandas, there are two different ways to do that:

By label: data is queried using the actual index/column name (e.g., the VEI column in the DataFrame above)
By location: data is queried using the column location (e.g., the 3rd row)

Label-based indexing

Figure 1: Label-based queries using `.loc`.

Querying rows

When we know the exact label of the row or the column, we can use the .loc function to query the DataFrame (Figure 1). Let’s start by querying specific rows. We have previously defined the Name column as the index (i.e., row label), which means that we can simply pass the name of the volcano.

# Get the row for "Calbuco" volcano
df.loc['Calbuco']

Country                    Chile
Date         2015-04-22 00:00:00
VEI                            4
Latitude                -41.2972
Longitude               -72.6097
Name: Calbuco, dtype: object

Note that the result is a Series (i.e., a 1-dimensional DataFrame where the columns become the index), not a DataFrame. If we want to keep it as a DataFrame, we can use double brackets. Double brackets can also be used to query multiple rows.

df.loc[['Calbuco']] # Query one row and return it as a DataFrame
df.loc[['Calbuco', 'Taal']] # Query multiple rows

	Country	Date	VEI	Latitude	Longitude
Name
Calbuco	Chile	2015-04-22	4	-41.2972	-72.6097
Taal	Philippines	2020-01-12	4	14.0020	120.9934

Question

What is the VEI recorded for Etna volcano?

What is the eruption date for Taal volcano?

Double brackets

In general, double brackets [[ ]] will return a DataFrame and not a Series. We will dominantly use this in the following examples.

Querying columns

Let’s now query specific columns. For example, querying the VEI column can be achieved in different ways. The simplest is to use the column name directly. We can also query multiple columns using double brackets

df[['VEI']] # Get the VEI column
df[['Country', 'VEI']] # Query multiple columns

	Country	VEI
Name
St. Helens	USA	5
Pinatubo	Philippines	6
El Chichón	Mexico	5
Galunggung	Indonesia	4
Nevado del Ruiz	Colombia	3
Merapi	Indonesia	2
Ontake	Japan	2
Soufrière Hills	Montserrat	3
Etna	Italy	2
Nyiragongo	DR Congo	1
Kīlauea	USA	2
Agung	Indonesia	3
Tavurvur	Papua New Guinea	3
Sinabung	Indonesia	3
Taal	Philippines	4
La Soufrière	Saint Vincent	4
Calbuco	Chile	4
St. Augustine	USA	3
Eyjafjallajökull	Iceland	4
Cleveland	USA	3

Note that until now, we have only retrieved either rows or columns (Figure 1). We can also retrieve specific values by specifying both the row and the column.

df.loc[['Calbuco', 'Taal']][['Country', 'VEI']]

	Country	VEI
Name
Calbuco	Chile	4
Taal	Philippines	4

Position-based indexing

Caution 1: Indexing in Python

Remember that in Python, indexing starts from 0 - so the first row or column has an index of 0.

Figure 2: Position-based queries using `.iloc`.

Some situations require querying data by location instead of label - let’s say for instance we need to retrieve rows 10-20. This is done using the .iloc function (instead of the .loc function previously used; Figure 2). Remember that Python uses zero-based indexing (Caution 1), meaning that the first element is at position 0, the second at position 1, and so on.

The next example queries the first row - using again double brackets to return a DataFrame.

df.iloc[[0]]

	Country	Date	VEI	Latitude	Longitude
Name
St. Helens	USA	1980-05-18	5	46.1914	-122.1956

Get ranges of rows

We can get a range of rows using what is called slicing. This is done using the colon (:) operator. The next example queries rows 3 to 6 of the DataFrame. Note that the end index is exclusive, meaning that the element at the end index is not included in the result.

df.iloc[2:6]

	Country	Date	VEI	Latitude	Longitude
Name
El Chichón	Mexico	1982-03-28	5	17.3559	-93.2233
Galunggung	Indonesia	1982-04-05	4	-7.2567	108.0771
Nevado del Ruiz	Colombia	1985-11-13	3	4.8950	-75.3220
Merapi	Indonesia	2023-12-03	2	-7.5407	110.4457

To get rows 3 to 6 and columns 2-3:

df.iloc[0:5, 1:3]

	Date	VEI
Name
St. Helens	1980-05-18	5
Pinatubo	1991-04-02	6
El Chichón	1982-03-28	5
Galunggung	1982-04-05	4
Nevado del Ruiz	1985-11-13	3

Count rows from the last

To get the last 5 rows of the DataFrame:

df.iloc[-5:]

	Country	Date	VEI	Latitude	Longitude
Name
La Soufrière	Saint Vincent	2021-04-09	4	13.2833	-61.3875
Calbuco	Chile	2015-04-22	4	-41.2972	-72.6097
St. Augustine	USA	2006-03-27	3	57.8819	-155.5611
Eyjafjallajökull	Iceland	2010-04-14	4	63.6333	-19.6111
Cleveland	USA	2023-05-23	3	52.8250	-169.9444

Combining position-based and label-based queries

By experience, position-based queries is more used on rows than columns. For instance, we might want to access the first 10 rows because we don’t know their associated labels, yet it is less likely that we ignore their attributes. It is possible mix label-based and position-based indexing. For example, to get the first 5 rows and the Country and VEI columns:

df.iloc[0:5][['Country', 'VEI']]

	Country	VEI
Name
St. Helens	USA	5
Pinatubo	Philippines	6
El Chichón	Mexico	5
Galunggung	Indonesia	4
Nevado del Ruiz	Colombia	3