Filtering data

Now that we have reviewed how to access data, let’s now see how to filter data using boolean indexing. For this, we need to review what are comparison operators (Table 1).

Comparison operators

Let’s assume the following variables (Listing 1):

Listing 1: Variables used for illustrating logical operations
a = 1
b = 2

Applying the comparison operators in Table 1 will produce a variable of type bool - which can take only two values: True or False.

Table 1: Comparison operators in Python.
Operator Meaning Example Result
== Equal to a == b False
!= Not equal to a != b True
> Greater than a > b False
< Less than a < b True
>= Greater than or equal a >= b False
<= Less than or equal a <= b True

We can apply comparison operators to DataFrame. Let’s say we want to test what rows have a VEI of 4:

Listing 2: Create a boolean mask
df['VEI'] == 4
Name
St. Helens          False
Pinatubo            False
El Chichón          False
Galunggung           True
Nevado del Ruiz     False
Merapi              False
Ontake              False
Soufrière Hills     False
Etna                False
Nyiragongo          False
Kīlauea             False
Agung               False
Tavurvur            False
Sinabung            False
Taal                 True
La Soufrière         True
Calbuco              True
St. Augustine       False
Eyjafjallajökull     True
Cleveland           False
Name: VEI, dtype: bool

We can see that Galunggung, Taal, La Soufrière, Calbuco and Eyjafjallajökull return True to this condition. This is great, but what if we want to return the actual rows? We can use Listing 2 as a mask to then query the rows using .loc.

Listing 3: Query data using a boolean mask
mask = df['VEI'] == 4 # Create a mask
df.loc[mask] # Query the data

# Or, as a one line alternative:
df.loc[df['VEI'] == 4]
Country Date VEI Latitude Longitude
Name
Galunggung Indonesia 1982-04-05 4 -7.2567 108.0771
Taal Philippines 2020-01-12 4 14.0020 120.9934
La Soufrière Saint Vincent 2021-04-09 4 13.2833 -61.3875
Calbuco Chile 2015-04-22 4 -41.2972 -72.6097
Eyjafjallajökull Iceland 2010-04-14 4 63.6333 -19.6111
Question

What volcanoes have a VEI of 5?

How many volcanoes are in the southern hemisphere? (hint: use the df.shape function to count them).

String comparisons

We can also use comparison operators on columns containing strings (see Caution 1 for caveats). Listing 4 illustrates a basic string comparison using the = operator. Table 2 shows additional operators for strings.

Listing 4: Basic comparison operation on strings
df.loc[df['Country'] == 'Indonesia']
Country Date VEI Latitude Longitude
Name
Galunggung Indonesia 1982-04-05 4 -7.2567 108.0771
Merapi Indonesia 2023-12-03 2 -7.5407 110.4457
Agung Indonesia 2017-11-21 3 -8.3422 115.5083
Sinabung Indonesia 2023-02-13 3 3.1719 98.3925
Table 2: Common string comparison operations.
Operation Example Description
Contains df['Name'].str.contains('Soufrière') Checks if each string contains a substring
Startswith df['Name'].str.startswith('E') Checks if each string starts with a substring
Endswith df['Name'].str.endswith('o') Checks if each string ends with a substring

When using the comparison operators in Table 1, we need to make sure that we are comparing data that have the same type. In Listing 2, we are comparing the column VEI with an integer number. You can check the data type of a DataFrame using df.dtypes.

Not all comparison operators work with all data type. For instance, you can test if a column contains a specific string using the == or != operators, but the other won’t work as they are illogical.

Logical operators

But what if we want to create more complex filters based on different rules? We can use logical operators to combine several comparison operators. Going back to the example in Listing 1, Table 3 illustrates the use of logical operators.

Table 3: Logical operators in pandas for combining boolean conditions. Use parentheses around each condition.
Operator Meaning Example Result
& Logical AND (a > 1) & (b < 3) False
| Logical OR (a == 1) | (b == 1) True
~ Logical NOT ~(a == 1) False

Let’s gather all volcanoes that have a VEI of 3 and are in Indonesia:

Listing 5: Complex filtering using logical operators
mask = (df['VEI'] == 3) & (df['Country'] == 'Indonesia') # Create a mask - don't forget parentheses!
df.loc[mask] # Query the data
Country Date VEI Latitude Longitude
Name
Agung Indonesia 2017-11-21 3 -8.3422 115.5083
Sinabung Indonesia 2023-02-13 3 3.1719 98.3925
Question

How many volcanoes are either in Chile or in the USA?

How many volcanoes are in the southern hemisphere and have a VEI≥4?