a = 1
b = 2Filtering data
Now that we have reviewed how to access data, let’s now see how to filter data using boolean indexing. For this, we need to review what are comparison operators (Table 1).
Comparison operators
Let’s assume the following variables (Listing 1):
Applying the comparison operators in Table 1 will produce a variable of type bool - which can take only two values: True or False.
| Operator | Meaning | Example | Result |
|---|---|---|---|
== |
Equal to | a == b |
False |
!= |
Not equal to | a != b |
True |
> |
Greater than | a > b |
False |
< |
Less than | a < b |
True |
>= |
Greater than or equal | a >= b |
False |
<= |
Less than or equal | a <= b |
True |
We can apply comparison operators to DataFrame. Let’s say we want to test what rows have a VEI of 4:
df['VEI'] == 4Name
St. Helens False
Pinatubo False
El Chichón False
Galunggung True
Nevado del Ruiz False
Merapi False
Ontake False
Soufrière Hills False
Etna False
Nyiragongo False
Kīlauea False
Agung False
Tavurvur False
Sinabung False
Taal True
La Soufrière True
Calbuco True
St. Augustine False
Eyjafjallajökull True
Cleveland False
Name: VEI, dtype: bool
We can see that Galunggung, Taal, La Soufrière, Calbuco and Eyjafjallajökull return True to this condition. This is great, but what if we want to return the actual rows? We can use Listing 2 as a mask to then query the rows using .loc.
mask = df['VEI'] == 4 # Create a mask
df.loc[mask] # Query the data
# Or, as a one line alternative:
df.loc[df['VEI'] == 4]| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
| Taal | Philippines | 2020-01-12 | 4 | 14.0020 | 120.9934 |
| La Soufrière | Saint Vincent | 2021-04-09 | 4 | 13.2833 | -61.3875 |
| Calbuco | Chile | 2015-04-22 | 4 | -41.2972 | -72.6097 |
| Eyjafjallajökull | Iceland | 2010-04-14 | 4 | 63.6333 | -19.6111 |
What volcanoes have a VEI of 5?
How many volcanoes are in the southern hemisphere? (hint: use the df.shape function to count them).
String comparisons
We can also use comparison operators on columns containing strings (see Caution 1 for caveats). Listing 4 illustrates a basic string comparison using the = operator. Table 2 shows additional operators for strings.
df.loc[df['Country'] == 'Indonesia']| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
| Merapi | Indonesia | 2023-12-03 | 2 | -7.5407 | 110.4457 |
| Agung | Indonesia | 2017-11-21 | 3 | -8.3422 | 115.5083 |
| Sinabung | Indonesia | 2023-02-13 | 3 | 3.1719 | 98.3925 |
| Operation | Example | Description |
|---|---|---|
| Contains | df['Name'].str.contains('Soufrière') |
Checks if each string contains a substring |
| Startswith | df['Name'].str.startswith('E') |
Checks if each string starts with a substring |
| Endswith | df['Name'].str.endswith('o') |
Checks if each string ends with a substring |
When using the comparison operators in Table 1, we need to make sure that we are comparing data that have the same type. In Listing 2, we are comparing the column VEI with an integer number. You can check the data type of a DataFrame using df.dtypes.
Not all comparison operators work with all data type. For instance, you can test if a column contains a specific string using the == or != operators, but the other won’t work as they are illogical.
Logical operators
But what if we want to create more complex filters based on different rules? We can use logical operators to combine several comparison operators. Going back to the example in Listing 1, Table 3 illustrates the use of logical operators.
| Operator | Meaning | Example | Result |
|---|---|---|---|
& |
Logical AND | (a > 1) & (b < 3) |
False |
| |
Logical OR | (a == 1) | (b == 1) |
True |
~ |
Logical NOT | ~(a == 1) |
False |
Let’s gather all volcanoes that have a VEI of 3 and are in Indonesia:
mask = (df['VEI'] == 3) & (df['Country'] == 'Indonesia') # Create a mask - don't forget parentheses!
df.loc[mask] # Query the data| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Agung | Indonesia | 2017-11-21 | 3 | -8.3422 | 115.5083 |
| Sinabung | Indonesia | 2023-02-13 | 3 | 3.1719 | 98.3925 |
How many volcanoes are either in Chile or in the USA?
How many volcanoes are in the southern hemisphere and have a VEI≥4?