Data structure

Let’s get our hands dirty and start coding. Create a new Jupyter notebook following this guide. You can copy fragments of the code, but make sure each code block is a different cell in you notebook. Also remember that you can add Markdown cells in between code cells, which are really useful to document your code.

The data we will use here is a csv file containing selected eruptions of the past 50 years. The first 5 rows of the data are illustrated in Table 1.

Table 1: First 5 rows of the dataset.
Name Country Date VEI Latitude Longitude
St. Helens USA 1980-05-18 00:00:00 5 46.1914 -122.196
Pinatubo Philippines 1991-04-02 00:00:00 6 15.1501 120.347
El Chichón Mexico 1982-03-28 00:00:00 5 17.3559 -93.2233
Galunggung Indonesia 1982-04-05 00:00:00 4 -7.2567 108.077
Nevado del Ruiz Colombia 1985-11-13 00:00:00 3 4.895 -75.322

The Volcanic Explosivity Index - or VEI - is a scale to measure the magnitude of explosive eruptions based on the volume of tephra ejected during an eruption. It is a logarithmic scale in base 10:

Table 2: VEI scale with minimum and maximum erupted volume and approximate frequency.
VEI Min Volume (km³) Max Volume (km³) Approx. Frequency
0 <0.00001 0.0001 Daily
1 0.0001 0.001 Weekly
2 0.001 0.01 Yearly
3 0.01 0.1 Few per year
4 0.1 1 ~10 per decade
5 1 10 ~1 per decade
6 10 100 ~1 per century
7 100 1000 ~1 per several centuries
8 >1000 - ~1 per 10,000 years

Importing the library and the data

As always, we start by importing the pandas library as pd.

import pandas as pd

We load the dataset using the pd.read_csv function into a variable called df (for DataFrame) (doc). Remember that functions can take different arguments, which are extra keywords you can pass to make the behaviour of the function more specific to your need. Here, we pass one arguments to the read_csv() function: parse_dates=['Date'] Specifies that the Date column should be treated as a date object.

Listing 1: Loading data from a csv file
df = pd.read_csv('https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/dummy_volcanoes.csv', parse_dates=['Date']) # Load data
df.head() # Show the first 5 rows
Name Country Date VEI Latitude Longitude
0 St. Helens USA 1980-05-18 5 46.1914 -122.1956
1 Pinatubo Philippines 1991-04-02 6 15.1501 120.3465
2 El Chichón Mexico 1982-03-28 5 17.3559 -93.2233
3 Galunggung Indonesia 1982-04-05 4 -7.2567 108.0771
4 Nevado del Ruiz Colombia 1985-11-13 3 4.8950 -75.3220

Setting up the index

The output of Listing 1 shows the first 5 rows in our DataFrame. As displayed here, the first column is the index - which is currently just integer numbers. That can be acceptable in some cases, but for the sake of the exercise we will choose one column to become the index - here Name.

Listing 2 Illustrates the use of two useful functions:

  • .set_index(): Uses a column as the DataFrame’s index
  • .reset_index(): Removes the column’s index back to a sequential numbering as in Listing 1.
Listing 2: Common functions to set the index of a DataFrame
df = df.set_index('VEI') # Set the 'VEI' column as an index
df = df.reset_index() # Shoots, I meant to set the 'Name' columns as an index
df = df.set_index('Name') # Here we go.
df.head()
VEI Country Date Latitude Longitude
Name
St. Helens 5 USA 1980-05-18 46.1914 -122.1956
Pinatubo 6 Philippines 1991-04-02 15.1501 120.3465
El Chichón 5 Mexico 1982-03-28 17.3559 -93.2233
Galunggung 4 Indonesia 1982-04-05 -7.2567 108.0771
Nevado del Ruiz 3 Colombia 1985-11-13 4.8950 -75.3220

Basic data exploration

Let’s now explore the structure of the dataset with the following functions:

Function Description
df.head() Prints the first 5 rows of the DataFrame (doc)
df.tail() Prints the last 5 rows of the DataFrame (doc)
df.info() Displays some info about the DataFrame, including the number of rows (entries) and columns (doc). Note the Dtype column: this is the type variable stored in each column including strings (object), integer (int64) and float (int64) numbers. See that the Date column is indeed stored as a datetime variable as requested above.
df.shape Returns a list containing the number of rows and columns of the DataFrame.
df.index Returns a list containing the index along the rows of the DataFrame.
df.columns Returns a list containing the index along the columns of the DataFrame.
Your turn!

Try these functions on df and get familiar with the output.

Sorting data

The main function to sort data is .sort_values (doc). It is necessary to review how three arguments can alter the function’s behaviour:

  1. by: First argument (required) is the label of index/row used to sort the data. It is possible to sort by multiple columns by passing a list of values.
  2. axis: Specifies whether sorting rows (axis = 0 - in which case by is a column name) or sorting columns (axis = 1, in which case by is an index value). The documentation specifies axis = 0, which means that rows will be sorted if axis is not specified.
  3. ascending: Using a bool (remember, this is a True/False behaviour), specifies if values are sorted in ascending (ascending = True, default behaviour is not specified) or descending (ascending = False) order.
Listing 3: Basic sorting operations
df.sort_values('VEI') # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False) # Sort volcanoes by eruption dates from recent to old
df.sort_values('Country') # .sort_values also work on strings to sort alphabetically
df.sort_values(['Latitude', 'Longitude']) # Sorting using multiple columns
Question

After sorting the data in descending order by VEI and time, what are the three first volcanoes?