Data structure

Let’s get our hands dirty and start coding. Create a new Jupyter notebook following this guide. You can copy fragments of the code, but make sure each code block is a different cell in you notebook. Also remember that you can add Markdown cells in between code cells, which are really useful to document your code.

The data we will use here is a csv file containing selected eruptions of the past 50 years. The first 5 rows of the data are illustrated in Table 1.

Table 1: First 5 rows of the dataset.

Name	Country	Date	VEI	Latitude	Longitude
St. Helens	USA	1980-05-18 00:00:00	5	46.1914	-122.196
Pinatubo	Philippines	1991-04-02 00:00:00	6	15.1501	120.347
El Chichón	Mexico	1982-03-28 00:00:00	5	17.3559	-93.2233
Galunggung	Indonesia	1982-04-05 00:00:00	4	-7.2567	108.077
Nevado del Ruiz	Colombia	1985-11-13 00:00:00	3	4.895	-75.322

What is the VEI?

The Volcanic Explosivity Index - or VEI - is a scale to measure the magnitude of explosive eruptions based on the volume of tephra ejected during an eruption. It is a logarithmic scale in base 10:

Table 2: VEI scale with minimum and maximum erupted volume and approximate frequency.

VEI	Min Volume (km³)	Max Volume (km³)	Approx. Frequency
0	<0.00001	0.0001	Daily
1	0.0001	0.001	Weekly
2	0.001	0.01	Yearly
3	0.01	0.1	Few per year
4	0.1	1	~10 per decade
5	1	10	~1 per decade
6	10	100	~1 per century
7	100	1000	~1 per several centuries
8	>1000	-	~1 per 10,000 years

Importing the library and the data

As always, we start by importing the pandas library as pd.

import pandas as pd

We load the dataset using the pd.read_csv function into a variable called df (for DataFrame) (doc). Remember that functions can take different arguments, which are extra keywords you can pass to make the behaviour of the function more specific to your need. Here, we pass one arguments to the read_csv() function: parse_dates=['Date'] Specifies that the Date column should be treated as a date object.

Listing 1: Loading data from a csv file

df = pd.read_csv('https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/dummy_volcanoes.csv', parse_dates=['Date']) # Load data
df.head() # Show the first 5 rows

	Name	Country	Date	VEI	Latitude	Longitude
0	St. Helens	USA	1980-05-18	5	46.1914	-122.1956
1	Pinatubo	Philippines	1991-04-02	6	15.1501	120.3465
2	El Chichón	Mexico	1982-03-28	5	17.3559	-93.2233
3	Galunggung	Indonesia	1982-04-05	4	-7.2567	108.0771
4	Nevado del Ruiz	Colombia	1985-11-13	3	4.8950	-75.3220

Setting up the index

The output of Listing 1 shows the first 5 rows in our DataFrame. As displayed here, the first column is the index - which is currently just integer numbers. That can be acceptable in some cases, but for the sake of the exercise we will choose one column to become the index - here Name.

Listing 2 Illustrates the use of two useful functions:

.set_index(): Uses a column as the DataFrame’s index
.reset_index(): Removes the column’s index back to a sequential numbering as in Listing 1.

Listing 2: Common functions to set the index of a DataFrame

df = df.set_index('VEI') # Set the 'VEI' column as an index
df = df.reset_index() # Shoots, I meant to set the 'Name' columns as an index
df = df.set_index('Name') # Here we go.
df.head()

	VEI	Country	Date	Latitude	Longitude
Name
St. Helens	5	USA	1980-05-18	46.1914	-122.1956
Pinatubo	6	Philippines	1991-04-02	15.1501	120.3465
El Chichón	5	Mexico	1982-03-28	17.3559	-93.2233
Galunggung	4	Indonesia	1982-04-05	-7.2567	108.0771
Nevado del Ruiz	3	Colombia	1985-11-13	4.8950	-75.3220

Basic data exploration

Let’s now explore the structure of the dataset with the following functions:

Function	Description
`df.head()`	Prints the first 5 rows of the DataFrame (doc)
`df.tail()`	Prints the last 5 rows of the DataFrame (doc)
`df.info()`	Displays some info about the DataFrame, including the number of rows (entries) and columns (doc). Note the `Dtype` column: this is the type variable stored in each column including strings (`object`), integer (`int64`) and float (`int64`) numbers. See that the `Date` column is indeed stored as a `datetime` variable as requested above.
`df.shape`	Returns a list containing the number of rows and columns of the DataFrame.
`df.index`	Returns a list containing the index along the rows of the DataFrame.
`df.columns`	Returns a list containing the index along the columns of the DataFrame.

Your turn!

Try these functions on df and get familiar with the output.

Sorting data

The main function to sort data is .sort_values (doc). It is necessary to review how three arguments can alter the function’s behaviour:

by: First argument (required) is the label of index/row used to sort the data. It is possible to sort by multiple columns by passing a list of values.
axis: Specifies whether sorting rows (axis = 0 - in which case by is a column name) or sorting columns (axis = 1, in which case by is an index value). The documentation specifies axis = 0, which means that rows will be sorted if axis is not specified.
ascending: Using a bool (remember, this is a True/False behaviour), specifies if values are sorted in ascending (ascending = True, default behaviour is not specified) or descending (ascending = False) order.

Listing 3: Basic sorting operations

df.sort_values('VEI') # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False) # Sort volcanoes by eruption dates from recent to old
df.sort_values('Country') # .sort_values also work on strings to sort alphabetically
df.sort_values(['Latitude', 'Longitude']) # Sorting using multiple columns

Question

After sorting the data in descending order by VEI and time, what are the three first volcanoes?