Data structure
Let’s get our hands dirty and start coding. Create a new Jupyter notebook following this guide. You can copy fragments of the code, but make sure each code block is a different cell in you notebook. Also remember that you can add Markdown cells in between code cells, which are really useful to document your code.
The data we will use here is a csv file containing selected eruptions of the past 50 years. The first 5 rows of the data are illustrated in Table 1.
| Name | Country | Date | VEI | Latitude | Longitude |
|---|---|---|---|---|---|
| St. Helens | USA | 1980-05-18 00:00:00 | 5 | 46.1914 | -122.196 |
| Pinatubo | Philippines | 1991-04-02 00:00:00 | 6 | 15.1501 | 120.347 |
| El Chichón | Mexico | 1982-03-28 00:00:00 | 5 | 17.3559 | -93.2233 |
| Galunggung | Indonesia | 1982-04-05 00:00:00 | 4 | -7.2567 | 108.077 |
| Nevado del Ruiz | Colombia | 1985-11-13 00:00:00 | 3 | 4.895 | -75.322 |
The Volcanic Explosivity Index - or VEI - is a scale to measure the magnitude of explosive eruptions based on the volume of tephra ejected during an eruption. It is a logarithmic scale in base 10:
| VEI | Min Volume (km³) | Max Volume (km³) | Approx. Frequency |
|---|---|---|---|
| 0 | <0.00001 | 0.0001 | Daily |
| 1 | 0.0001 | 0.001 | Weekly |
| 2 | 0.001 | 0.01 | Yearly |
| 3 | 0.01 | 0.1 | Few per year |
| 4 | 0.1 | 1 | ~10 per decade |
| 5 | 1 | 10 | ~1 per decade |
| 6 | 10 | 100 | ~1 per century |
| 7 | 100 | 1000 | ~1 per several centuries |
| 8 | >1000 | - | ~1 per 10,000 years |
Importing the library and the data
As always, we start by importing the pandas library as pd.
We load the dataset using the pd.read_csv function into a variable called df (for DataFrame) (doc). Remember that functions can take different arguments, which are extra keywords you can pass to make the behaviour of the function more specific to your need. Here, we pass one arguments to the read_csv() function: parse_dates=['Date'] Specifies that the Date column should be treated as a date object.
df = pd.read_csv('https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/dummy_volcanoes.csv', parse_dates=['Date']) # Load data
df.head() # Show the first 5 rows| Name | Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| 0 | St. Helens | USA | 1980-05-18 | 5 | 46.1914 | -122.1956 |
| 1 | Pinatubo | Philippines | 1991-04-02 | 6 | 15.1501 | 120.3465 |
| 2 | El Chichón | Mexico | 1982-03-28 | 5 | 17.3559 | -93.2233 |
| 3 | Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
| 4 | Nevado del Ruiz | Colombia | 1985-11-13 | 3 | 4.8950 | -75.3220 |
Setting up the index
The output of Listing 1 shows the first 5 rows in our DataFrame. As displayed here, the first column is the index - which is currently just integer numbers. That can be acceptable in some cases, but for the sake of the exercise we will choose one column to become the index - here Name.
Listing 2 Illustrates the use of two useful functions:
.set_index(): Uses a column as the DataFrame’s index.reset_index(): Removes the column’s index back to a sequential numbering as in Listing 1.
df = df.set_index('VEI') # Set the 'VEI' column as an index
df = df.reset_index() # Shoots, I meant to set the 'Name' columns as an index
df = df.set_index('Name') # Here we go.
df.head()| VEI | Country | Date | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| St. Helens | 5 | USA | 1980-05-18 | 46.1914 | -122.1956 |
| Pinatubo | 6 | Philippines | 1991-04-02 | 15.1501 | 120.3465 |
| El Chichón | 5 | Mexico | 1982-03-28 | 17.3559 | -93.2233 |
| Galunggung | 4 | Indonesia | 1982-04-05 | -7.2567 | 108.0771 |
| Nevado del Ruiz | 3 | Colombia | 1985-11-13 | 4.8950 | -75.3220 |
Basic data exploration
Let’s now explore the structure of the dataset with the following functions:
| Function | Description |
|---|---|
df.head() |
Prints the first 5 rows of the DataFrame (doc) |
df.tail() |
Prints the last 5 rows of the DataFrame (doc) |
df.info() |
Displays some info about the DataFrame, including the number of rows (entries) and columns (doc). Note the Dtype column: this is the type variable stored in each column including strings (object), integer (int64) and float (int64) numbers. See that the Date column is indeed stored as a datetime variable as requested above. |
df.shape |
Returns a list containing the number of rows and columns of the DataFrame. |
df.index |
Returns a list containing the index along the rows of the DataFrame. |
df.columns |
Returns a list containing the index along the columns of the DataFrame. |
Try these functions on df and get familiar with the output.
Sorting data
The main function to sort data is .sort_values (doc). It is necessary to review how three arguments can alter the function’s behaviour:
by: First argument (required) is the label of index/row used to sort the data. It is possible to sort by multiple columns by passing a list of values.axis: Specifies whether sorting rows (axis = 0- in which casebyis a column name) or sorting columns (axis = 1, in which casebyis an index value). The documentation specifiesaxis = 0, which means that rows will be sorted ifaxisis not specified.ascending: Using a bool (remember, this is a True/False behaviour), specifies if values are sorted in ascending (ascending = True, default behaviour is not specified) or descending (ascending = False) order.
df.sort_values('VEI') # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False) # Sort volcanoes by eruption dates from recent to old
df.sort_values('Country') # .sort_values also work on strings to sort alphabetically
df.sort_values(['Latitude', 'Longitude']) # Sorting using multiple columnsAfter sorting the data in descending order by VEI and time, what are the three first volcanoes?