DataFrames for Data Science
October 15, 2025
We assume that you all followed Guy Simpson’s Python crash course
pandas: A package for data manipulation and analysis handling structured data
Understand what is a pandas DataFrame and its basic anatomy
Synthetic dataset of selected volcanic eruptions → first 5 rows:
| Name | Country | Date | VEI | Latitude | Longitude |
|---|---|---|---|---|---|
| St. Helens | USA | 1980-05-18 | 5 | 46.1914 | -122.196 |
| Pinatubo | Philippines | 1991-04-02 | 6 | 15.1501 | 120.347 |
| El Chichón | Mexico | 1982-03-28 | 5 | 17.3559 | -93.2233 |
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.077 |
| Nevado del Ruiz | Colombia | 1985-11-13 | 3 | 4.895 | -75.322 |
pandas library
pd.read_csv() functiondf
df.head()df object
# Import the required packages
import pandas as pd
# Read the data
df = pd.read_csv('https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/dummy_volcanoes.csv', parse_dates=['Date']) # Load data
# Show the first 3 rows
df.head(3) | Name | Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| 0 | St. Helens | USA | 1980-05-18 | 5 | 46.1914 | -122.1956 |
| 1 | Pinatubo | Philippines | 1991-04-02 | 6 | 15.1501 | 120.3465 |
| 2 | El Chichón | Mexico | 1982-03-28 | 5 | 17.3559 | -93.2233 |
set_index()
| Function | Description |
|---|---|
df.head() |
Prints the first 5 rows of the DataFrame. |
df.tail() |
Prints the last 5 rows of the DataFrame. |
df.info() |
Displays some info about the DataFrame, including the number of rows (entries) and columns. |
df.shape |
Returns a list containing the number of rows and columns of the DataFrame. |
df.index |
Returns a list containing the index along the rows of the DataFrame. |
df.columns |
Returns a list containing the index along the columns of the DataFrame. |
Functions vs attributes
dfdf.sort_values| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Nyiragongo | DR Congo | 2021-05-22 | 1 | -1.5200 | 29.2500 |
| Ontake | Japan | 2014-09-27 | 2 | 35.5149 | 137.4781 |
| Etna | Italy | 2021-03-16 | 2 | 37.7510 | 15.0044 |
| Merapi | Indonesia | 2023-12-03 | 2 | -7.5407 | 110.4457 |
| Kīlauea | USA | 2018-05-03 | 2 | 19.4194 | -155.2811 |
.sort_valuesdf.sort_values('VEI').head() # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False).head() # Sort volcanoes by eruption dates from recent to old| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Merapi | Indonesia | 2023-12-03 | 2 | -7.5407 | 110.4457 |
| Cleveland | USA | 2023-05-23 | 3 | 52.8250 | -169.9444 |
| Sinabung | Indonesia | 2023-02-13 | 3 | 3.1719 | 98.3925 |
| Nyiragongo | DR Congo | 2021-05-22 | 1 | -1.5200 | 29.2500 |
| La Soufrière | Saint Vincent | 2021-04-09 | 4 | 13.2833 | -61.3875 |
.sort_valuesdf.sort_values('VEI').head() # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False).head() # Sort volcanoes by eruption dates from recent to old
df.sort_values('Country').head() # Also works on strings to sort alphabetically| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Calbuco | Chile | 2015-04-22 | 4 | -41.2972 | -72.6097 |
| Nevado del Ruiz | Colombia | 1985-11-13 | 3 | 4.8950 | -75.3220 |
| Nyiragongo | DR Congo | 2021-05-22 | 1 | -1.5200 | 29.2500 |
| Eyjafjallajökull | Iceland | 2010-04-14 | 4 | 63.6333 | -19.6111 |
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
.sort_valuesdf.sort_values('VEI').head() # Sort volcanoes by VEI in ascending number
df.sort_values('Date', ascending=False).head() # Sort volcanoes by eruption dates from recent to old
df.sort_values('Country').head() # Also works on strings to sort alphabetically
df.sort_values(['Latitude', 'Longitude']).head() # Sorting using multiple columns| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Calbuco | Chile | 2015-04-22 | 4 | -41.2972 | -72.6097 |
| Agung | Indonesia | 2017-11-21 | 3 | -8.3422 | 115.5083 |
| Merapi | Indonesia | 2023-12-03 | 2 | -7.5407 | 110.4457 |
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
| Tavurvur | Papua New Guinea | 2014-08-29 | 3 | -4.3494 | 152.2847 |
Option 1: label-based indexing
df.loc
Option 2: position-based indexing
df.iloc
.loc → Use square brackets [ ]
Calbuco
.loc
Calbuco or Taal
Country or VEI.loc to query columns.iloc
| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| La Soufrière | Saint Vincent | 2021-04-09 | 4 | 13.2833 | -61.3875 |
| Calbuco | Chile | 2015-04-22 | 4 | -41.2972 | -72.6097 |
| St. Augustine | USA | 2006-03-27 | 3 | 57.8819 | -155.5611 |
| Eyjafjallajökull | Iceland | 2010-04-14 | 4 | 63.6333 | -19.6111 |
| Cleveland | USA | 2023-05-23 | 3 | 52.8250 | -169.9444 |
True or False depending on whether the condition is satisfiedVEI == 4
Name
St. Helens False
Pinatubo False
El Chichón False
Galunggung True
Nevado del Ruiz False
Merapi False
Ontake False
Soufrière Hills False
Etna False
Nyiragongo False
Kīlauea False
Agung False
Tavurvur False
Sinabung False
Taal True
La Soufrière True
Calbuco True
St. Augustine False
Eyjafjallajökull True
Cleveland False
Name: VEI, dtype: bool
| Country | Date | VEI | Latitude | Longitude | |
|---|---|---|---|---|---|
| Name | |||||
| Galunggung | Indonesia | 1982-04-05 | 4 | -7.2567 | 108.0771 |
| Taal | Philippines | 2020-01-12 | 4 | 14.0020 | 120.9934 |
| La Soufrière | Saint Vincent | 2021-04-09 | 4 | 13.2833 | -61.3875 |
| Calbuco | Chile | 2015-04-22 | 4 | -41.2972 | -72.6097 |
| Eyjafjallajökull | Iceland | 2010-04-14 | 4 | 63.6333 | -19.6111 |
| Operation | Example | Description |
|---|---|---|
| contains | df['Name'].str.contains('Soufrière') |
Checks if each string contains a substring |
| startswith | df['Name'].str.startswith('E') |
Checks if each string starts with a substring |
| endswith | df['Name'].str.endswith('o') |
Checks if each string ends with a substring |
| Operation | Example | Description |
|---|---|---|
| Round | df['VEI'].round(1) |
Rounds values to the specified number of decimals |
| Floor | df['VEI'].apply(np.floor) |
Rounds values down to the nearest integer |
| Ceil | df['VEI'].apply(np.ceil) |
Rounds values up to the nearest integer |
| Absolute value | df['VEI'].abs() |
Returns the absolute value of each element |
| Fill missing | df['VEI'].fillna(0) |
Replaces missing values with a specified value |
| Operation | Symbol | Example | Description |
|---|---|---|---|
| Addition | + |
df['VEI'] + 1 |
Adds a value to each element |
| Subtraction | - |
df['VEI'] - 1 |
Subtracts a value from each element |
| Multiplication | * |
df['VEI'] * 2 |
Multiplies each element by a value |
| Division | / |
df['VEI'] / 2 |
Divides each element by a value |
| Exponentiation | ** |
df['VEI'] ** 2 |
Raises each element to a power |
| Modulo | % |
df['VEI'] % 2 |
Remainder after division for each element |
numpy| Operation | Symbol | Example | Description |
|---|---|---|---|
| Exponentiation | np.power |
np.power(df['VEI'], 2) |
Element-wise exponentiation |
| Square root | np.sqrt |
np.sqrt(df['VEI']) |
Element-wise square root |
| Logarithm (base e) | np.log |
np.log(df['VEI']) |
Element-wise natural logarithm |
| Logarithm (base 10) | np.log10 |
np.log10(df['VEI']) |
Element-wise base-10 logarithm |
| Exponential | np.exp |
np.exp(df['VEI']) |
Element-wise exponential (e^x) |