Overview

We will start our data science journey by learning a bit about the most useful Python library for this class: Pandas. As a reminder, a library is a set of tools we load on top of Python that provides new functionalities for a specific problem or type of analysis. Here, Pandas provides functions for data manipulation and analysis, handling structured data like tables or time series and facilitating numerous tasks you might encounter as a scientist. These include:

Reading/writing data from various commonly-used formats (CSV, Excel, SQL, JSON, etc.)
Handling missing data
Filtering, sorting, reshaping and grouping data
Aggregating data (sum, mean, count, etc.)
Time series support (date ranges, frequency conversions)
Statistical operations

Today’s objectives

The objective of this class is by no way to make you an expert in Pandas and data science. Rather, the objective is to take you through the most basic manipulations in order to build the confidence to keep on exploring the use of scientific coding and to include it into your research pipeline. The objectives of this module are to review:

What is a Pandas DataFrame and its basic anatomy
How to load data in a DataFrame
How to access data (e.g., query by label/position)
How to filter data (e.g., comparison and logical operators)
How to rearrange data (e.g., sorting values)
How to operate on data (e.g., arithmetic and string operations)

We first start by reviewing the data structure behind Pandas, then we will move on to a few coding exercises to make you familiar with some basic functionalities.

Overview

Today’s objectives

Slides

Introduction

Intro to Pandas