Overview
We will start our data science journey by learning a bit about the most useful Python library for this class: Pandas. As a reminder, a library is a set of tools we load on top of Python that provides new functionalities for a specific problem or type of analysis. Here, Pandas provides functions for data manipulation and analysis, handling structured data like tables or time series and facilitating numerous tasks you might encounter as a scientist. These include:
- Reading/writing data from various commonly-used formats (CSV, Excel, SQL, JSON, etc.)
- Handling missing data
- Filtering, sorting, reshaping and grouping data
- Aggregating data (sum, mean, count, etc.)
- Time series support (date ranges, frequency conversions)
- Statistical operations
Today’s objectives
The objective of this class is by no way to make you an expert in Pandas and data science. Rather, the objective is to take you through the most basic manipulations in order to build the confidence to keep on exploring the use of scientific coding and to include it into your research pipeline. The objectives of this module are to review:
We first start by reviewing the data structure behind Pandas, then we will move on to a few coding exercises to make you familiar with some basic functionalities.