Before We Start

Learn why you should take this course and what will be covered.

What is this course about?

One of the trends that can be seen since the early 2000s is the increased use of Python for various aspects of data science—gathering data, cleaning data, analysis, machine learning, and visualization. The pandas library has seen much uptake in this area.

According to its official website, “pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language.”

Is this course for you?

This course is intended to introduce pandas and explore various patterns and techniques that adhere to best practices for maximizing its potential. If you work with tabular data and need capabilities beyond Excel, this course is for you. This course covers many (but not all) aspects of the library as well as some gotchas or details that may be counterintuitive or even non-Pythonic to longtime users of Python.

Note: This course assumes a basic knowledge of Python.

What is pandas?

pandas is an in-memory analysis tool. It has SQL-like constructs, essential statistical and analytic support, and graphing capability. Because pandas is built on top of Cython and NumPy, it has less memory overhead and runs quicker than pure Python code.

             

Many people use pandas to replace Excel, perform ETL (extract transform load processing to move data from one place to another), process tabular data, load CSV or JSON files, prep for machine learning, and more. Though it grew out of the financial sector (for time series analysis), it is now a general-purpose data manipulation library.

With its NumPy lineage, pandas adopts some NumPy-isms that regular Python programmers may not be aware of or familiar with. Yes, one could go out and use Cython to perform fast typed data analysis with a Python-like dialect, but with pandas, we don’t need to. This work is done for us. If we use pandas and the vectorized operations, we are getting close to C-level speeds for numeric work while writing Python.