Merge the Values of Two DataFrames

Understand how to merge the Pandas DataFrames.

We'll cover the following

Try it yourself

Try executing the code below to see the result.

Press + to interact
import pandas as pd
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Clark Kent', 'Diana Prince', 'Bruce Wayne'],
})
df2 = pd.DataFrame({
'id': [2, 1, 4],
'hero': ['Wonder Woman', 'Superman', 'Aquaman'],
})
df = pd.merge(df1, df2, on='id')
print(df)

Explanation

A pandas merge gets a sequence of pandas.DataFrame to merge plus an optional column to merge on. If the column is not provided, pandas will use the index of each DataFrame for merging.

The question is, what happens when one merge column has values that the other doesn’t? This old question is rooted in relational databases and their join operator. There are several types of joins, and each defines a different behavior. The pandas merge function mimics these operators as well. Looking at pandas.merge documentation, we’ll see a how parameter, like this:

how {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

Here are short definitions of the types of merges that can be performed.

  • The left merge uses only keys from the left frame, similar to a SQL left outer join. This merge preserves key order.
  • The right merge uses only keys from the right frame, similar to a SQL right outer join. This merge also preserves key order.
  • The outer merge uses a union of keys from both frames, similar to a SQL full outer join. This merge sorts keys lexicographically.
  • The inner merge uses the intersection of keys from both frames, similar to a SQL inner join. This merge preserves the order of the left keys.

The default merge type is inner, which means only rows that have keys in both left and right frames are included in the result. The merge command orders the rows in order of keys on the left frame. The teaser’s output shows the above behavior.

We see only Superman and Wonder Woman in the output, which has keys in both frames. The output is sorted according to the order of the first frame.

If we switch the order of frames passed to merge, we’ll see a different ordering:

In [1]: pd.merge(df2, df1)
Out[1]:
id hero name
0 2 Wonder Woman Diana Prince
1 1 Superman Clark Kent

If we include all lines, we need to use an outer merge. The missing values will be filled with NaN by pandas:

In [2]: pd.merge(df1, df2, on='id', how='outer')
Out[2]:
id name hero
0 1 Clark Kent Superman
1 2 Diana Prince Wonder Woman
2 3 Bruce Wayne NaN
3 4 NaN Aquaman

The pandas merge is very powerful and will let us connect different frames.

A common case from data marts is a star schema. In a star schema, we have one main frame with data called a fact. We also have many other frames that provide auxiliary data. For example, the main frame will have sale events that include the customer ID. If we want to group by customer age, we first need to merge the main frame with the customers frame which includes a customer age for every customer ID. In this case, we’ll use a left join.

Get hands-on with 1300+ tech skills courses.