Merge the Values of Two DataFrames
Understand how to merge the Pandas DataFrames.
We'll cover the following
Try it yourself
Try executing the code below to see the result.
import pandas as pddf1 = pd.DataFrame({'id': [1, 2, 3],'name': ['Clark Kent', 'Diana Prince', 'Bruce Wayne'],})df2 = pd.DataFrame({'id': [2, 1, 4],'hero': ['Wonder Woman', 'Superman', 'Aquaman'],})df = pd.merge(df1, df2, on='id')print(df)
Explanation
A pandas merge
gets a sequence of pandas.DataFrame
to merge plus an optional column to merge on. If the column is not provided, pandas will use the index of each DataFrame for merging.
The question is, what happens when one merge column has values that the
other doesn’t? This old question is rooted in relational databases and their join operator. There are several types of joins, and each
defines a different behavior. The pandas merge
function mimics these operators
as well.
Looking at pandas.merge
documentation, we’ll see a how
parameter, like this:
how {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
Here are short definitions of the types of merges that can be performed.
- The
left
merge uses only keys from the left frame, similar to a SQL left outer join. This merge preserves key order. - The
right
merge uses only keys from the right frame, similar to a SQL right outer join. This merge also preserves key order. - The
outer
merge uses a union of keys from both frames, similar to a SQL full outer join. This merge sorts keys lexicographically. - The
inner
merge uses the intersection of keys from both frames, similar to a SQL inner join. This merge preserves the order of the left keys.
The default merge type is inner
, which means only rows that have keys in both left and right frames are included in the result. The merge
command orders the rows in order of keys on the left frame. The teaser’s output shows the above behavior.
We see only Superman
and Wonder Woman
in the output, which has keys
in both frames. The output is sorted according to the order of the first frame.
If we switch the order of frames passed to merge
, we’ll see a different ordering:
In [1]: pd.merge(df2, df1)
Out[1]:
id hero name
0 2 Wonder Woman Diana Prince
1 1 Superman Clark Kent
If we include all lines, we need to use an outer
merge. The missing
values will be filled with NaN
by pandas:
In [2]: pd.merge(df1, df2, on='id', how='outer')
Out[2]:
id name hero
0 1 Clark Kent Superman
1 2 Diana Prince Wonder Woman
2 3 Bruce Wayne NaN
3 4 NaN Aquaman
The pandas merge
is very powerful and will let us connect different frames.
A common case from data marts is a star schema. In a star schema, we have one main
frame with data called a fact. We also have many other frames that provide auxiliary data. For example, the main frame will have sale events that include the customer ID. If we want to group by customer age, we first need to merge the main frame with the customers
frame which includes a customer age for every customer ID. In this case,
we’ll use a left
join.
Get hands-on with 1300+ tech skills courses.