Guide
Merging and Joining Excel DataFrames with Pandas
Merge and join Excel DataFrames with pandas: normalize keys, pick the right join type, validate cardinality, track matches with indicator, and fix dtype and duplicate-key bugs.
When data lives in separate workbooks — sales in one, inventory in another — pandas.merge() joins them on a shared key the way a SQL join does. The hard parts are the keys: mismatched dtypes produce empty joins, duplicate keys multiply rows, and outer joins inject NaN. This guide covers the patterns that survive a scheduled run. Every block is runnable and shares one namespace, so paste them in order.
This is one stage of Advanced Data Transformation and Cleaning. For sanitizing headers and types first, see Cleaning Excel Data with Pandas.
Create sample workbooks
These examples need two source files, so create both up front — a sales table and an inventory lookup keyed on sku:
import pandas as pd
sales = pd.DataFrame({
"sku": ["a-100", "b-200 ", "C-300", "a-100"],
"region": ["North", "South", "West", "East"],
"units": [10, 5, 8, 3],
})
sales.to_excel("sales.xlsx", index=False)
inventory = pd.DataFrame({
"sku": ["A-100", "B-200", "C-300"],
"warehouse": ["W1", "W2", "W3"],
"stock_level": [200, 150, 90],
})
inventory.to_excel("inventory.xlsx", index=False)
The merge pipeline
A reliable merge follows the same order every time:
- Load each workbook into its own DataFrame.
- Normalize keys — strip whitespace, unify casing, match dtypes.
- Choose the join —
inner,left,right, orouter. - Validate cardinality with the
validateargument. - Audit row counts and unmatched keys.
- Export the result.
Pattern 1: Left join to enrich a master table
A left join keeps every row of the primary table and pulls matching attributes from the lookup. Normalize the key on both sides first, or rows that look equal ("a-100" vs "A-100") silently fail to match:
df_sales = pd.read_excel("sales.xlsx", engine="openpyxl", dtype={"sku": str})
df_inventory = pd.read_excel("inventory.xlsx", engine="openpyxl", dtype={"sku": str})
# Normalize the join key on both sides
df_sales["sku"] = df_sales["sku"].str.strip().str.upper()
df_inventory["sku"] = df_inventory["sku"].str.strip().str.upper()
merged = pd.merge(
df_sales,
df_inventory[["sku", "warehouse", "stock_level"]],
on="sku",
how="left",
validate="m:1", # many sales rows to one inventory row
)
print(merged)
validate="m:1" raises MergeError if the inventory side is not unique on sku, catching a duplicate-key bug before it doubles your rows. For the focused two-file recipe, see Merge Two Excel Files on Common Column Python.
Pattern 2: Reconcile divergent schemas
When two systems name the same field differently, rename to a canonical schema, then stack the rows with concat:
df_a = pd.DataFrame({"Client_ID": [1, 2], "Amount_USD": [100, 200]})
df_b = pd.DataFrame({"Acct_No": [3, 4], "Total_Value": [300, 400]})
column_mapping = {
"Client_ID": "customer_id", "Acct_No": "customer_id",
"Amount_USD": "amount", "Total_Value": "amount",
}
df_a = df_a.rename(columns={k: v for k, v in column_mapping.items() if k in df_a.columns})
df_b = df_b.rename(columns={k: v for k, v in column_mapping.items() if k in df_b.columns})
common = list(set(df_a.columns) & set(df_b.columns))
unified = pd.concat([df_a[common], df_b[common]], ignore_index=True)
print(unified)
Pattern 3: Track matches with the indicator
For audits you often need to know which rows matched. indicator=True adds a _merge column tagging each row as both, left_only, or right_only:
result = pd.merge(
df_sales, df_inventory,
on="sku", how="left", indicator=True,
)
result["match_status"] = result["_merge"].map({
"both": "matched",
"left_only": "unmatched_primary",
"right_only": "orphaned_secondary",
})
print(result[["sku", "region", "warehouse", "match_status"]])
Common errors and fixes
Dtype mismatch on the join key
Symptom: a zero-row or all-NaN merge despite values that look identical. Cause: one side stores the key as text, the other as a number. Fix — cast both to the same type:
left = pd.DataFrame({"order_id": ["1", "2", "3"], "qty": [4, 5, 6]})
right = pd.DataFrame({"order_id": [1, 2, 3], "price": [9.0, 8.0, 7.0]})
left["order_id"] = pd.to_numeric(left["order_id"], errors="coerce").astype("Int64")
right["order_id"] = right["order_id"].astype("Int64")
print(pd.merge(left, right, on="order_id", how="inner"))
Duplicate keys multiply rows
Symptom: the output is larger than either input. Cause: duplicate keys on both sides produce a many-to-many join. Fix — deduplicate or aggregate the lookup, or pass validate= to fail fast:
dupes = pd.DataFrame({"key_col": ["x", "x", "y"], "revenue": [10, 20, 30]})
deduped = dupes.drop_duplicates(subset=["key_col"], keep="first")
aggregated = dupes.groupby("key_col", as_index=False).agg({"revenue": "sum"})
print(aggregated)
NaN from outer/right joins
Symptom: numeric columns gain NaN for non-matching rows, breaking later math. Fix — fill the gaps deliberately after the merge:
numeric_cols = merged.select_dtypes(include="number").columns
merged[numeric_cols] = merged[numeric_cols].fillna(0)
print(merged.isna().sum().sum())
For a full treatment of post-merge gaps, see Handling Missing Data in Excel Reports.
Export the consolidated result
merged.to_excel("merged_report.xlsx", index=False, engine="openpyxl")
print("Wrote merged_report.xlsx")
Conclusion
A reliable merge rests on three decisions made before pd.merge() is called: normalize the join key on both sides, choose the join type that matches the business logic (most lookups are left), and pass validate= to catch cardinality problems early. Everything else — suffix resolution, NaN cleanup, indicator columns — flows from those three choices being right.
Where to go next
- For the focused common-column recipe, see Merge Two Excel Files on Common Column Python.
- To summarize a merged DataFrame, see Creating Pivot Tables from Excel Data.
- To resolve the
NaNjoins introduce, see Handling Missing Data in Excel Reports.