Guide

Advanced Data Transformation And CleaningDeep dive

Merging and Joining Excel DataFrames with Pandas

Merge and join Excel DataFrames with pandas: normalize keys, pick the right join type, validate cardinality, track matches with indicator, and fix dtype and duplicate-key bugs.

When data lives in separate workbooks — sales in one, inventory in another — pandas.merge() joins them on a shared key the way a SQL join does. The hard parts are the keys: mismatched dtypes produce empty joins, duplicate keys multiply rows, and outer joins inject NaN. This guide covers the patterns that survive a scheduled run. Every block is runnable and shares one namespace, so paste them in order.

This is one stage of Advanced Data Transformation and Cleaning. For sanitizing headers and types first, see Cleaning Excel Data with Pandas.

Create sample workbooks

These examples need two source files, so create both up front — a sales table and an inventory lookup keyed on sku:

Python
import pandas as pd

sales = pd.DataFrame({
    "sku": ["a-100", "b-200 ", "C-300", "a-100"],
    "region": ["North", "South", "West", "East"],
    "units": [10, 5, 8, 3],
})
sales.to_excel("sales.xlsx", index=False)

inventory = pd.DataFrame({
    "sku": ["A-100", "B-200", "C-300"],
    "warehouse": ["W1", "W2", "W3"],
    "stock_level": [200, 150, 90],
})
inventory.to_excel("inventory.xlsx", index=False)

The merge pipeline

A reliable merge follows the same order every time:

  1. Load each workbook into its own DataFrame.
  2. Normalize keys — strip whitespace, unify casing, match dtypes.
  3. Choose the joininner, left, right, or outer.
  4. Validate cardinality with the validate argument.
  5. Audit row counts and unmatched keys.
  6. Export the result.

Pattern 1: Left join to enrich a master table

A left join keeps every row of the primary table and pulls matching attributes from the lookup. Normalize the key on both sides first, or rows that look equal ("a-100" vs "A-100") silently fail to match:

Python
df_sales = pd.read_excel("sales.xlsx", engine="openpyxl", dtype={"sku": str})
df_inventory = pd.read_excel("inventory.xlsx", engine="openpyxl", dtype={"sku": str})

# Normalize the join key on both sides
df_sales["sku"] = df_sales["sku"].str.strip().str.upper()
df_inventory["sku"] = df_inventory["sku"].str.strip().str.upper()

merged = pd.merge(
    df_sales,
    df_inventory[["sku", "warehouse", "stock_level"]],
    on="sku",
    how="left",
    validate="m:1",   # many sales rows to one inventory row
)
print(merged)

validate="m:1" raises MergeError if the inventory side is not unique on sku, catching a duplicate-key bug before it doubles your rows. For the focused two-file recipe, see Merge Two Excel Files on Common Column Python.

Pattern 2: Reconcile divergent schemas

When two systems name the same field differently, rename to a canonical schema, then stack the rows with concat:

Python
df_a = pd.DataFrame({"Client_ID": [1, 2], "Amount_USD": [100, 200]})
df_b = pd.DataFrame({"Acct_No": [3, 4], "Total_Value": [300, 400]})

column_mapping = {
    "Client_ID": "customer_id", "Acct_No": "customer_id",
    "Amount_USD": "amount", "Total_Value": "amount",
}
df_a = df_a.rename(columns={k: v for k, v in column_mapping.items() if k in df_a.columns})
df_b = df_b.rename(columns={k: v for k, v in column_mapping.items() if k in df_b.columns})

common = list(set(df_a.columns) & set(df_b.columns))
unified = pd.concat([df_a[common], df_b[common]], ignore_index=True)
print(unified)

Pattern 3: Track matches with the indicator

For audits you often need to know which rows matched. indicator=True adds a _merge column tagging each row as both, left_only, or right_only:

Python
result = pd.merge(
    df_sales, df_inventory,
    on="sku", how="left", indicator=True,
)
result["match_status"] = result["_merge"].map({
    "both": "matched",
    "left_only": "unmatched_primary",
    "right_only": "orphaned_secondary",
})
print(result[["sku", "region", "warehouse", "match_status"]])

Common errors and fixes

Dtype mismatch on the join key

Symptom: a zero-row or all-NaN merge despite values that look identical. Cause: one side stores the key as text, the other as a number. Fix — cast both to the same type:

Python
left = pd.DataFrame({"order_id": ["1", "2", "3"], "qty": [4, 5, 6]})
right = pd.DataFrame({"order_id": [1, 2, 3], "price": [9.0, 8.0, 7.0]})
left["order_id"] = pd.to_numeric(left["order_id"], errors="coerce").astype("Int64")
right["order_id"] = right["order_id"].astype("Int64")
print(pd.merge(left, right, on="order_id", how="inner"))

Duplicate keys multiply rows

Symptom: the output is larger than either input. Cause: duplicate keys on both sides produce a many-to-many join. Fix — deduplicate or aggregate the lookup, or pass validate= to fail fast:

Python
dupes = pd.DataFrame({"key_col": ["x", "x", "y"], "revenue": [10, 20, 30]})
deduped = dupes.drop_duplicates(subset=["key_col"], keep="first")
aggregated = dupes.groupby("key_col", as_index=False).agg({"revenue": "sum"})
print(aggregated)

NaN from outer/right joins

Symptom: numeric columns gain NaN for non-matching rows, breaking later math. Fix — fill the gaps deliberately after the merge:

Python
numeric_cols = merged.select_dtypes(include="number").columns
merged[numeric_cols] = merged[numeric_cols].fillna(0)
print(merged.isna().sum().sum())

For a full treatment of post-merge gaps, see Handling Missing Data in Excel Reports.

Export the consolidated result

Python
merged.to_excel("merged_report.xlsx", index=False, engine="openpyxl")
print("Wrote merged_report.xlsx")

Conclusion

A reliable merge rests on three decisions made before pd.merge() is called: normalize the join key on both sides, choose the join type that matches the business logic (most lookups are left), and pass validate= to catch cardinality problems early. Everything else — suffix resolution, NaN cleanup, indicator columns — flows from those three choices being right.

Where to go next