Guide

Advanced Data Transformation And CleaningDeep dive

Merge Two Excel Files on a Common Column in Python

Merge two Excel files on a shared column in Python with pandas: load both workbooks, normalize the key, join with pd.merge, handle different names and duplicate keys, export.

To merge two Excel files on a shared column in Python, read each workbook into a DataFrame with pd.read_excel(), then join them with pd.merge() on the common key. The recipe below is fully runnable: the first block writes both source files so the reads have something to open.

For the wider set of join patterns, see Merging and Joining Excel DataFrames.

Create the two sample workbooks

Python
import pandas as pd

sales = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300", "A-100"],
    "region": ["North", "South", "West", "East"],
    "units": [10, 5, 8, 3],
})
sales.to_excel("sales_Q3.xlsx", index=False)

catalog = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300"],
    "product_name": ["Widget", "Gadget", "Gizmo"],
    "unit_price": [19.99, 49.50, 8.75],
})
catalog.to_excel("product_catalog.xlsx", index=False)

Merge on the common column

Load both files, then join. A left join keeps every row of the primary table and attaches matching catalog columns; suffixes disambiguates any non-key columns that share a name:

Python
df_primary = pd.read_excel("sales_Q3.xlsx", engine="openpyxl")
df_lookup = pd.read_excel("product_catalog.xlsx", engine="openpyxl")

merged = pd.merge(
    df_primary,
    df_lookup,
    on="product_sku",
    how="left",
    suffixes=("_sales", "_catalog"),
)
merged.to_excel("merged_sales_report.xlsx", index=False, engine="openpyxl")
print(merged)

Normalize the key before joining

merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produce empty matches. Normalize both sides first:

Python
for d in (df_primary, df_lookup):
    d["product_sku"] = d["product_sku"].astype(str).str.strip().str.upper()
print(df_primary["product_sku"].tolist())

Choosing the join type

howKeeps
innerOnly keys present in both files
leftAll rows from the first file
rightAll rows from the second file
outerAll keys from either file

When the column names differ

If the key has a different name in each file, use left_on/right_on, then drop the redundant column:

Python
alt_lookup = df_lookup.rename(columns={"product_sku": "ProductCode"})
merged_alt = pd.merge(
    df_primary, alt_lookup,
    left_on="product_sku", right_on="ProductCode", how="inner",
).drop(columns=["ProductCode"])
print(merged_alt.columns.tolist())

When duplicate keys multiply rows

If the lookup has more than one row per key, a join repeats the primary rows. Deduplicate the lookup before merging, or pass validate="m:1" to fail loudly if it isn't unique:

Python
df_lookup_unique = df_lookup.drop_duplicates(subset=["product_sku"], keep="last")
safe = pd.merge(df_primary, df_lookup_unique, on="product_sku",
                how="left", validate="m:1")
print(f"{len(safe)} rows (primary has {len(df_primary)})")

Validate the merge

After joining, confirm you didn't lose or gain rows unexpectedly and report unmatched keys. With a left join the row count should not drop below the primary's:

Python
assert len(merged) >= len(df_primary), "Unexpected row loss during merge"
missing = merged["unit_price"].isna().sum()
print(f"{missing} unmatched keys" if missing else "All keys matched")

Where to go next