Guide

Advanced Data Transformation And CleaningDeep dive

Merge Two Excel Files on a Common Column in Python

Q: What if the common column has a different name in each file?

Use left_on and right_on instead of on, e.g. pd.merge(df_primary, alt_lookup, left_on="product_sku", right_on="ProductCode"), then drop the redundant duplicate column afterward.

Q: Why do matching rows fail to join?

merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produces empty matches. Normalize both sides with .astype(str).str.strip().str.upper() before joining.

Q: Which how should I pick to keep all rows of my main file?

Use how="left". It keeps every row of the first file and attaches matching columns from the lookup, leaving NaN where no match exists.

Q: How do I stop a duplicate lookup from multiplying my rows?

Deduplicate the lookup with drop_duplicates(subset=["product_sku"]) before merging, or pass validate="m:1" so the merge raises loudly if the lookup isn't unique on the key.

Q: What does suffixes do?

When both files have a non-key column with the same name, suffixes=("_sales", "_catalog") renames the clashing columns so neither is overwritten. It does not affect the join key itself.

Merge two Excel files on a shared column in Python with pandas: load both workbooks, normalize the key, join with pd.merge, handle different names and duplicate keys, export.

To merge two Excel files on a shared column in Python, read each workbook into a DataFrame with pd.read_excel(), then join them with pd.merge() on the common key. The recipe below is fully runnable: the first block writes both source files so the reads have something to open.

This is the everyday "VLOOKUP between two spreadsheets" task — a sales export and a product catalog, an order list and a customer table, two monthly extracts that share an account number. This page walks the whole flow end to end: load both files, line up the key so matches actually happen, pick the join that keeps the rows you need, and write a single combined workbook back out. It sits inside the broader set of merging and joining Excel DataFrames patterns.

Prerequisites

You need Python 3.9+ with pandas and the openpyxl engine installed, since pandas delegates .xlsx reading and writing to it:

Bash

pip install pandas openpyxl

You should also know:

Which column the two files share. It might be product_sku, an order ID, a customer number, or an email. It does not have to have the same header in both files — a mismatch is handled below.
Which file is your "main" table — the one whose rows you want to keep in full. The other file is the lookup you pull columns from.

If loading the workbooks is itself new to you, start with reading Excel files with pandas, and use reading only specific columns when the lookup workbook is wide and you only need a couple of fields.

Create the two sample workbooks

So the rest of the page runs without any files of your own, this block writes a small sales export and a product catalog that share the product_sku column:

Python

import pandas as pd

sales = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300", "A-100"],
    "region": ["North", "South", "West", "East"],
    "units": [10, 5, 8, 3],
})
sales.to_excel("sales_Q3.xlsx", index=False)

catalog = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300"],
    "product_name": ["Widget", "Gadget", "Gizmo"],
    "unit_price": [19.99, 49.50, 8.75],
})
catalog.to_excel("product_catalog.xlsx", index=False)

Note that sales_Q3.xlsx has A-100 twice (two regions sold the same product) while the catalog has each SKU once. That is the common shape: many rows in the main file, one row per key in the lookup.

Merge on the common column

Load both files, then join. A left join keeps every row of the primary table and attaches the matching catalog columns; suffixes disambiguates any non-key columns that happen to share a name:

Python

df_primary = pd.read_excel("sales_Q3.xlsx", engine="openpyxl")
df_lookup = pd.read_excel("product_catalog.xlsx", engine="openpyxl")

merged = pd.merge(
    df_primary,
    df_lookup,
    on="product_sku",
    how="left",
    suffixes=("_sales", "_catalog"),
)
merged.to_excel("merged_sales_report.xlsx", index=False, engine="openpyxl")
print(merged)

Both A-100 rows in the sales file pick up the same product_name and unit_price from the catalog, so you get four rows out — the primary's row count is preserved and enriched with the lookup's columns.

Normalize the key before joining

merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produce empty matches. This is the single most common reason a merge "loses" rows that clearly exist in both files. Normalize both sides first:

Python

for d in (df_primary, df_lookup):
    d["product_sku"] = d["product_sku"].astype(str).str.strip().str.upper()
print(df_primary["product_sku"].tolist())

.astype(str) forces a shared text type (so 100 and "100" line up), .str.strip() removes stray whitespace from spreadsheet exports, and .str.upper() makes a-100 and A-100 match. Apply the identical transform to both frames — normalizing only one side does nothing.

Choosing the join type

The how argument decides which rows survive when a key is present in one file but not the other:

`how`	Keeps
`inner`	Only keys present in both files
`left`	All rows from the first file
`right`	All rows from the second file
`outer`	All keys from either file

For "enrich my main table with extra columns," how="left" is almost always right. Reach for inner when you only want rows that exist in both, and outer when you are reconciling two lists and need to see everything from either side.

When the column names differ

If the key has a different name in each file — product_sku in one, ProductCode in the other — use left_on/right_on, then drop the now-redundant duplicate column:

Python

alt_lookup = df_lookup.rename(columns={"product_sku": "ProductCode"})
merged_alt = pd.merge(
    df_primary, alt_lookup,
    left_on="product_sku", right_on="ProductCode", how="inner",
).drop(columns=["ProductCode"])
print(merged_alt.columns.tolist())

pandas keeps both key columns after a left_on/right_on merge because they had different names; dropping ProductCode leaves a single clean key. Alternatively, rename one side to match before the merge so a plain on="product_sku" works.

Common pitfalls & gotchas

Duplicate keys in the lookup multiply your rows. If the lookup has more than one row per key, the join repeats the primary rows once per match — a four-row report can quietly become twelve. Deduplicate the lookup before merging, or pass validate="m:1" to make the merge fail loudly when the right side isn't unique:

Python

df_lookup_unique = df_lookup.drop_duplicates(subset=["product_sku"], keep="last")
safe = pd.merge(df_primary, df_lookup_unique, on="product_sku",
                how="left", validate="m:1")
print(f"{len(safe)} rows (primary has {len(df_primary)})")

If a genuine duplicate needs collapsing rather than dropping — summing amounts, taking the latest record — aggregate with groupby first. For the mechanics of stripping repeats out of a single column, see dropping duplicates from an Excel column.

Unmatched keys leave NaN, not an error. A left or outer join fills the lookup columns with NaN wherever no match was found, which then breaks later arithmetic or formatting. Decide the fill deliberately with fillna on the missing columns rather than letting NaN propagate.

You silently drop rows. An inner join throws away primary rows that have no lookup match without warning. If that surprises you, you probably wanted left. Validate the row count after joining:

Python

assert len(merged) >= len(df_primary), "Unexpected row loss during merge"
missing = merged["unit_price"].isna().sum()
print(f"{missing} unmatched keys" if missing else "All keys matched")

To label which rows matched instead of just counting them, pass indicator=True to add a _merge column tagging each row both, left_only, or right_only.

Performance and scale notes

pd.merge uses a hash join, so it scales close to linearly with row count and handles hundreds of thousands of rows comfortably. A few things keep it fast and memory-light on larger files:

Set the key dtype once. Normalizing to str is safe, but if your key is truly numeric or categorical, casting both sides with .astype("int64") or .astype("category") makes the join faster and the result smaller than text keys.
Read only what you need. Pass usecols= to pd.read_excel so the lookup workbook loads just its key and the columns you actually merge in — reading a 40-column catalog to attach two fields wastes memory.
.xlsx reading is the bottleneck, not the merge. For repeated joins over large workbooks, convert the lookup to Parquet or CSV once and read that; the merge itself is rarely the slow part.
Watch many-to-many blow-ups. Duplicate keys on both sides produce every combination, so a merge can emit far more rows than either input. validate= catches this before it exhausts memory.

Verify the join

Conclusion

Merging two Excel files on a shared column comes down to four moves: read each workbook with pd.read_excel, normalize the key on both sides so matches actually land, join with pd.merge using the how that keeps the rows you need, and validate the row count before you export. Guard against duplicate keys with validate="m:1" and resolve any NaN from unmatched keys deliberately, and the same recipe scales from a two-file lookup to a repeatable reporting pipeline.

Frequently asked questions

What if the common column has a different name in each file? Use left_on and right_on instead of on, e.g. pd.merge(df_primary, alt_lookup, left_on="product_sku", right_on="ProductCode"), then drop the redundant duplicate column afterward.

Why do matching rows fail to join?merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produces empty matches. Normalize both sides with .astype(str).str.strip().str.upper() before joining.

Which how should I pick to keep all rows of my main file? Use how="left". It keeps every row of the first file and attaches matching columns from the lookup, leaving NaN where no match exists.

How do I stop a duplicate lookup from multiplying my rows? Deduplicate the lookup with drop_duplicates(subset=["product_sku"]) before merging, or pass validate="m:1" so the merge raises loudly if the lookup isn't unique on the key.

What does suffixes do? When both files have a non-key column with the same name, suffixes=("_sales", "_catalog") renames the clashing columns so neither is overwritten. It does not affect the join key itself.

Up to the parent guide: Merging and Joining Excel DataFrames — multi-column keys, schema reconciliation, match auditing, and cardinality validation.
Stacking many files instead of joining side by side? See Combine Multiple Excel Files into One in Python.
An outer or right join can introduce NaN; resolve it with Fill Missing Values in Excel with pandas fillna.
Deduplicate a noisy lookup before joining: Drop Duplicates from an Excel Column with pandas.
Summarize the merged result with Create a Pivot Table from Excel with pandas.