Guide

Advanced Data Transformation And CleaningDeep dive

Pandas: Drop Duplicates From an Excel Column

Q: What does keep=False do compared to the default?

The default keep="first" retains one row per key; keep="last" retains the final occurrence. keep=False is different — it drops every row whose key appears more than once, leaving only keys that were unique to begin with.

Q: How are NaN keys treated?

pandas considers NaN values equal to each other for deduplication, so multiple null keys collapse to one. To keep them, fill with a sentinel like "__NULL__" before dropping and revert afterward.

Q: How do I keep the highest-value row per key instead of the first?

keep="first" is purely positional. Sort by the column you care about first — df.sort_values("MetricA", ascending=False) — then drop_duplicates, optionally .sort_index() to restore the original order.

Q: What's the point of ignore_index=True?

It resets the index to 0, 1, 2, … after rows are removed, instead of leaving gaps. Set it for clean exports and predictable downstream joins.

Remove duplicate rows by a single Excel column with pandas drop_duplicates and the subset parameter — including keep options, NaN behavior, and pre-cleaning.

To drop duplicates by a single Excel column with pandas, load the workbook with pd.read_excel(), call df.drop_duplicates(subset=["column"]), and write the result back. This removes whole rows where the target column repeats, keeping the first occurrence by default — the other columns of the surviving row stay intact.

This is one of the most common jobs in a pandas cleaning workflow: a workbook accumulates repeated order IDs, SKUs, or email addresses because it was exported twice, synced from two systems, or hand-edited. Below, every block runs in order against a small sample workbook built in the first step, so you can paste them into one script and watch the row count fall.

Prerequisites

Python 3.9 or newer with pandas and the openpyxl engine installed: pip install pandas openpyxl.
An .xlsx workbook containing at least one column whose repeated values you want collapsed — an order ID, SKU, customer email, or any natural key.
Comfort reading a sheet into a DataFrame with pd.read_excel(). If your file has stray empty rows first, clear them with the blank-row removal steps so they cannot masquerade as duplicate keys.

Create a sample workbook

Python

import pandas as pd

df = pd.DataFrame({
    "TargetColumn": ["A-100", "A-100", "B-200", "C-300", "B-200"],
    "MetricA": [10, 99, 20, 30, 21],
    "UpdatedAt": ["2024-01-01", "2024-03-01", "2024-01-02",
                  "2024-01-03", "2024-02-15"],
})
df.to_excel("report_input.xlsx", index=False, engine="openpyxl")
print(f"Wrote {len(df)} rows")

The core operation

Python

df = pd.read_excel("report_input.xlsx", engine="openpyxl")

df_clean = df.drop_duplicates(subset=["TargetColumn"], keep="first", ignore_index=True)

df_clean.to_excel("report_output.xlsx", index=False, engine="openpyxl")
print(df_clean)

Five rows collapse to three — one row per distinct TargetColumn.

The parameters that matter

subset — the column(s) checked for uniqueness. ["TargetColumn"] evaluates only that column while keeping every other column of the surviving rows.
keep — which duplicate survives: "first" (default), "last", or False (drop all rows that have any duplicate).
ignore_index — resets the index to 0, 1, 2, …. Set True for clean exports and predictable downstream joins.

Python

# keep="last" retains the final occurrence instead of the first
last = df.drop_duplicates(subset=["TargetColumn"], keep="last", ignore_index=True)
print(last)

# keep=False removes every row whose TargetColumn appears more than once
unique_only = df.drop_duplicates(subset=["TargetColumn"], keep=False)
print(unique_only)

Keep the highest-value record, not just the first

keep="first" is positional. To keep a meaningful survivor — say the row with the largest MetricA per key — sort first, then drop:

Python

best = (df.sort_values("MetricA", ascending=False)
        .drop_duplicates(subset=["TargetColumn"], keep="first")
        .sort_index())
print(best[["TargetColumn", "MetricA"]])

Resolve conflicting columns with groupby

When duplicates carry different values in other columns and you want a deterministic single row per key, groupby is clearer than drop_duplicates:

Python

resolved = df.groupby("TargetColumn", as_index=False).first()
print(resolved)

Inspect before you drop

To review what would be removed, build a mask with duplicated() rather than dropping blind:

Python

mask = df.duplicated(subset=["TargetColumn"], keep="first")
removed = df[mask]
print(f"{len(removed)} rows would be removed:")
print(removed)

Log removal counts in a pipeline

Tracking how many duplicates you remove surfaces upstream issues — repeated exports, sync errors, template drift. Wrap it so a missing column fails loudly:

Python

def dedupe_logged(df: pd.DataFrame, column: str) -> pd.DataFrame:
    if column not in df.columns:
        raise KeyError(f"Column not found: {column}")
    before = len(df)
    out = df.drop_duplicates(subset=[column], ignore_index=True)
    print(f"[INFO] Removed {before - len(out)} duplicate rows on '{column}'")
    return out

result = dedupe_logged(df, "TargetColumn")

Common pitfalls and gotchas

Casing and whitespace hide duplicates

Exact matching is literal: "A-100" and "a-100 " are different values. Manual Excel entry routinely introduces trailing whitespace and inconsistent casing, so normalize the key first or duplicates slip through untouched:

Python

df["TargetColumn"] = df["TargetColumn"].astype(str).str.strip().str.lower()
deduped = df.drop_duplicates(subset=["TargetColumn"], ignore_index=True)
print(deduped["TargetColumn"].tolist())

NaN keys collapse to one

pandas treats NaN values as equal to each other for deduplication, so multiple null keys collapse to a single surviving row. That is usually not what you want when the blanks are genuinely different records. To preserve them, fill with a sentinel before dropping and revert afterward — closely related to the wider question of filling missing values with fillna:

Python

import numpy as np

s = pd.DataFrame({"TargetColumn": ["x", np.nan, np.nan, "x"]})
filled = s.assign(TargetColumn=s["TargetColumn"].fillna("__NULL__"))
kept = filled.drop_duplicates(subset=["TargetColumn"])
kept["TargetColumn"] = kept["TargetColumn"].replace("__NULL__", np.nan)
print(kept)

drop_duplicates returns a copy, not an in-place edit

Like most pandas methods, drop_duplicates returns a new DataFrame — the original is untouched unless you reassign (df = df.drop_duplicates(...)) or pass inplace=True. Forgetting this is the most common reason "the duplicates are still there" after the call.

The first row is not always the row you want

keep="first" keeps whichever row happens to appear earliest in file order, which may be the stale one. When one duplicate is authoritative, sort deliberately (as shown above) or reach for groupby(...).agg(...) to combine values across the duplicated rows rather than discard them.

Performance and scale notes

drop_duplicates on a single column is fast — pandas hashes the key column, so it scales close to linearly with row count, and a few hundred thousand rows deduplicate in well under a second. Practical points for larger Excel exports:

Deduplicate before you widen. Drop duplicates while the frame is narrow, then merge in extra columns or run heavier transforms — you do less work on fewer rows. This pairs naturally with the join workflow for combining Excel files on a common column.
String normalization dominates the cost. The .astype(str).str.strip().str.lower() pass is slower than the drop itself. On very large columns, normalize once into a helper column, dedupe on it, then drop the helper.
Sorting first has a price. The "keep the highest-value record" pattern adds an O(n log n) sort. For millions of rows where you only need one field, groupby(key)["value"].idxmax() and an .loc[] selection is lighter than a full sort.
The I/O is the bottleneck, not the dedupe. Reading and writing .xlsx with openpyxl is far slower than the in-memory operation. Read once, do all cleaning in the DataFrame, and write once at the end rather than round-tripping the workbook between steps.

Deduplicate on a normalised key

Python

import pandas as pd

df["_key"] = (
    df["Region"].astype("string").str.replace(r"\s+", " ", regex=True).str.strip().str.casefold()
)
deduped = df.drop_duplicates(subset="_key", keep="first").drop(columns="_key")
print(len(df), "->", len(deduped))

Keeping the original column untouched is the point of the separate key: the report should display North as the business writes it while the comparison uses the folded form. Dropping the helper column before writing keeps the output clean.

keep="first" only means something after a deliberate sort. Sorting by the column that decides which copy wins — most recent, largest, most complete — turns "first" from an accident of file order into a rule someone can explain.

Report before you drop

Dropping duplicates silently makes a row count change that nobody can explain later. Counting them first, logging how many were removed and on which key, and writing the removed rows to a Removed sheet turns a mysterious discrepancy into a documented decision. It costs three lines, and it is what lets you answer "the source had 4,812 rows and the report shows 4,790" without re-running anything.

Conclusion

Removing duplicates by one Excel column is a two-line operation — read_excel, drop_duplicates(subset=[...]) — but the details decide whether the result is correct. Pick the right keep (or groupby when you need to merge values), normalize casing and whitespace so near-duplicates actually match, handle NaN keys deliberately, and log the counts so silent data loss shows up in review. With those guards in place the step drops cleanly into any repeatable pandas cleaning pipeline.

Frequently asked questions

What does keep=False do compared to the default? The default keep="first" retains one row per key; keep="last" retains the final occurrence. keep=False is different — it drops every row whose key appears more than once, leaving only keys that were unique to begin with.

Why do "A-100" and "a-100 " count as separate values? Matching is exact and literal, so casing and trailing whitespace make otherwise-equal keys distinct. Normalize first with .astype(str).str.strip().str.lower() or duplicates slip through.

How are NaN keys treated? pandas considers NaN values equal to each other for deduplication, so multiple null keys collapse to one. To keep them, fill with a sentinel like "__NULL__" before dropping and revert afterward.

How do I keep the highest-value row per key instead of the first?keep="first" is purely positional. Sort by the column you care about first — df.sort_values("MetricA", ascending=False) — then drop_duplicates, optionally .sort_index() to restore the original order.

What's the point of ignore_index=True? It resets the index to 0, 1, 2, … after rows are removed, instead of leaving gaps. Set it for clean exports and predictable downstream joins.

Up: Cleaning Excel Data with Pandas — the full cleaning workflow this step belongs to.
Remove Blank Rows From Excel With Pandas — clear empty rows before deduplicating so they cannot pose as repeated keys.
Fill Missing Values in Excel With Pandas fillna — the deliberate way to handle the NaN keys that drop_duplicates would otherwise collapse.
Merge Two Excel Files on a Common Column — deduplicate a key column before joining to avoid fan-out.
Section overview: Advanced Data Transformation and Cleaning.