Guide
Pandas: Drop Duplicates From an Excel Column
Remove duplicate rows by a single Excel column with pandas drop_duplicates and the subset parameter — including keep options, NaN behavior, and pre-cleaning.
To drop duplicates by a single Excel column with pandas, load the workbook with pd.read_excel(), call df.drop_duplicates(subset=["column"]), and write the result back. This removes whole rows where the target column repeats, keeping the first occurrence by default — the other columns of the surviving row stay intact.
Every block below runs in order against a sample workbook built in the first step.
Create a sample workbook
import pandas as pd
df = pd.DataFrame({
"TargetColumn": ["A-100", "A-100", "B-200", "C-300", "B-200"],
"MetricA": [10, 99, 20, 30, 21],
"UpdatedAt": ["2024-01-01", "2024-03-01", "2024-01-02",
"2024-01-03", "2024-02-15"],
})
df.to_excel("report_input.xlsx", index=False, engine="openpyxl")
print(f"Wrote {len(df)} rows")
The core operation
df = pd.read_excel("report_input.xlsx", engine="openpyxl")
df_clean = df.drop_duplicates(subset=["TargetColumn"], keep="first", ignore_index=True)
df_clean.to_excel("report_output.xlsx", index=False, engine="openpyxl")
print(df_clean)
Five rows collapse to three — one row per distinct TargetColumn.
The parameters that matter
subset— the column(s) checked for uniqueness.["TargetColumn"]evaluates only that column while keeping every other column of the surviving rows.keep— which duplicate survives:"first"(default),"last", orFalse(drop all rows that have any duplicate).ignore_index— resets the index to0, 1, 2, …. SetTruefor clean exports and predictable downstream joins.
# keep="last" retains the final occurrence instead of the first
last = df.drop_duplicates(subset=["TargetColumn"], keep="last", ignore_index=True)
print(last)
# keep=False removes every row whose TargetColumn appears more than once
unique_only = df.drop_duplicates(subset=["TargetColumn"], keep=False)
print(unique_only)
Clean the column before deduplicating
Exact matching is literal: "A-100" and "a-100 " are different values. Manual Excel entry routinely introduces trailing whitespace and inconsistent casing, so normalize the key first or duplicates slip through:
df["TargetColumn"] = df["TargetColumn"].astype(str).str.strip().str.lower()
deduped = df.drop_duplicates(subset=["TargetColumn"], ignore_index=True)
print(deduped["TargetColumn"].tolist())
Keep the highest-value record, not just the first
keep="first" is positional. To keep a meaningful survivor — say the row with the largest MetricA per key — sort first, then drop:
best = (df.sort_values("MetricA", ascending=False)
.drop_duplicates(subset=["TargetColumn"], keep="first")
.sort_index())
print(best[["TargetColumn", "MetricA"]])
NaN behavior
pandas treats NaN values as equal to each other for deduplication, so multiple null keys collapse to one. To preserve them, fill with a sentinel before dropping and revert afterward:
import numpy as np
s = pd.DataFrame({"TargetColumn": ["x", np.nan, np.nan, "x"]})
filled = s.assign(TargetColumn=s["TargetColumn"].fillna("__NULL__"))
kept = filled.drop_duplicates(subset=["TargetColumn"])
kept["TargetColumn"] = kept["TargetColumn"].replace("__NULL__", np.nan)
print(kept)
Inspect before you drop
To review what would be removed, build a mask with duplicated() rather than dropping blind:
mask = df.duplicated(subset=["TargetColumn"], keep="first")
removed = df[mask]
print(f"{len(removed)} rows would be removed:")
print(removed)
Resolve conflicting columns with groupby
When duplicates carry different values in other columns and you want a deterministic single row per key, groupby is clearer than drop_duplicates:
resolved = df.groupby("TargetColumn", as_index=False).first()
print(resolved)
Log removal counts in a pipeline
Tracking how many duplicates you remove surfaces upstream issues — repeated exports, sync errors, template drift. Wrap it so a missing column fails loudly:
def dedupe_logged(df: pd.DataFrame, column: str) -> pd.DataFrame:
if column not in df.columns:
raise KeyError(f"Column not found: {column}")
before = len(df)
out = df.drop_duplicates(subset=[column], ignore_index=True)
print(f"[INFO] Removed {before - len(out)} duplicate rows on '{column}'")
return out
result = dedupe_logged(df, "TargetColumn")
This fits inside the broader Cleaning Excel Data with Pandas workflow. For multi-source joins and reshaping, see the Advanced Data Transformation and Cleaning overview.