Guide

Advanced Data Transformation And CleaningDeep dive

Pandas: Drop Duplicates From an Excel Column

Remove duplicate rows by a single Excel column with pandas drop_duplicates and the subset parameter — including keep options, NaN behavior, and pre-cleaning.

To drop duplicates by a single Excel column with pandas, load the workbook with pd.read_excel(), call df.drop_duplicates(subset=["column"]), and write the result back. This removes whole rows where the target column repeats, keeping the first occurrence by default — the other columns of the surviving row stay intact.

Every block below runs in order against a sample workbook built in the first step.

Create a sample workbook

Python
import pandas as pd

df = pd.DataFrame({
    "TargetColumn": ["A-100", "A-100", "B-200", "C-300", "B-200"],
    "MetricA": [10, 99, 20, 30, 21],
    "UpdatedAt": ["2024-01-01", "2024-03-01", "2024-01-02",
                  "2024-01-03", "2024-02-15"],
})
df.to_excel("report_input.xlsx", index=False, engine="openpyxl")
print(f"Wrote {len(df)} rows")

The core operation

Python
df = pd.read_excel("report_input.xlsx", engine="openpyxl")

df_clean = df.drop_duplicates(subset=["TargetColumn"], keep="first", ignore_index=True)

df_clean.to_excel("report_output.xlsx", index=False, engine="openpyxl")
print(df_clean)

Five rows collapse to three — one row per distinct TargetColumn.

The parameters that matter

  • subset — the column(s) checked for uniqueness. ["TargetColumn"] evaluates only that column while keeping every other column of the surviving rows.
  • keep — which duplicate survives: "first" (default), "last", or False (drop all rows that have any duplicate).
  • ignore_index — resets the index to 0, 1, 2, …. Set True for clean exports and predictable downstream joins.
Python
# keep="last" retains the final occurrence instead of the first
last = df.drop_duplicates(subset=["TargetColumn"], keep="last", ignore_index=True)
print(last)

# keep=False removes every row whose TargetColumn appears more than once
unique_only = df.drop_duplicates(subset=["TargetColumn"], keep=False)
print(unique_only)

Clean the column before deduplicating

Exact matching is literal: "A-100" and "a-100 " are different values. Manual Excel entry routinely introduces trailing whitespace and inconsistent casing, so normalize the key first or duplicates slip through:

Python
df["TargetColumn"] = df["TargetColumn"].astype(str).str.strip().str.lower()
deduped = df.drop_duplicates(subset=["TargetColumn"], ignore_index=True)
print(deduped["TargetColumn"].tolist())

Keep the highest-value record, not just the first

keep="first" is positional. To keep a meaningful survivor — say the row with the largest MetricA per key — sort first, then drop:

Python
best = (df.sort_values("MetricA", ascending=False)
        .drop_duplicates(subset=["TargetColumn"], keep="first")
        .sort_index())
print(best[["TargetColumn", "MetricA"]])

NaN behavior

pandas treats NaN values as equal to each other for deduplication, so multiple null keys collapse to one. To preserve them, fill with a sentinel before dropping and revert afterward:

Python
import numpy as np

s = pd.DataFrame({"TargetColumn": ["x", np.nan, np.nan, "x"]})
filled = s.assign(TargetColumn=s["TargetColumn"].fillna("__NULL__"))
kept = filled.drop_duplicates(subset=["TargetColumn"])
kept["TargetColumn"] = kept["TargetColumn"].replace("__NULL__", np.nan)
print(kept)

Inspect before you drop

To review what would be removed, build a mask with duplicated() rather than dropping blind:

Python
mask = df.duplicated(subset=["TargetColumn"], keep="first")
removed = df[mask]
print(f"{len(removed)} rows would be removed:")
print(removed)

Resolve conflicting columns with groupby

When duplicates carry different values in other columns and you want a deterministic single row per key, groupby is clearer than drop_duplicates:

Python
resolved = df.groupby("TargetColumn", as_index=False).first()
print(resolved)

Log removal counts in a pipeline

Tracking how many duplicates you remove surfaces upstream issues — repeated exports, sync errors, template drift. Wrap it so a missing column fails loudly:

Python
def dedupe_logged(df: pd.DataFrame, column: str) -> pd.DataFrame:
    if column not in df.columns:
        raise KeyError(f"Column not found: {column}")
    before = len(df)
    out = df.drop_duplicates(subset=[column], ignore_index=True)
    print(f"[INFO] Removed {before - len(out)} duplicate rows on '{column}'")
    return out

result = dedupe_logged(df, "TargetColumn")

This fits inside the broader Cleaning Excel Data with Pandas workflow. For multi-source joins and reshaping, see the Advanced Data Transformation and Cleaning overview.