Guide

Advanced Data Transformation And CleaningDeep dive

Cleaning Excel Data With Pandas: A Production Workflow

Turn messy Excel exports into reliable reporting inputs with pandas: load as text, normalize headers and types, drop noise, deduplicate, validate, and export.

Raw Excel exports rarely arrive analysis-ready: inconsistent headers, hidden whitespace, placeholder strings, duplicate rows, and numbers stored as text all break downstream reports. Cleaning that data in pandas gives you a scriptable, version-controlled, testable alternative to manual spreadsheet edits. This guide builds a linear cleaning pipeline — ingest, normalize, denoise, deduplicate, validate, export — one stage at a time, extending the Advanced Data Transformation and Cleaning overview.

Every block runs in order against a sample workbook created in the first step.

Install dependencies

Bash
pip install pandas openpyxl numpy

Create a sample workbook

This workbook deliberately contains the problems you meet in the wild: messy header casing, whitespace, a placeholder "N/A", a fully blank row, a duplicate order, a negative amount, and amounts stored as text:

Python
import pandas as pd

raw = pd.DataFrame({
    " Order ID ": ["1001", "1002", "1003", "1003", "1004", None],
    "Transaction Date": ["2024-01-05", "2024-01-06", "2024-01-07",
                         "2024-01-07", "2019-12-31", None],
    "Status": ["pending", "Complete", "PENDING", "PENDING", "complete", None],
    "Amount": ["$1,200.50", "980.00", "450.00", "450.00", "-75.00", None],
})
raw.to_excel("report_input.xlsx", sheet_name="Data", index=False)
print(f"Wrote {len(raw)} rows")

Step 1: Load everything as text

Load with dtype=str so pandas does not guess a type per cell — guessing is what produces object columns mixing strings, numbers, and dates. We coerce deliberately in Step 2.

Python
import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def load_excel_data(file_path: str, sheet_name=0) -> pd.DataFrame:
    df = pd.read_excel(
        file_path,
        sheet_name=sheet_name,
        header=0,
        engine="openpyxl",
        dtype=str,        # prevent premature, per-cell type coercion
    )
    logging.info(f"Loaded {len(df)} rows from {file_path}")
    return df

df = load_excel_data("report_input.xlsx", sheet_name="Data")
print(df.columns.tolist())

read_excel() has no skip_blank_lines argument (that belongs to read_csv()); remove blank rows with dropna(how="all") in Step 3 instead.

Step 2: Standardize headers and types

Normalize column names to predictable snake_case, then coerce each column to its intended type. Stripping currency symbols and separators before pd.to_numeric keeps the cast from collapsing to all-NaN.

Python
def standardize_schema(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = (
        df.columns.str.strip()
        .str.lower()
        .str.replace(r"\s+", "_", regex=True)
    )

    # amount: strip "$" and "," before numeric cast
    if "amount" in df.columns:
        df["amount"] = (df["amount"].str.replace(r"[$,]", "", regex=True)
                        .pipe(pd.to_numeric, errors="coerce"))

    if "transaction_date" in df.columns:
        df["transaction_date"] = pd.to_datetime(df["transaction_date"], errors="coerce")

    if "status" in df.columns:
        df["status"] = df["status"].astype("category")

    return df

df = standardize_schema(df)
print(df.dtypes)

Step 3: Drop structural noise

Excel exports carry empty rows from copy-paste, template padding, and footer notes. Drop fully blank rows, then drop rows missing a critical identifier, and strip residual whitespace from text columns.

Python
def purge_noise(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    initial = len(df)

    df = df.dropna(how="all")                       # fully blank rows
    df = df.dropna(subset=["order_id", "transaction_date"])  # missing keys

    text_cols = df.select_dtypes(include=["object", "string"]).columns
    for col in text_cols:
        df[col] = df[col].str.strip()

    logging.info(f"Purged {initial - len(df)} noisy/empty rows")
    return df

df = purge_noise(df)
print(f"{len(df)} rows remain")

Step 4: Deduplicate and normalize values

Duplicates come from repeated exports and overlapping date ranges. Sort first so the survivor is deterministic, then drop on the business key. For a focused treatment of single-column deduplication, see Pandas Drop Duplicates from Excel Column.

Python
def deduplicate_records(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # Newest first, so keep="first" retains the most recent record per key
    df = df.sort_values("transaction_date", ascending=False)
    df = df.drop_duplicates(subset=["order_id"], keep="first")

    # Normalize the status vocabulary (string ops on a category need .astype(str))
    df["status"] = (df["status"].astype(str).str.upper()
                    .replace({"PENDING": "OPEN", "COMPLETE": "CLOSED"}))
    return df

df = deduplicate_records(df)
print(df[["order_id", "status"]])

Step 5: Validate and derive

Run business-rule checks and compute the columns the report needs. Logging the row counts you drop gives you an audit trail when a scheduled run produces unexpected totals.

Python
def validate_and_prepare(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    neg = df["amount"] < 0
    if neg.any():
        logging.warning(f"Dropping {int(neg.sum())} rows with negative amounts")
        df = df[~neg]

    # Keep only in-scope dates
    df = df[df["transaction_date"] >= pd.Timestamp("2020-01-01")]

    df["fiscal_quarter"] = df["transaction_date"].dt.quarter
    df["fiscal_year"] = df["transaction_date"].dt.year
    return df

df = validate_and_prepare(df)
print(df)

Step 6: Export the cleaned data

Write the result for the next stage. CSV is universally interoperable; for a styled workbook, use to_excel with the openpyxl engine.

Python
def export_clean_data(df: pd.DataFrame, output_path: str):
    df.to_csv(output_path, index=False)
    logging.info(f"Exported {len(df)} rows to {output_path}")

export_clean_data(df, "report_clean.csv")
df.to_excel("report_clean.xlsx", index=False, engine="openpyxl")
print("Export complete")

For large static datasets you read repeatedly, Parquet (via pip install pyarrow) compresses well and preserves dtypes — swap to_csv for df.to_parquet(path, index=False).

Conclusion

A reliable cleaning pipeline follows the same order every time: load as text, normalize headers and types, drop structural noise, deduplicate on the business key, validate business rules, and export. Keeping each stage in its own function means you can log row counts at every boundary, test each step in isolation, and add new rules without touching the rest of the pipeline. The result is a script you can re-run on the next month's export and trust to produce the same clean output.

Common errors and fixes

ValueError: could not convert string to float — currency symbols, thousands separators, or trailing spaces in a numeric column. Strip them first:

Python
s = pd.Series(["$1,200.50", "980.00"])
cleaned = pd.to_numeric(s.str.replace(r"[$,]", "", regex=True), errors="coerce")
print(cleaned.tolist())

Footer notes or merged cells inflate the column count — restrict parsing to the real data region with usecols and skipfooter:

Python
subset = pd.read_excel("report_input.xlsx", usecols="A:D", skipfooter=0,
                       engine="openpyxl")
print(subset.shape)

MemoryError on a large workbook — declare dtypes up front and drop columns you do not need; convert high-cardinality strings to category to shrink memory.

Ambiguous dates (MM/DD vs DD/MM) — coerce explicitly and set dayfirst to match the source:

Python
dates = pd.to_datetime(pd.Series(["05/01/2024", "31/12/2019"]),
                       dayfirst=True, errors="coerce")
print(dates.tolist())

Where to go next