Guide

Advanced Data Transformation And CleaningDeep dive

Handling Missing Data in Excel Reports with Pandas

Handle missing data in Excel reports with pandas: map blanks to NaN, profile gaps, choose type-aware fills for numeric, categorical, and date columns, then validate and export.

Excel exports rarely use a single, clean marker for "no value." One column has empty cells, another has the string "N/A", a third has a stray space. Left alone, these break aggregations or silently skew totals. This guide builds a deterministic workflow: normalize blanks to NaN, profile where the gaps are, fill each column type appropriately, then validate before export. Every block is runnable and shares one namespace, so paste them in order.

This is one stage of Advanced Data Transformation and Cleaning. For parsing raw workbooks, see Cleaning Excel Data with Pandas.

Create a sample workbook

The sample mixes the placeholder styles you meet in the wild — empty cells, "N/A", a dash, and a blank space:

Python
import pandas as pd

raw = pd.DataFrame({
    "region": ["North", "N/A", "South", " ", "West"],
    "sales_rep": ["Ana", "Ben", None, "Dan", "Eve"],
    "revenue": [12000, None, 9800, 7200, "-"],
    "units_sold": [120, 40, None, 36, 60],
    "transaction_date": ["2024-01-05", None, "2024-01-07", "2024-01-08", None],
})
raw.to_excel("monthly_report.xlsx", index=False)

Step 1: Load and normalize blanks to NaN

read_excel only recognizes a default set of NA tokens. Pass na_values so placeholders like "N/A" and "-" become real NaN, and convert whitespace-only strings explicitly:

Python
na_indicators = ["N/A", "NA", "-", "null", "NULL", "#N/A"]
df = pd.read_excel("monthly_report.xlsx", na_values=na_indicators).copy()

# Whitespace-only cells aren't caught by na_values — replace them
df = df.replace(r"^\s*$", pd.NA, regex=True)
print(df)

Step 2: Profile the missingness

Before filling anything, quantify the gaps so you know which columns need attention and how severe each is:

Python
missing_count = df.isna().sum()
missing_pct = (df.isna().mean() * 100).round(1)
profile = pd.DataFrame({"missing_count": missing_count, "missing_pct": missing_pct})
print(profile[profile["missing_count"] > 0])

Step 3: Coerce types before aggregating

Step 1 already mapped "-" to NaN, but coerce revenue and units_sold to numeric anyway as a guard — any stray non-numeric value becomes NaN and joins the gaps to fill:

Python
for col in ["revenue", "units_sold"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")
print(df.dtypes)

Step 4: Fill each column by type

Different columns deserve different fills. Numeric gaps take a representative statistic (median resists outliers); categorical gaps take an explicit label; dates get parsed and then filled in chronological order:

Python
# Numeric: median per column
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
df["units_sold"] = df["units_sold"].fillna(df["units_sold"].median())

# Categorical: explicit labels, not silent guesses
df["region"] = df["region"].fillna("Unassigned")
df["sales_rep"] = df["sales_rep"].fillna("Pending Assignment")

# Temporal: parse, sort, then forward/back fill
df["transaction_date"] = pd.to_datetime(df["transaction_date"], errors="coerce")
df = df.sort_values("transaction_date")
df["transaction_date"] = df["transaction_date"].ffill().bfill()

print(df)

The method= argument to fillna() was removed in pandas 3.0 — use the dedicated ffill() and bfill() methods for forward and backward fills. For the full range of fillna patterns, see Fill Missing Values in Excel with Pandas Fillna.

Step 5: Validate and export

Confirm no NaN survives in the columns your report depends on, then write the cleaned workbook:

Python
critical_cols = ["revenue", "transaction_date", "region"]
remaining = df[critical_cols].isna().sum().sum()
if remaining:
    raise ValueError(f"{remaining} NaN values remain in critical columns")

df.to_excel("cleaned_monthly_report.xlsx", index=False, engine="openpyxl")
print("Exported cleaned_monthly_report.xlsx")

A reusable cleaning function

The same steps, packaged for a pipeline. It only fills columns that exist, so it tolerates schema drift between months:

Python
import pandas as pd

def clean_reporting_excel(input_path, output_path):
    na_map = ["N/A", "NA", "-", "null", "NULL", "#N/A"]
    df = pd.read_excel(input_path, na_values=na_map).copy()
    df = df.replace(r"^\s*$", pd.NA, regex=True)

    for col in ["revenue", "units_sold"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    fill_strategy = {
        "revenue": df["revenue"].median() if "revenue" in df.columns else 0,
        "units_sold": df["units_sold"].median() if "units_sold" in df.columns else 0,
        "region": "Unassigned",
        "sales_rep": "Pending Assignment",
    }
    active = {k: v for k, v in fill_strategy.items() if k in df.columns}
    df = df.fillna(active)

    if "transaction_date" in df.columns:
        df["transaction_date"] = pd.to_datetime(df["transaction_date"], errors="coerce")
        df = df.sort_values("transaction_date")
        df["transaction_date"] = df["transaction_date"].ffill().bfill()

    df.to_excel(output_path, index=False, engine="openpyxl")
    return df

result = clean_reporting_excel("monthly_report.xlsx", "report_clean.xlsx")
print(f"Cleaned {len(result)} rows")

Common errors and fixes

fillna() ignores blank cells. Excel often exports blanks as empty strings, which pandas keeps as valid text. Run df.replace(r"^\s*$", pd.NA, regex=True) after loading so the blanks become NaN.

Filling a mixed column turns numbers into strings. A column holding numbers and "-" reads as object; a numeric fill then coerces everything to text. Run pd.to_numeric(df[col], errors="coerce") first.

sum() or mean() still returns NaN. Check df.dtypes — an unconverted object column or a remaining NaN propagates. Coerce types and confirm the fill covered every gap.

Dates appear as serial numbers like 45215. Excel stores dates as serials. Let read_excel parse them, or convert numeric date columns with pd.to_datetime(col, origin="1899-12-30", unit="D").

Conclusion

The key discipline is to treat every missing-value decision as an explicit choice, not a silent default. Normalize blanks to NaN first so all gaps are visible, profile the missingness to understand scale and pattern, fill by column type with a documented rationale, and validate before export so a scheduling job surfaces gaps rather than silently writing a broken workbook.

Where to go next