Guide

Getting Started With Python Excel AutomationDeep dive

Combine Multiple Excel Files into One in Python

Merge many .xlsx files into one workbook with pandas — glob the folder, concat with name-based column alignment, track the source file, and export the result.

When a folder fills up with monthly exports or per-region submissions, you usually want them stacked into one table. The standard approach reads each .xlsx into a DataFrame and stacks them with pd.concat(), which aligns columns by name so files in slightly different orders still line up. This guide builds the whole pipeline on sample files the first step generates, so every snippet runs in order.

Step 1: Install the toolchain

Bash
pip install pandas openpyxl

openpyxl is the engine pandas uses for .xlsx I/O. Convert any legacy .xls files to .xlsx first.

Step 2: Create a folder of sample files

To make the example self-contained, generate three monthly files in a monthly_reports/ directory. Note the third file lists its columns in a different order — pd.concat will still align them by name:

Python
import pandas as pd
from pathlib import Path

src = Path("monthly_reports")
src.mkdir(exist_ok=True)

pd.DataFrame({
    "date": ["2024-01-31", "2024-01-31"],
    "region": ["North", "South"],
    "revenue": [1200.0, 980.0],
}).to_excel(src / "jan.xlsx", index=False)

pd.DataFrame({
    "date": ["2024-02-29", "2024-02-29"],
    "region": ["North", "West"],
    "revenue": [1350.0, 720.0],
}).to_excel(src / "feb.xlsx", index=False)

# Same columns, different order
pd.DataFrame({
    "revenue": [1410.0, 1010.0],
    "region": ["East", "South"],
    "date": ["2024-03-31", "2024-03-31"],
}).to_excel(src / "mar.xlsx", index=False)

print("Created", len(list(src.glob('*.xlsx'))), "files")

Step 3: Find the files with glob

Path.glob("*.xlsx") lists matching files. Filter out Excel's lock files (their names start with ~$) so a workbook left open in Excel does not break the run:

Python
files = sorted(f for f in src.glob("*.xlsx") if not f.name.startswith("~$"))
print([f.name for f in files])

Step 4: Read and concatenate

Read each file into a DataFrame, collect them in a list, then pd.concat with ignore_index=True to renumber the rows. Column alignment is by name, so mar.xlsx's reordered columns slot into place:

Python
frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    if not df.empty:
        frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined)
print("Combined shape:", combined.shape)

Step 5: Track which file each row came from

A consolidated table is far more useful when each row remembers its origin. Add a source_file column as you read:

Python
frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    df["source_file"] = f.name          # tag every row with its origin
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined[["date", "region", "revenue", "source_file"]])

Step 6: Validate the row count

Before exporting, confirm the combined frame holds exactly as many rows as the source files did. Catching a mismatch here beats discovering missing data downstream:

Python
expected = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected, f"Expected {expected} rows, got {combined.shape[0]}"
print(f"Row count verified: {combined.shape[0]} rows")

Step 7: Export the consolidated workbook

Write the result to a single .xlsx. Use index=False so the DataFrame's row numbers do not become a stray column:

Python
combined.to_excel("consolidated_report.xlsx", index=False, engine="openpyxl")
print("Wrote consolidated_report.xlsx")

Handling schema drift

pd.concat aligns by column name, so a file missing a column gets NaN there, and an extra column appears for every other file as NaN. To pin the output to an exact schema, reindex after concatenating — this drops unexpected columns and guarantees column order:

Python
master_cols = ["date", "region", "revenue", "source_file"]
combined = combined.reindex(columns=master_cols)
print("Final columns:", list(combined.columns))

If different files use different names for the same field (rev vs revenue), rename each file's columns to a canonical set with a rename map before appending it to the list.

Fallback: append raw cells with openpyxl

When files carry merged cells or quirks that trip up pandas, you can copy raw values cell by cell. This keeps the first file's header and appends only data rows from the rest. It preserves raw values but skips pandas' type handling and name-based alignment, so it assumes a consistent column order:

Python
from openpyxl import load_workbook, Workbook

out = Workbook()
ws_out = out.active
first = True

for f in files:
    wb_in = load_workbook(f, read_only=True, data_only=True)
    ws_in = wb_in.active
    start = 1 if first else 2          # keep the header only from the first file
    for i, row in enumerate(ws_in.iter_rows(values_only=True), start=1):
        if i < start:
            continue
        if any(cell is not None for cell in row):
            ws_out.append(row)
    wb_in.close()
    first = False

out.save("consolidated_raw.xlsx")
print("Wrote consolidated_raw.xlsx via openpyxl")

Common errors and fixes

ErrorCauseFix
FileNotFoundError / empty resultglob pattern or directory wrongPrint the resolved path and list(src.glob("*.xlsx")) to confirm
Unexpected NaN columnsColumn names differ across filesRename to a canonical schema, then reindex(columns=...)
PermissionError on readA source file is open in ExcelSkip ~$ lock files (Step 3); close the workbook
Row count assertion failsAn empty or unreadable file was skippedLog skipped files; decide whether to fail or continue

Conclusion

The glob-read-concat pipeline is the idiomatic pandas approach: three lines in the common case, three failure modes to guard against (lock files, schema drift, and empty inputs), and a row-count assertion before export that catches mismatches early. If column-name inconsistency is common in your file set, build the reindex step with a master_cols list into the pipeline from the start rather than adding it after the first failure.

Where to go next