Guide
Combine Multiple Excel Files into One in Python
Merge many .xlsx files into one workbook with pandas — glob the folder, concat with name-based column alignment, track the source file, and export the result.
When a folder fills up with monthly exports or per-region submissions, you usually want them stacked into one table. The standard approach reads each .xlsx into a DataFrame and stacks them with pd.concat(), which aligns columns by name so files in slightly different orders still line up. This guide builds the whole pipeline on sample files the first step generates, so every snippet runs in order.
Step 1: Install the toolchain
pip install pandas openpyxl
openpyxl is the engine pandas uses for .xlsx I/O. Convert any legacy .xls files to .xlsx first.
Step 2: Create a folder of sample files
To make the example self-contained, generate three monthly files in a monthly_reports/ directory. Note the third file lists its columns in a different order — pd.concat will still align them by name:
import pandas as pd
from pathlib import Path
src = Path("monthly_reports")
src.mkdir(exist_ok=True)
pd.DataFrame({
"date": ["2024-01-31", "2024-01-31"],
"region": ["North", "South"],
"revenue": [1200.0, 980.0],
}).to_excel(src / "jan.xlsx", index=False)
pd.DataFrame({
"date": ["2024-02-29", "2024-02-29"],
"region": ["North", "West"],
"revenue": [1350.0, 720.0],
}).to_excel(src / "feb.xlsx", index=False)
# Same columns, different order
pd.DataFrame({
"revenue": [1410.0, 1010.0],
"region": ["East", "South"],
"date": ["2024-03-31", "2024-03-31"],
}).to_excel(src / "mar.xlsx", index=False)
print("Created", len(list(src.glob('*.xlsx'))), "files")
Step 3: Find the files with glob
Path.glob("*.xlsx") lists matching files. Filter out Excel's lock files (their names start with ~$) so a workbook left open in Excel does not break the run:
files = sorted(f for f in src.glob("*.xlsx") if not f.name.startswith("~$"))
print([f.name for f in files])
Step 4: Read and concatenate
Read each file into a DataFrame, collect them in a list, then pd.concat with ignore_index=True to renumber the rows. Column alignment is by name, so mar.xlsx's reordered columns slot into place:
frames = []
for f in files:
df = pd.read_excel(f, engine="openpyxl")
if not df.empty:
frames.append(df)
combined = pd.concat(frames, ignore_index=True)
print(combined)
print("Combined shape:", combined.shape)
Step 5: Track which file each row came from
A consolidated table is far more useful when each row remembers its origin. Add a source_file column as you read:
frames = []
for f in files:
df = pd.read_excel(f, engine="openpyxl")
df["source_file"] = f.name # tag every row with its origin
frames.append(df)
combined = pd.concat(frames, ignore_index=True)
print(combined[["date", "region", "revenue", "source_file"]])
Step 6: Validate the row count
Before exporting, confirm the combined frame holds exactly as many rows as the source files did. Catching a mismatch here beats discovering missing data downstream:
expected = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected, f"Expected {expected} rows, got {combined.shape[0]}"
print(f"Row count verified: {combined.shape[0]} rows")
Step 7: Export the consolidated workbook
Write the result to a single .xlsx. Use index=False so the DataFrame's row numbers do not become a stray column:
combined.to_excel("consolidated_report.xlsx", index=False, engine="openpyxl")
print("Wrote consolidated_report.xlsx")
Handling schema drift
pd.concat aligns by column name, so a file missing a column gets NaN there, and an extra column appears for every other file as NaN. To pin the output to an exact schema, reindex after concatenating — this drops unexpected columns and guarantees column order:
master_cols = ["date", "region", "revenue", "source_file"]
combined = combined.reindex(columns=master_cols)
print("Final columns:", list(combined.columns))
If different files use different names for the same field (rev vs revenue), rename each file's columns to a canonical set with a rename map before appending it to the list.
Fallback: append raw cells with openpyxl
When files carry merged cells or quirks that trip up pandas, you can copy raw values cell by cell. This keeps the first file's header and appends only data rows from the rest. It preserves raw values but skips pandas' type handling and name-based alignment, so it assumes a consistent column order:
from openpyxl import load_workbook, Workbook
out = Workbook()
ws_out = out.active
first = True
for f in files:
wb_in = load_workbook(f, read_only=True, data_only=True)
ws_in = wb_in.active
start = 1 if first else 2 # keep the header only from the first file
for i, row in enumerate(ws_in.iter_rows(values_only=True), start=1):
if i < start:
continue
if any(cell is not None for cell in row):
ws_out.append(row)
wb_in.close()
first = False
out.save("consolidated_raw.xlsx")
print("Wrote consolidated_raw.xlsx via openpyxl")
Common errors and fixes
| Error | Cause | Fix |
|---|---|---|
FileNotFoundError / empty result | glob pattern or directory wrong | Print the resolved path and list(src.glob("*.xlsx")) to confirm |
Unexpected NaN columns | Column names differ across files | Rename to a canonical schema, then reindex(columns=...) |
PermissionError on read | A source file is open in Excel | Skip ~$ lock files (Step 3); close the workbook |
| Row count assertion fails | An empty or unreadable file was skipped | Log skipped files; decide whether to fail or continue |
Conclusion
The glob-read-concat pipeline is the idiomatic pandas approach: three lines in the common case, three failure modes to guard against (lock files, schema drift, and empty inputs), and a row-count assertion before export that catches mismatches early. If column-name inconsistency is common in your file set, build the reindex step with a master_cols list into the pipeline from the start rather than adding it after the first failure.
Where to go next
- Working with Multiple Excel Sheets in Python — combine across tabs within one workbook.
- Merging and Joining Excel DataFrames — when files relate by key rather than stacking.
- Scheduling Python Excel Scripts with cron — run this consolidation on a schedule.