Guide
How to Combine Multiple Excel Files into One in Python
Use pandas.concat() with pathlib to batch-read .xlsx files, align mismatched columns automatically, and export a unified workbook. This approach is the standard for automated reporting pipelines because it handles schema drift, skips empty files, and executes in seconds without manual iteration.
How to Combine Multiple Excel Files into One in Python
Use pandas.concat() with pathlib to batch-read .xlsx files, align mismatched columns automatically, and export a unified workbook. This approach is the standard for automated reporting pipelines because it handles schema drift, skips empty files, and executes in seconds without manual iteration.
Prerequisites
pip install pandas openpyxl
- Python: 3.8+
- pandas: 2.0+ (column alignment defaults to name-based; legacy
sort=Falseparameter is removed) - Engine:
openpyxlis mandatory for.xlsxI/O. Convert legacy.xlsfiles first.
Primary Method: DataFrame Concatenation
import pandas as pd
from pathlib import Path
def combine_excel_files(source_dir: str, output_path: str) -> None:
"""Combine all .xlsx files in a directory into a single workbook."""
source = Path(source_dir)
# Exclude Excel temporary lock files
files = [f for f in source.glob("*.xlsx") if not f.name.startswith("~$")]
if not files:
raise FileNotFoundError(f"No .xlsx files found in {source_dir}")
dfs = []
for file in files:
try:
df = pd.read_excel(file, engine="openpyxl")
if not df.empty:
dfs.append(df)
except Exception as e:
print(f"Skipping {file.name}: {e}")
if not dfs:
raise ValueError("No valid data found across provided files.")
# Aligns columns by name; ignores original row indices
combined = pd.concat(dfs, ignore_index=True)
combined.to_excel(output_path, index=False, engine="openpyxl")
print(f"Combined {len(dfs)} files into {output_path}")
combine_excel_files("./monthly_reports", "./consolidated_report.xlsx")
Fallback: Raw Cell Append (openpyxl)
When source files contain merged cells, password protection, or heavily inconsistent headers, pandas fails to parse correctly. Bypass DataFrame conversion and append rows directly. This preserves raw cell values but sacrifices automatic type casting and column alignment.
from openpyxl import load_workbook
from pathlib import Path
def fallback_combine(source_dir: str, output_path: str) -> None:
source = Path(source_dir)
files = [f for f in source.glob("*.xlsx") if not f.name.startswith("~$")]
if not files:
return
wb_out = load_workbook(files[0])
ws_out = wb_out.active
for file in files[1:]:
wb_in = load_workbook(file, data_only=True)
ws_in = wb_in.active
# Skip header row (min_row=2), append only non-empty rows
for row in ws_in.iter_rows(min_row=2, values_only=True):
if any(cell is not None for cell in row):
ws_out.append(row)
wb_out.save(output_path)
Pipeline Hardening & Troubleshooting
Schema Drift & Column Mismatchpd.concat() aligns by column name, not position. Missing columns fill with NaN. Enforce a strict schema post-concatenation:
master_cols = ["date", "region", "revenue", "cost"]
combined = combined.reindex(columns=master_cols)
Data Type Conflicts
Mixed numeric/string columns trigger DtypeWarning. Suppress with explicit casting:
combined = combined.astype({"revenue": "float64", "region": "string"})
Performance & Memory Limitspandas loads files entirely into RAM. For datasets >2GB or 50+ files, avoid repeated I/O. Process in batches, write to intermediate Parquet files, and export to Excel only at the final step. When building your first reporting pipeline, lock dependency versions in requirements.txt and review Getting Started with Python Excel Automation to configure virtual environments and structured logging that prevents silent data corruption.
Validation & Automation Schedule via cron, GitHub Actions, or Windows Task Scheduler. Validate output programmatically before deployment:
expected_rows = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected_rows, "Row count mismatch detected"
Formatting & Template Preservationpandas exports raw data only. To retain conditional formatting, pivot tables, or macros, generate consolidated data first, then inject values into a pre-built template. This aligns with established patterns for Working with Multiple Excel Sheets in Python where sheet-level operations require direct workbook manipulation rather than DataFrame abstraction.
Deployment Checklist
- Verify all source files use
.xlsxextension - Confirm
openpyxlversion matches pandas compatibility matrix - Test with empty files, single-row files, and trailing whitespace
- Validate output against expected row count
- Add structured logging to track skipped files and dtype warnings