Guide

Getting Started With Python Excel AutomationQuick guide

How to Combine Multiple Excel Files into One in Python

Use pandas.concat() with pathlib to batch-read .xlsx files, align mismatched columns automatically, and export a unified workbook. This approach is the standard for automated reporting pipelines because it handles schema drift, skips empty files, and executes in seconds without manual iteration.

How to Combine Multiple Excel Files into One in Python

Use pandas.concat() with pathlib to batch-read .xlsx files, align mismatched columns automatically, and export a unified workbook. This approach is the standard for automated reporting pipelines because it handles schema drift, skips empty files, and executes in seconds without manual iteration.

Prerequisites

Bash
      pip install pandas openpyxl

    
  • Python: 3.8+
  • pandas: 2.0+ (column alignment defaults to name-based; legacy sort=False parameter is removed)
  • Engine: openpyxl is mandatory for .xlsx I/O. Convert legacy .xls files first.

Primary Method: DataFrame Concatenation

Python
      import pandas as pd
from pathlib import Path

def combine_excel_files(source_dir: str, output_path: str) -> None:
 """Combine all .xlsx files in a directory into a single workbook."""
 source = Path(source_dir)
 # Exclude Excel temporary lock files
 files = [f for f in source.glob("*.xlsx") if not f.name.startswith("~$")]
 
 if not files:
 raise FileNotFoundError(f"No .xlsx files found in {source_dir}")

 dfs = []
 for file in files:
 try:
 df = pd.read_excel(file, engine="openpyxl")
 if not df.empty:
 dfs.append(df)
 except Exception as e:
 print(f"Skipping {file.name}: {e}")

 if not dfs:
 raise ValueError("No valid data found across provided files.")

 # Aligns columns by name; ignores original row indices
 combined = pd.concat(dfs, ignore_index=True)
 combined.to_excel(output_path, index=False, engine="openpyxl")
 print(f"Combined {len(dfs)} files into {output_path}")

combine_excel_files("./monthly_reports", "./consolidated_report.xlsx")

    

Fallback: Raw Cell Append (openpyxl)

When source files contain merged cells, password protection, or heavily inconsistent headers, pandas fails to parse correctly. Bypass DataFrame conversion and append rows directly. This preserves raw cell values but sacrifices automatic type casting and column alignment.

Python
      from openpyxl import load_workbook
from pathlib import Path

def fallback_combine(source_dir: str, output_path: str) -> None:
 source = Path(source_dir)
 files = [f for f in source.glob("*.xlsx") if not f.name.startswith("~$")]
 if not files:
 return

 wb_out = load_workbook(files[0])
 ws_out = wb_out.active
 
 for file in files[1:]:
 wb_in = load_workbook(file, data_only=True)
 ws_in = wb_in.active
 
 # Skip header row (min_row=2), append only non-empty rows
 for row in ws_in.iter_rows(min_row=2, values_only=True):
 if any(cell is not None for cell in row):
 ws_out.append(row)
 
 wb_out.save(output_path)

    

Pipeline Hardening & Troubleshooting

Schema Drift & Column Mismatchpd.concat() aligns by column name, not position. Missing columns fill with NaN. Enforce a strict schema post-concatenation:

Python
      master_cols = ["date", "region", "revenue", "cost"]
combined = combined.reindex(columns=master_cols)

    

Data Type Conflicts Mixed numeric/string columns trigger DtypeWarning. Suppress with explicit casting:

Python
      combined = combined.astype({"revenue": "float64", "region": "string"})

    

Performance & Memory Limitspandas loads files entirely into RAM. For datasets >2GB or 50+ files, avoid repeated I/O. Process in batches, write to intermediate Parquet files, and export to Excel only at the final step. When building your first reporting pipeline, lock dependency versions in requirements.txt and review Getting Started with Python Excel Automation to configure virtual environments and structured logging that prevents silent data corruption.

Validation & Automation Schedule via cron, GitHub Actions, or Windows Task Scheduler. Validate output programmatically before deployment:

Python
      expected_rows = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected_rows, "Row count mismatch detected"

    

Formatting & Template Preservationpandas exports raw data only. To retain conditional formatting, pivot tables, or macros, generate consolidated data first, then inject values into a pre-built template. This aligns with established patterns for Working with Multiple Excel Sheets in Python where sheet-level operations require direct workbook manipulation rather than DataFrame abstraction.

Deployment Checklist

  • Verify all source files use .xlsx extension
  • Confirm openpyxl version matches pandas compatibility matrix
  • Test with empty files, single-row files, and trailing whitespace
  • Validate output against expected row count
  • Add structured logging to track skipped files and dtype warnings