Guide

DocumentationDeep dive

Getting Started with Python Excel Automation

Q: Do I need Excel installed to run these scripts?

No. pandas and openpyxl are pure Python and read and write .xlsx files directly, so they run on a headless Linux server or CI runner. Only xlwings requires an installed copy of Excel, and only on Windows or macOS.

Q: Why is there an extra unnamed column in my output file?

You omitted index=False on to_excel, so pandas wrote the DataFrame's row numbers as a leading column. Pass index=False whenever the row index isn't meaningful data.

Q: Can I write multiple sheets into one workbook?

Yes. Use an ExcelWriter context manager and call to_excel once per sheet with a different sheet_name; the context manager saves and closes the file for you.

Learn how to read, transform, and write Excel files with Python using pandas and openpyxl — the libraries, the pipeline pattern, and runnable starter code.

If you build the same spreadsheet by hand every week, Python can do it for you in seconds — and produce the exact same result every time. This guide is the on-ramp for Python developers who are tired of manual copy-paste reporting: which library to reach for, how a small automation fits together, and a complete read - transform - write example you can paste and run today. Every code block below executes in order against a sample workbook created on this page, so you can follow along without any data of your own.

What you will learn

This page is the map for everything else on the site. It covers the whole journey at a high level, then hands off to focused guides for each step:

Picking a library — pandas, openpyxl, xlsxwriter, and xlwings, and the one rule that tells them apart.
Reading and writing workbooks, sheets, and ranges — loading data in and getting formatted results back out.
Cleaning and reshaping tabular data — the advanced data transformation and cleaning work that turns raw exports into report-ready tables.
Formatting and charting — bold headers, number formats, and native charts, covered in depth under formatting and charting Excel reports with Python.
Formulas, pivots, and charts from code — pushing calculations into the workbook itself.
Validation and safe in-place edits — failing loudly and never corrupting a file mid-write.
Scheduling and packaging — turning a script into an unattended job, the heart of automating reporting workflows.

Read it top to bottom once, then treat each heading as a jumping-off point into the deeper guide it links to.

The Python and Excel library landscape

Python has no single "Excel library." Three tools cover almost every task, and they overlap by design — most real scripts combine pandas with one of the others.

Library	Best for	Needs Excel installed?	Runs on Linux servers?
pandas	Reading, transforming, and writing tabular data	No	Yes
openpyxl	Cell-level control: styling, formulas, charts, editing existing files	No	Yes
xlsxwriter	Fast write-only export with rich formatting and charts	No	Yes
xlwings	Driving a live Excel app: running macros, refreshing data	Yes	No (Windows/macOS)

A useful rule of thumb:

Reach for pandas when you think in terms of rows, columns, and aggregations. It delegates the actual .xlsx parsing to an engine — usually openpyxl. See Reading Excel Files with Pandas and Writing DataFrames to Excel with Pandas.
Reach for openpyxl when you need to touch individual cells — apply formatting, write formulas, or append rows to a workbook that already exists. See Using openpyxl for Excel File Manipulation.
Reach for xlsxwriter when you are building a brand-new file and want the fastest write path plus the richest formatting API — though you cannot use it to reopen and edit an existing workbook. The trade-offs are laid out in openpyxl vs xlsxwriter vs pandas.ExcelWriter.
Reach for xlwings only when a real, installed copy of Excel must stay in the loop — for example to run an existing VBA macro. See Automating Excel with xlwings Basics and its run-a-macro-from-Python example.

For unattended jobs on a Linux server or CI runner, stick to pandas plus openpyxl or xlsxwriter: they are pure Python and never launch Excel.

Install the libraries

pandas does not bundle an Excel reader/writer; install an engine alongside it. openpyxl covers .xlsx reading and writing and is all you need to follow this guide. Add xlsxwriter when you want its faster write path later.

Bash

pip install pandas openpyxl

Pin these versions in a requirements.txt so a scheduled job behaves the same after every redeploy — more on that in the scheduling section below.

Reading and writing workbooks, sheets, and ranges

Nearly every Excel automation is the same three steps in a row:

Read a source workbook (or CSV, or database) into a pandas DataFrame.
Transform it — filter, compute new columns, aggregate.
Write the result back out to a formatted .xlsx file.

The rest of this page walks that pattern end to end. First, create a sample workbook so the later steps have something to read:

Python

import pandas as pd

orders = pd.DataFrame({
    "Order_ID": [1001, 1002, 1003, 1004, 1005, 1006],
    "Region": ["North", "South", "North", "West", "South", "West"],
    "Quantity": [3, 1, 5, 2, 4, 1],
    "Unit_Price": [19.99, 49.50, 19.99, 8.75, 12.00, 49.50],
})
orders.to_excel("orders.xlsx", sheet_name="Orders", index=False)
print("Wrote orders.xlsx with", len(orders), "rows")

Step 1: Read the source data

read_excel loads a sheet into a DataFrame. With no extra arguments it reads the first sheet and treats row 0 as the header:

Python

df = pd.read_excel("orders.xlsx", sheet_name="Orders")
print(df.head())

That one line hides a lot of options — targeting a named sheet, choosing which columns to load, and controlling data types. Those are covered step by step in how to read Excel with pandas, and when you only need a few fields, reading specific columns from Excel shows how usecols keeps memory down.

Step 2: Transform it

Compute a revenue column, then aggregate to one row per region. These are vectorized operations — no Python loops over rows:

Python

df["Revenue"] = df["Quantity"] * df["Unit_Price"]

summary = (
    df.groupby("Region", as_index=False)
      .agg(Orders=("Order_ID", "count"),
           Units=("Quantity", "sum"),
           Revenue=("Revenue", "sum"))
      .sort_values("Revenue", ascending=False)
)
print(summary)

Real-world data is rarely this clean. When your source has stray blank rows, duplicate keys, or missing values, the transform step grows — that is the subject of the advanced data transformation and cleaning guides.

Step 3: Write the report

Use an ExcelWriter context manager to put both the detail and the summary into one workbook, each on its own sheet. The context manager saves and closes the file for you:

Python

with pd.ExcelWriter("regional_report.xlsx", engine="openpyxl") as writer:
    summary.to_excel(writer, sheet_name="Summary", index=False)
    df.to_excel(writer, sheet_name="Detail", index=False)

print("Saved regional_report.xlsx")

That index=False is worth a habit: without it, pandas writes the DataFrame's row numbers as a leading column, which rarely belongs in a report. See Write Pandas DataFrame to Excel Without Index for the details. When a job produces many tabs — one per region, month, or client — working with multiple Excel sheets in Python covers splitting and combining multiple files into one workbook.

Cleaning and transforming tabular data with pandas

The transform step is where most of the real work lives, because exported data is messy. Three fixes come up on almost every job. First, drop rows that are entirely empty — a common artifact of ranges copied out of a spreadsheet:

Python

df = df.dropna(how="all")

Second, coerce columns to the right type so numbers actually add up. A revenue column read as text will silently break sum:

Python

df["Unit_Price"] = pd.to_numeric(df["Unit_Price"], errors="coerce")

Third, remove duplicate records before you aggregate, keeping the first occurrence of each order:

Python

df = df.drop_duplicates(subset="Order_ID", keep="first")

Each of these has edge cases worth understanding: the difference between dropping fully-blank versus partially-blank rows in removing blank rows with pandas, the subset and keep options in dropping duplicates from an Excel column, and filling gaps sensibly in handling missing data in Excel reports. When a report draws on more than one source, merging and joining Excel DataFrames shows how to line them up on a common key.

Formatting, styling, and number formats with openpyxl

pandas writes plain data. When you want a styled header or fitted column widths, reopen the file with openpyxl and edit cells directly. Here we bold the header row of the summary sheet and set sensible widths:

Python

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

wb = load_workbook("regional_report.xlsx")
ws = wb["Summary"]

header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="2F5496", end_color="2F5496", fill_type="solid")
for cell in ws[1]:
    cell.font = header_font
    cell.fill = header_fill

# Fit each column to its widest value
for column_cells in ws.columns:
    width = max(len(str(c.value)) for c in column_cells if c.value is not None)
    ws.column_dimensions[column_cells[0].column_letter].width = width + 2

wb.save("regional_report.xlsx")
print("Formatted Summary sheet")

Formatting is deep enough to fill its own section of the site. For the full toolkit — fonts, fills, borders, and alignment — see styling Excel cells with openpyxl, with focused walk-throughs for setting column width and row height and freezing the header row so it stays visible when scrolling.

A styled header is only half of a readable report — the numbers underneath need formats too. A revenue column should display as currency and a date column as a date, without changing the underlying values. That is a display concern handled by number-format codes, covered in applying number and date formats, with recipes for currency cells and date cells. To highlight outliers automatically — say, negative revenue in red — reach for conditional formatting with openpyxl.

Formulas, pivot tables, and charts from code

Sometimes a report should carry live calculations rather than pre-computed numbers, so a reader can change an input and watch totals update. openpyxl writes a formula as a string that begins with =; Excel evaluates it when the file opens:

Python

from openpyxl import load_workbook

wb = load_workbook("regional_report.xlsx")
ws = wb["Summary"]
last = ws.max_row + 1
ws.cell(row=last, column=1, value="Total")
ws.cell(row=last, column=4, value=f"=SUM(D2:D{last - 1})")
wb.save("regional_report.xlsx")

For summarising, you have two paths. Compute the pivot in pandas and export the finished table — see creating pivot tables from Excel data and how to export a pandas pivot table to Excel, formatted. To turn the summary into a chart the reader can see at a glance, openpyxl builds native Excel charts — no screenshots, fully editable. Start with creating charts in Excel with openpyxl, then the specifics of a bar chart or a line chart. Add a company logo or image and the output looks hand-built.

Validation, error handling, and safe in-place edits

A script that runs unattended must fail loudly, not silently produce a wrong or empty file. Two habits matter most.

Check your inputs before you touch them. Confirm the source file exists and has the columns you expect, so a bad input raises a clear message instead of a confusing traceback deep in the transform:

Python

from pathlib import Path

source = Path("orders.xlsx")
if not source.exists():
    raise FileNotFoundError(f"Expected input at {source.resolve()}")

df = pd.read_excel(source, sheet_name="Orders")
required = {"Order_ID", "Region", "Quantity", "Unit_Price"}
missing = required - set(df.columns)
if missing:
    raise ValueError(f"Source is missing columns: {sorted(missing)}")

Never edit a file in place by overwriting the original mid-run. If the process dies halfway through a save, you can be left with a corrupt workbook and no original to fall back on. Write to a temporary file, then swap it into place atomically so readers only ever see a complete file:

Python

import os
from pathlib import Path

out = Path("regional_report.xlsx")
tmp = out.with_suffix(".xlsx.tmp")
wb.save(tmp)          # write the full file under a temp name first
os.replace(tmp, out)  # atomic on the same filesystem

When you must add to an existing workbook rather than regenerate it — appending today's rows to a running log, for instance — appending data to an existing Excel sheet with openpyxl shows the safe pattern, and reading a single cell value with openpyxl covers pulling one value out without loading the whole sheet.

Scheduling and packaging automations

Once a script works by hand, the payoff is running it on a schedule so the report simply appears. A few habits keep scheduled jobs reliable:

Use absolute paths. A scheduler's working directory is rarely the script's folder, so "orders.xlsx" may not resolve. Build paths from a known base instead:

Python

from pathlib import Path

base_dir = Path("/srv/reports/data")  # a fixed, known location
source = base_dir / "orders.xlsx"
print("Will read:", source)

Log row counts at ingestion and output, so a run that reads zero rows or writes an empty file is obvious in the logs.
Pin your dependencies. Record pandas and openpyxl versions in a requirements.txt so the job behaves the same after a redeploy.

When you are ready to put a script on a timer, the automating reporting workflows section takes it from here: scheduling Python Excel scripts with cron on Linux, the equivalent on Windows Task Scheduler, or in-process with APScheduler. To deliver the finished file, email Excel reports with smtplib or convert the workbook to PDF for recipients who just want to read it. And so non-technical colleagues get a consistent layout every run, generate reports from a template and populate multi-sheet dashboards from your data.

Common first-time errors

ModuleNotFoundError: No module named 'openpyxl' — pandas found no Excel engine. Install one with pip install openpyxl.
ValueError: Excel file format cannot be determined — the file is not really an .xlsx (often a renamed CSV or HTML export). Confirm the real format, or pass engine="openpyxl" explicitly.
PermissionError: [Errno 13] Permission denied on write — the target file is open in Excel. Close it, or write to a new filename.
A leading unnamed column in your output — you forgot index=False on to_excel.

Formulas, tables and validation: the next layer

Reading and writing values covers most first scripts. The layer above it is the one that makes a spreadsheet feel like a spreadsheet — formulas that recalculate, tables readers can filter, and rules that stop bad data being typed in the first place. All three are written from Python in much the same way as a cell value:

Python

from openpyxl import Workbook
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl.worksheet.table import Table, TableStyleInfo

wb = Workbook()
ws = wb.active
ws.title = "Orders"
ws.append(["Order_ID", "Region", "Quantity", "Unit_Price", "Line_Total"])

for order_id, region, qty, price in [
    (2001, "North", 4, 19.99),
    (2002, "South", 2, 49.50),
    (2003, "West", 7, 12.25),
]:
    ws.append([order_id, region, qty, price])
    row = ws.max_row
    ws[f"E{row}"] = f"=C{row}*D{row}"          # a formula, written as a string

table = Table(displayName="Orders", ref=f"A1:E{ws.max_row}")
table.tableStyleInfo = TableStyleInfo(name="TableStyleMedium9", showRowStripes=True)
ws.add_table(table)                             # filter + banding for the reader

regions = DataValidation(type="list", formula1='"North,South,West,Central"', showErrorMessage=True)
ws.add_data_validation(regions)
regions.add(f"B2:B500")                         # a dropdown for whoever types next

wb.save("starter_report.xlsx")

Three ideas are worth taking from that snippet. A formula is just a string beginning with =, and openpyxl never calculates it — the numbers appear when Excel opens the file, which is why a fresh workbook read back in Python shows formula text rather than results. A table is a named object with its own range, so =SUM(Orders[Line_Total]) keeps working as rows are added. And validation is metadata for Excel's interface: it constrains typing, not your own writes, so a Python job still has to check the values it produces.

Each of those becomes its own topic as your scripts grow: working with Excel formulas in Python for the calculation layer, creating Excel tables and autofilters for the presentation layer, and validating Excel data with Python for the quality layer.

Where scripts go wrong first

The failures that catch people out early are rarely exotic, and knowing them shortens the learning curve considerably:

The pattern behind all four is the same: a spreadsheet is a file, not a live application. Nothing is written until you save, nothing is calculated until a spreadsheet program opens it, and every write replaces whatever it covers. Scripts that respect those three facts tend to keep working; scripts that assume Excel is somehow present tend to fail in ways that are hard to read.

Choosing where a job should run

A script that works on your machine has to run somewhere eventually, and the choice shapes which library you can use. openpyxl and pandas are pure Python and run anywhere — a laptop, a container, a scheduled server job — because they read and write the file format directly. xlwings drives a real copy of Excel, which unlocks macros, .xls conversion and live recalculation, and which also means the job needs Excel installed and a desktop session to run in.

For anything scheduled, that difference is decisive: choose the pure-Python route and the job can run at 6am on a headless server. Keep xlwings for interactive work on a desktop, or for the small number of tasks that genuinely need Excel's own engine. When a scheduled job must produce calculated values, compute them in pandas and write literals rather than hoping something will recalculate the file later.

A first project worth building

The fastest way to consolidate all of this is a small end-to-end job rather than more isolated snippets. Read a source workbook, clean the columns you care about, aggregate them, and write a two-sheet report — a summary people read and a detail sheet they can filter. That single exercise touches reading, type coercion, aggregation, writing, formatting and file layout, which is most of what day-to-day spreadsheet automation consists of.

Keep it deliberately small at first: three columns, one grouping, one total. The habits that make it robust — deriving ranges from the data, saving once at the end, checking the output has rows before declaring success — are much easier to establish on a script you can hold in your head than to retrofit onto one that grew.

Key takeaways

Python has no single Excel library: use pandas for tabular data, openpyxl for cell-level control and editing existing files, xlsxwriter for fast write-only exports, and xlwings only when a live Excel install must run macros.
Almost every automation is the same read - transform - write pipeline — get that skeleton right once and every future report is a variation on it.
Keep display concerns (number formats, styling, charts) separate from data concerns (cleaning, aggregating) so each stays simple.
Make unattended jobs trustworthy: validate inputs, write to a temp file and swap atomically, use absolute paths, log row counts, and pin your dependencies.
Each heading above is a doorway — follow its links into the focused guides when you need the depth.

Frequently asked questions

Do I need Excel installed to run these scripts? No. pandas and openpyxl are pure Python and read and write .xlsx files directly, so they run on a headless Linux server or CI runner. Only xlwings requires an installed copy of Excel, and only on Windows or macOS.

Why do I get ModuleNotFoundError: No module named 'openpyxl' when I only imported pandas? pandas does not bundle an Excel reader/writer — it delegates parsing to an engine. Install one alongside pandas with pip install openpyxl.

Why is there an extra unnamed column in my output file? You omitted index=False on to_excel, so pandas wrote the DataFrame's row numbers as a leading column. Pass index=False whenever the row index isn't meaningful data.

Can I write multiple sheets into one workbook? Yes. Use an ExcelWriter context manager and call to_excel once per sheet with a different sheet_name; the context manager saves and closes the file for you.

When should I reach for openpyxl instead of pandas? Use pandas when you think in rows, columns, and aggregations. Switch to openpyxl when you need cell-level control — bold headers, fitted column widths, formulas, or editing an existing workbook — which pandas does not expose.

Which library is fastest for writing large workbooks? For write-only jobs, xlsxwriter is usually the quickest and has the richest formatting API; openpyxl is the only option that can edit an existing file in place. pandas.ExcelWriter wraps either engine, so you get the same speed with less code when you are exporting DataFrames.

Up: Python Excel Automation home

Continue with the guides that build directly on this page:

Reading Excel Files with Pandas — sheet and column targeting, dtype control, and a defensive reader.
Writing DataFrames to Excel with Pandas — multi-sheet exports, number formats, and appending to existing files.
Using openpyxl for Excel File Manipulation — when you need cell-level control beyond what pandas offers.
Working with Multiple Excel Sheets in Python — splitting and combining workbooks.
Automating Excel with xlwings Basics — driving a live Excel install and running macros.

Then broaden out into the rest of the site:

Advanced Data Transformation and Cleaning — the messy middle of the pipeline.
Formatting and Charting Excel Reports with Python — make the output look hand-built.
Automating Reporting Workflows — schedule, email, and package the whole thing.