| name | xlsx-parsing |
| description | Read Microsoft Excel (.xlsx) files robustly with `openpyxl` (or `pandas`). Covers multi-sheet workbooks, header rows, empty cells, merged cells, comma-separated list cells, and converting a sheet to a list-of-dicts the rest of your code can consume. Use when a task input or reference document is an `.xlsx` file rather than JSON/CSV. |
xlsx-parsing
Excel workbooks are the lingua franca of operational documents that nobody bothered to put in a database — playbooks, rate cards, deviation policies, finance models, SLAs. They show up in tasks with three properties that trip up naive readers:
- Multiple sheets, only one of which is the data you actually want.
- Sparse cells — a row that uses a column may sit next to a row that doesn't, leaving
None cells. Empty is meaningful (the rule does not apply), not an error.
- Composite cells — a single cell that contains a comma-separated list, a JSON blob, or a sentence rather than an atomic value.
Treat the workbook as a typed table with declared columns, not a free-form spreadsheet. Read every sheet you need, normalise it to list[dict[str, Any]], then operate on that.
Reading with openpyxl (pure Python, no compiled dependencies)
import openpyxl
wb = openpyxl.load_workbook("workbook.xlsx", data_only=True, read_only=True)
print(wb.sheetnames)
ws = wb["Rules"]
rows = ws.iter_rows(values_only=True)
header = [str(c).strip() if c else "" for c in next(rows)]
records = [dict(zip(header, row)) for row in rows if any(cell is not None for cell in row)]
Notes:
data_only=True returns the cached value of formula cells instead of the formula expression. Without this you may get strings like "=A1+B2".
read_only=True is faster on big workbooks and avoids loading styles you don't need.
- The
any(cell is not None ...) filter drops entirely-blank rows that Excel preserves at the bottom of a sheet.
dict(zip(header, row)) handles trailing empty columns gracefully when a row is shorter than the header.
Reading with pandas (if it's installed)
import pandas as pd
sheets = pd.read_excel("workbook.xlsx", sheet_name=None, dtype=object)
rules_df = sheets["Rules"]
rules_df = rules_df.dropna(how="all")
records = rules_df.where(rules_df.notna(), None).to_dict(orient="records")
pandas is heavier but useful when you want grouping, joins, or numeric aggregation. Either library is fine; do not mix them in the same module.
Empty cells: None is the answer, not an error
A row that does not specify a numeric cap leaves that cell blank. The blank is part of the rule's shape — it means "no cap applies" or "this constraint is not engaged for this rule." Code defensively:
def get(rec, key, default=None):
val = rec.get(key)
return default if val is None or (isinstance(val, str) and not val.strip()) else val
Compare against is None or call .strip() rather than truthiness — 0 and False are valid values that fail truthy tests.
Composite cells
Authors often put list-valued data into a single cell. A cell containing "Delaware, New York, California" is one string, not three rows.
def split_list_cell(value):
if value is None:
return []
return [item.strip() for item in str(value).split(",") if item.strip()]
If you see a cell with curly-brace text, it is probably an embedded JSON document; parse with json.loads. Try the simple split first.
Merged cells
Merged cells appear once in the underlying data; only the top-left cell holds the value, and the rest are None. If a column is intentionally merged for a "section header" effect, fill the value down to recover row-wise records:
last = None
for row in records:
if row["section"] is None:
row["section"] = last
else:
last = row["section"]
If you need to know whether a cell is in a merged range, ws.merged_cells.ranges gives you the list.
Multiple sheets
Use the metadata sheet (often named Metadata, Info, or README) for workbook-level fields, and the data sheet(s) for per-record rows. Read all sheets you need before processing — do not assume the schema of one sheet is described inside another sheet you have not opened.
Putting it together for a configuration-style workbook
import openpyxl
def load_sheet_as_records(wb, sheet_name):
ws = wb[sheet_name]
rows = ws.iter_rows(values_only=True)
header = [str(c).strip() if c else "" for c in next(rows)]
return [
dict(zip(header, row))
for row in rows
if any(cell is not None for cell in row)
]
wb = openpyxl.load_workbook("workbook.xlsx", data_only=True, read_only=True)
metadata = {row[0]: row[1] for row in wb["Metadata"].iter_rows(min_row=2, values_only=True)}
defs = load_sheet_as_records(wb, "Definitions")
rules = load_sheet_as_records(wb, "Rules")
After this, rules[0]["key"], rules[0]["rule_type"], etc. are plain Python values you can branch on. The rest of your code does not need to know the input was Excel.
When not to use this skill
- The file is
.csv — use csv.DictReader directly.
- The file is
.json or .jsonl — use json.loads.
- The file is
.xls (legacy binary) — openpyxl will refuse; use xlrd<2 or convert to .xlsx first.