| name | policyengine-simulation-mechanics |
| description | ALWAYS LOAD THIS SKILL before writing any policyengine.py microsimulation code.
Contains correct import paths, environment setup, dataset loading, and analysis patterns.
Triggers: "write a script", "policyengine.py", "microsimulation script", "run a simulation",
"load the dataset", "FRS", "EFRS", "enhanced FRS", "CPS", "enhanced CPS",
"by income decile", "by tenure", "by region", "energy spending", "domestic energy",
"household net income", "output_dataset", "ensure_datasets", "uk_datasets", "us_datasets",
"import datasets", "from policyengine", "Simulation(dataset=", "uk_latest", "us_latest",
"plotly", "analysis script", "decile breakdown", "percentile", "groupby", "weighted",
"mean", "median", "p25", "p75", "tenure type", "income band", "policy reform script".
|
PolicyEngine Simulation Mechanics
This skill covers advanced patterns for working with policyengine.py simulations, including caching, result access, and entity mapping.
CRITICAL: Environment Setup
Before writing any code, check the environment. The policyengine.py package must be installed in the project's .venv.
cd /path/to/policyengine.py
uv run python script.py
source .venv/bin/activate
python script.py
uv pip install -e ".[uk]"
uv pip install -e ".[us]"
If from policyengine.core import Simulation fails:
cd /path/to/policyengine.py
uv pip install -e ".[uk]"
CRITICAL: Correct Import Paths
Only these imports exist — do not guess others:
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import (
uk_latest,
uk_model,
PolicyEngineUKDataset,
UKYearData,
create_datasets,
load_datasets,
ensure_datasets,
)
from policyengine.tax_benefit_models.us import (
us_latest,
PolicyEngineUSDataset,
ensure_datasets,
)
from policyengine.outputs.aggregate import Aggregate, AggregateType
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
from policyengine.utils.plotting import COLORS, format_fig
There is NO:
policyengine.core.dataset_registry
policyengine.datasets
policyengine.core.dataset_version.DatasetVersion.list()
UK Datasets
Loading UK datasets
Use ensure_datasets() — it returns a dict[str, PolicyEngineUKDataset], building files in ./data/ on first run and loading from disk on subsequent runs.
WARNING: from policyengine.tax_benefit_models.uk import datasets gives you the Python submodule, not a dict. Never index it like a dict.
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(
datasets=[
"hf://policyengine/policyengine-uk-data/frs_2023_24.h5",
"hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5",
],
years=[2026],
data_folder="./data",
)
efrs = uk["enhanced_frs_2023_24_2026"]
frs = uk["frs_2023_24_2026"]
Dict key format: "{stem}_{year}" e.g. "enhanced_frs_2023_24_2026"
To force regeneration: delete ./data/ and call ensure_datasets() again.
Loading US datasets
from policyengine.tax_benefit_models.us import ensure_datasets
us = ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
data_folder="./data",
)
ecps = us["enhanced_cps_2024_2026"]
Default US dataset: enhanced_cps_2024.h5 (Enhanced CPS), years 2024–2028.
Inspecting available variables
Always inspect the dataset to find available variable names — never guess:
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(years=[2026], data_folder="./data")
d = uk["enhanced_frs_2023_24_2026"]
print("household:", list(d.data.household.columns))
print("person: ", list(d.data.person.columns))
print("benunit: ", list(d.data.benunit.columns))
Input variables are what's in the raw survey data — demographics, reported incomes, consumption, wealth, flags.
Computed variables (household_net_income, income_tax, universal_credit, etc.) are not in the raw dataset — they are calculated by the simulation. To see what's available after running:
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
sim = Simulation(dataset=d, tax_benefit_model_version=uk_latest)
sim.run()
print("household (post-sim):", list(sim.output_dataset.data.household.columns))
print("person (post-sim): ", list(sim.output_dataset.data.person.columns))
The computed variables available are defined by uk_latest.entity_variables — inspect this to see the full list without running a simulation:
from policyengine.tax_benefit_models.uk import uk_latest
print(uk_latest.entity_variables)
For Analysts: Core Concepts
When running simulations with policyengine.py (the microsimulation package, not the API client), you work with three key components:
Simulation.ensure() - Smart caching to avoid redundant computation
simulation.output_dataset.data - Accessing calculated results
map_to_entity() - Converting data between entity levels (person ↔ household)
Note: This is for microsimulation with policyengine.py, not the policyengine Python API client (which uses Simulation(situation=...)).
Simulation Lifecycle
The Four Methods
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
simulation = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
)
simulation.run()
simulation.ensure()
simulation.save()
simulation.load()
When to Use Each
run(): Use when you need fresh results or parameters changed
ensure(): Use for iterative development (checks cache → disk → run)
save(): Use to persist large simulation results
load(): Use to resume from previous session
How ensure() Works
def ensure(self):
cached = _cache.get(self.id)
if cached:
self.output_dataset = cached.output_dataset
return
try:
self.tax_benefit_model_version.load(self)
except Exception:
self.run()
self.save()
_cache.add(self.id, self)
Performance impact:
- First call: Full simulation runtime (seconds to minutes)
- Same session: Instant (in-memory cache)
- New session: Fast (disk load, no recomputation)
Example: Reusing Baseline Across Reforms
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()
for reform in [reform1, reform2, reform3]:
baseline.ensure()
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run()
Accessing Results: output_dataset.data
After running a simulation, all calculated variables are in simulation.output_dataset.data.
Structure (UK Example)
simulation.run()
output = simulation.output_dataset.data
output.person
output.benunit
output.household
US Entity Structure
output.person
output.tax_unit
output.spm_unit
output.family
output.marital_unit
output.household
Available Variables
Each dataframe contains input variables + calculated variables:
print(output.person.columns)
print(output.household.columns)
print(output.benunit.columns)
Direct Data Access
incomes = output.household[["household_id", "household_net_income"]]
high_earners = output.person[output.person["employment_income"] > 100000]
mean_income = output.household["household_net_income"].mean()
total_tax = output.household["household_tax"].sum()
first_hh_income = output.household["household_net_income"].iloc[0]
MicroDataFrame Automatic Weighting
All operations respect survey weights automatically:
total_population = output.person["person_weight"].sum()
mean_income = output.household["household_net_income"].mean()
poverty_rate = output.household["in_absolute_poverty_bhc"].mean()
by_region = output.household.groupby("region")["household_net_income"].mean()
Entity Mapping with map_to_entity()
Convert data between entity levels (e.g., sum person income to household, or broadcast household rent to persons).
Method Signature
output.map_to_entity(
source_entity: str,
target_entity: str,
columns: list[str] = None,
values: np.ndarray = None,
how: str = "sum"
)
Aggregation Methods
Person → Group (aggregation):
how="sum" (default): Sum values within each group
how="first": Take first value in each group
how="mean": Average values
how="max": Maximum value
how="min": Minimum value
Group → Person (expansion):
how="project" (default): Broadcast group value to all members
how="divide": Split group value equally among members
Example 1: Sum Person Income to Household
household_employment = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
print(household_employment.columns)
Example 2: Broadcast Household Rent to Persons
person_rent = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["rent"],
how="project"
)
print(person_rent.columns)
Example 3: Divide Household Value Per Person
person_savings_share = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["total_savings"],
how="divide"
)
Example 4: Map Custom Values
import numpy as np
custom_tax = np.where(
output.person["employment_income"] > 50000,
output.person["income_tax"] * 1.1,
output.person["income_tax"]
)
household_custom_tax = output.map_to_entity(
source_entity="person",
target_entity="household",
values=custom_tax,
how="sum"
)
Example 5: Multi-Column Mapping
household_incomes = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=[
"employment_income",
"self_employment_income",
"pension_income",
"savings_interest_income"
],
how="sum"
)
Example 6: Cross-Entity Mapping (Group to Group)
household_uc = output.map_to_entity(
source_entity="benunit",
target_entity="household",
columns=["universal_credit", "child_benefit"],
how="sum"
)
Automatic Mapping in Aggregate Classes
The Aggregate and ChangeAggregate classes automatically handle entity mapping when the variable and target entity don't match:
from policyengine.outputs.aggregate import Aggregate, AggregateType
total_tax = Aggregate(
simulation=simulation,
variable="income_tax",
entity="household",
aggregate_type=AggregateType.SUM,
)
total_tax.run()
Common Patterns
Pattern 1: Compare Baseline vs Reform
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
reform = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform_policy
)
reform.ensure()
baseline_out = baseline.output_dataset.data
reform_out = reform.output_dataset.data
baseline_income = baseline_out.household["household_net_income"]
reform_income = reform_out.household["household_net_income"]
difference = reform_income - baseline_income
winners = (difference > 0).sum()
losers = (difference < 0).sum()
unchanged = (difference == 0).sum()
Pattern 2: Calculate Custom Derived Variable
person_data = output.person.copy()
person_data["mtr"] = (
(person_data["income_tax"] + person_data["national_insurance"])
/ person_data["employment_income"].clip(lower=1)
) * 100
household_mtr = output.map_to_entity(
source_entity="person",
target_entity="household",
values=person_data["mtr"].values,
how="max"
)
Pattern 3: Extract Subset for Analysis
london_hh = output.household[output.household["region"] == "LONDON"]
households_with_children = output.person.groupby("person_household_id")["age"].apply(
lambda ages: (ages < 18).any()
)
london_ids = set(london_hh["household_id"])
hh_with_kids_ids = set(households_with_children[households_with_children].index)
target_ids = london_ids & hh_with_kids_ids
subset_hh = output.household[output.household["household_id"].isin(target_ids)]
subset_persons = output.person[output.person["person_household_id"].isin(target_ids)]
Pattern 4: Reuse Baseline Across Multiple Reforms
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()
reforms = [reform1, reform2, reform3]
results = {}
for reform in reforms:
baseline.ensure()
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run()
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
revenue = ChangeAggregate(
baseline_simulation=baseline,
reform_simulation=reform_sim,
variable="household_tax",
aggregate_type=ChangeAggregateType.SUM,
)
revenue.run()
results[reform.name] = revenue.result
Direct Data Analysis (without Aggregate)
For custom analyses (decile breakdowns, percentiles, groupby), work directly with output_dataset.data after running the simulation. This is often simpler than using Aggregate.
Full working example: energy spending by income decile and tenure type
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, uk_latest
dataset = PolicyEngineUKDataset(
name="Enhanced FRS 2026",
description="EFRS 2026",
filepath="./data/enhanced_frs_2023_24_year_2026.h5",
year=2026,
)
dataset.load()
simulation = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
simulation.ensure()
hh = simulation.output_dataset.data.household
hh["income_decile"] = pd.qcut(
hh["household_net_income"],
q=10,
labels=[f"D{i}" for i in range(1, 11)],
)
stats = (
hh.groupby(["income_decile", "tenure_type"])["domestic_energy_consumption"]
.agg(
mean="mean",
p25=lambda x: np.percentile(x, 25),
p75=lambda x: np.percentile(x, 75),
)
.reset_index()
)
Key points:
simulation.output_dataset.data.household is a MicroDataFrame with weights
domestic_energy_consumption is household-level (annual £)
tenure_type values: OWNED_OUTRIGHT, OWNED_WITH_MORTGAGE, RENT_FROM_COUNCIL, RENT_PRIVATELY, RENT_FROM_HA
- Income deciles must be computed from simulation output (not raw data)
Performance Tips
- Use
ensure() for iterative work: Can save minutes when re-running analyses
- Filter before mapping: Reduces computation on large datasets
- Use
Aggregate classes: Optimised implementations for common operations
- Batch similar calculations: Run multiple aggregates in sequence
- Cache intermediate results: Store derived calculations
high_earners = output.person[output.person["employment_income"] > 100000]
high_earner_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
values=high_earners["employment_income"].values,
how="sum"
)
all_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
high_earner_hh = all_hh_income[all_hh_income["employment_income"] > 100000]
For Contributors: Implementation
Current implementation:
cat policyengine.py/src/policyengine/core/simulation.py
cat policyengine.py/src/policyengine/core/dataset.py
cat policyengine.py/src/policyengine/core/cache.py
Key patterns:
- Simulation caching: LRU cache with max 100 entries, keyed by UUID
- Entity mapping: Automatic detection of mapping direction (person→group or group→person)
- MicroDataFrame: All entity data uses weighted DataFrames from microdf package
Related skills:
policyengine-core-skill - Understanding simulation engine architecture
microdf-skill - Working with weighted DataFrames
policyengine-python-client-skill - Basic simulation usage
Debugging Tips
Verify Simulation Ran
assert simulation.output_dataset is not None, "Simulation hasn't run"
expected = ["household_net_income", "household_tax"]
actual = simulation.output_dataset.data.household.columns
assert all(v in actual for v in expected), "Missing variables"
Check Entity Linkages
person_hh_ids = set(output.person["person_household_id"])
household_ids = set(output.household["household_id"])
assert person_hh_ids.issubset(household_ids), "Invalid linkage"
Verify Weights
total_persons = output.person["person_weight"].sum()
print(f"Weighted population: {total_persons:,.0f}")
assert not output.person["person_weight"].isna().any(), "Missing weights"
Related Documentation
In policyengine.py repo:
.claude/policyengine-guide.md - High-level patterns
.claude/quick-reference.md - Syntax cheat sheet
.claude/working-with-simulations.md - Detailed simulation guide
examples/ - Full working examples
docs/core-concepts.md - Architecture documentation