Skip to content

PolicyEngine has two different concepts of "input variable" that can diverge #417

@baogorek

Description

@baogorek

The Problem

The core issue: PolicyEngine has two different concepts of "input variable" that can diverge:

  1. Structural definition (Variable.is_input_variable()): A variable is a true input if it has no formulas AND no adds AND no subtracts. This is the
    "correct" definition.
  2. Runtime definition (sim.input_variables): Any variable that has values stored in the simulation (known periods). This is what gets used when saving
    H5 files.

The loophole: Variables with adds (like self_employed_pension_contribution_ald) are structurally NOT inputs - they're aggregations. But if the CPS
dataset contains pre-computed values for them, they appear in sim.input_variables at runtime.

Why it corrupts calculations:

self_employed_pension_contribution_ald
└── adds: [self_employed_pension_contribution_ald_person]
└── formula: min(self_employment_income, contributions)

The CPS has a stale pre-computed value (say $203K) that doesn't match the formula result (say $0 because SE income = 0). When you:

  1. Load the dataset and calculate fresh → formula runs → correct $0
  2. Save to H5 including the pseudo-input → stale $203K gets saved
  3. Reload and calculate → stored $203K overrides the formula → wrong AGI → wrong everything downstream

A Demo to reproduce:

Output:

(lt) baogorek@T14-Gen6:~/devl/long-term/policyengine-us-data/policyengine_us_data/datasets/cps/long_term$ python demo_pseudo_input_bug.py 
======================================================================
DEMONSTRATION: The Pseudo-Input Variable Bug
======================================================================

[STEP 1] Loading base dataset and finding a problematic case...
  Found 143 tax units with stale pseudo-input > $50k

  Example Tax Unit 535:
    Stored value:           $   58,172.36
    SE income:              $        0.00
    Contributions:          $        0.00
    Formula should give:    $        0.00
    --> Stale value doesn't match formula!

[STEP 2] Fresh calculation (correct behavior)...
  After clearing cache and recalculating:
    ALD (fresh):  $        0.00
    AGI (fresh):  $  342,647.16

[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...
  Saved to /tmp/tmp6i0o2ykx/buggy.h5
  Pseudo-input 'self_employed_pension_contribution_ald' was included: True

[STEP 4] Reloading H5 and demonstrating corruption...
  After reloading the buggy H5:
    ALD (reloaded): $   58,172.36
    AGI (reloaded): $  284,474.81

  COMPARISON:
                                             Fresh        Reloaded      Difference
    ---------------------------------------------------------------------------
    ALD                            $          0.00 $     58,172.36 $     58,172.36
    AGI                            $    342,647.16 $    284,474.81 $    -58,172.34

[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  Excluding 3 pseudo-input variables:
    - housing_assistance
    - self_employed_health_insurance_ald
    - self_employed_pension_contribution_ald

  Saved to /tmp/tmp6i0o2ykx/fixed.h5

[STEP 6] Verifying the fix...
  After reloading the FIXED H5:
    ALD (fixed):  $        0.00
    AGI (fixed):  $  342,647.16

  FINAL COMPARISON:
                                             Fresh        Buggy H5        Fixed H5
    ---------------------------------------------------------------------------
    ALD                            $          0.00 $     58,172.36 $          0.00
    AGI                            $    342,647.16 $    284,474.81 $    342,647.16

  Fix successful: True

======================================================================
END OF DEMONSTRATION
======================================================================

Code

"""
Demonstration: Why Pseudo-Input Variables Corrupt H5 Files

This script shows how variables that LOOK like inputs but AGGREGATE
calculated values can corrupt saved datasets.

The culprit: self_employed_pension_contribution_ald
- Has 'adds' attribute (sums person-level values)
- No explicit formula
- PolicyEngine treats it as an "input"
- BUT its component (self_employed_pension_contribution_ald_person) HAS a formula!

The formula: min(self_employment_income, self_employed_pension_contributions)

The bug manifests when:
1. Base data has stale pre-computed values
2. Fresh calculation gives correct values (formula runs)
3. We save to H5 including the pseudo-input (stale value gets stored)
4. We reload and calculate - the stored value overrides the formula
"""

import tempfile
import os
import h5py
import numpy as np
from policyengine_us import Microsimulation
from policyengine_core.enums import Enum

print("=" * 70)
print("DEMONSTRATION: The Pseudo-Input Variable Bug")
print("=" * 70)

YEAR = 2024
PSEUDO_INPUT = "self_employed_pension_contribution_ald"

# =============================================================================
# STEP 1: Load base dataset and find a problematic case
# =============================================================================
print(f"\n[STEP 1] Loading base dataset and finding a problematic case...")

sim = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

# All variables at tax_unit level for consistent comparison
stale_values = sim.calculate(PSEUDO_INPUT, period=YEAR).values
se_income = sim.calculate(
    "self_employment_income_before_lsr", period=YEAR, map_to="tax_unit"
).values
contributions = sim.calculate(
    "self_employed_pension_contributions", period=YEAR, map_to="tax_unit"
).values

# Find tax units where stale value > 0 but formula would give 0
# Formula: min(SE_income, contributions) at person level, summed to tax unit
mask = (stale_values > 50000) & ((se_income == 0) | (contributions == 0))
problem_indices = np.where(mask)[0]

print(
    f"  Found {len(problem_indices)} tax units with stale pseudo-input > $50k"
)

tu_idx = problem_indices[0]
stale_value = stale_values[tu_idx]

print(f"\n  Example Tax Unit {tu_idx}:")
print(f"    Stored value:           ${stale_value:>12,.2f}")
print(f"    SE income:              ${se_income[tu_idx]:>12,.2f}")
print(f"    Contributions:          ${contributions[tu_idx]:>12,.2f}")
print(
    f"    Formula should give:    ${min(se_income[tu_idx], contributions[tu_idx]):>12,.2f}"
)
print(f"    --> Stale value doesn't match formula!")

# =============================================================================
# STEP 2: Calculate AGI with fresh formula (correct behavior)
# =============================================================================
print(f"\n[STEP 2] Fresh calculation (correct behavior)...")

# Delete the cached pseudo-input to force recalculation
sim.delete_arrays(PSEUDO_INPUT)

ald_fresh = sim.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_fresh = sim.calculate("adjusted_gross_income", period=YEAR).values[tu_idx]

print(f"  After clearing cache and recalculating:")
print(f"    ALD (fresh):  ${ald_fresh:>12,.2f}")
print(f"    AGI (fresh):  ${agi_fresh:>12,.2f}")

# =============================================================================
# STEP 3: Save H5 WITH the pseudo-input (buggy behavior)
# =============================================================================
print(f"\n[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...")

# Reload to get stale values back
sim_buggy = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

temp_dir = tempfile.mkdtemp()
buggy_h5_path = os.path.join(temp_dir, "buggy.h5")

# Save including the pseudo-input (this is the bug)
data = {}
input_vars = set(sim_buggy.input_variables)

for variable in sim_buggy.tax_benefit_system.variables:
    if variable not in input_vars:
        continue
    data[variable] = {}
    for period in sim_buggy.get_holder(variable).get_known_periods():
        values = sim_buggy.get_holder(variable).get_array(period)
        if sim_buggy.tax_benefit_system.variables.get(variable).value_type in (
            Enum,
            str,
        ):
            if hasattr(values, "decode_to_str"):
                values = values.decode_to_str().astype("S")
            else:
                values = values.astype("S")
        else:
            values = np.array(values)
        if values is not None:
            data[variable][period] = values
    if len(data[variable]) == 0:
        del data[variable]

with h5py.File(buggy_h5_path, "w") as f:
    for variable, periods in data.items():
        grp = f.create_group(variable)
        for period, values in periods.items():
            grp.create_dataset(str(period), data=values)

print(f"  Saved to {buggy_h5_path}")
print(f"  Pseudo-input '{PSEUDO_INPUT}' was included: {PSEUDO_INPUT in data}")

# =============================================================================
# STEP 4: Reload and show corruption
# =============================================================================
print(f"\n[STEP 4] Reloading H5 and demonstrating corruption...")

sim_reloaded = Microsimulation(dataset=buggy_h5_path)

# The stale value overrides the formula!
ald_reloaded = sim_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_reloaded = sim_reloaded.calculate(
    "adjusted_gross_income", period=YEAR
).values[tu_idx]

print(f"  After reloading the buggy H5:")
print(f"    ALD (reloaded): ${ald_reloaded:>12,.2f}")
print(f"    AGI (reloaded): ${agi_reloaded:>12,.2f}")

print(f"\n  COMPARISON:")
print(f"    {'':30} {'Fresh':>15} {'Reloaded':>15} {'Difference':>15}")
print(f"    {'-'*75}")
print(
    f"    {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_reloaded - ald_fresh:>14,.2f}"
)
print(
    f"    {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_reloaded - agi_fresh:>14,.2f}"
)

# =============================================================================
# STEP 5: Show the fix works
# =============================================================================
print(f"\n[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...")

sim_fixed = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

fixed_h5_path = os.path.join(temp_dir, "fixed.h5")


def get_pseudo_input_variables(sim) -> set:
    """Identify pseudo-inputs that should NOT be saved."""
    tbs = sim.tax_benefit_system
    pseudo_inputs = set()
    for var_name in sim.input_variables:
        var = tbs.variables.get(var_name)
        if not var:
            continue
        adds = getattr(var, "adds", None)
        if adds and isinstance(adds, list):
            for component in adds:
                comp_var = tbs.variables.get(component)
                if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
                    pseudo_inputs.add(var_name)
                    break
        subtracts = getattr(var, "subtracts", None)
        if subtracts and isinstance(subtracts, list):
            for component in subtracts:
                comp_var = tbs.variables.get(component)
                if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
                    pseudo_inputs.add(var_name)
                    break
    return pseudo_inputs


# Save EXCLUDING pseudo-inputs
data_fixed = {}
input_vars_fixed = set(sim_fixed.input_variables)
pseudo_inputs = get_pseudo_input_variables(sim_fixed)
input_vars_fixed = input_vars_fixed - pseudo_inputs

print(f"  Excluding {len(pseudo_inputs)} pseudo-input variables:")
for var in sorted(pseudo_inputs):
    print(f"    - {var}")

for variable in sim_fixed.tax_benefit_system.variables:
    if variable not in input_vars_fixed:
        continue
    data_fixed[variable] = {}
    for period in sim_fixed.get_holder(variable).get_known_periods():
        values = sim_fixed.get_holder(variable).get_array(period)
        if sim_fixed.tax_benefit_system.variables.get(variable).value_type in (
            Enum,
            str,
        ):
            if hasattr(values, "decode_to_str"):
                values = values.decode_to_str().astype("S")
            else:
                values = values.astype("S")
        else:
            values = np.array(values)
        if values is not None:
            data_fixed[variable][period] = values
    if len(data_fixed[variable]) == 0:
        del data_fixed[variable]

with h5py.File(fixed_h5_path, "w") as f:
    for variable, periods in data_fixed.items():
        grp = f.create_group(variable)
        for period, values in periods.items():
            grp.create_dataset(str(period), data=values)

print(f"\n  Saved to {fixed_h5_path}")

# =============================================================================
# STEP 6: Verify the fix works
# =============================================================================
print(f"\n[STEP 6] Verifying the fix...")

sim_fixed_reloaded = Microsimulation(dataset=fixed_h5_path)

ald_fixed = sim_fixed_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[
    tu_idx
]
agi_fixed = sim_fixed_reloaded.calculate(
    "adjusted_gross_income", period=YEAR
).values[tu_idx]

print(f"  After reloading the FIXED H5:")
print(f"    ALD (fixed):  ${ald_fixed:>12,.2f}")
print(f"    AGI (fixed):  ${agi_fixed:>12,.2f}")

print(f"\n  FINAL COMPARISON:")
print(f"    {'':30} {'Fresh':>15} {'Buggy H5':>15} {'Fixed H5':>15}")
print(f"    {'-'*75}")
print(
    f"    {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_fixed:>14,.2f}"
)
print(
    f"    {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_fixed:>14,.2f}"
)

# Check if fix worked
ald_matches = abs(ald_fixed - ald_fresh) < 0.01
agi_matches = abs(agi_fixed - agi_fresh) < 0.01

print(f"\n  Fix successful: {ald_matches and agi_matches}")

# Cleanup
os.remove(buggy_h5_path)
os.remove(fixed_h5_path)
os.rmdir(temp_dir)

print("\n" + "=" * 70)
print("END OF DEMONSTRATION")
print("=" * 70)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions