Skip to content

ID variables produce fractional values when mapped between non-hierarchical entities #400

@baogorek

Description

@baogorek

Problem Description

When mapping ID variables (like household_id, spm_unit_id) between non-hierarchical entities using calculate(..., map_to=...), PolicyEngine Core inappropriately averages these values, producing nonsensical fractional IDs.

Minimal Reproducible Example

from policyengine_us import Microsimulation
import pandas as pd

sim = Microsimulation(dataset='hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5')

# Map household_id to tax_unit level
household_ids_tax_unit = sim.calculate('household_id', map_to='tax_unit')

# Check for fractional values (IDs should always be integers)
fractional_ids = household_ids_tax_unit.values[household_ids_tax_unit.values % 1 != 0]
print(f"Found {len(fractional_ids)} fractional household IDs")
print(f"Examples: {fractional_ids[:5]}")
# Output: [218.  153.5 153.5 172.5 172.5]

Root Cause

In policyengine_core/simulations/simulation.py, the map_result method handles mapping between non-hierarchical group entities (e.g., household → tax_unit) by:

  1. First mapping from source to person using how="mean" (averaging)
  2. Then mapping from person to target using how="sum" (summing)

This is mathematically inappropriate for ID fields, which are categorical identifiers, not numeric quantities that should be averaged or summed.

Impact

  • Produces invalid ID values that break referential integrity
  • Can cause silent bugs in downstream analysis
  • Affects any code that relies on ID mapping between non-hierarchical entities

Proposed Solutions

  1. Short-term: Add a warning when mapping variables with "_id" suffix between non-hierarchical entities
  2. Medium-term: Add a variable attribute to mark categorical/ID variables that should not be aggregated mathematically
  3. Long-term: Implement proper ID mapping logic that preserves the most common ID or uses a different strategy appropriate for categorical data

Affected Variables

Testing shows at least these ID variables produce fractional values when mapped to tax_unit:

  • household_id
  • spm_unit_id

Environment

  • policyengine-core version: 3.20.0
  • policyengine-us version: 1.399.1
  • Python version: 3.13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions