Skip to content

🐛 Bug fix for exact entities.#80

Open
conrabeatriz wants to merge 1 commit into
release/v1.5.0from
bug-fix/issue#75
Open

🐛 Bug fix for exact entities.#80
conrabeatriz wants to merge 1 commit into
release/v1.5.0from
bug-fix/issue#75

Conversation

@conrabeatriz
Copy link
Copy Markdown

@conrabeatriz conrabeatriz commented May 20, 2026

Summary by Sourcery

Handle certain entity labels as exact identifiers during disambiguation and anonymization, and adjust the experimental notebook document selection accordingly.

Bug Fixes:

  • Prevent fuzzy clustering for entities that should be treated as exact identifiers by grouping them using normalized exact aliases.
  • Ensure anonymization postprocessing records a normalized subclass value for exact-identifier labels to support precise disambiguation.

Enhancements:

  • Introduce shared handling of exact-identifier labels by deriving and storing an exact alias from the entity label subclass metadata.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 20, 2026

Reviewer's Guide

Introduces exact-match handling for specific entity labels in the entity disambiguation pipeline by propagating a normalized subclass key from anonymization postprocessing into canonical entity building, so that those labels cluster strictly by exact value instead of fuzzy similarity; also adjusts a notebook to point to a different example document.

Sequence diagram for exact-match handling in entity disambiguation

sequenceDiagram
    participant AnonymizationPostprocess
    participant FuzzyDisambiguation
    participant CanonicalEntities

    AnonymizationPostprocess->>AnonymizationPostprocess: process(ent)
    AnonymizationPostprocess->>AnonymizationPostprocess: cleaned_text = pattern.sub("", ent.text)
    AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass = []
    alt label in exact_labels
        AnonymizationPostprocess->>AnonymizationPostprocess: flattened_text = re.sub("[^a-zA-Z0-9]", "", cleaned_text)
        AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass.append(flattened_text)
    end
    AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_alt_text = cleaned_text

    FuzzyDisambiguation->>FuzzyDisambiguation: build_canonical_entities(labels, target_labels, threshold)
    FuzzyDisambiguation->>FuzzyDisambiguation: grouped.setdefault(aymurai_label, []).append({text, aymurai_label, exact_alias})
    loop for each label_type, items in grouped.items()
        alt label_type in EXACT_LABELS
            FuzzyDisambiguation->>FuzzyDisambiguation: exact_groups.setdefault(exact_alias, []).append(item)
            FuzzyDisambiguation->>FuzzyDisambiguation: clusters = list(exact_groups.values())
        else
            FuzzyDisambiguation->>FuzzyDisambiguation: clusters = _cluster_aliases_with_cdist(items, threshold)
        end
        FuzzyDisambiguation->>CanonicalEntities: _clusters_to_canonical_entities(clusters)
    end
Loading

File-Level Changes

Change Details Files
Propagate an exact, normalized subclass alias per entity (for certain labels) from anonymization postprocessing into the entity disambiguation pipeline so those entities are clustered by exact value rather than fuzzy distance.
  • Define a shared set of labels that must be treated with exact matching semantics (e.g., DNI, CUIT_CUIL, TELEFONO, etc.)
  • In anonymization postprocess, derive a cleaned, alphanumeric-only value for exact-match labels and store it in the entity attrs under aymurai_label_subclass, initializing that field as a list and appending the flattened value
  • In canonical entity building, compute an exact_alias from aymurai_label_subclass (handling both list and scalar cases) alongside the existing alias text, and include it in the grouped item structure
  • For labels in the exact-match set, form clusters by grouping items with identical exact_alias instead of using the distance-based clustering; for other labels, keep the existing fuzzy clustering behavior
aymurai/utils/entity_disambiguation/fuzzy.py
aymurai/transforms/anonymization_postprocess/core.py
Adjust the experiment notebook to run on a different sample document index.
  • Change the selected document index from 14 to 5 when choosing doc_path for processing in the entity disambiguation anonymization experiment notebook
notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@conrabeatriz conrabeatriz changed the title 🐛 Bug fix for exact entities. 🐛 Bug fix for exact entities. PR May 20, 2026
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The exact_labels set is duplicated in both fuzzy.py and core.py; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
  • In anonymization_postprocess/core.py, aymurai_label_subclass is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
  • The notebook change from documents[14] to documents[5] looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `exact_labels` set is duplicated in both `fuzzy.py` and `core.py`; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
- In `anonymization_postprocess/core.py`, `aymurai_label_subclass` is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
- The notebook change from `documents[14]` to `documents[5]` looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.

## Individual Comments

### Comment 1
<location path="aymurai/utils/entity_disambiguation/fuzzy.py" line_range="10-18" />
<code_context>
 from aymurai.meta.api_interfaces import DocLabel
 from aymurai.meta.entities import CanonicalEntity

+EXACT_LABELS = {
+    "DNI",
+    "CUIT_CUIL",
+    "TELEFONO",
+    "PATENTE_DOMINIO",
+    "IP",
+    "NUM_CAJA_AHORRO",
+    "CBU",
+    "NUM_MATRICULA",
+}
+
</code_context>
<issue_to_address>
**suggestion:** Avoid duplicating the exact-label set in multiple modules by centralizing it

This set also exists here as `EXACT_LABELS` and in `anonymization_postprocess/core.py` as `exact_labels`. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.

Suggested implementation:

```python
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS

```

1. Create (or extend) a shared constants module, for example `aymurai/meta/constants.py`, and move the set definition there:

```python
EXACT_LABELS = {
    "DNI",
    "CUIT_CUIL",
    "TELEFONO",
    "PATENTE_DOMINIO",
    "IP",
    "NUM_CAJA_AHORRO",
    "CBU",
    "NUM_MATRICULA",
}
```

2. In `anonymization_postprocess/core.py`, replace the local `exact_labels` definition with an import from the same constants module, e.g.:

```python
from aymurai.meta.constants import EXACT_LABELS as exact_labels
```

(or adjust naming/import style to match existing conventions in that file).

3. Ensure `aymurai/meta/constants.py` is part of the package (has `__init__.py` as needed) and update any relevant `__all__` if your project uses it.
</issue_to_address>

### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="60-64" />
<code_context>
+            "NUM_MATRICULA",
+        }
+
+        ent["attrs"]["aymurai_label_subclass"] = []
+
+        if label in exact_labels:
+            flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
+            ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
+
         # Update the entity's alt text and indices
</code_context>
<issue_to_address>
**issue (bug_risk):** Re-initializing `aymurai_label_subclass` may unintentionally discard previous subclass information

Unconditionally assigning `ent["attrs"]["aymurai_label_subclass"] = []` clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via `setdefault`/`get`) or otherwise making this logic additive rather than destructive.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +10 to +18
EXACT_LABELS = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Avoid duplicating the exact-label set in multiple modules by centralizing it

This set also exists here as EXACT_LABELS and in anonymization_postprocess/core.py as exact_labels. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.

Suggested implementation:

from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS
  1. Create (or extend) a shared constants module, for example aymurai/meta/constants.py, and move the set definition there:
EXACT_LABELS = {
    "DNI",
    "CUIT_CUIL",
    "TELEFONO",
    "PATENTE_DOMINIO",
    "IP",
    "NUM_CAJA_AHORRO",
    "CBU",
    "NUM_MATRICULA",
}
  1. In anonymization_postprocess/core.py, replace the local exact_labels definition with an import from the same constants module, e.g.:
from aymurai.meta.constants import EXACT_LABELS as exact_labels

(or adjust naming/import style to match existing conventions in that file).

  1. Ensure aymurai/meta/constants.py is part of the package (has __init__.py as needed) and update any relevant __all__ if your project uses it.

Comment on lines +60 to +64
ent["attrs"]["aymurai_label_subclass"] = []

if label in exact_labels:
flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Re-initializing aymurai_label_subclass may unintentionally discard previous subclass information

Unconditionally assigning ent["attrs"]["aymurai_label_subclass"] = [] clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via setdefault/get) or otherwise making this logic additive rather than destructive.

@conrabeatriz conrabeatriz changed the title 🐛 Bug fix for exact entities. PR 🐛 Bug fix for exact entities. May 20, 2026
@jansaldo jansaldo requested a review from Copilot May 21, 2026 18:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts entity disambiguation/anonymization so certain identifier-like labels (e.g., DNI/CBU/IP) are treated as exact identifiers (no fuzzy clustering), using a normalized “exact alias” derived from label subclass metadata.

Changes:

  • Add an EXACT_LABELS path in canonical-entity building to group exact-identifier labels by a normalized alias instead of fuzzy clustering.
  • Update anonymization postprocessing to store a normalized subclass value for exact-identifier labels to support exact grouping.
  • Update an experimental notebook to process a different sample document.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb Changes which document sample index is processed in the experiment.
aymurai/utils/entity_disambiguation/fuzzy.py Introduces exact-identifier grouping logic during canonical entity construction.
aymurai/transforms/anonymization_postprocess/core.py Records a normalized subclass value for exact-identifier labels during entity cleaning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for item in items:
exact_groups.setdefault(item["exact_alias"], []).append(item)

clusters = list(exact_groups.values())
Comment on lines +60 to +64
ent["attrs"]["aymurai_label_subclass"] = []

if label in exact_labels:
flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
Comment on lines +49 to +58
exact_labels = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants