🐛 Bug fix for exact entities.#80
Conversation
…the fuzzy matching algorithm.
Reviewer's GuideIntroduces exact-match handling for specific entity labels in the entity disambiguation pipeline by propagating a normalized subclass key from anonymization postprocessing into canonical entity building, so that those labels cluster strictly by exact value instead of fuzzy similarity; also adjusts a notebook to point to a different example document. Sequence diagram for exact-match handling in entity disambiguationsequenceDiagram
participant AnonymizationPostprocess
participant FuzzyDisambiguation
participant CanonicalEntities
AnonymizationPostprocess->>AnonymizationPostprocess: process(ent)
AnonymizationPostprocess->>AnonymizationPostprocess: cleaned_text = pattern.sub("", ent.text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass = []
alt label in exact_labels
AnonymizationPostprocess->>AnonymizationPostprocess: flattened_text = re.sub("[^a-zA-Z0-9]", "", cleaned_text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass.append(flattened_text)
end
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_alt_text = cleaned_text
FuzzyDisambiguation->>FuzzyDisambiguation: build_canonical_entities(labels, target_labels, threshold)
FuzzyDisambiguation->>FuzzyDisambiguation: grouped.setdefault(aymurai_label, []).append({text, aymurai_label, exact_alias})
loop for each label_type, items in grouped.items()
alt label_type in EXACT_LABELS
FuzzyDisambiguation->>FuzzyDisambiguation: exact_groups.setdefault(exact_alias, []).append(item)
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = list(exact_groups.values())
else
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = _cluster_aliases_with_cdist(items, threshold)
end
FuzzyDisambiguation->>CanonicalEntities: _clusters_to_canonical_entities(clusters)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
exact_labelsset is duplicated in bothfuzzy.pyandcore.py; consider centralizing this constant in a shared module to avoid divergence and make future updates easier. - In
anonymization_postprocess/core.py,aymurai_label_subclassis always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values. - The notebook change from
documents[14]todocuments[5]looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `exact_labels` set is duplicated in both `fuzzy.py` and `core.py`; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
- In `anonymization_postprocess/core.py`, `aymurai_label_subclass` is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
- The notebook change from `documents[14]` to `documents[5]` looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
## Individual Comments
### Comment 1
<location path="aymurai/utils/entity_disambiguation/fuzzy.py" line_range="10-18" />
<code_context>
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
+EXACT_LABELS = {
+ "DNI",
+ "CUIT_CUIL",
+ "TELEFONO",
+ "PATENTE_DOMINIO",
+ "IP",
+ "NUM_CAJA_AHORRO",
+ "CBU",
+ "NUM_MATRICULA",
+}
+
</code_context>
<issue_to_address>
**suggestion:** Avoid duplicating the exact-label set in multiple modules by centralizing it
This set also exists here as `EXACT_LABELS` and in `anonymization_postprocess/core.py` as `exact_labels`. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.
Suggested implementation:
```python
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS
```
1. Create (or extend) a shared constants module, for example `aymurai/meta/constants.py`, and move the set definition there:
```python
EXACT_LABELS = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
}
```
2. In `anonymization_postprocess/core.py`, replace the local `exact_labels` definition with an import from the same constants module, e.g.:
```python
from aymurai.meta.constants import EXACT_LABELS as exact_labels
```
(or adjust naming/import style to match existing conventions in that file).
3. Ensure `aymurai/meta/constants.py` is part of the package (has `__init__.py` as needed) and update any relevant `__all__` if your project uses it.
</issue_to_address>
### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="60-64" />
<code_context>
+ "NUM_MATRICULA",
+ }
+
+ ent["attrs"]["aymurai_label_subclass"] = []
+
+ if label in exact_labels:
+ flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
+ ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
+
# Update the entity's alt text and indices
</code_context>
<issue_to_address>
**issue (bug_risk):** Re-initializing `aymurai_label_subclass` may unintentionally discard previous subclass information
Unconditionally assigning `ent["attrs"]["aymurai_label_subclass"] = []` clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via `setdefault`/`get`) or otherwise making this logic additive rather than destructive.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| EXACT_LABELS = { | ||
| "DNI", | ||
| "CUIT_CUIL", | ||
| "TELEFONO", | ||
| "PATENTE_DOMINIO", | ||
| "IP", | ||
| "NUM_CAJA_AHORRO", | ||
| "CBU", | ||
| "NUM_MATRICULA", |
There was a problem hiding this comment.
suggestion: Avoid duplicating the exact-label set in multiple modules by centralizing it
This set also exists here as EXACT_LABELS and in anonymization_postprocess/core.py as exact_labels. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.
Suggested implementation:
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS- Create (or extend) a shared constants module, for example
aymurai/meta/constants.py, and move the set definition there:
EXACT_LABELS = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
}- In
anonymization_postprocess/core.py, replace the localexact_labelsdefinition with an import from the same constants module, e.g.:
from aymurai.meta.constants import EXACT_LABELS as exact_labels(or adjust naming/import style to match existing conventions in that file).
- Ensure
aymurai/meta/constants.pyis part of the package (has__init__.pyas needed) and update any relevant__all__if your project uses it.
| ent["attrs"]["aymurai_label_subclass"] = [] | ||
|
|
||
| if label in exact_labels: | ||
| flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text) | ||
| ent["attrs"]["aymurai_label_subclass"].append(flattened_text) |
There was a problem hiding this comment.
issue (bug_risk): Re-initializing aymurai_label_subclass may unintentionally discard previous subclass information
Unconditionally assigning ent["attrs"]["aymurai_label_subclass"] = [] clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via setdefault/get) or otherwise making this logic additive rather than destructive.
There was a problem hiding this comment.
Pull request overview
This PR adjusts entity disambiguation/anonymization so certain identifier-like labels (e.g., DNI/CBU/IP) are treated as exact identifiers (no fuzzy clustering), using a normalized “exact alias” derived from label subclass metadata.
Changes:
- Add an
EXACT_LABELSpath in canonical-entity building to group exact-identifier labels by a normalized alias instead of fuzzy clustering. - Update anonymization postprocessing to store a normalized subclass value for exact-identifier labels to support exact grouping.
- Update an experimental notebook to process a different sample document.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb | Changes which document sample index is processed in the experiment. |
| aymurai/utils/entity_disambiguation/fuzzy.py | Introduces exact-identifier grouping logic during canonical entity construction. |
| aymurai/transforms/anonymization_postprocess/core.py | Records a normalized subclass value for exact-identifier labels during entity cleaning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for item in items: | ||
| exact_groups.setdefault(item["exact_alias"], []).append(item) | ||
|
|
||
| clusters = list(exact_groups.values()) |
| ent["attrs"]["aymurai_label_subclass"] = [] | ||
|
|
||
| if label in exact_labels: | ||
| flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text) | ||
| ent["attrs"]["aymurai_label_subclass"].append(flattened_text) |
| exact_labels = { | ||
| "DNI", | ||
| "CUIT_CUIL", | ||
| "TELEFONO", | ||
| "PATENTE_DOMINIO", | ||
| "IP", | ||
| "NUM_CAJA_AHORRO", | ||
| "CBU", | ||
| "NUM_MATRICULA", | ||
| } |
Summary by Sourcery
Handle certain entity labels as exact identifiers during disambiguation and anonymization, and adjust the experimental notebook document selection accordingly.
Bug Fixes:
Enhancements: