Skip to content

Normalize IPA strings to NFC for consistency #382

Description

@lars76

Hi, I noticed that the dataset contains a mixture of NFC and NFD Unicode forms for IPA strings. For example:

  • Row 277, Col 7: ãː (NFD: a + COMBINING TILDE) vs. ãː (NFC: single precomposed ã).

Out of ~5.1 million cells, ~6,200 are not in NFC. This causes issues with string matching, e.g., "ã" != "ã" even though they look identical.

To fix this, I applied NFC normalization across the CSV like this:

import csv, unicodedata

with open("input.csv", "r", encoding="utf-8", newline="") as infile, \
     open("output.csv", "w", encoding="utf-8", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        writer.writerow([unicodedata.normalize("NFC", cell) for cell in row])

Example of a normalized cell:

Row 1747, Col 8
  Original   : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+006F U+031E U+0303 U+0020 U+0254]
  Normalized : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+00F5 U+031E U+0020 U+0254]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions