Skip to content

Accent-stripping example? #84

@ccleve

Description

@ccleve

Would it be possible to add an example of stripping accents to the documentation? (This is commonly needed for search applications.)

As I understand it, the right way to do this is to determine if each character IS_LETTER and in one of these Unicode blocks: LATIN_1_SUPPLEMENT, LATIN_EXTENDED_ADDITIONAL, LATIN_EXTENDED_A, LATIN_EXTENDED_B. If it is, then decompose it, remove any NON_SPACING_MARKs, and recompose.

I haven't been able to figure out if a character is a non-spacing mark or not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions