Skip to content

Build name-tagger automaton as a noncontiguous NFA#216

Open
akamick86 wants to merge 1 commit into
opensanctions:mainfrom
akamick86:perf/tagger-noncontiguous-nfa
Open

Build name-tagger automaton as a noncontiguous NFA#216
akamick86 wants to merge 1 commit into
opensanctions:mainfrom
akamick86:perf/tagger-noncontiguous-nfa

Conversation

@akamick86

Copy link
Copy Markdown

What

Needles::build currently lets AhoCorasickBuilder pick the automaton kind (the default AhoCorasickKind::Auto). This forces AhoCorasickKind::NoncontiguousNFA.

Why

The name tagger seeds the Needles automaton with ~470k aliases (mostly from person_names.txt). That automaton is built lazily on the first analyze_names call with symbols enabled and cached for the process lifetime. On a cold process the build is the dominant startup cost, and the Aho-Corasick construction is ~86% of it.

The default Auto kind does extra work selecting and constructing a contiguous NFA. The noncontiguous NFA is cheaper to build. For this workload build cost matters far more than search: the prefilter stays enabled and tagger haystacks are short (~30 chars), so search stays sub-microsecond either way.

kind only changes the internal representation, never which matches are returned, so output is unchanged.

Numbers

Release build, the person tagger's 470,083-needle set:

builder kind build time
Auto (current) 1624 ms
ContiguousNFA 1516 ms
NoncontiguousNFA (this PR) 1443 ms

About 180 ms / 11% off the largest single component of first-use latency. Overlapping-search results over a set of probe haystacks are byte-identical across all three kinds.

Risk

Low. match_kind stays Standard (required for the overlapping iteration the tagger relies on), the prefilter is untouched, and only the NFA representation changes. The full test suite passes (192 tests).

Needles::build let AhoCorasickBuilder choose the automaton kind (Auto).
Force NoncontiguousNFA: it is cheaper to build, which matters for the
large needle sets the name tagger seeds (~470k aliases), whose one-time
build dominates first-use latency. kind only changes the internal
representation, not which matches are returned, and the prefilter stays
enabled so search is unaffected for the short haystacks the tagger sees.

Release build, person tagger (470,083 needles):
Auto 1624 ms, ContiguousNFA 1516 ms, NoncontiguousNFA 1443 ms.
About 11% off the automaton build; overlapping-search results are
byte-identical across kinds. Existing test suite passes (192).
@akamick86

Copy link
Copy Markdown
Author

Same origin as #217 when every bit of cold start matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant