Build name-tagger automaton as a noncontiguous NFA#216
Open
akamick86 wants to merge 1 commit into
Open
Conversation
Needles::build let AhoCorasickBuilder choose the automaton kind (Auto). Force NoncontiguousNFA: it is cheaper to build, which matters for the large needle sets the name tagger seeds (~470k aliases), whose one-time build dominates first-use latency. kind only changes the internal representation, not which matches are returned, and the prefilter stays enabled so search is unaffected for the short haystacks the tagger sees. Release build, person tagger (470,083 needles): Auto 1624 ms, ContiguousNFA 1516 ms, NoncontiguousNFA 1443 ms. About 11% off the automaton build; overlapping-search results are byte-identical across kinds. Existing test suite passes (192).
Author
|
Same origin as #217 when every bit of cold start matters. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Needles::buildcurrently letsAhoCorasickBuilderpick the automaton kind (the defaultAhoCorasickKind::Auto). This forcesAhoCorasickKind::NoncontiguousNFA.Why
The name tagger seeds the
Needlesautomaton with ~470k aliases (mostly fromperson_names.txt). That automaton is built lazily on the firstanalyze_namescall with symbols enabled and cached for the process lifetime. On a cold process the build is the dominant startup cost, and the Aho-Corasick construction is ~86% of it.The default
Autokind does extra work selecting and constructing a contiguous NFA. The noncontiguous NFA is cheaper to build. For this workload build cost matters far more than search: the prefilter stays enabled and tagger haystacks are short (~30 chars), so search stays sub-microsecond either way.kindonly changes the internal representation, never which matches are returned, so output is unchanged.Numbers
Release build, the person tagger's 470,083-needle set:
About 180 ms / 11% off the largest single component of first-use latency. Overlapping-search results over a set of probe haystacks are byte-identical across all three kinds.
Risk
Low.
match_kindstaysStandard(required for the overlapping iteration the tagger relies on), the prefilter is untouched, and only the NFA representation changes. The full test suite passes (192 tests).