Skip to content

feat: support ngram token filter#10

Merged
silver-ymz merged 4 commits into
supervc-stack:mainfrom
silver-ymz:feat/ngram
May 11, 2025
Merged

feat: support ngram token filter#10
silver-ymz merged 4 commits into
supervc-stack:mainfrom
silver-ymz:feat/ngram

Conversation

@silver-ymz
Copy link
Copy Markdown
Collaborator

close #9

Update:

silver-ymz added 4 commits May 8, 2025 22:52
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>
@silver-ymz silver-ymz requested review from VoVAllen and Copilot May 8, 2025 15:35
@silver-ymz silver-ymz changed the title Feat/ngram feat: support ngram token filter May 8, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an ngram token filter similar to Elastic Search’s implementation and updates some dependency versions.

  • Added a new ngram token filter implementation with configuration and validation.
  • Updated dependency versions in Cargo.toml and the setup script.
  • Extended documentation and tests to cover the new ngram token filter.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/setup.sh Bumped pgrx version to 0.14.1 for proper integration.
tests/sqllogictest/ngram.slt Added tests verifying the ngram token filter behavior.
src/token_filter/ngram.rs Introduced the ngram filter implementation and configuration.
src/token_filter/mod.rs Registered the new ngram filter in the token filter module.
src/lib.rs Modified _PG_init to use extern "C-unwind" as required by pgrx bump.
docs/00-reference.md Documented the options for the ngram token filter.
Cargo.toml Updated versions of lindera, pgrx, and removed pg12 support.
.github/workflows/check.yml Updated workflow to use the latest sccache action version.
Comments suppressed due to low confidence (1)

Cargo.toml:13

  • The removal of the pg12 feature is not mentioned in the PR description; please confirm if dropping support for pg12 is intentional.
pg12 = ["pgrx/pg12", "pgrx-tests/pg12"]

Comment thread src/token_filter/ngram.rs
impl TokenFilter for Ngram {
fn apply(&self, token: String) -> Vec<String> {
let mut results = Vec::new();
let len = token.len();
Copy link

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the token length is less than min_gram, the loop 'for i in 0..=(len - self.config.min_gram)' will underflow and panic. Consider adding a guard that returns an empty vector when token.len() < self.config.min_gram.

Suggested change
let len = token.len();
let len = token.len();
if len < self.config.min_gram {
return results;
}

Copilot uses AI. Check for mistakes.
Comment thread src/token_filter/ngram.rs
Comment on lines +64 to +69
pub fn new(config: NgramConfig) -> Self {
if let Err(e) = config.validate() {
panic!("Invalid NgramConfig: {}", e);
}

Ngram { config }
Copy link

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than panicking on an invalid configuration, consider returning a Result to allow clients to handle configuration errors gracefully.

Suggested change
pub fn new(config: NgramConfig) -> Self {
if let Err(e) = config.validate() {
panic!("Invalid NgramConfig: {}", e);
}
Ngram { config }
pub fn new(config: NgramConfig) -> Result<Self, ValidationError> {
if let Err(e) = config.validate() {
return Err(e);
}
Ok(Ngram { config })

Copilot uses AI. Check for mistakes.
@silver-ymz silver-ymz merged commit 504af4c into supervc-stack:main May 11, 2025
10 checks passed
@silver-ymz silver-ymz deleted the feat/ngram branch May 11, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: ngram tokenizer

2 participants