1875,pg_tiktoken_c,pg_tiktoken_c,RAG,https://github.com/relytcloud/pg_tiktoken_c,Apache-2.0,,1.1,NONE,C,f,t,t,t,f,f,f,,"{17,16,15,14,13}",,,,,,,,,,,,tiktoken tokenizer for PostgreSQL in pure C, 1700x faster than pg_tiktoken (Rust/pgrx),纯C实现的PostgreSQL tiktoken分词器,比Rust版本快1700倍,支持RAG文档切分与Token计数,
Summary
Add pg_tiktoken_c (id: 1875) to the RAG category, positioned after pg_tiktoken (1870).
What is pg_tiktoken_c?
A PostgreSQL extension that implements OpenAI's tiktoken BPE tokenizer in pure C, as a high-performance alternative to pg_tiktoken (Rust/pgrx).
Performance vs pg_tiktoken (Rust/pgrx)
Benchmark on Apple M-series · PostgreSQL 17 · cl100k_base · single connection:
Root cause of gap: pg_tiktoken re-initialises the BPE encoder on every call (~220 ms overhead). pg_tiktoken_c caches the encoder in
TopMemoryContextonce per backend.Features
tiktoken_count(encoding, text)— token countingtiktoken_encode(encoding, text)— returns token ID arrayschunk_text_table(text, size, overlap)— document chunking for RAG pipelinescl100k_base,o200k_base,r50k_base,p50k_base,p50k_editgpt-4o,gpt-4,o1, etc.)IMMUTABLE PARALLEL SAFE— works in indexes, generated columns, parallel queriesEntry added
GitHub: https://github.com/relytcloud/pg_tiktoken_c
License: Apache 2.0
PR: #18