Skip to content

[Extension Request] pg_tiktoken_c - pure C tiktoken, 1700x faster than pg_tiktoken #19

@fasdwcx

Description

@fasdwcx

Summary

Add pg_tiktoken_c (id: 1875) to the RAG category, positioned after pg_tiktoken (1870).

What is pg_tiktoken_c?

A PostgreSQL extension that implements OpenAI's tiktoken BPE tokenizer in pure C, as a high-performance alternative to pg_tiktoken (Rust/pgrx).

Performance vs pg_tiktoken (Rust/pgrx)

Benchmark on Apple M-series · PostgreSQL 17 · cl100k_base · single connection:

Text size pg_tiktoken_c (C) pg_tiktoken (Rust) Speedup
Short (~3 tok) 11,061 rows/s, 86 µs 4 rows/s 2,765×
Medium (~60 tok) 6,779 rows/s, 141 µs 4 rows/s 1,695×
Long (~500 tok) 1,202 rows/s, 810 µs 4 rows/s 301×

Root cause of gap: pg_tiktoken re-initialises the BPE encoder on every call (~220 ms overhead). pg_tiktoken_c caches the encoder in TopMemoryContext once per backend.

Features

  • tiktoken_count(encoding, text) — token counting
  • tiktoken_encode(encoding, text) — returns token ID arrays
  • chunk_text_table(text, size, overlap) — document chunking for RAG pipelines
  • All major OpenAI encodings: cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit
  • Model name aliases (gpt-4o, gpt-4, o1, etc.)
  • IMMUTABLE PARALLEL SAFE — works in indexes, generated columns, parallel queries

Entry added

1875,pg_tiktoken_c,pg_tiktoken_c,RAG,https://github.com/relytcloud/pg_tiktoken_c,Apache-2.0,,1.1,NONE,C,f,t,t,t,f,f,f,,"{17,16,15,14,13}",,,,,,,,,,,,tiktoken tokenizer for PostgreSQL in pure C, 1700x faster than pg_tiktoken (Rust/pgrx),纯C实现的PostgreSQL tiktoken分词器,比Rust版本快1700倍,支持RAG文档切分与Token计数,
  • repo: NONE — source-only install for now; happy to update once packages are available
  • lang: C — pure C, no pgrx/Rust dependency
  • pg_ver: 13-17 — tested on all supported versions

GitHub: https://github.com/relytcloud/pg_tiktoken_c
License: Apache 2.0


PR: #18

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions