feat(openchat): Mistral-7B inference engine (GQA + RoPE + RMSNorm + SiLU) OpenChat 3.5 / Mistral-7B architecture, fully distinct from GPT-2: - GQA: 32 query heads share 8 KV heads (4:1 ratio, 75% KV cache savings) - RoPE: rotary positional embedding (no learned positions) - RMSNorm: simpler norm without mean subtraction (both in models::layers) - SiLU: gated MLP (gate * up → down) with F32x16 element-wise SIMD - GGUF weight loading via hpc::gguf (Q4_K_M + Q4_0 dequantization added) - CausalEdge64 emission from attention patterns - OpenChat chat template (GPT4 Correct User/Assistant markers) - /v1/chat/completions API types All ops through crate::simd::F32x16 via models::layers. No weights stored — loaded at runtime from user-provided GGUF. 15 tests passing. 77 total across new modules. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7#45
Merged
Conversation
…lette Extracted from openai-community/gpt2 model.safetensors (522.7 MB): wte.weight: [50257, 768] f32 → Base17 golden-step projection gpt2_base17_50k.bin: 1,669 KB (50,257 tokens × 34B = 320× compression) gpt2_palette_50k.bin: 58 KB (256 centroids + 50K assignments = 9,294×) GPT-2 uses SAME BPE tokenizer as Jina v4 → palette indices are DIRECTLY COMPATIBLE. CausalEdge64 edges from GPT-2 and Jina use the SAME S/P/O palette space. Zero mapping overhead. Total weights in ndarray/src/hpc/jina/weights/: 3.4 MB Jina v4 (20K tokens): 694 KB GPT-2 (50K tokens): 1,727 KB COCA vocabulary: 997 KB All three models share the same Base17 codec. Load via LazyLock at startup. No external deps. No GPU. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…palette Extracted from google-bert/bert-base-uncased (420 MB safetensors): bert.embeddings.word_embeddings.weight: [30522, 768] f32 bert_base17_30k.bin: 1,013 KB (30,522 tokens × 34B = 424× compression) bert_palette_30k.bin: 38 KB (256 centroids + 30K assignments = 11,225×) BERT (WordPiece) uses DIFFERENT tokenizer than GPT-2/Jina (BPE). Needs mapping table for cross-model palette alignment. BERT captures BIDIRECTIONAL context — complements GPT-2 autoregressive. Total weights: 4.5 MB (3 models + COCA vocabulary) Jina v4: 694 KB (20K tokens, 2048D→17D) GPT-2: 1,727 KB (50K tokens, 768D→17D) BERT: 1,052 KB (30K tokens, 768D→17D) COCA: 997 KB (20K academic vocabulary) All three models: same Base17 codec, same palette format, same CausalEdge64. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…odec
runtime.rs (300 lines, 11 tests):
LazyLock<ModelRuntime> for Jina v4, GPT-2, BERT:
Weights embedded via include_bytes! (zero file I/O)
Loaded once, used forever. 4.5MB total in binary.
Full codec chain per model:
HHTL cascade: heel_distance() → cascade_distance() (early exit)
SimilarityTable: heel_similarity() → calibrated f32 [0,1]
CAM-PQ: cam_fingerprint() → 6-byte [palette, dim0..4]
CausalEdge64: pack_spo_edge() → u64 with NARS + Pearl + temporal
Base17 LEAF: leaf_distance() → full resolution L1
SimilarityTable built from EXACT 256×256 palette distance CDF.
This IS the bgz17 pattern: empirical distribution → calibrated lookup.
Usage:
use ndarray::hpc::jina::runtime::{JINA, GPT2, BERT};
let sim = GPT2.heel_similarity(token_a, token_b); // O(1), calibrated
let edge = GPT2.pack_spo_edge(s, p, o, 0.8, 0.6, 42); // CausalEdge64
let fp = BERT.cam_fingerprint(token); // 6-byte CAM-PQ
22 tests passing (12 codec + 11 runtime).
https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
GPT-2 small (124M) forward pass with KV cache, all transcendentals via crate::simd::F32x16 (LayerNorm, GELU, softmax, dot products). - weights.rs: safetensors loader for 12 transformer layers - inference.rs: autoregressive generation with temperature sampling - api.rs: OpenAI-compatible request/response types (/v1/completions, /v1/embeddings, /v1/models) — transport-agnostic - 9 tests passing (layer_norm, GELU, softmax, config, API types) https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Weight matrices pre-transposed from [in_dim, out_dim] to [out_dim, in_dim] during safetensors loading. matmul_vec_simd now reads contiguous rows via F32x16::from_slice + mul_add — full SIMD utilization (768D = 48 × F32x16). https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Full integration with the tensor codec pipeline: - AttentionTable: palette-based O(1) approximate attention via jina::runtime::GPT2 (256×256 HEEL distance table) - CausalEdge64 emission: attention patterns packed as SPO edges with NARS truth values (subject=query, predicate=head, object=key) - HHTL cascade: token_similarity(), token_distance_leaf(), token_distance_cascade() methods on Gpt2Engine - CAM-PQ: 6-byte token fingerprints via cam_fingerprint() Both features are opt-in flags (use_attention_table, emit_causal_edges) to avoid overhead when not needed. 14 tests passing. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Extract shared code into hpc/models/: - safetensors.rs: generic file loader (used by GPT-2, SD, BERT) - layers.rs: SIMD ops (layer_norm, gelu, silu, group_norm, softmax, matmul_vec, dot_product) — all via crate::simd::F32x16 - api_types.rs: OpenAI-compatible envelope (Usage, FinishReason, etc.) Add hpc/stable_diffusion/ scaffold (code only, no weights): - clip.rs: CLIP text encoder (same transformer as GPT-2, shared layers) - unet.rs: UNet denoiser with Conv2D, GroupNorm, SiLU, timestep embedding - vae.rs: VAE decoder (latent→RGB) - scheduler.rs: DDIM noise scheduler with precomputed alpha schedule - weights.rs: safetensors loader for SD CLIP weights - api.rs: /v1/images/generations with full pipeline 52 tests passing. Zero weight files — disk space conscious. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…iLU) OpenChat 3.5 / Mistral-7B architecture, fully distinct from GPT-2: - GQA: 32 query heads share 8 KV heads (4:1 ratio, 75% KV cache savings) - RoPE: rotary positional embedding (no learned positions) - RMSNorm: simpler norm without mean subtraction (both in models::layers) - SiLU: gated MLP (gate * up → down) with F32x16 element-wise SIMD - GGUF weight loading via hpc::gguf (Q4_K_M + Q4_0 dequantization added) - CausalEdge64 emission from attention patterns - OpenChat chat template (GPT4 Correct User/Assistant markers) - /v1/chat/completions API types All ops through crate::simd::F32x16 via models::layers. No weights stored — loaded at runtime from user-provided GGUF. 15 tests passing. 77 total across new modules. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.