release(0.23.2): merge release branch + post-release smoke fixups into develop by michalharakal · Pull Request #132 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-05T07:58:22Z

Summary

Brings chore/release-0.23.2 (already tagged as 0.23.2 on commit 6eec93a) plus three post-release smoke-tooling fixes onto develop.

Release content (tagged 0.23.2)

edb366c fix(tool-calling): tolerate markdown code fences around Llama 3 JSON
5b34ee9 fix(kllama-cli): route Llama GGUF/SafeTensors back to eager LlamaRuntime — recovers Llama 3 tool-calling functionality at the cost of staying on the legacy path until the DSL gets first-class Q4/Q8 DTypes
5c3b9fa feat(kllama-cli): log prompts, raw responses, and tools list in ToolCallingDemo
74c1416 fix(llama): inject logical 2D shape and dequant token_embd in DSL converter (now Qwen-only)
1e7af50 test(smoke): add Llama-3.2-1B-Instruct entry with tool-calling assertion
cea3173 docs(tool-calling): end-to-end Llama 3 setup walkthrough for app integrators
40200da chore(api): refresh public API dumps for 0.23.2
6eec93a release: bump version to 0.23.2

Post-release smoke fixups (not in the 0.23.2 tag, no published-artifact change)

412d0b6 fix(kbert-cli): apply application plugin so :run task is wired
7abe110 fix(smoke): tolerate runners that don't emit tok/s (embedding models)
70e936d test(smoke): add MongoDB/mdbr-leaf-ir embedding entry

These three only affect tests/smoke/ and kbert-cli's build script — they don't change anything published to Maven Central.

Test plan

:llm-agent:jvmTest, :llm-inference:llama:jvmTest, :llm-runtime:kllama:jvmTest — green
Parser regression tests for fenced JSON (3 new cases) — green
apiCheck — green (dumps refreshed via apiDump)
Smoke: Llama-3.2-1B-Instruct chat (0.37 t/s) + tool calling ([Tool Call] calculator → 4.0)
Smoke: MongoDB/mdbr-leaf-ir embedding (cosine 0.78 between two MongoDB-related sentences, 384-dim, ~290 ms/encode)
CI on this PR

Known followups (called out in the 0.23.2 tag annotation)

Recover the previous ~2 t/s baseline on Llama Q8 — needs first-class Q4/Q8 DTypes in the DSL or per-call SIMD dispatch in ops.matmul.
Bisect the residual perf gap on the eager path (upstream skainet pinned at 0.23.1, so it's not an upstream backend bump).

🤖 Generated with Claude Code

Llama 3.2 1B Instruct sometimes wraps its tool-call JSON in a triple- backtick fence (```...``` or ```json...```) even though the system prompt instructs bare JSON. Llama31ToolCallParserStrategy required candidate.startsWith("{"), so the fenced form silently parsed as "no tool call" and the agent loop returned the raw JSON to the user instead of executing the tool. Add a stripCodeFence step that peels one layer of opening/closing fence before the existing python-tag strip + balanced-brace extraction. Both parse() and containsToolCall() use it so dispatch and detection stay in lockstep. Pinned by three new ToolCallParserTest cases (plain ```...```, ```json tagged fence, and containsToolCall on fenced input). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Suppress

Reverts the Llama branch of d519eb2 ("swap Llama GGUF + SafeTensors to DSL path") in the JVM kllama-cli. The DSL path (DecoderGgufWeightLoader → DecoderGgufMemSegConverter → LlamaNetworkLoader.fromWeights → OptimizedLLMRuntime DIRECT) is functionally correct but pays a per- linearProject ops.transpose tax on packed Q4/Q8 weights — the kernel still has to sniff the marker class on every call because the DSL doesn't have first-class Q4/Q8 DTypes yet. Measured 0.24 t/s on Llama-3.2-1B-Instruct-Q8 vs the legacy LlamaRuntime path that hits the SIMD quant matmul kernel directly. Restore the pre-d519eb2 wiring: LlamaIngestion(NATIVE_OPTIMIZED, allowQuantized=true) → MemSegWeightConverter.convert → CpuAttentionBackend → LlamaRuntime<FP32> (with @Suppress("DEPRECATION")). Fold the BIN branch into the same `else` block since it was already on LlamaRuntime anyway. Re-add the LlamaIngestion + LlamaLoadConfig + MemSegWeightConverter imports; drop DecoderSafeTensorsLoader + LlamaNetworkLoader from imports. Qwen GGUF stays on the DSL path (unchanged) — its Q8 perf is acceptable on smaller batches and the parity test pins it. Recovers tool calling functionality on Llama 3.2: the CLI smoke test now emits [Tool Call] calculator(...) → [Tool Result] 4.0 cleanly. Followup: either give the DSL first-class Q4/Q8 DTypes so linearProject can dispatch directly, or push the SIMD kernel selection deeper into ops.matmul so the per-call transpose disappears. Tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…allingDemo The smoke test for tool calling previously printed only [Tool Call] and [Tool Result] markers, which made it hard to debug a missed call (was the prompt malformed? did the model emit something the parser couldn't see?). ToolCallingDemo.runSingleShot now also prints, before generation: - [Tools] block with each tool's name, description, and JSON schema - [Prompt → Round 1] with the full chat-template-rendered prompt the model will actually receive (system + tools + user) …and during the agent loop the listener prints: - [Raw Assistant → Round N] with the model's exact output per round - [Tool Call Invalid] when a tool call fails JSON-Schema validation …and after the loop completes: - [Final Conversation] with every message in the accumulated history, so post-round-1 prompts can be reconstructed by the reader. No behavior change to the agent loop itself; this is pure observability on the demo path. The smoke-test.sh greps still match [Tool Call] / [Tool Result] so PASS/FAIL accounting is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…verter DecoderGgufMemSegConverter wrapped Q4_0/Q8_0 GGUF tensors in Q4MemorySegmentTensorData / Q8MemorySegmentTensorData using the loader's intermediate Int8 byte-count Shape (1D = bytes.size). The DSL's linearProject calls ops.transpose(weight) before matmul, and transpose needs the logical 2D shape [out, in] from metadata to dispatch the quant-aware kernel. Previously the kernel either rejected the 1D shape or silently fell back to a generic path. Compute the logical [out, in] shape per tensor name (attn_q/k/v/output, ffn_gate/up/down, token_embd, output) from LlamaModelMetadata (embeddingLength, headCount, kvHeadCount, feedForwardLength, vocabSize) and pass it to the Q4/Q8 MemSeg wrappers. K-quants get the same logical shape on dequant. Special-case token_embd.weight: the Embedding layer consumes it via gather (row indexing), not matmul. Packed Q4/Q8 bytes can't be gathered as floats, and the loader's Int8 byte-count shape is rejected by gather even after wrapping. Always dequantize token_embd to FP32 with the [vocab, dim] shape regardless of quant type. (Tied output.weight stays Q-packed because it's used by the LM head matmul, not gather.) After the JVM kllama-cli Llama branch reverted to LlamaRuntime in the previous commit, this converter is exercised only by the Qwen GGUF branch and the Java facade KLlamaJava.loadGGUF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin the Llama 3.2 chat + tool-calling path in the smoke suite so the JSON-mode parser fix and the eager-LlamaRuntime revert can't silently regress. Tool-calling prompt is "What is 2 + 2?" with 256 max steps; the smoke runner asserts a [Tool Call] line is emitted. Path resolves under MODELS_ROOT or the user's ~/.cache/standapp/models; absent locally it reports FAIL (model path not found) cleanly without masking other failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…integrators Rewrites llama3-tool-calling.adoc into a getting-started for embedding Llama 3 / 3.1 / 3.2 tool calling into a consumer Kotlin app, on top of the existing format-internals reference. New sections: - Quick start: actual CLI output a user sees ([Tools], [Prompt → Round 1], [Raw Assistant], [Tool Call], [Tool Result]) so the goal is concrete. - Use it from your own Kotlin app: 5 numbered steps — 1. Add sk.ainet.transformers:llm-runtime-kllama + llm-agent (0.23.2) + --enable-preview --add-modules jdk.incubator.vector 2. KLlamaJava.loadGGUF(Path.of(...)) — runtime + tokenizer in one call 3. Define a tool — full WeatherTool example with ToolDefinition + JSON schema + execute() 4. ChatSession(runtime, tokenizer, ModelMetadata(family="llama")) → ToolRegistry → createAgentLoop → runWithEncoder 5. AgentListener snippet for prompt/answer/tool-call/tool-result logs; how to render the prompt yourself via chat.chatTemplate.apply(...) - Verify it's working: exact callback sequence, what to file if it breaks. - NOTE: Llama31ToolCallParserStrategy peels markdown code fences (this release's parser fix); also added to the "Parser accepts" bullet list. The pre-existing format reference (Llama3ToolFormat.JSON / FUNCTION_TAG, picking format programmatically, why two formats exist, model-size caveat, related files) is preserved verbatim below the new walkthrough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regenerates the binary-compatibility-validator dumps via apiDump to match the actual public surface of every module. Catches up the dumps to the accumulated changes on the chore/release-0.23.2 branch: - llm-runtime/kllama: drops GpuAttentionBackend, GpuTensorBridge, LlamaIngestionBlocking; adds JavaTools.definition; KLlamaSession constructor takes InferenceRuntime instead of LlamaRuntime. - llm-providers: SkaiNetChatModel constructor takes a Set parameter where there used to be an Int (additional generation knobs). - llm-inference/llama: CpuAttentionBackend constructor takes a RopeType (Llama vs Qwen RoPE selection). - llm-inference/voxtral: new module, first API dump. - llm-agent, llm-core, apertus, bert, gemma, qwen, kgemma: assorted additions/cleanups from the DSL swaps and shared decoder body refactors merged into this release branch. No new source-level public API changes in this commit — purely the catch-up dump from prior commits + the new voxtral module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Final commit on the chore/release-0.23.2 branch. Highlights since 0.23.1: - feat(kllama-cli/native): swap CLI to DSL path for Llama + Qwen (#125, #129, #130) on native and wasm targets. - cleanup(gpu): delete placeholder GPU attention/tensor stubs that always fell back to CPU; rename native benchmark scenario to native-cpu-throughput (#131). - feat(llm-core): wire SentencePiece decorator + GGUF tokenizer route through upstream sk.ainet.io.tokenizer; fix Qwen / GPT-2 BPE GGUFs (#52, #124). - fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs. - fix(transformer): thread metadata RMSNorm eps through QK-norm. - cleanup: delete :llm-runtime:kqwen module; remove LlamaIngestionBlocking.kt. - feat(kllama-java): swap KLlamaJava facade to DSL path. - fix(tool-calling): tolerate markdown code fences around Llama 3 JSON tool calls. - fix(kllama-cli): route Llama GGUF/SafeTensors back to eager LlamaRuntime for now — the DSL Q4/Q8 path is functionally correct but needs first-class Q4/Q8 DTypes to match the SIMD perf of the legacy path. Tracked as a followup. - docs: end-to-end Llama 3 tool-calling walkthrough for app integrators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without the kotlin application plugin, the project only exposed shadowJar + shadowJarDemo and Gradle reported "task 'run' not found". The smoke test (tests/smoke/smoke-test.sh) dispatches every kbert entry via :llm-apps:kbert-cli:run, so the kbert smoke leg was a no-op masked by "model path not found" on the absent all-MiniLM-L6-v2.gguf fixture. Apply the application plugin and pin mainClass to sk.ainet.apps.bert.cli.MainKt — same pattern as :llm-apps:kllama-cli. The existing JavaExec configuration already adds --enable-preview + jdk.incubator.vector, so :run inherits the right JVM args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`set -euo pipefail` made the no-match `grep -oE 'tok/s: [0-9.]+'` fatal, killing the script silently right after a successful kbert run because BertRuntime doesn't print throughput. Visible symptom: the script exited cleanly mid-test with no PASS/FAIL line and no summary table. Append `|| true` to the substitution and the secondary `sed | grep | sed` block so a missing tok/s falls back to "?" (the existing default) instead of aborting the whole run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pins the BERT embedding path against an actual model that ships with config.json + vocab.txt + tokenizer.json + 2_Dense projection — the same layout BertNumericalAccuracyTest validates. Path resolves under the standard HuggingFace cache (~/.cache/huggingface/hub/models--MongoDB--mdbr-leaf-ir/ snapshots/<rev>) so any developer who has snapshot-downloaded the revision via huggingface_hub can run it without extra setup. Prompt + doc are intentionally on-topic ("MongoDB is a NoSQL database" vs "MongoDB stores data in BSON documents") so the cosine similarity shows up high (~0.78 on local validation) — a sanity check that the embedding pipeline isn't producing garbage. Snapshot revision is pinned to the SHA returned by the December 2024 HF index. To refresh: uv run --with huggingface_hub python -c \ "from huggingface_hub import snapshot_download; \ print(snapshot_download(repo_id='MongoDB/mdbr-leaf-ir'))" The pre-existing all-MiniLM-L6-v2 entry is left in place — it'll report FAIL (model path not found) on machines that don't have it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal and others added 11 commits May 5, 2026 08:58

michalharakal merged commit 0fcd733 into develop May 5, 2026
4 checks passed

michalharakal deleted the chore/release-0.23.2 branch May 5, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release(0.23.2): merge release branch + post-release smoke fixups into develop#132

release(0.23.2): merge release branch + post-release smoke fixups into develop#132
michalharakal merged 11 commits into
developfrom
chore/release-0.23.2

michalharakal commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant