release(0.23.2): merge release branch + post-release smoke fixups into develop#132
Merged
Conversation
Llama 3.2 1B Instruct sometimes wraps its tool-call JSON in a triple-
backtick fence (```...``` or ```json...```) even though the system
prompt instructs bare JSON. Llama31ToolCallParserStrategy required
candidate.startsWith("{"), so the fenced form silently parsed as
"no tool call" and the agent loop returned the raw JSON to the user
instead of executing the tool.
Add a stripCodeFence step that peels one layer of opening/closing
fence before the existing python-tag strip + balanced-brace extraction.
Both parse() and containsToolCall() use it so dispatch and detection
stay in lockstep.
Pinned by three new ToolCallParserTest cases (plain ```...```, ```json
tagged fence, and containsToolCall on fenced input).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the Llama branch of d519eb2 ("swap Llama GGUF + SafeTensors to DSL path") in the JVM kllama-cli. The DSL path (DecoderGgufWeightLoader → DecoderGgufMemSegConverter → LlamaNetworkLoader.fromWeights → OptimizedLLMRuntime DIRECT) is functionally correct but pays a per- linearProject ops.transpose tax on packed Q4/Q8 weights — the kernel still has to sniff the marker class on every call because the DSL doesn't have first-class Q4/Q8 DTypes yet. Measured 0.24 t/s on Llama-3.2-1B-Instruct-Q8 vs the legacy LlamaRuntime path that hits the SIMD quant matmul kernel directly. Restore the pre-d519eb2 wiring: LlamaIngestion(NATIVE_OPTIMIZED, allowQuantized=true) → MemSegWeightConverter.convert → CpuAttentionBackend → LlamaRuntime<FP32> (with @Suppress("DEPRECATION")). Fold the BIN branch into the same `else` block since it was already on LlamaRuntime anyway. Re-add the LlamaIngestion + LlamaLoadConfig + MemSegWeightConverter imports; drop DecoderSafeTensorsLoader + LlamaNetworkLoader from imports. Qwen GGUF stays on the DSL path (unchanged) — its Q8 perf is acceptable on smaller batches and the parity test pins it. Recovers tool calling functionality on Llama 3.2: the CLI smoke test now emits [Tool Call] calculator(...) → [Tool Result] 4.0 cleanly. Followup: either give the DSL first-class Q4/Q8 DTypes so linearProject can dispatch directly, or push the SIMD kernel selection deeper into ops.matmul so the per-call transpose disappears. Tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allingDemo
The smoke test for tool calling previously printed only [Tool Call] and
[Tool Result] markers, which made it hard to debug a missed call (was the
prompt malformed? did the model emit something the parser couldn't see?).
ToolCallingDemo.runSingleShot now also prints, before generation:
- [Tools] block with each tool's name, description, and JSON schema
- [Prompt → Round 1] with the full chat-template-rendered prompt the
model will actually receive (system + tools + user)
…and during the agent loop the listener prints:
- [Raw Assistant → Round N] with the model's exact output per round
- [Tool Call Invalid] when a tool call fails JSON-Schema validation
…and after the loop completes:
- [Final Conversation] with every message in the accumulated history,
so post-round-1 prompts can be reconstructed by the reader.
No behavior change to the agent loop itself; this is pure observability
on the demo path. The smoke-test.sh greps still match [Tool Call] /
[Tool Result] so PASS/FAIL accounting is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verter DecoderGgufMemSegConverter wrapped Q4_0/Q8_0 GGUF tensors in Q4MemorySegmentTensorData / Q8MemorySegmentTensorData using the loader's intermediate Int8 byte-count Shape (1D = bytes.size). The DSL's linearProject calls ops.transpose(weight) before matmul, and transpose needs the logical 2D shape [out, in] from metadata to dispatch the quant-aware kernel. Previously the kernel either rejected the 1D shape or silently fell back to a generic path. Compute the logical [out, in] shape per tensor name (attn_q/k/v/output, ffn_gate/up/down, token_embd, output) from LlamaModelMetadata (embeddingLength, headCount, kvHeadCount, feedForwardLength, vocabSize) and pass it to the Q4/Q8 MemSeg wrappers. K-quants get the same logical shape on dequant. Special-case token_embd.weight: the Embedding layer consumes it via gather (row indexing), not matmul. Packed Q4/Q8 bytes can't be gathered as floats, and the loader's Int8 byte-count shape is rejected by gather even after wrapping. Always dequantize token_embd to FP32 with the [vocab, dim] shape regardless of quant type. (Tied output.weight stays Q-packed because it's used by the LM head matmul, not gather.) After the JVM kllama-cli Llama branch reverted to LlamaRuntime in the previous commit, this converter is exercised only by the Qwen GGUF branch and the Java facade KLlamaJava.loadGGUF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the Llama 3.2 chat + tool-calling path in the smoke suite so the JSON-mode parser fix and the eager-LlamaRuntime revert can't silently regress. Tool-calling prompt is "What is 2 + 2?" with 256 max steps; the smoke runner asserts a [Tool Call] line is emitted. Path resolves under MODELS_ROOT or the user's ~/.cache/standapp/models; absent locally it reports FAIL (model path not found) cleanly without masking other failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…integrators
Rewrites llama3-tool-calling.adoc into a getting-started for embedding
Llama 3 / 3.1 / 3.2 tool calling into a consumer Kotlin app, on top of
the existing format-internals reference.
New sections:
- Quick start: actual CLI output a user sees ([Tools], [Prompt → Round 1],
[Raw Assistant], [Tool Call], [Tool Result]) so the goal is concrete.
- Use it from your own Kotlin app: 5 numbered steps —
1. Add sk.ainet.transformers:llm-runtime-kllama + llm-agent (0.23.2)
+ --enable-preview --add-modules jdk.incubator.vector
2. KLlamaJava.loadGGUF(Path.of(...)) — runtime + tokenizer in one call
3. Define a tool — full WeatherTool example with ToolDefinition + JSON
schema + execute()
4. ChatSession(runtime, tokenizer, ModelMetadata(family="llama")) →
ToolRegistry → createAgentLoop → runWithEncoder
5. AgentListener snippet for prompt/answer/tool-call/tool-result logs;
how to render the prompt yourself via chat.chatTemplate.apply(...)
- Verify it's working: exact callback sequence, what to file if it breaks.
- NOTE: Llama31ToolCallParserStrategy peels markdown code fences (this
release's parser fix); also added to the "Parser accepts" bullet list.
The pre-existing format reference (Llama3ToolFormat.JSON / FUNCTION_TAG,
picking format programmatically, why two formats exist, model-size caveat,
related files) is preserved verbatim below the new walkthrough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerates the binary-compatibility-validator dumps via apiDump to
match the actual public surface of every module. Catches up the dumps
to the accumulated changes on the chore/release-0.23.2 branch:
- llm-runtime/kllama: drops GpuAttentionBackend, GpuTensorBridge,
LlamaIngestionBlocking; adds JavaTools.definition; KLlamaSession
constructor takes InferenceRuntime instead of LlamaRuntime.
- llm-providers: SkaiNetChatModel constructor takes a Set parameter
where there used to be an Int (additional generation knobs).
- llm-inference/llama: CpuAttentionBackend constructor takes a
RopeType (Llama vs Qwen RoPE selection).
- llm-inference/voxtral: new module, first API dump.
- llm-agent, llm-core, apertus, bert, gemma, qwen, kgemma: assorted
additions/cleanups from the DSL swaps and shared decoder body
refactors merged into this release branch.
No new source-level public API changes in this commit — purely the
catch-up dump from prior commits + the new voxtral module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final commit on the chore/release-0.23.2 branch. Highlights since 0.23.1:
- feat(kllama-cli/native): swap CLI to DSL path for Llama + Qwen
(#125, #129, #130) on native and wasm targets.
- cleanup(gpu): delete placeholder GPU attention/tensor stubs that
always fell back to CPU; rename native benchmark scenario to
native-cpu-throughput (#131).
- feat(llm-core): wire SentencePiece decorator + GGUF tokenizer route
through upstream sk.ainet.io.tokenizer; fix Qwen / GPT-2 BPE GGUFs
(#52, #124).
- fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs.
- fix(transformer): thread metadata RMSNorm eps through QK-norm.
- cleanup: delete :llm-runtime:kqwen module; remove
LlamaIngestionBlocking.kt.
- feat(kllama-java): swap KLlamaJava facade to DSL path.
- fix(tool-calling): tolerate markdown code fences around Llama 3 JSON
tool calls.
- fix(kllama-cli): route Llama GGUF/SafeTensors back to eager
LlamaRuntime for now — the DSL Q4/Q8 path is functionally correct
but needs first-class Q4/Q8 DTypes to match the SIMD perf of the
legacy path. Tracked as a followup.
- docs: end-to-end Llama 3 tool-calling walkthrough for app integrators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the kotlin application plugin, the project only exposed shadowJar + shadowJarDemo and Gradle reported "task 'run' not found". The smoke test (tests/smoke/smoke-test.sh) dispatches every kbert entry via :llm-apps:kbert-cli:run, so the kbert smoke leg was a no-op masked by "model path not found" on the absent all-MiniLM-L6-v2.gguf fixture. Apply the application plugin and pin mainClass to sk.ainet.apps.bert.cli.MainKt — same pattern as :llm-apps:kllama-cli. The existing JavaExec configuration already adds --enable-preview + jdk.incubator.vector, so :run inherits the right JVM args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`set -euo pipefail` made the no-match `grep -oE 'tok/s: [0-9.]+'` fatal, killing the script silently right after a successful kbert run because BertRuntime doesn't print throughput. Visible symptom: the script exited cleanly mid-test with no PASS/FAIL line and no summary table. Append `|| true` to the substitution and the secondary `sed | grep | sed` block so a missing tok/s falls back to "?" (the existing default) instead of aborting the whole run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the BERT embedding path against an actual model that ships with
config.json + vocab.txt + tokenizer.json + 2_Dense projection — the
same layout BertNumericalAccuracyTest validates. Path resolves under
the standard HuggingFace cache (~/.cache/huggingface/hub/models--MongoDB--mdbr-leaf-ir/
snapshots/<rev>) so any developer who has snapshot-downloaded the
revision via huggingface_hub can run it without extra setup.
Prompt + doc are intentionally on-topic ("MongoDB is a NoSQL database"
vs "MongoDB stores data in BSON documents") so the cosine similarity
shows up high (~0.78 on local validation) — a sanity check that the
embedding pipeline isn't producing garbage.
Snapshot revision is pinned to the SHA returned by the December 2024
HF index. To refresh:
uv run --with huggingface_hub python -c \
"from huggingface_hub import snapshot_download; \
print(snapshot_download(repo_id='MongoDB/mdbr-leaf-ir'))"
The pre-existing all-MiniLM-L6-v2 entry is left in place — it'll
report FAIL (model path not found) on machines that don't have it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings
chore/release-0.23.2(already tagged as0.23.2on commit6eec93a) plus three post-release smoke-tooling fixes ontodevelop.Release content (tagged 0.23.2)
edb366cfix(tool-calling): tolerate markdown code fences around Llama 3 JSON5b34ee9fix(kllama-cli): route Llama GGUF/SafeTensors back to eager LlamaRuntime — recovers Llama 3 tool-calling functionality at the cost of staying on the legacy path until the DSL gets first-class Q4/Q8 DTypes5c3b9fafeat(kllama-cli): log prompts, raw responses, and tools list in ToolCallingDemo74c1416fix(llama): inject logical 2D shape and dequant token_embd in DSL converter (now Qwen-only)1e7af50test(smoke): add Llama-3.2-1B-Instruct entry with tool-calling assertioncea3173docs(tool-calling): end-to-end Llama 3 setup walkthrough for app integrators40200dachore(api): refresh public API dumps for 0.23.26eec93arelease: bump version to 0.23.2Post-release smoke fixups (not in the 0.23.2 tag, no published-artifact change)
412d0b6fix(kbert-cli): apply application plugin so:runtask is wired7abe110fix(smoke): tolerate runners that don't emittok/s(embedding models)70e936dtest(smoke): addMongoDB/mdbr-leaf-irembedding entryThese three only affect
tests/smoke/andkbert-cli's build script — they don't change anything published to Maven Central.Test plan
:llm-agent:jvmTest,:llm-inference:llama:jvmTest,:llm-runtime:kllama:jvmTest— greenapiCheck— green (dumps refreshed viaapiDump)[Tool Call] calculator → 4.0)Known followups (called out in the 0.23.2 tag annotation)
ops.matmul.upstream skainet pinned at 0.23.1, so it's not an upstream backend bump).🤖 Generated with Claude Code