Skip to content

Unified Model Pipeline with Decoupled Tool Calling #49

@michalharakal

Description

@michalharakal

Context

Currently SKaiNET-transformers has:

  • 5+ hand-coded runtimes (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
  • Tool calling tightly coupled to kllama — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
  • Two execution paths — legacy hand-coded runtimes AND the newer OptimizedLLMRuntime with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated

The goal: converge on one unified pipeline where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.

Architecture Overview

GGUF/SafeTensors File
    |
WeightLoader (parse metadata + tensors)
    |
DSL Network Definition (model-specific, declarative)
    |
ComputeGraph (DAG)
    |
Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
    |
ComputeGraphExecutor (fused kernels)
    |
InferenceRuntime (unified: forward + generate)
    |
TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
    |
ChatPipeline (template formatting, tool calling, agent loop)

Phase 1: Decouple Tool Calling from kllama (immediate value)

Problem: Tool calling lives in llm-agent (good) but is wired only through kllama CLI. Other runners can't use it.

Changes:

  1. Move AgentLoop's generation into InferenceRuntime interface (llm-core)

    • generateUntilStop() is currently an extension function in llm-agent — promote it or add a generate() method with stop-token support to the interface
    • File: llm-core/.../InferenceRuntime.kt
  2. Create ChatSession abstraction (llm-agent)

    • Bundles: InferenceRuntime + Tokenizer + ChatTemplate + ToolRegistry
    • Any runner creates a ChatSession to get chat/agent/demo modes for free
    • File: new llm-agent/.../ChatSession.kt
  3. Extract CLI chat/agent/demo dispatch from kllama Main.kt into shared module

    • Currently lines 532-551 of Main.kt dispatch to ToolCallingDemo / AgentCli
    • This logic should work for any runner that provides InferenceRuntime + Tokenizer

Critical files:

  • llm-core/.../InferenceRuntime.kt — extend interface
  • llm-agent/.../AgentLoop.kt — already well-abstracted, keep as-is
  • llm-runtime/kllama/.../Main.kt — extract dispatch logic
  • llm-runtime/kllama/.../ToolCallingDemo.kt — move to llm-agent or llm-apps shared module

Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)

Problem: Each model has a hand-coded runtime. OptimizedLLMRuntime already supports DSL -> graph -> optimized execution, but only some models use it.

Changes:

  1. Define DSL networks for all model families:

    • llamaNetwork(config) — LLaMA/Mistral/Qwen2/3 (standard transformer)
    • qwen35Network(config) — Qwen3.5 (hybrid DeltaNet + full attention)
    • gemmaNetwork(config) — Gemma (GELU, MatFormer FFN, sliding window)
    • apertusNetwork(config) — Apertus (xIELU, ungated MLP, QK-norm)
    • Each is a pure function returning a Network<T> from the DSL
  2. Unified model loading flow:

    detectArchitecture(ggufMetadata) -> ModelFamily
    ModelFamily.createNetwork(config) -> Network<T>
    WeightLoader.loadAndMap(file, network) -> weights
    OptimizedLLMRuntime(network, weights, mode=HYBRID) -> InferenceRuntime
    
  3. Remove deprecated hand-coded runtimes once DSL equivalents are validated:

    • LlamaRuntime -> llamaNetwork() + OptimizedLLMRuntime
    • ApertusRuntime -> apertusNetwork() + OptimizedLLMRuntime

Critical files:

  • llm-core/.../OptimizedLLMRuntime.kt — already exists, extend
  • llm-core/.../dsl/TransformerDsl.kt — already has embedding, MHA, SwiGLU, RMSNorm
  • llm-core/.../weights/LLMWeightNameResolvers.kt — already maps DSL paths -> GGUF names
  • New: per-model DSL network definitions

Phase 3: Tokenization as Pipeline Stage

Problem: Tokenization is split between GGUFTokenizer (kllama module), QwenByteLevelBPETokenizer (llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.

Changes:

  1. Enhance Tokenizer interface (llm-core):

    interface Tokenizer {
        fun encode(text: String): IntArray
        fun decode(token: Int): String
        fun decode(tokens: IntArray): String
        val eosTokenId: Int
        val bosTokenId: Int
        val vocabSize: Int
        val specialTokens: Set<String>
    }
  2. Unified tokenizer factory:

    • TokenizerFactory.fromGGUF(source) — auto-detects BPE/SentencePiece/WordPiece
    • TokenizerFactory.fromTokenizerJson(json) — HuggingFace format
    • Returns the correct implementation (byte-level BPE for GPT-2/Qwen, SentencePiece for LLaMA, etc.)
  3. Move GGUFTokenizer to llm-core so all runners can use it without depending on kllama

Phase 4: Unified Runner (single CLI entry point)

Problem: 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.

Changes:

  1. Single skainet CLI that auto-detects model architecture from GGUF metadata:

    skainet -m model.gguf "prompt"                    # auto-detect, generate
    skainet -m model.gguf --chat                      # auto-detect, chat mode
    skainet -m model.gguf --demo "What is 2+2?"       # auto-detect, tool calling
  2. Architecture registry:

    ModelRegistry.register("llama", ::llamaNetwork)
    ModelRegistry.register("qwen3", ::qwenNetwork)
    ModelRegistry.register("gemma", ::gemmaNetwork)
  3. Auto-detection from GGUF metadata (already exists in peekGgufMetadata())

Verification

  • All existing unit tests pass (llm-agent, llm-runtime:kllama, llm-core)
  • Smoke test suite passes (generation + tool calling)
  • Basic generation produces identical output for all model families
  • Tool calling works for any model that supports ChatML/Qwen/Llama3 templates
  • OptimizedLLMRuntime in HYBRID mode matches hand-coded runtime output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions