Skip to content

Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers #100

@michalharakal

Description

@michalharakal

Summary

ApertusNetworkLoader.fromGguf(quantPolicy = NATIVE_OPTIMIZED).load() succeeds (after PR #98), but a single forward pass through the loaded model fails because ApertusWeightLoader.streamingTensorToTensor stores quantized weights as raw byte-level rank-1 Int8 tensors. The standard transformer DSL forward path then hits two failures in sequence:

  1. Embedding gather on token_embd — fixed in PR fix(apertus): real-model loading — UInt metadata + quantized shape #98 by force-dequant'ing the token embedding to FP32 in loadStreamingTensor / loadReaderTensor regardless of quantPolicy. (Embedding lookup needs the logical [vocab, dim] shape; byte-shape doesn't work for gather.)

  2. Attention Q/K/V/O and FFN projectionslinearProject(ops, x, W) = ops.matmul(x, ops.transpose(W)) (llm-core/.../transformer/LinearProjection.kt:30) doesn't know about quantized weights. ops.transpose(byteShape Int8) errors out with Transpose requires at least 2 dimensions because the byte tensor is rank 1.

java.lang.IllegalArgumentException: Transpose requires at least 2 dimensions
  at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:455)
  at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
  at sk.ainet.lang.nn.transformer.MultiHeadAttention.onForward(MultiHeadAttention.kt:185)
  at sk.ainet.apps.llm.HybridTransformerBlock.directForward(HybridTransformerBlock.kt:172)
  ...

This is the same problem Gemma solved. Gemma's loader stores Q4_K weights as Q4_KBlockTensorData(logicalShape = Shape(rows, cols), blockMajorBytes) — a quant-aware TensorData that retains the logical rank-2 shape. transpose(Q4_KTensorData) is overridden to be lazy, and matmul dispatches via JvmQuantizedVectorKernels.matmulQ4_KVec. See GemmaDslQ4KTest, relayoutQ4_KRowMajorToBlockMajor, and GemmaMemSegConverter for the pattern.

Repro (after PR #98)

// SKaiNET-transformers/llm-inference/apertus/src/jvmTest/.../ApertusRealGgufLoadingTest.kt
// Run with -PapertusTestMaxHeap=12g and unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S in HF cache.
val ctx = DirectCpuExecutionContext.create()
val model = ApertusNetworkLoader.fromGguf(
    randomAccessProvider = { JvmRandomAccessSource.open(file) },
    quantPolicy = QuantPolicy.NATIVE_OPTIMIZED
).load<FP32, Float>(ctx)

OptimizedLLMRuntime(model, ctx, OptimizedLLMMode.DIRECT, FP32::class).forward(bos)
//                  ^^^ throws IllegalArgumentException at first MHA Q-projection

Why this matters

After cleanup commit 8a7e0ff removed ApertusQuantizedRuntime, the canonical path for running Apertus models is OptimizedLLMRuntime + apertusNetwork(). Combined with this bug, there is currently no working path to actually run an Apertus-8B Q4_K_S model end-to-end on a normal-sized JVM:

  • DEQUANTIZE_TO_FP32 → ~32 GB heap for Apertus-8B (won't fit on a 16 GB box).
  • NATIVE_OPTIMIZED → fails at the first projection in the forward pass (this issue).
  • RAW_BYTES → identical byte-shape problem.
  • loadQuantized() → returns ApertusQuantizedWeights, but the runtime that consumed them was deleted.

PR #98 verifies loading; but inference is blocked.

Proposed fix

Mirror Gemma's path. In ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor, when quantPolicy == NATIVE_OPTIMIZED and the tensor is a block-quantized type (Q4_K / Q5_K / Q6_K / Q8_0 / IQ4_NL / IQ4_XS / Q2_K / Q3_K / TQ1_0 / TQ2_0), wrap as the appropriate *BlockTensorData from skainet-lang-core with the logical [out, in] shape and pre-relayout to block-major. Each format needs:

  • Row-major → block-major relayout (relayoutQ4_KRowMajorToBlockMajor exists for Q4_K; the other formats need analogous helpers if they don't already)
  • A lazy transpose override on the TensorData (Gemma's Q4_KTensorData has this already)
  • matmul dispatch via the right native / Panama Vector kernel

Apertus-8B-Instruct-2509 Q4_K_S contains 185 Q4_K tensors, 8 Q5_K, 1 Q6_K, and 130 F32. Q4_K and F32 paths exist already in skainet-lang-core; Q5_K / Q6_K need parity work. (Or fall back to dequant for the Q5_K / Q6_K outliers — only 9 tensors total in this quant.)

Scope split

This is a multi-day chunk:

  1. ApertusWeightLoader gains a wrapAsBlockTensorData(tensorType, shape, bytes) switch that produces the right *BlockTensorData per quant type.
  2. Verify Q5_K / Q6_K wrappers exist in skainet-lang-core (or add them, mirroring Q4_K).
  3. The FFN down projection's output dim isn't a multiple of the K-quant block size in some Apertus quants — check whether Gemma's relayout handles padded blocks or panic-routes to dequant.
  4. End-to-end smoke test that the same ApertusRealGgufLoadingTest.fromGguf path now produces finite logits, matching dequantized FP32 within Q4_K tolerance.

Out of scope

  • Tool calling — the chat-template + parser are unit-tested and don't depend on this.
  • Numeric parity with llama.cpp — separate measurement task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions