This tutorial walks you through running text generation with a GGUF model using the unified skainet CLI.
|
Note
|
This tutorial is part of the canonical SKaiNET Transformers five-minute start path — see the "Start in 5 minutes" section of the repository README. |
-
JDK 21+ with preview features (Vector API)
-
A GGUF model file is required — this tutorial does not download one for you. Use a small quantized model for the first run (e.g.,
tinyllama-1.1b-chat-v1.0.Q8_0.gguf).
./gradlew :llm-apps:skainet-cli:run \
--args="-m tinyllama-1.1b-chat-v1.0.Q8_0.gguf 'The capital of France is'"Expected output:
Architecture: llama, Family: LLaMA / Mistral Backend: CPU (SIMD) Loading GGUF model (LLaMA / Mistral, streaming)... Generating 64 tokens with temperature=0.8... --- The capital of France is Paris. It is also the largest city in France... --- tok/s: 3.4
The CLI auto-detects the model architecture from GGUF metadata — no need to specify which runner to use.
./gradlew :llm-apps:skainet-cli:run \
--args="-m Qwen3-1.7B-Q8_0.gguf --chat"This starts a multi-turn conversation with the model using the auto-detected chat template.
./gradlew :llm-apps:skainet-cli:run \
--args="-m Qwen3-1.7B-Q8_0.gguf --demo"The demo provides calculator and list_files tools.
Type a question like "What is 2 + 2?" and the model will call the calculator tool.
| Problem | What to check |
|---|---|
Model file not found |
Use an absolute path to the |
|
The Vector API needs |
Out of memory |
Start with a smaller quantized model (e.g. a Q4/Q8 1B model) and close memory-heavy applications. |
Gradle cannot resolve artifacts |
Check that the version you use matches the one in the repository README. |
Slow first run |
The first run spends extra time resolving dependencies and loading the model. |
-
Tool calling in depth — integrate tool calling into your own application
-
CLI reference — all available flags and options
-
Architecture overview — understand the pipeline