Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions openspec/changes/harden-gguf-contracts/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
## Context

The repository is in final hardening mode. For this phase, honest failure is better than partial runtime optimism. The current GGUF host-side path violates that principle in three ways:

1. integer sizes from GGUF metadata/tensor descriptors are trusted too long
2. runtime tensor validation happens after CUDA allocation starts
3. schema mapping mixes valid configuration keys with unrelated metadata fallbacks

These problems are tightly coupled and live behind one shallow seam: `GGUFParser` + `ModelLoader::loadGGUF`.

## Goals / Non-Goals

**Goals:**
- Fail malformed or oversized GGUF inputs with `Result<T>` errors, not host exceptions.
- Validate runtime-required tensors before any CUDA allocation.
- Keep the GGUF host-side seam smaller and more honest.
- Add regression coverage for hostile and incomplete GGUF fixtures.

**Non-Goals:**
- Full GGUF runtime support for every quantization format.
- Reworking CUDA kernels or broader inference-engine execution flow.
- Changing binary model loading behavior.

## Decisions

### 1. Add explicit overflow guards in GGUF size math

`GGUFTensorInfo::numElements()`, `GGUFTensorInfo::calculateSize()`, and GGUF array readers will reject sizes that overflow `size_t` or exceed safe allocation math.

Why:
- This converts malicious/untrusted metadata into structured parse errors.
- It keeps the parser as deep module: callers do not need to duplicate size-validation logic.

### 2. Validate GGUF runtime completeness before CUDA allocation

`ModelLoader::loadGGUF()` will build a required-tensor checklist from the extracted `ModelConfig` and return an error if required runtime tensors are missing or unsupported.

Why:
- Missing tensors are input-validation failures, not CUDA failures.
- This concentrates loader policy in one place and removes fake-success zero placeholders from the public seam.

### 3. Remove schema-invalid config fallback behavior

`GGUFParser::extractModelConfig()` will stop treating unrelated metadata like `general.architecture` as numeric dimension fallbacks.

Why:
- `general.architecture` is architecture identity, not `hidden_dim`.
- Keeping schema mapping strict improves locality and makes malformed GGUF behavior predictable.

## Risks / Trade-offs

- Some GGUF files that previously limped through with zero-filled placeholders will now fail early. This is intentional hardening.
- Tests must use host-side fixtures that fail before CUDA allocation when the environment lacks a working GPU/runtime.

## Verification Plan

1. Add failing host-side tests for:
- tensor-size overflow guards
- oversized GGUF arrays
- invalid config fallback behavior
- incomplete GGUF runtime tensors returning `Result` errors without throwing
2. Implement minimal production changes to satisfy those tests.
3. Run the focused test target plus strongest available repository verification commands.
22 changes: 22 additions & 0 deletions openspec/changes/harden-gguf-contracts/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
## Why

Tiny-LLM's GGUF host-side path is currently too forgiving and too shallow. The parser accepts untrusted sizes without enough overflow protection, the loader validates required runtime tensors only after it starts allocating CUDA memory, and malformed metadata can influence `ModelConfig` in ways that do not reflect the GGUF schema. In practice this means crafted or incomplete GGUF files can trigger exceptions, oversized allocations, or fake-success paths instead of returning a clean `Result<T>` error.

## What Changes

- Harden GGUF parsing against integer-overflow and oversized-allocation paths.
- Make GGUF runtime loading validate required metadata and tensor presence before any CUDA allocation.
- Remove misleading fallback behavior in GGUF-to-`ModelConfig` mapping.
- Add regression tests for hostile/malformed GGUF inputs and incomplete runtime artifacts.
- Align the inference-engine capability delta with the stricter GGUF contract.

## Capabilities

### Modified Capabilities
- `inference-engine`: Tightens GGUF parsing and runtime-loading failure behavior so malformed or incomplete files fail explicitly instead of throwing or silently synthesizing invalid weights.

## Impact

- Affects `src/gguf_parser.cpp`, `src/model_loader.cpp`, and related tests.
- Adds stricter failure behavior for malformed/incomplete GGUF inputs.
- Improves alignment between `Result<T>`-based error handling and the actual GGUF host-side path.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## MODIFIED Requirements

### Requirement: Model Loading

The system SHALL parse and load model files with explicit validation of malformed, incomplete, or unsupported GGUF inputs before runtime allocation begins.

1. The system SHALL parse GGUF file headers and configuration metadata.
2. The system SHALL reject GGUF metadata, tensor dimensions, or array sizes that overflow safe allocation math.
3. The system SHALL validate that required runtime tensors are present before allocating CUDA memory for GGUF runtime loading.
4. The system SHALL return a structured error for corrupted, invalid, incomplete, or unsupported model files.
5. The system SHALL provide a structured model representation including `ModelConfig` and `QuantizedWeight` structures.

#### Scenario: Reject oversized GGUF tensor metadata
- **GIVEN** a GGUF tensor descriptor or metadata array whose size computation would overflow
- **WHEN** the parser reads the file
- **THEN** parsing SHALL fail with a structured error
- **AND** the parser SHALL NOT attempt an undersized allocation

#### Scenario: Reject incomplete GGUF runtime tensor sets
- **GIVEN** a GGUF file that parses successfully but omits tensors required for runtime inference
- **WHEN** `ModelLoader::loadGGUF()` is called
- **THEN** the loader SHALL return an error describing the missing or unsupported tensors
- **AND** it SHALL fail before starting CUDA allocations for model weights

#### Scenario: Ignore unrelated metadata fallbacks
- **GIVEN** GGUF metadata that includes unrelated keys such as `general.architecture`
- **WHEN** `extractModelConfig()` maps metadata into `ModelConfig`
- **THEN** unrelated keys SHALL NOT override numeric model dimensions
17 changes: 17 additions & 0 deletions openspec/changes/harden-gguf-contracts/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## 1. GGUF parser overflow hardening

- [ ] 1.1 Add failing tests for tensor-size overflow and oversized GGUF arrays
- [ ] 1.2 Guard `numElements()`, `calculateSize()`, and array byte-count math against overflow
- [ ] 1.3 Re-run the focused GGUF/parser tests

## 2. GGUF config and runtime contract hardening

- [ ] 2.1 Add failing tests for invalid metadata fallback and incomplete runtime tensor sets
- [ ] 2.2 Make `extractModelConfig()` ignore unrelated metadata fallbacks
- [ ] 2.3 Make `ModelLoader::loadGGUF()` validate required tensors before CUDA allocation and fail with `Result` errors
- [ ] 2.4 Re-run the focused GGUF/model-loader tests

## 3. Spec and verification alignment

- [ ] 3.1 Update the inference-engine change delta to describe the stricter GGUF contract
- [ ] 3.2 Run repository verification commands or strongest available substitutes and record environment limits
107 changes: 90 additions & 17 deletions src/gguf_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,39 @@

#include <algorithm>
#include <cstring>
#include <limits>

namespace tiny_llm {

namespace {

bool safeMultiplySize(size_t lhs, size_t rhs, size_t &out) {
if (lhs != 0 && rhs > std::numeric_limits<size_t>::max() / lhs) {
out = 0;
return false;
}
out = lhs * rhs;
return true;
}

template <typename T>
bool canAllocateArray(uint64_t count) {
return count <= std::numeric_limits<size_t>::max() / sizeof(T);
}

size_t remainingBytes(std::ifstream &file) {
const std::streampos current = file.tellg();
file.seekg(0, std::ios::end);
const std::streampos end = file.tellg();
file.seekg(current);
if (current < 0 || end < current) {
return 0;
}
return static_cast<size_t>(end - current);
}

} // namespace

GGUFParser::GGUFParser(const std::string &path) : path_(path) {}

Result<void> GGUFParser::parse() {
Expand Down Expand Up @@ -273,34 +303,50 @@ Result<GGUFValue> GGUFParser::readArray(std::ifstream &file) {
}

TLLM_TRACE("Reading array of {} elements, type {}", count, static_cast<uint32_t>(elem_type));
const size_t remaining = remainingBytes(file);

// Handle different array types
switch (elem_type) {
case GGUFType::UINT32: {
if (!canAllocateArray<uint32_t>(count) || count > remaining / sizeof(uint32_t)) {
return Result<GGUFValue>::err("Array too large: " + std::to_string(count));
}
std::vector<uint32_t> arr(count);
file.read(reinterpret_cast<char *>(arr.data()), count * 4);
if (!file) return Result<GGUFValue>::err("Failed to read uint32 array");
return Result<GGUFValue>::ok(GGUFValue{arr});
}
case GGUFType::INT32: {
if (!canAllocateArray<int32_t>(count) || count > remaining / sizeof(int32_t)) {
return Result<GGUFValue>::err("Array too large: " + std::to_string(count));
}
std::vector<int32_t> arr(count);
file.read(reinterpret_cast<char *>(arr.data()), count * 4);
if (!file) return Result<GGUFValue>::err("Failed to read int32 array");
return Result<GGUFValue>::ok(GGUFValue{arr});
}
case GGUFType::FLOAT32: {
if (!canAllocateArray<float>(count) || count > remaining / sizeof(float)) {
return Result<GGUFValue>::err("Array too large: " + std::to_string(count));
}
std::vector<float> arr(count);
file.read(reinterpret_cast<char *>(arr.data()), count * 4);
if (!file) return Result<GGUFValue>::err("Failed to read float array");
return Result<GGUFValue>::ok(GGUFValue{arr});
}
case GGUFType::FLOAT64: {
if (!canAllocateArray<double>(count) || count > remaining / sizeof(double)) {
return Result<GGUFValue>::err("Array too large: " + std::to_string(count));
}
std::vector<double> arr(count);
file.read(reinterpret_cast<char *>(arr.data()), count * 8);
if (!file) return Result<GGUFValue>::err("Failed to read double array");
return Result<GGUFValue>::ok(GGUFValue{arr});
}
case GGUFType::STRING: {
if (count > remaining / sizeof(uint64_t)) {
return Result<GGUFValue>::err("Array too large: " + std::to_string(count));
}
std::vector<std::string> arr;
arr.reserve(count);
for (uint64_t i = 0; i < count; ++i) {
Expand Down Expand Up @@ -340,6 +386,10 @@ Result<std::string> GGUFParser::readString(std::ifstream &file) {
return Result<std::string>::err("String too long: " + std::to_string(length));
}

if (length == 0) {
return Result<std::string>::ok({});
}

std::string str(length, '\0');
file.read(&str[0], length);

Expand Down Expand Up @@ -403,13 +453,13 @@ Result<ModelConfig> GGUFParser::extractModelConfig() const {
get_int("llama.attention.head_count", config.num_heads);
get_int("llama.attention.head_count_kv", config.num_kv_heads);
get_int("llama.context_length", config.max_seq_len);
get_int("general.architecture", config.hidden_dim); // fallback

// Try alternative keys (some models use different naming)
get_int("llama.embedding_length", config.hidden_dim);
if (config.hidden_dim == 4096) { // default
get_int("llama.embedding_length", config.hidden_dim);
}
get_int("general.embedding_length", config.hidden_dim);
get_int("general.block_count", config.num_layers);
get_int("general.head_count", config.num_heads);
get_int("general.head_count_kv", config.num_kv_heads);
get_int("general.context_length", config.max_seq_len);

// Tokenizer metadata
get_int("tokenizer.ggml.model.vocab_size", config.vocab_size);
Expand Down Expand Up @@ -472,7 +522,10 @@ Result<std::vector<uint8_t>> GGUFParser::readTensorData(const GGUFTensorInfo &te
std::to_string(read_offset));
}

size_t size = tensor.calculateSize();
size_t size = tensor.calculateSize();
if (!tensor.dimensions.empty() && size == 0) {
return Result<std::vector<uint8_t>>::err("Tensor size overflow for: " + tensor.name);
}
std::vector<uint8_t> data(size);
file.read(reinterpret_cast<char *>(data.data()), size);

Expand All @@ -491,43 +544,63 @@ size_t GGUFTensorInfo::numElements() const {
if (dimensions.empty()) return 0;
size_t n = 1;
for (auto d : dimensions) {
n *= d;
size_t next = 0;
if (!safeMultiplySize(n, static_cast<size_t>(d), next)) {
return 0;
}
n = next;
}
return n;
}

size_t GGUFTensorInfo::calculateSize() const {
size_t num_elem = numElements();
if (!dimensions.empty() && num_elem == 0) {
return 0;
}

auto scaledSize = [&](size_t bytes_per_elem) -> size_t {
size_t size = 0;
return safeMultiplySize(num_elem, bytes_per_elem, size) ? size : 0;
};
auto blockScaledSize = [&](size_t block_size, size_t bytes_per_block) -> size_t {
if (block_size == 0) {
return 0;
}
const size_t blocks = (num_elem + block_size - 1) / block_size;
size_t size = 0;
return safeMultiplySize(blocks, bytes_per_block, size) ? size : 0;
};

// Bytes per element for each type
switch (type) {
case GGMLType::F32:
return num_elem * 4;
return scaledSize(4);
case GGMLType::F16:
return num_elem * 2;
return scaledSize(2);
case GGMLType::I8:
return num_elem;
case GGMLType::I16:
return num_elem * 2;
return scaledSize(2);
case GGMLType::I32:
return num_elem * 4;
return scaledSize(4);
case GGMLType::I64:
return num_elem * 8;
return scaledSize(8);
case GGMLType::F64:
return num_elem * 8;
return scaledSize(8);
case GGMLType::Q8_0:
// Q8_0: 32 values per block, each block has 32 int8 + 1 half scale
return (num_elem / 32) * (32 + 2);
return blockScaledSize(32, 32 + 2);
case GGMLType::Q4_0:
// Q4_0: 32 values per block, each block has 16 int8 + 1 half scale
return (num_elem / 32) * (16 + 2);
return blockScaledSize(32, 16 + 2);
case GGMLType::Q4_1:
// Q4_1: 32 values per block, each block has 16 int8 + 2 half (scale + min)
return (num_elem / 32) * (16 + 4);
return blockScaledSize(32, 16 + 4);
default:
// Default to 2 bytes per element (FP16)
TLLM_WARN("Unknown tensor type {}, assuming FP16 size", static_cast<uint32_t>(type));
return num_elem * 2;
return scaledSize(2);
}
}

Expand Down
7 changes: 4 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ void printHelp(const char *program_name) {
std::cout << " " << program_name << " # Show CUDA readiness" << std::endl;
std::cout << " " << program_name << " --info # Show detailed device info"
<< std::endl;
std::cout << " " << program_name << " model.gguf # Load GGUF model (partial support)"
std::cout << " " << program_name
<< " model.gguf # Inspect GGUF support notes (runtime load unsupported)"
<< std::endl;
}

Expand Down Expand Up @@ -177,8 +178,8 @@ int main(int argc, char **argv) {
// Handle model path argument
if (!model_path.empty()) {
if (model_path.size() >= 5 && model_path.substr(model_path.size() - 5) == ".gguf") {
std::cout << "\nRuntime note: GGUF parsing is partial and runtime GGUF loading is not "
"supported yet."
std::cout << "\nRuntime note: GGUF parsing/validation is available, but runtime GGUF "
"loading is intentionally unsupported."
<< std::endl;
std::cout << "Use the test binary format consumed by ModelLoader::loadBin() for "
"end-to-end loading."
Expand Down
Loading
Loading