Go library that enumerates models served by a vLLM (or any OpenAI-compatible) endpoint and resolves rich metadata for each one from the HuggingFace Hub: features, license, lineage, weight quantization, and tags. Also resolves a single HuggingFace model id directly when you don't need an endpoint.
go get github.com/algonode/model-metaRequires Go 1.24+.
import modelmeta "github.com/algonode/model-meta"
e := &modelmeta.Enumerator{
EndpointURL: "http://localhost:8000", // any OpenAI-compatible /v1
APIKey: os.Getenv("VLLM_API_KEY"),
HFToken: os.Getenv("HF_TOKEN"),
}
// Enumerate everything the endpoint serves.
models, err := e.Enumerate(ctx)
// Or resolve one HuggingFace model without contacting the endpoint.
m, err := e.Resolve(ctx, "meta-llama/Meta-Llama-3-8B")A resolve-model binary ships with the module — install with:
go install github.com/algonode/model-meta/resolve-model@latestIt mirrors both library modes:
resolve-model # default localhost:8000
resolve-model http://host:8000 # explicit vLLM endpoint
resolve-model -m meta-llama/Meta-Llama-3-8B # single HF model, no endpoint
VLLM_API_KEY=... HF_TOKEN=... resolve-modelEach Model aggregates one set of weights and every endpoint-exposed alias
that resolves to it:
{
"root": "meta-llama/Meta-Llama-3-8B-Instruct",
"aliases": ["default", "llama3"],
"max_model_len": 8192,
"owned_by": "meta-llama",
"features": {
"text_generation": true,
"tool_use": true,
"quantization": "bf16",
"architectures": ["LlamaForCausalLM"],
"pipeline": "text-generation"
},
"lineage": ["meta-llama/Meta-Llama-3-8B"],
"tags": {
"huggingface": ["transformers", "tool-use", "license:llama3"],
"compliance": []
},
"license": { "id": "llama3" },
"flags": {
"compliant": false,
"huggingface": true,
"lineage": true,
"quantized": false
}
}/v1/models entries are grouped by their root field. Anything whose id
differs from root becomes an alias. The result is sorted by Root.
Detected with the following priority — the first source that fires wins:
quantization_config.quant_methodfrom the API summary (NVFP4 recognized from eitherquant_method == "nvfp4"or aformatcontaining"nvfp4").- Compressed-tensors refine. When step 1 yields
"compressed-tensors"the API summary has truncated the real format. The library does a follow-upGET /{id}/resolve/main/config.jsonand inspects both the top-levelquantization_config.formatand per-groupconfig_groups.*.formatfor an NVFP4 marker. Cached, with 404 short-circuit, so non-affected models pay no extra HTTP cost. - GGUF filename. If the repo is a GGUF repo (library_name, tag, or
.ggufsiblings), the canonical file is picked — id pin first (Foo-GGUF-Q5_K_M), then a llama.cpp preference order (Q4_K_M > Q5_K_M > Q5_K_S > …), then size, then lexicographic. Its tier is parsed from the filename. Multi-part shards (-NNNNN-of-NNNNN.gguf) are collapsed to part 1. torch_dtypemapping:bfloat16 → bf16,float16/half → fp16,float32 → fp32,float8* → fp8,float4* → fp4.- Vendor suffix on the vLLM id:
-AWQ,-GPTQ,-FP8,-NVFP4, … - GGUF tier suffix on the id:
Q4_K_M,IQ3_XXS,BF16, … This is the practical fallback for llama.cpp endpoints, whoseidis typically the local filename rather than an HF path.
Pipeline tag, HF tags, and architecture names feed boolean flags:
TextGeneration, Embedding, Vision, Audio, ToolUse, Reasoning,
Code. Best-effort — false means "not detected", not "definitely
unsupported".
License.ID comes from cardData.license, with a license:<id> HF tag as
fallback. Name and Link are filled when the model card declares an
other-style license with a custom title and URL. Set only when HF
resolution succeeded.
Tags.HuggingFace is the deduped union of info.tags and cardData.tags.
Set only when HF resolution succeeded.
Tags.Compliance is matched against a curated regex watchlist
(Uncensored, Abliterated, Dolphin, Hermes, OpenHermes, NSFW,
RP, ERP, Wizard, …). RP/ERP are word-boundary anchored so
identifiers like RPCS3/ERPNext don't trigger. Useful for routing or
policy decisions.
Model.Tags is omitted from the JSON entirely when both lists are empty.
Walks the declared parent outward, depth-capped (default 8) and
cycle-safe. The parent at each step is read from cardData.base_model
first, then from authoritative HF base_model:* tags as a fallback
(those tags are present even when the API summary drops cardData).
Lineage[0] is the immediate parent; the last element is the deepest
declared ancestor.
Model.Ancestor carries a fully-resolved (non-recursive) view of the
top-most upstream model we could identify:
-
When
Lineageis non-empty,Ancestorresolves the last entry (the deepest declaredbase_model). -
When the model is not on HF (e.g. llama.cpp ids like
qwen2.5-7b-instruct-q4_k_m), or is on HF but declares nobase_modeland is quantized (e.g. an NVFP4 fork whose author didn't fill in the model card), the library searches HF (/api/models?search=…&sort=downloads) with a cleaned-up query and picks the best non-quantized candidate. Concretely:- The query and candidate ids are stripped of vendor/GGUF quant
suffixes and fork-marker words (
uncensored,abliterated,dolphin,hermes,merge,dare,lora,dpo,ultra, …), so HF's ranker doesn't bias toward other forks. - Candidates whose own tags include a quant marker (
gguf,compressed-tensors,awq,gptq,bitsandbytes,4-bit,8-bit,nf4,exl2) are dropped — a quant of a quant isn't an upstream. - Two similarity gates: at least 60% of normalized query tokens
must appear in the candidate (
query ⊆ candidate), and at least 80% of the candidate's tokens must appear in the query (candidate ⊆ query). The second gate kills sibling forks that introduce extra tokens beyond the query.
Disable the search fallback with
Enumerator.SkipGuessParent. Native-dtype HF models with no lineage (bf16/fp16/fp32) are treated as bases themselves and never searched, so true bases likemeta-llama/Meta-Llama-3-8Baren't pointed at their own siblings. - The query and candidate ids are stripped of vendor/GGUF quant
suffixes and fork-marker words (
If the resolved Ancestor itself declares a base_model, the library
pivots to the tip of that chain and re-resolves — repeating up to
MaxLineageDepth times — so you land on the deepest reachable base
rather than stopping at the first hop. This matters mostly after a
search-guess: a guessed parent like org/llama-3-8b-instruct will
pivot to org/llama-3-8b when the latter is declared upstream.
The non-recursion guarantee: Ancestor.Ancestor is always nil. Ancestor
entries do walk their own Lineage though, so you can still see the
ancestor's own declared upstream chain.
Endpoint-reported value wins (it reflects the configured serving limit,
e.g. --max-model-len 32768 on a 128k-context model). If the endpoint
omits it, falls back to config.max_position_embeddings from HF.
Enumerator fields:
| Field | Purpose |
|---|---|
EndpointURL |
OpenAI-compatible base; /v1 and /v1/models both accepted. |
APIKey |
Bearer token for the endpoint. |
HFBaseURL |
Override the Hub root (tests / private mirrors). |
HFToken |
Bearer token for HuggingFace. |
HTTPClient |
Custom client for endpoint requests (default: 30s timeout). |
HFHTTPClient |
Custom client for HF requests (default: 30s timeout). |
MaxLineageDepth |
Cap on base_model traversal (default 8). |
SkipHF |
Disable HF resolution; only id-derived signals are used. |
SkipGuessParent |
Disable the HF search fallback used to populate Ancestor when direct resolution fails. Lineage-tip Ancestor is unaffected. |
make test # unit tests
make test-race # race detector
make cover # coverage report
make all # fmt + vet + testThe repo follows a tight convention: every functional change ships in its own commit with a short rationale, and every new or modified feature has a test.