Skip to content

llama.cpp

Runtime provider id: llama-cpp. Runs GGUF models in-process via node-llama-cpp — no separate server required. This is Hooman’s default out-of-the-box provider: a fresh config.json ships three model presets — Qwen/Qwen3-1.7B-GGUF:Q8_0 (the default), unsloth/Qwen3.5-0.8B-MTP-GGUF:Q8_0, and unsloth/gemma-4-E2B-it-GGUF:Q8_0 — so the first turn works with no API keys or local runtime setup. Weights are downloaded from the Hugging Face Hub (via @huggingface/hub) into ~/.hooman/cache/huggingface on first use and reused afterwards.

Field Type Notes
hfToken string Optional. Hugging Face access token for gated/private repos. Falls back to the HF_TOKEN env var.
gpu string or false Optional. "auto" (the default when unset), "metal", "cuda", "vulkan", or false for CPU-only inference.
context number Optional. Context size in tokens. A per-LLM options.context overrides it; when both are absent, node-llama-cpp adapts it to the model’s training context and available memory.
reasoning object Optional. See Reasoning.

The LLM entry’s options.model accepts:

  • owner/repo — a Hugging Face GGUF repo. The repo’s GGUF file is auto-detected; when several quant variants exist, common quantizations are preferred (Q4_K_M first).
  • owner/repo:QUANT — pick the variant matching a quant tag, llama.cpp style (e.g. Qwen/Qwen3-1.7B-GGUF:Q8_0, the default).
  • owner/repo/path/to/file.gguf — pin an exact file. Sharded GGUFs are supported; point at the first shard and the siblings are fetched too.
  • A local .gguf path (absolute, ./relative, or ~/-prefixed).

An optional hf: prefix is accepted and stripped.

Providing reasoning enables thinking on reasoning-capable GGUFs: the chat template is configured to allow thought segments (Qwen3 thinking mode, Gemma 4 reasoning turns, gpt-oss/Harmony native reasoning-effort levels), and reasoning.effort caps thought tokens via node-llama-cpp’s thought budget (minimal→1024, low→2048, medium→4096, high→8192; default medium). Omit reasoning to disable thinking — templates are told to discourage thoughts and the thought budget is forced to 0, which also reins in models that always think (e.g. DeepSeek-R1 distills). summary/display are not used.

Default models from the Hub (matches the out-of-the-box config.json — Qwen3 is the active default, Qwen3.5 and Gemma 4 ship alongside it):

{
"name": "llama.cpp",
"provider": "llama-cpp",
"options": {}
}
[
{
"name": "Qwen3 1.7B",
"provider": "llama.cpp",
"options": {
"model": "Qwen/Qwen3-1.7B-GGUF:Q8_0",
"context": 32768
},
"default": true
},
{
"name": "Qwen3.5 0.8B",
"provider": "llama.cpp",
"options": {
"model": "unsloth/Qwen3.5-0.8B-MTP-GGUF:Q8_0",
"context": 262144
},
"default": false
},
{
"name": "Gemma 4 E2B",
"provider": "llama.cpp",
"options": {
"model": "unsloth/gemma-4-E2B-it-GGUF:Q8_0",
"context": 131072
},
"default": false
}
]

Each preset pins options.context to the model’s full training window (32K for Qwen3 1.7B, 256K for Qwen3.5 0.8B, 128K for Gemma 4 E2B). The per-LLM context is specific to this provider and overrides the provider-level context.

Gated repo with a pinned quant file and CPU-only inference:

{
"name": "llama.cpp CPU",
"provider": "llama-cpp",
"options": {
"hfToken": "hf_...",
"gpu": false,
"context": 8192
}
}
{
"name": "Qwen3 8B Q4",
"provider": "llama.cpp CPU",
"options": {
"model": "Qwen/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf",
"temperature": 0.7
}
}

First use of a Hub model downloads the weights, which can take a while for large files; subsequent runs load from the cache. Downloads report live progress — percent, transferred/total size, speed, and ETA — on every surface: a progress bar above the composer in chat, a progress line on stderr in exec and daemon, and a download strip in the VS Code extension. Sharded GGUFs report per-shard progress.