A tour of llama.cpp's llama-server HTTP endpoints
llama-server is the HTTP front-end that ships with llama.cpp. One binary gives you a native API, an OpenAI-compatible API, an Ollama-compatible shim, plus a built-in web UI — all backed by the same GGUF model and the same KV cache slots.
For this post I poked at a local instance running build b8681-Debian with unsloth/Qwen3.6-35B-A3B-GGUF loaded (4 slots, n_ctx=65536, vision enabled). Every example below is a real request/response from that server.
Server introspection
GET /health
Trivial liveness probe. Returns 200 OK once the model is loaded.
$ curl -s http://localhost:8080/health
{"status":"ok"}
GET /props
Everything the server is willing to tell you about itself: default sampler params, model alias, model path, modalities, the chat template, build info, sleep state. Useful for clients that want to auto-detect chat-template capabilities or enabled features.
$ curl -s http://localhost:8080/props | jq '{model_alias, total_slots, modalities, build_info, eos_token,
n_ctx: .default_generation_settings.n_ctx}'
{
"model_alias": "unsloth/Qwen3.6-35B-A3B-GGUF",
"total_slots": 4,
"modalities": { "vision": true, "audio": false },
"build_info": "b8681-Debian",
"eos_token": "<|im_end|>",
"n_ctx": 65536
}
The full payload also includes chat_template_caps (does the template support tools? parallel tool calls? typed content?), which is what well-behaved clients should branch on instead of hard-coding model names.
GET /slots
Each parallel decoding slot the server owns. Handy for capacity planning and for watching what’s currently being processed.
$ curl -s http://localhost:8080/slots | jq '.[] | {id, n_ctx, is_processing, speculative}'
{ "id": 0, "n_ctx": 65536, "is_processing": false, "speculative": false }
{ "id": 1, "n_ctx": 65536, "is_processing": false, "speculative": false }
{ "id": 2, "n_ctx": 65536, "is_processing": false, "speculative": false }
{ "id": 3, "n_ctx": 65536, "is_processing": false, "speculative": false }
Slots that are mid-generation also expose the active sampler params and a next_token block with n_remain and n_decoded.
GET /metrics
Prometheus-format metrics — but only when the server was launched with --metrics. Otherwise:
$ curl -s -w '%{http_code}\n' http://localhost:8080/metrics
501
GET /lora-adapters / POST /lora-adapters
List loaded LoRA adapters and (with POST) change their per-request scale. Empty array on a vanilla server:
$ curl -s http://localhost:8080/lora-adapters
[]
Native generation API
The native endpoints predate the OpenAI shim. They expose more knobs (DRY, XTC, top_n_sigma, mirostat, slot pinning via id_slot, custom grammar, …) and return llama.cpp-shaped JSON.
POST /completion
The workhorse. Give it a prompt (string or token array), get back generated text plus rich timing/sampler metadata.
$ curl -s -X POST http://localhost:8080/completion \
-H 'Content-Type: application/json' \
-d '{"prompt":"The capital of France is","n_predict":8,"temperature":0,"stream":false}'
{
"index": 0,
"content": " Paris.\nThe capital of France is",
"id_slot": 2,
"stop": true,
"model": "unsloth/Qwen3.6-35B-A3B-GGUF",
"tokens_predicted": 8,
"tokens_evaluated": 5,
"stop_type": "limit",
"tokens_cached": 12,
"timings": {
"prompt_n": 5, "prompt_ms": 118.6, "prompt_per_second": 42.2,
"predicted_n": 8, "predicted_ms": 319.8, "predicted_per_second": 25.0
}
}
Setting "stream": true switches the response to Server-Sent Events with one chunk per token (data: {...}\n\n), terminated by a final chunk that has "stop": true.
POST /tokenize and POST /detokenize
Round-trip text ↔ token IDs against the loaded model’s tokenizer.
$ curl -s -X POST http://localhost:8080/tokenize \
-H 'Content-Type: application/json' \
-d '{"content":"Hello, world!"}'
{"tokens":[9419,11,1814,0]}
$ curl -s -X POST http://localhost:8080/detokenize \
-H 'Content-Type: application/json' \
-d '{"tokens":[9419,11,1814,0]}'
{"content":"Hello, world!"}
/tokenize also accepts "add_special": true to prepend BOS, and "with_pieces": true to get string pieces alongside the IDs.
POST /apply-template
Renders a chat-style messages array through the model’s Jinja chat template, without running inference. Great for building proxies that need the exact prompt the model would otherwise see.
$ curl -s -X POST http://localhost:8080/apply-template \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hi"}]}'
{"prompt":"<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n<think>\n"}
Note how Qwen3.6 opens a <think> block automatically — this server has reasoning support baked into the template.
POST /infill
Fill-in-the-middle for code models. Pass input_prefix and input_suffix; the server stitches them with the model’s FIM tokens before decoding.
$ curl -s -X POST http://localhost:8080/infill \
-H 'Content-Type: application/json' \
-d '{"input_prefix":"def add(a,b):\n return ",
"input_suffix":"\n",
"prompt":"",
"n_predict":8,
"temperature":0}'
{"index":0,"content":"a+b","stop":true,"tokens_predicted":3, ...}
OpenAI-compatible API
Drop-in for anything that already speaks the OpenAI REST shape — point your OPENAI_BASE_URL at the server and most SDKs Just Work.
GET /v1/models
$ curl -s http://localhost:8080/v1/models | jq '.data[0] | {id, owned_by, meta}'
{
"id": "unsloth/Qwen3.6-35B-A3B-GGUF",
"owned_by": "llamacpp",
"meta": {
"vocab_type": 2, "n_vocab": 248320,
"n_ctx_train": 262144, "n_embd": 2048,
"n_params": 34660610688, "size": 22123538944
}
}
POST /v1/chat/completions
$ curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Reply with exactly the word: pong"}],
"max_tokens":4,"temperature":0}'
{
"object": "chat.completion",
"model": "unsloth/Qwen3.6-35B-A3B-GGUF",
"choices": [{
"index": 0,
"finish_reason": "length",
"message": {
"role": "assistant",
"content": "",
"reasoning_content": "Here's a thinking"
}
}],
"usage": { "prompt_tokens": 17, "completion_tokens": 4, "total_tokens": 21 },
"system_fingerprint": "b8681-Debian"
}
Two things to notice that aren’t strictly OpenAI:
reasoning_contentis split out fromcontentbecause the chat template emits a<think>block. Set"reasoning_format": "none"to keep everything incontent.system_fingerprintis the llama.cpp build tag — useful for reproducibility logs.
The endpoint also accepts the OpenAI tool-calling shape (tools, tool_choice, tool_calls in responses), JSON-mode (response_format: { "type": "json_object" }), and llama.cpp’s grammar extensions (grammar, json_schema).
POST /v1/completions
Legacy text completion. Same generation engine, different response wrapper.
$ curl -s -X POST http://localhost:8080/v1/completions \
-H 'Content-Type: application/json' \
-d '{"prompt":"2+2=","max_tokens":4,"temperature":0}'
{
"object": "text_completion",
"choices": [{ "text": "5\n\n<think>\n", "index": 0, "finish_reason": "length" }],
"usage": { "prompt_tokens": 4, "completion_tokens": 4, "total_tokens": 8 }
}
POST /v1/embeddings and POST /rerank
Both exist, but they’re gated behind launch flags — and this server didn’t get them:
$ curl -s -X POST http://localhost:8080/v1/embeddings \
-H 'Content-Type: application/json' -d '{"input":"hello"}'
{"error":{"code":501,
"message":"This server does not support embeddings. Start it with `--embeddings`",
"type":"not_supported_error"}}
$ curl -s -X POST http://localhost:8080/reranking \
-H 'Content-Type: application/json' -d '{"query":"q","documents":["a","b"]}'
{"error":{"code":501,
"message":"This server does not support reranking. Start it with `--reranking`",
"type":"not_supported_error"}}
Pooling-based models (BGE, E5, etc.) only run as embedders, so embeddings and chat completion are mutually exclusive on a single process. Run a second llama-server if you need both.
Ollama-compatible shim
Just enough surface for clients that hard-code Ollama’s API.
GET /api/tags
$ curl -s http://localhost:8080/api/tags | jq '.models[0] | {name, capabilities, details}'
{
"name": "unsloth/Qwen3.6-35B-A3B-GGUF",
"capabilities": ["completion", "multimodal"],
"details": { "format": "gguf", "family": "", "families": [""] }
}
There’s also /api/show (model details) on builds that include it; this server returned 404, so YMMV by version.
The web UI
Hitting GET / with a browser serves a static SimpleChat page — a tiny vanilla-JS client that drives /v1/chat/completions and /v1/completions. It’s perfect for sanity-checking that the model loaded and the chat template renders correctly, without pulling in any external UI.
Cheat sheet
| Endpoint | Method | Purpose | Needs flag |
|---|---|---|---|
/health |
GET | Liveness probe | — |
/props |
GET | Server config + chat template | — |
/slots |
GET | Per-slot decode state | (default on) |
/metrics |
GET | Prometheus metrics | --metrics |
/lora-adapters |
GET/POST | List / re-scale LoRA adapters | --lora … |
/completion |
POST | Native completion (full sampler set) | — |
/tokenize, /detokenize |
POST | Tokenizer round-trip | — |
/apply-template |
POST | Render chat template, no inference | — |
/infill |
POST | Fill-in-the-middle for code | model-dependent |
/v1/models |
GET | OpenAI model list | — |
/v1/chat/completions |
POST | OpenAI chat (tools, JSON mode, grammar) | — |
/v1/completions |
POST | OpenAI legacy text completion | — |
/v1/embeddings, /embedding |
POST | Embeddings | --embeddings |
/rerank, /v1/rerank |
POST | Reranking | --reranking |
/api/tags, /api/show |
GET | Ollama-compatible model listing | — |
/ |
GET | Built-in SimpleChat web UI | — |
Closing thoughts
The thing that keeps surprising me about llama-server is how much of a complete product it is for a “reference” server. You get OpenAI compatibility for free, an Ollama shim for free, deep introspection (/props, /slots, timings on every response), and a UI for free — all backed by the same KV cache and parallel decoding slots, in a single static binary. For local development and small-scale serving it’s hard to beat, and when you outgrow it, the OpenAI endpoint means most of your client code ports straight to a hosted API without changes.