Local Providers
Local providers run models on your own hardware. No API key required, no data sent externally. Perfect for private deployments or offline use.
Ollama
Model prefix: none (bare model name)
Ollama is the simplest way to run local models. Install it from ollama.ai and pull models:
ollama pull mistral
ollama pull llama3
ollama pull qwen2.5:14b
ollama pull phi4
Configuration:
{
"mode": "local",
"model": "mistral",
"ollama_base_url": "http://127.0.0.1:11434"
}
Or set via environment: OLLAMA_BASE_URL=http://127.0.0.1:11434
Popular models:
| Model | Size | Best for |
|---|---|---|
mistral | 7B | General purpose, fast |
llama3 | 8B | Strong reasoning |
qwen2.5:14b | 14B | High quality on consumer hardware |
phi4 | 14B | Strong at coding and math |
deepseek-r1:14b | 14B | Deep reasoning |
nomic-embed-text | — | Embeddings for memory |
Local OpenAI-Compatible (vLLM, LM Studio, etc.)
For locally hosted OpenAI-compatible servers.
Model prefix: locv1.<model_name>
{
"local_compatible_base_url": "http://localhost:8080/v1",
"local_compatible_key": "any-string",
"local_compatible_models": "mistral-7b,llama3-8b"
}
Use the model: locv1.mistral-7b
Compatible servers
| Server | Description |
|---|---|
| vLLM | High-throughput inference, GPU-optimised |
| LM Studio | Desktop app with a UI, macOS/Windows |
| Ollama /v1 | Ollama's OpenAI-compatible endpoint at :11434/v1 |
| llama.cpp server | Lightweight, CPU-friendly |
| Jan | Desktop app, cross-platform |
| Text Generation WebUI | Feature-rich, multi-backend |
Embedding models
{
"local_compatible_embed_models": "bge-m3,nomic-embed-text"
}
Use for code embeddings when embed_code: true is set.
Hardware requirements
| Model size | Min VRAM | Recommended |
|---|---|---|
| 7B | 6 GB | 8 GB (RTX 3070) |
| 13B | 10 GB | 16 GB (RTX 4080) |
| 14B | 10 GB | 12 GB |
| 30B | 20 GB | 24 GB |
| 70B | 40 GB | 48 GB (2×RTX 3090) |
CPU-only inference works for 7B models at ~5-10 tokens/second with 16 GB RAM.