Local Providers

Local providers run models on your own hardware. No API key required, no data sent externally. Perfect for private deployments or offline use.

Ollama

Model prefix: none (bare model name)

Ollama is the simplest way to run local models. Install it from ollama.ai and pull models:

ollama pull mistral
ollama pull llama3
ollama pull qwen2.5:14b
ollama pull phi4

Configuration:

{
  "mode": "local",
  "model": "mistral",
  "ollama_base_url": "http://127.0.0.1:11434"
}

Or set via environment: OLLAMA_BASE_URL=http://127.0.0.1:11434

Popular models:

Model	Size	Best for
`mistral`	7B	General purpose, fast
`llama3`	8B	Strong reasoning
`qwen2.5:14b`	14B	High quality on consumer hardware
`phi4`	14B	Strong at coding and math
`deepseek-r1:14b`	14B	Deep reasoning
`nomic-embed-text`	—	Embeddings for memory

Local OpenAI-Compatible (vLLM, LM Studio, etc.)

For locally hosted OpenAI-compatible servers.

Model prefix: locv1.<model_name>

{
  "local_compatible_base_url": "http://localhost:8080/v1",
  "local_compatible_key": "any-string",
  "local_compatible_models": "mistral-7b,llama3-8b"
}

Use the model: locv1.mistral-7b

Compatible servers

Server	Description
vLLM	High-throughput inference, GPU-optimised
LM Studio	Desktop app with a UI, macOS/Windows
Ollama /v1	Ollama's OpenAI-compatible endpoint at `:11434/v1`
llama.cpp server	Lightweight, CPU-friendly
Jan	Desktop app, cross-platform
Text Generation WebUI	Feature-rich, multi-backend

Embedding models

{
  "local_compatible_embed_models": "bge-m3,nomic-embed-text"
}

Use for code embeddings when embed_code: true is set.

Hardware requirements

Model size	Min VRAM	Recommended
7B	6 GB	8 GB (RTX 3070)
13B	10 GB	16 GB (RTX 4080)
14B	10 GB	12 GB
30B	20 GB	24 GB
70B	40 GB	48 GB (2×RTX 3090)

CPU-only inference works for 7B models at ~5-10 tokens/second with 16 GB RAM.

Ollama​

Local OpenAI-Compatible (vLLM, LM Studio, etc.)​

Compatible servers​

Embedding models​

Hardware requirements​

Ollama

Local OpenAI-Compatible (vLLM, LM Studio, etc.)

Compatible servers

Embedding models

Hardware requirements