Skip to main content

Local Providers

Local providers run models on your own hardware. No API key required, no data sent externally. Perfect for private deployments or offline use.

Ollama

Model prefix: none (bare model name)

Ollama is the simplest way to run local models. Install it from ollama.ai and pull models:

ollama pull mistral
ollama pull llama3
ollama pull qwen2.5:14b
ollama pull phi4

Configuration:

{
"mode": "local",
"model": "mistral",
"ollama_base_url": "http://127.0.0.1:11434"
}

Or set via environment: OLLAMA_BASE_URL=http://127.0.0.1:11434

Popular models:

ModelSizeBest for
mistral7BGeneral purpose, fast
llama38BStrong reasoning
qwen2.5:14b14BHigh quality on consumer hardware
phi414BStrong at coding and math
deepseek-r1:14b14BDeep reasoning
nomic-embed-textEmbeddings for memory

Local OpenAI-Compatible (vLLM, LM Studio, etc.)

For locally hosted OpenAI-compatible servers.

Model prefix: locv1.<model_name>

{
"local_compatible_base_url": "http://localhost:8080/v1",
"local_compatible_key": "any-string",
"local_compatible_models": "mistral-7b,llama3-8b"
}

Use the model: locv1.mistral-7b

Compatible servers

ServerDescription
vLLMHigh-throughput inference, GPU-optimised
LM StudioDesktop app with a UI, macOS/Windows
Ollama /v1Ollama's OpenAI-compatible endpoint at :11434/v1
llama.cpp serverLightweight, CPU-friendly
JanDesktop app, cross-platform
Text Generation WebUIFeature-rich, multi-backend

Embedding models

{
"local_compatible_embed_models": "bge-m3,nomic-embed-text"
}

Use for code embeddings when embed_code: true is set.


Hardware requirements

Model sizeMin VRAMRecommended
7B6 GB8 GB (RTX 3070)
13B10 GB16 GB (RTX 4080)
14B10 GB12 GB
30B20 GB24 GB
70B40 GB48 GB (2×RTX 3090)

CPU-only inference works for 7B models at ~5-10 tokens/second with 16 GB RAM.