Local Providers
Local providers run models on your own hardware. No API key required, no data sent externally. Perfect for private deployments or offline use. Two cards in Settings → Models cover this:
- Ollama (Local) — for Ollama's native API
- Hugging Face (Local) — for running Hugging Face Hub models on your own hardware via
transformers+torch - Local V1 Compatible — for any locally-hosted OpenAI-compatible server (vLLM, LM Studio, llama.cpp server, Jan, Text Generation WebUI, Ollama's
/v1endpoint, etc.)
Ollama
Model prefix: none (bare model name)
Install Ollama from ollama.ai and pull the models you want:
ollama pull mistral
ollama pull llama3
ollama pull qwen2.5:14b
ollama pull phi4
Then in Synapse open Settings → Models and click the Ollama (Local) card. By default Synapse looks at http://127.0.0.1:11434; if your Ollama is running elsewhere, override the Base URL field in the card. Click Save — the card's status dot turns green once Synapse can reach Ollama and the pulled models appear in the selector.
| Model | Size | Best for |
|---|---|---|
mistral | 7B | General purpose, fast |
llama3 | 8B | Strong reasoning |
qwen2.5:14b | 14B | High quality on consumer hardware |
phi4 | 14B | Strong at coding and math |
deepseek-r1:14b | 14B | Deep reasoning |
nomic-embed-text | — | Embeddings for memory |
Advanced: direct settings.json edit (or OLLAMA_BASE_URL env var)
{
"mode": "local",
"model": "mistral",
"ollama_base_url": "http://127.0.0.1:11434"
}
Hugging Face (Local)
Model prefix: hf.<huggingface_model_id>
Loads models directly from the Hugging Face Hub and runs inference locally via transformers + torch. No API call leaves your machine, but the model is downloaded the first time it is used.
Prerequisites on the host: torch and transformers installed and importable from the backend's Python environment. A CUDA-capable GPU is strongly recommended — 7B-class models need roughly 16–40 GB of VRAM.
In Settings → Models click the Hugging Face (Local) card and fill:
| Field | Example | Notes |
|---|---|---|
| Access Token (optional) | hf_... | Only required for gated models (Llama, Gemma, etc.). Generate at huggingface.co/settings/tokens. |
| Model IDs (comma- or newline-separated) | Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct | Use the full Hub repo IDs. |
| Max New Tokens | 1024 | Maximum tokens generated per call. |
Reference models as hf.<the full repo id>, e.g. hf.Qwen/Qwen2.5-7B-Instruct.
The first call to a model pays a 20–60s load cost; afterwards the model stays resident in memory for fast subsequent calls.
- No native tool calling — tools are injected into the prompt for HF models.
- No vision support.
Advanced: direct settings.json edit
{
"huggingface_token": "hf_...",
"huggingface_models": "Qwen/Qwen2.5-7B-Instruct,meta-llama/Llama-3.1-8B-Instruct",
"huggingface_max_new_tokens": 1024
}
Local OpenAI-Compatible (vLLM, LM Studio, etc.)
Model prefix: locv1.<model_name>
In Settings → Models click the Local V1 Compatible card and fill:
| Field | Example |
|---|---|
| API Key | Any non-empty string (most local servers don't validate it) |
| Base URL | http://localhost:8080/v1 (your server's v1 endpoint) |
| Model Names (comma-separated) | mistral-7b, llama3-8b |
| Embedding Model Names (comma-separated, optional) | bge-m3, nomic-embed-text |
Reference models as locv1.mistral-7b (prefix + the name you listed).
Compatible servers
| Server | Description |
|---|---|
| vLLM | High-throughput inference, GPU-optimised |
| LM Studio | Desktop app with a UI, macOS/Windows |
| Ollama /v1 | Ollama's OpenAI-compatible endpoint at :11434/v1 |
| llama.cpp server | Lightweight, CPU-friendly |
| Jan | Desktop app, cross-platform |
| Text Generation WebUI | Feature-rich, multi-backend |
Embedding models
The Embedding Model Names field is how you point code search (embed_code enabled) and long-term memory at your local embedder. The first model listed is used by default.
Advanced: direct settings.json edit
{
"local_compatible_base_url": "http://localhost:8080/v1",
"local_compatible_key": "any-string",
"local_compatible_models": "mistral-7b,llama3-8b",
"local_compatible_embed_models": "bge-m3,nomic-embed-text"
}
Hardware requirements
| Model size | Min VRAM | Recommended |
|---|---|---|
| 7B | 6 GB | 8 GB (RTX 3070) |
| 13B | 10 GB | 16 GB (RTX 4080) |
| 14B | 10 GB | 12 GB |
| 30B | 20 GB | 24 GB |
| 70B | 40 GB | 48 GB (2×RTX 3090) |
CPU-only inference works for 7B models at ~5–10 tokens/second with 16 GB RAM.