Skip to main content

Local Providers

Local providers run models on your own hardware. No API key required, no data sent externally. Perfect for private deployments or offline use. Two cards in Settings → Models cover this:

  • Ollama (Local) — for Ollama's native API
  • Hugging Face (Local) — for running Hugging Face Hub models on your own hardware via transformers + torch
  • Local V1 Compatible — for any locally-hosted OpenAI-compatible server (vLLM, LM Studio, llama.cpp server, Jan, Text Generation WebUI, Ollama's /v1 endpoint, etc.)

Ollama

Model prefix: none (bare model name)

Install Ollama from ollama.ai and pull the models you want:

ollama pull mistral
ollama pull llama3
ollama pull qwen2.5:14b
ollama pull phi4

Then in Synapse open Settings → Models and click the Ollama (Local) card. By default Synapse looks at http://127.0.0.1:11434; if your Ollama is running elsewhere, override the Base URL field in the card. Click Save — the card's status dot turns green once Synapse can reach Ollama and the pulled models appear in the selector.

ModelSizeBest for
mistral7BGeneral purpose, fast
llama38BStrong reasoning
qwen2.5:14b14BHigh quality on consumer hardware
phi414BStrong at coding and math
deepseek-r1:14b14BDeep reasoning
nomic-embed-textEmbeddings for memory
Advanced: direct settings.json edit (or OLLAMA_BASE_URL env var)
{
"mode": "local",
"model": "mistral",
"ollama_base_url": "http://127.0.0.1:11434"
}

Hugging Face (Local)

Model prefix: hf.<huggingface_model_id>

Loads models directly from the Hugging Face Hub and runs inference locally via transformers + torch. No API call leaves your machine, but the model is downloaded the first time it is used.

Prerequisites on the host: torch and transformers installed and importable from the backend's Python environment. A CUDA-capable GPU is strongly recommended — 7B-class models need roughly 16–40 GB of VRAM.

In Settings → Models click the Hugging Face (Local) card and fill:

FieldExampleNotes
Access Token (optional)hf_...Only required for gated models (Llama, Gemma, etc.). Generate at huggingface.co/settings/tokens.
Model IDs (comma- or newline-separated)Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-InstructUse the full Hub repo IDs.
Max New Tokens1024Maximum tokens generated per call.

Reference models as hf.<the full repo id>, e.g. hf.Qwen/Qwen2.5-7B-Instruct.

The first call to a model pays a 20–60s load cost; afterwards the model stays resident in memory for fast subsequent calls.

Limitations
  • No native tool calling — tools are injected into the prompt for HF models.
  • No vision support.
Advanced: direct settings.json edit
{
"huggingface_token": "hf_...",
"huggingface_models": "Qwen/Qwen2.5-7B-Instruct,meta-llama/Llama-3.1-8B-Instruct",
"huggingface_max_new_tokens": 1024
}

Local OpenAI-Compatible (vLLM, LM Studio, etc.)

Model prefix: locv1.<model_name>

In Settings → Models click the Local V1 Compatible card and fill:

FieldExample
API KeyAny non-empty string (most local servers don't validate it)
Base URLhttp://localhost:8080/v1 (your server's v1 endpoint)
Model Names (comma-separated)mistral-7b, llama3-8b
Embedding Model Names (comma-separated, optional)bge-m3, nomic-embed-text

Reference models as locv1.mistral-7b (prefix + the name you listed).

Compatible servers

ServerDescription
vLLMHigh-throughput inference, GPU-optimised
LM StudioDesktop app with a UI, macOS/Windows
Ollama /v1Ollama's OpenAI-compatible endpoint at :11434/v1
llama.cpp serverLightweight, CPU-friendly
JanDesktop app, cross-platform
Text Generation WebUIFeature-rich, multi-backend

Embedding models

The Embedding Model Names field is how you point code search (embed_code enabled) and long-term memory at your local embedder. The first model listed is used by default.

Advanced: direct settings.json edit
{
"local_compatible_base_url": "http://localhost:8080/v1",
"local_compatible_key": "any-string",
"local_compatible_models": "mistral-7b,llama3-8b",
"local_compatible_embed_models": "bge-m3,nomic-embed-text"
}

Hardware requirements

Model sizeMin VRAMRecommended
7B6 GB8 GB (RTX 3070)
13B10 GB16 GB (RTX 4080)
14B10 GB12 GB
30B20 GB24 GB
70B40 GB48 GB (2×RTX 3090)

CPU-only inference works for 7B models at ~5–10 tokens/second with 16 GB RAM.