Caching
Synapse has three independent caches that together cut LLM cost and tool latency without changing what your agents do. Most of it runs automatically — you only need to know where to look in the UI to see savings, and which knob to flip when you want stronger caching on a specific orchestration step.
| Cache | Scope | Default | Where to view |
|---|---|---|---|
| Prompt cache | Provider-level (Anthropic, OpenAI, DeepSeek, Gemini, Bedrock) | On | Settings → Usage |
| Response cache | Per orchestration step (exact + semantic) | Off (opt-in per step) | Settings → Usage |
| Tool cache | Deterministic tools (code_search, pdf_parser, …) | On for eligible tools | Settings → Usage |
All three persist to disk under DATA_DIR/cache/. Entries have a 1-hour TTL by default.
Prompt cache
The prompt cache piggybacks on each LLM provider's native caching mechanism. Synapse injects cache_control markers (Anthropic, Bedrock) or relies on automatic prefix caching (OpenAI, DeepSeek, Gemini) so that the long, stable parts of your system prompt — agent instructions, tool definitions, RAG context — are reused across turns instead of being re-tokenised every call.
What gets cached: the stable prefix of the system prompt plus the tools block. Synapse splits the system prompt at an internal volatile separator so turn-changing values (current time, turn budget, recent RAG matches) sit after the cache point — they update freely without invalidating the cached prefix.
Minimum prompt length: about 4000 characters (~1000 tokens). Below that, providers ignore the cache marker, so Synapse skips it to avoid paying the write surcharge for ineligible writes.
Cost shape:
| Provider | Read cost | Write cost |
|---|---|---|
| Anthropic | ~0.1× base input | ~1.25× base input (first call only) |
| OpenAI | ~0.5× base input | No extra charge — automatic prefix caching |
| DeepSeek | ~0.1× base input | No extra charge |
| Gemini | ~0.25× base input | Requires a 5-min minimum TTL |
| Bedrock (Claude) | ~0.1× base input | ~1.25× base input |
In practice this lands at 50–80% cost reduction on long, repeated conversations, after a one-time write surcharge on the first turn.
Viewing cache savings
Go to Settings → Usage. The cache panel surfaces:
- Total Estimated Savings — dollar amount Synapse would have paid without the cache.
- Response Cache Hit Rate — fraction of LLM calls served from the response cache (see below).
- Total Cache Read Tokens / Total Cache Write Tokens — raw read/write totals.
- By Model — per-model breakdown so you can see which models are actually using the cache.
- By Run — per-orchestration-run breakdown.
Disabling the prompt cache
There's no UI toggle — the cache is on by default and almost always worth it. If you need to disable it (e.g. for cost auditing or A/B comparison), edit settings.json directly:
Advanced: direct settings.json edit
{ "prompt_cache_enabled": false }
Response cache
The response cache short-circuits an LLM call entirely when an identical (or near-identical) request has been made before. There are two layers:
Exact match — SHA256 of (model, system_prompt, messages, tools). O(1) lookup. If the hash matches, the cached completion is returned without contacting the provider.
Semantic match — embeds the last user message and compares it against prior cached entries for the same step. A high similarity threshold (0.95) keeps hits limited to near-identical prompts. Requires ChromaDB to be available (it is, by default).
Important: opt-in per step
The response cache is off by default and only available on certain orchestration step types:
| Step type | Eligible? | Why |
|---|---|---|
| LLM | ✅ | Pure prompt-in / response-out, safe to cache |
| Evaluator | ✅ | Routing decision is deterministic for the same state |
| Extract JSON | ✅ | Pure parsing of a previous step's output |
| Agent | ❌ | Skipping the agent would also skip its tool-call side effects, which would silently desync your shared state |
To enable response caching on an eligible step, open the orchestration editor, click the step, and toggle the Cache responses option in the Step Config panel.
Viewing response cache activity
Settings → Usage shows total_response_cache_hits and response_cache_hit_rate. Per-model and per-run breakdowns include cache hits so you can tell which steps are benefiting.
Tool cache
Synapse memoizes the result of tool calls that are pure functions of their arguments — running the same tool with the same args twice should not pay twice.
Cacheable tools (always on):
| Tool | Scope | Reason |
|---|---|---|
code_search | global | Same query against the same index returns the same chunks |
pdf_parser | global | Same PDF parses identically |
xlsx_parser | global | Same file → same rows |
time | global | parse_time("tomorrow 5pm") is deterministic |
code_indexer | global | Indexing a directory twice is idempotent |
collect_data | global | Form schema is static |
personal_details | session | Per-user lookup — keyed by session id |
bash, sql_query, web_scraper, browser_*, and sandbox_execute are deliberately not cached: their output reflects live external state, and a cached result would mask reality.
Invalidating the tool cache
The tool cache is invalidated automatically on TTL expiry. If you re-index a code repo, Synapse clears the relevant code_search cache entries so subsequent searches see the new chunks. There is no manual UI button — TTL plus the auto-invalidation hooks are sufficient in practice.
On-disk layout
All three caches live under DATA_DIR/cache/, organised by namespace:
DATA_DIR/cache/
├── responses_exact/ # Exact-match LLM responses
├── responses_semantic_*/ # Semantic cache per step (ChromaDB collection)
└── tool_results/ # Memoized tool results
You can clear a cache entirely by deleting its namespace directory and restarting the backend. Settings → Usage also shows per-namespace disk usage under disk_stats.
When caching does NOT help
- Streaming tool-heavy chats where the user message changes every turn — exact match won't hit, and the prompt cache only helps on the stable system prefix.
- Very short system prompts (under ~1000 tokens) — below the provider minimum, so prompt cache is skipped.
- Agent steps in orchestration — these intentionally bypass the response cache.
If your hit rate is unexpectedly low, check Settings → Usage → By Model: if cache read tokens are zero for a model you're using heavily, your system prompt is likely too short, or the model's provider doesn't support cache markers in the way Synapse expects (file a bug if so).