Skip to main content

Caching

Synapse has three independent caches that together cut LLM cost and tool latency without changing what your agents do. Most of it runs automatically — you only need to know where to look in the UI to see savings, and which knob to flip when you want stronger caching on a specific orchestration step.

CacheScopeDefaultWhere to view
Prompt cacheProvider-level (Anthropic, OpenAI, DeepSeek, Gemini, Bedrock)OnSettings → Usage
Response cachePer orchestration step (exact + semantic)Off (opt-in per step)Settings → Usage
Tool cacheDeterministic tools (code_search, pdf_parser, …)On for eligible toolsSettings → Usage

All three persist to disk under DATA_DIR/cache/. Entries have a 1-hour TTL by default.


Prompt cache

The prompt cache piggybacks on each LLM provider's native caching mechanism. Synapse injects cache_control markers (Anthropic, Bedrock) or relies on automatic prefix caching (OpenAI, DeepSeek, Gemini) so that the long, stable parts of your system prompt — agent instructions, tool definitions, RAG context — are reused across turns instead of being re-tokenised every call.

What gets cached: the stable prefix of the system prompt plus the tools block. Synapse splits the system prompt at an internal volatile separator so turn-changing values (current time, turn budget, recent RAG matches) sit after the cache point — they update freely without invalidating the cached prefix.

Minimum prompt length: about 4000 characters (~1000 tokens). Below that, providers ignore the cache marker, so Synapse skips it to avoid paying the write surcharge for ineligible writes.

Cost shape:

ProviderRead costWrite cost
Anthropic~0.1× base input~1.25× base input (first call only)
OpenAI~0.5× base inputNo extra charge — automatic prefix caching
DeepSeek~0.1× base inputNo extra charge
Gemini~0.25× base inputRequires a 5-min minimum TTL
Bedrock (Claude)~0.1× base input~1.25× base input

In practice this lands at 50–80% cost reduction on long, repeated conversations, after a one-time write surcharge on the first turn.

Viewing cache savings

Go to Settings → Usage. The cache panel surfaces:

  • Total Estimated Savings — dollar amount Synapse would have paid without the cache.
  • Response Cache Hit Rate — fraction of LLM calls served from the response cache (see below).
  • Total Cache Read Tokens / Total Cache Write Tokens — raw read/write totals.
  • By Model — per-model breakdown so you can see which models are actually using the cache.
  • By Run — per-orchestration-run breakdown.

Disabling the prompt cache

There's no UI toggle — the cache is on by default and almost always worth it. If you need to disable it (e.g. for cost auditing or A/B comparison), edit settings.json directly:

Advanced: direct settings.json edit
{ "prompt_cache_enabled": false }

Response cache

The response cache short-circuits an LLM call entirely when an identical (or near-identical) request has been made before. There are two layers:

Exact match — SHA256 of (model, system_prompt, messages, tools). O(1) lookup. If the hash matches, the cached completion is returned without contacting the provider.

Semantic match — embeds the last user message and compares it against prior cached entries for the same step. A high similarity threshold (0.95) keeps hits limited to near-identical prompts. Requires ChromaDB to be available (it is, by default).

Important: opt-in per step

The response cache is off by default and only available on certain orchestration step types:

Step typeEligible?Why
LLMPure prompt-in / response-out, safe to cache
EvaluatorRouting decision is deterministic for the same state
Extract JSONPure parsing of a previous step's output
AgentSkipping the agent would also skip its tool-call side effects, which would silently desync your shared state

To enable response caching on an eligible step, open the orchestration editor, click the step, and toggle the Cache responses option in the Step Config panel.

Viewing response cache activity

Settings → Usage shows total_response_cache_hits and response_cache_hit_rate. Per-model and per-run breakdowns include cache hits so you can tell which steps are benefiting.


Tool cache

Synapse memoizes the result of tool calls that are pure functions of their arguments — running the same tool with the same args twice should not pay twice.

Cacheable tools (always on):

ToolScopeReason
code_searchglobalSame query against the same index returns the same chunks
pdf_parserglobalSame PDF parses identically
xlsx_parserglobalSame file → same rows
timeglobalparse_time("tomorrow 5pm") is deterministic
code_indexerglobalIndexing a directory twice is idempotent
collect_dataglobalForm schema is static
personal_detailssessionPer-user lookup — keyed by session id

bash, sql_query, web_scraper, browser_*, and sandbox_execute are deliberately not cached: their output reflects live external state, and a cached result would mask reality.

Invalidating the tool cache

The tool cache is invalidated automatically on TTL expiry. If you re-index a code repo, Synapse clears the relevant code_search cache entries so subsequent searches see the new chunks. There is no manual UI button — TTL plus the auto-invalidation hooks are sufficient in practice.


On-disk layout

All three caches live under DATA_DIR/cache/, organised by namespace:

DATA_DIR/cache/
├── responses_exact/ # Exact-match LLM responses
├── responses_semantic_*/ # Semantic cache per step (ChromaDB collection)
└── tool_results/ # Memoized tool results

You can clear a cache entirely by deleting its namespace directory and restarting the backend. Settings → Usage also shows per-namespace disk usage under disk_stats.


When caching does NOT help

  • Streaming tool-heavy chats where the user message changes every turn — exact match won't hit, and the prompt cache only helps on the stable system prefix.
  • Very short system prompts (under ~1000 tokens) — below the provider minimum, so prompt cache is skipped.
  • Agent steps in orchestration — these intentionally bypass the response cache.

If your hit rate is unexpectedly low, check Settings → Usage → By Model: if cache read tokens are zero for a model you're using heavily, your system prompt is likely too short, or the model's provider doesn't support cache markers in the way Synapse expects (file a bug if so).