43% Token Savings: Inside the LLM Gateway

When we deployed QuilrAI's LLM Gateway across a 200-seat enterprise customer last quarter, we measured a 43% reduction in billed tokens over their baseline. The savings were not from a single technique but from three compounding optimizations that work independently at different points in the request lifecycle. This post explains each optimization in detail, with the numbers behind the headline figure.

What Is System Prompt Deduplication?

Enterprise deployments routinely send the same multi-kilobyte system prompt with every request. That repetition adds up fast: a 3,200-token system prompt sent 8,000 times per day was costing over 25 million input tokens daily before any user message was counted. The gateway intercepts each request, hashes the system prompt, and replaces it with a reference token on repeat occurrences within a configurable time window, reconstructing the full prompt at the API boundary. This alone accounted for 18% of the total savings.

What Is Semantic Caching?

Exact-match caching misses the majority of real-world cache opportunities because users phrase the same question differently. Semantic caching embeds each incoming query, performs a nearest-neighbor search against a cache index of recent (query, response) pairs, and returns the cached response when cosine similarity exceeds a configurable threshold, typically 0.97 for high-precision domains like internal knowledge bases. Cache hit rates of 12–18% are common in enterprise deployments, contributing 15% of the total token savings in our measurement.

How Does Model Routing by Task Type Work?

Not every request requires a frontier model. Classification, extraction, summarization, and FAQ-style queries consistently perform at parity on smaller, faster, cheaper models. The gateway classifies each incoming request by task type using a lightweight classifier (under 1ms latency overhead) and routes simple tasks to a cost-optimized tier while routing complex reasoning, code generation, and multi-step agentic tasks to the full model. This routing layer contributed the remaining 10% of token savings, but with a 3× improvement in p50 latency for routed requests.

System prompt deduplication: 18% savings by reference-tokenizing repeated prompts
Semantic caching at 0.97 cosine threshold: 15% savings with 12–18% cache hit rates
Task-type model routing: 10% savings plus 3× p50 latency improvement for simple tasks
Combined effect: 43% total token reduction without any change to application code
Gateway adds under 2ms p50 latency overhead on the critical path

QuilrAI

How QuilrAI addresses this: The LLM Gateway is a drop-in OpenAI-compatible proxy that applies all three optimizations transparently. No SDK changes, no prompt rewrites, no model selection logic in application code, just a base URL swap and a configuration file.

43% Token Savings: Inside the LLM Gateway

What Is System Prompt Deduplication?

What Is Semantic Caching?

How Does Model Routing by Task Type Work?

Related Articles

Dynamic Tool Calling: How We 2× MCP Tool Usage

One base_url Change. Your AI Gets Secure.

MCP Authentication: What the Spec Doesn't Cover

Secure your AI stack today