QuilrAI
Back to Resources
Engineering

43% Token Savings: Inside the LLM Gateway

System prompt deduplication, semantic caching, and intelligent model routing cut token consumption dramatically. A detailed breakdown of every optimization.

10 min read
April 2026

When we deployed QuilrAI's LLM Gateway across a 200-seat enterprise customer last quarter, we measured a 43% reduction in billed tokens over their baseline. The savings were not from a single technique but from three compounding optimizations that work independently at different points in the request lifecycle. This post explains each optimization in detail, with the numbers behind the headline figure.

What Is System Prompt Deduplication?

Enterprise deployments routinely send the same multi-kilobyte system prompt with every request. That repetition adds up fast: a 3,200-token system prompt sent 8,000 times per day was costing over 25 million input tokens daily before any user message was counted. The gateway intercepts each request, hashes the system prompt, and replaces it with a reference token on repeat occurrences within a configurable time window, reconstructing the full prompt at the API boundary. This alone accounted for 18% of the total savings.

What Is Semantic Caching?

Exact-match caching misses the majority of real-world cache opportunities because users phrase the same question differently. Semantic caching embeds each incoming query, performs a nearest-neighbor search against a cache index of recent (query, response) pairs, and returns the cached response when cosine similarity exceeds a configurable threshold, typically 0.97 for high-precision domains like internal knowledge bases. Cache hit rates of 12–18% are common in enterprise deployments, contributing 15% of the total token savings in our measurement.

How Does Model Routing by Task Type Work?

Not every request requires a frontier model. Classification, extraction, summarization, and FAQ-style queries consistently perform at parity on smaller, faster, cheaper models. The gateway classifies each incoming request by task type using a lightweight classifier (under 1ms latency overhead) and routes simple tasks to a cost-optimized tier while routing complex reasoning, code generation, and multi-step agentic tasks to the full model. This routing layer contributed the remaining 10% of token savings, but with a 3× improvement in p50 latency for routed requests.

QuilrAI

How QuilrAI addresses this: The LLM Gateway is a drop-in OpenAI-compatible proxy that applies all three optimizations transparently. No SDK changes, no prompt rewrites, no model selection logic in application code, just a base URL swap and a configuration file.

Related Articles

Dig deeper

Secure your AI stack today

See how QuilrAI's Guardian Agent and LLM Gateway protect your AI deployment from the threats covered in this article.

Get a Demo