Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries.Key Benefits:
Cost Reduction: Avoid expensive LLM API calls for similar requests
Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
Intelligent Matching: Semantic similarity beyond exact text matching
Streaming Support: Full streaming response caching with proper chunk ordering
UI Note: The current Web UI flow configures provider-backed semantic caching. If you want direct-only mode (dimension: 1 with no provider), configure it through config.json.
Go SDK
Web UI
config.json
import ( "github.com/maximhq/bifrost/plugins/semanticcache" "github.com/maximhq/bifrost/core/schemas")// Configure semantic cache plugincacheConfig := &semanticcache.Config{ // Embedding model configuration (Required) Provider: schemas.OpenAI, Keys: []schemas.Key{{Value: "sk-..."}}, EmbeddingModel: "text-embedding-3-small", Dimension: 1536, // Cache behavior TTL: 5 * time.Minute, // Time to live for cached responses (default: 5 minutes) Threshold: 0.8, // Similarity threshold for cache lookup (default: 0.8) CleanUpOnShutdown: true, // Clean up cache on shutdown (default: false) // Conversation behavior ConversationHistoryThreshold: 5, // Skip caching if conversation has > N messages (default: 3) ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages from cache key (default: false) // Advanced options CacheByModel: bifrost.Ptr(true), // Include model in cache key (default: true) CacheByProvider: bifrost.Ptr(true), // Include provider in cache key (default: true)}// Create pluginplugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)if err != nil { log.Fatal("Failed to create semantic cache plugin:", err)}// Add to Bifrost configbifrostConfig := schemas.BifrostConfig{ LLMPlugins: []schemas.LLMPlugin{plugin}, // ... other config}
Note: Make sure you have a vector store setup (using config.json) before configuring the semantic cache plugin.
Navigate to Settings
Open Bifrost UI at http://localhost:8080
Go to Settings.
Configure Semantic Cache Plugin
Toggle the plugin switch to enable it, and fill in the required fields.
Required Fields:
Provider: The provider to use for caching.
Embedding Model: The embedding model to use for caching.
Dimension: The embedding dimension for the configured embedding model.
Note: Changes will need a restart of the Bifrost server to take effect, because the plugin is loaded on startup only.
Note: In config.json setups, provider keys are taken from the provider config on initialization, so you do not need to duplicate keys inside the plugin config. Any updates to the provider keys will not be reflected until next restart.
Direct hash mode provides exact-match caching without requiring an embedding provider. Each request is hashed deterministically based on its normalized input, parameters, and stream flag. Identical requests produce cache hits; different wording is a cache miss.Exact-match direct entries are stored and retrieved using a deterministic cache ID. This keeps repeated direct cache lookups fast and consistent across retries, streaming responses, and restarts.When to use direct hash mode:
You only need exact-match deduplication (no fuzzy/semantic matching)
You cannot or do not want to call an external embedding API
You want the lowest possible latency with zero embedding overhead
Cost-sensitive environments where embedding API calls add up
To enable direct-only mode globally, set dimension: 1 and omit the provider and keys fields from the plugin config. The plugin will automatically fall back to direct search only.
Important: If you specify dimension: 1 and also provide a provider, Bifrost treats the config as provider-backed semantic mode, not direct-only mode. To use direct-only mode, omit the provider field entirely.
A vector store is still required as the storage backend, even in direct hash mode. See Recommended Vector Store below for the best choice.
Go SDK
Helm
config.json
import ( "github.com/maximhq/bifrost/plugins/semanticcache")cacheConfig := &semanticcache.Config{ // No Provider, Keys, or EmbeddingModel -- direct hash mode only Dimension: 1, // Placeholder; entries are stored as metadata-only (no embedding vectors). Change dimension before switching to dual-layer mode to avoid mixed-dimension issues. TTL: 5 * time.Minute, CleanUpOnShutdown: true, CacheByModel: bifrost.Ptr(true), CacheByProvider: bifrost.Ptr(true),}plugin, err := semanticcache.Init(ctx, cacheConfig, logger, store)
When initialized this way, all requests automatically use direct hash matching regardless of the x-bf-cache-type header. No embeddings are generated, and no embedding provider credentials are needed.
Redis/Valkey-compatible stores are recommended for direct hash mode. They do not require vectors for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.
Qdrant and Pinecone are not compatible with direct hash mode when no embedding provider is configured. These stores require a vector for every entry; the plugin’s zero-vector placeholder codepath requires an initialised embedding client, so storage will fail if no provider is set. Weaviate requires a vector per entry as well and is therefore also not recommended for direct-only mode.
When the plugin is initialized without an embedding provider (direct-only mode), all requests use direct hash matching automatically. The x-bf-cache-type header has no effect.When the plugin is initialized with an embedding provider (dual-layer mode), you can force direct-only matching on specific requests using the x-bf-cache-type: direct header. See Cache Type Control for details.
Cache Key is mandatory: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
Go SDK
HTTP API
Must set cache key in request context:
// This request WILL be cachedctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)// This request will NOT be cached (no context value)response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), request)
Must set cache key in request header x-bf-cache-key:
# This request WILL be cachedcurl -H "x-bf-cache-key: session-123" ...# This request will NOT be cached (no header)curl ...
Disable response caching while still allowing cache reads:
Go SDK
HTTP API
// Read from cache but don't store the responsectx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")ctx = context.WithValue(ctx, semanticcache.CacheNoStoreKey, true)
# Read from cache but don't store responsecurl -H "x-bf-cache-key: session-123" \ -H "x-bf-cache-no-store: true" ...
When responses are served from semantic cache, 3 key variables are automatically added to the response:Location: response.ExtraFields.CacheDebug (as a JSON object)Fields:
CacheHit (boolean): true if the response was served from the cache, false when lookup fails.
HitType (string): "semantic" for similarity match, "direct" for hash match
CacheID (string): Unique cache entry ID for management operations (present only for cache hits)
Semantic Cache Only:
ProviderUsed (string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)
ModelUsed (string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)
InputTokens (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
Threshold (number): Similarity threshold used for the match. (present only for cache hits)
Similarity (number): Similarity score for the match. (present only for cache hits)
Use the request ID from cached responses to clear specific entries:
Go SDK
HTTP API
// Clear specific entry by request IDerr := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")// Clear all entries for a cache key err := plugin.ClearCacheForKey("support-session-456")
# Clear specific cached entry by request IDcurl -X DELETE http://localhost:8080/api/cache/clear/550e8400-e29b-41d4-a716-446655440000# Clear all entries for a cache keycurl -X DELETE http://localhost:8080/api/cache/clear-by-key/support-session-456
The semantic cache automatically handles cleanup to prevent storage bloat:Automatic Cleanup:
TTL Expiration: Entries are automatically removed when TTL expires
Shutdown Cleanup: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
Namespace Isolation: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts
Manual Cleanup Options:
Clear specific entries by request ID (see examples above)
Clear all entries for a cache key
Restart Bifrost to clear all cache data
The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down only if cleanup_on_shutdown is set to true. By default (cleanup_on_shutdown: false), cache data persists between restarts. DO NOT use the plugin’s namespace for external purposes.
Dimension Changes: If you update the dimension config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different vector_store_namespace or set cleanup_on_shutdown: true before restarting.
Vector Store Requirement: Semantic caching requires a configured vector store. Bifrost supports Weaviate, Redis/Valkey-compatible endpoints, Qdrant, and Pinecone. See the Vector Store documentation for setup details.