Overview
Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. Bifrost delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:- Local-first deployment - Run models locally or on private infrastructure
- OpenAI API compatibility - Identical request/response format
- Full feature support - Chat, text, embeddings, and streaming
- Tool calling - Complete function definition and execution
- Self-hosted - No external API dependency required
Supported Operations
| Operation | Non-Streaming | Streaming | Endpoint |
|---|---|---|---|
| Chat Completions | ✅ | ✅ | /v1/chat/completions |
| Responses API | ✅ | ✅ | /v1/chat/completions |
| Text Completions | ✅ | ✅ | /v1/completions |
| Embeddings | ✅ | - | /v1/embeddings |
| List Models | ✅ | - | /v1/models |
| Image Generation | ❌ | ❌ | - |
| Speech (TTS) | ❌ | ❌ | - |
| Transcriptions (STT) | ❌ | ❌ | - |
| Files | ❌ | ❌ | - |
| Batch | ❌ | ❌ | - |
Unsupported Operations (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Ollama API. These return
UnsupportedOperationError.Ollama is self-hosted. Ensure you have an Ollama instance running and configured with the correct BaseURL (e.g., http://localhost:11434).1. Chat Completions
Request Parameters
Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.Filtered Parameters
Removed for Ollama compatibility:prompt_cache_key- Not supportedverbosity- Anthropic-specificstore- Not supportedservice_tier- Not supported
2. Responses API
Converted internally to Chat Completions:3. Text Completions
Ollama supports legacy text completion format:| Parameter | Mapping |
|---|---|
prompt | Direct pass-through |
max_tokens | max_tokens |
temperature, top_p | Direct pass-through |
stop | Stop sequences |
4. Embeddings
Ollama supports text embeddings:| Parameter | Notes |
|---|---|
input | Text or array of texts |
model | Embedding model name |
encoding_format | ”float” or “base64” |
dimensions | Custom output dimensions (optional) |
5. List Models
Lists models currently loaded in Ollama with capabilities and context information.Unsupported Features
| Feature | Reason |
|---|---|
| Speech/TTS | Not offered by Ollama API |
| Transcription/STT | Not offered by Ollama API |
| Batch Operations | Not offered by Ollama API |
| File Management | Not offered by Ollama API |
Ollama follows the OpenAI API specification for request format and error handling. Authentication is optional and depends on deployment (no authentication required for local access, optional Bearer token for protected instances).Critical: BaseURL must be explicitly configured pointing to your Ollama instance (e.g.,
http://localhost:11434 for local, https://ollama.example.com for remote).Setup & Configuration
Configure Ollama as a provider.- Web UI
- config.json
- API
- Go SDK

- Navigate to Models > Model Providers. Look for Ollama under Configured Providers. If it is missing, click on Add New Provider and select Ollama.
- Click Add New Server or edit an existing key.
- Set a name for your key.
- Leave API Key blank for local servers. If your endpoint requires auth, paste a bearer token directly or use an environment variable.
- Set Ollama URL to
http://localhost:11434or your remote Ollama endpoint. - Set Allowed Models to All Models (default) or the specific model allowlist you want this key to serve.
- Save the provider configuration.
- Install Ollama from https://ollama.ai
- Pull a model:
- Start Ollama server:
- Verify it is running:
Performance Considerations
Streaming for Large Models: For better user experience with large models, use streaming:- Llama 3.1 70B: 128K tokens
- Mistral 7B: 32K tokens
- Neural Chat 7B: 8K tokens
Popular Models
| Model | Size | Context | Speed |
|---|---|---|---|
| llama3.1:latest | Varies | 128K | Fast |
| mistral:latest | 7B | 32K | Very Fast |
| neural-chat:latest | 7B | 8K | Very Fast |
| orca-mini:latest | 3B | 3K | Very Fast |
| openchat:latest | 7B | 8K | Very Fast |
Caveats
BaseURL Configuration Required
BaseURL Configuration Required
Severity: High
Behavior: BaseURL must be explicitly configured through
ollama_key_config.url or network_config.base_url - no default
Impact: Requests fail without proper configuration
Code: Requests call baseURLOrError before contacting OllamaCache Control Stripped
Cache Control Stripped
Severity: Low
Behavior: Cache control directives are removed from messages
Impact: Prompt caching features don’t work
Code: Stripped during JSON marshaling
Parameter Filtering
Parameter Filtering
Severity: Low
Behavior: OpenAI-specific parameters filtered out
Impact: prompt_cache_key, verbosity, store removed
Code: filterOpenAISpecificParameters
User Field Size Limit
User Field Size Limit
Severity: Low
Behavior: User field > 64 characters silently dropped
Impact: Longer user identifiers are lost
Code: SanitizeUserField enforces 64-char max

