Runtime setup, gateway execution, protocol bridging, budget tracking, replay sanitization, and operational patterns.
Usage Guide
This guide explains how to use OmniLLM as:
- a runtime gateway for generation requests
- a protocol transcoding layer between supported generation APIs
- a typed multi-endpoint conversion layer for embeddings, images, audio, and rerank
- a replay sanitization helper for test fixtures
- a Rust project that ships a first-party OmniLLM Skill
If you want Skill installation details, see skill.md. If you want architecture and implementation details, see architecture.md and implementation.md.
What This Crate Does
OmniLLM has three major surfaces:
-
GatewayUse this when you want to send generation requests at runtime with:- provider-neutral request/response types
- multi-key load balancing
- per-key RPM and TPM controls
- circuit breaking
- budget tracking
- canonical streaming events
-
Provider primitive runtime APIs Use these when you want to send provider-native requests directly without converting through
LlmRequest,LlmResponse,ApiRequest, orApiResponse. Primitive call, stream, and realtime entry points reuse the sameGatewaykey pool, RPM guard, timeout, andBudgetTracker. -
API and protocol conversion helpers Use these when you want to:
- parse raw upstream payloads into canonical types
- emit canonical types back into provider wire formats
- transcode between supported protocols
- inspect bridge and loss metadata
- sanitize request/response fixtures for tests
Important: Gateway::call and Gateway::stream remain the OpenAI Responses-centered canonical generation path. Provider primitive APIs are explicit additive entry points for raw provider-native payloads, including non-generation APIs where the primitive support registry enables them.
Installation
Add the crate to your Cargo.toml:
Choose one TLS backend:
- default:
rustls - optional:
native-tls
Examples:
OmniLLM Skill
OmniLLM ships with a first-party agent skill in the repository's
skill/ directory. Use
the Skill Guide when you want to install it in Claude Code,
Codex, or OpenCode with the Vercel Labs skills workflow.
Core Concepts
The crate normalizes generation around LlmRequest and LlmResponse.
LlmRequestis the canonical generation request.LlmResponseis the canonical generation response.LlmStreamEventis the canonical stream event model.CapabilitySetholds cross-provider features like tools, structured output, reasoning, and builtin tools.EndpointProtocolidentifies a runtime endpoint profile, including*_compatmodes.ProviderProtocolidentifies a low-level generation wire protocol used by codecs and transcoding.ProviderEndpointidentifies where and how to send a canonical request.PrimitiveProviderEndpointandPrimitiveRequestidentify where and how to send a provider-native primitive request.
For multi-endpoint work:
ApiRequestandApiResponseare canonical typed wrappers across endpoint families.WireFormatidentifies a specific upstream wire format.ConversionReport<T>tells you whether a conversion was bridged and whether data was lost.
Quick Start
This is the smallest useful runtime setup:
Provider Primitive Runtime
Use primitive runtime APIs when you need provider-native payloads without canonical conversion. The raw body is the source of truth; OmniLLM only adds transport metadata such as auth, default headers, query parameters, timeout, and base URL/path resolution.
Primitive usage telemetry is stored on PrimitiveResponse::usage when providers
return known usage fields such as OpenAI usage, Anthropic usage, or Gemini
usageMetadata. Budget settlement uses that telemetry when available and falls
back to the reserved estimate when no token usage is reported.
Primitive provider support is scoped to model-workload gateway use cases, not full provider SDK parity. P1 HTTP gaps include OpenAI Files/Uploads/Models/Audio Translations/Image edits/variations, Anthropic Models/Files hardening, and Gemini Models/Operations/Files/Caches hardening. P2 covers explicit async job lifecycle requests for batch-style APIs. P3 covers OpenAI Audio Speech binary chunk streaming plus WebSocket realtime sessions for OpenAI Realtime and Gemini Live; WebRTC remains planned, not implemented. Metadata and read-only operations settle as zero-cost unless provider usage appears; uploads settle as upload/storage; media calls and realtime sessions use provider telemetry or reserved-estimate fallback.
Deferred surfaces include admin, billing, webhooks, fine-tuning, evals, tunings, managed-agent platforms, hosted RAG/vector-store administration, and SDK helper layers.
Building a Gateway
GatewayBuilder controls the runtime client:
Builder Options
-
add_key/add_keysRegisters one or more API keys for the same upstream endpoint. -
budget_limit_usdSets a process-local budget cap. Requests reserve estimated cost before dispatch and settle to actual cost after completion. -
pool_configConfigures acquire retries and circuit breaker thresholds. -
request_timeoutSets the HTTP client timeout used by the dispatcher.
Key Configuration
Each KeyConfig contains:
- raw key string
- human-readable label
tpm_limitrpm_limit
Use labels for observability. Labels are surfaced by gateway.pool_status().
Choosing a Provider Endpoint
Built-in generation endpoints:
ProviderEndpoint::openai_responses()ProviderEndpoint::openai_chat_completions()ProviderEndpoint::claude_messages()ProviderEndpoint::gemini_generate_content()
You can also construct a custom endpoint:
Use a non-compat protocol when base_url is a host or prefix and OmniLLM should derive the standard path.
Use a *_compat protocol when base_url is already the full request URL exposed by a wrapper or OpenAI-compatible gateway.
That includes wrappers that expect strict messages[].content[] arrays for OpenAI Chat Completions payloads.
EndpointProtocol is the runtime configuration surface; names such as ClaudeMessages and GeminiGenerateContent live on ProviderProtocol because they mirror upstream wire APIs used by the parse, emit, and transcode helpers.
Authentication Modes
AuthScheme supports:
BearerHeader { name }Query { name }
If you do not set an auth scheme explicitly, ProviderEndpoint uses a protocol-specific default.
Building Requests
input vs messages
LlmRequest supports both:
input: canonical execution inputmessages: compatibility chat-style view
If input is non-empty, it is treated as the source of truth. If input is empty, messages is used.
For new code, prefer input.
Message.parts is the content model behind the compatibility view. When
OmniLLM emits OpenAI Chat Completions payloads, plain-text chat messages stay
array-shaped: MessagePart::Text { text: "hi?".into() } becomes
content: [{ "type": "text", "text": "hi?" }]. This is useful for compat
wrappers that reject bare string content.
Provider-Specific Top-Level Fields
Use LlmRequest.vendor_extensions for request fields that OmniLLM does not
normalize.
For OpenAI responses and chat_completions, OmniLLM preserves top-level
request vendor extensions across parse/emit and transport emission. This is
the right place for wrapper-specific flags such as enable_thinking.
Keep normalized controls in generation, capabilities, and metadata.
Reach for vendor_extensions only when a wrapper needs extra fields that
OmniLLM does not model directly.
Instructions
instructions is the canonical top-level place for system/developer guidance.
If instructions is absent, the crate can derive normalized instructions from system/developer messages in the chat-style view.
Generation Controls
GenerationConfig includes:
max_output_tokenstemperaturetop_ptop_kstop_sequencespresence_penaltyfrequency_penaltyseed
These are normalized controls. When transcoding to narrower protocols, some fields may be dropped and reported through ConversionReport.loss_reasons.
Capabilities
CapabilitySet is the cross-provider capability layer.
Custom Tools
Structured Output
Reasoning and Builtin Tools
CapabilitySet also includes:
reasoningbuiltin_toolsmodalitiessafetycache
These are canonical abstractions. Support depends on the target protocol. If a target cannot represent part of the capability set, conversion reports mark that as bridged and possibly lossy.
Prompt Cache
Prompt cache support lives at LlmRequest.capabilities.prompt_cache and keeps the older CapabilitySet.cache as a compatibility hint. Use PromptCachePolicy::BestEffort when cache support is an optimization, and PromptCachePolicy::Required when losing cache semantics should fail before transport.
Provider mapping:
- OpenAI Responses and Chat Completions emit
prompt_cache_keyplusprompt_cache_retention(in_memoryfor short retention,24hfor long retention). Explicit breakpoints are not representable; BestEffort reports loss during typed bridging, while Required returns an error. - Claude Messages emits
cache_controlwithtype: ephemeral;Shortmaps tottl: 5m,Longmaps tottl: 1h, and supported breakpoints can target tools, system instructions, messages, or content blocks. - Gemini GenerateContent has no typed prompt cache mapping; BestEffort is dropped with
ConversionReport.loss_reasons, and Required fails before transport.
Provider usage is preserved in TokenUsage.prompt_cache:
- OpenAI cached prefixes appear as
cached_input_tokens. - Claude cache hits and writes appear as
cache_read_input_tokens,cache_creation_input_tokens,cache_creation_short_input_tokens, andcache_creation_long_input_tokenswhen the provider reports them.
PromptLayoutBuilder helps keep a stable cacheable prefix before dynamic user/RAG suffixes and can generate deterministic tenant-scoped prefix keys:
Non-Streaming Calls
Use Gateway::call for one-shot generation:
The gateway:
- estimates tokens and cost
- acquires a healthy key with enough TPM capacity
- checks local budget
- checks the local RPM window
- dispatches the upstream HTTP request
- settles cost to actual usage
- updates key health based on success or failure
Streaming Calls
Use Gateway::stream when you want canonical stream events:
Stream Semantics
- The stream yields
Result<LlmStreamEvent, GatewayError>. - Some upstreams send a terminal
Completedevent; others end with[DONE]or protocol-specific stop markers. - OpenAI Chat Completions compat wrappers sometimes coalesce
delta.role = "assistant"and the firstdelta.contentinto one SSE frame. The gateway preserves that initial text delta instead of dropping it behindResponseStarted. - The gateway synthesizes a terminal
Completedevent when needed so callers still get a final canonical response. - If a stream ends or fails before usage metadata is available, the gateway falls back to internal usage estimation to settle budget instead of refunding the whole reservation.
Cancellation
Use CancellationToken to stop an in-flight request:
Cancellation becomes GatewayError::Cancelled.
Budget Tracking
Budget tracking is process-local and lock-free.
Key points:
- requests reserve estimated cost before dispatch
- final cost is settled against actual usage
- successful requests can settle up or down
- prompt cache discounts or write costs are applied only after provider usage telemetry is available and the model has known cache rates
- failed or truncated streams do not automatically refund everything; the gateway uses known or estimated partial usage when possible
Observability methods:
gateway.budget_used_usd()gateway.budget_remaining_usd()
Use BudgetTracker directly if you need the low-level primitive outside of Gateway.
Key Pooling, Rate Limits, and Circuit Breaking
Each key is tracked independently.
What the Pool Enforces
- TPM reservation using atomic in-flight counters
- RPM admission via a sliding window
- randomized selection to reduce contention
- cooldown on provider rate-limit responses
- permanent death on unauthorized responses
- circuit breaking on repeated provider failures
Observability
KeyStatus includes:
labelavailabletpm_inflighttpm_limitcool_down_untilfailure_cool_down_untilconsecutive_failures
The cooldown fields are Unix epoch milliseconds.
Error Handling
Public runtime errors are normalized as GatewayError:
NoAvailableKeyBudgetExceededRateLimitedUnauthorizedCancelledProvider(ProviderError)Protocol(String)Http(reqwest::Error)
General guidance:
-
NoAvailableKeyNo currently healthy key had enough local capacity. -
BudgetExceededYour configured budget cap rejected the request before dispatch. -
RateLimitedThe local RPM window denied the request, or an upstream 429 was normalized. -
UnauthorizedThe upstream returned 401/403. The affected key is marked dead. -
ProviderThe transport completed but the provider failed, or the network was normalized as a provider-side failure. -
ProtocolThe crate could not parse or emit the expected protocol payload.
Protocol Parsing and Emission
Use these helpers when you want to work directly with supported generation protocols:
parse_requestemit_requestparse_responseemit_responseparse_stream_eventemit_stream_eventtranscode_requesttranscode_responsetranscode_stream_eventtranscode_error
These are low-level frame-oriented helpers. For production runtime streaming,
prefer Gateway::stream, especially when an upstream compat wrapper may pack
multiple stream semantics into one provider frame.
Example:
Multi-Endpoint API Layer
The multi-endpoint API layer is typed and canonical. It is useful when you want to build converters or request emitters for non-generation endpoint families.
Supported Canonical Endpoint Families
- generation:
ApiRequest::Responses - embeddings:
ApiRequest::Embeddings - image generation:
ApiRequest::ImageGenerations - audio transcription:
ApiRequest::AudioTranscriptions - audio speech:
ApiRequest::AudioSpeech - rerank:
ApiRequest::Rerank
Emitting a Transport Request
Bridge and Loss Reporting
ConversionReport<T> tells you:
-
bridgedThe target wire format did not natively match the canonical endpoint model. -
lossyOne or more fields could not be represented. -
loss_reasonsA specific explanation of what was dropped or degraded.
Example:
Embedded Provider Registry
Use the embedded registry to inspect which providers and endpoint families are currently modeled:
This registry is metadata, not a runtime dispatcher. It helps with capability discovery, configuration UIs, and validation.
Replay Sanitization
For record/replay style testing, use:
ReplayFixturesanitize_transport_requestsanitize_transport_responsesanitize_json_value
These helpers redact common secrets such as:
- authorization headers
- query tokens
- JSON key-like secrets
- large binary/base64 blobs
Example:
Examples Included in This Repository
Run these from the repository root:
What each one shows:
-
basic_usageConcurrent runtime generation calls with budget and pool status printing. -
multi_endpoint_demoTyped request emission, transcoding, provider registry lookup, and replay sanitization without making network calls. -
responses_live_demoA live image-capable runtime request configured entirely from environment variables.
Live Demo and Live Tests
The repository includes .env.example for the live runtime demo and ignored live tests.
Typical flow:
Optional ignored tests:
Practical Patterns
1. OpenAI-compatible Runtime Gateway
Use ProviderEndpoint::new(...) with EndpointProtocol for runtime configuration.
Use official variants when OmniLLM should derive standard upstream paths, and *_compat variants when you need to hit a wrapper-specific full URL while reusing the same wire protocol.
2. Conversion-Only Service
If you are writing a proxy, SDK adapter, or test harness, you may never need Gateway. Use the emit_*, parse_*, and transcode_* helpers directly.
3. Safe Fixture Capture
If you store request/response fixtures in a repository, sanitize them before writing to disk.
Troubleshooting
I get NoAvailableKey
Possible causes:
- all keys are cooling down
- all keys are dead
- all keys are locally saturated on TPM
- local circuit breaker has opened on all keys
Inspect gateway.pool_status().
I get BudgetExceeded earlier than expected
Remember that the gateway reserves estimated cost before dispatch. The reservation settles later. During spikes, current usage can temporarily look higher until requests settle.
I get Protocol(...)
This usually means one of:
- the upstream payload shape changed
- the selected protocol does not match the upstream
- a feature was requested that the target protocol cannot encode
If you are transcoding, inspect loss_reasons.
Stream ended without provider usage metadata
This is expected for some upstream streaming shapes. The gateway falls back to partial usage estimation for budget settlement and can synthesize a terminal completed response when necessary.
API Surface Reference
The most commonly used items are:
-
runtime generation:
Gateway,GatewayBuilder,KeyConfig,PoolConfig,ProviderEndpoint,EndpointProtocol -
canonical generation types:
LlmRequest,LlmResponse,LlmStreamEvent,Message,RequestItem,CapabilitySet -
conversion helpers:
parse_request,emit_request,parse_response,emit_response,transcode_request,transcode_response -
multi-endpoint API:
ApiRequest,ApiResponse,WireFormat,ConversionReport,emit_transport_request,parse_transport_response -
replay sanitization:
ReplayFixture,sanitize_transport_request,sanitize_transport_response,sanitize_json_value
Recommended Reading Order
If you are new to the crate:
- read the main
README.md - run
cargo run --example basic_usage - read this usage guide
- read architecture.md if you need design context
- read implementation.md if you need internals