Random-start acquisition, explicit cooldown semantics, and fixed-point budget settlement across the runtime.
Architecture Notes
Production-grade Rust scheduling kernel for LLM API access. Handles multi-key load balancing, per-key rate limiting, cost tracking, and error-driven state management — with all critical paths concurrency-safe.
Note: the public request/response model has since moved to a canonical
Responses + Capability Layerhybrid. The concurrency, pool, and budget architecture below still applies, but some request/response examples are from the pre-migration Chat Completions model.
Design Principles
Keys are leases, not config. A key is acquired before use and released after — unconditionally, via Drop. There is no code path where a key can be consumed without being returned.
Accounting is a first-class citizen. Every request pre-occupies quota (tokens, budget). On response, actual usage settles the pre-occupation. Errors trigger key state transitions immediately.
Algorithm choice follows provider semantics. OpenAI's RPM is a sliding window — not a token bucket, not GCRA. The limiter models the actual provider behavior, not a convenient approximation.
⚠️ select + reserve must be atomic. Any system that separates key selection from quota reservation has a TOCTOU window. This design merges both into a single CAS loop inside
KeyPool::acquire.⚠️ Precision and lock-freedom are not opposites. Per-key
parking_lot::Mutexhas no cross-key contention, microsecond hold times, and near-zero cost when uncontended. Timestamp-based cooldown is both precise (millisecond granularity) and fully lock-free. There is no trade-off here — you get both.
Module Layout
Request flow
Every call follows this path — no exceptions.
KeyLease — RAII Lease
The central insight of this design. A KeyLease holds a reservation of TPM quota against a specific KeyInner. When it drops — whether the call succeeded, panicked, or was cancelled — the quota is returned via fetch_sub. There is no way to forget.
Why not a guard pattern with explicit release? Because explicit release is forgettable. Any early-return,
?propagation, or future cancellation would skip it.Dropis the only guarantee that survives async cancellation in Tokio.
KeyPool — Acquire and Error Reporting
The pool uses a random-start first-fit strategy: each request begins scanning from a random index, wraps around, and takes the first healthy key with available capacity. This avoids the thundering herd problem of min_by_key — where N concurrent requests all see the same "least-loaded" key and pile onto the same CAS — and naturally distributes requests across keys with O(1) amortised cost.
The scan is capped at MAX_CAS_ATTEMPTS (default 5). Under extreme contention with very few keys, unbounded scanning degenerates into a long-tail spin. Capping attempts converts this into a fast, bounded failure — the caller gets None and can retry at a higher level (e.g. FallbackScheduler) rather than burning CPU.
Why timestamps instead of
tokio::spawn? The spawn-based approach creates a zombie timer problem: the spawned task holds anArc<KeyInner>and a timer handle that outlives the request. If the pool is dropped or keys are reconfigured, these timers keep running against stale state. The timestamp approach is checked lazily duringacquire()— zero async overhead, zero memory overhead, and the key is automatically eligible again oncecurrent_millis()passes the deadline.
Decoupled cooldowns. Rate-limit cooldown (
cool_down_until) and circuit-breaker cooldown (failure_cool_down_until) are separate timestamps, checked independently inacquire(). This prevents a subtle failure mode: if a key accumulates 4 failures, then gets a 429 and cools down for 60s, and the first request after recovery hits a 5xx, a shared counter would immediately trip the breaker (failure #5) even though the errors were spread over a minute. Separate timestamps, separate counters, separate semantics.
⚠️ CAS contention. With random-start + skip-on-fail +
MAX_CAS_ATTEMPTScap, CAS contention is bounded both probabilistically and absolutely. Each acquire makes at most 5 CAS attempts, then returnsNonefor the caller to handle at a higher level. If you have >1000 concurrent callers with very few keys, consider sharding the pool.
SlidingWindow — RPM and TPM Rate Control
Each key carries two SlidingWindow instances — one for RPM, one for TPM. The sliding window accurately models OpenAI's actual rate limit behavior, unlike GCRA (which is more conservative) or fixed-window (which allows boundary bursts).
Why not
governor? Thegovernorcrate uses GCRA, which enforces uniform inter-arrival spacing. OpenAI's RPM allows bursts within a window — 60 requests can arrive in the first second of a minute window, which GCRA would reject. The sliding window here allows that burst naturally.
Mutex choice. The
Mutexhold time here is microsecond-scale (apop_frontloop), so the real concern is not the lock itself but the implementation. Useparking_lot::Mutexinstead ofstd::sync::Mutex— it is ~3-5× faster under contention, does not poison on panic, and in an async context avoids blocking the Tokio worker thread for meaningful durations. Since eachSlidingWindowis per-key, there is no cross-key lock contention — multiple keys are fully independent.
⚠️ Memory bound.
VecDeque<Instant>holds at mostlimitentries. For RPM=10000, that is ~80KB per key. Acceptable for tens of keys; if you have thousands of keys, consider a fixed-capacity ring buffer instead.
BudgetTracker — Fixed-Point, Lock-Free
Costs are stored as u64 micro-dollars (1 USD = 1,000,000 units). This avoids floating-point precision loss and enables atomic CAS operations. A two-phase settle corrects the delta between pre-estimated and actual usage.
Gateway::call — The Main Path
The gateway wires all components together. The order of operations is significant: acquire key before budget (so a failed acquire doesn't consume budget), settle before lease drop (so accounting runs with the key still reserved).
Dispatcher injects the key as a header, never clones the client.
Dispatcherholds a singlereqwest::Client(shared connection pool). The API key is injected per-request viaAuthorization: Bearer {lease.inner.key}. The provider is stateless.
⚠️ RPM tradeoff. RPM is checked after TPM reservation. Under burst load (e.g. 1000 concurrent requests), requests that pass TPM but fail RPM briefly inflate
tpm_inflightbefore the lease drops. This transient saturation may cause other requests to see false "full" states and returnNone. The window is microsecond-scale and self-healing, but it can amplify tail latency under extreme bursts. We accept this tradeoff to keep the acquire path lock-free — merging RPM into the CAS loop would require holding a Mutex inside a CAS, which is worse.
Dispatcher — Retry and Fallback
The dispatcher implements a three-tier retry strategy. This is intentionally inside the dispatcher, not at the gateway level — the gateway sees a single call that either succeeds or returns a final error after all retry options are exhausted.
Why not retry with a different key inside dispatcher? Because key selection is the pool's responsibility. The dispatcher only knows about the current lease. Cross-key retry belongs at the
FallbackSchedulerlevel (see Natural Next Steps), where the full pool topology is visible.
Cancellation propagation. When the upstream caller cancels (timeout, user disconnect),
tokio::select!drops the dispatcher future, which drops the in-flightreqwestresponse future. reqwest'sClientuses hyper under the hood — dropping the response future sends aRSTon the TCP connection, so the provider stops processing. Without explicit cancellation, a dropped future may leave the TCP connection alive in the pool, causing phantom inflight: the provider is still working, consuming your TPM quota, but yourtpm_inflightcounter has already been decremented by the leaseDrop.
Decision Log
PoolRegistry — Provider → Model → KeyPool
Keys are not a flat list. Different models under the same provider have independent rate limits — a GPT-4o key's TPM quota is separate from its GPT-4o-mini quota. The PoolRegistry enforces this hierarchy:
Why not just tag keys with a model? Because the same API key string may appear in multiple pools with different limits. A single OpenAI key has separate RPM/TPM limits for GPT-4o vs. GPT-4o-mini. Flattening them into one pool would cause cross-model quota pollution — a burst of cheap mini requests could starve the GPT-4o quota, or vice versa.
Gateway::call changes accordingly:
Observability
A scheduling system without observability is a black box. Every KeyInner exposes the following metrics, readable without acquiring any lock:
The gateway should expose a /health or equivalent endpoint that returns:
Without these metrics you cannot distinguish "system is healthy but idle" from "all keys are cooling down and every request fails instantly." Both look the same from the outside.
Known Tradeoffs
RPM transient saturation. RPM is checked after TPM reservation. Under burst load, requests that pass TPM but fail RPM briefly inflate tpm_inflight, potentially causing other requests to see false "full" states. The window is microsecond-scale and self-healing. We accept this to keep the acquire path lock-free. See Gateway::call for detailed analysis.
Token estimation and P99 latency. estimated_tokens is inherently inaccurate — streaming responses, function calls, and reasoning tokens can be 2-10× the estimate. This doesn't break correctness (settle corrects the delta, provider enforces real limits), but it degrades scheduling precision: the TPM inflight counter understates real load, causing the pool to over-admit requests. The provider responds with elevated latency (soft throttling) rather than a clean 429. This is the primary driver of P99 latency degradation under load. Mitigation: reserve estimated * OVERBOOK_FACTOR (e.g. 1.3×) for the inflight counter; settle corrects the delta, so the only cost is slightly reduced theoretical throughput.
Phantom inflight on cancellation. When a future is dropped without explicit cancellation, the underlying TCP connection may remain alive in reqwest's pool. The provider continues processing the request, consuming real TPM quota, but the gateway has already decremented tpm_inflight via lease Drop. This causes under-counting. Mitigation: CancellationToken + tokio::select! in Gateway::call.
Natural Next Steps
Multi-provider fallback. Add a FallbackScheduler that wraps the PoolRegistry and implements cross-provider retry: when the primary provider returns NoAvailableKey or is circuit-broken, transparently retry on a fallback provider with model mapping (e.g. gpt-4o → claude-3.5-sonnet).
Per-tenant budget isolation. Replace the single BudgetTracker with a HashMap<TenantId, BudgetTracker>. The gateway takes a tenant ID on each call and routes to the appropriate tracker.
EWMA latency-aware scoring. Track a per-key exponentially weighted moving average of response time. Use it as a secondary signal in the scan: prefer keys with lower latency when multiple keys have available capacity. This naturally routes away from degraded backends before the circuit breaker trips — critical for detecting provider soft-throttling that doesn't produce 429s.
Cost-based model downgrade. Add a ModelRouter that, when budget is <20% remaining, substitutes a cheaper model (e.g. gpt-4o → gpt-4o-mini) before calling acquire.
Adaptive feedback loop. The current system uses static strategies. A production control plane would dynamically adjust OVERBOOK_FACTOR based on observed estimation error, tune CIRCUIT_BREAKER_THRESHOLD based on per-key error rates, and auto-scale the key pool based on sustained saturation signals.