Skip to main content
Splendor protects shared backends — full-text search, the vector store, and the embedding and reranking providers — by shedding load when they are saturated, rather than letting requests pile up.

Capacity shedding

When a backend is temporarily at capacity, the request returns 503 with a structured error code and a Retry-After header:
{
  "detail": {
    "code": "semantic_search_capacity_exhausted",
    "message": "Semantic search capacity is temporarily exhausted; retry shortly."
  }
}
The codes you may see under load:
  • semantic_search_capacity_exhausted — semantic search is saturated.
  • semantic_provider_capacity_exhausted — the embedding or reranking provider is saturated.
  • cold_search_capacity_exhausted — cold-tier (archival) search is saturated.

Backing off

When you receive a 503 with Retry-After, wait at least that long before retrying, and apply exponential backoff with jitter if it persists.
import time, httpx

def search_with_retry(client: httpx.Client, body: dict, attempts: int = 5):
    delay = 1.0
    for _ in range(attempts):
        r = client.post("/v1/search", json=body)
        if r.status_code != 503:
            return r
        wait = float(r.headers.get("Retry-After", delay))
        time.sleep(wait)
        delay *= 2
    r.raise_for_status()
Capacity errors are transient — retrying after the indicated delay almost always succeeds. Treat them differently from 4xx errors, which require changing the request.