OpenAlchemyOpenAlchemy
Docs

API reference

OpenAI-compatible inference over an open network of providers. Point any OpenAI SDK at our base URL, keep your existing code, and pay per-token in credits.

Introduction

#

Every endpoint matches the OpenAI /v1/* shape one-for-one. Swap the base_url on the OpenAI Python or Node SDK and the rest of your application stays unchanged.

Three things differ from OpenAI:

  • You pay in credits (1 credit ≈ USD 0.001). Top up in /billing; per-model pricing is on the same page.
  • Every JSON response has an x_openalchemy block with the request id, worker id, engine latency, and usage breakdown — useful for debugging and cost attribution.
  • Inference is served by independent providers running real GPUs against your traffic. The model catalog at /models shows the live capacity per model.

Authentication

#

The API uses bearer tokens. Provision a key in /api-keys — keys are shown exactly once at creation; we store only a salted hash. Treat them like passwords.

Each request must include an Authorization header. Keys can be scoped to a subset of models; requests for models outside that scope return 403 model_not_allowed.

Never ship a live key to a browser bundle, a mobile app, or a public Git repo. If a key leaks, revoke it immediately from /api-keys.

Header
Authorization: Bearer $OPENALCHEMY_API_KEY

Base URL

#

The only network endpoint you need is:

HTTPShttps://api.openalchemy.io

Append /v1/<path> for the OpenAI-compatible surface. The server selects a worker, dispatches your request, streams the result back, and bills you on completion — all in the same response.

Python SDK
# Python — works for chat, vision, embeddings, audio.
from openai import OpenAI
client = OpenAI(
    base_url="https://api.openalchemy.io/v1",
    api_key=os.environ["OPENALCHEMY_API_KEY"],
)

Response metadata

#

Every JSON response carries an x_openalchemy block. Standard SDKs ignore unknown fields, so this is invisible to existing code — but it's the first place to look when debugging.

  • request_id — quote this when filing a ticket.
  • worker_id — which provider served you.
  • engine_latency_ms — pure GPU time, excludes network.
  • usage.cost — credits debited, as a fixed-point decimal string.

The /logs page indexes the same fields so you can grep across requests.

Response envelope
{
  "id": "cmpl-…",
  "object": "chat.completion",
  "model": "llama-3.1-70b-instruct",
  "choices": [],
  "usage": { "prompt_tokens": 24, "completion_tokens": 19, "total_tokens": 43 },
  "x_openalchemy": {
    "request_id": "req_01HG…",
    "tier": "m",
    "worker_id": "wrk_4f…",
    "engine_request_id": "vllm-…",
    "engine_latency_ms": 412,
    "upstream_latency_ms": 438,
    "usage": {
      "input_tokens": 24,
      "output_tokens": 19,
      "total_tokens": 43,
      "cost": "0.000086"
    }
  }
}

List models

#
GET/v1/models

Returns every model the network currently serves, with capacity, tier, and capability metadata. Use endpoint_type to filter (chat / embedding / rerank / stt / tts), and live_workers to gate fallbacks.

Request
curl https://api.openalchemy.io/v1/models \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY"
Response
{
  "object": "list",
  "data": [
    {
      "id": "llama-3.1-70b-instruct",
      "object": "model",
      "endpoint_type": "chat",
      "tier": "m",
      "family": "llama-3.1",
      "context_window": 131072,
      "params_b": 70,
      "live_workers": 4,
      "online": true
    },
  ]
}

Chat completions

#
POST/v1/chat/completions

Generate a model response for a conversation. The bread-and-butter endpoint — OpenAI-compatible.

OpenAI v1 compatible — drop in `base_url` + key in any OpenAI SDK.

Request body
NameTypeDescription
modelreqstringModel id from /v1/models.
messagesreqarray<{role, content}>Conversation turns. role is system / user / assistant.
temperaturenumber 0–2
default 1
Sampling temperature.
top_pnumber 0–1
default 1
Nucleus sampling cumulative prob.
max_tokensinteger
default 2048
Maximum tokens to generate.
presence_penaltynumber -2…2
default 0
Penalise tokens already present in the text.
frequency_penaltynumber -2…2
default 0
Penalise tokens proportional to their frequency.
stopstring | string[]Sequences that halt generation.
response_format{ type: 'text' | 'json_object' }Force JSON output.
streamboolean
default false
SSE streaming — currently returns 501.
Response
{
  "id": "cmpl-…",
  "object": "chat.completion",
  "model": "<model>",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "…" },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": N, "completion_tokens": N, "total_tokens": N },
  "x_openalchemy": {
    "request_id": "req_…",
    "tier": "s",
    "worker_id": "…",
    "engine_latency_ms":,
    "usage": { "input_tokens": N, "output_tokens": N, "cost": "0.0008" }
  }
}
curl https://api.openalchemy.io/v1/chat/completions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "llama-3.1-70b-instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Say hi in one sentence."
    }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}'
Try it in the playground
Tweak the request live and copy the resulting code back here.

Vision

#
POST/v1/chat/completions

Same endpoint as chat completions, with image content parts. Use a vision-capable model.

OpenAI Vision-style — pick a vision-capable model (qwen2.5-vl, etc.).

Request body
NameTypeDescription
modelreqstringA vision-capable model id.
messagesreqarray<{role, content: (string | ContentPart[])}>Each content part is {type:'text', text} or {type:'image_url', image_url:{url, detail}}.
image_url.urlstringhttps://… URL or data:image/...;base64,… (max ~6 MB per image after b64).
image_url.detail'low' | 'auto' | 'high'
default 'auto'
low: 85 tok/image flat; high: 85 + 170×tiles (~765 for 768×768).
temperaturenumber 0–2
default 0.2
Same as chat.
max_tokensinteger
default 1024
Same as chat.
Response
// Same shape as /v1/chat/completions.
// x_openalchemy.usage.image_tokens is the per-request image token total.
curl https://api.openalchemy.io/v1/chat/completions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen2.5-vl-72b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What'\''s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://images.example.com/cat.jpg",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "max_tokens": 512
}'
Try it in the playground
Tweak the request live and copy the resulting code back here.

Embeddings

#
POST/v1/embeddings

Turn one or many strings into dense vectors for search, clustering, or RAG.

OpenAI Embeddings v1 compatible.

Request body
NameTypeDescription
modelreqstringEmbedding model id.
inputreqstring | string[]One or many strings to embed.
dimensionsintegerFor Matryoshka-trained models, truncate the vector to this many dims. Ignored otherwise.
encoding_format'float' | 'base64'
default 'float'
Wire format for the returned vectors.
Response
{
  "object": "list",
  "data": [
    { "object": "embedding", "embedding": [0.0123, -0.045,], "index": 0 },
  ],
  "model": "<model>",
  "usage": { "prompt_tokens": N, "total_tokens": N },
  "x_openalchemy": { "usage": { "input_tokens": N, "cost": "…" } }
}
curl https://api.openalchemy.io/v1/embeddings \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nomic-embed-text-v1.5",
  "input": [
    "The quick brown fox",
    "jumps over the lazy dog"
  ]
}'
Try it in the playground
Tweak the request live and copy the resulting code back here.

Reranking

#
POST/v1/rerank

Rerank N candidate documents against a query. Cohere-compatible request shape.

Cohere /v2/rerank compatible (`model`, `query`, `documents`, `top_n`).

Request body
NameTypeDescription
modelreqstringReranker model id.
queryreqstringThe search query.
documentsreqstring[] | { text: string }[]Candidate documents to rerank.
top_nintegerReturn only the top-N results. Omit for all.
return_documentsboolean
default false
Include each document's text in the response.
Response
{
  "results": [
    { "index": 2, "relevance_score": 0.97 },
    { "index": 0, "relevance_score": 0.81 },
  ],
  "x_openalchemy": {
    "usage": {
      "query_tokens": Q,
      "document_tokens": Σ,
      "total_tokens": Q + Σ,
      "cost": "…"
    }
  }
}
curl https://api.openalchemy.io/v1/rerank \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "bge-reranker-v2-m3",
  "query": "What is OpenAlchemy?",
  "documents": [
    "OpenAlchemy is a distributed inference network where independent providers run open models for credits.",
    "Cats are popular household pets.",
    "Workers connect to the grid and serve OpenAI-compatible traffic."
  ],
  "top_n": 2
}'
Try it in the playground
Tweak the request live and copy the resulting code back here.

Audio transcriptions

#
POST/v1/audio/transcriptions

Speech-to-text. Multipart upload of an audio file, Whisper-compatible response.

OpenAI Whisper v1 compatible. multipart/form-data.

Request body
NameTypeDescription
filereqbinarymp3 / wav / m4a / webm / flac. ≤ 25 MB per request.
modelreqstringSTT model id.
languagestringISO-639-1 (e.g. 'en', 'ja'). Omit for auto-detect.
promptstringOptional bias text. Useful for technical vocabulary.
response_format'json' | 'text' | 'srt' | 'vtt' | 'verbose_json'
default 'json'
Output format.
temperaturenumber 0–1
default 0
Sampling temperature (rarely needed).
Response
// json:
{ "text": "…transcript…" }

// verbose_json (adds duration + segments):
{ "text": "…", "duration": 12.34, "language": "en", "segments": [] }

// srt / vtt: plain text subtitle file in the matching format.
curl https://api.openalchemy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -F file=@/path/to/audio.mp3 \
  -F model=whisper-large-v3 \
  -F response_format=json
Try it in the playground
Tweak the request live and copy the resulting code back here.

Audio speech

#
POST/v1/audio/speech

Text-to-speech. Returns binary audio bytes with usage metadata in response headers.

OpenAI TTS v1 compatible. WAV today; MP3 transcoder is post-launch.

Request body
NameTypeDescription
modelreqstringTTS model id.
inputreqstringText to synthesise.
voicestringVoice id — varies per model (e.g. 'af_bella').
response_format'wav' | 'mp3' | 'opus'
default 'wav'
Audio container. mp3/opus may 501 if engine transcode isn't ready.
speednumber 0.5–2.0
default 1
Playback speed multiplier.
Response
// Body is binary audio (Content-Type: audio/wav).
// Inspect the X-Openalchemy-* response headers:
//   X-Openalchemy-Request-Id
//   X-Openalchemy-Audio-Seconds
//   X-Openalchemy-Cost
curl https://api.openalchemy.io/v1/audio/speech \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "kokoro-tts-v1",
  "input": "Hello from OpenAlchemy.",
  "voice": "af_bella",
  "response_format": "wav"
}' \
  --output speech.wav
Try it in the playground
Tweak the request live and copy the resulting code back here.

Errors

#

Errors follow the OpenAI shape: a JSON body with an error object carrying message, type, and code. Branch on code, not on the HTTP status — the same status can carry several codes.

401invalid_api_keyBearer token missing, malformed, or revoked.
402insufficient_balanceEstimated cost exceeds spendable balance.
403model_not_allowedThis API key isn't authorised for the requested model.
404model_not_foundModel id is not registered.
429rate_limitedPer-key RPM or TPM limit exceeded.
503no_workers_for_modelNo grid worker is currently serving this model.
503model_not_pricedOperator hasn't published a credit_pricing row for this model's tier.
501stream_not_implementedSet stream=false until M2 lands SSE.
Error body
{
  "error": {
    "message": "Insufficient credit balance for this request.",
    "type": "billing_error",
    "code": "insufficient_balance"
  }
}

Rate limits

#

Rate limits are enforced per API key and expressed in two dimensions: RPM (requests per minute) and TPM (tokens per minute, summed across input + output). Both default to a free-tier ceiling; tiers expand automatically as you top up credits.

When you exceed a limit we return 429 rate_limited with a Retry-After header (seconds). The OpenAI SDKs honour this automatically; for custom clients, back off and retry.

Sustained 503 no_workers_for_model on a specific model is a capacity signal, not a rate-limit signal — pick a peer model from /models or open an issue so we can route additional providers to it.

Something missing? Drop notes in /logs when a request misbehaves — the x_openalchemy.request_id on the response is what we'll ask for first.