Docs

API reference

OpenAI-compatible inference over an open network of providers. Point any OpenAI SDK at our base URL, keep your existing code, and pay per-token in credits.

Quickstart API keys Playground Model catalog

Introduction

Every endpoint matches the OpenAI /v1/* shape one-for-one. Swap the base_url on the OpenAI Python or Node SDK and the rest of your application stays unchanged.

Three things differ from OpenAI:

You pay in credits (1 credit ≈ USD 0.001). Top up in /billing; per-model pricing is on the same page.
Every JSON response has an x_openalchemy block with the request id, worker id, engine latency, and usage breakdown — useful for debugging and cost attribution.
Inference is served by independent providers running real GPUs against your traffic. The model catalog at /models shows the live capacity per model.

Authentication

The API uses bearer tokens. Provision a key in /api-keys — keys are shown exactly once at creation; we store only a salted hash. Treat them like passwords.

Each request must include an Authorization header. Keys can be scoped to a subset of models; requests for models outside that scope return 403 model_not_allowed.

Never ship a live key to a browser bundle, a mobile app, or a public Git repo. If a key leaks, revoke it immediately from /api-keys.

Header

Authorization: Bearer $OPENALCHEMY_API_KEY

Base URL

The only network endpoint you need is:

HTTPShttps://api.openalchemy.io

Append /v1/<path> for the OpenAI-compatible surface. The server selects a worker, dispatches your request, streams the result back, and bills you on completion — all in the same response.

Python SDK

# Python — works for chat, vision, embeddings, audio.
from openai import OpenAI
client = OpenAI(
    base_url="https://api.openalchemy.io/v1",
    api_key=os.environ["OPENALCHEMY_API_KEY"],
)

Response metadata

Every JSON response carries an x_openalchemy block. Standard SDKs ignore unknown fields, so this is invisible to existing code — but it's the first place to look when debugging.

request_id — quote this when filing a ticket.
worker_id — which provider served you.
engine_latency_ms — pure GPU time, excludes network.
usage.cost — credits debited, as a fixed-point decimal string.

The /logs page indexes the same fields so you can grep across requests.

Response envelope

{
  "id": "cmpl-…",
  "object": "chat.completion",
  "model": "llama-3.1-70b-instruct",
  "choices": [ … ],
  "usage": { "prompt_tokens": 24, "completion_tokens": 19, "total_tokens": 43 },
  "x_openalchemy": {
    "request_id": "req_01HG…",
    "tier": "m",
    "worker_id": "wrk_4f…",
    "engine_request_id": "vllm-…",
    "engine_latency_ms": 412,
    "upstream_latency_ms": 438,
    "usage": {
      "input_tokens": 24,
      "output_tokens": 19,
      "total_tokens": 43,
      "cost": "0.000086"
    }
  }
}

All endpoints

The full OpenAI-compatible surface. Append any path to the base URL. Jump to a documented endpoint below, or open the live model catalog to see which models serve each one.

POST/v1/chat/completionsstreaming POST/v1/chat/completionsvision POST/v1/embeddingsvector POST/v1/rerankrerank POST/v1/audio/transcriptionswhisper POST/v1/audio/translationswhisper POST/v1/audio/speechtts

POST/v1/images/generationssoon

POST/v1/videos/generationssoon

GET/v1/modelslist

List models

GET/v1/models

Returns every model the network currently serves, with capacity, tier, and capability metadata. Use endpoint_type to filter (chat / embedding / rerank / stt / tts), and live_workers to gate fallbacks.

Request

curl https://api.openalchemy.io/v1/models \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY"

Response

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.1-70b-instruct",
      "object": "model",
      "endpoint_type": "chat",
      "tier": "m",
      "family": "llama-3.1",
      "context_window": 131072,
      "params_b": 70,
      "live_workers": 4,
      "online": true
    },
    …
  ]
}

Chat completions

POST/v1/chat/completions

Generate a model response for a conversation. The bread-and-butter endpoint — OpenAI-compatible.

OpenAI v1 compatible — drop in `base_url` + key in any OpenAI SDK.

Request body

Name	Type	Description
`model`req	`string`	Model id from /v1/models.
`messages`req	`array<{role, content}>`	Conversation turns. role is system / user / assistant.
`temperature`	`number 0–2` default `1`	Sampling temperature.
`top_p`	`number 0–1` default `1`	Nucleus sampling cumulative prob.
`max_tokens`	`integer` default `2048`	Maximum tokens to generate.
`presence_penalty`	`number -2…2` default `0`	Penalise tokens already present in the text.
`frequency_penalty`	`number -2…2` default `0`	Penalise tokens proportional to their frequency.
`stop`	`string \| string[]`	Sequences that halt generation.
`response_format`	`{ type: 'text' \| 'json_object' }`	Force JSON output.
`stream`	`boolean` default `false`	SSE streaming — currently returns 501.

Response

{
  "id": "cmpl-…",
  "object": "chat.completion",
  "model": "<model>",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "…" },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": N, "completion_tokens": N, "total_tokens": N },
  "x_openalchemy": {
    "request_id": "req_…",
    "tier": "s",
    "worker_id": "…",
    "engine_latency_ms": …,
    "usage": { "input_tokens": N, "output_tokens": N, "cost": "0.0008" }
  }
}

curl https://api.openalchemy.io/v1/chat/completions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "llama-3.1-70b-instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Say hi in one sentence."
    }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}'

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Vision

POST/v1/chat/completions

Same endpoint as chat completions, with image content parts. Use a vision-capable model.

OpenAI Vision-style — pick a vision-capable model (qwen2.5-vl, etc.).

Request body

Name	Type	Description
`model`req	`string`	A vision-capable model id.
`messages`req	`array<{role, content: (string \| ContentPart[])}>`	Each content part is {type:'text', text} or {type:'image_url', image_url:{url, detail}}.
`image_url.url`	`string`	https://… URL or data:image/...;base64,… (max ~6 MB per image after b64).
`image_url.detail`	`'low' \| 'auto' \| 'high'` default `'auto'`	low: 85 tok/image flat; high: 85 + 170×tiles (~765 for 768×768).
`temperature`	`number 0–2` default `0.2`	Same as chat.
`max_tokens`	`integer` default `1024`	Same as chat.

Response

// Same shape as /v1/chat/completions.
// x_openalchemy.usage.image_tokens is the per-request image token total.

curl https://api.openalchemy.io/v1/chat/completions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen2.5-vl-72b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What'\''s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://images.example.com/cat.jpg",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "max_tokens": 512
}'

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Embeddings

POST/v1/embeddings

Turn one or many strings into dense vectors for search, clustering, or RAG.

OpenAI Embeddings v1 compatible.

Request body

Name	Type	Description
`model`req	`string`	Embedding model id.
`input`req	`string \| string[]`	One or many strings to embed.
`dimensions`	`integer`	For Matryoshka-trained models, truncate the vector to this many dims. Ignored otherwise.
`encoding_format`	`'float' \| 'base64'` default `'float'`	Wire format for the returned vectors.

Response

{
  "object": "list",
  "data": [
    { "object": "embedding", "embedding": [0.0123, -0.045, …], "index": 0 },
    …
  ],
  "model": "<model>",
  "usage": { "prompt_tokens": N, "total_tokens": N },
  "x_openalchemy": { "usage": { "input_tokens": N, "cost": "…" } }
}

curl https://api.openalchemy.io/v1/embeddings \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nomic-embed-text-v1.5",
  "input": [
    "The quick brown fox",
    "jumps over the lazy dog"
  ]
}'

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Reranking

POST/v1/rerank

Rerank N candidate documents against a query. Cohere-compatible request shape.

Cohere /v2/rerank compatible (`model`, `query`, `documents`, `top_n`).

Request body

Name	Type	Description
`model`req	`string`	Reranker model id.
`query`req	`string`	The search query.
`documents`req	`string[] \| { text: string }[]`	Candidate documents to rerank.
`top_n`	`integer`	Return only the top-N results. Omit for all.
`return_documents`	`boolean` default `false`	Include each document's text in the response.

Response

{
  "results": [
    { "index": 2, "relevance_score": 0.97 },
    { "index": 0, "relevance_score": 0.81 },
    …
  ],
  "x_openalchemy": {
    "usage": {
      "query_tokens": Q,
      "document_tokens": Σ,
      "total_tokens": Q + Σ,
      "cost": "…"
    }
  }
}

curl https://api.openalchemy.io/v1/rerank \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "bge-reranker-v2-m3",
  "query": "What is OpenAlchemy?",
  "documents": [
    "OpenAlchemy is a distributed inference network where independent providers run open models for credits.",
    "Cats are popular household pets.",
    "Workers connect to the grid and serve OpenAI-compatible traffic."
  ],
  "top_n": 2
}'

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Audio transcriptions

POST/v1/audio/transcriptions

Speech-to-text. Multipart upload of an audio file, Whisper-compatible response.

OpenAI Whisper v1 compatible. multipart/form-data.

Request body

Name	Type	Description
`file`req	`binary`	mp3 / wav / m4a / webm / flac. ≤ 25 MB per request.
`model`req	`string`	STT model id.
`language`	`string`	ISO-639-1 (e.g. 'en', 'ja'). Omit for auto-detect.
`prompt`	`string`	Optional bias text. Useful for technical vocabulary.
`response_format`	`'json' \| 'text' \| 'srt' \| 'vtt' \| 'verbose_json'` default `'json'`	Output format.
`temperature`	`number 0–1` default `0`	Sampling temperature (rarely needed).

Response

// json:
{ "text": "…transcript…" }

// verbose_json (adds duration + segments):
{ "text": "…", "duration": 12.34, "language": "en", "segments": [ … ] }

// srt / vtt: plain text subtitle file in the matching format.

curl https://api.openalchemy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -F file=@/path/to/audio.mp3 \
  -F model=whisper-large-v3 \
  -F response_format=json

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Audio speech

POST/v1/audio/speech

Text-to-speech. Returns binary audio bytes with usage metadata in response headers.

OpenAI TTS v1 compatible. WAV today; MP3 transcoder is post-launch.

Request body

Name	Type	Description
`model`req	`string`	TTS model id.
`input`req	`string`	Text to synthesise.
`voice`	`string`	Voice id — varies per model (e.g. 'af_bella').
`response_format`	`'wav' \| 'mp3' \| 'opus'` default `'wav'`	Audio container. mp3/opus may 501 if engine transcode isn't ready.
`speed`	`number 0.5–2.0` default `1`	Playback speed multiplier.

Response

// Body is binary audio (Content-Type: audio/wav).
// Inspect the X-Openalchemy-* response headers:
//   X-Openalchemy-Request-Id
//   X-Openalchemy-Audio-Seconds
//   X-Openalchemy-Cost

curl https://api.openalchemy.io/v1/audio/speech \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "kokoro-tts-v1",
  "input": "Hello from OpenAlchemy.",
  "voice": "af_bella",
  "response_format": "wav"
}' \
  --output speech.wav

Try it in the playground→

Tweak the request live and copy the resulting code back here.

Errors

Errors follow the OpenAI shape: a JSON body with an error object carrying message, type, and code. Branch on code, not on the HTTP status — the same status can carry several codes.

401invalid_api_keyBearer token missing, malformed, or revoked.

402insufficient_balanceEstimated cost exceeds spendable balance.

403model_not_allowedThis API key isn't authorised for the requested model.

404model_not_foundModel id is not registered.

429rate_limitedPer-key RPM or TPM limit exceeded.

503no_workers_for_modelNo grid worker is currently serving this model.

503model_not_pricedOperator hasn't published a credit_pricing row for this model's tier.

501stream_not_implementedSet stream=false until M2 lands SSE.

Error body

{
  "error": {
    "message": "Insufficient credit balance for this request.",
    "type": "billing_error",
    "code": "insufficient_balance"
  }
}

Rate limits

Rate limits are enforced per API key and expressed in two dimensions: RPM (requests per minute) and TPM (tokens per minute, summed across input + output). Both default to a free-tier ceiling; tiers expand automatically as you top up credits.

When you exceed a limit we return 429 rate_limited with a Retry-After header (seconds). The OpenAI SDKs honour this automatically; for custom clients, back off and retry.

Sustained 503 no_workers_for_model on a specific model is a capacity signal, not a rate-limit signal — pick a peer model from /models or open an issue so we can route additional providers to it.

Something missing? Drop notes in /logs when a request misbehaves — the x_openalchemy.request_id on the response is what we'll ask for first.