API reference
OpenAI-compatible inference over an open network of providers. Point any OpenAI SDK at our base URL, keep your existing code, and pay per-token in credits.
Introduction
#Every endpoint matches the OpenAI /v1/* shape one-for-one. Swap the base_url on the OpenAI Python or Node SDK and the rest of your application stays unchanged.
Three things differ from OpenAI:
- You pay in credits (1 credit ≈ USD 0.001). Top up in /billing; per-model pricing is on the same page.
- Every JSON response has an
x_openalchemyblock with the request id, worker id, engine latency, and usage breakdown — useful for debugging and cost attribution. - Inference is served by independent providers running real GPUs against your traffic. The model catalog at /models shows the live capacity per model.
Authentication
#The API uses bearer tokens. Provision a key in /api-keys — keys are shown exactly once at creation; we store only a salted hash. Treat them like passwords.
Each request must include an Authorization header. Keys can be scoped to a subset of models; requests for models outside that scope return 403 model_not_allowed.
Never ship a live key to a browser bundle, a mobile app, or a public Git repo. If a key leaks, revoke it immediately from /api-keys.
Authorization: Bearer $OPENALCHEMY_API_KEYBase URL
#The only network endpoint you need is:
https://api.openalchemy.ioAppend /v1/<path> for the OpenAI-compatible surface. The server selects a worker, dispatches your request, streams the result back, and bills you on completion — all in the same response.
# Python — works for chat, vision, embeddings, audio.
from openai import OpenAI
client = OpenAI(
base_url="https://api.openalchemy.io/v1",
api_key=os.environ["OPENALCHEMY_API_KEY"],
)Response metadata
#Every JSON response carries an x_openalchemy block. Standard SDKs ignore unknown fields, so this is invisible to existing code — but it's the first place to look when debugging.
request_id— quote this when filing a ticket.worker_id— which provider served you.engine_latency_ms— pure GPU time, excludes network.usage.cost— credits debited, as a fixed-point decimal string.
The /logs page indexes the same fields so you can grep across requests.
{
"id": "cmpl-…",
"object": "chat.completion",
"model": "llama-3.1-70b-instruct",
"choices": [ … ],
"usage": { "prompt_tokens": 24, "completion_tokens": 19, "total_tokens": 43 },
"x_openalchemy": {
"request_id": "req_01HG…",
"tier": "m",
"worker_id": "wrk_4f…",
"engine_request_id": "vllm-…",
"engine_latency_ms": 412,
"upstream_latency_ms": 438,
"usage": {
"input_tokens": 24,
"output_tokens": 19,
"total_tokens": 43,
"cost": "0.000086"
}
}
}List models
#/v1/modelsReturns every model the network currently serves, with capacity, tier, and capability metadata. Use endpoint_type to filter (chat / embedding / rerank / stt / tts), and live_workers to gate fallbacks.
curl https://api.openalchemy.io/v1/models \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY"{
"object": "list",
"data": [
{
"id": "llama-3.1-70b-instruct",
"object": "model",
"endpoint_type": "chat",
"tier": "m",
"family": "llama-3.1",
"context_window": 131072,
"params_b": 70,
"live_workers": 4,
"online": true
},
…
]
}Chat completions
#/v1/chat/completionsGenerate a model response for a conversation. The bread-and-butter endpoint — OpenAI-compatible.
OpenAI v1 compatible — drop in `base_url` + key in any OpenAI SDK.
| Name | Type | Description |
|---|---|---|
modelreq | string | Model id from /v1/models. |
messagesreq | array<{role, content}> | Conversation turns. role is system / user / assistant. |
temperature | number 0–2default 1 | Sampling temperature. |
top_p | number 0–1default 1 | Nucleus sampling cumulative prob. |
max_tokens | integerdefault 2048 | Maximum tokens to generate. |
presence_penalty | number -2…2default 0 | Penalise tokens already present in the text. |
frequency_penalty | number -2…2default 0 | Penalise tokens proportional to their frequency. |
stop | string | string[] | Sequences that halt generation. |
response_format | { type: 'text' | 'json_object' } | Force JSON output. |
stream | booleandefault false | SSE streaming — currently returns 501. |
{
"id": "cmpl-…",
"object": "chat.completion",
"model": "<model>",
"choices": [{
"index": 0,
"message": { "role": "assistant", "content": "…" },
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": N, "completion_tokens": N, "total_tokens": N },
"x_openalchemy": {
"request_id": "req_…",
"tier": "s",
"worker_id": "…",
"engine_latency_ms": …,
"usage": { "input_tokens": N, "output_tokens": N, "cost": "0.0008" }
}
}curl https://api.openalchemy.io/v1/chat/completions \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-70b-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Say hi in one sentence."
}
],
"temperature": 0.7,
"max_tokens": 256
}'Vision
#/v1/chat/completionsSame endpoint as chat completions, with image content parts. Use a vision-capable model.
OpenAI Vision-style — pick a vision-capable model (qwen2.5-vl, etc.).
| Name | Type | Description |
|---|---|---|
modelreq | string | A vision-capable model id. |
messagesreq | array<{role, content: (string | ContentPart[])}> | Each content part is {type:'text', text} or {type:'image_url', image_url:{url, detail}}. |
image_url.url | string | https://… URL or data:image/...;base64,… (max ~6 MB per image after b64). |
image_url.detail | 'low' | 'auto' | 'high'default 'auto' | low: 85 tok/image flat; high: 85 + 170×tiles (~765 for 768×768). |
temperature | number 0–2default 0.2 | Same as chat. |
max_tokens | integerdefault 1024 | Same as chat. |
// Same shape as /v1/chat/completions.
// x_openalchemy.usage.image_tokens is the per-request image token total.curl https://api.openalchemy.io/v1/chat/completions \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-vl-72b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'\''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.example.com/cat.jpg",
"detail": "auto"
}
}
]
}
],
"max_tokens": 512
}'Embeddings
#/v1/embeddingsTurn one or many strings into dense vectors for search, clustering, or RAG.
OpenAI Embeddings v1 compatible.
| Name | Type | Description |
|---|---|---|
modelreq | string | Embedding model id. |
inputreq | string | string[] | One or many strings to embed. |
dimensions | integer | For Matryoshka-trained models, truncate the vector to this many dims. Ignored otherwise. |
encoding_format | 'float' | 'base64'default 'float' | Wire format for the returned vectors. |
{
"object": "list",
"data": [
{ "object": "embedding", "embedding": [0.0123, -0.045, …], "index": 0 },
…
],
"model": "<model>",
"usage": { "prompt_tokens": N, "total_tokens": N },
"x_openalchemy": { "usage": { "input_tokens": N, "cost": "…" } }
}curl https://api.openalchemy.io/v1/embeddings \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1.5",
"input": [
"The quick brown fox",
"jumps over the lazy dog"
]
}'Reranking
#/v1/rerankRerank N candidate documents against a query. Cohere-compatible request shape.
Cohere /v2/rerank compatible (`model`, `query`, `documents`, `top_n`).
| Name | Type | Description |
|---|---|---|
modelreq | string | Reranker model id. |
queryreq | string | The search query. |
documentsreq | string[] | { text: string }[] | Candidate documents to rerank. |
top_n | integer | Return only the top-N results. Omit for all. |
return_documents | booleandefault false | Include each document's text in the response. |
{
"results": [
{ "index": 2, "relevance_score": 0.97 },
{ "index": 0, "relevance_score": 0.81 },
…
],
"x_openalchemy": {
"usage": {
"query_tokens": Q,
"document_tokens": Σ,
"total_tokens": Q + Σ,
"cost": "…"
}
}
}curl https://api.openalchemy.io/v1/rerank \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "What is OpenAlchemy?",
"documents": [
"OpenAlchemy is a distributed inference network where independent providers run open models for credits.",
"Cats are popular household pets.",
"Workers connect to the grid and serve OpenAI-compatible traffic."
],
"top_n": 2
}'Audio transcriptions
#/v1/audio/transcriptionsSpeech-to-text. Multipart upload of an audio file, Whisper-compatible response.
OpenAI Whisper v1 compatible. multipart/form-data.
| Name | Type | Description |
|---|---|---|
filereq | binary | mp3 / wav / m4a / webm / flac. ≤ 25 MB per request. |
modelreq | string | STT model id. |
language | string | ISO-639-1 (e.g. 'en', 'ja'). Omit for auto-detect. |
prompt | string | Optional bias text. Useful for technical vocabulary. |
response_format | 'json' | 'text' | 'srt' | 'vtt' | 'verbose_json'default 'json' | Output format. |
temperature | number 0–1default 0 | Sampling temperature (rarely needed). |
// json:
{ "text": "…transcript…" }
// verbose_json (adds duration + segments):
{ "text": "…", "duration": 12.34, "language": "en", "segments": [ … ] }
// srt / vtt: plain text subtitle file in the matching format.curl https://api.openalchemy.io/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-F file=@/path/to/audio.mp3 \
-F model=whisper-large-v3 \
-F response_format=jsonAudio speech
#/v1/audio/speechText-to-speech. Returns binary audio bytes with usage metadata in response headers.
OpenAI TTS v1 compatible. WAV today; MP3 transcoder is post-launch.
| Name | Type | Description |
|---|---|---|
modelreq | string | TTS model id. |
inputreq | string | Text to synthesise. |
voice | string | Voice id — varies per model (e.g. 'af_bella'). |
response_format | 'wav' | 'mp3' | 'opus'default 'wav' | Audio container. mp3/opus may 501 if engine transcode isn't ready. |
speed | number 0.5–2.0default 1 | Playback speed multiplier. |
// Body is binary audio (Content-Type: audio/wav).
// Inspect the X-Openalchemy-* response headers:
// X-Openalchemy-Request-Id
// X-Openalchemy-Audio-Seconds
// X-Openalchemy-Costcurl https://api.openalchemy.io/v1/audio/speech \
-H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-tts-v1",
"input": "Hello from OpenAlchemy.",
"voice": "af_bella",
"response_format": "wav"
}' \
--output speech.wavErrors
#Errors follow the OpenAI shape: a JSON body with an error object carrying message, type, and code. Branch on code, not on the HTTP status — the same status can carry several codes.
invalid_api_keyBearer token missing, malformed, or revoked.insufficient_balanceEstimated cost exceeds spendable balance.model_not_allowedThis API key isn't authorised for the requested model.model_not_foundModel id is not registered.rate_limitedPer-key RPM or TPM limit exceeded.no_workers_for_modelNo grid worker is currently serving this model.model_not_pricedOperator hasn't published a credit_pricing row for this model's tier.stream_not_implementedSet stream=false until M2 lands SSE.{
"error": {
"message": "Insufficient credit balance for this request.",
"type": "billing_error",
"code": "insufficient_balance"
}
}Rate limits
#Rate limits are enforced per API key and expressed in two dimensions: RPM (requests per minute) and TPM (tokens per minute, summed across input + output). Both default to a free-tier ceiling; tiers expand automatically as you top up credits.
When you exceed a limit we return 429 rate_limited with a Retry-After header (seconds). The OpenAI SDKs honour this automatically; for custom clients, back off and retry.
Sustained 503 no_workers_for_model on a specific model is a capacity signal, not a rate-limit signal — pick a peer model from /models or open an issue so we can route additional providers to it.
x_openalchemy.request_id on the response is what we'll ask for first.