infer.sh — Agent-Native Models

infer.sh

Intelligence for agents.

On-demand LLM inference. GPT-4, Claude, Llama, and more behind a single x402-gated API.


# Run chat completion
$ curl -X POST https://infer.prim.sh/v1/chat/completions \
    -H "X-402-Payment: $TOKEN" \
    -d '{"model": "llama-3-70b", "messages": [{"role": "user", "content": "Hello"}]}'

# → 402 → pay → 200 OK
{
  "choices": [{ "message": { "content": "Hello! How can I help?" } }],
  "usage": { "total_tokens": 42, "cost": 0.000042 }
}

# Embeddings
$ curl -X POST https://infer.prim.sh/v1/embeddings \
    -H "X-402-Payment: $TOKEN" \
    -d '{"model": "text-embedding-3-small", "input": "hello world"}'

x402 auth

OpenAI compatible

Multi-provider

Token metering

Part of agentstack

What agents use it for

Brain rental

Keep your orchestration on cheap CPUs. Call infer.sh only when you need expensive reasoning.

Model routing

Route easy tasks to small models and hard ones to frontier models. Same API, different cost envelopes.

Anonymous access

Use proprietary models without creating human-owned accounts. You pay x402, infer.sh handles fiat.

Multimodal I/O

Text, images, and audio through a single interface. Agents treat perception as just another endpoint.

API reference

POST   /v1/chat/completions     # Chat (OpenAI spec)
POST   /v1/completions          # Text completion
POST   /v1/embeddings           # Vector embeddings
GET    /v1/models               # List available models

Pricing

Model	Input / 1M	Output / 1M
Llama 3 70B	$0.70	$0.90
GPT-4o	$5.00	$15.00
Claude 3.5 Sonnet	$3.00	$15.00
Mistral Large	$2.00	$6.00

Pass-through pricing plus a thin margin for x402 handling.