v0.1.0 Beta

Route smarter.
Infer faster.

Intelligently classify and route to Claude, Gemini, ChatGPT, DeepSeek, and Kimi. Engineered in Rust for 908 RPS at mere megabytes of overhead.
Note: Currently only local OLLAMA models are supported in beta.

Read the Docs →View on Docker Hub

0ms

Cache lookup (L1)

0 RPS

Tested Raw Throughput

0MB

Peak Memory Under Load

Reliability Under Load

live routing log

// capabilities

routing

Intelligent Model Routing

Multi-factor heuristics + embedding analysis classify every query in under 50ms, then route to the right model automatically.

caching

Two-Level Semantic Cache

Local Moka LRU for microsecond hits. Redis for distributed, persistent storage with cosine-similarity matching across sessions.

observability

Prometheus + Grafana

Pre-wired dashboards, request histograms, cache hit ratios, and model health gauges out of the box.

// quick start

Docker-first,
zero config.

Spin up the full stack — platform, Redis, Prometheus, and Grafana — in a single command. Bring your own API key.

Full setup guide →

bash

# Run standalone
docker run \
  -p 8080:8080 \
  himanshu806/ai-inference-platform

# Full stack (Redis + monitoring)
docker-compose up -d

# Test it
curl -X POST http://localhost:8080/api/v1/infer \
  -H 'Content-Type: application/json' \
  -d '{"query":"Explain quantum computing"}'

Early Beta Release (v0.1.0)

This is a super early release designed for unified routing to Claude, Gemini, ChatGPT, DeepSeek, and Kimi.Note: We only support local OLLAMA models right now.
We are working towards a stable v1.0.0 for production. Current drawbacks include limited robust classification, basic cache management, and early-stage telemetry.

Need a feature like more robust classification, improved cache managers, better telemetrics, or more configuration?
Mail me at: hyattherate2005@gmail.com

Ship inference
infrastructure today.

Open source · MIT license · Built in Rust