Intelligently classify and route to Claude, Gemini, ChatGPT, DeepSeek, and Kimi. Engineered in Rust for 908 RPS at mere megabytes of overhead.
Note: Currently only local OLLAMA models are supported in beta.
// capabilities
Multi-factor heuristics + embedding analysis classify every query in under 50ms, then route to the right model automatically.
Local Moka LRU for microsecond hits. Redis for distributed, persistent storage with cosine-similarity matching across sessions.
Pre-wired dashboards, request histograms, cache hit ratios, and model health gauges out of the box.
// quick start
Spin up the full stack — platform, Redis, Prometheus, and Grafana — in a single command. Bring your own API key.
Full setup guide →# Run standalone
docker run \
-p 8080:8080 \
himanshu806/ai-inference-platform
# Full stack (Redis + monitoring)
docker-compose up -d
# Test it
curl -X POST http://localhost:8080/api/v1/infer \
-H 'Content-Type: application/json' \
-d '{"query":"Explain quantum computing"}'This is a super early release designed for unified routing to Claude, Gemini, ChatGPT, DeepSeek, and Kimi.Note: We only support local OLLAMA models right now.
We are working towards a stable v1.0.0 for production. Current drawbacks include limited robust classification, basic cache management, and early-stage telemetry.
Need a feature like more robust classification, improved cache managers, better telemetrics, or more configuration?
Mail me at: hyattherate2005@gmail.com
Open source · MIT license · Built in Rust