Docs
architecture.mdx
Architecture Documentation
System Overview
The AI Inference Platform is designed as a high-performance, intelligent gateway for AI model inference. It sits between client applications and various AI model providers (e.g., OpenAI, Ollama), adding value through:
- Intelligent Routing: Directing queries to the most cost-effective model that can handle the complexity.
- Caching: Reducing latency and costs by serving repeated or semantically similar queries from cache.
- Observability: providing deep insights into usage, performance, and model behavior.
Component Architecture
mermaid
graph TD
Client[Client Application] -->|HTTP/JSON| API[Axum HTTP API]
subgraph "Core Services"
API --> Classifier[Query Classifier]
API --> Cache[Cache Manager]
API --> Router[Model Router]
Classifier -->|Heuristics + Embeddings| Router
Cache -->|L1: Moka| LocalCache[(In-Memory)]
Cache -->|L2: Redis| Redis[(Redis Cluster)]
end
subgraph "Model Layer"
Router -->|Simple Queries| Simple[Small Model\ne.g., Llama3-8b, GPT-3.5]
Router -->|Complex Queries| Complex[Large Model\ne.g., GPT-4, Claude 3 Opus]
end
subgraph "Observability"
Metrics[Prometheus] .-> API
Metrics .-> Router
Metrics .-> Cache
end1. API Layer (src/api/)
- Framework: Built on
axum, providing a robust asynchronous HTTP server. - Middleware: Handles cross-cutting concerns like timeouts, compression, CORS, and request tracking (IDs).
- Endpoints: Exposes RESTful endpoints for inference (
/infer), classification (/classify), and management.
2. Query Classifier (src/classifier/)
- Purpose: Determines the "complexity" of a user query to decide which model should handle it.
- Mechanism:
- Heuristics: Analyzes length, vocabulary diversity, and sentence structure.
- Embeddings: (Optional) Uses lightweight embeddings to semantically analyze the intent complexity.
- Scoring: Aggregates factors into a unified complexity score (0.0 - 1.0).
3. Cache Manager (src/cache/)
- L1 Cache (Local): High-speed, in-memory LRU cache using
moka. Stores the most frequently accessed data to minimize network calls. - L2 Cache (Distributed): Redis-backed cache for persistence and sharing state across multiple API instances.
- Semantic Matching: Uses cosine similarity on query embeddings to find "close enough" matches, not just exact string matches.
4. Model Router (src/models/)
- Configuration: Manages a registry of available models (configured in
config.tomlor env vars). - Routing Logic:
- If
complexity_score < threshold: Route to "Simple" model provider. - If
complexity_score >= threshold: Route to "Complex" model provider.
- If
- Fallback: Can be configured to fall back to a different provider if the primary one fails.
Data Flow
- Request Ingestion: Client sends a POST request with the query.
- Classification: The system analyzes the query text.
- Cache Lookup:
- Check L1 (Local).
- Check L2 (Redis).
- If a semantic match is found, return immediately.
- Routing: If no cache hit, the router selects a model based on the classification result.
- Inference: The request is forwarded to the selected model provider (e.g., via HTTP to Ollama or OpenAI).
- Response & Caching: The result is sent back to the client and asynchronously stored in L1 and L2 caches for future use.
Design Principles
- Async-First: Built on
tokiofor handling high concurrency. - Fail-Fast: Timeouts and circuit breakers at every integration point.
- Observability: "Metrics-first" development ensures every component emits useful telemetry.