Docs

architecture.mdx

Architecture Documentation

System Overview

The AI Inference Platform is designed as a high-performance, intelligent gateway for AI model inference. It sits between client applications and various AI model providers (e.g., OpenAI, Ollama), adding value through:

Intelligent Routing: Directing queries to the most cost-effective model that can handle the complexity.
Caching: Reducing latency and costs by serving repeated or semantically similar queries from cache.
Observability: providing deep insights into usage, performance, and model behavior.

Component Architecture

mermaid

graph TD
    Client[Client Application] -->|HTTP/JSON| API[Axum HTTP API]
    
    subgraph "Core Services"
        API --> Classifier[Query Classifier]
        API --> Cache[Cache Manager]
        API --> Router[Model Router]
        
        Classifier -->|Heuristics + Embeddings| Router
        
        Cache -->|L1: Moka| LocalCache[(In-Memory)]
        Cache -->|L2: Redis| Redis[(Redis Cluster)]
    end
    
    subgraph "Model Layer"
        Router -->|Simple Queries| Simple[Small Model\ne.g., Llama3-8b, GPT-3.5]
        Router -->|Complex Queries| Complex[Large Model\ne.g., GPT-4, Claude 3 Opus]
    end
    
    subgraph "Observability"
        Metrics[Prometheus] .-> API
        Metrics .-> Router
        Metrics .-> Cache
    end

1. API Layer (`src/api/`)

Framework: Built on axum, providing a robust asynchronous HTTP server.
Middleware: Handles cross-cutting concerns like timeouts, compression, CORS, and request tracking (IDs).
Endpoints: Exposes RESTful endpoints for inference (/infer), classification (/classify), and management.

2. Query Classifier (`src/classifier/`)

Purpose: Determines the "complexity" of a user query to decide which model should handle it.
Mechanism:
- Heuristics: Analyzes length, vocabulary diversity, and sentence structure.
- Embeddings: (Optional) Uses lightweight embeddings to semantically analyze the intent complexity.
- Scoring: Aggregates factors into a unified complexity score (0.0 - 1.0).

3. Cache Manager (`src/cache/`)

L1 Cache (Local): High-speed, in-memory LRU cache using moka. Stores the most frequently accessed data to minimize network calls.
L2 Cache (Distributed): Redis-backed cache for persistence and sharing state across multiple API instances.
Semantic Matching: Uses cosine similarity on query embeddings to find "close enough" matches, not just exact string matches.

4. Model Router (`src/models/`)

Configuration: Manages a registry of available models (configured in config.toml or env vars).
Routing Logic:
- If complexity_score < threshold: Route to "Simple" model provider.
- If complexity_score >= threshold: Route to "Complex" model provider.
Fallback: Can be configured to fall back to a different provider if the primary one fails.

Data Flow

Request Ingestion: Client sends a POST request with the query.
Classification: The system analyzes the query text.
Cache Lookup:
1. Check L1 (Local).
2. Check L2 (Redis).
3. If a semantic match is found, return immediately.
Routing: If no cache hit, the router selects a model based on the classification result.
Inference: The request is forwarded to the selected model provider (e.g., via HTTP to Ollama or OpenAI).
Response & Caching: The result is sent back to the client and asynchronously stored in L1 and L2 caches for future use.

Design Principles

Async-First: Built on tokio for handling high concurrency.
Fail-Fast: Timeouts and circuit breakers at every integration point.
Observability: "Metrics-first" development ensures every component emits useful telemetry.