Docs

architecture.mdx

Architecture Documentation

System Overview

The AI Inference Platform is designed as a high-performance, intelligent gateway for AI model inference. It sits between client applications and various AI model providers (e.g., OpenAI, Ollama), adding value through:

  1. Intelligent Routing: Directing queries to the most cost-effective model that can handle the complexity.
  2. Caching: Reducing latency and costs by serving repeated or semantically similar queries from cache.
  3. Observability: providing deep insights into usage, performance, and model behavior.

Component Architecture

mermaid
graph TD
    Client[Client Application] -->|HTTP/JSON| API[Axum HTTP API]
    
    subgraph "Core Services"
        API --> Classifier[Query Classifier]
        API --> Cache[Cache Manager]
        API --> Router[Model Router]
        
        Classifier -->|Heuristics + Embeddings| Router
        
        Cache -->|L1: Moka| LocalCache[(In-Memory)]
        Cache -->|L2: Redis| Redis[(Redis Cluster)]
    end
    
    subgraph "Model Layer"
        Router -->|Simple Queries| Simple[Small Model\ne.g., Llama3-8b, GPT-3.5]
        Router -->|Complex Queries| Complex[Large Model\ne.g., GPT-4, Claude 3 Opus]
    end
    
    subgraph "Observability"
        Metrics[Prometheus] .-> API
        Metrics .-> Router
        Metrics .-> Cache
    end

1. API Layer (src/api/)

  • Framework: Built on axum, providing a robust asynchronous HTTP server.
  • Middleware: Handles cross-cutting concerns like timeouts, compression, CORS, and request tracking (IDs).
  • Endpoints: Exposes RESTful endpoints for inference (/infer), classification (/classify), and management.

2. Query Classifier (src/classifier/)

  • Purpose: Determines the "complexity" of a user query to decide which model should handle it.
  • Mechanism:
    • Heuristics: Analyzes length, vocabulary diversity, and sentence structure.
    • Embeddings: (Optional) Uses lightweight embeddings to semantically analyze the intent complexity.
    • Scoring: Aggregates factors into a unified complexity score (0.0 - 1.0).

3. Cache Manager (src/cache/)

  • L1 Cache (Local): High-speed, in-memory LRU cache using moka. Stores the most frequently accessed data to minimize network calls.
  • L2 Cache (Distributed): Redis-backed cache for persistence and sharing state across multiple API instances.
  • Semantic Matching: Uses cosine similarity on query embeddings to find "close enough" matches, not just exact string matches.

4. Model Router (src/models/)

  • Configuration: Manages a registry of available models (configured in config.toml or env vars).
  • Routing Logic:
    • If complexity_score < threshold: Route to "Simple" model provider.
    • If complexity_score >= threshold: Route to "Complex" model provider.
  • Fallback: Can be configured to fall back to a different provider if the primary one fails.

Data Flow

  1. Request Ingestion: Client sends a POST request with the query.
  2. Classification: The system analyzes the query text.
  3. Cache Lookup:
    1. Check L1 (Local).
    2. Check L2 (Redis).
    3. If a semantic match is found, return immediately.
  4. Routing: If no cache hit, the router selects a model based on the classification result.
  5. Inference: The request is forwarded to the selected model provider (e.g., via HTTP to Ollama or OpenAI).
  6. Response & Caching: The result is sent back to the client and asynchronously stored in L1 and L2 caches for future use.

Design Principles

  • Async-First: Built on tokio for handling high concurrency.
  • Fail-Fast: Timeouts and circuit breakers at every integration point.
  • Observability: "Metrics-first" development ensures every component emits useful telemetry.