Docs

architecture.mdx

Architecture Documentation

System Overview

The AI Inference Platform is designed as a high-performance, intelligent gateway for AI model inference. It sits between client applications and various AI model providers (e.g., OpenAI, Ollama), adding value through:

  1. Intelligent Routing: Directing queries to the most cost-effective model that can handle the complexity.
  2. Caching: Reducing latency and costs by serving repeated or semantically similar queries from cache.
  3. Observability: providing deep insights into usage, performance, and model behavior.

Component Architecture

1. API Layer (src/api/)

  • Framework: Built on axum, providing a robust asynchronous HTTP server.
  • Middleware: Handles cross-cutting concerns like timeouts, compression, CORS, and request tracking (IDs).
  • Endpoints: Exposes RESTful endpoints for inference (/infer), classification (/classify), and management.

2. Query Classifier (src/classifier/)

  • Purpose: Determines the "complexity" of a user query to decide which model should handle it.
  • Mechanism:
    • Heuristics: Analyzes length, vocabulary diversity, and sentence structure.
    • Embeddings: (Optional) Uses lightweight embeddings to semantically analyze the intent complexity.
    • Scoring: Aggregates factors into a unified complexity score (0.0 - 1.0).

3. Cache Manager (src/cache/)

  • L1 Cache (Local): High-speed, in-memory LRU cache using moka. Stores the most frequently accessed data to minimize network calls.
  • L2 Cache (Distributed): Redis-backed cache for persistence and sharing state across multiple API instances.
  • Semantic Matching: Uses cosine similarity on query embeddings to find "close enough" matches, not just exact string matches.

4. Model Router (src/models/)

  • Configuration: Manages a registry of available models (configured in config.toml or env vars).
  • Routing Logic:
    • If complexity_score < threshold: Route to "Simple" model provider.
    • If complexity_score >= threshold: Route to "Complex" model provider.
  • Fallback: Can be configured to fall back to a different provider if the primary one fails.

Data Flow

  1. Request Ingestion: Client sends a POST request with the query.
  2. Classification: The system analyzes the query text.
  3. Cache Lookup:
    1. Check L1 (Local).
    2. Check L2 (Redis).
    3. If a semantic match is found, return immediately.
  4. Routing: If no cache hit, the router selects a model based on the classification result.
  5. Inference: The request is forwarded to the selected model provider (e.g., via HTTP to Ollama or OpenAI).
  6. Response & Caching: The result is sent back to the client and asynchronously stored in L1 and L2 caches for future use.

Design Principles

  • Async-First: Built on tokio for handling high concurrency.
  • Fail-Fast: Timeouts and circuit breakers at every integration point.
  • Observability: "Metrics-first" development ensures every component emits useful telemetry.