Docs

Load Benchmark

Sustained 100k routing benchmark and memory results under high async load.

Load Benchmark Results (100k Requests)

To understand the core routing performance, caching layer, and baseline footprint of the AI Inference Platform independent of external LLM providers (e.g., Ollama HTTP network latency or model limits), we simulated sustained high-concurrency traffic using a mock LLM setup.

Test Environment

Platform Base Context: Release mode compilation (cargo build --release).
Endpoint tested: POST http://127.0.0.1:8080/api/v1/infer
Mock Configuration: Simulated constant 5ms model processing time. Cache evaluation lookup skipped.
Concurrency Setup: 200 concurrent HTTP asymmetric connections.
Total Requests: 100,000

Target Hardware

MacBook Air (Apple Silicon M-Series)
macOS 14+

Summary Results

The platform easily managed sustained 100,000 asynchronous inferences.

Metric	Result
Total Requests Processed	100,000
Success Rate	100% (0 Failed, 0 Error)
Total Test Duration	110.10s
Actual RPS (Requests per Second)	908.27 req/s

Latency Distribution

Given the 5ms injected baseline LLM cost + 200 concurrent tasks generating a heavily queued asynchronous model backpressure, the platform scheduled requests fairly:

50th %ile: 181.20 ms
90th %ile: 342.37 ms
95th %ile: 463.76 ms
99th %ile: 913.97 ms
Max Latency: 2526.12 ms

(Note: Maximum latencies reflect TCP batch scheduling and the inherent mock artificial queuing. No requests were dropped).

Memory Footprint (M-Profile)

The core ai-inference-platform Rust binary is exceptionally memory efficient. Unencumbered by local LLM weights (the llama.cpp process runs exclusively on Ollama context), the HTTP router and orchestrator maintains a tiny footprint:

Idle Resident Set Size: ~12 - 15 MB RAM
During 200 Concurrent Load Burst: ~30 - 38 MB RAM
Memory footprint never exceeds a few dozen megabytes.

Analysis

This benchmark demonstrates high availability and predictable request queuing without degrading or crashing the proxy level. The Rust application successfully amortizes costs across concurrent incoming connections predictably, confirming that routing constraints are strictly bounded by configured LLM latencies, and not by network socket exhaustion or internal processing friction.

Nextapi.mdx