Load Benchmark
Sustained 100k routing benchmark and memory results under high async load.
Load Benchmark Results (100k Requests)
To understand the core routing performance, caching layer, and baseline footprint of the AI Inference Platform independent of external LLM providers (e.g., Ollama HTTP network latency or model limits), we simulated sustained high-concurrency traffic using a mock LLM setup.
Test Environment
- Platform Base Context: Release mode compilation (
cargo build --release). - Endpoint tested:
POST http://127.0.0.1:8080/api/v1/infer - Mock Configuration: Simulated constant 5ms model processing time. Cache evaluation lookup skipped.
- Concurrency Setup: 200 concurrent HTTP asymmetric connections.
- Total Requests: 100,000
Target Hardware
- MacBook Air (Apple Silicon M-Series)
- macOS 14+
Summary Results
The platform easily managed sustained 100,000 asynchronous inferences.
| Metric | Result |
|---|---|
| Total Requests Processed | 100,000 |
| Success Rate | 100% (0 Failed, 0 Error) |
| Total Test Duration | 110.10s |
| Actual RPS (Requests per Second) | 908.27 req/s |
Latency Distribution
Given the 5ms injected baseline LLM cost + 200 concurrent tasks generating a heavily queued asynchronous model backpressure, the platform scheduled requests fairly:
- 50th %ile: 181.20 ms
- 90th %ile: 342.37 ms
- 95th %ile: 463.76 ms
- 99th %ile: 913.97 ms
- Max Latency: 2526.12 ms
(Note: Maximum latencies reflect TCP batch scheduling and the inherent mock artificial queuing. No requests were dropped).
Memory Footprint (M-Profile)
The core ai-inference-platform Rust binary is exceptionally memory efficient. Unencumbered by local LLM weights (the llama.cpp process runs exclusively on Ollama context), the HTTP router and orchestrator maintains a tiny footprint:
- Idle Resident Set Size:
~12 - 15 MBRAM - During 200 Concurrent Load Burst:
~30 - 38 MBRAM - Memory footprint never exceeds a few dozen megabytes.
Analysis
This benchmark demonstrates high availability and predictable request queuing without degrading or crashing the proxy level. The Rust application successfully amortizes costs across concurrent incoming connections predictably, confirming that routing constraints are strictly bounded by configured LLM latencies, and not by network socket exhaustion or internal processing friction.