benchmark.mdx
Benchmarking Results & Analysis
This document explains the performance benchmarks for the AI Inference Platform. Validating the performance of "hot paths" (code executed frequently) is critical for high-throughput API servers.
Overview
The benchmarks measure the execution time of critical components in the inference pipeline:
- API Layer: JSON serialization and deserialization of requests.
- Inference Logic: Query classification heuristics.
- Caching: Cache key generation and vector similarity calculations.
- Embeddings: Generation of embedding vectors (mocked for this benchmark).
Method
Benchmarks are written using Criterion.rs, a statistics-driven benchmarking library for Rust. It provides precise measurements by running thousands of iterations and statistically analyzing the results to filter out noise (outliers).
To run the benchmarks yourself, use the following command:
cargo bench # repo private, will try to integrate it inside docker image somehowDetailed Results Analysis
The following analysis is based on the benchmark run provided.
1. API Hot Path (api_benchmark.rs)
This group measures the overhead of handling HTTP request data.
| Benchmark Case | Time (approx) | Description |
|---|---|---|
serialize_infer_request | 317 ns | Converts the internal InferRequest struct into a JSON string. This happens when the server sends a response or forwards a request. |
deserialize_infer_request | 287 ns | Parses a raw JSON string into the InferRequest struct. This happens for every incoming API call. |
Interpretation: The serialization/deserialization overhead is extremely low (less than 1 microsecond). This means the framework introduces negligible latency compared to the actual model inference (which usually takes milliseconds or seconds).
Will test it using biv, to see if i can improve this performance
2. Query Classification (inference_benchmark.rs)
This group measures the heuristic logic used to route queries (e.g., deciding if a query is "simple" or "complex").
| Benchmark Case | Time (approx) | Description |
|---|---|---|
simple_query | 48 ns | Analyzing a short string ("What is the weather today?"). |
complex_query | 189 ns | Analyzing a longer, multi-sentence prompt. |
Interpretation: The classification logic is virtually instantaneous. The difference in time (~140ns) is due to the O(N) nature of iterating over characters and words in longer strings.
its bad rn, will find some efficient way (need lot of look tbh)
3. Caching Operations (inference_benchmark.rs)
This group measures operations required for the semantic cache (vector database lookups).
| Benchmark Case | Time (approx) | Description |
|---|---|---|
cache_key_gen | 276 ns | Normalizing the input text (lowercase, trim) and hashing it to create a lookup key. |
cosine_similarity | 1.19 µs | Calculating the similarity between two 384-dimensional vectors. This is the core mathematical operation for finding "similar" past queries. |
Interpretation:
- Key Gen: Fast and efficient.
- Cosine Similarity: Takes ~1.2 microseconds per comparison. If the cache has to scan 1,000 vectors linearly, it would take ~1.2ms. This highlights why indexing (like HNSW or FAISS) is necessary for large datasets, though for small batches, linear scan is acceptable.
4. Embedding Generation (inference_benchmark.rs)
| Benchmark Case | Time (approx) | Description |
|---|---|---|
mock_embedding_384d | 1.30 µs | Generates a synthetic 384-dimensional vector and normalizes it. |
Interpretation:
This benchmarks a mock implementation. In a real production environment, generating an embedding using a model like BERT or OpenAI's text-embedding-3 would take significantly longer (10ms - 500ms) and would likely be offloaded to a GPU or external API.
Conclusion
The core infrastructure code (routing, parsing, caching math) is highly optimized, with most operations taking less than 1.5 microseconds. This confirms that the Rust-based service layer will not be a bottleneck; the primary latency driver will be the actual AI model inference time.