Docs

benchmark.mdx

Benchmarking Results & Analysis

This document explains the performance benchmarks for the AI Inference Platform. Validating the performance of "hot paths" (code executed frequently) is critical for high-throughput API servers.

Overview

The benchmarks measure the execution time of critical components in the inference pipeline:

  1. API Layer: JSON serialization and deserialization of requests.
  2. Inference Logic: Query classification heuristics.
  3. Caching: Cache key generation and vector similarity calculations.
  4. Embeddings: Generation of embedding vectors (mocked for this benchmark).

Method

Benchmarks are written using Criterion.rs, a statistics-driven benchmarking library for Rust. It provides precise measurements by running thousands of iterations and statistically analyzing the results to filter out noise (outliers).

To run the benchmarks yourself, use the following command:

bash
cargo bench # repo private, will try to integrate it inside docker image somehow

Detailed Results Analysis

The following analysis is based on the benchmark run provided.

1. API Hot Path (api_benchmark.rs)

This group measures the overhead of handling HTTP request data.

Benchmark CaseTime (approx)Description
serialize_infer_request317 nsConverts the internal InferRequest struct into a JSON string. This happens when the server sends a response or forwards a request.
deserialize_infer_request287 nsParses a raw JSON string into the InferRequest struct. This happens for every incoming API call.

Interpretation: The serialization/deserialization overhead is extremely low (less than 1 microsecond). This means the framework introduces negligible latency compared to the actual model inference (which usually takes milliseconds or seconds).

Will test it using biv, to see if i can improve this performance

2. Query Classification (inference_benchmark.rs)

This group measures the heuristic logic used to route queries (e.g., deciding if a query is "simple" or "complex").

Benchmark CaseTime (approx)Description
simple_query48 nsAnalyzing a short string ("What is the weather today?").
complex_query189 nsAnalyzing a longer, multi-sentence prompt.

Interpretation: The classification logic is virtually instantaneous. The difference in time (~140ns) is due to the O(N) nature of iterating over characters and words in longer strings.

its bad rn, will find some efficient way (need lot of look tbh)

3. Caching Operations (inference_benchmark.rs)

This group measures operations required for the semantic cache (vector database lookups).

Benchmark CaseTime (approx)Description
cache_key_gen276 nsNormalizing the input text (lowercase, trim) and hashing it to create a lookup key.
cosine_similarity1.19 µsCalculating the similarity between two 384-dimensional vectors. This is the core mathematical operation for finding "similar" past queries.

Interpretation:

  • Key Gen: Fast and efficient.
  • Cosine Similarity: Takes ~1.2 microseconds per comparison. If the cache has to scan 1,000 vectors linearly, it would take ~1.2ms. This highlights why indexing (like HNSW or FAISS) is necessary for large datasets, though for small batches, linear scan is acceptable.

4. Embedding Generation (inference_benchmark.rs)

Benchmark CaseTime (approx)Description
mock_embedding_384d1.30 µsGenerates a synthetic 384-dimensional vector and normalizes it.

Interpretation: This benchmarks a mock implementation. In a real production environment, generating an embedding using a model like BERT or OpenAI's text-embedding-3 would take significantly longer (10ms - 500ms) and would likely be offloaded to a GPU or external API.

Conclusion

The core infrastructure code (routing, parsing, caching math) is highly optimized, with most operations taking less than 1.5 microseconds. This confirms that the Rust-based service layer will not be a bottleneck; the primary latency driver will be the actual AI model inference time.