MCP Server Performance Benchmark v2: 15 Implementations, I/O-Bound Workloads

An expanded benchmark covering 15 MCP server implementations across Rust, Java (Spring MVC, WebFlux, Virtual Threads), Quarkus, Micronaut (JVM and GraalVM native images), Go, Bun, Node.js, and Python. Three independent runs totaling 39.9 million requests with real Redis and HTTP API I/O workloads and 0% error rate across all implementations.

Research Context and Scope

This benchmark evaluates a specific question: how do different language runtimes and frameworks perform when implementing MCP servers under I/O-bound workloads at 50 concurrent virtual users? The results reflect this test scenario and should be interpreted within that context. They do not constitute a general ranking of programming languages or frameworks.

MCP is a nascent protocol and its ecosystem (SDKs, tooling, server frameworks) is evolving rapidly. Several SDKs used here are pre-release or early-access (Micronaut MCP SDK 0.0.19). Results may differ with future SDK versions, different concurrency levels, or alternative workload patterns.

The intent of this research is constructive contribution to the MCP ecosystem. No technology evaluated here is unsuitable by nature: each serves real use cases and the communities behind them continue to evolve their implementations. A lower benchmark rank reflects performance under this specific workload, not a judgment of quality or fitness for purpose. We welcome corrections, alternative configurations, and contributions via the benchmark repository.

Abstract

This experiment presents a comprehensive performance analysis of 15 Model Context Protocol (MCP) server implementations spanning Rust, Java (Spring MVC, WebFlux, Virtual Threads), Quarkus, Micronaut (JVM and GraalVM native images), Go, Bun, Node.js, and Python. Three independent runs totaling 39.9 million requests were executed with I/O-bound workloads (Redis + HTTP API), achieving a 0% error rate across all 15 servers in all 3 runs.

Key Findings: Rust leads throughput at 4,845 RPS with only 10.9 MB RAM and a CV of 0.04%. Quarkus leads latency at 4.04ms average and 8.13ms P95. Go and Java remain competitive at 3,616 and 3,540 RPS respectively. Classic blocking I/O (Spring MVC) outperforms reactive WebFlux at 50 VUs. GraalVM native images uniformly reduce memory (27-81%) while reducing throughput (20-36%), with Quarkus-native as the best native trade-off. Bun delivers 2.2x the RPS of Node.js on identical application code. Python with 4 workers and uvloop reaches 259 RPS. The bottleneck is FastMCP session overhead, not the ASGI server.

Recommendations: For high-load production deployments, Rust offers unmatched throughput and resource efficiency. Quarkus is the optimal choice when latency SLAs are primary. Go provides an excellent balance of performance, memory, and operational simplicity. Java Spring MVC (blocking) remains a strong Tier 2 choice. Native images are justified where memory is constrained and throughput requirements are moderate (below 3,500 RPS). JavaScript and Python runtimes are better suited for low-to-moderate traffic MCP deployments.

Keywords: MCP, Model Context Protocol, Performance Benchmark, Rust, Quarkus, GraalVM Native Image, Virtual Threads, Spring WebFlux, Micronaut, Bun, Go, Java, Node.js, Python, k6, Redis, Load Testing, Streamable HTTP

Table of Contents

1. Introduction and Motivation

The MCP ecosystem is growing rapidly. Organizations adopting MCP servers face a widening set of implementation trade-offs: native compilation via GraalVM, reactive vs blocking I/O models within the JVM, alternative JavaScript runtimes, and the impact of real external I/O workloads on framework-level decisions. v2 was designed to surface these trade-offs empirically.

The v1 benchmark drew valid community feedback: no Quarkus, no GraalVM native images, no Virtual Threads, no reactive frameworks, no Micronaut, no Rust, and Python running in a single-worker default configuration. Version 2 is the direct experimental response to those criticisms, expanding from 4 to 15 implementations and replacing synthetic CPU tools with real Redis and HTTP API workloads.

Research Questions:
  • How do JVM-based implementations compare across blocking, reactive, and virtual thread concurrency models?
  • What are the throughput and latency trade-offs between JVM and GraalVM native images under I/O-bound load?
  • Does Rust belong in the MCP server performance conversation?
  • Can Bun's JavaScriptCore runtime change the JavaScript performance story for MCP servers?
  • What is the realistic production ceiling for optimized Python (multi-worker + uvloop)?

2. Experimental Setup

2.1 From v1 to v2: What Changed

v1 vs v2 Scope Comparison
Dimensionv1v2
Servers415
Java variants1 (Spring Boot + Spring AI)6 (Spring MVC, WebFlux, VT + 3 native images)
New runtimesNoneRust, Quarkus, Micronaut, Bun
Tools4 synthetic (fibonacci, HTTP, JSON, sleep)3 I/O-bound (Redis + HTTP API)
Total requests~3.9M~39.9M
Runs3 rounds3 full independent runs
CPU per container1.0 vCPU2.0 vCPUs
Memory per container1 GB2 GB
Workload Design Rationale: Synthetic tools stress isolated performance dimensions (pure CPU, artificial delay). I/O-bound tools are representative of how MCP servers operate in production: fetching data from external services and reading/writing persistent state. The v2 workload makes framework-level I/O handling decisions visible in the results in a way that synthetic tools cannot.

2.2 Test Environment

Test Environment Specification
ComponentSpecification
HostMicrosoft Azure VM, 8 vCPUs, 32 GB RAM
OSUbuntu 24.04 LTS
Container RuntimeDocker with Docker Compose
CPU Limit (per MCP server)2.0 vCPUs
Memory Limit (per MCP server)2 GB
InfrastructureRedis 7 Alpine (0.5 vCPU / 512 MB) + Go API service (2 vCPUs / 2 GB)
NetworkDocker bridge (inter-container, localhost)
Test Runs3 independent runs (February 27-28, 2026)

2.3 Server Implementations

Server Implementations
ServerFramework / SDKRuntime
rustrmcp 0.17.0Rust / Tokio
quarkusQuarkus 3.31.4 MCP Server SDKJava 21 / Vert.x
gomcp-goGo 1.23
javaSpring Boot 4 MVC + Spring AIJava 21
java-vtSpring Boot 4 + Project LoomJava 25
java-webfluxSpring Boot 4 WebFlux + Spring AIJava 21 / Netty
micronautMicronaut 4.10.8 / MCP SDK 0.0.19Java 21
quarkus-nativeQuarkus native imageGraalVM 23 / native
java-nativeSpring Boot native imageGraalVM 25 / native
java-vt-nativeSpring Boot VT native imageGraalVM 25 / native
java-webflux-nativeSpring Boot WebFlux nativeGraalVM 25 / native
micronaut-nativeMicronaut native imageGraalVM 23 / native
bunExpress + MCP SDKBun 1 / JavaScriptCore
nodejsExpress + MCP SDKNode.js 22 / V8
pythonFastMCP + StarlettePython 3.11 / CPython
Production-Representative Configurations: Each server uses production-appropriate settings: connection pools tuned, workers configured, json_response enabled where supported. This follows the same "Standard DX" principle as v1, extended to include configurations that would survive real production load rather than default out-of-box settings.

2.4 Benchmark Tools and Workload

Each server implements three identical tools performing I/O-bound operations against a Redis instance and an HTTP API service (100,000 products in-memory):

Benchmark Tools and I/O Operations
ToolI/O OperationsPerformance Dimension
search_products HTTP GET /products/search + Redis ZRANGE (parallel) Parallel async I/O, HTTP client pool
get_user_cart Redis HGETALL, then HTTP GET /products/{id} + Redis LRANGE (parallel) Sequential + parallel I/O, Redis read patterns
checkout HTTP POST /cart/calculate + Redis pipeline INCR+RPUSH+ZADD (parallel) Write throughput, Redis pipeline efficiency
JavaScript
// k6 Load Profile — v2
export const options = {
    stages: [
        { duration: '10s', target: 50 },  // Ramp-up to 50 VUs
        { duration: '5m', target: 50 },   // Sustained load
        { duration: '10s', target: 0 },   // Ramp-down
    ],
    thresholds: {
        'http_req_failed': ['rate<0.05'],
    },
};
// First 60 seconds excluded from metrics (WARMUP_SECONDS=60)

2.5 Test Methodology

graph LR
    A[Redis Flush] --> B[Redis Seed]
    B --> C[Stop MCP Servers]
    C --> D[Start Target Server]
    D --> E[Health Check]
    E --> F[Warmup 60s]
    F --> G[k6 Run 5min]
    G --> H[Stats Collection]
    H --> I[Consolidate Results]

    style D fill:#1e293b,stroke:#6366f1
    style G fill:#1e293b,stroke:#10b981
                    

Figure 1: Per-server test cycle ensuring isolation and reproducibility

  1. Redis Isolation: Flush and re-seed before each server to ensure identical data state across all 15 servers.
  2. Server Isolation: Only one MCP server running during each test period, eliminating resource contention.
  3. Warmup: 60 seconds of real tool calls excluded from metrics to allow JIT compilation on all code paths (5 init sessions + 9 tool call sessions per server).
  4. Sustained Load: 50 VUs, 5 minutes, all 3 tools called in rotation. VU N uses user-(N%1000), providing 50 distinct users.
  5. Parallel Stats Collection: Docker stats sampled alongside k6 metrics for CPU and memory.
Run Timestamps:
  • First run: February 27, 2026 at 21:08:47 UTC
  • Second run: February 27, 2026 at 23:34:47 UTC
  • Third run: February 28, 2026 at 00:00:47 UTC

3. Implementation Details

3.1 Rust (rmcp SDK)

SDK Bug Discovered During Benchmarking: SSE Hardcoded in rmcp v0.16

During our benchmarking, we identified an anomalous fixed latency on every tool that performed HTTP calls. In an isolated run with rmcp v0.16 defaults, the search_products tool (pure Redis path) ran at 1.11ms avg, while get_user_cart and checkout were stuck at 40.84ms regardless of actual I/O time. We traced the cause to rmcp v0.16, which hardcodes text/event-stream for all responses, even stateless request-response exchanges. Every response carries chunked transfer-encoding, SSE framing, and keep-alive pings: approximately 40ms of pure transport overhead per request on any tool that returns an HTTP-originated payload. The MCP spec explicitly permits application/json for stateless responses. rmcp was simply not implementing that path.

This issue is fixed in rmcp v0.17.0, released on February 27, 2026, which ships the json_response option as an official feature. The benchmark was run with the equivalent patch applied to v0.16. The server implementation has since been updated to use rmcp 0.17.0 from crates.io directly, with no local patch required.

Rust: Impact of the json_response Patch (identical setup, only the patch differs)
ConfigurationRPSAvg LatencyTool Breakdown
Without json_response patch (SSE default)1,28327.59 mssearch_products: 1.11ms / get_user_cart: 40.84ms / checkout: 40.84ms
With json_response patch4,8455.09 mssearch_products: 6.12ms / get_user_cart: 5.63ms / checkout: 3.51ms
Rust
// StreamableHttpServerConfig with json_response patch
StreamableHttpServerConfig {
    stateful_mode: false,
    json_response: true,   // returns application/json directly instead of SSE
    ..Default::default()
}

// New branch in tower.rs (simplified)
if self.config.json_response {
    let cancel = self.config.cancellation_token.child_token();
    match tokio::select! {
        res = receiver.recv() => res,
        _ = cancel.cancelled() => None,
    } {
        Some(message) => {
            let body = serde_json::to_vec(&message)?;
            Ok(Response::builder()
                .header(CONTENT_TYPE, JSON_MIME_TYPE)
                .body(Full::new(Bytes::from(body)).boxed()))
        }
        None => Err(internal_error_response("empty response")(...))
    }
}
Fix Merged Upstream: PR #683 shipped in rmcp v0.17.0

We submitted the fix to the official modelcontextprotocol/rust-sdk repository as PR #683. The patch adds json_response: bool to StreamableHttpServerConfig, backwards-compatible by default (false preserves the original SSE path unchanged). It was refined during code review (maintainers suggested tokio::select! with cancellation safety and tracing::info! logging), merged, and shipped as part of rmcp v0.17.0 on February 27, 2026.

Key Characteristics:

  • rmcp 0.17.0 (official release) with json_response: true in StreamableHttpServerConfig
  • Tokio async runtime with deadpool-redis connection pool (pool size 100)
  • Parallel tool handlers via tokio::spawn
  • Redis pipeline combines 3 write operations in checkout into a single RTT
  • 10.9 MB average RAM (lowest of all 15 servers)
  • CV 0.04% (most stable across all 3 runs)

3.2 Quarkus (Vert.x / Mutiny)

Default Pool Limits Caused Catastrophic Failure (0 RPS): Quarkus REST Client Reactive ships with a conservative connection pool (default approximately 50 connections per host). Under 50 VUs with parallel I/O, the pool exhausted instantly and all requests queued until the 30-second timeout, yielding less than 1 effective RPS. Other frameworks passed the same test at 3,000+ RPS.
Properties
quarkus.rest-client.api-service.connection-pool-size=1000
quarkus.rest-client.api-service.keep-alive-enabled=true
quarkus.redis.max-pool-size=100
quarkus.redis.max-pool-waiting=1000

Key Characteristics:

  • Reactive Vert.x/Mutiny event loop model, non-blocking I/O throughout
  • Lowest CPU usage among all Java frameworks (161.7% avg, vs 200%+ for others)
  • 194.5 MB RAM (lowest JVM footprint among JVM Java variants)
  • Best latency of all 15 servers: 4.04ms avg, 8.13ms P95, 11.16ms P99

3.3 Go (mcp-go)

Go
var httpClient = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,
    },
    Timeout: 10 * time.Second,
}
Why Pool Tuning Mattered: http.DefaultClient uses MaxIdleConnsPerHost=2. Under 50 VUs, goroutines contended for two idle keep-alive connections, causing TCP connection churn and P95 latency spikes of 61ms. Replacing it with a tuned transport reduced P95 to 17.62ms.

Key Characteristics:

  • Goroutine-per-request concurrency model with net/http standard library
  • 23.9 MB average RAM (second lowest after Rust)
  • Static binary, no runtime dependency
  • CV 0.19% (third most stable overall)

3.4 Java (Spring MVC, Blocking I/O)

Blocking Outperforms Reactive at 50 VUs: With 50 VUs and fast I/O (approximately 6ms avg), the Tomcat thread pool never exhausts. Sequential blocking code eliminates reactive scheduling overhead. Java MVC (3,540 RPS) outperforms Java WebFlux (3,032 RPS) in this workload. The blocking model is appropriate when the concurrency level is lower than the thread pool capacity.

Key Characteristics:

  • Spring Boot 4 MVC with Tomcat thread pool executor
  • Sequential blocking I/O via RestClient and StringRedisTemplate per request
  • No CompletableFuture or reactive primitives
  • 368.1 MB average RAM (standard Spring Boot JVM footprint)

3.5 Java Virtual Threads (Project Loom)

Virtual Threads Design Intent: Virtual Threads (Project Loom, stable since Java 21) are designed for I/O-bound workloads where threads spend most of their time waiting on external resources. When a Virtual Thread blocks on I/O, the JVM unmounts it from its carrier platform thread, freeing that thread to run another Virtual Thread. This allows high concurrency without requiring explicit async programming. In CPU-bound workloads, Virtual Threads add scheduler overhead without throughput benefit. In this benchmark's I/O-bound workload (Redis + HTTP API calls), Virtual Threads operate as designed and deliver competitive 3,482 RPS.
java-vt-native: Worst Native Image Configuration: The java-vt-native image is the worst-performing native image (19.06ms avg latency, 2,447 RPS). GraalVM's AOT-compiled VT continuation mechanism adds overhead that the JIT would otherwise optimize at runtime. Combining VT scheduling overhead with AOT compilation is the least efficient native image configuration in this benchmark.

Key Characteristics:

  • spring.threads.virtual.enabled=true (canonical Spring Boot 4 configuration)
  • Java 25 selected to include JEP 491 (delivered in Java 24): Virtual Threads can now acquire and release synchronized monitors without pinning the carrier thread. Before this fix, any synchronized block inside the call stack (including Spring internals) would pin the Virtual Thread to its OS carrier thread for the duration, eliminating the concurrency benefit for I/O-bound code.
  • Same RestClient and StringRedisTemplate as Java MVC
  • 349.7 MB RAM (slightly higher than MVC due to Virtual Thread scheduler metadata)
  • Competitive throughput at 3,482 RPS

3.6 Java WebFlux (Reactor / Netty)

Netty Off-Heap Memory: WebFlux uses Netty as its HTTP server instead of Tomcat. Netty allocates I/O buffers outside the JVM heap (off-heap / direct memory) via PooledByteBufAllocator. These buffers are invisible to -Xmx and persist in GraalVM native images. This explains the high memory peak (663 MB max) and why java-webflux-native also has elevated memory (351 MB avg) despite native compilation.

Key Characteristics:

  • Reactor event loop model, non-blocking I/O throughout
  • Competitive average latency (8.89ms avg) but high P99 tail in checkout (47ms)
  • 484.6 MB average RAM (highest among JVM variants, due to Netty off-heap buffers)

3.7 Micronaut

GraalVM 23 Required for micronaut-native: The Dockerfile uses graalvm/native-image-community:23. The previous GraalVM version could not fully inline Micronaut's annotation-driven dispatch at compile time, causing a -49% RPS regression vs JVM in earlier testing (worst of any stack). GraalVM 23 improves closed-world analysis for annotation metadata, reducing the regression significantly.
micronaut-native CPU Throttling: CPU consistently exceeds 200% (avg 233.2%), above the 2-vCPU container limit. Docker CFS burst behavior allows short-window overages before throttling. The reported throughput (2,161 RPS) may be suppressed by CFS quota enforcement. On dedicated hardware, performance could differ.

Key Characteristics:

  • Micronaut MCP Server SDK 0.0.19 (pre-release)
  • Compile-time dependency injection and annotation processing
  • Netty server (same off-heap memory pattern as WebFlux)
  • JVM variant competitive at 3,382 RPS

3.8 GraalVM Native Images: Cross-Stack Analysis

Five server stacks were benchmarked in both JVM and GraalVM native image configurations, revealing consistent trade-off patterns across all stacks.

GraalVM Native Image: The Core Trade-off
27–81% memory reduction
at the cost of
20–36% throughput regression
JVM vs Native Image: Throughput and Memory Trade-offs
StackJVM RPSNative RPSRPS RegressionJVM RAM avgNative RAM avgRAM Saving
Quarkus4,7393,449-27%194 MB36 MB-81%
Java MVC3,5402,316-35%368 MB178 MB-52%
Java VT3,4822,447-30%350 MB194 MB-44%
WebFlux3,0322,413-20%485 MB351 MB-28%
Micronaut3,3822,161-36%216 MB63 MB-71%
The Trade-off Pattern: In I/O-bound workloads under sustained load, JIT-compiled code consistently outperforms AOT. The JIT observes actual runtime call patterns and optimizes hot paths adaptively at runtime. AOT compiles conservatively at build time without that observability. The throughput regression from native images (20-36%) becomes measurable only when the server is under significant and sustained load. At low or intermittent request rates, the difference is negligible and the memory savings dominate. The benefits of native images (fast startup, predictable memory footprint) do not translate to throughput advantages in sustained high-load I/O scenarios. Quarkus-native is the strongest exception: it maintains competitive P95 latency (15.92ms, 4th best overall) despite the RPS regression.

3.9 Node.js and Bun

SDK Per-Request Instantiation: An Intentional Design Constraint

Creating a new McpServer and StreamableHTTPServerTransport per HTTP request is the intentional design for stateless MCP servers. The SDK throws "Stateless transport cannot be reused across requests" on any reuse attempt. This sets a fixed 5-10ms overhead floor per request that no framework swap can eliminate.

Experiments attempted and reverted: undici pool with connections: 100 regressed RPS by 28%. Hono + WebStandardStreamableHTTPServerTransport cost 23% RPS due to IncomingMessage-to-Request adaptation overhead on Node.js. Express with StreamableHTTPServerTransport is retained.

JavaScript
// Cluster entry (WEB_CONCURRENCY=4 worker processes)
import cluster from 'cluster';
const WORKERS = parseInt(process.env.WEB_CONCURRENCY || '1');
if (cluster.isPrimary) {
    for (let i = 0; i < WORKERS; i++) cluster.fork();
} else {
    // Per-request McpServer (SDK design constraint — cannot be reused)
    app.post('/mcp', async (req, res) => {
        const server = createMcpServer();
        const transport = new StreamableHTTPServerTransport({ sessionIdGenerator: undefined });
        await server.connect(transport);
        await transport.handleRequest(req, res, req.body);
    });
}
Bun vs Node.js: Same Code, 2.2x Throughput: Both containers run the exact same index.js with WEB_CONCURRENCY=4. Bun's JavaScriptCore JIT and native fetch() deliver 876 RPS vs Node.js 423 RPS (2.2x ratio). Memory cost: Bun uses 540.8 MB vs Node.js 389.2 MB (+154 MB). Both runtimes saturate 2 vCPUs at approximately 200% with 4 workers.

3.10 Python (FastMCP + uvloop)

v1 Criticism Addressed: v1 ran Python with a single-worker uvicorn. v2 runs 4 workers with uvloop, directly responding to community feedback. The json_response=True option was already present in v1.
Python
# Launch command
# uvicorn main:app --host 0.0.0.0 --port 8082 --workers 4 --loop uvloop

# FastMCP configuration: json_response=True eliminates SSE framing overhead
mcp = FastMCP("BenchmarkPythonServer", stateless_http=True, json_response=True)

# Shared HTTP client (avoids per-request TCP pool creation)
@asynccontextmanager
async def lifespan(app):
    global _http_client
    _http_client = httpx.AsyncClient(timeout=10.0)
    async with mcp.session_manager.run():
        yield
    await _http_client.aclose()
Granian Experiment: ASGI Server is Not the Bottleneck: A Rust-based ASGI server (Granian with Tokio I/O) was tested as a replacement for uvicorn. Result: -12% RPS. The bottleneck is not the ASGI server layer. It is FastMCP's per-request MCP session processing in CPython. The ASGI server choice is irrelevant to Python MCP performance.

Key Characteristics:

  • GIL limits true parallelism within each worker process
  • 4 workers x approximately 65 RPS per worker = 259 RPS total
  • 258.6 MB average RAM
  • The bottleneck is FastMCP session overhead in CPython, not uvicorn or network I/O

4. Results and Analysis

4.1 Overall Performance Metrics

Perfect Reliability: 39,958,616 total requests across 15 servers and 3 independent runs. Error rate: 0% for every server in every run. All 15 implementations demonstrated robust MCP protocol compliance under sustained load.
Table 1: Performance Metrics Summary, All 15 Servers (Average of 3 Runs, sorted by RPS descending)
ServerRPS (avg) ▼Avg LatencyP95 LatencyRequests Served (3 runs)
rust4,8455.09 ms10.99 ms4,724,624
quarkus4,7394.04 ms8.13 ms4,620,520
go3,6166.87 ms17.62 ms3,525,424
java3,5406.13 ms13.71 ms3,452,064
java-vt3,4829.03 ms18.43 ms3,395,384
quarkus-native3,44910.36 ms15.92 ms3,362,784
micronaut3,3829.75 ms17.00 ms3,297,208
java-webflux3,0328.89 ms27.48 ms2,956,424
java-vt-native2,44719.06 ms36.82 ms2,385,880
java-webflux-native2,41314.43 ms44.17 ms2,353,056
java-native2,31616.20 ms42.44 ms2,258,592
micronaut-native2,16120.75 ms36.94 ms2,107,080
bun87648.46 ms98.50 ms853,736
nodejs423123.50 ms200.07 ms412,888
python259251.62 ms342.41 ms252,952
Reading CPU Percentages: Docker stats reports CPU usage as a percentage of a single logical CPU core. Each MCP server container was allocated 2.0 vCPUs, so 200% represents full utilization of both allocated cores. Values above 200% (most notably micronaut-native at 233.2%) occur because Docker CFS (Completely Fair Scheduler) allows short burst windows that temporarily exceed the configured quota before throttling is applied. These bursts are real CPU consumption, not a measurement artifact.
Table 2: Resource Utilization, All 15 Servers (sorted by RAM ascending)
ServerCPU avg (%)RAM avg (MB)Error Rate
rust117.9%10.9 MB0%
quarkus161.7%194.5 MB0%
java-vt184.9%349.7 MB0%
micronaut190.5%216.3 MB0%
go209.9%23.9 MB0%
java206.0%368.1 MB0%
quarkus-native204.6%36.1 MB0%
java-webflux207.2%484.6 MB0%
java-vt-native211.7%193.8 MB0%
java-webflux-native204.4%351.2 MB0%
java-native202.8%178.2 MB0%
micronaut-native233.2%63.0 MB0%
bun205.8%540.8 MB0%
nodejs202.2%389.2 MB0%
python206.6%258.6 MB0%

4.2 Latency Analysis

Average latency measurements reveal four distinct performance tiers. The top tier (Rust, Quarkus, Go, Java) operates in the 4-7ms range with I/O-bound workloads. This contrasts sharply with the v1 sub-millisecond averages, which reflected synthetic CPU-bound tools. The v2 numbers represent realistic production latency with actual Redis and HTTP network round-trips.

Average Latency — All 15 Servers (ms, sorted ascending within tier)

Tier 1 (Quarkus, Rust, Go)
Tier 2 (Java JVM variants)
Tier 3 (Native images)
Tier 4 (JS / Python)
quarkus
4.04 ms
rust
5.09 ms
go
6.87 ms

java
6.13 ms
java-webflux
8.89 ms
java-vt
9.03 ms
micronaut
9.75 ms
quarkus-native
10.36 ms

java-webflux-native
14.43 ms
java-native
16.20 ms
java-vt-native
19.06 ms
micronaut-native
20.75 ms

bun
48.46 ms
nodejs
123.50 ms
python
251.62 ms
Latency Context: The v1 top performers (Java, Go) reached sub-millisecond averages on CPU-bound synthetic tools. In v2, the same servers operate at 5-7ms average because tools now perform I/O-bound network operations (2 Redis operations + 1 HTTP call per request). This is intentional: the v2 numbers reflect realistic production latency, not synthetic minimums.

4.3 Throughput Comparison

Throughput measurements reveal three clear clusters: Rust and Quarkus at 4,700-4,850 RPS, the Java/Go cluster at 3,000-3,620 RPS, and the JS/Python group at 250-880 RPS. The gap between Tier 1 and Tier 4 is approximately 19x (Rust vs Python). Within the Java ecosystem, the 6-way spread from java-webflux-native (2,161 RPS) to quarkus (4,739 RPS) illustrates the significant impact of framework, concurrency model, and compilation strategy.

Throughput Stability: Rust achieved a CV (Coefficient of Variation) of 0.04% (most consistent across runs). Node.js achieved 1.61% (least consistent, but still excellent). All 15 servers achieved CV below 2% across 3 independent runs, confirming that the rankings are reliable for technology selection decisions.

4.4 Resource Efficiency

Table 4: CPU and Memory Efficiency
ServerRPS / CPU%RPS / MB RAMCPU Efficiency Rank
rust41.1444.51st
quarkus29.324.42nd
java-vt18.810.03rd
micronaut17.715.64th
go17.2151.35th
java17.29.65th
quarkus-native16.995.57th
java-webflux14.66.38th
java-webflux-native11.86.99th
java-vt-native11.612.610th
java-native11.413.011th
micronaut-native9.334.312th
bun4.31.613th
nodejs2.11.114th
python1.31.015th
Rust's Resource Advantage: At 4,845 RPS on 117.9% CPU and 10.9 MB RAM, Rust uses approximately 44x less memory than Java WebFlux (484.6 MB) while delivering 60% more throughput. In a 100-instance deployment, this translates to roughly 47 GB less RAM consumed. Go is the second-best option for memory-constrained deployments at 23.9 MB and 151.3 RPS per MB.

4.5 Tool-Specific Performance

Table 5 breaks down average latency (ms, consolidated across 3 runs) per tool, with each panel sorted independently to show which server leads for that specific operation.

search_products

HTTP GET + Redis ZRANGE (parallel)

ServerAvg (ms)
quarkus4.37
rust6.12
java6.41
go8.41
java-vt9.00
quarkus-native9.71
java-webflux9.76
micronaut9.82
java-native13.97
java-webflux-native15.12
java-vt-native15.86
micronaut-native20.87
bun41.47
nodejs119.79
python244.50

get_user_cart

Redis HGETALL + HTTP + LRANGE (parallel)

ServerAvg (ms)
quarkus4.35
rust5.63
java5.63
go6.65
java-vt8.24
java-webflux8.56
micronaut10.36
quarkus-native12.45
java-native12.74
java-webflux-native14.63
java-vt-native18.14
micronaut-native22.03
bun59.31
nodejs141.88
python260.39

checkout

HTTP POST + Redis pipeline (parallel)

ServerAvg (ms)
quarkus3.38
rust3.51
go5.57
java6.35
java-webflux8.34
quarkus-native8.93
micronaut9.06
java-vt9.83
java-webflux-native13.54
micronaut-native19.34
java-native21.88
java-vt-native23.18
bun44.61
nodejs108.84
python249.97
Key Tool Observations:
  • checkout is consistently the fastest tool for top performers (Quarkus 3.38ms, Rust 3.51ms). Redis pipeline combines 3 write operations into 1 RTT, eliminating the per-operation network overhead.
  • search_products is the slowest tool for most servers. It requires a parallel HTTP GET + Redis ZRANGE, and the HTTP call to the API service dominates the latency.
  • Java MVC's sequential I/O is visible in get_user_cart, where the server must wait for HGETALL to complete before firing the HTTP call, unlike reactive implementations that parallelize immediately.

4.6 Stability and Reproducibility

Table 6: Throughput Stability Across 3 Runs — sorted by CV (Coefficient of Variation: std dev / mean, expressed as %; lower = more consistent)
ServerCV ▲Mean RPSStd DevStability
rust0.04%4,8452Excellent
java-webflux-native0.10%2,4132Excellent
go0.19%3,6167Excellent
java-vt-native0.36%2,4479Excellent
java-native0.44%2,31610Excellent
quarkus0.50%4,73924Excellent
bun0.52%8765Excellent
micronaut-native0.57%2,16112Excellent
java-webflux0.62%3,03219Excellent
java0.64%3,54023Excellent
micronaut0.70%3,38224Excellent
java-vt0.92%3,48232Excellent
quarkus-native1.13%3,44939Excellent
python1.58%2594Excellent
nodejs1.61%4237Excellent

All 15 servers achieved a CV (Coefficient of Variation: the ratio of standard deviation to mean, expressed as a percentage) below 2% across the 3 independent runs. A CV below 5% is generally considered excellent for load tests on shared infrastructure. The Redis flush-and-reseed methodology eliminates state drift between servers. The 60-second warmup exclusion eliminates JIT cold-start noise. The result is stable, reproducible rankings suitable for technology selection decisions. P95 latency showed equivalent stability: 13 of 15 servers had a stable P95 trend across runs. Only quarkus-native (+1.06ms) and bun (+2.86ms) showed slight increases. Python was the sole variable entry (range of approximately 30ms across runs), consistent with CPython GIL scheduling variability.

5. Discussion

5.1 Performance Tiers

Tier 1: High-Performance (Rust, Quarkus, Go)
  • Rust: 4,845 RPS, 10.9 MB RAM, CV (Coefficient of Variation) 0.04%. Maximum throughput and minimum resource usage. The json_response fix (PR #683) was merged and shipped in rmcp v0.17.0.
  • Quarkus: 4,739 RPS, 4.04ms avg latency, 8.13ms P95. Best latency of all 15 servers. Requires explicit connection pool tuning.
  • Go: 3,616 RPS, 23.9 MB RAM. Third in throughput, second in memory efficiency, highly stable. Operational simplicity with no JVM dependency.
Tier 2: Good Performance (Java Ecosystem, JVM Variants)
  • Java MVC: 3,540 RPS, 6.13ms avg. Outperforms reactive WebFlux at 50 VUs due to lower scheduling overhead.
  • Java-VT: 3,482 RPS, 9.03ms avg. Virtual Threads operate as designed in I/O-bound workloads.
  • Quarkus-native: 3,449 RPS, 15.92ms P95 (4th best overall), 36 MB RAM. Best native image option.
  • Micronaut: 3,382 RPS. Competitive across all 3 runs.
  • Java-WebFlux: 3,032 RPS. Competitive throughput but high P99 tail (47ms) in checkout.
Tier 3: Native Images with Trade-offs
  • Java-VT-native: 2,447 RPS, 19.06ms avg. Worst native image (VT continuation overhead in AOT).
  • Java-WebFlux-native: 2,413 RPS, 44.17ms P95. High tail latency under sustained write load, compounded by Netty off-heap buffer pressure.
  • Java-native: 2,316 RPS. Stable but high tail latency.
  • Micronaut-native: 2,161 RPS, 233.2% CPU. Likely CPU-throttled by CFS.
Suitable where startup time and memory footprint outweigh throughput requirements (serverless, edge).
Tier 4: Low Throughput (JavaScript and Python)
  • Bun: 876 RPS. Best JavaScript option, 2.2x over Node.js on identical code.
  • Node.js: 423 RPS. Appropriate for low-traffic deployments.
  • Python: 259 RPS. Ceiling set by FastMCP session overhead, not the ASGI server.

5.2 Trade-offs Analysis

Table 8: Implementation Trade-offs Matrix
DimensionRustQuarkusGoJava MVCJava-VTWebFluxBunNode.jsPython
Peak ThroughputHighestVery HighHighHighHighModerateLowVery LowLowest
Latency (avg)Very LowLowestLowLowLowLowHighVery HighHighest
Latency Tail (P99)LowLowestModerateLowModerateHighHighVery HighHighest
Memory FootprintLowestModerateVery LowHighHighVery HighHighHighModerate
CPU EfficiencyHighestHighModerateModerateModerateModerateLowVery LowLowest
Ecosystem MaturityEarlyHighHighHighestHighHighModerateHighHigh
SDK OverheadPatchedTunableStandardStandardStandardStandardFixed floorFixed floorFixed floor

5.3 Consistency and Reliability

The CV below 2% for all 15 servers is exceptional for a benchmark running on a shared cloud VM. The Redis reset methodology eliminates state drift. The warmup exclusion eliminates JIT noise. Rankings are stable and can be used with confidence. The only notable anomaly is Python's P95 variability across runs (335-387ms), attributable to GC pressure variation in the CPython runtime rather than network or Redis inconsistency.

6. Recommendations

6.1 Production Deployment Guidance

Use Rust When:
  • Maximum throughput is the primary SLA (4,845 RPS, 41.1 RPS per CPU%)
  • Memory footprint must be minimal (10.9 MB average)
  • Resource cost efficiency matters at scale
  • Team has Rust proficiency
  • rmcp v0.17.0 or later is available on crates.io
Use Quarkus When:
  • Latency SLAs are strict (P95 below 10ms required, achieved 8.13ms)
  • JVM ecosystem tooling and library access are needed
  • Reactive non-blocking I/O is preferred
  • Memory-constrained deployments favor Quarkus-native (36 MB)
  • Team is Java-proficient and comfortable with reactive programming
Use Go When:
  • Cloud-native deployment on Kubernetes (23.9 MB RAM, static binary)
  • Operational simplicity is preferred (no JVM, minimal configuration)
  • Resource cost matters (151 RPS per MB RAM)
  • Team uses Go
  • No JVM dependency or startup time constraints exist
Use Java Spring MVC When:
  • Existing Spring ecosystem and team expertise in Java/Spring
  • Moderate-to-high throughput requirements within the JVM ecosystem
  • Reactive model overhead is not desired
  • Blocking I/O matches the concurrency level

Consider instead: Java-VT for future-proofing at higher concurrency levels (above 100 VUs), where Virtual Threads show greater advantage.

Use Bun or Node.js When:
  • Team is JavaScript-native
  • Low-to-moderate traffic scenarios where JavaScript development speed is the priority
  • Rapid development cycle is valued
  • Prefer Bun over Node.js when JavaScript is required (2.2x throughput advantage)

Not recommended for: Latency-sensitive or high-load production MCP deployments.

Use Python When:
  • Team is Python-native and Python ML/AI library integration is needed
  • Low-traffic deployments where Python ecosystem integration outweighs performance
  • Development, testing, or prototyping scenarios
  • Integration with existing Python data science tooling outweighs performance requirements

Not recommended for: High-throughput production MCP deployments. The bottleneck is FastMCP session overhead in CPython, not the ASGI server or network I/O.

6.2 Use Case Decision Matrix

Table 10: Deployment Recommendations by Use Case
Use CaseRecommendedAlternativeAvoid
High-load production deploymentsRustQuarkus, GoPython, Node.js
Latency SLA P95 < 10msQuarkusRust, GoPython, Node.js
Kubernetes / cloud-nativeGoRust, Quarkus-nativeJava WebFlux
Memory-constrained (< 50 MB)RustGo, Quarkus-nativeJava JVM variants
Memory-constrained (< 200 MB)Quarkus-nativeJava-native, Micronaut-nativeJava-WebFlux
Native image preferredQuarkus-nativeJava-native, Micronaut-nativeJava-VT-native
Java ecosystem, moderate-to-high loadJava MVCJava-VT, MicronautPython
Dev / Testing / low-trafficPythonNode.js, Bun(none)
JavaScript ecosystem requiredBunNode.js(none)
Java ecosystem, reactive preferredJava-VTJava WebFluxJava-VT-native
High-Load Production
Sustained traffic, latency SLAs critical
Default choice Rust Highest throughput, lowest memory, CV 0.04%
P95 latency SLA (< 10ms) Quarkus Best latency profile, 4.04ms avg, 8.13ms P95
No JVM dependency Go Static binary, 23.9 MB RAM, 3,616 RPS
Moderate Load
Java / JVM ecosystem preferred
Spring ecosystem Java MVC Thread-pool blocking I/O, 3,540 RPS
Higher concurrency headroom Java Virtual Threads Project Loom, same Spring toolchain, 3,482 RPS
Memory-constrained (< 50 MB) Quarkus-native Best native trade-off: 36 MB, 3,449 RPS
Low Traffic / Dev
Rapid iteration, existing team skills
JavaScript runtime Bun 2.2x faster than Node.js on identical code
JavaScript (max compatibility) Node.js Widest ecosystem, proven tooling
Python / AI-ML integration Python FastMCP + uvloop, 259 RPS with 4 workers

Figure 2: MCP Server Selection Guide — primary choice and alternatives by deployment scenario. See Table 10 for the full use case matrix.

7. Conclusion

This experimental analysis expanded the MCP server benchmark from 4 to 15 implementations, replacing synthetic CPU tools with real Redis and HTTP API workloads. The expansion revealed performance characteristics that were invisible in v1: the critical impact of connection pool configuration (Quarkus 0 RPS without tuning), the JVM vs native image throughput-memory trade-off under I/O load, the significance of runtime choice within the JavaScript ecosystem (Bun 2.2x Node.js), and the realistic production ceiling of optimized Python (259 RPS with 4 workers and uvloop).

The 39.9 million requests processed with 0% errors across all 15 servers validate the methodology's reproducibility. The CV below 2% for every server confirms that the rankings are stable. The data provides a reliable empirical basis for MCP server technology selection decisions.

Key Finding: In I/O-bound workloads representative of production MCP deployments, Rust and Quarkus lead the field at 4,845 and 4,739 RPS respectively, with Quarkus holding the best latency at 4.04ms average and 8.13ms P95. Go remains the optimal choice for teams prioritizing operational simplicity and resource efficiency. The study confirms that GraalVM native images reduce memory at the cost of throughput in sustained I/O workloads, with Quarkus-native as the best-positioned exception.

Summary of Findings:

  • Performance tiers are clearly separated: Rust/Quarkus at 4,700-4,850 RPS, Go/Java cluster at 3,000-3,620 RPS, JS/Python at 250-880 RPS.
  • Native images consistently reduce memory (27-81%) at a 20-36% throughput cost under sustained high load. At low request rates, this throughput regression is not observable. Quarkus-native offers the best trade-off at high load.
  • Classic blocking I/O (Spring MVC) outperforms reactive (WebFlux) at 50 VUs in this I/O-bound workload.
  • Bun delivers 2.2x the throughput of Node.js on identical code, making it the clear choice when the JavaScript ecosystem is required.
  • All 15 servers achieved 0% errors and CV below 2% across 39.9 million requests, validating the methodology's reproducibility.
Recommendations Summary:
  • Production choice (throughput): Rust at 4,845 RPS with 10.9 MB RAM
  • Production choice (latency): Quarkus at 4.04ms avg, 8.13ms P95
  • Resource and operational choice: Go at 23.9 MB RAM and 3,616 RPS
  • Java ecosystem: Spring MVC (blocking) at 3,540 RPS for strong throughput with operational simplicity. Java-VT for future-proofing at higher concurrency levels.
  • JavaScript ecosystem: Bun over Node.js (2.2x throughput advantage)
  • Python: Appropriate for low-traffic deployments and Python-native teams. The ceiling is FastMCP session overhead in CPython, not the ASGI server.

Future Work: Higher concurrency levels (100-200 VUs) to identify saturation points. Persistent session benchmarks. Multi-instance Kubernetes deployments with session affinity. Rust Rust with native compilation using rmcp v0.17.0.

8. References and Resources

  1. MCP Streamable HTTP Specification (2025). Model Context Protocol: Streamable HTTP Transport. https://modelcontextprotocol.io/specification/2025-06-18/basic/transports
  2. Mendes, T. (2026). rmcp SDK PR #683: json_response support for stateless HTTP transport. https://github.com/modelcontextprotocol/rust-sdk/pull/683
  3. Quarkus Team. (2025). Quarkus REST Client Reactive: Configuration Reference. https://quarkus.io/guides/rest-client-reactive
  4. OpenJDK. (2023). JEP 444: Virtual Threads. https://openjdk.org/jeps/444
  5. Oracle. (2025). GraalVM Native Image Documentation. https://www.graalvm.org/latest/reference-manual/native-image/
  6. FastMCP Contributors. (2025). FastMCP: Running a FastMCP Server. https://gofastmcp.com/deployment/running-server
  7. Grafana Labs. (2025). k6 Load Testing Documentation. https://k6.io/docs/
  8. deadpool-redis contributors. (2025). deadpool-redis crate documentation. https://docs.rs/deadpool-redis
  9. Bun Team. (2025). Bun JavaScript Runtime. https://bun.sh

9. Appendix

9.1 Raw Data and Complete Results

All raw benchmark data, including detailed results from all three runs, per-tool latency breakdowns, Docker stats logs, and k6 output files are available in the project repository:

View on GitHub

The benchmark/results/ directory contains timestamped result sets:

  • summary.json: aggregated metrics across all servers
  • [server]/k6.json: detailed k6 metrics for each server
  • [server]/stats.json: Docker resource usage statistics

9.2 Server Implementations

Complete source code for all 15 MCP server implementations:

  • rust-server/: rmcp 0.17.0 implementation with json_response: true
  • quarkus-server/: Quarkus MCP server (JVM and native Dockerfiles)
  • go-server/: mcp-go implementation
  • java-server/: Spring Boot MVC implementation
  • java-vt-server/: Spring Boot Virtual Threads implementation
  • java-webflux-server/: Spring Boot WebFlux implementation
  • micronaut-server/: Micronaut MCP server (JVM and native Dockerfiles)
  • nodejs-server/: Node.js Express implementation
  • python-server/: FastMCP + Starlette implementation

9.3 Benchmark Suite

  • benchmark/benchmark.js: k6 load testing script
  • benchmark/run_benchmark.sh: automated benchmark orchestration
  • benchmark/collect_stats.py: Docker stats collection
  • benchmark/consolidate.py: results aggregation