MCP Server Performance Benchmark v2: 15 Implementations, I/O-Bound Workloads

Abstract

This experiment presents a comprehensive performance analysis of 15 Model Context Protocol (MCP) server implementations spanning Rust, Java (Spring MVC, WebFlux, Virtual Threads), Quarkus, Micronaut (JVM and GraalVM native images), Go, Bun, Node.js, and Python. Three independent runs totaling 39.9 million requests were executed with I/O-bound workloads (Redis + HTTP API), achieving a 0% error rate across all 15 servers in all 3 runs.

Key Findings: Rust leads throughput at 4,845 RPS with only 10.9 MB RAM and a CV of 0.04%. Quarkus leads latency at 4.04ms average and 8.13ms P95. Go and Java remain competitive at 3,616 and 3,540 RPS respectively. Classic blocking I/O (Spring MVC) outperforms reactive WebFlux at 50 VUs. GraalVM native images uniformly reduce memory (27-81%) while reducing throughput (20-36%), with Quarkus-native as the best native trade-off. Bun delivers 2.2x the RPS of Node.js on identical application code. Python with 4 workers and uvloop reaches 259 RPS. The bottleneck is FastMCP session overhead, not the ASGI server.

Recommendations: For high-load production deployments, Rust offers unmatched throughput and resource efficiency. Quarkus is the optimal choice when latency SLAs are primary. Go provides an excellent balance of performance, memory, and operational simplicity. Java Spring MVC (blocking) remains a strong Tier 1 choice. Native images are justified where memory is constrained and throughput requirements are moderate (below 3,500 RPS). JavaScript and Python runtimes are better suited for low-to-moderate traffic MCP deployments.

Keywords: MCP, Model Context Protocol, Performance Benchmark, Rust, Quarkus, GraalVM Native Image, Virtual Threads, Spring WebFlux, Micronaut, Bun, Go, Java, Node.js, Python, k6, Redis, Load Testing, Streamable HTTP

1. Introduction and Motivation

The MCP ecosystem is growing rapidly. Organizations adopting MCP servers face a widening set of implementation trade-offs: native compilation via GraalVM, reactive vs blocking I/O models within the JVM, alternative JavaScript runtimes, and the impact of real external I/O workloads on framework-level decisions. v2 was designed to surface these trade-offs empirically.

The v1 benchmark drew valid community feedback: no Quarkus, no GraalVM native images, no Virtual Threads, no reactive frameworks, no Micronaut, no Rust, and Python running in a single-worker default configuration. Version 2 is the direct experimental response to those criticisms, expanding from 4 to 15 implementations and replacing synthetic CPU tools with real Redis and HTTP API workloads.

Research Questions:

How do JVM-based implementations compare across blocking, reactive, and virtual thread concurrency models?
What are the throughput and latency trade-offs between JVM and GraalVM native images under I/O-bound load?
Does Rust belong in the MCP server performance conversation?
Can Bun's JavaScriptCore runtime change the JavaScript performance story for MCP servers?
What is the realistic production ceiling for optimized Python (multi-worker + uvloop)?

2. Experimental Setup

2.1 From v1 to v2: What Changed

v1 vs v2 Scope Comparison
Dimension	v1	v2
Servers	4	15
Java variants	1 (Spring Boot + Spring AI)	6 (Spring MVC, WebFlux, VT + 3 native images)
New runtimes	None	Rust, Quarkus, Micronaut, Bun
Tools	4 synthetic (fibonacci, HTTP, JSON, sleep)	3 I/O-bound (Redis + HTTP API)
Total requests	~3.9M	~39.9M
Runs	3 rounds	3 full independent runs
CPU per container	1.0 vCPU	2.0 vCPUs
Memory per container	1 GB	2 GB

Workload Design Rationale: Synthetic tools stress isolated performance dimensions (pure CPU, artificial delay). I/O-bound tools are representative of how MCP servers operate in production: fetching data from external services and reading/writing persistent state. The v2 workload makes framework-level I/O handling decisions visible in the results in a way that synthetic tools cannot.

2.2 Test Environment

Test Environment Specification
Component	Specification
Host	Microsoft Azure VM, 8 vCPUs, 32 GB RAM
OS	Ubuntu 24.04 LTS
Container Runtime	Docker with Docker Compose
CPU Limit (per MCP server)	2.0 vCPUs
Memory Limit (per MCP server)	2 GB
Infrastructure	Redis 7 Alpine (0.5 vCPU / 512 MB) + Go API service (2 vCPUs / 2 GB)
Network	Docker bridge (inter-container, localhost)
Test Runs	3 independent runs (February 27-28, 2026)

2.3 Server Implementations

Server Implementations
Server	Framework / SDK	Runtime
rust	rmcp 0.17.0	Rust / Tokio
quarkus	Quarkus 3.31.4 MCP Server SDK	Java 21 / Vert.x
go	mcp-go	Go 1.23
java	Spring Boot 4 MVC + Spring AI	Java 21
java-vt	Spring Boot 4 + Project Loom	Java 25
java-webflux	Spring Boot 4 WebFlux + Spring AI	Java 21 / Netty
micronaut	Micronaut 4.10.8 / MCP SDK 0.0.19	Java 21
quarkus-native	Quarkus native image	GraalVM 23 / native
java-native	Spring Boot native image	GraalVM 25 / native
java-vt-native	Spring Boot VT native image	GraalVM 25 / native
java-webflux-native	Spring Boot WebFlux native	GraalVM 25 / native
micronaut-native	Micronaut native image	GraalVM 23 / native
bun	Express + MCP SDK	Bun 1 / JavaScriptCore
nodejs	Express + MCP SDK	Node.js 22 / V8
python	FastMCP + Starlette	Python 3.11 / CPython

Production-Representative Configurations: Each server uses production-appropriate settings: connection pools tuned, workers configured, json_response enabled where supported. This follows the same "Standard DX" principle as v1, extended to include configurations that would survive real production load rather than default out-of-box settings.

2.4 Benchmark Tools and Workload

Each server implements three identical tools performing I/O-bound operations against a Redis instance and an HTTP API service (100,000 products in-memory):

Benchmark Tools and I/O Operations
Tool	I/O Operations	Performance Dimension
`search_products`	HTTP GET `/products/search` + Redis ZRANGE (parallel)	Parallel async I/O, HTTP client pool
`get_user_cart`	Redis HGETALL, then HTTP GET `/products/{id}` + Redis LRANGE (parallel)	Sequential + parallel I/O, Redis read patterns
`checkout`	HTTP POST `/cart/calculate` + Redis pipeline INCR+RPUSH+ZADD (parallel)	Write throughput, Redis pipeline efficiency

JavaScript

// k6 Load Profile — v2
export const options = {
    stages: [
        { duration: '10s', target: 50 },  // Ramp-up to 50 VUs
        { duration: '5m', target: 50 },   // Sustained load
        { duration: '10s', target: 0 },   // Ramp-down
    ],
    thresholds: {
        'http_req_failed': ['rate<0.05'],
    },
};
// First 60 seconds excluded from metrics (WARMUP_SECONDS=60)

2.5 Test Methodology

graph LR
    A[Redis Flush] --> B[Redis Seed]
    B --> C[Stop MCP Servers]
    C --> D[Start Target Server]
    D --> E[Health Check]
    E --> F[Warmup 60s]
    F --> G[k6 Run 5min]
    G --> H[Stats Collection]
    H --> I[Consolidate Results]

    style D fill:#1e293b,stroke:#6366f1
    style G fill:#1e293b,stroke:#10b981

Figure 1: Per-server test cycle ensuring isolation and reproducibility

Redis Isolation: Flush and re-seed before each server to ensure identical data state across all 15 servers.
Server Isolation: Only one MCP server running during each test period, eliminating resource contention.
Warmup: 60 seconds of real tool calls excluded from metrics to allow JIT compilation on all code paths (5 init sessions + 9 tool call sessions per server).
Sustained Load: 50 VUs, 5 minutes, all 3 tools called in rotation. VU N uses user-(N%1000), providing 50 distinct users.
Parallel Stats Collection: Docker stats sampled alongside k6 metrics for CPU and memory.

Run Timestamps:

First run: February 27, 2026 at 21:08:47 UTC
Second run: February 27, 2026 at 23:34:47 UTC
Third run: February 28, 2026 at 00:00:47 UTC

3. Implementation Details

3.1 Rust (rmcp SDK)

SDK Bug Discovered During Benchmarking: SSE Hardcoded in rmcp v0.16

During our benchmarking, we identified an anomalous fixed latency on every tool that performed HTTP calls. In an isolated run with rmcp v0.16 defaults, the search_products tool (pure Redis path) ran at 1.11ms avg, while get_user_cart and checkout were stuck at 40.84ms regardless of actual I/O time. We traced the cause to rmcp v0.16, which hardcodes text/event-stream for all responses, even stateless request-response exchanges. Every response carries chunked transfer-encoding, SSE framing, and keep-alive pings: approximately 40ms of pure transport overhead per request on any tool that returns an HTTP-originated payload. The MCP spec explicitly permits application/json for stateless responses. rmcp was simply not implementing that path.

This issue is fixed in rmcp v0.17.0, released on February 27, 2026, which ships the json_response option as an official feature. The benchmark was run with the equivalent patch applied to v0.16. The server implementation has since been updated to use rmcp 0.17.0 from crates.io directly, with no local patch required.

Rust: Impact of the json_response Patch (identical setup, only the patch differs)
Configuration	RPS	Avg Latency	Tool Breakdown
Without json_response patch (SSE default)	1,283	27.59 ms	search_products: 1.11ms / get_user_cart: 40.84ms / checkout: 40.84ms
With json_response patch	4,845	5.09 ms	search_products: 6.12ms / get_user_cart: 5.63ms / checkout: 3.51ms

Rust

// StreamableHttpServerConfig with json_response patch
StreamableHttpServerConfig {
    stateful_mode: false,
    json_response: true,   // returns application/json directly instead of SSE
    ..Default::default()
}

// New branch in tower.rs (simplified)
if self.config.json_response {
    let cancel = self.config.cancellation_token.child_token();
    match tokio::select! {
        res = receiver.recv() => res,
        _ = cancel.cancelled() => None,
    } {
        Some(message) => {
            let body = serde_json::to_vec(&message)?;
            Ok(Response::builder()
                .header(CONTENT_TYPE, JSON_MIME_TYPE)
                .body(Full::new(Bytes::from(body)).boxed()))
        }
        None => Err(internal_error_response("empty response")(...))
    }
}

Fix Merged Upstream: PR #683 shipped in rmcp v0.17.0

We submitted the fix to the official modelcontextprotocol/rust-sdk repository as PR #683. The patch adds json_response: bool to StreamableHttpServerConfig, backwards-compatible by default (false preserves the original SSE path unchanged). It was refined during code review (maintainers suggested tokio::select! with cancellation safety and tracing::info! logging), merged, and shipped as part of rmcp v0.17.0 on February 27, 2026.

Key Characteristics:

rmcp 0.17.0 (official release) with json_response: true in StreamableHttpServerConfig
Tokio async runtime with deadpool-redis connection pool (pool size 100)
Parallel tool handlers via tokio::spawn
Redis pipeline combines 3 write operations in checkout into a single RTT
10.9 MB average RAM (lowest of all 15 servers)
CV 0.04% (most stable across all 3 runs)

3.2 Quarkus (Vert.x / Mutiny)

Default Pool Limits Caused Catastrophic Failure (0 RPS): Quarkus REST Client Reactive ships with a conservative connection pool (default approximately 50 connections per host). Under 50 VUs with parallel I/O, the pool exhausted instantly and all requests queued until the 30-second timeout, yielding less than 1 effective RPS. Other frameworks passed the same test at 3,000+ RPS.

Text

quarkus.rest-client.api-service.connection-pool-size=1000
quarkus.rest-client.api-service.keep-alive-enabled=true
quarkus.redis.max-pool-size=100
quarkus.redis.max-pool-waiting=1000

Key Characteristics:

Reactive Vert.x/Mutiny event loop model, non-blocking I/O throughout
Lowest CPU usage among all Java frameworks (161.7% avg, vs 200%+ for others)
194.5 MB RAM (lowest JVM footprint among JVM Java variants)
Best latency of all 15 servers: 4.04ms avg, 8.13ms P95, 11.16ms P99

3.3 Go (mcp-go)

Go

var httpClient = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,
    },
    Timeout: 10 * time.Second,
}

Why Pool Tuning Mattered: http.DefaultClient uses MaxIdleConnsPerHost=2. Under 50 VUs, goroutines contended for two idle keep-alive connections, causing TCP connection churn and P95 latency spikes of 61ms. Replacing it with a tuned transport reduced P95 to 17.62ms.

Key Characteristics:

Goroutine-per-request concurrency model with net/http standard library
23.9 MB average RAM (second lowest after Rust)
Static binary, no runtime dependency
CV 0.19% (third most stable overall)

3.4 Java (Spring MVC, Blocking I/O)

Blocking Outperforms Reactive at 50 VUs: With 50 VUs and fast I/O (approximately 6ms avg), the Tomcat thread pool never exhausts. Sequential blocking code eliminates reactive scheduling overhead. Java MVC (3,540 RPS) outperforms Java WebFlux (3,032 RPS) in this workload. The blocking model is appropriate when the concurrency level is lower than the thread pool capacity.

Key Characteristics:

Spring Boot 4 MVC with Tomcat thread pool executor
Sequential blocking I/O via RestClient and StringRedisTemplate per request
No CompletableFuture or reactive primitives
368.1 MB average RAM (standard Spring Boot JVM footprint)

3.5 Java Virtual Threads (Project Loom)

Virtual Threads Design Intent: Virtual Threads (Project Loom, stable since Java 21) are designed for I/O-bound workloads where threads spend most of their time waiting on external resources. When a Virtual Thread blocks on I/O, the JVM unmounts it from its carrier platform thread, freeing that thread to run another Virtual Thread. This allows high concurrency without requiring explicit async programming. In CPU-bound workloads, Virtual Threads add scheduler overhead without throughput benefit. In this benchmark's I/O-bound workload (Redis + HTTP API calls), Virtual Threads operate as designed and deliver competitive 3,482 RPS.

java-vt-native: Worst Native Image Configuration: The java-vt-native image is the worst-performing native image (19.06ms avg latency, 2,447 RPS). GraalVM's AOT-compiled VT continuation mechanism adds overhead that the JIT would otherwise optimize at runtime. Combining VT scheduling overhead with AOT compilation is the least efficient native image configuration in this benchmark.

Key Characteristics:

spring.threads.virtual.enabled=true (canonical Spring Boot 4 configuration)
Java 25 selected to include JEP 491 (delivered in Java 24): Virtual Threads can now acquire and release synchronized monitors without pinning the carrier thread. Before this fix, any synchronized block inside the call stack (including Spring internals) would pin the Virtual Thread to its OS carrier thread for the duration, eliminating the concurrency benefit for I/O-bound code.
Same RestClient and StringRedisTemplate as Java MVC
349.7 MB RAM (slightly higher than MVC due to Virtual Thread scheduler metadata)
Competitive throughput at 3,482 RPS

3.6 Java WebFlux (Reactor / Netty)

Netty Off-Heap Memory: WebFlux uses Netty as its HTTP server instead of Tomcat. Netty allocates I/O buffers outside the JVM heap (off-heap / direct memory) via PooledByteBufAllocator. These buffers are invisible to -Xmx and persist in GraalVM native images. This explains the high memory peak (663 MB max) and why java-webflux-native also has elevated memory (351 MB avg) despite native compilation.

Key Characteristics:

Reactor event loop model, non-blocking I/O throughout
Competitive average latency (8.89ms avg) but high P99 tail in checkout (47ms)
484.6 MB average RAM (highest among JVM variants, due to Netty off-heap buffers)

3.7 Micronaut

GraalVM 23 Required for micronaut-native: The Dockerfile uses graalvm/native-image-community:23. The previous GraalVM version could not fully inline Micronaut's annotation-driven dispatch at compile time, causing a -49% RPS regression vs JVM in earlier testing (worst of any stack). GraalVM 23 improves closed-world analysis for annotation metadata, reducing the regression significantly.

micronaut-native CPU Throttling: CPU consistently exceeds 200% (avg 233.2%), above the 2-vCPU container limit. Docker CFS burst behavior allows short-window overages before throttling. The reported throughput (2,161 RPS) may be suppressed by CFS quota enforcement. On dedicated hardware, performance could differ.

Key Characteristics:

Micronaut MCP Server SDK 0.0.19 (pre-release)
Compile-time dependency injection and annotation processing
Netty server (same off-heap memory pattern as WebFlux)
JVM variant competitive at 3,382 RPS

3.8 GraalVM Native Images: Cross-Stack Analysis

Five server stacks were benchmarked in both JVM and GraalVM native image configurations, revealing consistent trade-off patterns across all stacks.

GraalVM Native Image: The Core Trade-off

27–81% memory reduction

at the cost of

20–36% throughput regression

JVM vs Native Image: Throughput and Memory Trade-offs
Stack	JVM RPS	Native RPS	RPS Regression	JVM RAM avg	Native RAM avg	RAM Saving
Quarkus	4,739	3,449	-27%	194 MB	36 MB	-81%
Java MVC	3,540	2,316	-35%	368 MB	178 MB	-52%
Java VT	3,482	2,447	-30%	350 MB	194 MB	-44%
WebFlux	3,032	2,413	-20%	485 MB	351 MB	-28%
Micronaut	3,382	2,161	-36%	216 MB	63 MB	-71%

The Trade-off Pattern: In I/O-bound workloads under sustained load, JIT-compiled code consistently outperforms AOT. The JIT observes actual runtime call patterns and optimizes hot paths adaptively at runtime. AOT compiles conservatively at build time without that observability. The throughput regression from native images (20-36%) becomes measurable only when the server is under significant and sustained load. At low or intermittent request rates, the difference is negligible and the memory savings dominate. The benefits of native images (fast startup, predictable memory footprint) do not translate to throughput advantages in sustained high-load I/O scenarios. Quarkus-native is the strongest exception: it maintains competitive P95 latency (15.92ms, 4th best overall) despite the RPS regression.

3.9 Node.js and Bun

SDK Per-Request Instantiation: An Intentional Design Constraint

Creating a new McpServer and StreamableHTTPServerTransport per HTTP request is the intentional design for stateless MCP servers. The SDK throws "Stateless transport cannot be reused across requests" on any reuse attempt. This sets a fixed 5-10ms overhead floor per request that no framework swap can eliminate.

Experiments attempted and reverted: undici pool with connections: 100 regressed RPS by 28%. Hono + WebStandardStreamableHTTPServerTransport cost 23% RPS due to IncomingMessage-to-Request adaptation overhead on Node.js. Express with StreamableHTTPServerTransport is retained.

JavaScript

// Cluster entry (WEB_CONCURRENCY=4 worker processes)
import cluster from 'cluster';
const WORKERS = parseInt(process.env.WEB_CONCURRENCY || '1');
if (cluster.isPrimary) {
    for (let i = 0; i < WORKERS; i++) cluster.fork();
} else {
    // Per-request McpServer (SDK design constraint — cannot be reused)
    app.post('/mcp', async (req, res) => {
        const server = createMcpServer();
        const transport = new StreamableHTTPServerTransport({ sessionIdGenerator: undefined });
        await server.connect(transport);
        await transport.handleRequest(req, res, req.body);
    });
}

Bun vs Node.js: Same Code, 2.2x Throughput: Both containers run the exact same index.js with WEB_CONCURRENCY=4. Bun's JavaScriptCore JIT and native fetch() deliver 876 RPS vs Node.js 423 RPS (2.2x ratio). Memory cost: Bun uses 540.8 MB vs Node.js 389.2 MB (+154 MB). Both runtimes saturate 2 vCPUs at approximately 200% with 4 workers.

3.10 Python (FastMCP + uvloop)

v1 Criticism Addressed: v1 ran Python with a single-worker uvicorn. v2 runs 4 workers with uvloop, directly responding to community feedback. The json_response=True option was already present in v1.

Python

# Launch command
# uvicorn main:app --host 0.0.0.0 --port 8082 --workers 4 --loop uvloop

# FastMCP configuration: json_response=True eliminates SSE framing overhead
mcp = FastMCP("BenchmarkPythonServer", stateless_http=True, json_response=True)

# Shared HTTP client (avoids per-request TCP pool creation)
@asynccontextmanager
async def lifespan(app):
    global _http_client
    _http_client = httpx.AsyncClient(timeout=10.0)
    async with mcp.session_manager.run():
        yield
    await _http_client.aclose()

Granian Experiment: ASGI Server is Not the Bottleneck: A Rust-based ASGI server (Granian with Tokio I/O) was tested as a replacement for uvicorn. Result: -12% RPS. The bottleneck is not the ASGI server layer. It is FastMCP's per-request MCP session processing in CPython. The ASGI server choice is irrelevant to Python MCP performance.

Key Characteristics:

GIL limits true parallelism within each worker process
4 workers x approximately 65 RPS per worker = 259 RPS total
258.6 MB average RAM
The bottleneck is FastMCP session overhead in CPython, not uvicorn or network I/O

4. Results and Analysis

4.1 Overall Performance Metrics

Perfect Reliability: 39,958,616 total requests across 15 servers and 3 independent runs. Error rate: 0% for every server in every run. All 15 implementations demonstrated robust MCP protocol compliance under sustained load.

Table 1: Performance Metrics Summary, All 15 Servers (Average of 3 Runs, sorted by RPS descending)
Server	RPS (avg) ▼	Avg Latency	P95 Latency	Requests Served (3 runs)
rust	4,845	5.09 ms	10.99 ms	4,724,624
quarkus	4,739	4.04 ms	8.13 ms	4,620,520
go	3,616	6.87 ms	17.62 ms	3,525,424
java	3,540	6.13 ms	13.71 ms	3,452,064
java-vt	3,482	9.03 ms	18.43 ms	3,395,384
quarkus-native	3,449	10.36 ms	15.92 ms	3,362,784
micronaut	3,382	9.75 ms	17.00 ms	3,297,208
java-webflux	3,032	8.89 ms	27.48 ms	2,956,424
java-vt-native	2,447	19.06 ms	36.82 ms	2,385,880
java-webflux-native	2,413	14.43 ms	44.17 ms	2,353,056
java-native	2,316	16.20 ms	42.44 ms	2,258,592
micronaut-native	2,161	20.75 ms	36.94 ms	2,107,080
bun	876	48.46 ms	98.50 ms	853,736
nodejs	423	123.50 ms	200.07 ms	412,888
python	259	251.62 ms	342.41 ms	252,952

Reading CPU Percentages: Docker stats reports CPU usage as a percentage of a single logical CPU core. Each MCP server container was allocated 2.0 vCPUs, so 200% represents full utilization of both allocated cores. Values above 200% (most notably micronaut-native at 233.2%) occur because Docker CFS (Completely Fair Scheduler) allows short burst windows that temporarily exceed the configured quota before throttling is applied. These bursts are real CPU consumption, not a measurement artifact.

Table 2: Resource Utilization, All 15 Servers (sorted by RAM ascending)
Server	CPU avg (%)	RAM avg (MB)	Error Rate
rust	117.9%	10.9 MB	0%
quarkus	161.7%	194.5 MB	0%
java-vt	184.9%	349.7 MB	0%
micronaut	190.5%	216.3 MB	0%
go	209.9%	23.9 MB	0%
java	206.0%	368.1 MB	0%
quarkus-native	204.6%	36.1 MB	0%
java-webflux	207.2%	484.6 MB	0%
java-vt-native	211.7%	193.8 MB	0%
java-webflux-native	204.4%	351.2 MB	0%
java-native	202.8%	178.2 MB	0%
micronaut-native	233.2%	63.0 MB	0%
bun	205.8%	540.8 MB	0%
nodejs	202.2%	389.2 MB	0%
python	206.6%	258.6 MB	0%

4.2 Latency Analysis

Average latency measurements reveal four distinct performance tiers. The top tier (Rust, Quarkus, Go, Java) operates in the 4-7ms range with I/O-bound workloads. This contrasts sharply with the v1 sub-millisecond averages, which reflected synthetic CPU-bound tools. The v2 numbers represent realistic production latency with actual Redis and HTTP network round-trips.

Average Latency: All 15 Servers (ms, sorted ascending within tier)

Tier 1 (≤7ms)

Tier 2 (7–13ms)

Tier 3 (13–25ms)

Tier 4 (25ms+)

quarkus

4.04 ms

rust

5.09 ms

java

6.13 ms

go

6.87 ms

java-webflux

8.89 ms

java-vt

9.03 ms

micronaut

9.75 ms

quarkus-native

10.36 ms

java-webflux-native

14.43 ms

java-native

16.20 ms

java-vt-native

19.06 ms

micronaut-native

20.75 ms

bun

48.46 ms

nodejs

123.50 ms

python

251.62 ms

Latency Context: The v1 top performers (Java, Go) reached sub-millisecond averages on CPU-bound synthetic tools. In v2, the same servers operate at 5-7ms average because tools now perform I/O-bound network operations (2 Redis operations + 1 HTTP call per request). This is intentional: the v2 numbers reflect realistic production latency, not synthetic minimums.

4.3 Throughput Comparison

Throughput measurements reveal three clear clusters: Rust and Quarkus at 4,700-4,850 RPS, the Java/Go cluster at 3,000-3,620 RPS, and the JS/Python group at 250-880 RPS. The gap between Tier 1 and Tier 4 is approximately 19x (Rust vs Python). Within the Java ecosystem, the 10-way spread from micronaut-native (2,161 RPS) to quarkus (4,739 RPS) illustrates the significant impact of framework, concurrency model, and compilation strategy.

Throughput Stability: Rust achieved a CV (Coefficient of Variation) of 0.04% (most consistent across runs). Node.js achieved 1.61% (least consistent, but still excellent). All 15 servers achieved CV below 2% across 3 independent runs, confirming that the rankings are reliable for technology selection decisions.

4.4 Resource Efficiency

Table 3: CPU and Memory Efficiency
Server	RPS / CPU%	RPS / MB RAM	CPU Efficiency Rank
rust	41.1	444.5	1st
quarkus	29.3	24.4	2nd
java-vt	18.8	10.0	3rd
micronaut	17.7	15.6	4th
go	17.2	151.3	5th
java	17.2	9.6	5th
quarkus-native	16.9	95.5	7th
java-webflux	14.6	6.3	8th
java-webflux-native	11.8	6.9	9th
java-vt-native	11.6	12.6	10th
java-native	11.4	13.0	11th
micronaut-native	9.3	34.3	12th
bun	4.3	1.6	13th
nodejs	2.1	1.1	14th
python	1.3	1.0	15th

Rust's Resource Advantage: At 4,845 RPS on 117.9% CPU and 10.9 MB RAM, Rust uses approximately 44x less memory than Java WebFlux (484.6 MB) while delivering 60% more throughput. In a 100-instance deployment, this translates to roughly 47 GB less RAM consumed. Go is the second-best option for memory-constrained deployments at 23.9 MB and 151.3 RPS per MB.

4.5 Tool-Specific Performance

Table 4 breaks down average latency (ms, consolidated across 3 runs) per tool, with each panel sorted independently to show which server leads for that specific operation.

search_products

HTTP GET + Redis ZRANGE (parallel)

Server	Avg (ms)
quarkus	4.37
rust	6.12
java	6.41
go	8.41
java-vt	9.00
quarkus-native	9.71
java-webflux	9.76
micronaut	9.82
java-native	13.97
java-webflux-native	15.12
java-vt-native	15.86
micronaut-native	20.87
bun	41.47
nodejs	119.79
python	244.50

get_user_cart

Redis HGETALL + HTTP + LRANGE (parallel)

Server	Avg (ms)
quarkus	4.35
rust	5.63
java	5.63
go	6.65
java-vt	8.24
java-webflux	8.56
micronaut	10.36
quarkus-native	12.45
java-native	12.74
java-webflux-native	14.63
java-vt-native	18.14
micronaut-native	22.03
bun	59.31
nodejs	141.88
python	260.39

checkout

HTTP POST + Redis pipeline (parallel)

Server	Avg (ms)
quarkus	3.38
rust	3.51
go	5.57
java	6.35
java-webflux	8.34
quarkus-native	8.93
micronaut	9.06
java-vt	9.83
java-webflux-native	13.54
micronaut-native	19.34
java-native	21.88
java-vt-native	23.18
bun	44.61
nodejs	108.84
python	249.97

Key Tool Observations:

checkout is consistently the fastest tool for top performers (Quarkus 3.38ms, Rust 3.51ms). Redis pipeline combines 3 write operations into 1 RTT, eliminating the per-operation network overhead.
search_products is the slowest tool for most servers. It requires a parallel HTTP GET + Redis ZRANGE, and the HTTP call to the API service dominates the latency.
Java MVC's sequential I/O is visible in get_user_cart, where the server must wait for HGETALL to complete before firing the HTTP call, unlike reactive implementations that parallelize immediately.

4.6 Stability and Reproducibility

Table 5: Throughput Stability Across 3 Runs, sorted by CV (Coefficient of Variation: std dev / mean, expressed as %; lower = more consistent)
Server	CV ▲	Mean RPS	Std Dev	Stability
rust	0.04%	4,845	2	Excellent
java-webflux-native	0.10%	2,413	2	Excellent
go	0.19%	3,616	7	Excellent
java-vt-native	0.36%	2,447	9	Excellent
java-native	0.44%	2,316	10	Excellent
quarkus	0.50%	4,739	24	Excellent
bun	0.52%	876	5	Excellent
micronaut-native	0.57%	2,161	12	Excellent
java-webflux	0.62%	3,032	19	Excellent
java	0.64%	3,540	23	Excellent
micronaut	0.70%	3,382	24	Excellent
java-vt	0.92%	3,482	32	Excellent
quarkus-native	1.13%	3,449	39	Excellent
python	1.58%	259	4	Excellent
nodejs	1.61%	423	7	Excellent

All 15 servers achieved a CV (Coefficient of Variation: the ratio of standard deviation to mean, expressed as a percentage) below 2% across the 3 independent runs. A CV below 5% is generally considered excellent for load tests on shared infrastructure. The Redis flush-and-reseed methodology eliminates state drift between servers. The 60-second warmup exclusion eliminates JIT cold-start noise. The result is stable, reproducible rankings suitable for technology selection decisions. P95 latency showed equivalent stability: 13 of 15 servers had a stable P95 trend across runs. Only quarkus-native (+1.06ms) and bun (+2.86ms) showed slight increases. Python was the sole variable entry (range of approximately 30ms across runs), consistent with CPython GIL scheduling variability.

5. Discussion

5.1 Performance Tiers

Tier 1: High-Performance (Rust, Quarkus, Go, Java MVC)

Rust: 4,845 RPS, 10.9 MB RAM, CV (Coefficient of Variation) 0.04%. Maximum throughput and minimum resource usage. The json_response fix (PR #683) was merged and shipped in rmcp v0.17.0.
Quarkus: 4,739 RPS, 4.04ms avg latency, 8.13ms P95. Best latency of all 15 servers. Requires explicit connection pool tuning.
Go: 3,616 RPS, 23.9 MB RAM. Third in throughput, second in memory efficiency, highly stable. Operational simplicity with no JVM dependency.
Java MVC: 3,540 RPS, 6.13ms avg. Outperforms reactive WebFlux at 50 VUs due to lower scheduling overhead.

Tier 2: Good Performance (7–13ms)

Java-VT: 3,482 RPS, 9.03ms avg. Virtual Threads operate as designed in I/O-bound workloads.
Quarkus-native: 3,449 RPS, 10.36ms avg, 15.92ms P95 (4th best overall), 36 MB RAM. Best native image option.
Micronaut: 3,382 RPS. Competitive across all 3 runs.
Java-WebFlux: 3,032 RPS. Competitive throughput but high P99 tail (47ms) in checkout.

Tier 3: Native Images with Trade-offs

Java-VT-native: 2,447 RPS, 19.06ms avg. Worst native image (VT continuation overhead in AOT).
Java-WebFlux-native: 2,413 RPS, 14.43ms avg, 44.17ms P95. High tail latency under sustained write load, compounded by Netty off-heap buffer pressure.
Java-native: 2,316 RPS. Stable but high tail latency.
Micronaut-native: 2,161 RPS, 233.2% CPU. Likely CPU-throttled by CFS.

Suitable where startup time and memory footprint outweigh throughput requirements (serverless, edge).

Tier 4: Low Throughput (JavaScript and Python)

Bun: 876 RPS. Best JavaScript option, 2.2x over Node.js on identical code.
Node.js: 423 RPS. Appropriate for low-traffic deployments.
Python: 259 RPS. Ceiling set by FastMCP session overhead, not the ASGI server.

5.2 Trade-offs Analysis

Table 6: Implementation Trade-offs Matrix
Dimension	Rust	Quarkus	Go	Java MVC	Java-VT	WebFlux	Bun	Node.js	Python
Peak Throughput	Highest	Very High	High	High	High	Moderate	Low	Very Low	Lowest
Latency (avg)	Very Low	Lowest	Low	Low	Low	Low	High	Very High	Highest
Latency Tail (P99)	Low	Lowest	Moderate	Low	Moderate	High	High	Very High	Highest
Memory Footprint	Lowest	Moderate	Very Low	High	High	Very High	High	High	Moderate
CPU Efficiency	Highest	High	Moderate	Moderate	Moderate	Moderate	Low	Very Low	Lowest
Ecosystem Maturity	Early	High	High	Highest	High	High	Moderate	High	High
SDK Overhead	Standard	Tunable	Standard	Standard	Standard	Standard	Fixed floor	Fixed floor	Fixed floor

5.3 Consistency and Reliability

The CV below 2% for all 15 servers is exceptional for a benchmark running on a shared cloud VM. The Redis reset methodology eliminates state drift. The warmup exclusion eliminates JIT noise. Rankings are stable and can be used with confidence. The only notable anomaly is Python's P95 variability across runs (335-387ms), attributable to GC pressure variation in the CPython runtime rather than network or Redis inconsistency.

6. Recommendations

6.1 Production Deployment Guidance

Use Rust When:

Maximum throughput is the primary SLA (4,845 RPS, 41.1 RPS per CPU%)
Memory footprint must be minimal (10.9 MB average)
Resource cost efficiency matters at scale
Team has Rust proficiency
rmcp v0.17.0 or later is available on crates.io

Use Quarkus When:

Latency SLAs are strict (P95 below 10ms required, achieved 8.13ms)
JVM ecosystem tooling and library access are needed
Reactive non-blocking I/O is preferred
Memory-constrained deployments favor Quarkus-native (36 MB)
Team is Java-proficient and comfortable with reactive programming

Use Go When:

Cloud-native deployment on Kubernetes (23.9 MB RAM, static binary)
Operational simplicity is preferred (no JVM, minimal configuration)
Resource cost matters (151 RPS per MB RAM)
Team uses Go
No JVM dependency or startup time constraints exist

Use Java Spring MVC When:

Existing Spring ecosystem and team expertise in Java/Spring
Moderate-to-high throughput requirements within the JVM ecosystem
Reactive model overhead is not desired
Blocking I/O matches the concurrency level

Consider instead: Java-VT for future-proofing at higher concurrency levels (above 100 VUs), where Virtual Threads show greater advantage.

Use Bun or Node.js When:

Team is JavaScript-native
Low-to-moderate traffic scenarios where JavaScript development speed is the priority
Rapid development cycle is valued
Prefer Bun over Node.js when JavaScript is required (2.2x throughput advantage)

Not recommended for: Latency-sensitive or high-load production MCP deployments.

Use Python When:

Team is Python-native and Python ML/AI library integration is needed
Low-traffic deployments where Python ecosystem integration outweighs performance
Development, testing, or prototyping scenarios
Integration with existing Python data science tooling outweighs performance requirements

Not recommended for: High-throughput production MCP deployments. The bottleneck is FastMCP session overhead in CPython, not the ASGI server or network I/O.

6.2 Use Case Decision Matrix

Table 7: Deployment Recommendations by Use Case
Use Case	Recommended	Alternative	Avoid
High-load production deployments	Rust	Quarkus, Go	Python, Node.js
Latency SLA P95 < 10ms	Quarkus	Rust, Go	Python, Node.js
Kubernetes / cloud-native	Go	Rust, Quarkus-native	Java WebFlux
Memory-constrained (< 50 MB)	Rust	Go, Quarkus-native	Java JVM variants
Memory-constrained (< 200 MB)	Quarkus-native	Java-native, Micronaut-native	Java-WebFlux
Native image preferred	Quarkus-native	Java-native, Micronaut-native	Java-VT-native
Java ecosystem, moderate-to-high load	Java MVC	Java-VT, Micronaut	Python
Dev / Testing / low-traffic	Python	Node.js, Bun	(none)
JavaScript ecosystem required	Bun	Node.js	(none)
Java ecosystem, reactive preferred	Java-VT	Java WebFlux	Java-VT-native

High-Load Production

Sustained traffic, latency SLAs critical

Default choice Rust Highest throughput, lowest memory, CV 0.04%

P95 latency SLA (< 10ms) Quarkus Best latency profile, 4.04ms avg, 8.13ms P95

No JVM dependency Go Static binary, 23.9 MB RAM, 3,616 RPS

Moderate Load

Java / JVM ecosystem preferred

Spring ecosystem Java MVC Thread-pool blocking I/O, 3,540 RPS

Higher concurrency headroom Java Virtual Threads Project Loom, same Spring toolchain, 3,482 RPS

Memory-constrained (< 50 MB) Quarkus-native Best native trade-off: 36 MB, 3,449 RPS

Low Traffic / Dev

Rapid iteration, existing team skills

JavaScript runtime Bun 2.2x faster than Node.js on identical code

JavaScript (max compatibility) Node.js Widest ecosystem, proven tooling

Python / AI-ML integration Python FastMCP + uvloop, 259 RPS with 4 workers

Figure 2: MCP Server Selection Guide, primary choice and alternatives by deployment scenario. See Table 7 for the full use case matrix.

7. Conclusion

This experimental analysis expanded the MCP server benchmark from 4 to 15 implementations, replacing synthetic CPU tools with real Redis and HTTP API workloads. The expansion revealed performance characteristics that were invisible in v1: the critical impact of connection pool configuration (Quarkus 0 RPS without tuning), the JVM vs native image throughput-memory trade-off under I/O load, the significance of runtime choice within the JavaScript ecosystem (Bun 2.2x Node.js), and the realistic production ceiling of optimized Python (259 RPS with 4 workers and uvloop).

The 39.9 million requests processed with 0% errors across all 15 servers validate the methodology's reproducibility. The CV below 2% for every server confirms that the rankings are stable. The data provides a reliable empirical basis for MCP server technology selection decisions.

Key Finding: In I/O-bound workloads representative of production MCP deployments, Rust and Quarkus lead the field at 4,845 and 4,739 RPS respectively, with Quarkus holding the best latency at 4.04ms average and 8.13ms P95. Go remains the optimal choice for teams prioritizing operational simplicity and resource efficiency. The study confirms that GraalVM native images reduce memory at the cost of throughput in sustained I/O workloads, with Quarkus-native as the best-positioned exception.

Summary of Findings:

Performance tiers are clearly separated: Rust/Quarkus at 4,700-4,850 RPS, Go/Java cluster at 3,000-3,620 RPS, JS/Python at 250-880 RPS.
Native images consistently reduce memory (27-81%) at a 20-36% throughput cost under sustained high load. At low request rates, this throughput regression is not observable. Quarkus-native offers the best trade-off at high load.
Classic blocking I/O (Spring MVC) outperforms reactive (WebFlux) at 50 VUs in this I/O-bound workload.
Bun delivers 2.2x the throughput of Node.js on identical code, making it the clear choice when the JavaScript ecosystem is required.
All 15 servers achieved 0% errors and CV below 2% across 39.9 million requests, validating the methodology's reproducibility.

Recommendations Summary:

Production choice (throughput): Rust at 4,845 RPS with 10.9 MB RAM
Production choice (latency): Quarkus at 4.04ms avg, 8.13ms P95
Resource and operational choice: Go at 23.9 MB RAM and 3,616 RPS
Java ecosystem: Spring MVC (blocking) at 3,540 RPS for strong throughput with operational simplicity. Java-VT for future-proofing at higher concurrency levels.
JavaScript ecosystem: Bun over Node.js (2.2x throughput advantage)
Python: Appropriate for low-traffic deployments and Python-native teams. The ceiling is FastMCP session overhead in CPython, not the ASGI server.

Future Work: Higher concurrency levels (100-200 VUs) to identify saturation points. Persistent session benchmarks. Multi-instance Kubernetes deployments with session affinity. Rust with rmcp v0.17.0 under higher concurrency profiles.

8. References and Resources

MCP Streamable HTTP Specification (2025). Model Context Protocol: Streamable HTTP Transport. https://modelcontextprotocol.io/specification/2025-06-18/basic/transports
Mendes, T. (2026). rmcp SDK PR #683: json_response support for stateless HTTP transport. https://github.com/modelcontextprotocol/rust-sdk/pull/683
Quarkus Team. (2025). Quarkus REST Client Reactive: Configuration Reference. https://quarkus.io/guides/rest-client-reactive
OpenJDK. (2023). JEP 444: Virtual Threads. https://openjdk.org/jeps/444
Oracle. (2025). GraalVM Native Image Documentation. https://www.graalvm.org/latest/reference-manual/native-image/
FastMCP Contributors. (2025). FastMCP: Running a FastMCP Server. https://gofastmcp.com/deployment/running-server
Grafana Labs. (2025). k6 Load Testing Documentation. https://k6.io/docs/
deadpool-redis contributors. (2025). deadpool-redis crate documentation. https://docs.rs/deadpool-redis
Bun Team. (2025). Bun JavaScript Runtime. https://bun.sh

9. Appendix

9.1 Raw Data and Complete Results

All raw benchmark data, including detailed results from all three runs, per-tool latency breakdowns, Docker stats logs, and k6 output files are available in the project repository:

View on GitHub

The benchmark/results/ directory contains timestamped result sets:

summary.json: aggregated metrics across all servers
[server]/k6.json: detailed k6 metrics for each server
[server]/stats.json: Docker resource usage statistics

9.2 Server Implementations

Complete source code for all 15 MCP server implementations:

rust-server/: rmcp 0.17.0 implementation with json_response: true
quarkus-server/: Quarkus MCP server (JVM and native Dockerfiles)
go-server/: mcp-go implementation
java-server/: Spring Boot MVC implementation
java-vt-server/: Spring Boot Virtual Threads implementation
java-webflux-server/: Spring Boot WebFlux implementation
micronaut-server/: Micronaut MCP server (JVM and native Dockerfiles)
nodejs-server/: Node.js Express implementation
python-server/: FastMCP + Starlette implementation

9.3 Benchmark Suite

benchmark/benchmark.js: k6 load testing script
benchmark/run_benchmark.sh: automated benchmark orchestration
benchmark/collect_stats.py: Docker stats collection
benchmark/consolidate.py: results aggregation

Abstract

Table of Contents

1. Introduction and Motivation

2. Experimental Setup

2.1 From v1 to v2: What Changed

2.2 Test Environment

2.3 Server Implementations

2.4 Benchmark Tools and Workload

2.5 Test Methodology

3. Implementation Details

3.1 Rust (rmcp SDK)

3.2 Quarkus (Vert.x / Mutiny)

3.3 Go (mcp-go)

3.4 Java (Spring MVC, Blocking I/O)

3.5 Java Virtual Threads (Project Loom)

3.6 Java WebFlux (Reactor / Netty)

3.7 Micronaut

3.8 GraalVM Native Images: Cross-Stack Analysis

3.9 Node.js and Bun

3.10 Python (FastMCP + uvloop)

4. Results and Analysis

4.1 Overall Performance Metrics

4.2 Latency Analysis

Average Latency: All 15 Servers (ms, sorted ascending within tier)

4.3 Throughput Comparison

4.4 Resource Efficiency

4.5 Tool-Specific Performance

search_products

get_user_cart

checkout

4.6 Stability and Reproducibility

5. Discussion

5.1 Performance Tiers

5.2 Trade-offs Analysis

5.3 Consistency and Reliability

6. Recommendations

6.1 Production Deployment Guidance

6.2 Use Case Decision Matrix

7. Conclusion

8. References and Resources

9. Appendix

9.1 Raw Data and Complete Results

9.2 Server Implementations

9.3 Benchmark Suite