Benchmarking Methodology

Overview

This document describes the systematic approach used to benchmark REST service implementations, aiming for results that are reproducible, comparable, and transparent.

Where details differ between documentation and code/config, the repository source (Docker/compose/service implementations) is the source of truth.

At-a-glance results (06/03/2026)

The table below is a curated summary (RPS rounded to the closest thousand) for CPU-limited service containers (2 vCPUs).

Framework Runtime Mode RPS Peak Mem (MB) Image Size (MB)
Spring JVM Platform 21k 552 246
Spring JVM Virtual 17k 439 246
Spring JVM Reactive 14k 427 277
Spring Native Platform 10k 237 388
Spring Native Virtual 11k 163 388
Spring Native Reactive 7k 176 447
Quarkus JVM Platform 37k 540 235
Quarkus JVM Virtual 45k 540 235
Quarkus JVM Reactive 49k 540 235
Quarkus Native Platform 21k 270 636
Quarkus Native Virtual 27k 270 636
Quarkus Native Reactive 22k 270 636
Micronaut JVM Platform 31k 441 193
Micronaut JVM Virtual 38k 441 193
Micronaut JVM Reactive 33k 441 193
Micronaut Native Platform 17k 165 349
Micronaut Native Virtual 17k 165 349
Micronaut Native Reactive 15k 165 349
Helidon SE JVM Virtual 65k 430 169
Helidon SE Native Virtual 37k 195 253
Helidon MP JVM Virtual 15k 463 189
Helidon MP Native Virtual 10k 202 356
Spark JVM Platform 35k 559 216
Spark JVM Virtual 25k 395 216
Javalin JVM Platform 29k 754 219
Javalin JVM Virtual 26k 510 219
Dropwizard JVM Platform 17k 613 246
Dropwizard JVM Virtual 16k 529 246
Vert.x JVM Reactive 52k 541 220
Pekko JVM Reactive 30k 693 266
Go Native Goroutines 24k 120 36
Django CPython Platform 1k 161 306
Django CPython Reactive 0.7k 200 309

Fairness Notes

Benchmarking Philosophy

Goals

  1. Fair Comparison: Create equivalent test conditions for all implementations
  2. Reproducibility: Enable others to reproduce results
  3. Practical Relevance: Test realistic scenarios while maintaining simplicity
  4. Transparency: Document all assumptions and limitations

Non-Goals

Test Environment

Hardware Configuration

Host System:

Note: Results vary significantly with hardware. Always benchmark on target hardware.

Container Configuration

Resource Limits:

cpus: 2.0          # 2 virtual CPUs
memory: 2GB        # Maximum memory

Why CPU Limiting?

Software Versions

Java:

Native:

Frameworks:

Third-party license note (native-image)

This repository is Apache-2.0 licensed.

However, native builds may use Oracle GraalVM container images (for example: container-registry.oracle.com/graalvm/native-image:25.0.2-ol9). If you build or run those images, you are responsible for reviewing and complying with Oracle’s license terms.

Workload Design

Service Implementation

Endpoint: GET /hello/platform

Logic:

@GetMapping("/hello/platform")
public ResponseEntity<String> getFromCache() {
    String value = cache.get(key, k -> "value-" + k);
    return ResponseEntity.ok(value);
}

Cache: Caffeine (high-performance, non-blocking)

Why This Workload?

Load Generation

Tool: wrk2 (constant throughput load generator)

Configuration:

wrk2 -t 8 \                    # 8 threads
     -c 200 \                  # 200 connections
     -d 180s \                 # 180 second duration
     -R 80000 \                # 80,000 requests/sec target
     --latency \               # Latency distribution
     http://service:8080/hello/platform

Key Parameters:

Why wrk2?

Benchmarking Process

To maximize repeatability:

Native-image build time & resource notes

Native-image builds are CPU intensive and can take up to ~10 minutes per service. First-time builds of the full set can take 30+ minutes.

Building multiple native images in parallel can overwhelm Docker Desktop/WSL2. The repository therefore defaults to serial image builds using:

1. Preparation Phase

Environment Setup:

# Start observability stack
docker compose --project-directory compose --profile=OBS up -d

# Wait for all services to be healthy (60 seconds minimum)
sleep 60

Windows PowerShell alternative:

Start-Sleep -Seconds 60

Service Deployment:

# Start specific service
docker compose --project-directory compose --profile=SERVICES up -d service-name

# Wait for service warmup
sleep 30

Windows PowerShell alternative:

Start-Sleep -Seconds 30

Health checks can be verified with curl (or a browser):

2. Warmup Phase

Purpose: Allow JVM to reach steady-state performance

Procedure:

# Low-rate warmup (30 seconds)
wrk2 -t 4 -c 50 -d 30s -R 10000 http://localhost:8080/hello/platform

# Wait for GC to settle
sleep 10

# Medium-rate warmup (30 seconds)
wrk2 -t 6 -c 100 -d 30s -R 30000 http://localhost:8080/hello/platform

# Wait for stabilization
sleep 10

Native Images: Shorter warmup acceptable (instant startup)

3. Measurement Phase

Primary Benchmark:

# Full load test
wrk2 -t 8 -c 200 -d 180s -R 100000 --latency \
     http://localhost:8080/hello/platform > results.txt

What to Capture:

Observability Data:

Example: request rate drilldown (Grafana)

These are examples of how request rate can be inspected via hello_request_count_total and then broken down by labels like service_name and endpoint.

Grafana request count broken down by endpoint

Grafana request count broken down by service_name

Grafana request count broken down by multiple labels

4. Cooldown Phase

# Stop load generator
docker compose --project-directory compose --profile=RAIN_FIRE down

# Wait for queues to drain
sleep 30

# Capture final metrics
docker stats --no-stream

Windows PowerShell alternative:

Start-Sleep -Seconds 30

docker stats --no-stream

Example: capturing peak memory in Docker Desktop

When summarizing peak memory, it’s useful to capture a “post-benchmark” snapshot of container memory usage (either via docker stats or Docker Desktop).

Docker Desktop container list showing memory usage after a benchmark run

5. Data Collection

Automated:

Manual:

Example output artifacts

The repository stores benchmark artifacts under results/ (see results/README.md).

Benchmark output location

Result Interpretation

Primary Metrics

Requests Per Second (RPS):

Latency Percentiles:

CPU Utilization:

Memory Usage:

Secondary Metrics

Startup Time:

Memory Footprint:

Error Rate:

Comparing Results

Fair Comparison Checklist

Same hardware: All tests on same machine

Same resource limits: CPU and memory constraints identical

Same workload: Identical request pattern

Same warmup: Adequate warmup time for each

Multiple runs: At least 3 runs, report median

Same observability: Instrumentation overhead consistent

Common Pitfalls

Cold start bias: Insufficient warmup

Thermal throttling: CPU temperature limiting performance

Background processes: Other workloads affecting results

Network saturation: Localhost loopback as bottleneck

Observer effect: Observability overhead not accounted for

Statistical Rigor

Multiple Runs

Minimum: 3 runs per configuration

Report: Median RPS, range

Discard: Outliers with clear explanation

Variance Analysis

Acceptable: ±5% between runs

Investigate: >10% variance suggests instability

Significance

Results presented are indicative, not scientific proof

Known Limitations

Workload Simplicity

Local Testing

Tool Limitations

Recommendations for Reproducibility

Before Benchmarking

  1. Close unnecessary applications: Minimize interference
  2. Disable power management: Maximum performance mode
  3. Fix CPU frequency: Avoid turbo boost variations
  4. Warm up system: Run a test benchmark first
  5. Check thermals: Ensure adequate cooling

During Benchmarking

  1. Monitor system: Watch for anomalies
  2. Consistent time of day: Avoid thermal variations
  3. Multiple iterations: Don’t trust single run
  4. Document everything: Configuration, versions, observations

After Benchmarking

  1. Review observability data: Correlate with results
  2. Check for errors: Validate test validity
  3. Compare with baseline: Detect regression
  4. Archive results: Include metadata

Advanced Benchmarking

Latency Profiling

Use flamegraphs to identify hot paths:

# Pyroscope captures automatically during test
# View in Grafana: Explore → Pyroscope

Concurrency Scaling

Test different connection counts:

for conn in 50 100 200 400; do
    wrk2 -t 8 -c $conn -d 60s -R 100000 http://localhost:8080/hello/platform
done

Stress Testing

Find the breaking point:

for rate in 50000 100000 150000 200000; do
    wrk2 -t 8 -c 200 -d 60s -R $rate http://localhost:8080/hello/platform
done

References

Continuous Improvement

This methodology evolves based on:

Contributions and suggestions welcome via GitHub issues!