Benchmarking Methodology

Overview

This document describes the systematic approach used to benchmark REST service implementations, aiming for results that are reproducible, comparable, and transparent.

Where details differ between documentation and code/config, the repository source (Docker/compose/service implementations) is the source of truth.

At-a-glance results (22/01/2026)

The table below is a curated summary (RPS rounded to the closest thousand) for CPU-limited service containers (4 vCPUs).

Implementation Mode RPS
Spring JVM Platform 32k
Spring JVM Virtual 29k
Spring JVM Reactive 22k
Spring Native Platform 20k
Spring Native Virtual 20k
Spring Native Reactive 16k
Quarkus JVM Platform 70k
Quarkus JVM Virtual 90k
Quarkus JVM Reactive 104k
Quarkus Native Platform 45k
Quarkus Native Virtual 54k
Quarkus Native Reactive 51k
Go (observability-aligned) 52k

Fairness note (Go vs go-simple)

A simpler Go variant in this repository can reach ~120k RPS, but it is intentionally kept out of the headline comparison because its observability setup is not equivalent to the Java services.

The “observability-aligned” Go implementation is intended to match the same OpenTelemetry + LGTM pipeline, making the comparison more apples-to-apples.

Benchmarking Philosophy

Goals

  1. Fair Comparison: Create equivalent test conditions for all implementations
  2. Reproducibility: Enable others to reproduce results
  3. Practical Relevance: Test realistic scenarios while maintaining simplicity
  4. Transparency: Document all assumptions and limitations

Non-Goals

Test Environment

Hardware Configuration

Host System:

Note: Results vary significantly with hardware. Always benchmark on target hardware.

Container Configuration

Resource Limits:

cpus: 4.0          # 4 virtual CPUs
memory: 2GB        # Maximum memory

Why CPU Limiting?

Software Versions

Java:

Native:

Frameworks:

Third-party license note (native-image)

This repository is Apache-2.0 licensed.

However, native builds may use Oracle GraalVM container images (for example: container-registry.oracle.com/graalvm/native-image:25.0.1-ol10). If you build or run those images, you are responsible for reviewing and complying with Oracle’s license terms.

Workload Design

Service Implementation

Endpoint: GET /api/cache/{key}

Logic:

@GetMapping("/api/cache/{key}")
public ResponseEntity<String> getFromCache(@PathVariable String key) {
    String value = cache.get(key, k -> "value-" + k);
    return ResponseEntity.ok(value);
}

Cache: Caffeine (high-performance, non-blocking)

Why This Workload?

Load Generation

Tool: wrk2 (constant throughput load generator)

Configuration:

wrk2 -t 8 \                    # 8 threads
     -c 200 \                  # 200 connections
     -d 180s \                 # 180 second duration
     -R 80000 \                # 80,000 requests/sec target
     --latency \               # Latency distribution
     http://service:8080/api/cache/key1

Key Parameters:

Why wrk2?

Benchmarking Process

To maximize repeatability:

Native-image build time & resource notes

Native-image builds are CPU intensive and can take up to ~10 minutes per service. First-time builds of the full set can take 30+ minutes.

Building multiple native images in parallel can overwhelm Docker Desktop/WSL2. The repository therefore defaults to serial image builds using:

1. Preparation Phase

Environment Setup:

# Start observability stack
docker compose --project-directory compose --profile=OBS up -d

# Wait for all services to be healthy (60 seconds minimum)
sleep 60

Windows PowerShell alternative:

Start-Sleep -Seconds 60

Service Deployment:

# Start specific service
docker compose --project-directory compose --profile=SERVICES up -d service-name

# Wait for service warmup
sleep 30

Windows PowerShell alternative:

Start-Sleep -Seconds 30

Health checks can be verified with curl (or a browser):

2. Warmup Phase

Purpose: Allow JVM to reach steady-state performance

Procedure:

# Low-rate warmup (30 seconds)
wrk2 -t 4 -c 50 -d 30s -R 10000 http://localhost:8080/api/cache/key1

# Wait for GC to settle
sleep 10

# Medium-rate warmup (30 seconds)
wrk2 -t 6 -c 100 -d 30s -R 30000 http://localhost:8080/api/cache/key1

# Wait for stabilization
sleep 10

Native Images: Shorter warmup acceptable (instant startup)

3. Measurement Phase

Primary Benchmark:

# Full load test
wrk2 -t 8 -c 200 -d 180s -R 100000 --latency \
     http://localhost:8080/api/cache/key1 > results.txt

What to Capture:

Observability Data:

4. Cooldown Phase

# Stop load generator
docker compose --project-directory compose --profile=RAIN_FIRE down

# Wait for queues to drain
sleep 30

# Capture final metrics
docker stats --no-stream

Windows PowerShell alternative:

Start-Sleep -Seconds 30

docker stats --no-stream

5. Data Collection

Automated:

Manual:

Example output artifacts

The repository stores benchmark artifacts under results/ (see results/README.md).

Benchmark output location

Result Interpretation

Primary Metrics

Requests Per Second (RPS):

Latency Percentiles:

CPU Utilization:

Memory Usage:

Secondary Metrics

Startup Time:

Memory Footprint:

Error Rate:

Comparing Results

Fair Comparison Checklist

Same hardware: All tests on same machine

Same resource limits: CPU and memory constraints identical

Same workload: Identical request pattern

Same warmup: Adequate warmup time for each

Multiple runs: At least 3 runs, report median

Same observability: Instrumentation overhead consistent

Common Pitfalls

Cold start bias: Insufficient warmup

Thermal throttling: CPU temperature limiting performance

Background processes: Other workloads affecting results

Network saturation: Localhost loopback as bottleneck

Observer effect: Observability overhead not accounted for

Statistical Rigor

Multiple Runs

Minimum: 3 runs per configuration

Report: Median RPS, range

Discard: Outliers with clear explanation

Variance Analysis

Acceptable: ±5% between runs

Investigate: >10% variance suggests instability

Significance

Results presented are indicative, not scientific proof

Known Limitations

Workload Simplicity

Local Testing

Tool Limitations

Recommendations for Reproducibility

Before Benchmarking

  1. Close unnecessary applications: Minimize interference
  2. Disable power management: Maximum performance mode
  3. Fix CPU frequency: Avoid turbo boost variations
  4. Warm up system: Run a test benchmark first
  5. Check thermals: Ensure adequate cooling

During Benchmarking

  1. Monitor system: Watch for anomalies
  2. Consistent time of day: Avoid thermal variations
  3. Multiple iterations: Don’t trust single run
  4. Document everything: Configuration, versions, observations

After Benchmarking

  1. Review observability data: Correlate with results
  2. Check for errors: Validate test validity
  3. Compare with baseline: Detect regression
  4. Archive results: Include metadata

Advanced Benchmarking

Latency Profiling

Use flamegraphs to identify hot paths:

# Pyroscope captures automatically during test
# View in Grafana: Explore → Pyroscope

Concurrency Scaling

Test different connection counts:

for conn in 50 100 200 400; do
    wrk2 -t 8 -c $conn -d 60s -R 100000 http://localhost:8080/api/cache/key1
done

Stress Testing

Find breaking point:

for rate in 50000 100000 150000 200000; do
    wrk2 -t 8 -c 200 -d 60s -R $rate http://localhost:8080/api/cache/key1
done

References

Continuous Improvement

This methodology evolves based on:

Contributions and suggestions welcome via GitHub issues!