Running Benchmarks (MCP Server)

The MCP benchmark servers provide a programmatic interface for running Juliet and real-world benchmarks. All results are stored in data/benchmarks.db (SQLite, WAL mode).

Benchmark Infrastructure

bench/
  __init__.py      Package marker
  __main__.py      CLI: python -m bench juliet [--full] [--jobs N]
  config.py        Paths, constants, defaults
  db.py            SQLite schema, WAL mode, CRUD + query API
  analyzer.py      TP/FP classifier (Juliet ground truth)
  runner.py        Parallel CWE runner
  machine.py       Machine metadata (CPU, RAM, hostname)

SQLite Schema

Table

Purpose

runs

One row per benchmark (version, SHA, mode, status, machine)

cwe_scans

One row per CWE per run (file count, violations, duration)

violations

Every individual sqc finding with TP/FP classification

cwe_metrics

Pre-computed aggregates per CWE (TP/FP rates)

rule_cwe_breakdown

Per-rule per-CWE counts

realworld_runs

Real-world benchmark runs (sqc version, machine)

realworld_results

Per-project per-tool violation counts

Historical data from JULIET_RESULTS.md and REALWORLD_RESULTS.md has been backfilled into the database.

Benchmark Workflow Protocol

Important

  1. Version bump + commit BEFORE benchmark: Always bump the version in Cargo.toml, rebuild (cargo build --release), and commit before starting. The run_id is sqc-{version}-{sha}.

  2. NEVER modify code while a benchmark is running: The benchmark uses target/release/sqc. Rebuilding while running corrupts results.

  3. Wait for completion: Fast-mode ~8-10 min (4-core), ~3-5 min (24-core). Full-suite ~40-50 min. Check status no more than once every 5 minutes.

  4. Compare runs after completion.

  5. Sequence: implement -> bump version -> commit -> build release -> run benchmark -> wait -> analyze

Pre-Benchmark Checklist

  • All code changes committed

  • Version bumped in Cargo.toml (for Juliet)

  • cargo build --release successful

  • No other benchmark currently running (get_status())

  • Previous results compared if needed (compare_runs())

Juliet Benchmark Tools

Tool

Purpose

run_benchmark(mode)

Start benchmark ("fast" default or "full")

get_status

Check progress (%, ETA, recent CWEs)

get_results(sort_by, run)

Aggregated TP/FP across completed CWEs

get_cwe_detail(cwe_id, run)

TP/FP breakdown for a specific CWE

list_runs

List all benchmark runs

compare_runs(base, target)

Compare two runs (TP/FP deltas)

compare_cwe(cwe_id, base, target)

Compare a CWE between two runs

cancel_benchmark

Kill a running benchmark

clear_results

Remove old result directories

reanalyze_run(run)

Re-run analysis on existing CSVs

Typical Juliet workflow:

1. run_benchmark()                          # Start (fast mode)
2. get_status()                             # Check progress (every 5 min)
3. get_results()                            # After completion: summary
4. get_results(sort_by="fp_count")          # Top FP rules
5. get_cwe_detail(cwe_id="476")             # Deep dive
6. compare_runs(base="sqc-0.3.17-historical", target="latest")
7. list_runs()                              # All available runs

Run identifiers accepted by query tools:

  • "latest" – most recent run (default)

  • Full run name: "sqc-0.3.20-abc1234"

  • Commit SHA: "abc1234"

  • Historical runs: "sqc-0.3.17-historical"

Notes:

  • run_benchmark() returns immediately – use get_status() to monitor

  • If a benchmark is already running, run_benchmark() returns the existing PID

  • Fast mode (default): per-CWE manifests, CWE-matched rules only. ~10x faster

  • Full mode: all 283 rules against every CWE. Higher noise ratio

  • Results from get_results() only include completed CWEs

  • Resume: interrupted runs skip already-completed CWEs on re-run

CLI Alternative

python -m bench juliet [--full] [--jobs N] [--keep-csv]
python -m bench status [RUN_ID]
python -m bench compare BASE TARGET
python -m bench runs

Real-World Benchmark Tools

Tool

Purpose

run_analysis

Run one tool against one codebase

run_all

Run all tool x codebase combinations (or filter)

get_status

Show status of all tracked runs

get_results

Parse and display results

compare_runs

Compare results between two versions

list_runs

List all version directories

cancel_run

Cancel a specific or all active runs

purge_run

Remove stale/zombie runs

clear_results

Remove old result directories

deploy_sqc

Deploy sqc binary + manifest to remote hosts

Supported tools: sqc, cppcheck, clang-tidy

Supported codebases: libcrc, sqlite, mosquitto, curl, hostap

Typical real-world workflow:

1. run_all(tool="sqc")           # Run sqc against all 5 codebases
2. get_status()                  # Monitor progress
3. get_results()                 # View all results
4. compare_runs(base="0.2.6", target="0.2.7")

Comparing Across Runs

Juliet

compare_runs(base="sqc-0.3.17-historical", target="latest")
compare_cwe(cwe_id="476", base="sqc-0.3.14-historical", target="latest")

Positive FP delta = regression. Negative = improvement.

Real-World

compare_runs(base_version="0.2.6", target_version="0.2.7")
compare_runs(base_version="0.2.6", target_version="0.2.7", tool="sqc", codebase="sqlite")

Competitor Benchmarks (Infer / Frama-C)

The bench/competitors.py module runs Facebook Infer and Frama-C EVA on Juliet test cases and classifies findings as TP/FP using the same ground truth as the sqc benchmark (OMITBAD/OMITGOOD guards and procedure names).

Results are written to data/competitor_results/<tool>_<timestamp>.json.

Infrastructure

bench/
  competitors.py   Infer + Frama-C runners, TP/FP classification, comparison

Default CWE sets:

Tool

CWEs

Infer

476, 690, 416, 401, 415, 761, 762, 121, 122, 124, 127

Frama-C

190, 191, 476, 369, 197, 680

Running

# Run Infer on default CWEs (~80 min on 24-core)
python3 -m bench.competitors infer --jobs 8

# Run Frama-C on default CWEs (~7-9 hours)
eval $(opam env) && python3 -m bench.competitors framac --jobs 8

# Run a specific subset
python3 -m bench.competitors infer --cwes CWE476,CWE690

# Compare results
python3 -m bench.competitors compare \
  data/competitor_results/infer_*.json \
  data/competitor_results/framac_*.json

Timing Estimates

Tool

CWEs

Files

Estimated Time

Infer

11

17,232

~80 min

Frama-C

6

11,628

~7–9 hours

Infer uses incremental capture (infer capture --continue) per file then a single infer analyze pass per CWE. Frama-C runs EVA per-function per-file (-main <func>), which is the main bottleneck.

Classification Logic

Infer: Findings include a procedure field (e.g. CWE476_..._01_bad). If the procedure contains _bad or Bad it is classified as TP; if it contains good it is FP. Unresolved findings fall back to line-level classification using parse_c_file_sections().

Frama-C: Each file is analyzed once per entry point (_bad function and _good/goodN functions). Alarms found when the entry point is a bad function are TP; alarms under a good entry point are FP.

Key Frama-C flags:

  • -machdep gcc_x86_64 — enables GCC extensions (required for Juliet headers)

  • -lib-entry — incomplete application analysis (no main)

  • -warn-signed-overflow -warn-signed-downcast — needed for CWE-190/191

  • -eva-precision 1 — reasonable precision/speed tradeoff

Troubleshooting

Issue

Solution

“Benchmark already running”

get_status(), then cancel_benchmark()

Old results consuming disk

clear_results()

Results show wrong version

Ensure version bump + commit before build

SQLite locked

WAL handles concurrent reads; check for zombies

Historical run not found

Data predates SQLite migration; not available

Resolved Issues

  • DCL02-C Stack Overflow (Fixed 2026-01-07): Unbounded recursive AST traversal in DCL02-C caused stack overflow on large files (SQLite). Converted to iterative with depth limit.

  • STR31-C ``detect_manual_string_loop`` Runaway (Fixed 2026-02-25): Caused 36–49% of all violations on 3 of 5 real-world projects. File-wide fallback removed; pattern matching restricted to loop condition and body.

  • Output Buffer Saturation: SqC emits one status line per rule per file (~100 rules × N files). Always suppress or redirect output during scans:

    ./target/release/sqc directory/ --export results.csv 2>/dev/null