Running Benchmarks (MCP Server)

The MCP benchmark servers provide a programmatic interface for running Juliet and real-world benchmarks. All results are stored in data/benchmarks.db (SQLite, WAL mode).

Benchmark Infrastructure

bench/
  __init__.py      Package marker
  __main__.py      CLI: python -m bench juliet [--full] [--jobs N]
  config.py        Paths, constants, defaults
  db.py            SQLite schema, WAL mode, CRUD + query API
  analyzer.py      TP/FP classifier (Juliet ground truth)
  runner.py        Parallel CWE runner
  machine.py       Machine metadata (CPU, RAM, hostname)

SQLite Schema

Table	Purpose
`runs`	One row per benchmark (version, SHA, mode, status, machine)
`cwe_scans`	One row per CWE per run (file count, violations, duration)
`violations`	Every individual sqc finding with TP/FP classification
`cwe_metrics`	Pre-computed aggregates per CWE (TP/FP rates)
`rule_cwe_breakdown`	Per-rule per-CWE counts
`realworld_runs`	Real-world benchmark runs (sqc version, machine)
`realworld_results`	Per-project per-tool violation counts (+ codebase_commit)
`realworld_violations`	Every individual real-world sqc finding (file, line, rule)
`ground_truth`	Adjudicated TP/FP oracle keyed on (project, commit, file, line, rule)

Historical data from JULIET_RESULTS.md and REALWORLD_RESULTS.md has been backfilled into the database.

Benchmark Workflow Protocol

Important

Version bump + commit BEFORE benchmark: Always bump the version in Cargo.toml, rebuild (cargo build --release), and commit before starting. The run_id is sqc-{version}-{sha}.
NEVER modify code while a benchmark is running: The benchmark uses target/release/sqc. Rebuilding while running corrupts results.
Wait for completion: Fast-mode ~8-10 min (4-core), ~3-5 min (24-core). Full-suite ~40-50 min. Check status no more than once every 5 minutes.
Compare runs after completion.
Sequence: implement -> bump version -> commit -> build release -> run benchmark -> wait -> analyze

Pre-Benchmark Checklist

All code changes committed
Version bumped in Cargo.toml (for Juliet)
cargo build --release successful
No other benchmark currently running (get_status())
Previous results compared if needed (compare_runs())

Juliet Benchmark Tools

Tool	Purpose
`run_benchmark(mode)`	Start benchmark (`"fast"` default or `"full"`)
`get_status`	Check progress (%, ETA, recent CWEs)
`get_results(sort_by, run)`	Aggregated TP/FP across completed CWEs
`get_cwe_detail(cwe_id, run)`	TP/FP breakdown for a specific CWE
`list_runs`	List all benchmark runs
`compare_runs(base, target)`	Compare two runs (TP/FP deltas)
`compare_cwe(cwe_id, base, target)`	Compare a CWE between two runs
`cancel_benchmark`	Kill a running benchmark
`clear_results`	Remove old result directories
`reanalyze_run(run)`	Re-run analysis on existing CSVs

Typical Juliet workflow:

run_benchmark()                          # Start (fast mode)
get_status()                             # Check progress (every 5 min)
get_results()                            # After completion: summary
get_results(sort_by="fp_count")          # Top FP rules
get_cwe_detail(cwe_id="476")             # Deep dive
compare_runs(base="sqc-0.3.17-historical", target="latest")
list_runs()                              # All available runs

Run identifiers accepted by query tools:

"latest" – most recent run (default)
Full run name: "sqc-0.3.20-abc1234"
Commit SHA: "abc1234"
Historical runs: "sqc-0.3.17-historical"

Notes:

run_benchmark() returns immediately – use get_status() to monitor
If a benchmark is already running, run_benchmark() returns the existing PID
Fast mode (default): per-CWE manifests, CWE-matched rules only. ~10x faster
Full mode: all 283 rules against every CWE. Higher noise ratio
Results from get_results() only include completed CWEs
Resume: interrupted runs skip already-completed CWEs on re-run

CLI Alternative

python -m bench juliet [--full] [--jobs N] [--keep-csv]
python -m bench status [RUN_ID]
python -m bench compare BASE TARGET
python -m bench runs

Real-World Benchmark Tools

Tool	Purpose
`run_analysis`	Run one tool against one codebase
`run_all`	Run all tool x codebase combinations (or filter)
`get_status`	Show status of all tracked runs
`get_results`	Parse and display results
`compare_runs`	Compare results between two versions
`list_runs`	List all version directories
`cancel_run`	Cancel a specific or all active runs
`purge_run`	Remove stale/zombie runs
`clear_results`	Remove old result directories
`deploy_sqc`	Deploy sqc binary + manifest to remote hosts

Supported tools: sqc, cppcheck, clang-tidy

Supported codebases: libcrc, sqlite, mosquitto, curl, hostap

Per-Codebase Rule Configs

Each codebase may carry its own sqc rules manifest in conf/realworld/ (the real-world analog of a project shipping its own sqc-rules.toml). The MCP server reuses it for every run of that codebase via the CODEBASES[<name>]["sqc"]["manifest"] registry entry, so rules that do not apply are ignored consistently. A codebase with no entry falls back to the shared base rules_templates/rules-benchmark.toml. The config is the categorical filter (disable a whole rule only when it is inapplicable); per-finding false positives among enabled rules are recorded in the ground_truth oracle instead, so analyzer misfires stay measured rather than hidden. See conf/realworld/README.md for the per-codebase audit workflow. libcrc is fully audited (every enabled-rule finding labelled); the four large codebases grow their labels incrementally.

Auto-Scoring

When a real-world run completes, get_status() ingests it and auto-scores it against the oracle: it writes a <run-dir>.score.json sidecar and returns a one-line measured precision/recall in the status payload ("measured"). Scoring only joins findings to existing labels — it never adjudicates new findings. Re-run any time with python -m bench realworld-score <RUN>.

Typical real-world workflow:

run_all(tool="sqc")           # Run sqc against all 7 codebases
get_status()                  # Monitor progress
get_results()                 # View all results
compare_runs(base="0.2.6", target="0.2.7")

Real-World Ground-Truth Oracle (measured precision/recall)

Volume deltas and CWE-aware Juliet rates do not predict real-world precision (the v0.4.22 audit measured ~2–34% precision for the noisiest rules). The ground_truth table is a growing, manually/AI-adjudicated TP/FP oracle for the real-world codebases — the real-world analog of Juliet’s OMITGOOD/OMITBAD. Because each benchmark checkout is pinned to a fixed git SHA, a label keyed on (project, codebase_commit, file_path, line, rule_id) stays valid across sqc versions: only the tool changes, never the code. Labels are appended over time, never tied to a single run.

CLI:

python -m bench ground-truth                       # label inventory
python -m bench realworld-score [RUN]              # measured precision/recall
python -m bench realworld-unlabeled [RUN] --rule R --project P --limit N --seed S
python -m bench realworld-import-labels CSV --run RUN [--source TAG] [--update]

realworld-score joins a run’s findings to labels for each project’s own ``codebase_commit`` and reports, per rule and overall:

precision = labeled-TP / (labeled-TP + labeled-FP), over the labeled subset of the run’s findings (a sampled estimate; “Label coverage” shows how much of the run is labeled);
recall = known-TPs flagged / known-TPs — a known true bug that stops being flagged drops recall, seeding regression detection.

A run whose codebase_commit has no labels is warned about, not scored.

Incremental adjudication loop (need not be one-shot):

realworld-unlabeled RUN --rule X --seed S --limit N — pull findings with no label yet (reproducible sample);
adjudicate them (Claude or manual) into a CSV (rule,idx,project,file,line,verdict,reason);
realworld-import-labels CSV --run RUN — append (existing labels are skipped unless --update re-adjudicates them).

The first 200 labels were seeded from data/precision_audit/adjudication_0.4.22.csv.

Comparing Across Runs

Juliet

compare_runs(base="sqc-0.3.17-historical", target="latest")
compare_cwe(cwe_id="476", base="sqc-0.3.14-historical", target="latest")

Positive FP delta = regression. Negative = improvement.

Real-World

compare_runs(base_version="0.2.6", target_version="0.2.7")
compare_runs(base_version="0.2.6", target_version="0.2.7", tool="sqc", codebase="sqlite")

Competitor Benchmarks (Infer / Frama-C)

The bench/competitors.py module runs Facebook Infer and Frama-C EVA on Juliet test cases and classifies findings as TP/FP using the same ground truth as the sqc benchmark (OMITBAD/OMITGOOD guards and procedure names).

Results are written to data/competitor_results/<tool>_<timestamp>.json.

Infrastructure

bench/
  competitors.py   Infer + Frama-C runners, TP/FP classification, comparison

Default CWE sets:

Tool	CWEs
Infer	476, 690, 416, 401, 415, 761, 762, 121, 122, 124, 127
Frama-C	190, 191, 476, 369, 197, 680

Running

# Run Infer on default CWEs (~80 min on 24-core)
python3 -m bench.competitors infer --jobs 8

# Run Frama-C on default CWEs (~7-9 hours)
eval $(opam env) && python3 -m bench.competitors framac --jobs 8

# Run a specific subset
python3 -m bench.competitors infer --cwes CWE476,CWE690

# Compare results
python3 -m bench.competitors compare \
  data/competitor_results/infer_*.json \
  data/competitor_results/framac_*.json

Timing Estimates

Tool	CWEs	Files	Estimated Time
Infer	11	17,232	~80 min
Frama-C	6	11,628	~7–9 hours

Infer uses incremental capture (infer capture --continue) per file then a single infer analyze pass per CWE. Frama-C runs EVA per-function per-file (-main <func>), which is the main bottleneck.

Classification Logic

Infer: Findings include a procedure field (e.g. CWE476_..._01_bad). If the procedure contains _bad or Bad it is classified as TP; if it contains good it is FP. Unresolved findings fall back to line-level classification using parse_c_file_sections().

Frama-C: Each file is analyzed once per entry point (_bad function and _good/goodN functions). Alarms found when the entry point is a bad function are TP; alarms under a good entry point are FP.

Key Frama-C flags:

-machdep gcc_x86_64 — enables GCC extensions (required for Juliet headers)
-lib-entry — incomplete application analysis (no main)
-warn-signed-overflow -warn-signed-downcast — needed for CWE-190/191
-eva-precision 1 — reasonable precision/speed tradeoff

Troubleshooting

Issue	Solution
“Benchmark already running”	`get_status()`, then `cancel_benchmark()`
Old results consuming disk	`clear_results()`
Results show wrong version	Ensure version bump + commit before build
SQLite locked	WAL handles concurrent reads; check for zombies
Historical run not found	Data predates SQLite migration; not available

Resolved Issues

DCL02-C Stack Overflow (Fixed 2026-01-07): Unbounded recursive AST traversal in DCL02-C caused stack overflow on large files (SQLite). Converted to iterative with depth limit.
STR31-C ``detect_manual_string_loop`` Runaway (Fixed 2026-02-25): Caused 36–49% of all violations on 3 of 5 real-world projects. File-wide fallback removed; pattern matching restricted to loop condition and body.
Output Buffer Saturation: SqC emits one status line per rule per file (~100 rules × N files). Always suppress or redirect output during scans:
```
./target/release/sqc directory/ --export results.csv 2>/dev/null
```