Running Benchmarks (MCP Server)
The MCP benchmark servers provide a programmatic interface for running Juliet and
real-world benchmarks. All results are stored in data/benchmarks.db (SQLite,
WAL mode).
Benchmark Infrastructure
bench/
__init__.py Package marker
__main__.py CLI: python -m bench juliet [--full] [--jobs N]
config.py Paths, constants, defaults
db.py SQLite schema, WAL mode, CRUD + query API
analyzer.py TP/FP classifier (Juliet ground truth)
runner.py Parallel CWE runner
machine.py Machine metadata (CPU, RAM, hostname)
SQLite Schema
Table |
Purpose |
|---|---|
|
One row per benchmark (version, SHA, mode, status, machine) |
|
One row per CWE per run (file count, violations, duration) |
|
Every individual sqc finding with TP/FP classification |
|
Pre-computed aggregates per CWE (TP/FP rates) |
|
Per-rule per-CWE counts |
|
Real-world benchmark runs (sqc version, machine) |
|
Per-project per-tool violation counts |
Historical data from JULIET_RESULTS.md and REALWORLD_RESULTS.md has been
backfilled into the database.
Benchmark Workflow Protocol
Important
Version bump + commit BEFORE benchmark: Always bump the version in
Cargo.toml, rebuild (cargo build --release), and commit before starting. The run_id issqc-{version}-{sha}.NEVER modify code while a benchmark is running: The benchmark uses
target/release/sqc. Rebuilding while running corrupts results.Wait for completion: Fast-mode ~8-10 min (4-core), ~3-5 min (24-core). Full-suite ~40-50 min. Check status no more than once every 5 minutes.
Compare runs after completion.
Sequence:
implement -> bump version -> commit -> build release -> run benchmark -> wait -> analyze
Pre-Benchmark Checklist
All code changes committed
Version bumped in
Cargo.toml(for Juliet)cargo build --releasesuccessfulNo other benchmark currently running (
get_status())Previous results compared if needed (
compare_runs())
Juliet Benchmark Tools
Tool |
Purpose |
|---|---|
|
Start benchmark ( |
|
Check progress (%, ETA, recent CWEs) |
|
Aggregated TP/FP across completed CWEs |
|
TP/FP breakdown for a specific CWE |
|
List all benchmark runs |
|
Compare two runs (TP/FP deltas) |
|
Compare a CWE between two runs |
|
Kill a running benchmark |
|
Remove old result directories |
|
Re-run analysis on existing CSVs |
Typical Juliet workflow:
1. run_benchmark() # Start (fast mode)
2. get_status() # Check progress (every 5 min)
3. get_results() # After completion: summary
4. get_results(sort_by="fp_count") # Top FP rules
5. get_cwe_detail(cwe_id="476") # Deep dive
6. compare_runs(base="sqc-0.3.17-historical", target="latest")
7. list_runs() # All available runs
Run identifiers accepted by query tools:
"latest"– most recent run (default)Full run name:
"sqc-0.3.20-abc1234"Commit SHA:
"abc1234"Historical runs:
"sqc-0.3.17-historical"
Notes:
run_benchmark()returns immediately – useget_status()to monitorIf a benchmark is already running,
run_benchmark()returns the existing PIDFast mode (default): per-CWE manifests, CWE-matched rules only. ~10x faster
Full mode: all 283 rules against every CWE. Higher noise ratio
Results from
get_results()only include completed CWEsResume: interrupted runs skip already-completed CWEs on re-run
CLI Alternative
python -m bench juliet [--full] [--jobs N] [--keep-csv]
python -m bench status [RUN_ID]
python -m bench compare BASE TARGET
python -m bench runs
Real-World Benchmark Tools
Tool |
Purpose |
|---|---|
|
Run one tool against one codebase |
|
Run all tool x codebase combinations (or filter) |
|
Show status of all tracked runs |
|
Parse and display results |
|
Compare results between two versions |
|
List all version directories |
|
Cancel a specific or all active runs |
|
Remove stale/zombie runs |
|
Remove old result directories |
|
Deploy sqc binary + manifest to remote hosts |
Supported tools: sqc, cppcheck, clang-tidy
Supported codebases: libcrc, sqlite, mosquitto, curl, hostap
Typical real-world workflow:
1. run_all(tool="sqc") # Run sqc against all 5 codebases
2. get_status() # Monitor progress
3. get_results() # View all results
4. compare_runs(base="0.2.6", target="0.2.7")
Comparing Across Runs
Juliet
compare_runs(base="sqc-0.3.17-historical", target="latest")
compare_cwe(cwe_id="476", base="sqc-0.3.14-historical", target="latest")
Positive FP delta = regression. Negative = improvement.
Real-World
compare_runs(base_version="0.2.6", target_version="0.2.7")
compare_runs(base_version="0.2.6", target_version="0.2.7", tool="sqc", codebase="sqlite")
Competitor Benchmarks (Infer / Frama-C)
The bench/competitors.py module runs Facebook Infer and Frama-C EVA on
Juliet test cases and classifies findings as TP/FP using the same ground truth
as the sqc benchmark (OMITBAD/OMITGOOD guards and procedure names).
Results are written to data/competitor_results/<tool>_<timestamp>.json.
Infrastructure
bench/
competitors.py Infer + Frama-C runners, TP/FP classification, comparison
Default CWE sets:
Tool |
CWEs |
|---|---|
Infer |
476, 690, 416, 401, 415, 761, 762, 121, 122, 124, 127 |
Frama-C |
190, 191, 476, 369, 197, 680 |
Running
# Run Infer on default CWEs (~80 min on 24-core)
python3 -m bench.competitors infer --jobs 8
# Run Frama-C on default CWEs (~7-9 hours)
eval $(opam env) && python3 -m bench.competitors framac --jobs 8
# Run a specific subset
python3 -m bench.competitors infer --cwes CWE476,CWE690
# Compare results
python3 -m bench.competitors compare \
data/competitor_results/infer_*.json \
data/competitor_results/framac_*.json
Timing Estimates
Tool |
CWEs |
Files |
Estimated Time |
|---|---|---|---|
Infer |
11 |
17,232 |
~80 min |
Frama-C |
6 |
11,628 |
~7–9 hours |
Infer uses incremental capture (infer capture --continue) per file then a
single infer analyze pass per CWE. Frama-C runs EVA per-function per-file
(-main <func>), which is the main bottleneck.
Classification Logic
Infer: Findings include a procedure field (e.g.
CWE476_..._01_bad). If the procedure contains _bad or Bad it is
classified as TP; if it contains good it is FP. Unresolved findings fall
back to line-level classification using parse_c_file_sections().
Frama-C: Each file is analyzed once per entry point (_bad function and
_good/goodN functions). Alarms found when the entry point is a bad
function are TP; alarms under a good entry point are FP.
Key Frama-C flags:
-machdep gcc_x86_64— enables GCC extensions (required for Juliet headers)-lib-entry— incomplete application analysis (nomain)-warn-signed-overflow -warn-signed-downcast— needed for CWE-190/191-eva-precision 1— reasonable precision/speed tradeoff
Troubleshooting
Issue |
Solution |
|---|---|
“Benchmark already running” |
|
Old results consuming disk |
|
Results show wrong version |
Ensure version bump + commit before build |
SQLite locked |
WAL handles concurrent reads; check for zombies |
Historical run not found |
Data predates SQLite migration; not available |
Resolved Issues
DCL02-C Stack Overflow (Fixed 2026-01-07): Unbounded recursive AST traversal in DCL02-C caused stack overflow on large files (SQLite). Converted to iterative with depth limit.
STR31-C ``detect_manual_string_loop`` Runaway (Fixed 2026-02-25): Caused 36–49% of all violations on 3 of 5 real-world projects. File-wide fallback removed; pattern matching restricted to loop condition and body.
Output Buffer Saturation: SqC emits one status line per rule per file (~100 rules × N files). Always suppress or redirect output during scans:
./target/release/sqc directory/ --export results.csv 2>/dev/null