Running Benchmarks (MCP Server) =============================== The MCP benchmark servers provide a programmatic interface for running Juliet and real-world benchmarks. All results are stored in ``data/benchmarks.db`` (SQLite, WAL mode). Benchmark Infrastructure ------------------------ :: bench/ __init__.py Package marker __main__.py CLI: python -m bench juliet [--full] [--jobs N] config.py Paths, constants, defaults db.py SQLite schema, WAL mode, CRUD + query API analyzer.py TP/FP classifier (Juliet ground truth) runner.py Parallel CWE runner machine.py Machine metadata (CPU, RAM, hostname) SQLite Schema ~~~~~~~~~~~~~ ======================= ============================================================= Table Purpose ======================= ============================================================= ``runs`` One row per benchmark (version, SHA, mode, status, machine) ``cwe_scans`` One row per CWE per run (file count, violations, duration) ``violations`` Every individual sqc finding with TP/FP classification ``cwe_metrics`` Pre-computed aggregates per CWE (TP/FP rates) ``rule_cwe_breakdown`` Per-rule per-CWE counts ``realworld_runs`` Real-world benchmark runs (sqc version, machine) ``realworld_results`` Per-project per-tool violation counts ======================= ============================================================= Historical data from ``JULIET_RESULTS.md`` and ``REALWORLD_RESULTS.md`` has been backfilled into the database. Benchmark Workflow Protocol --------------------------- .. important:: 1. **Version bump + commit BEFORE benchmark**: Always bump the version in ``Cargo.toml``, rebuild (``cargo build --release``), and commit before starting. The run_id is ``sqc-{version}-{sha}``. 2. **NEVER modify code while a benchmark is running**: The benchmark uses ``target/release/sqc``. Rebuilding while running corrupts results. 3. **Wait for completion**: Fast-mode ~8-10 min (4-core), ~3-5 min (24-core). Full-suite ~40-50 min. Check status no more than once every 5 minutes. 4. **Compare runs after completion**. 5. **Sequence**: ``implement -> bump version -> commit -> build release -> run benchmark -> wait -> analyze`` Pre-Benchmark Checklist ~~~~~~~~~~~~~~~~~~~~~~~~ - All code changes committed - Version bumped in ``Cargo.toml`` (for Juliet) - ``cargo build --release`` successful - No other benchmark currently running (``get_status()``) - Previous results compared if needed (``compare_runs()``) Juliet Benchmark Tools ---------------------- ========================================== ================================================= Tool Purpose ========================================== ================================================= ``run_benchmark(mode)`` Start benchmark (``"fast"`` default or ``"full"``) ``get_status`` Check progress (%, ETA, recent CWEs) ``get_results(sort_by, run)`` Aggregated TP/FP across completed CWEs ``get_cwe_detail(cwe_id, run)`` TP/FP breakdown for a specific CWE ``list_runs`` List all benchmark runs ``compare_runs(base, target)`` Compare two runs (TP/FP deltas) ``compare_cwe(cwe_id, base, target)`` Compare a CWE between two runs ``cancel_benchmark`` Kill a running benchmark ``clear_results`` Remove old result directories ``reanalyze_run(run)`` Re-run analysis on existing CSVs ========================================== ================================================= Typical Juliet workflow:: 1. run_benchmark() # Start (fast mode) 2. get_status() # Check progress (every 5 min) 3. get_results() # After completion: summary 4. get_results(sort_by="fp_count") # Top FP rules 5. get_cwe_detail(cwe_id="476") # Deep dive 6. compare_runs(base="sqc-0.3.17-historical", target="latest") 7. list_runs() # All available runs Run identifiers accepted by query tools: - ``"latest"`` -- most recent run (default) - Full run name: ``"sqc-0.3.20-abc1234"`` - Commit SHA: ``"abc1234"`` - Historical runs: ``"sqc-0.3.17-historical"`` **Notes**: - ``run_benchmark()`` returns immediately -- use ``get_status()`` to monitor - If a benchmark is already running, ``run_benchmark()`` returns the existing PID - **Fast mode** (default): per-CWE manifests, CWE-matched rules only. ~10x faster - **Full mode**: all 283 rules against every CWE. Higher noise ratio - Results from ``get_results()`` only include completed CWEs - Resume: interrupted runs skip already-completed CWEs on re-run CLI Alternative ~~~~~~~~~~~~~~~ .. code-block:: bash python -m bench juliet [--full] [--jobs N] [--keep-csv] python -m bench status [RUN_ID] python -m bench compare BASE TARGET python -m bench runs Real-World Benchmark Tools -------------------------- ========================================== ================================================= Tool Purpose ========================================== ================================================= ``run_analysis`` Run one tool against one codebase ``run_all`` Run all tool x codebase combinations (or filter) ``get_status`` Show status of all tracked runs ``get_results`` Parse and display results ``compare_runs`` Compare results between two versions ``list_runs`` List all version directories ``cancel_run`` Cancel a specific or all active runs ``purge_run`` Remove stale/zombie runs ``clear_results`` Remove old result directories ``deploy_sqc`` Deploy sqc binary + manifest to remote hosts ========================================== ================================================= Supported tools: ``sqc``, ``cppcheck``, ``clang-tidy`` Supported codebases: ``libcrc``, ``sqlite``, ``mosquitto``, ``curl``, ``hostap`` Typical real-world workflow:: 1. run_all(tool="sqc") # Run sqc against all 5 codebases 2. get_status() # Monitor progress 3. get_results() # View all results 4. compare_runs(base="0.2.6", target="0.2.7") Comparing Across Runs --------------------- Juliet ~~~~~~ :: compare_runs(base="sqc-0.3.17-historical", target="latest") compare_cwe(cwe_id="476", base="sqc-0.3.14-historical", target="latest") Positive FP delta = regression. Negative = improvement. Real-World ~~~~~~~~~~ :: compare_runs(base_version="0.2.6", target_version="0.2.7") compare_runs(base_version="0.2.6", target_version="0.2.7", tool="sqc", codebase="sqlite") Competitor Benchmarks (Infer / Frama-C) ---------------------------------------- The ``bench/competitors.py`` module runs Facebook Infer and Frama-C EVA on Juliet test cases and classifies findings as TP/FP using the same ground truth as the sqc benchmark (``OMITBAD``/``OMITGOOD`` guards and procedure names). Results are written to ``data/competitor_results/_.json``. Infrastructure ~~~~~~~~~~~~~~ :: bench/ competitors.py Infer + Frama-C runners, TP/FP classification, comparison Default CWE sets: =========== ================================================================== Tool CWEs =========== ================================================================== Infer 476, 690, 416, 401, 415, 761, 762, 121, 122, 124, 127 Frama-C 190, 191, 476, 369, 197, 680 =========== ================================================================== Running ~~~~~~~ .. code-block:: bash # Run Infer on default CWEs (~80 min on 24-core) python3 -m bench.competitors infer --jobs 8 # Run Frama-C on default CWEs (~7-9 hours) eval $(opam env) && python3 -m bench.competitors framac --jobs 8 # Run a specific subset python3 -m bench.competitors infer --cwes CWE476,CWE690 # Compare results python3 -m bench.competitors compare \ data/competitor_results/infer_*.json \ data/competitor_results/framac_*.json Timing Estimates ~~~~~~~~~~~~~~~~ =========== ============ =============== ============= Tool CWEs Files Estimated Time =========== ============ =============== ============= Infer 11 17,232 ~80 min Frama-C 6 11,628 ~7--9 hours =========== ============ =============== ============= Infer uses incremental capture (``infer capture --continue``) per file then a single ``infer analyze`` pass per CWE. Frama-C runs EVA per-function per-file (``-main ``), which is the main bottleneck. Classification Logic ~~~~~~~~~~~~~~~~~~~~ **Infer**: Findings include a ``procedure`` field (e.g. ``CWE476_..._01_bad``). If the procedure contains ``_bad`` or ``Bad`` it is classified as TP; if it contains ``good`` it is FP. Unresolved findings fall back to line-level classification using ``parse_c_file_sections()``. **Frama-C**: Each file is analyzed once per entry point (``_bad`` function and ``_good``/``goodN`` functions). Alarms found when the entry point is a bad function are TP; alarms under a good entry point are FP. Key Frama-C flags: - ``-machdep gcc_x86_64`` — enables GCC extensions (required for Juliet headers) - ``-lib-entry`` — incomplete application analysis (no ``main``) - ``-warn-signed-overflow -warn-signed-downcast`` — needed for CWE-190/191 - ``-eva-precision 1`` — reasonable precision/speed tradeoff Troubleshooting --------------- ======================================= ============================================= Issue Solution ======================================= ============================================= "Benchmark already running" ``get_status()``, then ``cancel_benchmark()`` Old results consuming disk ``clear_results()`` Results show wrong version Ensure version bump + commit before build SQLite locked WAL handles concurrent reads; check for zombies Historical run not found Data predates SQLite migration; not available ======================================= ============================================= Resolved Issues ~~~~~~~~~~~~~~~ - **DCL02-C Stack Overflow** (Fixed 2026-01-07): Unbounded recursive AST traversal in DCL02-C caused stack overflow on large files (SQLite). Converted to iterative with depth limit. - **STR31-C ``detect_manual_string_loop`` Runaway** (Fixed 2026-02-25): Caused 36--49% of all violations on 3 of 5 real-world projects. File-wide fallback removed; pattern matching restricted to loop condition and body. - **Output Buffer Saturation**: SqC emits one status line per rule per file (~100 rules × N files). Always suppress or redirect output during scans:: ./target/release/sqc directory/ --export results.csv 2>/dev/null