Running Benchmarks (MCP Server)
===============================

The MCP benchmark servers provide a programmatic interface for running Juliet and
real-world benchmarks. All results are stored in ``data/benchmarks.db`` (SQLite,
WAL mode).

Benchmark Infrastructure
------------------------

::

    bench/
      __init__.py      Package marker
      __main__.py      CLI: python -m bench juliet [--full] [--jobs N]
      config.py        Paths, constants, defaults
      db.py            SQLite schema, WAL mode, CRUD + query API
      analyzer.py      TP/FP classifier (Juliet ground truth)
      runner.py        Parallel CWE runner
      machine.py       Machine metadata (CPU, RAM, hostname)

SQLite Schema
~~~~~~~~~~~~~

=======================  =============================================================
Table                    Purpose
=======================  =============================================================
``runs``                 One row per benchmark (version, SHA, mode, status, machine)
``cwe_scans``            One row per CWE per run (file count, violations, duration)
``violations``           Every individual sqc finding with TP/FP classification
``cwe_metrics``          Pre-computed aggregates per CWE (TP/FP rates)
``rule_cwe_breakdown``   Per-rule per-CWE counts
``realworld_runs``       Real-world benchmark runs (sqc version, machine)
``realworld_results``    Per-project per-tool violation counts
=======================  =============================================================

Historical data from ``JULIET_RESULTS.md`` and ``REALWORLD_RESULTS.md`` has been
backfilled into the database.

Benchmark Workflow Protocol
---------------------------

.. important::

    1. **Version bump + commit BEFORE benchmark**: Always bump the version in
       ``Cargo.toml``, rebuild (``cargo build --release``), and commit before
       starting. The run_id is ``sqc-{version}-{sha}``.

    2. **NEVER modify code while a benchmark is running**: The benchmark uses
       ``target/release/sqc``. Rebuilding while running corrupts results.

    3. **Wait for completion**: Fast-mode ~8-10 min (4-core), ~3-5 min (24-core).
       Full-suite ~40-50 min. Check status no more than once every 5 minutes.

    4. **Compare runs after completion**.

    5. **Sequence**: ``implement -> bump version -> commit -> build release ->
       run benchmark -> wait -> analyze``

Pre-Benchmark Checklist
~~~~~~~~~~~~~~~~~~~~~~~~

- All code changes committed
- Version bumped in ``Cargo.toml`` (for Juliet)
- ``cargo build --release`` successful
- No other benchmark currently running (``get_status()``)
- Previous results compared if needed (``compare_runs()``)

Juliet Benchmark Tools
----------------------

==========================================  =================================================
Tool                                        Purpose
==========================================  =================================================
``run_benchmark(mode)``                     Start benchmark (``"fast"`` default or ``"full"``)
``get_status``                              Check progress (%, ETA, recent CWEs)
``get_results(sort_by, run)``               Aggregated TP/FP across completed CWEs
``get_cwe_detail(cwe_id, run)``             TP/FP breakdown for a specific CWE
``list_runs``                               List all benchmark runs
``compare_runs(base, target)``              Compare two runs (TP/FP deltas)
``compare_cwe(cwe_id, base, target)``       Compare a CWE between two runs
``cancel_benchmark``                        Kill a running benchmark
``clear_results``                           Remove old result directories
``reanalyze_run(run)``                      Re-run analysis on existing CSVs
==========================================  =================================================

Typical Juliet workflow::

    1. run_benchmark()                          # Start (fast mode)
    2. get_status()                             # Check progress (every 5 min)
    3. get_results()                            # After completion: summary
    4. get_results(sort_by="fp_count")          # Top FP rules
    5. get_cwe_detail(cwe_id="476")             # Deep dive
    6. compare_runs(base="sqc-0.3.17-historical", target="latest")
    7. list_runs()                              # All available runs

Run identifiers accepted by query tools:

- ``"latest"`` -- most recent run (default)
- Full run name: ``"sqc-0.3.20-abc1234"``
- Commit SHA: ``"abc1234"``
- Historical runs: ``"sqc-0.3.17-historical"``

**Notes**:

- ``run_benchmark()`` returns immediately -- use ``get_status()`` to monitor
- If a benchmark is already running, ``run_benchmark()`` returns the existing PID
- **Fast mode** (default): per-CWE manifests, CWE-matched rules only. ~10x faster
- **Full mode**: all 283 rules against every CWE. Higher noise ratio
- Results from ``get_results()`` only include completed CWEs
- Resume: interrupted runs skip already-completed CWEs on re-run

CLI Alternative
~~~~~~~~~~~~~~~

.. code-block:: bash

    python -m bench juliet [--full] [--jobs N] [--keep-csv]
    python -m bench status [RUN_ID]
    python -m bench compare BASE TARGET
    python -m bench runs

Real-World Benchmark Tools
--------------------------

==========================================  =================================================
Tool                                        Purpose
==========================================  =================================================
``run_analysis``                            Run one tool against one codebase
``run_all``                                 Run all tool x codebase combinations (or filter)
``get_status``                              Show status of all tracked runs
``get_results``                             Parse and display results
``compare_runs``                            Compare results between two versions
``list_runs``                               List all version directories
``cancel_run``                              Cancel a specific or all active runs
``purge_run``                               Remove stale/zombie runs
``clear_results``                           Remove old result directories
``deploy_sqc``                              Deploy sqc binary + manifest to remote hosts
==========================================  =================================================

Supported tools: ``sqc``, ``cppcheck``, ``clang-tidy``

Supported codebases: ``libcrc``, ``sqlite``, ``mosquitto``, ``curl``, ``hostap``

Typical real-world workflow::

    1. run_all(tool="sqc")           # Run sqc against all 5 codebases
    2. get_status()                  # Monitor progress
    3. get_results()                 # View all results
    4. compare_runs(base="0.2.6", target="0.2.7")

Comparing Across Runs
---------------------

Juliet
~~~~~~

::

    compare_runs(base="sqc-0.3.17-historical", target="latest")
    compare_cwe(cwe_id="476", base="sqc-0.3.14-historical", target="latest")

Positive FP delta = regression. Negative = improvement.

Real-World
~~~~~~~~~~

::

    compare_runs(base_version="0.2.6", target_version="0.2.7")
    compare_runs(base_version="0.2.6", target_version="0.2.7", tool="sqc", codebase="sqlite")

Competitor Benchmarks (Infer / Frama-C)
----------------------------------------

The ``bench/competitors.py`` module runs Facebook Infer and Frama-C EVA on
Juliet test cases and classifies findings as TP/FP using the same ground truth
as the sqc benchmark (``OMITBAD``/``OMITGOOD`` guards and procedure names).

Results are written to ``data/competitor_results/<tool>_<timestamp>.json``.

Infrastructure
~~~~~~~~~~~~~~

::

    bench/
      competitors.py   Infer + Frama-C runners, TP/FP classification, comparison

Default CWE sets:

===========  ==================================================================
Tool         CWEs
===========  ==================================================================
Infer        476, 690, 416, 401, 415, 761, 762, 121, 122, 124, 127
Frama-C      190, 191, 476, 369, 197, 680
===========  ==================================================================

Running
~~~~~~~

.. code-block:: bash

    # Run Infer on default CWEs (~80 min on 24-core)
    python3 -m bench.competitors infer --jobs 8

    # Run Frama-C on default CWEs (~7-9 hours)
    eval $(opam env) && python3 -m bench.competitors framac --jobs 8

    # Run a specific subset
    python3 -m bench.competitors infer --cwes CWE476,CWE690

    # Compare results
    python3 -m bench.competitors compare \
      data/competitor_results/infer_*.json \
      data/competitor_results/framac_*.json

Timing Estimates
~~~~~~~~~~~~~~~~

===========  ============  ===============  =============
Tool         CWEs          Files            Estimated Time
===========  ============  ===============  =============
Infer        11            17,232           ~80 min
Frama-C      6             11,628           ~7--9 hours
===========  ============  ===============  =============

Infer uses incremental capture (``infer capture --continue``) per file then a
single ``infer analyze`` pass per CWE.  Frama-C runs EVA per-function per-file
(``-main <func>``), which is the main bottleneck.

Classification Logic
~~~~~~~~~~~~~~~~~~~~

**Infer**: Findings include a ``procedure`` field (e.g.
``CWE476_..._01_bad``).  If the procedure contains ``_bad`` or ``Bad`` it is
classified as TP; if it contains ``good`` it is FP.  Unresolved findings fall
back to line-level classification using ``parse_c_file_sections()``.

**Frama-C**: Each file is analyzed once per entry point (``_bad`` function and
``_good``/``goodN`` functions).  Alarms found when the entry point is a bad
function are TP; alarms under a good entry point are FP.

Key Frama-C flags:

- ``-machdep gcc_x86_64`` — enables GCC extensions (required for Juliet headers)
- ``-lib-entry`` — incomplete application analysis (no ``main``)
- ``-warn-signed-overflow -warn-signed-downcast`` — needed for CWE-190/191
- ``-eva-precision 1`` — reasonable precision/speed tradeoff

Troubleshooting
---------------

=======================================  =============================================
Issue                                    Solution
=======================================  =============================================
"Benchmark already running"              ``get_status()``, then ``cancel_benchmark()``
Old results consuming disk               ``clear_results()``
Results show wrong version               Ensure version bump + commit before build
SQLite locked                            WAL handles concurrent reads; check for zombies
Historical run not found                 Data predates SQLite migration; not available
=======================================  =============================================

Resolved Issues
~~~~~~~~~~~~~~~

- **DCL02-C Stack Overflow** (Fixed 2026-01-07): Unbounded recursive AST traversal
  in DCL02-C caused stack overflow on large files (SQLite). Converted to iterative
  with depth limit.

- **STR31-C ``detect_manual_string_loop`` Runaway** (Fixed 2026-02-25): Caused
  36--49% of all violations on 3 of 5 real-world projects. File-wide fallback
  removed; pattern matching restricted to loop condition and body.

- **Output Buffer Saturation**: SqC emits one status line per rule per file
  (~100 rules × N files). Always suppress or redirect output during scans::

      ./target/release/sqc directory/ --export results.csv 2>/dev/null