============
Benchmarking
============

Cocoa includes a `Google Benchmark <https://github.com/google/benchmark>`_
suite for tracking performance regressions in individual computational
components. Unlike the end-to-end simulation benchmarks in
:doc:`../user_guide/performance`, these micro-benchmarks isolate specific
kernels and solver stages to identify the source of any regression.

Building
--------

Benchmarks are gated behind the ``cocoa_ENABLE_BENCHMARKS`` CMake option:

.. code-block:: bash

   cmake .. -Dcocoa_ENABLE_BENCHMARKS=ON
   make -j8 cocoa_benchmarks

This produces a single executable at ``bench/cocoa_benchmarks``.

Running Benchmarks
------------------

.. code-block:: bash

   # Run all benchmarks
   ./bench/cocoa_benchmarks

   # Filter by component name (regex)
   ./bench/cocoa_benchmarks --benchmark_filter="GwceLumped"
   ./bench/cocoa_benchmarks --benchmark_filter="MomentumSolver"
   ./bench/cocoa_benchmarks --benchmark_filter="WetDry"
   ./bench/cocoa_benchmarks --benchmark_filter="Friction"

   # Filter by mesh size (the Args parameter in the benchmark name)
   ./bench/cocoa_benchmarks --benchmark_filter=".*2000$"         # ~2k nodes
   ./bench/cocoa_benchmarks --benchmark_filter=".*20000$"        # ~20k nodes
   ./bench/cocoa_benchmarks --benchmark_filter=".*200000$"       # ~200k nodes
   ./bench/cocoa_benchmarks --benchmark_filter=".*2000000$"      # ~2M nodes
   ./bench/cocoa_benchmarks --benchmark_filter=".*20000000$"     # ~20M nodes

   # Combine component and mesh size filters
   ./bench/cocoa_benchmarks --benchmark_filter="GwceLumped.*20000$"

   # Control minimum measurement time (default is 0.5s)
   ./bench/cocoa_benchmarks --benchmark_min_time=5s

   # JSON output for regression tracking or CI
   ./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

   # List available benchmarks without running them
   ./bench/cocoa_benchmarks --benchmark_list_tests

Available Benchmarks
--------------------

Each benchmark is registered at five mesh sizes: ``2000`` (~2k nodes),
``20000`` (~20k nodes), ``200000`` (~200k nodes), ``2000000`` (~2M nodes),
and ``20000000`` (~20M nodes). The mesh size appears as the ``Args`` parameter
in the benchmark name.

.. list-table::
   :header-rows: 1
   :widths: 40 40 20

   * - Benchmark
     - Component
     - Counters
   * - ``Gwce/RhsAssembly``
     - GWCE RHS vector assembly (element scatter + gather)
     - elements/s
   * - ``GwceLumped/MatrixAssembly``
     - GWCE lumped (diagonal) matrix assembly
     - elements/s
   * - ``GwceLumped/Solve``
     - GWCE lumped diagonal solve phase
     - DOF/s
   * - ``GwceConsistent/MatrixAssembly``
     - GWCE consistent sparse matrix assembly
     - elements/s
   * - ``GwceSolver_Belos/BelosCG``
     - CG solver (Belos)
     - DOF/s
   * - ``GwceSolver_SingleReduce/TpetraSingleReduce``
     - CG solver (Tpetra single reduction)
     - DOF/s
   * - ``GwceSolver_Pipeline/TpetraCgPipeline``
     - CG solver (Tpetra pipelined)
     - DOF/s
   * - ``Momentum/ElementAssembly``
     - Momentum element-parallel RHS scatter + normalize
     - elements/s
   * - ``Momentum/NodalAssembly``
     - Momentum per-node: normalize, wind stress, velocity contribution
     - DOF/s
   * - ``Momentum/Solve2x2``
     - Momentum per-node 2x2 Cramer's rule solve + land BC
     - DOF/s
   * - ``Momentum/FullSolve``
     - Full momentum solve (element + nodal assembly + 2x2 + flux)
     - DOF/s
   * - ``Friction/ManningTKM``
     - Manning bottom friction kernel
     - nodes/s
   * - ``TimeStep/FullTimeStep``
     - Full time step (friction + GWCE + momentum + wet/dry)
     - DOF/s
   * - ``WetDry/Compute``
     - Wet/dry algorithm with partially wet domain
     - elements/s

Scaling Analysis
----------------

The ``bench/analyze_scaling.py`` script classifies benchmarks as
linear-scaling or cache/memory-bandwidth limited based on how throughput
changes across mesh sizes.

.. code-block:: bash

   # Run benchmarks with JSON output
   ./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

   # Print scaling table
   python3 bench/analyze_scaling.py results.json

   # Print table + save scaling plot
   python3 bench/analyze_scaling.py results.json --plot scaling.png

The table shows throughput at each mesh size with a classification and
efficiency ratio. The plot produces two panels: absolute throughput (log-log)
and normalized scaling efficiency with threshold lines at 85% (linear) and
50% (moderate falloff).

**Classification criteria** (throughput at largest size / peak throughput):

- **linear** (>= 85%): Throughput scales with problem size
- **moderate falloff** (>= 50%): Some cache/bandwidth pressure
- **cache/BW limited** (< 50%): Throughput degrades significantly

Benchmark Mesh Generation
-------------------------

Benchmarks use programmatically generated rectangular channel meshes, requiring
no file I/O or external dependencies. The ``BenchMeshFactory`` in
``bench/BenchMeshFactory.hpp`` creates triangulated rectangular grids by
splitting each quad cell into two triangles with alternating diagonals.

Grid dimensions are computed from the target node count with approximately 2:1
aspect ratio. The number of nodes is ``(nx+1) * (ny+1)`` and the number of
elements is ``nx * ny * 2``.

**Mesh properties:**

- **Coordinates:** Gulf of Mexico region (~-89, 29) with Mercator projection
- **Bathymetry:** 10m uniform flat bottom (the wet/dry benchmark overrides this
  with a sloped plane to create a partially wet domain)
- **Boundaries:** All four edges are land boundaries (no open or flow boundaries)
- **Caching:** Meshes are built once per size and cached for the duration of the
  process, so all benchmarks sharing the same mesh size reuse the same data

Regression Tracking
-------------------

Use JSON output to compare results across commits:

.. code-block:: bash

   # Before changes
   ./bench/cocoa_benchmarks --benchmark_out=before.json --benchmark_out_format=json

   # After changes
   ./bench/cocoa_benchmarks --benchmark_out=after.json --benchmark_out_format=json

   # Compare (requires google-benchmark's compare.py tool)
   python3 <benchmark-src>/tools/compare.py benchmarks before.json after.json

Adding New Benchmarks
---------------------

To add a new benchmark:

1. Create a new ``.cpp`` file in ``bench/`` (e.g., ``BenchMyKernel.cpp``)
2. Include ``BenchFixtures.hpp`` for the base fixture
3. Define a fixture class inheriting from ``CocoaBench``
4. Use ``BENCHMARK_DEFINE_F`` and ``BENCHMARK_REGISTER_F`` macros
5. Add the file to ``bench/CMakeLists.txt``

Example:

.. code-block:: cpp

   #include <benchmark/benchmark.h>
   #include "BenchFixtures.hpp"

   namespace Cocoa::Bench {

   class MyBench : public CocoaBench {};

   BENCHMARK_DEFINE_F(MyBench, MyKernel)(benchmark::State& state) {
     for (auto _ : state) {
       // Call the kernel being benchmarked
       my_kernel(fields(), config());
     }
     set_element_counters(state);
   }

   BENCHMARK_REGISTER_F(MyBench, MyKernel)
       ->Apply(apply_mesh_sizes)
       ->Unit(benchmark::kMillisecond);

   }  // namespace Cocoa::Bench

If your benchmark needs custom setup (e.g., specialized solver configuration or
modified bathymetry), override ``SetUp`` in your fixture class. Remember to add
the ``using`` declarations to avoid hiding the base class overloads:

.. code-block:: cpp

   class MyBench : public CocoaBench {
    public:
     using CocoaBench::SetUp;
     using CocoaBench::TearDown;

     void SetUp(const benchmark::State& state) override {
       CocoaBench::SetUp(state);
       // Custom setup here
     }
   };

Source Organization
-------------------

.. code-block:: text

   bench/
   +-- CMakeLists.txt              Build rules
   +-- BenchMain.cpp               Custom main (Tpetra + Google Benchmark init)
   +-- BenchMeshFactory.hpp        Programmatic mesh generation
   +-- BenchFixtures.hpp           Shared base fixture (mesh/fields/config pipeline)
   +-- BenchGwceLhsAssembly.cpp    GWCE RHS/matrix assembly and lumped solve
   +-- BenchGwceSolver.cpp         CG linear solver variants
   +-- BenchMomentumSolver.cpp     Momentum solver phases (element, nodal, 2x2, full)
   +-- BenchWetDry.cpp             Wet/dry algorithm
   +-- BenchPhysics.cpp            Friction kernels
   +-- BenchTimeStep.cpp           Full time step
   +-- analyze_scaling.py          Scaling analysis and plotting script