============ Benchmarking ============ Cocoa includes a `Google Benchmark `_ suite for tracking performance regressions in individual computational components. Unlike the end-to-end simulation benchmarks in :doc:`../user_guide/performance`, these micro-benchmarks isolate specific kernels and solver stages to identify the source of any regression. Building -------- Benchmarks are gated behind the ``cocoa_ENABLE_BENCHMARKS`` CMake option: .. code-block:: bash cmake .. -Dcocoa_ENABLE_BENCHMARKS=ON make -j8 cocoa_benchmarks This produces a single executable at ``bench/cocoa_benchmarks``. Running Benchmarks ------------------ .. code-block:: bash # Run all benchmarks ./bench/cocoa_benchmarks # Filter by component name (regex) ./bench/cocoa_benchmarks --benchmark_filter="GwceLumped" ./bench/cocoa_benchmarks --benchmark_filter="MomentumSolver" ./bench/cocoa_benchmarks --benchmark_filter="WetDry" ./bench/cocoa_benchmarks --benchmark_filter="Friction" # Filter by mesh size (the Args parameter in the benchmark name) ./bench/cocoa_benchmarks --benchmark_filter=".*2000$" # ~2k nodes ./bench/cocoa_benchmarks --benchmark_filter=".*20000$" # ~20k nodes ./bench/cocoa_benchmarks --benchmark_filter=".*200000$" # ~200k nodes ./bench/cocoa_benchmarks --benchmark_filter=".*2000000$" # ~2M nodes ./bench/cocoa_benchmarks --benchmark_filter=".*20000000$" # ~20M nodes # Combine component and mesh size filters ./bench/cocoa_benchmarks --benchmark_filter="GwceLumped.*20000$" # Control minimum measurement time (default is 0.5s) ./bench/cocoa_benchmarks --benchmark_min_time=5s # JSON output for regression tracking or CI ./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json # List available benchmarks without running them ./bench/cocoa_benchmarks --benchmark_list_tests Available Benchmarks -------------------- Each benchmark is registered at five mesh sizes: ``2000`` (~2k nodes), ``20000`` (~20k nodes), ``200000`` (~200k nodes), ``2000000`` (~2M nodes), and ``20000000`` (~20M nodes). The mesh size appears as the ``Args`` parameter in the benchmark name. .. list-table:: :header-rows: 1 :widths: 40 40 20 * - Benchmark - Component - Counters * - ``Gwce/RhsAssembly`` - GWCE RHS vector assembly (element scatter + gather) - elements/s * - ``GwceLumped/MatrixAssembly`` - GWCE lumped (diagonal) matrix assembly - elements/s * - ``GwceLumped/Solve`` - GWCE lumped diagonal solve phase - DOF/s * - ``GwceConsistent/MatrixAssembly`` - GWCE consistent sparse matrix assembly - elements/s * - ``GwceSolver_Belos/BelosCG`` - CG solver (Belos) - DOF/s * - ``GwceSolver_SingleReduce/TpetraSingleReduce`` - CG solver (Tpetra single reduction) - DOF/s * - ``GwceSolver_Pipeline/TpetraCgPipeline`` - CG solver (Tpetra pipelined) - DOF/s * - ``Momentum/ElementAssembly`` - Momentum element-parallel RHS scatter + normalize - elements/s * - ``Momentum/NodalAssembly`` - Momentum per-node: normalize, wind stress, velocity contribution - DOF/s * - ``Momentum/Solve2x2`` - Momentum per-node 2x2 Cramer's rule solve + land BC - DOF/s * - ``Momentum/FullSolve`` - Full momentum solve (element + nodal assembly + 2x2 + flux) - DOF/s * - ``Friction/ManningTKM`` - Manning bottom friction kernel - nodes/s * - ``TimeStep/FullTimeStep`` - Full time step (friction + GWCE + momentum + wet/dry) - DOF/s * - ``WetDry/Compute`` - Wet/dry algorithm with partially wet domain - elements/s Scaling Analysis ---------------- The ``bench/analyze_scaling.py`` script classifies benchmarks as linear-scaling or cache/memory-bandwidth limited based on how throughput changes across mesh sizes. .. code-block:: bash # Run benchmarks with JSON output ./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json # Print scaling table python3 bench/analyze_scaling.py results.json # Print table + save scaling plot python3 bench/analyze_scaling.py results.json --plot scaling.png The table shows throughput at each mesh size with a classification and efficiency ratio. The plot produces two panels: absolute throughput (log-log) and normalized scaling efficiency with threshold lines at 85% (linear) and 50% (moderate falloff). **Classification criteria** (throughput at largest size / peak throughput): - **linear** (>= 85%): Throughput scales with problem size - **moderate falloff** (>= 50%): Some cache/bandwidth pressure - **cache/BW limited** (< 50%): Throughput degrades significantly Benchmark Mesh Generation ------------------------- Benchmarks use programmatically generated rectangular channel meshes, requiring no file I/O or external dependencies. The ``BenchMeshFactory`` in ``bench/BenchMeshFactory.hpp`` creates triangulated rectangular grids by splitting each quad cell into two triangles with alternating diagonals. Grid dimensions are computed from the target node count with approximately 2:1 aspect ratio. The number of nodes is ``(nx+1) * (ny+1)`` and the number of elements is ``nx * ny * 2``. **Mesh properties:** - **Coordinates:** Gulf of Mexico region (~-89, 29) with Mercator projection - **Bathymetry:** 10m uniform flat bottom (the wet/dry benchmark overrides this with a sloped plane to create a partially wet domain) - **Boundaries:** All four edges are land boundaries (no open or flow boundaries) - **Caching:** Meshes are built once per size and cached for the duration of the process, so all benchmarks sharing the same mesh size reuse the same data Regression Tracking ------------------- Use JSON output to compare results across commits: .. code-block:: bash # Before changes ./bench/cocoa_benchmarks --benchmark_out=before.json --benchmark_out_format=json # After changes ./bench/cocoa_benchmarks --benchmark_out=after.json --benchmark_out_format=json # Compare (requires google-benchmark's compare.py tool) python3 /tools/compare.py benchmarks before.json after.json Adding New Benchmarks --------------------- To add a new benchmark: 1. Create a new ``.cpp`` file in ``bench/`` (e.g., ``BenchMyKernel.cpp``) 2. Include ``BenchFixtures.hpp`` for the base fixture 3. Define a fixture class inheriting from ``CocoaBench`` 4. Use ``BENCHMARK_DEFINE_F`` and ``BENCHMARK_REGISTER_F`` macros 5. Add the file to ``bench/CMakeLists.txt`` Example: .. code-block:: cpp #include #include "BenchFixtures.hpp" namespace Cocoa::Bench { class MyBench : public CocoaBench {}; BENCHMARK_DEFINE_F(MyBench, MyKernel)(benchmark::State& state) { for (auto _ : state) { // Call the kernel being benchmarked my_kernel(fields(), config()); } set_element_counters(state); } BENCHMARK_REGISTER_F(MyBench, MyKernel) ->Apply(apply_mesh_sizes) ->Unit(benchmark::kMillisecond); } // namespace Cocoa::Bench If your benchmark needs custom setup (e.g., specialized solver configuration or modified bathymetry), override ``SetUp`` in your fixture class. Remember to add the ``using`` declarations to avoid hiding the base class overloads: .. code-block:: cpp class MyBench : public CocoaBench { public: using CocoaBench::SetUp; using CocoaBench::TearDown; void SetUp(const benchmark::State& state) override { CocoaBench::SetUp(state); // Custom setup here } }; Source Organization ------------------- .. code-block:: text bench/ +-- CMakeLists.txt Build rules +-- BenchMain.cpp Custom main (Tpetra + Google Benchmark init) +-- BenchMeshFactory.hpp Programmatic mesh generation +-- BenchFixtures.hpp Shared base fixture (mesh/fields/config pipeline) +-- BenchGwceLhsAssembly.cpp GWCE RHS/matrix assembly and lumped solve +-- BenchGwceSolver.cpp CG linear solver variants +-- BenchMomentumSolver.cpp Momentum solver phases (element, nodal, 2x2, full) +-- BenchWetDry.cpp Wet/dry algorithm +-- BenchPhysics.cpp Friction kernels +-- BenchTimeStep.cpp Full time step +-- analyze_scaling.py Scaling analysis and plotting script