Benchmarking

Cocoa includes a Google Benchmark suite for tracking performance regressions in individual computational components. Unlike the end-to-end simulation benchmarks in Performance, these micro-benchmarks isolate specific kernels and solver stages to identify the source of any regression.

Building

Benchmarks are gated behind the cocoa_ENABLE_BENCHMARKS CMake option:

cmake .. -Dcocoa_ENABLE_BENCHMARKS=ON
make -j8 cocoa_benchmarks

This produces a single executable at bench/cocoa_benchmarks.

Running Benchmarks

# Run all benchmarks
./bench/cocoa_benchmarks

# Filter by component name (regex)
./bench/cocoa_benchmarks --benchmark_filter="GwceLumped"
./bench/cocoa_benchmarks --benchmark_filter="MomentumSolver"
./bench/cocoa_benchmarks --benchmark_filter="WetDry"
./bench/cocoa_benchmarks --benchmark_filter="Friction"

# Filter by mesh size (the Args parameter in the benchmark name)
./bench/cocoa_benchmarks --benchmark_filter=".*2000$"         # ~2k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*20000$"        # ~20k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*200000$"       # ~200k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*2000000$"      # ~2M nodes
./bench/cocoa_benchmarks --benchmark_filter=".*20000000$"     # ~20M nodes

# Combine component and mesh size filters
./bench/cocoa_benchmarks --benchmark_filter="GwceLumped.*20000$"

# Control minimum measurement time (default is 0.5s)
./bench/cocoa_benchmarks --benchmark_min_time=5s

# JSON output for regression tracking or CI
./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

# List available benchmarks without running them
./bench/cocoa_benchmarks --benchmark_list_tests

Available Benchmarks

Each benchmark is registered at five mesh sizes: 2000 (~2k nodes), 20000 (~20k nodes), 200000 (~200k nodes), 2000000 (~2M nodes), and 20000000 (~20M nodes). The mesh size appears as the Args parameter in the benchmark name.

Benchmark	Component	Counters
`Gwce/RhsAssembly`	GWCE RHS vector assembly (element scatter + gather)	elements/s
`GwceLumped/MatrixAssembly`	GWCE lumped (diagonal) matrix assembly	elements/s
`GwceLumped/Solve`	GWCE lumped diagonal solve phase	DOF/s
`GwceConsistent/MatrixAssembly`	GWCE consistent sparse matrix assembly	elements/s
`GwceSolver_Belos/BelosCG`	CG solver (Belos)	DOF/s
`GwceSolver_SingleReduce/TpetraSingleReduce`	CG solver (Tpetra single reduction)	DOF/s
`GwceSolver_Pipeline/TpetraCgPipeline`	CG solver (Tpetra pipelined)	DOF/s
`Momentum/ElementAssembly`	Momentum element-parallel RHS scatter + normalize	elements/s
`Momentum/NodalAssembly`	Momentum per-node: normalize, wind stress, velocity contribution	DOF/s
`Momentum/Solve2x2`	Momentum per-node 2x2 Cramer’s rule solve + land BC	DOF/s
`Momentum/FullSolve`	Full momentum solve (element + nodal assembly + 2x2 + flux)	DOF/s
`Friction/ManningTKM`	Manning bottom friction kernel	nodes/s
`TimeStep/FullTimeStep`	Full time step (friction + GWCE + momentum + wet/dry)	DOF/s
`WetDry/Compute`	Wet/dry algorithm with partially wet domain	elements/s

Scaling Analysis

The bench/analyze_scaling.py script classifies benchmarks as linear-scaling or cache/memory-bandwidth limited based on how throughput changes across mesh sizes.

# Run benchmarks with JSON output
./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

# Print scaling table
python3 bench/analyze_scaling.py results.json

# Print table + save scaling plot
python3 bench/analyze_scaling.py results.json --plot scaling.png

The table shows throughput at each mesh size with a classification and efficiency ratio. The plot produces two panels: absolute throughput (log-log) and normalized scaling efficiency with threshold lines at 85% (linear) and 50% (moderate falloff).

Classification criteria (throughput at largest size / peak throughput):

linear (>= 85%): Throughput scales with problem size
moderate falloff (>= 50%): Some cache/bandwidth pressure
cache/BW limited (< 50%): Throughput degrades significantly

Benchmark Mesh Generation

Benchmarks use programmatically generated rectangular channel meshes, requiring no file I/O or external dependencies. The BenchMeshFactory in bench/BenchMeshFactory.hpp creates triangulated rectangular grids by splitting each quad cell into two triangles with alternating diagonals.

Grid dimensions are computed from the target node count with approximately 2:1 aspect ratio. The number of nodes is (nx+1) * (ny+1) and the number of elements is nx * ny * 2.

Mesh properties:

Coordinates: Gulf of Mexico region (~-89, 29) with Mercator projection
Bathymetry: 10m uniform flat bottom (the wet/dry benchmark overrides this with a sloped plane to create a partially wet domain)
Boundaries: All four edges are land boundaries (no open or flow boundaries)
Caching: Meshes are built once per size and cached for the duration of the process, so all benchmarks sharing the same mesh size reuse the same data

Regression Tracking

Use JSON output to compare results across commits:

# Before changes
./bench/cocoa_benchmarks --benchmark_out=before.json --benchmark_out_format=json

# After changes
./bench/cocoa_benchmarks --benchmark_out=after.json --benchmark_out_format=json

# Compare (requires google-benchmark's compare.py tool)
python3 <benchmark-src>/tools/compare.py benchmarks before.json after.json

Adding New Benchmarks

To add a new benchmark:

Create a new .cpp file in bench/ (e.g., BenchMyKernel.cpp)
Include BenchFixtures.hpp for the base fixture
Define a fixture class inheriting from CocoaBench
Use BENCHMARK_DEFINE_F and BENCHMARK_REGISTER_F macros
Add the file to bench/CMakeLists.txt

Example:

#include <benchmark/benchmark.h>
#include "BenchFixtures.hpp"

namespace Cocoa::Bench {

class MyBench : public CocoaBench {};

BENCHMARK_DEFINE_F(MyBench, MyKernel)(benchmark::State& state) {
  for (auto _ : state) {
    // Call the kernel being benchmarked
    my_kernel(fields(), config());
  }
  set_element_counters(state);
}

BENCHMARK_REGISTER_F(MyBench, MyKernel)
    ->Apply(apply_mesh_sizes)
    ->Unit(benchmark::kMillisecond);

}  // namespace Cocoa::Bench

If your benchmark needs custom setup (e.g., specialized solver configuration or modified bathymetry), override SetUp in your fixture class. Remember to add the using declarations to avoid hiding the base class overloads:

class MyBench : public CocoaBench {
 public:
  using CocoaBench::SetUp;
  using CocoaBench::TearDown;

  void SetUp(const benchmark::State& state) override {
    CocoaBench::SetUp(state);
    // Custom setup here
  }
};

Source Organization

bench/
+-- CMakeLists.txt              Build rules
+-- BenchMain.cpp               Custom main (Tpetra + Google Benchmark init)
+-- BenchMeshFactory.hpp        Programmatic mesh generation
+-- BenchFixtures.hpp           Shared base fixture (mesh/fields/config pipeline)
+-- BenchGwceLhsAssembly.cpp    GWCE RHS/matrix assembly and lumped solve
+-- BenchGwceSolver.cpp         CG linear solver variants
+-- BenchMomentumSolver.cpp     Momentum solver phases (element, nodal, 2x2, full)
+-- BenchWetDry.cpp             Wet/dry algorithm
+-- BenchPhysics.cpp            Friction kernels
+-- BenchTimeStep.cpp           Full time step
+-- analyze_scaling.py          Scaling analysis and plotting script