Benchmarking

Cocoa includes a Google Benchmark suite for tracking performance regressions in individual computational components. Unlike the end-to-end simulation benchmarks in Performance, these micro-benchmarks isolate specific kernels and solver stages to identify the source of any regression.

Building

Benchmarks are gated behind the cocoa_ENABLE_BENCHMARKS CMake option:

cmake .. -Dcocoa_ENABLE_BENCHMARKS=ON
make -j8 cocoa_benchmarks

This produces a single executable at bench/cocoa_benchmarks.

Running Benchmarks

# Run all benchmarks
./bench/cocoa_benchmarks

# Filter by component name (regex)
./bench/cocoa_benchmarks --benchmark_filter="GwceLumped"
./bench/cocoa_benchmarks --benchmark_filter="MomentumSolver"
./bench/cocoa_benchmarks --benchmark_filter="WetDry"
./bench/cocoa_benchmarks --benchmark_filter="Friction"

# Filter by mesh size (the Args parameter in the benchmark name)
./bench/cocoa_benchmarks --benchmark_filter=".*2000$"         # ~2k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*20000$"        # ~20k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*200000$"       # ~200k nodes
./bench/cocoa_benchmarks --benchmark_filter=".*2000000$"      # ~2M nodes
./bench/cocoa_benchmarks --benchmark_filter=".*20000000$"     # ~20M nodes

# Combine component and mesh size filters
./bench/cocoa_benchmarks --benchmark_filter="GwceLumped.*20000$"

# Control minimum measurement time (default is 0.5s)
./bench/cocoa_benchmarks --benchmark_min_time=5s

# JSON output for regression tracking or CI
./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

# List available benchmarks without running them
./bench/cocoa_benchmarks --benchmark_list_tests

Available Benchmarks

Each benchmark is registered at five mesh sizes: 2000 (~2k nodes), 20000 (~20k nodes), 200000 (~200k nodes), 2000000 (~2M nodes), and 20000000 (~20M nodes). The mesh size appears as the Args parameter in the benchmark name.

Benchmark

Component

Counters

Gwce/RhsAssembly

GWCE RHS vector assembly (element scatter + gather)

elements/s

GwceLumped/MatrixAssembly

GWCE lumped (diagonal) matrix assembly

elements/s

GwceLumped/Solve

GWCE lumped diagonal solve phase

DOF/s

GwceConsistent/MatrixAssembly

GWCE consistent sparse matrix assembly

elements/s

GwceSolver_Belos/BelosCG

CG solver (Belos)

DOF/s

GwceSolver_SingleReduce/TpetraSingleReduce

CG solver (Tpetra single reduction)

DOF/s

GwceSolver_Pipeline/TpetraCgPipeline

CG solver (Tpetra pipelined)

DOF/s

Momentum/ElementAssembly

Momentum element-parallel RHS scatter + normalize

elements/s

Momentum/NodalAssembly

Momentum per-node: normalize, wind stress, velocity contribution

DOF/s

Momentum/Solve2x2

Momentum per-node 2x2 Cramer’s rule solve + land BC

DOF/s

Momentum/FullSolve

Full momentum solve (element + nodal assembly + 2x2 + flux)

DOF/s

Friction/ManningTKM

Manning bottom friction kernel

nodes/s

TimeStep/FullTimeStep

Full time step (friction + GWCE + momentum + wet/dry)

DOF/s

WetDry/Compute

Wet/dry algorithm with partially wet domain

elements/s

Scaling Analysis

The bench/analyze_scaling.py script classifies benchmarks as linear-scaling or cache/memory-bandwidth limited based on how throughput changes across mesh sizes.

# Run benchmarks with JSON output
./bench/cocoa_benchmarks --benchmark_out=results.json --benchmark_out_format=json

# Print scaling table
python3 bench/analyze_scaling.py results.json

# Print table + save scaling plot
python3 bench/analyze_scaling.py results.json --plot scaling.png

The table shows throughput at each mesh size with a classification and efficiency ratio. The plot produces two panels: absolute throughput (log-log) and normalized scaling efficiency with threshold lines at 85% (linear) and 50% (moderate falloff).

Classification criteria (throughput at largest size / peak throughput):

  • linear (>= 85%): Throughput scales with problem size

  • moderate falloff (>= 50%): Some cache/bandwidth pressure

  • cache/BW limited (< 50%): Throughput degrades significantly

Benchmark Mesh Generation

Benchmarks use programmatically generated rectangular channel meshes, requiring no file I/O or external dependencies. The BenchMeshFactory in bench/BenchMeshFactory.hpp creates triangulated rectangular grids by splitting each quad cell into two triangles with alternating diagonals.

Grid dimensions are computed from the target node count with approximately 2:1 aspect ratio. The number of nodes is (nx+1) * (ny+1) and the number of elements is nx * ny * 2.

Mesh properties:

  • Coordinates: Gulf of Mexico region (~-89, 29) with Mercator projection

  • Bathymetry: 10m uniform flat bottom (the wet/dry benchmark overrides this with a sloped plane to create a partially wet domain)

  • Boundaries: All four edges are land boundaries (no open or flow boundaries)

  • Caching: Meshes are built once per size and cached for the duration of the process, so all benchmarks sharing the same mesh size reuse the same data

Regression Tracking

Use JSON output to compare results across commits:

# Before changes
./bench/cocoa_benchmarks --benchmark_out=before.json --benchmark_out_format=json

# After changes
./bench/cocoa_benchmarks --benchmark_out=after.json --benchmark_out_format=json

# Compare (requires google-benchmark's compare.py tool)
python3 <benchmark-src>/tools/compare.py benchmarks before.json after.json

Adding New Benchmarks

To add a new benchmark:

  1. Create a new .cpp file in bench/ (e.g., BenchMyKernel.cpp)

  2. Include BenchFixtures.hpp for the base fixture

  3. Define a fixture class inheriting from CocoaBench

  4. Use BENCHMARK_DEFINE_F and BENCHMARK_REGISTER_F macros

  5. Add the file to bench/CMakeLists.txt

Example:

#include <benchmark/benchmark.h>
#include "BenchFixtures.hpp"

namespace Cocoa::Bench {

class MyBench : public CocoaBench {};

BENCHMARK_DEFINE_F(MyBench, MyKernel)(benchmark::State& state) {
  for (auto _ : state) {
    // Call the kernel being benchmarked
    my_kernel(fields(), config());
  }
  set_element_counters(state);
}

BENCHMARK_REGISTER_F(MyBench, MyKernel)
    ->Apply(apply_mesh_sizes)
    ->Unit(benchmark::kMillisecond);

}  // namespace Cocoa::Bench

If your benchmark needs custom setup (e.g., specialized solver configuration or modified bathymetry), override SetUp in your fixture class. Remember to add the using declarations to avoid hiding the base class overloads:

class MyBench : public CocoaBench {
 public:
  using CocoaBench::SetUp;
  using CocoaBench::TearDown;

  void SetUp(const benchmark::State& state) override {
    CocoaBench::SetUp(state);
    // Custom setup here
  }
};

Source Organization

bench/
+-- CMakeLists.txt              Build rules
+-- BenchMain.cpp               Custom main (Tpetra + Google Benchmark init)
+-- BenchMeshFactory.hpp        Programmatic mesh generation
+-- BenchFixtures.hpp           Shared base fixture (mesh/fields/config pipeline)
+-- BenchGwceLhsAssembly.cpp    GWCE RHS/matrix assembly and lumped solve
+-- BenchGwceSolver.cpp         CG linear solver variants
+-- BenchMomentumSolver.cpp     Momentum solver phases (element, nodal, 2x2, full)
+-- BenchWetDry.cpp             Wet/dry algorithm
+-- BenchPhysics.cpp            Friction kernels
+-- BenchTimeStep.cpp           Full time step
+-- analyze_scaling.py          Scaling analysis and plotting script