=========== Performance =========== Cocoa is designed for high performance on modern GPU and CPU architectures. This page presents benchmark results comparing Cocoa with ADCIRC on a representative large-scale simulation, including strong-scaling behavior in CPU-MPI mode and single-GPU performance across three GPU generations. Benchmark Configuration ----------------------- The benchmark is a Hurricane Katrina (2005) hindcast with the following characteristics: - **Mesh size:** 1,568,749 nodes - **Forcing:** Tidal boundary, tidal potential, and hurricane wind and pressure fields - **Time step:** 2 s for both the implicit and explicit (lumped-mass) solvers, for both models - **Simulated duration:** 25 days ADCIRC runs use double precision throughout. Cocoa computes in double precision with single-precision storage for bandwidth-bound fields. **Hardware:** - **CPU (PSC Bridges-2):** dual-socket AMD EPYC 7742 64-core nodes (128 cores per node), 128 to 1024 MPI ranks (1 to 8 nodes) - **GPU (Lambda Labs):** NVIDIA H100 (SXM5, host Intel Platinum 8480+) and NVIDIA B200 (SXM6, host Intel Platinum 8592+), single device - **GPU (workstation):** NVIDIA V100 (host AMD Ryzen 9 9950X), single device Both solvers use the same time step (2 s) and run for the same simulated duration, so the implicit and explicit results are a direct, step-for-step comparison. Results are still presented separately per solver because they exercise different work per step. Implicit Solver Results ----------------------- .. figure:: ../_static/images/performance_comparison_implicit.png :alt: Cocoa vs ADCIRC implicit solver performance comparison :width: 90% :align: center Wall-clock time for the implicit solver (dt=2s) on the 1.57M-node Hurricane Katrina hindcast. .. list-table:: :header-rows: 1 :widths: 18 42 20 * - Model - Hardware - Wall Time * - ADCIRC - 128 cores (1 node) - 607m (10.1h) * - Cocoa - 128 cores (1 node) - 492m (8.2h) * - ADCIRC - 1024 cores (8 nodes) - 124m (2.1h) * - Cocoa - 1024 cores (8 nodes) - 142m (2.4h) * - Cocoa - 1x V100 GPU - 207m (3.5h) * - Cocoa - 1x H100 GPU - 82m (1.4h) * - Cocoa - 1x B200 GPU - 62m (1.0h) Explicit Solver Results ----------------------- .. figure:: ../_static/images/performance_comparison_explicit.png :alt: Cocoa vs ADCIRC explicit solver performance comparison :width: 90% :align: center Wall-clock time for the explicit lumped-mass solver (dt=2s) on the 1.57M-node Hurricane Katrina hindcast. .. list-table:: :header-rows: 1 :widths: 18 42 20 * - Model - Hardware - Wall Time * - ADCIRC - 128 cores (1 node) - 450m (7.5h) * - Cocoa - 128 cores (1 node) - 317m (5.3h) * - ADCIRC - 1024 cores (8 nodes) - 99m (1.6h) * - Cocoa - 1024 cores (8 nodes) - 78m (1.3h) * - Cocoa - 1x V100 GPU - 72m (1.2h) * - Cocoa - 1x H100 GPU - 35m (0.6h) * - Cocoa - 1x B200 GPU - 34m (0.6h) CPU-MPI Scaling ---------------------- .. figure:: ../_static/images/scaling_comparison.png :alt: CPU-MPI scaling of Cocoa and ADCIRC :width: 100% :align: center Top: log-log scaling of wall-clock time vs MPI ranks on PSC Bridges-2, one panel per solver, with horizontal reference lines marking Cocoa's single-GPU wall times for the same solver. Bottom: parallel efficiency relative to each configuration's own 128-core run. Wall-clock time in minutes by MPI rank count: .. list-table:: :header-rows: 1 :widths: 14 18 18 18 18 * - Ranks - Cocoa Implicit - ADCIRC Implicit - Cocoa Explicit - ADCIRC Explicit * - 128 - 492 - 607 - 317 - 450 * - 256 - 266 - 319 - 148 - 239 * - 512 - 171 - 160 - 94 - 142 * - 1024 - 142 - 124 - 78 - 99