Numerical Methods
This section describes the numerical methods used in Cocoa for spatial and temporal discretization.
Spatial Discretization
Finite Element Mesh
Cocoa uses unstructured triangular meshes with linear (P1) elements [Hughes2000] [Zienkiewicz2013]:
Nodes located at triangle vertices
Linear interpolation within elements
Continuous across element boundaries
Fig. 27 A P1 linear triangular element. Each shape function Ni equals 1 at its own node and 0 at the other two. Gradients are constant within the element.
Shape Functions
For a triangle with vertices \((x_1, y_1), (x_2, y_2), (x_3, y_3)\), the linear shape functions are:
where \(A\) is the element area and coefficients are computed from vertex coordinates.
Shape function gradients:
Mass Matrix
Cocoa supports both consistent and lumped mass matrices. The consistent element mass matrix for linear triangular elements is:
The factor \(A/12\) arises from integrating the product of linear shape functions over the element. The off-diagonal entries couple neighboring nodes, providing better accuracy than a lumped formulation.
The lumped mass matrix diagonalizes the mass matrix by row-summing:
The lumped formulation enables an explicit direct solve (no iterative solver needed) at the cost of some spatial accuracy. The GWCE-specific mass matrix formulation including \(\tau_0\) weighting is given in Equation (7). See Configuration for solver selection.
Temporal Discretization
Time Integration Scheme
Cocoa uses a three-level scheme for the GWCE:
This provides second-order accuracy in time.
Momentum equations use a semi-implicit Crank-Nicolson scheme for the pressure gradient and bottom friction terms, with explicit treatment of advection and lateral stress.
Linear Solver
Assembly Pattern: Scatter-to-Gather
Cocoa assembles the finite element right-hand side with an element-parallel
atomic scatter to node vectors (Kokkos::atomic_add); the GWCE LHS matrix
uses a node-based gather. The diagram below illustrates the scatter and gather
phases of this assembly:
Fig. 28 Two-phase assembly: elements write to private buffers (scatter), then nodes accumulate contributions via CSR lookup (gather). Ghost contributions are exported to owning ranks via Tpetra.
System Assembly
The GWCE produces a sparse linear system:
The matrix \(\mathbf{A}\) has the sparsity pattern of the mesh connectivity (see the GWCE discretization in Equation (7) and (8)).
Iterative Solver
Cocoa uses the Conjugate Gradient (CG) method via the Belos solver package from Trilinos [Heroux2005] [Bavier2012]:
Preconditioning
Jacobi preconditioning is applied:
This improves convergence while maintaining GPU efficiency.
Convergence Criteria
The solver terminates when the scaled residual norm satisfies the tolerance.
By default, Cocoa scales the residual by the initial residual norm
(NormOfInitialResidual):
with default tolerance \(\epsilon = 10^{-5}\).
The convergence scaling is selectable; the three available options are:
NormOfInitialResidual (default): \(\|\mathbf{r}_k\| / \|\mathbf{r}_0\| < \epsilon\)
NormOfPrecInitRes: \(\|\mathbf{M}^{-1}\mathbf{r}_k\| / \|\mathbf{M}^{-1}\mathbf{r}_0\| < \epsilon\)
None: \(\|\mathbf{r}_k\| < \epsilon\) (absolute tolerance)
Numerical Stability
CFL Condition
For a stable explicit time step, the Courant-Friedrichs-Lewy (CFL) condition provides useful guidance for choosing \(\Delta t\):
where \(h_e\) is the element size and \(C \approx 0.5\) is the Courant number. This is descriptive guidance, not an enforced constraint: Cocoa only validates that the configured time step is positive and does not compute a CFL-limited step internally. The user is responsible for selecting a stable \(\Delta t\).
Numerical Diffusion
Explicit advection schemes introduce numerical diffusion. This is partially offset by the GWCE formulation.
Floating-Point Precision
Cocoa performs all arithmetic in double precision (FP64). Both the lumped (explicit) and consistent (implicit) GWCE solvers, the momentum solver, and all reductions accumulate in double; the consistent mass matrix in particular requires it to avoid overflow in the iterative solve.
Mixed-precision storage. To cut memory traffic without changing the numerics, a subset of bandwidth-sensitive fields is stored as single precision (float) and promoted to double the moment it is read into a kernel:
the prognostic velocity and flux fields,
the bottom-friction and lateral-stress tensors,
a subset of static mesh geometry (element areas, scale factors, Coriolis,
tan_phi, and the element-averaged bathymetry), the slope-limiter coefficients, and nodal attributes.
Bathymetry itself and the spatial gradient coefficients (the a and b
shape-function derivative arrays) remain double precision, as do
water-surface elevation, the GWCE/momentum right-hand sides, the assembled
system matrices, and all accumulators.
The float/double conversions are confined to a single sanctioned boundary at
the gather/scatter step (Cocoa::Types::promote reads float storage up to
double; Cocoa::Types::narrow writes a double result back to float). For
double-typed fields both helpers are exact no-ops, so kernels are written
uniformly regardless of a field’s storage precision. Because the working
arithmetic is unchanged, the only numerical effect is the ~1e-7 relative
round-off incurred when a float-stored field is written and later re-read.
Performance considerations. This scheme targets GPU architectures, where memory bandwidth — not double-precision throughput — is usually the limiting factor; storing the heavily-streamed fields as float reduces the bytes moved per timestep. CPU architectures see little benefit, as they already hide the extra bandwidth of full double-precision storage, but they are unaffected because compute precision is identical in both cases.