Numerical Methods

This section describes the numerical methods used in Cocoa for spatial and temporal discretization.

Spatial Discretization

Finite Element Mesh

Cocoa uses unstructured triangular meshes with linear (P1) elements [Hughes2000] [Zienkiewicz2013]:

Nodes located at triangle vertices
Linear interpolation within elements
Continuous across element boundaries

Shape Functions

For a triangle with vertices \((x_1, y_1), (x_2, y_2), (x_3, y_3)\), the linear shape functions are:

(20)\[N_i = \frac{1}{2A}(a_i + b_i x + c_i y)\]

where \(A\) is the element area and coefficients are computed from vertex coordinates.

Shape function gradients:

\[\frac{\partial N_i}{\partial x} = \frac{b_i}{2A}, \quad \frac{\partial N_i}{\partial y} = \frac{c_i}{2A}\]

Mass Matrix

Cocoa supports both consistent and lumped mass matrices. The consistent element mass matrix for linear triangular elements is:

(21)\[\begin{split}M^e_{ij} = \int_{\Omega^e} N_i N_j \, d\Omega = \frac{A}{12} \begin{pmatrix} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{pmatrix}\end{split}\]

The factor \(A/12\) arises from integrating the product of linear shape functions over the element. The off-diagonal entries couple neighboring nodes, providing better accuracy than a lumped formulation.

The lumped mass matrix diagonalizes the mass matrix by row-summing:

\[M^e_{ii} = \frac{A}{3}, \quad M^e_{ij} = 0 \text{ for } i \neq j\]

The lumped formulation enables an explicit direct solve (no iterative solver needed) at the cost of some spatial accuracy. The GWCE-specific mass matrix formulation including \(\tau_0\) weighting is given in Equation (7). See Configuration for solver selection.

Temporal Discretization

Time Integration Scheme

Cocoa uses a three-level scheme for the GWCE:

\[\frac{\partial^2 \zeta}{\partial t^2} \approx \frac{\zeta^{n+1} - 2\zeta^n + \zeta^{n-1}}{\Delta t^2}\]

\[\frac{\partial \zeta}{\partial t} \approx \frac{\zeta^{n+1} - \zeta^{n-1}}{2\Delta t}\]

This provides second-order accuracy in time.

Momentum equations use a semi-implicit Crank-Nicolson scheme for the pressure gradient and bottom friction terms, with explicit treatment of advection and lateral stress.

Linear Solver

Assembly Pattern: Scatter-to-Gather

Cocoa assembles the finite element right-hand side with an element-parallel atomic scatter to node vectors (Kokkos::atomic_add); the GWCE LHS matrix uses a node-based gather. The diagram below illustrates the scatter and gather phases of this assembly:

Scatter-to-gather FEM assembly pattern — Fig. 28 Two-phase assembly: elements write to private buffers (scatter), then nodes accumulate contributions via CSR lookup (gather). Ghost contributions are exported to owning ranks via Tpetra.

System Assembly

The GWCE produces a sparse linear system:

(22)\[\mathbf{A} \boldsymbol{\zeta}^{n+1} = \mathbf{b}\]

The matrix \(\mathbf{A}\) has the sparsity pattern of the mesh connectivity (see the GWCE discretization in Equation (7) and (8)).

Iterative Solver

Cocoa uses the Conjugate Gradient (CG) method via the Belos solver package from Trilinos [Heroux2005] [Bavier2012]:

\[\begin{split}&\mathbf{r}_0 = \mathbf{b} - \mathbf{A}\mathbf{x}_0 \\ &\mathbf{p}_0 = \mathbf{r}_0 \\ &\text{for } k = 0, 1, 2, \ldots \\ &\quad \alpha_k = (\mathbf{r}_k, \mathbf{r}_k) / (\mathbf{p}_k, \mathbf{A}\mathbf{p}_k) \\ &\quad \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k \\ &\quad \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k \mathbf{A}\mathbf{p}_k \\ &\quad \beta_k = (\mathbf{r}_{k+1}, \mathbf{r}_{k+1}) / (\mathbf{r}_k, \mathbf{r}_k) \\ &\quad \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\end{split}\]

Preconditioning

Jacobi preconditioning is applied:

\[\mathbf{M}^{-1} = \text{diag}(\mathbf{A})^{-1}\]

This improves convergence while maintaining GPU efficiency.

Convergence Criteria

The solver terminates when the scaled residual norm satisfies the tolerance. By default, Cocoa scales the residual by the initial residual norm (NormOfInitialResidual):

\[\frac{\|\mathbf{r}_k\|}{\|\mathbf{r}_0\|} < \epsilon\]

with default tolerance \(\epsilon = 10^{-5}\).

The convergence scaling is selectable; the three available options are:

NormOfInitialResidual (default): \(\|\mathbf{r}_k\| / \|\mathbf{r}_0\| < \epsilon\)
NormOfPrecInitRes: \(\|\mathbf{M}^{-1}\mathbf{r}_k\| / \|\mathbf{M}^{-1}\mathbf{r}_0\| < \epsilon\)
None: \(\|\mathbf{r}_k\| < \epsilon\) (absolute tolerance)

Numerical Stability

CFL Condition

For a stable explicit time step, the Courant-Friedrichs-Lewy (CFL) condition provides useful guidance for choosing \(\Delta t\):

(23)\[\Delta t \lesssim C \cdot \min_e \frac{h_e}{\sqrt{gH_e}}\]

where \(h_e\) is the element size and \(C \approx 0.5\) is the Courant number. This is descriptive guidance, not an enforced constraint: Cocoa only validates that the configured time step is positive and does not compute a CFL-limited step internally. The user is responsible for selecting a stable \(\Delta t\).

Numerical Diffusion

Explicit advection schemes introduce numerical diffusion. This is partially offset by the GWCE formulation.

Floating-Point Precision

Cocoa performs all arithmetic in double precision (FP64). Both the lumped (explicit) and consistent (implicit) GWCE solvers, the momentum solver, and all reductions accumulate in double; the consistent mass matrix in particular requires it to avoid overflow in the iterative solve.

Mixed-precision storage. To cut memory traffic without changing the numerics, a subset of bandwidth-sensitive fields is stored as single precision (float) and promoted to double the moment it is read into a kernel:

the prognostic velocity and flux fields,
the bottom-friction and lateral-stress tensors,
a subset of static mesh geometry (element areas, scale factors, Coriolis, tan_phi, and the element-averaged bathymetry), the slope-limiter coefficients, and nodal attributes.

Bathymetry itself and the spatial gradient coefficients (the a and b shape-function derivative arrays) remain double precision, as do water-surface elevation, the GWCE/momentum right-hand sides, the assembled system matrices, and all accumulators.

The float/double conversions are confined to a single sanctioned boundary at the gather/scatter step (Cocoa::Types::promote reads float storage up to double; Cocoa::Types::narrow writes a double result back to float). For double-typed fields both helpers are exact no-ops, so kernels are written uniformly regardless of a field’s storage precision. Because the working arithmetic is unchanged, the only numerical effect is the ~1e-7 relative round-off incurred when a float-stored field is written and later re-read.

Performance considerations. This scheme targets GPU architectures, where memory bandwidth — not double-precision throughput — is usually the limiting factor; storing the heavily-streamed fields as float reduces the bytes moved per timestep. CPU architectures see little benefit, as they already hide the extra bandwidth of full double-precision storage, but they are unaffected because compute precision is identical in both cases.