Numerical Methods

This section describes the numerical methods used in Cocoa for spatial and temporal discretization.

Spatial Discretization

Finite Element Mesh

Cocoa uses unstructured triangular meshes with linear (P1) elements [Hughes2000] [Zienkiewicz2013]:

  • Nodes located at triangle vertices

  • Linear interpolation within elements

  • Continuous across element boundaries

P1 linear triangular element with labeled nodes and shape functions

Fig. 27 A P1 linear triangular element. Each shape function Ni equals 1 at its own node and 0 at the other two. Gradients are constant within the element.

Shape Functions

For a triangle with vertices \((x_1, y_1), (x_2, y_2), (x_3, y_3)\), the linear shape functions are:

(20)\[N_i = \frac{1}{2A}(a_i + b_i x + c_i y)\]

where \(A\) is the element area and coefficients are computed from vertex coordinates.

Shape function gradients:

\[\frac{\partial N_i}{\partial x} = \frac{b_i}{2A}, \quad \frac{\partial N_i}{\partial y} = \frac{c_i}{2A}\]

Mass Matrix

Cocoa supports both consistent and lumped mass matrices. The consistent element mass matrix for linear triangular elements is:

(21)\[\begin{split}M^e_{ij} = \int_{\Omega^e} N_i N_j \, d\Omega = \frac{A}{12} \begin{pmatrix} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{pmatrix}\end{split}\]

The factor \(A/12\) arises from integrating the product of linear shape functions over the element. The off-diagonal entries couple neighboring nodes, providing better accuracy than a lumped formulation.

The lumped mass matrix diagonalizes the mass matrix by row-summing:

\[M^e_{ii} = \frac{A}{3}, \quad M^e_{ij} = 0 \text{ for } i \neq j\]

The lumped formulation enables an explicit direct solve (no iterative solver needed) at the cost of some spatial accuracy. The GWCE-specific mass matrix formulation including \(\tau_0\) weighting is given in Equation (7). See Configuration for solver selection.

Temporal Discretization

Time Integration Scheme

Cocoa uses a three-level scheme for the GWCE:

\[\frac{\partial^2 \zeta}{\partial t^2} \approx \frac{\zeta^{n+1} - 2\zeta^n + \zeta^{n-1}}{\Delta t^2}\]
\[\frac{\partial \zeta}{\partial t} \approx \frac{\zeta^{n+1} - \zeta^{n-1}}{2\Delta t}\]

This provides second-order accuracy in time.

Momentum equations use a semi-implicit Crank-Nicolson scheme for the pressure gradient and bottom friction terms, with explicit treatment of advection and lateral stress.

Linear Solver

Assembly Pattern: Scatter-to-Gather

Cocoa assembles the finite element right-hand side with an element-parallel atomic scatter to node vectors (Kokkos::atomic_add); the GWCE LHS matrix uses a node-based gather. The diagram below illustrates the scatter and gather phases of this assembly:

Scatter-to-gather FEM assembly pattern

Fig. 28 Two-phase assembly: elements write to private buffers (scatter), then nodes accumulate contributions via CSR lookup (gather). Ghost contributions are exported to owning ranks via Tpetra.

System Assembly

The GWCE produces a sparse linear system:

(22)\[\mathbf{A} \boldsymbol{\zeta}^{n+1} = \mathbf{b}\]

The matrix \(\mathbf{A}\) has the sparsity pattern of the mesh connectivity (see the GWCE discretization in Equation (7) and (8)).

Iterative Solver

Cocoa uses the Conjugate Gradient (CG) method via the Belos solver package from Trilinos [Heroux2005] [Bavier2012]:

\[\begin{split}&\mathbf{r}_0 = \mathbf{b} - \mathbf{A}\mathbf{x}_0 \\ &\mathbf{p}_0 = \mathbf{r}_0 \\ &\text{for } k = 0, 1, 2, \ldots \\ &\quad \alpha_k = (\mathbf{r}_k, \mathbf{r}_k) / (\mathbf{p}_k, \mathbf{A}\mathbf{p}_k) \\ &\quad \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k \\ &\quad \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k \mathbf{A}\mathbf{p}_k \\ &\quad \beta_k = (\mathbf{r}_{k+1}, \mathbf{r}_{k+1}) / (\mathbf{r}_k, \mathbf{r}_k) \\ &\quad \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\end{split}\]

Preconditioning

Jacobi preconditioning is applied:

\[\mathbf{M}^{-1} = \text{diag}(\mathbf{A})^{-1}\]

This improves convergence while maintaining GPU efficiency.

Convergence Criteria

The solver terminates when the scaled residual norm satisfies the tolerance. By default, Cocoa scales the residual by the initial residual norm (NormOfInitialResidual):

\[\frac{\|\mathbf{r}_k\|}{\|\mathbf{r}_0\|} < \epsilon\]

with default tolerance \(\epsilon = 10^{-5}\).

The convergence scaling is selectable; the three available options are:

  • NormOfInitialResidual (default): \(\|\mathbf{r}_k\| / \|\mathbf{r}_0\| < \epsilon\)

  • NormOfPrecInitRes: \(\|\mathbf{M}^{-1}\mathbf{r}_k\| / \|\mathbf{M}^{-1}\mathbf{r}_0\| < \epsilon\)

  • None: \(\|\mathbf{r}_k\| < \epsilon\) (absolute tolerance)

Numerical Stability

CFL Condition

For a stable explicit time step, the Courant-Friedrichs-Lewy (CFL) condition provides useful guidance for choosing \(\Delta t\):

(23)\[\Delta t \lesssim C \cdot \min_e \frac{h_e}{\sqrt{gH_e}}\]

where \(h_e\) is the element size and \(C \approx 0.5\) is the Courant number. This is descriptive guidance, not an enforced constraint: Cocoa only validates that the configured time step is positive and does not compute a CFL-limited step internally. The user is responsible for selecting a stable \(\Delta t\).

Numerical Diffusion

Explicit advection schemes introduce numerical diffusion. This is partially offset by the GWCE formulation.

Floating-Point Precision

Cocoa performs all arithmetic in double precision (FP64). Both the lumped (explicit) and consistent (implicit) GWCE solvers, the momentum solver, and all reductions accumulate in double; the consistent mass matrix in particular requires it to avoid overflow in the iterative solve.

Mixed-precision storage. To cut memory traffic without changing the numerics, a subset of bandwidth-sensitive fields is stored as single precision (float) and promoted to double the moment it is read into a kernel:

  • the prognostic velocity and flux fields,

  • the bottom-friction and lateral-stress tensors,

  • a subset of static mesh geometry (element areas, scale factors, Coriolis, tan_phi, and the element-averaged bathymetry), the slope-limiter coefficients, and nodal attributes.

Bathymetry itself and the spatial gradient coefficients (the a and b shape-function derivative arrays) remain double precision, as do water-surface elevation, the GWCE/momentum right-hand sides, the assembled system matrices, and all accumulators.

The float/double conversions are confined to a single sanctioned boundary at the gather/scatter step (Cocoa::Types::promote reads float storage up to double; Cocoa::Types::narrow writes a double result back to float). For double-typed fields both helpers are exact no-ops, so kernels are written uniformly regardless of a field’s storage precision. Because the working arithmetic is unchanged, the only numerical effect is the ~1e-7 relative round-off incurred when a float-stored field is written and later re-read.

Performance considerations. This scheme targets GPU architectures, where memory bandwidth — not double-precision throughput — is usually the limiting factor; storing the heavily-streamed fields as float reduces the bytes moved per timestep. CPU architectures see little benefit, as they already hide the extra bandwidth of full double-precision storage, but they are unaffected because compute precision is identical in both cases.