=================
Numerical Methods
=================

This section describes the numerical methods used in Cocoa for spatial
and temporal discretization.

Spatial Discretization
----------------------

Finite Element Mesh
^^^^^^^^^^^^^^^^^^^

Cocoa uses unstructured triangular meshes with linear (P1) elements
[Hughes2000]_ [Zienkiewicz2013]_:

- Nodes located at triangle vertices
- Linear interpolation within elements
- Continuous across element boundaries

.. figure:: ../_static/images/diagrams/linear_triangle.svg
   :alt: P1 linear triangular element with labeled nodes and shape functions
   :width: 400px
   :align: center

   A P1 linear triangular element. Each shape function Ni equals 1 at its own
   node and 0 at the other two. Gradients are constant within the element.

Shape Functions
^^^^^^^^^^^^^^^

For a triangle with vertices :math:`(x_1, y_1), (x_2, y_2), (x_3, y_3)`,
the linear shape functions are:

.. math::
   :label: shape-functions

   N_i = \frac{1}{2A}(a_i + b_i x + c_i y)

where :math:`A` is the element area and coefficients are computed from
vertex coordinates.

**Shape function gradients**:

.. math::

   \frac{\partial N_i}{\partial x} = \frac{b_i}{2A}, \quad
   \frac{\partial N_i}{\partial y} = \frac{c_i}{2A}

Mass Matrix
^^^^^^^^^^^

Cocoa supports both consistent and lumped mass matrices. The consistent element
mass matrix for linear triangular elements is:

.. math::
   :label: consistent-mass-matrix

   M^e_{ij} = \int_{\Omega^e} N_i N_j \, d\Omega = \frac{A}{12}
   \begin{pmatrix} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{pmatrix}

The factor :math:`A/12` arises from integrating the product of linear shape
functions over the element. The off-diagonal entries couple neighboring nodes,
providing better accuracy than a lumped formulation.

The lumped mass matrix diagonalizes the mass matrix by row-summing:

.. math::

   M^e_{ii} = \frac{A}{3}, \quad M^e_{ij} = 0 \text{ for } i \neq j

The lumped formulation enables an explicit direct solve (no iterative solver
needed) at the cost of some spatial accuracy. The GWCE-specific mass matrix
formulation including :math:`\tau_0` weighting is given in
Equation :eq:`gwce-mass-matrix`. See :doc:`/getting_started/configuration`
for solver selection.

Temporal Discretization
-----------------------

Time Integration Scheme
^^^^^^^^^^^^^^^^^^^^^^^

Cocoa uses a three-level scheme for the GWCE:

.. math::

   \frac{\partial^2 \zeta}{\partial t^2} \approx \frac{\zeta^{n+1} - 2\zeta^n + \zeta^{n-1}}{\Delta t^2}

.. math::

   \frac{\partial \zeta}{\partial t} \approx \frac{\zeta^{n+1} - \zeta^{n-1}}{2\Delta t}

This provides second-order accuracy in time.

Momentum equations use a semi-implicit Crank-Nicolson scheme for the
pressure gradient and bottom friction terms, with explicit treatment of
advection and lateral stress.

Linear Solver
-------------

Assembly Pattern: Scatter-to-Gather
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Cocoa assembles the finite element right-hand side with an element-parallel
atomic scatter to node vectors (``Kokkos::atomic_add``); the GWCE LHS matrix
uses a node-based gather. The diagram below illustrates the scatter and gather
phases of this assembly:

.. figure:: ../_static/images/diagrams/scatter_gather.svg
   :alt: Scatter-to-gather FEM assembly pattern
   :width: 600px
   :align: center

   Two-phase assembly: elements write to private buffers (scatter), then nodes
   accumulate contributions via CSR lookup (gather). Ghost contributions are
   exported to owning ranks via Tpetra.

System Assembly
^^^^^^^^^^^^^^^

The GWCE produces a sparse linear system:

.. math::
   :label: gwce-linear-system

   \mathbf{A} \boldsymbol{\zeta}^{n+1} = \mathbf{b}

The matrix :math:`\mathbf{A}` has the sparsity pattern of the mesh connectivity
(see the GWCE discretization in Equation :eq:`gwce-mass-matrix` and
:eq:`gwce-stiffness-matrix`).

Iterative Solver
^^^^^^^^^^^^^^^^

Cocoa uses the Conjugate Gradient (CG) method via the Belos solver package
from Trilinos [Heroux2005]_ [Bavier2012]_:

.. math::

   &\mathbf{r}_0 = \mathbf{b} - \mathbf{A}\mathbf{x}_0 \\
   &\mathbf{p}_0 = \mathbf{r}_0 \\
   &\text{for } k = 0, 1, 2, \ldots \\
   &\quad \alpha_k = (\mathbf{r}_k, \mathbf{r}_k) / (\mathbf{p}_k, \mathbf{A}\mathbf{p}_k) \\
   &\quad \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k \\
   &\quad \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k \mathbf{A}\mathbf{p}_k \\
   &\quad \beta_k = (\mathbf{r}_{k+1}, \mathbf{r}_{k+1}) / (\mathbf{r}_k, \mathbf{r}_k) \\
   &\quad \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k

Preconditioning
^^^^^^^^^^^^^^^

Jacobi preconditioning is applied:

.. math::

   \mathbf{M}^{-1} = \text{diag}(\mathbf{A})^{-1}

This improves convergence while maintaining GPU efficiency.

Convergence Criteria
^^^^^^^^^^^^^^^^^^^^

The solver terminates when the scaled residual norm satisfies the tolerance.
By default, Cocoa scales the residual by the initial residual norm
(``NormOfInitialResidual``):

.. math::

   \frac{\|\mathbf{r}_k\|}{\|\mathbf{r}_0\|} < \epsilon

with default tolerance :math:`\epsilon = 10^{-5}`.

The convergence scaling is selectable; the three available options are:

- **NormOfInitialResidual** (default): :math:`\|\mathbf{r}_k\| / \|\mathbf{r}_0\| < \epsilon`
- **NormOfPrecInitRes**: :math:`\|\mathbf{M}^{-1}\mathbf{r}_k\| / \|\mathbf{M}^{-1}\mathbf{r}_0\| < \epsilon`
- **None**: :math:`\|\mathbf{r}_k\| < \epsilon` (absolute tolerance)

Numerical Stability
-------------------

CFL Condition
^^^^^^^^^^^^^

For a stable explicit time step, the Courant-Friedrichs-Lewy (CFL) condition
provides useful guidance for choosing :math:`\Delta t`:

.. math::
   :label: cfl-condition

   \Delta t \lesssim C \cdot \min_e \frac{h_e}{\sqrt{gH_e}}

where :math:`h_e` is the element size and :math:`C \approx 0.5` is the
Courant number. This is descriptive guidance, not an enforced constraint:
Cocoa only validates that the configured time step is positive and does not
compute a CFL-limited step internally. The user is responsible for selecting a
stable :math:`\Delta t`.

Numerical Diffusion
^^^^^^^^^^^^^^^^^^^

Explicit advection schemes introduce numerical diffusion. This is partially
offset by the GWCE formulation.

Floating-Point Precision
------------------------

Cocoa performs all arithmetic in double precision (FP64). Both the lumped
(explicit) and consistent (implicit) GWCE solvers, the momentum solver, and
all reductions accumulate in double; the consistent mass matrix in particular
requires it to avoid overflow in the iterative solve.

**Mixed-precision storage.**
To cut memory traffic without changing the numerics, a subset of
bandwidth-sensitive fields is *stored* as single precision (float) and
promoted to double the moment it is read into a kernel:

- the prognostic velocity and flux fields,
- the bottom-friction and lateral-stress tensors,
- a subset of static mesh geometry (element areas, scale factors, Coriolis,
  ``tan_phi``, and the element-averaged bathymetry), the slope-limiter
  coefficients, and nodal attributes.

Bathymetry itself and the spatial gradient coefficients (the ``a`` and ``b``
shape-function derivative arrays) remain double precision, as do
water-surface elevation, the GWCE/momentum right-hand sides, the assembled
system matrices, and all accumulators.

The float/double conversions are confined to a single sanctioned boundary at
the gather/scatter step (``Cocoa::Types::promote`` reads float storage up to
double; ``Cocoa::Types::narrow`` writes a double result back to float). For
double-typed fields both helpers are exact no-ops, so kernels are written
uniformly regardless of a field's storage precision. Because the working
arithmetic is unchanged, the only numerical effect is the ~1e-7 relative
round-off incurred when a float-stored field is written and later re-read.

**Performance considerations.**
This scheme targets GPU architectures, where memory bandwidth --- not
double-precision throughput --- is usually the limiting factor; storing the
heavily-streamed fields as float reduces the bytes moved per timestep. CPU
architectures see little benefit, as they already hide the extra bandwidth of
full double-precision storage, but they are unaffected because compute
precision is identical in both cases.