================= Numerical Methods ================= This section describes the numerical methods used in Cocoa for spatial and temporal discretization. Spatial Discretization ---------------------- Finite Element Mesh ^^^^^^^^^^^^^^^^^^^ Cocoa uses unstructured triangular meshes with linear (P1) elements [Hughes2000]_ [Zienkiewicz2013]_: - Nodes located at triangle vertices - Linear interpolation within elements - Continuous across element boundaries .. figure:: ../_static/images/diagrams/linear_triangle.svg :alt: P1 linear triangular element with labeled nodes and shape functions :width: 400px :align: center A P1 linear triangular element. Each shape function Ni equals 1 at its own node and 0 at the other two. Gradients are constant within the element. Shape Functions ^^^^^^^^^^^^^^^ For a triangle with vertices :math:`(x_1, y_1), (x_2, y_2), (x_3, y_3)`, the linear shape functions are: .. math:: :label: shape-functions N_i = \frac{1}{2A}(a_i + b_i x + c_i y) where :math:`A` is the element area and coefficients are computed from vertex coordinates. **Shape function gradients**: .. math:: \frac{\partial N_i}{\partial x} = \frac{b_i}{2A}, \quad \frac{\partial N_i}{\partial y} = \frac{c_i}{2A} Mass Matrix ^^^^^^^^^^^ Cocoa supports both consistent and lumped mass matrices. The consistent element mass matrix for linear triangular elements is: .. math:: :label: consistent-mass-matrix M^e_{ij} = \int_{\Omega^e} N_i N_j \, d\Omega = \frac{A}{12} \begin{pmatrix} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{pmatrix} The factor :math:`A/12` arises from integrating the product of linear shape functions over the element. The off-diagonal entries couple neighboring nodes, providing better accuracy than a lumped formulation. The lumped mass matrix diagonalizes the mass matrix by row-summing: .. math:: M^e_{ii} = \frac{A}{3}, \quad M^e_{ij} = 0 \text{ for } i \neq j The lumped formulation enables an explicit direct solve (no iterative solver needed) at the cost of some spatial accuracy. The GWCE-specific mass matrix formulation including :math:`\tau_0` weighting is given in Equation :eq:`gwce-mass-matrix`. See :doc:`/getting_started/configuration` for solver selection. Temporal Discretization ----------------------- Time Integration Scheme ^^^^^^^^^^^^^^^^^^^^^^^ Cocoa uses a three-level scheme for the GWCE: .. math:: \frac{\partial^2 \zeta}{\partial t^2} \approx \frac{\zeta^{n+1} - 2\zeta^n + \zeta^{n-1}}{\Delta t^2} .. math:: \frac{\partial \zeta}{\partial t} \approx \frac{\zeta^{n+1} - \zeta^{n-1}}{2\Delta t} This provides second-order accuracy in time. Momentum equations use a semi-implicit Crank-Nicolson scheme for the pressure gradient and bottom friction terms, with explicit treatment of advection and lateral stress. Linear Solver ------------- Assembly Pattern: Scatter-to-Gather ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cocoa assembles the finite element right-hand side with an element-parallel atomic scatter to node vectors (``Kokkos::atomic_add``); the GWCE LHS matrix uses a node-based gather. The diagram below illustrates the scatter and gather phases of this assembly: .. figure:: ../_static/images/diagrams/scatter_gather.svg :alt: Scatter-to-gather FEM assembly pattern :width: 600px :align: center Two-phase assembly: elements write to private buffers (scatter), then nodes accumulate contributions via CSR lookup (gather). Ghost contributions are exported to owning ranks via Tpetra. System Assembly ^^^^^^^^^^^^^^^ The GWCE produces a sparse linear system: .. math:: :label: gwce-linear-system \mathbf{A} \boldsymbol{\zeta}^{n+1} = \mathbf{b} The matrix :math:`\mathbf{A}` has the sparsity pattern of the mesh connectivity (see the GWCE discretization in Equation :eq:`gwce-mass-matrix` and :eq:`gwce-stiffness-matrix`). Iterative Solver ^^^^^^^^^^^^^^^^ Cocoa uses the Conjugate Gradient (CG) method via the Belos solver package from Trilinos [Heroux2005]_ [Bavier2012]_: .. math:: &\mathbf{r}_0 = \mathbf{b} - \mathbf{A}\mathbf{x}_0 \\ &\mathbf{p}_0 = \mathbf{r}_0 \\ &\text{for } k = 0, 1, 2, \ldots \\ &\quad \alpha_k = (\mathbf{r}_k, \mathbf{r}_k) / (\mathbf{p}_k, \mathbf{A}\mathbf{p}_k) \\ &\quad \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k \\ &\quad \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k \mathbf{A}\mathbf{p}_k \\ &\quad \beta_k = (\mathbf{r}_{k+1}, \mathbf{r}_{k+1}) / (\mathbf{r}_k, \mathbf{r}_k) \\ &\quad \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k Preconditioning ^^^^^^^^^^^^^^^ Jacobi preconditioning is applied: .. math:: \mathbf{M}^{-1} = \text{diag}(\mathbf{A})^{-1} This improves convergence while maintaining GPU efficiency. Convergence Criteria ^^^^^^^^^^^^^^^^^^^^ The solver terminates when the scaled residual norm satisfies the tolerance. By default, Cocoa scales the residual by the initial residual norm (``NormOfInitialResidual``): .. math:: \frac{\|\mathbf{r}_k\|}{\|\mathbf{r}_0\|} < \epsilon with default tolerance :math:`\epsilon = 10^{-5}`. The convergence scaling is selectable; the three available options are: - **NormOfInitialResidual** (default): :math:`\|\mathbf{r}_k\| / \|\mathbf{r}_0\| < \epsilon` - **NormOfPrecInitRes**: :math:`\|\mathbf{M}^{-1}\mathbf{r}_k\| / \|\mathbf{M}^{-1}\mathbf{r}_0\| < \epsilon` - **None**: :math:`\|\mathbf{r}_k\| < \epsilon` (absolute tolerance) Numerical Stability ------------------- CFL Condition ^^^^^^^^^^^^^ For a stable explicit time step, the Courant-Friedrichs-Lewy (CFL) condition provides useful guidance for choosing :math:`\Delta t`: .. math:: :label: cfl-condition \Delta t \lesssim C \cdot \min_e \frac{h_e}{\sqrt{gH_e}} where :math:`h_e` is the element size and :math:`C \approx 0.5` is the Courant number. This is descriptive guidance, not an enforced constraint: Cocoa only validates that the configured time step is positive and does not compute a CFL-limited step internally. The user is responsible for selecting a stable :math:`\Delta t`. Numerical Diffusion ^^^^^^^^^^^^^^^^^^^ Explicit advection schemes introduce numerical diffusion. This is partially offset by the GWCE formulation. Floating-Point Precision ------------------------ Cocoa performs all arithmetic in double precision (FP64). Both the lumped (explicit) and consistent (implicit) GWCE solvers, the momentum solver, and all reductions accumulate in double; the consistent mass matrix in particular requires it to avoid overflow in the iterative solve. **Mixed-precision storage.** To cut memory traffic without changing the numerics, a subset of bandwidth-sensitive fields is *stored* as single precision (float) and promoted to double the moment it is read into a kernel: - the prognostic velocity and flux fields, - the bottom-friction and lateral-stress tensors, - a subset of static mesh geometry (element areas, scale factors, Coriolis, ``tan_phi``, and the element-averaged bathymetry), the slope-limiter coefficients, and nodal attributes. Bathymetry itself and the spatial gradient coefficients (the ``a`` and ``b`` shape-function derivative arrays) remain double precision, as do water-surface elevation, the GWCE/momentum right-hand sides, the assembled system matrices, and all accumulators. The float/double conversions are confined to a single sanctioned boundary at the gather/scatter step (``Cocoa::Types::promote`` reads float storage up to double; ``Cocoa::Types::narrow`` writes a double result back to float). For double-typed fields both helpers are exact no-ops, so kernels are written uniformly regardless of a field's storage precision. Because the working arithmetic is unchanged, the only numerical effect is the ~1e-7 relative round-off incurred when a float-stored field is written and later re-read. **Performance considerations.** This scheme targets GPU architectures, where memory bandwidth --- not double-precision throughput --- is usually the limiting factor; storing the heavily-streamed fields as float reduces the bytes moved per timestep. CPU architectures see little benefit, as they already hide the extra bandwidth of full double-precision storage, but they are unaffected because compute precision is identical in both cases.