==================
Parallel Execution
==================

Cocoa supports multiple levels of parallelism for efficient execution on modern
computing systems. This guide covers serial execution, shared-memory parallelism
via OpenMP/CUDA/HIP, and distributed-memory parallelism via MPI.

Execution Modes Overview
------------------------

Cocoa can be run in several configurations:

.. list-table:: Execution Modes
   :header-rows: 1
   :widths: 25 75

   * - Mode
     - Description
   * - Serial
     - Single CPU thread, useful for debugging and small meshes
   * - OpenMP
     - Shared-memory parallelism using CPU threads
   * - CUDA
     - NVIDIA GPU acceleration
   * - HIP
     - AMD GPU acceleration
   * - MPI
     - Distributed-memory parallelism across multiple compute nodes
   * - MPI + GPU
     - Distributed parallelism with GPU acceleration on each rank

Serial Execution
----------------

For serial execution, simply run the ``cocoa`` executable directly:

.. code-block:: bash

   ./cocoa -i simulation.yaml

This is useful for:

- Debugging numerical issues
- Small test cases
- Systems without MPI or GPU support

OpenMP Parallelism
------------------

When Cocoa is compiled with OpenMP support via Kokkos, you can control the number
of threads using the ``--kokkos-num-threads`` option:

.. code-block:: bash

   ./cocoa -i simulation.yaml --kokkos-num-threads=8

Alternatively, use the ``OMP_NUM_THREADS`` environment variable:

.. code-block:: bash

   export OMP_NUM_THREADS=8
   ./cocoa -i simulation.yaml

Note that at present, atomics make the OpenMP code generally slower
than an equivalent simulation running using the serial execution space
and MPI, which avoids the atomics.

GPU Execution
-------------

For GPU builds (CUDA or HIP), Cocoa automatically uses the default GPU device.
To select a specific GPU:

.. code-block:: bash

   # Using Kokkos option
   ./cocoa -i simulation.yaml --kokkos-device-id=0

   # Or using environment variable (CUDA)
   export CUDA_VISIBLE_DEVICES=0
   ./cocoa -i simulation.yaml

MPI Distributed Parallelism
---------------------------

For large-scale simulations, Cocoa supports distributed-memory parallelism via
MPI. This allows the mesh to be partitioned across multiple compute nodes, with
each MPI rank responsible for a subdomain of the mesh.

.. graphviz::
   :align: center
   :caption: Distributed execution model with ghost exchange between two MPI ranks

   digraph distributed {
     rankdir=LR;
     node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10];
     edge [color="#555555"];
     bgcolor="transparent";
     compound=true;

     subgraph cluster_rank0 {
       label="MPI Rank 0";
       style="filled,rounded";
       fillcolor="#E3F2FD";
       color="#1E88E5";
       fontname="Helvetica";
       fontsize=11;

       r0_owned [label="Owned Nodes\n(authoritative)", fillcolor="#BBDEFB", color="#1565C0"];
       r0_ghost [label="Ghost Nodes\n(copies from Rank 1)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"];
       r0_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#BBDEFB", color="#1565C0"];

       r0_owned -> r0_solver;
       r0_ghost -> r0_solver;
     }

     subgraph cluster_rank1 {
       label="MPI Rank 1";
       style="filled,rounded";
       fillcolor="#FFF3E0";
       color="#FB8C00";
       fontname="Helvetica";
       fontsize=11;

       r1_owned [label="Owned Nodes\n(authoritative)", fillcolor="#FFE0B2", color="#E65100"];
       r1_ghost [label="Ghost Nodes\n(copies from Rank 0)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"];
       r1_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#FFE0B2", color="#E65100"];

       r1_owned -> r1_solver;
       r1_ghost -> r1_solver;
     }

     // Ghost exchange arrows
     r0_owned -> r1_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false];
     r1_owned -> r0_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false];
     r0_ghost -> r0_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed];
     r1_ghost -> r1_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed];
   }

Requirements
^^^^^^^^^^^^

- Cocoa must be built with MPI support (Trilinos compiled with MPI enabled)
- A partition file must be created or provided

Basic MPI Execution
^^^^^^^^^^^^^^^^^^^

To run with MPI:

.. code-block:: bash

   mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc

The ``--partition`` option specifies the partition file that defines how the mesh
is distributed across MPI ranks. If the file exists, it will be loaded. If it does
not exist, Cocoa will create it automatically using ParMETIS.

.. important::

   When using MPI with GPU acceleration, set ``--kokkos-num-threads=1`` to avoid
   oversubscription:

   .. code-block:: bash

      mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc --kokkos-num-threads=1

Mesh Partitioning
-----------------

.. graphviz::
   :align: center
   :caption: Mesh partitioning pipeline from input mesh to distributed execution

   digraph partitioning {
     rankdir=LR;
     node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10];
     edge [color="#555555"];
     bgcolor="transparent";

     read [label="Read Full Mesh\n(Rank 0)", fillcolor="#FFF3E0", color="#FB8C00"];
     check [shape=diamond, label="Partition\nCached?", fillcolor="#E0E0E0", color="#616161", fontsize=9];
     partition [label="Zoltan2 /\nParMETIS\nPartition", fillcolor="#E3F2FD", color="#1E88E5"];
     cache [label="Cache to\nNetCDF", fillcolor="#F5F5F5", color="#9E9E9E"];
     load [label="Load Cached\nPartition", fillcolor="#F5F5F5", color="#9E9E9E"];
     distribute [label="Distribute to\nMPI Ranks", fillcolor="#E8F5E9", color="#43A047"];
     build [label="Build Local Mesh\n+ Ghost Layers", fillcolor="#E8F5E9", color="#43A047"];
     maps [label="Create Tpetra\nMaps & Importers", fillcolor="#F3E5F5", color="#8E24AA"];

     read -> check;
     check -> load [label="yes", fontsize=9];
     check -> partition [label="no", fontsize=9];
     partition -> cache -> distribute;
     load -> distribute;
     distribute -> build -> maps;
   }

Cocoa uses `ParMETIS <http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview>`_
(via Trilinos/Zoltan2) for mesh partitioning. The partitioner uses graph-based
algorithms to minimize communication while balancing the computational load across
MPI ranks.

Partition File Format
^^^^^^^^^^^^^^^^^^^^^

Partition files are stored in NetCDF format and contain:

- **Node coordinates** (``node_x``, ``node_y``): Geographic coordinates of all nodes
- **Element connectivity** (``element_nodes``): Triangular element node indices
- **Node ownership** (``node_owner``): MPI rank that owns each node
- **Element ownership** (``element_owner``): MPI rank that owns each element
- **Mesh checksum**: Hash for validating mesh consistency

The partition file is automatically validated against the mesh to ensure
consistency. If the mesh changes, a new partition file must be generated.

Automatic Partition Naming
^^^^^^^^^^^^^^^^^^^^^^^^^^

When no explicit partition file is provided, Cocoa uses the naming convention:

.. code-block:: text

   <mesh_filename>.partition_<N>.nc

For example, with ``mesh.nc`` and 8 MPI ranks, the partition file would be:

.. code-block:: text

   mesh.nc.partition_8.nc

Creating Partition Files
------------------------

Partition files can be created in two ways:

1. **Automatic creation**: When running with MPI and the partition file doesn't exist
2. **Pre-computation**: Using the ``--create-partition`` option

Pre-computing Partitions
^^^^^^^^^^^^^^^^^^^^^^^^

For large meshes, it's recommended to pre-compute partition files before running
simulations. This avoids the partitioning overhead during production runs:

.. code-block:: bash

   # Create a partition file for 8 subdomains
   ./cocoa -i simulation.yaml --create-partition 8

This will:

1. Read the mesh file specified in ``simulation.yaml``
2. Partition the mesh into 8 subdomains using ParMETIS
3. Save the partition to ``<mesh_filename>.partition_8.nc``
4. Exit without running the simulation

To specify a custom output filename:

.. code-block:: bash

   ./cocoa -i simulation.yaml --create-partition 8 --partition my_partition.nc

.. note::

   Creating partition files only requires serial execution. The partitioner will
   use MPI if available but works correctly with a single rank.

Partition Caching
^^^^^^^^^^^^^^^^^

Cocoa caches partition files to avoid re-partitioning on subsequent runs. The
cache includes a mesh checksum that validates the partition against the current
mesh. If the mesh changes (nodes added/removed, connectivity modified), the
cache is invalidated and a new partition must be generated.

To force regeneration of a partition file, delete the existing file:

.. code-block:: bash

   rm mesh.nc.partition_8.nc
   mpirun -np 8 ./cocoa -i simulation.yaml

Example Partition Visualization
-------------------------------

The following image shows an example ParMETIS partition of a global ocean mesh
(GSTOFS domain) with 8 subdomains. Each color represents a different MPI rank's
subdomain:

.. figure:: ../_static/images/partition.png
   :alt: Example ParMETIS partition with 256 subdomains
   :width: 600px
   :align: center

   Example mesh partition for a global ocean model. The mesh is divided into
   256 subdomains

Internal Weir Pairs Across Partitions
-------------------------------------

ADCIRC requires that both sides of an internal weir boundary be assigned to the
same MPI subdomain. This constraint simplifies the overflow computation (each
rank has direct access to both pair elevations) but restricts the partitioner,
potentially leading to load imbalance near weirs.

Cocoa removes this constraint entirely. Internal weir pair nodes may be owned
by different MPI ranks because the ghost exchange mechanism already provides the
necessary data.

.. figure:: ../_static/images/diagrams/weir_mpi_ghost.svg
   :alt: Internal weir pair nodes across MPI partitions with ghost exchange
   :width: 550px
   :align: center

   Weir pair nodes on opposite sides of a partition boundary. Each rank holds
   ghost copies of the other rank's nodes. The ghost exchange before the
   overflow computation ensures both ranks have up-to-date elevation and
   wet/dry status for their pair nodes.

**How it works:**

1. **Ghost layer inclusion**: Pair nodes on the opposite side of a weir share
   elements near the crest, so they are automatically included in the ghost
   layer during mesh partitioning.

2. **Ghost exchange timing**: Each timestep, ``exchange_wetdry_ghosts()`` runs
   before the overflow computation. This exchanges elevation (:math:`\zeta^{n+1}`)
   and wet/dry status for all ghost nodes, including weir pair nodes on other
   ranks.

3. **Independent computation**: After the ghost exchange, each rank has the
   elevation at both its owned boundary nodes and their pair nodes (as ghost
   copies). The overflow formula runs independently on each rank using local
   data. Since both sides of the weir are listed as boundary nodes in the mesh,
   both ranks compute their own QFORCE contributions.

4. **Bounds-checked RHS application**: When applying QFORCE to the GWCE RHS,
   boundary segments that span partition boundaries may have one endpoint that
   is a ghost node. The kernel skips writes to ghost nodes (via a bounds check
   against the owned RHS size), so each contribution is applied only by the rank
   that owns the node.

**Runtime validation:**

At startup, ``BoundaryProcessor`` verifies that all internal weir pair node IDs
are valid (not -1) in the local partition's global-to-local map. If a pair node
is missing from the ghost layer, an error is logged. This would indicate a
partitioning bug since weir pair nodes should always be in the ghost set.

Command Line Reference
----------------------

Cocoa Options
^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Option
     - Description
   * - ``-i <file>``
     - Input configuration file (YAML format)
   * - ``--partition <file>``
     - Use specified partition file. If file exists, load it; otherwise create it
   * - ``--create-partition <N>``
     - Create partition for N subdomains and exit (no simulation)
   * - ``-v, --verbose``
     - Enable verbose logging (debug level)
   * - ``-V, --version``
     - Show version information and exit
   * - ``-h, --help``
     - Show help message

Kokkos Options
^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 35 65

   * - Option
     - Description
   * - ``--kokkos-num-threads=N``
     - Number of threads for OpenMP execution
   * - ``--kokkos-device-id=N``
     - GPU device ID to use (0-indexed)
   * - ``--kokkos-map-device-id-by``
     - Map device ID by ``mpi_rank`` or ``socket``
   * - ``--kokkos-help``
     - Show all Kokkos command line options