================== Parallel Execution ================== Cocoa supports multiple levels of parallelism for efficient execution on modern computing systems. This guide covers serial execution, shared-memory parallelism via OpenMP/CUDA/HIP, and distributed-memory parallelism via MPI. Execution Modes Overview ------------------------ Cocoa can be run in several configurations: .. list-table:: Execution Modes :header-rows: 1 :widths: 25 75 * - Mode - Description * - Serial - Single CPU thread, useful for debugging and small meshes * - OpenMP - Shared-memory parallelism using CPU threads * - CUDA - NVIDIA GPU acceleration * - HIP - AMD GPU acceleration * - MPI - Distributed-memory parallelism across multiple compute nodes * - MPI + GPU - Distributed parallelism with GPU acceleration on each rank Serial Execution ---------------- For serial execution, simply run the ``cocoa`` executable directly: .. code-block:: bash ./cocoa -i simulation.yaml This is useful for: - Debugging numerical issues - Small test cases - Systems without MPI or GPU support OpenMP Parallelism ------------------ When Cocoa is compiled with OpenMP support via Kokkos, you can control the number of threads using the ``--kokkos-num-threads`` option: .. code-block:: bash ./cocoa -i simulation.yaml --kokkos-num-threads=8 Alternatively, use the ``OMP_NUM_THREADS`` environment variable: .. code-block:: bash export OMP_NUM_THREADS=8 ./cocoa -i simulation.yaml Note that at present, atomics make the OpenMP code generally slower than an equivalent simulation running using the serial execution space and MPI, which avoids the atomics. GPU Execution ------------- For GPU builds (CUDA or HIP), Cocoa automatically uses the default GPU device. To select a specific GPU: .. code-block:: bash # Using Kokkos option ./cocoa -i simulation.yaml --kokkos-device-id=0 # Or using environment variable (CUDA) export CUDA_VISIBLE_DEVICES=0 ./cocoa -i simulation.yaml MPI Distributed Parallelism --------------------------- For large-scale simulations, Cocoa supports distributed-memory parallelism via MPI. This allows the mesh to be partitioned across multiple compute nodes, with each MPI rank responsible for a subdomain of the mesh. .. graphviz:: :align: center :caption: Distributed execution model with ghost exchange between two MPI ranks digraph distributed { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10]; edge [color="#555555"]; bgcolor="transparent"; compound=true; subgraph cluster_rank0 { label="MPI Rank 0"; style="filled,rounded"; fillcolor="#E3F2FD"; color="#1E88E5"; fontname="Helvetica"; fontsize=11; r0_owned [label="Owned Nodes\n(authoritative)", fillcolor="#BBDEFB", color="#1565C0"]; r0_ghost [label="Ghost Nodes\n(copies from Rank 1)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"]; r0_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#BBDEFB", color="#1565C0"]; r0_owned -> r0_solver; r0_ghost -> r0_solver; } subgraph cluster_rank1 { label="MPI Rank 1"; style="filled,rounded"; fillcolor="#FFF3E0"; color="#FB8C00"; fontname="Helvetica"; fontsize=11; r1_owned [label="Owned Nodes\n(authoritative)", fillcolor="#FFE0B2", color="#E65100"]; r1_ghost [label="Ghost Nodes\n(copies from Rank 0)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"]; r1_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#FFE0B2", color="#E65100"]; r1_owned -> r1_solver; r1_ghost -> r1_solver; } // Ghost exchange arrows r0_owned -> r1_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false]; r1_owned -> r0_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false]; r0_ghost -> r0_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed]; r1_ghost -> r1_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed]; } Requirements ^^^^^^^^^^^^ - Cocoa must be built with MPI support (Trilinos compiled with MPI enabled) - A partition file must be created or provided Basic MPI Execution ^^^^^^^^^^^^^^^^^^^ To run with MPI: .. code-block:: bash mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc The ``--partition`` option specifies the partition file that defines how the mesh is distributed across MPI ranks. If the file exists, it will be loaded. If it does not exist, Cocoa will create it automatically using ParMETIS. .. important:: When using MPI with GPU acceleration, set ``--kokkos-num-threads=1`` to avoid oversubscription: .. code-block:: bash mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc --kokkos-num-threads=1 Mesh Partitioning ----------------- .. graphviz:: :align: center :caption: Mesh partitioning pipeline from input mesh to distributed execution digraph partitioning { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10]; edge [color="#555555"]; bgcolor="transparent"; read [label="Read Full Mesh\n(Rank 0)", fillcolor="#FFF3E0", color="#FB8C00"]; check [shape=diamond, label="Partition\nCached?", fillcolor="#E0E0E0", color="#616161", fontsize=9]; partition [label="Zoltan2 /\nParMETIS\nPartition", fillcolor="#E3F2FD", color="#1E88E5"]; cache [label="Cache to\nNetCDF", fillcolor="#F5F5F5", color="#9E9E9E"]; load [label="Load Cached\nPartition", fillcolor="#F5F5F5", color="#9E9E9E"]; distribute [label="Distribute to\nMPI Ranks", fillcolor="#E8F5E9", color="#43A047"]; build [label="Build Local Mesh\n+ Ghost Layers", fillcolor="#E8F5E9", color="#43A047"]; maps [label="Create Tpetra\nMaps & Importers", fillcolor="#F3E5F5", color="#8E24AA"]; read -> check; check -> load [label="yes", fontsize=9]; check -> partition [label="no", fontsize=9]; partition -> cache -> distribute; load -> distribute; distribute -> build -> maps; } Cocoa uses `ParMETIS `_ (via Trilinos/Zoltan2) for mesh partitioning. The partitioner uses graph-based algorithms to minimize communication while balancing the computational load across MPI ranks. Partition File Format ^^^^^^^^^^^^^^^^^^^^^ Partition files are stored in NetCDF format and contain: - **Node coordinates** (``node_x``, ``node_y``): Geographic coordinates of all nodes - **Element connectivity** (``element_nodes``): Triangular element node indices - **Node ownership** (``node_owner``): MPI rank that owns each node - **Element ownership** (``element_owner``): MPI rank that owns each element - **Mesh checksum**: Hash for validating mesh consistency The partition file is automatically validated against the mesh to ensure consistency. If the mesh changes, a new partition file must be generated. Automatic Partition Naming ^^^^^^^^^^^^^^^^^^^^^^^^^^ When no explicit partition file is provided, Cocoa uses the naming convention: .. code-block:: text .partition_.nc For example, with ``mesh.nc`` and 8 MPI ranks, the partition file would be: .. code-block:: text mesh.nc.partition_8.nc Creating Partition Files ------------------------ Partition files can be created in two ways: 1. **Automatic creation**: When running with MPI and the partition file doesn't exist 2. **Pre-computation**: Using the ``--create-partition`` option Pre-computing Partitions ^^^^^^^^^^^^^^^^^^^^^^^^ For large meshes, it's recommended to pre-compute partition files before running simulations. This avoids the partitioning overhead during production runs: .. code-block:: bash # Create a partition file for 8 subdomains ./cocoa -i simulation.yaml --create-partition 8 This will: 1. Read the mesh file specified in ``simulation.yaml`` 2. Partition the mesh into 8 subdomains using ParMETIS 3. Save the partition to ``.partition_8.nc`` 4. Exit without running the simulation To specify a custom output filename: .. code-block:: bash ./cocoa -i simulation.yaml --create-partition 8 --partition my_partition.nc .. note:: Creating partition files only requires serial execution. The partitioner will use MPI if available but works correctly with a single rank. Partition Caching ^^^^^^^^^^^^^^^^^ Cocoa caches partition files to avoid re-partitioning on subsequent runs. The cache includes a mesh checksum that validates the partition against the current mesh. If the mesh changes (nodes added/removed, connectivity modified), the cache is invalidated and a new partition must be generated. To force regeneration of a partition file, delete the existing file: .. code-block:: bash rm mesh.nc.partition_8.nc mpirun -np 8 ./cocoa -i simulation.yaml Example Partition Visualization ------------------------------- The following image shows an example ParMETIS partition of a global ocean mesh (GSTOFS domain) with 8 subdomains. Each color represents a different MPI rank's subdomain: .. figure:: ../_static/images/partition.png :alt: Example ParMETIS partition with 256 subdomains :width: 600px :align: center Example mesh partition for a global ocean model. The mesh is divided into 256 subdomains Internal Weir Pairs Across Partitions ------------------------------------- ADCIRC requires that both sides of an internal weir boundary be assigned to the same MPI subdomain. This constraint simplifies the overflow computation (each rank has direct access to both pair elevations) but restricts the partitioner, potentially leading to load imbalance near weirs. Cocoa removes this constraint entirely. Internal weir pair nodes may be owned by different MPI ranks because the ghost exchange mechanism already provides the necessary data. .. figure:: ../_static/images/diagrams/weir_mpi_ghost.svg :alt: Internal weir pair nodes across MPI partitions with ghost exchange :width: 550px :align: center Weir pair nodes on opposite sides of a partition boundary. Each rank holds ghost copies of the other rank's nodes. The ghost exchange before the overflow computation ensures both ranks have up-to-date elevation and wet/dry status for their pair nodes. **How it works:** 1. **Ghost layer inclusion**: Pair nodes on the opposite side of a weir share elements near the crest, so they are automatically included in the ghost layer during mesh partitioning. 2. **Ghost exchange timing**: Each timestep, ``exchange_wetdry_ghosts()`` runs before the overflow computation. This exchanges elevation (:math:`\zeta^{n+1}`) and wet/dry status for all ghost nodes, including weir pair nodes on other ranks. 3. **Independent computation**: After the ghost exchange, each rank has the elevation at both its owned boundary nodes and their pair nodes (as ghost copies). The overflow formula runs independently on each rank using local data. Since both sides of the weir are listed as boundary nodes in the mesh, both ranks compute their own QFORCE contributions. 4. **Bounds-checked RHS application**: When applying QFORCE to the GWCE RHS, boundary segments that span partition boundaries may have one endpoint that is a ghost node. The kernel skips writes to ghost nodes (via a bounds check against the owned RHS size), so each contribution is applied only by the rank that owns the node. **Runtime validation:** At startup, ``BoundaryProcessor`` verifies that all internal weir pair node IDs are valid (not -1) in the local partition's global-to-local map. If a pair node is missing from the ghost layer, an error is logged. This would indicate a partitioning bug since weir pair nodes should always be in the ghost set. Command Line Reference ---------------------- Cocoa Options ^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 70 * - Option - Description * - ``-i `` - Input configuration file (YAML format) * - ``--partition `` - Use specified partition file. If file exists, load it; otherwise create it * - ``--create-partition `` - Create partition for N subdomains and exit (no simulation) * - ``-v, --verbose`` - Enable verbose logging (debug level) * - ``-V, --version`` - Show version information and exit * - ``-h, --help`` - Show help message Kokkos Options ^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 35 65 * - Option - Description * - ``--kokkos-num-threads=N`` - Number of threads for OpenMP execution * - ``--kokkos-device-id=N`` - GPU device ID to use (0-indexed) * - ``--kokkos-map-device-id-by`` - Map device ID by ``mpi_rank`` or ``socket`` * - ``--kokkos-help`` - Show all Kokkos command line options