Parallel Execution

Cocoa supports multiple levels of parallelism for efficient execution on modern computing systems. This guide covers serial execution, shared-memory parallelism via OpenMP/CUDA/HIP, and distributed-memory parallelism via MPI.

Execution Modes Overview

Cocoa can be run in several configurations:

Table 3 Execution Modes

Mode

Description

Serial

Single CPU thread, useful for debugging and small meshes

OpenMP

Shared-memory parallelism using CPU threads

CUDA

NVIDIA GPU acceleration

HIP

AMD GPU acceleration

MPI

Distributed-memory parallelism across multiple compute nodes

MPI + GPU

Distributed parallelism with GPU acceleration on each rank

Serial Execution

For serial execution, simply run the cocoa executable directly:

./cocoa -i simulation.yaml

This is useful for:

  • Debugging numerical issues

  • Small test cases

  • Systems without MPI or GPU support

OpenMP Parallelism

When Cocoa is compiled with OpenMP support via Kokkos, you can control the number of threads using the --kokkos-num-threads option:

./cocoa -i simulation.yaml --kokkos-num-threads=8

Alternatively, use the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=8
./cocoa -i simulation.yaml

Note that at present, atomics make the OpenMP code generally slower than an equivalent simulation running using the serial execution space and MPI, which avoids the atomics.

GPU Execution

For GPU builds (CUDA or HIP), Cocoa automatically uses the default GPU device. To select a specific GPU:

# Using Kokkos option
./cocoa -i simulation.yaml --kokkos-device-id=0

# Or using environment variable (CUDA)
export CUDA_VISIBLE_DEVICES=0
./cocoa -i simulation.yaml

MPI Distributed Parallelism

For large-scale simulations, Cocoa supports distributed-memory parallelism via MPI. This allows the mesh to be partitioned across multiple compute nodes, with each MPI rank responsible for a subdomain of the mesh.

digraph distributed { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10]; edge [color="#555555"]; bgcolor="transparent"; compound=true; subgraph cluster_rank0 { label="MPI Rank 0"; style="filled,rounded"; fillcolor="#E3F2FD"; color="#1E88E5"; fontname="Helvetica"; fontsize=11; r0_owned [label="Owned Nodes\n(authoritative)", fillcolor="#BBDEFB", color="#1565C0"]; r0_ghost [label="Ghost Nodes\n(copies from Rank 1)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"]; r0_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#BBDEFB", color="#1565C0"]; r0_owned -> r0_solver; r0_ghost -> r0_solver; } subgraph cluster_rank1 { label="MPI Rank 1"; style="filled,rounded"; fillcolor="#FFF3E0"; color="#FB8C00"; fontname="Helvetica"; fontsize=11; r1_owned [label="Owned Nodes\n(authoritative)", fillcolor="#FFE0B2", color="#E65100"]; r1_ghost [label="Ghost Nodes\n(copies from Rank 0)", fillcolor="#E0E0E0", color="#9E9E9E", style="filled,rounded,dashed"]; r1_solver [label="Local Solver\n(GWCE + Momentum)", fillcolor="#FFE0B2", color="#E65100"]; r1_owned -> r1_solver; r1_ghost -> r1_solver; } // Ghost exchange arrows r0_owned -> r1_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false]; r1_owned -> r0_ghost [label="update_ghosts", fontsize=9, color="#E53935", fontcolor="#E53935", constraint=false]; r0_ghost -> r0_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed]; r1_ghost -> r1_owned [label="sum_into_owned\n(FEM assembly)", fontsize=8, color="#43A047", fontcolor="#43A047", style=dashed]; }

Fig. 5 Distributed execution model with ghost exchange between two MPI ranks

Requirements

  • Cocoa must be built with MPI support (Trilinos compiled with MPI enabled)

  • A partition file must be created or provided

Basic MPI Execution

To run with MPI:

mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc

The --partition option specifies the partition file that defines how the mesh is distributed across MPI ranks. If the file exists, it will be loaded. If it does not exist, Cocoa will create it automatically using ParMETIS.

Important

When using MPI with GPU acceleration, set --kokkos-num-threads=1 to avoid oversubscription:

mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc --kokkos-num-threads=1

Mesh Partitioning

digraph partitioning { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Helvetica", fontsize=10]; edge [color="#555555"]; bgcolor="transparent"; read [label="Read Full Mesh\n(Rank 0)", fillcolor="#FFF3E0", color="#FB8C00"]; check [shape=diamond, label="Partition\nCached?", fillcolor="#E0E0E0", color="#616161", fontsize=9]; partition [label="Zoltan2 /\nParMETIS\nPartition", fillcolor="#E3F2FD", color="#1E88E5"]; cache [label="Cache to\nNetCDF", fillcolor="#F5F5F5", color="#9E9E9E"]; load [label="Load Cached\nPartition", fillcolor="#F5F5F5", color="#9E9E9E"]; distribute [label="Distribute to\nMPI Ranks", fillcolor="#E8F5E9", color="#43A047"]; build [label="Build Local Mesh\n+ Ghost Layers", fillcolor="#E8F5E9", color="#43A047"]; maps [label="Create Tpetra\nMaps & Importers", fillcolor="#F3E5F5", color="#8E24AA"]; read -> check; check -> load [label="yes", fontsize=9]; check -> partition [label="no", fontsize=9]; partition -> cache -> distribute; load -> distribute; distribute -> build -> maps; }

Fig. 6 Mesh partitioning pipeline from input mesh to distributed execution

Cocoa uses ParMETIS (via Trilinos/Zoltan2) for mesh partitioning. The partitioner uses graph-based algorithms to minimize communication while balancing the computational load across MPI ranks.

Partition File Format

Partition files are stored in NetCDF format and contain:

  • Node coordinates (node_x, node_y): Geographic coordinates of all nodes

  • Element connectivity (element_nodes): Triangular element node indices

  • Node ownership (node_owner): MPI rank that owns each node

  • Element ownership (element_owner): MPI rank that owns each element

  • Mesh checksum: Hash for validating mesh consistency

The partition file is automatically validated against the mesh to ensure consistency. If the mesh changes, a new partition file must be generated.

Automatic Partition Naming

When no explicit partition file is provided, Cocoa uses the naming convention:

<mesh_filename>.partition_<N>.nc

For example, with mesh.nc and 8 MPI ranks, the partition file would be:

mesh.nc.partition_8.nc

Creating Partition Files

Partition files can be created in two ways:

  1. Automatic creation: When running with MPI and the partition file doesn’t exist

  2. Pre-computation: Using the --create-partition option

Pre-computing Partitions

For large meshes, it’s recommended to pre-compute partition files before running simulations. This avoids the partitioning overhead during production runs:

# Create a partition file for 8 subdomains
./cocoa -i simulation.yaml --create-partition 8

This will:

  1. Read the mesh file specified in simulation.yaml

  2. Partition the mesh into 8 subdomains using ParMETIS

  3. Save the partition to <mesh_filename>.partition_8.nc

  4. Exit without running the simulation

To specify a custom output filename:

./cocoa -i simulation.yaml --create-partition 8 --partition my_partition.nc

Note

Creating partition files only requires serial execution. The partitioner will use MPI if available but works correctly with a single rank.

Partition Caching

Cocoa caches partition files to avoid re-partitioning on subsequent runs. The cache includes a mesh checksum that validates the partition against the current mesh. If the mesh changes (nodes added/removed, connectivity modified), the cache is invalidated and a new partition must be generated.

To force regeneration of a partition file, delete the existing file:

rm mesh.nc.partition_8.nc
mpirun -np 8 ./cocoa -i simulation.yaml

Example Partition Visualization

The following image shows an example ParMETIS partition of a global ocean mesh (GSTOFS domain) with 8 subdomains. Each color represents a different MPI rank’s subdomain:

Example ParMETIS partition with 256 subdomains

Fig. 7 Example mesh partition for a global ocean model. The mesh is divided into 256 subdomains

Internal Weir Pairs Across Partitions

ADCIRC requires that both sides of an internal weir boundary be assigned to the same MPI subdomain. This constraint simplifies the overflow computation (each rank has direct access to both pair elevations) but restricts the partitioner, potentially leading to load imbalance near weirs.

Cocoa removes this constraint entirely. Internal weir pair nodes may be owned by different MPI ranks because the ghost exchange mechanism already provides the necessary data.

Internal weir pair nodes across MPI partitions with ghost exchange

Fig. 8 Weir pair nodes on opposite sides of a partition boundary. Each rank holds ghost copies of the other rank’s nodes. The ghost exchange before the overflow computation ensures both ranks have up-to-date elevation and wet/dry status for their pair nodes.

How it works:

  1. Ghost layer inclusion: Pair nodes on the opposite side of a weir share elements near the crest, so they are automatically included in the ghost layer during mesh partitioning.

  2. Ghost exchange timing: Each timestep, exchange_wetdry_ghosts() runs before the overflow computation. This exchanges elevation (\(\zeta^{n+1}\)) and wet/dry status for all ghost nodes, including weir pair nodes on other ranks.

  3. Independent computation: After the ghost exchange, each rank has the elevation at both its owned boundary nodes and their pair nodes (as ghost copies). The overflow formula runs independently on each rank using local data. Since both sides of the weir are listed as boundary nodes in the mesh, both ranks compute their own QFORCE contributions.

  4. Bounds-checked RHS application: When applying QFORCE to the GWCE RHS, boundary segments that span partition boundaries may have one endpoint that is a ghost node. The kernel skips writes to ghost nodes (via a bounds check against the owned RHS size), so each contribution is applied only by the rank that owns the node.

Runtime validation:

At startup, BoundaryProcessor verifies that all internal weir pair node IDs are valid (not -1) in the local partition’s global-to-local map. If a pair node is missing from the ghost layer, an error is logged. This would indicate a partitioning bug since weir pair nodes should always be in the ghost set.

Command Line Reference

Cocoa Options

Option

Description

-i <file>

Input configuration file (YAML format)

--partition <file>

Use specified partition file. If file exists, load it; otherwise create it

--create-partition <N>

Create partition for N subdomains and exit (no simulation)

-v, --verbose

Enable verbose logging (debug level)

-V, --version

Show version information and exit

-h, --help

Show help message

Kokkos Options

Option

Description

--kokkos-num-threads=N

Number of threads for OpenMP execution

--kokkos-device-id=N

GPU device ID to use (0-indexed)

--kokkos-map-device-id-by

Map device ID by mpi_rank or socket

--kokkos-help

Show all Kokkos command line options