Parallel Execution

Cocoa supports multiple levels of parallelism for efficient execution on modern computing systems. This guide covers serial execution, shared-memory parallelism via OpenMP/CUDA/HIP, and distributed-memory parallelism via MPI.

Execution Modes Overview

Cocoa can be run in several configurations:

Table 3 Execution Modes
Mode	Description
Serial	Single CPU thread, useful for debugging and small meshes
OpenMP	Shared-memory parallelism using CPU threads
CUDA	NVIDIA GPU acceleration
HIP	AMD GPU acceleration
MPI	Distributed-memory parallelism across multiple compute nodes
MPI + GPU	Distributed parallelism with GPU acceleration on each rank

Serial Execution

For serial execution, simply run the cocoa executable directly:

./cocoa -i simulation.yaml

This is useful for:

Debugging numerical issues
Small test cases
Systems without MPI or GPU support

OpenMP Parallelism

When Cocoa is compiled with OpenMP support via Kokkos, you can control the number of threads using the --kokkos-num-threads option:

./cocoa -i simulation.yaml --kokkos-num-threads=8

Alternatively, use the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=8
./cocoa -i simulation.yaml

Note that at present, atomics make the OpenMP code generally slower than an equivalent simulation running using the serial execution space and MPI, which avoids the atomics.

GPU Execution

For GPU builds (CUDA or HIP), Cocoa automatically uses the default GPU device. To select a specific GPU:

# Using Kokkos option
./cocoa -i simulation.yaml --kokkos-device-id=0

# Or using environment variable (CUDA)
export CUDA_VISIBLE_DEVICES=0
./cocoa -i simulation.yaml

MPI Distributed Parallelism

For large-scale simulations, Cocoa supports distributed-memory parallelism via MPI. This allows the mesh to be partitioned across multiple compute nodes, with each MPI rank responsible for a subdomain of the mesh.

Fig. 5 Distributed execution model with ghost exchange between two MPI ranks

Requirements

Cocoa must be built with MPI support (Trilinos compiled with MPI enabled)
A partition file must be created or provided

Basic MPI Execution

To run with MPI:

mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc

The --partition option specifies the partition file that defines how the mesh is distributed across MPI ranks. If the file exists, it will be loaded. If it does not exist, Cocoa will create it automatically using ParMETIS.

Important

When using MPI with GPU acceleration, set --kokkos-num-threads=1 to avoid oversubscription:

mpirun -np 4 ./cocoa -i simulation.yaml --partition partition.nc --kokkos-num-threads=1

Mesh Partitioning

Fig. 6 Mesh partitioning pipeline from input mesh to distributed execution

Cocoa uses ParMETIS (via Trilinos/Zoltan2) for mesh partitioning. The partitioner uses graph-based algorithms to minimize communication while balancing the computational load across MPI ranks.

Partition File Format

Partition files are stored in NetCDF format and contain:

Node coordinates (node_x, node_y): Geographic coordinates of all nodes
Element connectivity (element_nodes): Triangular element node indices
Node ownership (node_owner): MPI rank that owns each node
Element ownership (element_owner): MPI rank that owns each element
Mesh checksum: Hash for validating mesh consistency

The partition file is automatically validated against the mesh to ensure consistency. If the mesh changes, a new partition file must be generated.

Automatic Partition Naming

When no explicit partition file is provided, Cocoa uses the naming convention:

<mesh_filename>.partition_<N>.nc

For example, with mesh.nc and 8 MPI ranks, the partition file would be:

mesh.nc.partition_8.nc

Creating Partition Files

Partition files can be created in two ways:

Automatic creation: When running with MPI and the partition file doesn’t exist
Pre-computation: Using the --create-partition option

Pre-computing Partitions

For large meshes, it’s recommended to pre-compute partition files before running simulations. This avoids the partitioning overhead during production runs:

# Create a partition file for 8 subdomains
./cocoa -i simulation.yaml --create-partition 8

This will:

Read the mesh file specified in simulation.yaml
Partition the mesh into 8 subdomains using ParMETIS
Save the partition to <mesh_filename>.partition_8.nc
Exit without running the simulation

To specify a custom output filename:

./cocoa -i simulation.yaml --create-partition 8 --partition my_partition.nc

Note

Creating partition files only requires serial execution. The partitioner will use MPI if available but works correctly with a single rank.

Partition Caching

Cocoa caches partition files to avoid re-partitioning on subsequent runs. The cache includes a mesh checksum that validates the partition against the current mesh. If the mesh changes (nodes added/removed, connectivity modified), the cache is invalidated and a new partition must be generated.

To force regeneration of a partition file, delete the existing file:

rm mesh.nc.partition_8.nc
mpirun -np 8 ./cocoa -i simulation.yaml

Example Partition Visualization

The following image shows an example ParMETIS partition of a global ocean mesh (GSTOFS domain) with 8 subdomains. Each color represents a different MPI rank’s subdomain:

Example ParMETIS partition with 256 subdomains — Fig. 7 Example mesh partition for a global ocean model. The mesh is divided into 256 subdomains

Internal Weir Pairs Across Partitions

ADCIRC requires that both sides of an internal weir boundary be assigned to the same MPI subdomain. This constraint simplifies the overflow computation (each rank has direct access to both pair elevations) but restricts the partitioner, potentially leading to load imbalance near weirs.

Cocoa removes this constraint entirely. Internal weir pair nodes may be owned by different MPI ranks because the ghost exchange mechanism already provides the necessary data.

Internal weir pair nodes across MPI partitions with ghost exchange — Fig. 8 Weir pair nodes on opposite sides of a partition boundary. Each rank holds ghost copies of the other rank’s nodes. The ghost exchange before the overflow computation ensures both ranks have up-to-date elevation and wet/dry status for their pair nodes.

How it works:

Ghost layer inclusion: Pair nodes on the opposite side of a weir share elements near the crest, so they are automatically included in the ghost layer during mesh partitioning.
Ghost exchange timing: Each timestep, exchange_wetdry_ghosts() runs before the overflow computation. This exchanges elevation (\(\zeta^{n+1}\)) and wet/dry status for all ghost nodes, including weir pair nodes on other ranks.
Independent computation: After the ghost exchange, each rank has the elevation at both its owned boundary nodes and their pair nodes (as ghost copies). The overflow formula runs independently on each rank using local data. Since both sides of the weir are listed as boundary nodes in the mesh, both ranks compute their own QFORCE contributions.
Bounds-checked RHS application: When applying QFORCE to the GWCE RHS, boundary segments that span partition boundaries may have one endpoint that is a ghost node. The kernel skips writes to ghost nodes (via a bounds check against the owned RHS size), so each contribution is applied only by the rank that owns the node.

Runtime validation:

At startup, BoundaryProcessor verifies that all internal weir pair node IDs are valid (not -1) in the local partition’s global-to-local map. If a pair node is missing from the ghost layer, an error is logged. This would indicate a partitioning bug since weir pair nodes should always be in the ghost set.

Command Line Reference

Cocoa Options

Option	Description
`-i <file>`	Input configuration file (YAML format)
`--partition <file>`	Use specified partition file. If file exists, load it; otherwise create it
`--create-partition <N>`	Create partition for N subdomains and exit (no simulation)
`-v, --verbose`	Enable verbose logging (debug level)
`-V, --version`	Show version information and exit
`-h, --help`	Show help message

Kokkos Options

Option	Description
`--kokkos-num-threads=N`	Number of threads for OpenMP execution
`--kokkos-device-id=N`	GPU device ID to use (0-indexed)
`--kokkos-map-device-id-by`	Map device ID by `mpi_rank` or `socket`
`--kokkos-help`	Show all Kokkos command line options