Hot Start

Cocoa supports checkpoint/restart (hot start) for long-running simulations. A simulation can write periodic checkpoint files, and a subsequent simulation can resume from any checkpoint without restarting from the beginning.

Overview

The hot start system works as follows:

  1. Checkpoint writing: During a simulation, Cocoa periodically writes checkpoint files containing the full hydrodynamic state.

  2. Restart: A new simulation reads a checkpoint file and resumes from the saved state, continuing to the desired end time.

Checkpoint files use a round-robin two-file system: writes alternate between {prefix}_a.nc and {prefix}_b.nc. If a crash occurs mid-write, the previous checkpoint remains intact in the other file.

Configuration

The hot_start section in the YAML configuration file controls checkpoint behavior.

Writing Checkpoints

To enable periodic checkpoint writing:

hot_start:
  enabled: true
  write_interval: 12h       # Write every 12 hours (or integer step count)
  file_prefix: "cocoa_hotstart"  # Output: cocoa_hotstart_a.nc, cocoa_hotstart_b.nc

Parameters:

Parameter

Type

Default

Description

enabled

bool

false

Enable hot start checkpoint writing

write_interval

int or duration

0

Interval between checkpoint writes. Accepts a plain integer (number of time steps) or a duration string (e.g., 6h, 1d). Set to 0 to disable. See Duration Syntax.

file_prefix

string

cocoa_hotstart

Filename prefix for checkpoint files. Two files are created: {prefix}_a.nc and {prefix}_b.nc.

Restarting from a Checkpoint

To restart a simulation from a checkpoint file:

hot_start:
  enabled: true
  read_file: "cocoa_hotstart_a.nc"

Parameters:

Parameter

Type

Default

Description

read_file

string

(none)

Path to checkpoint file to read. When set, the simulation resumes from this checkpoint instead of starting from initial conditions.

When read_file is specified, Cocoa:

  • Recovers the original simulation start time from the checkpoint file

  • Restores all hydrodynamic state (elevation, velocity, flux at all time levels)

  • Restores wet/dry status and element active flags

  • Restores slope limiters and boundary forcing state

  • Resumes time stepping from the checkpoint step

  • Continues output numbering from the checkpoint offset

Workflow

A typical hot start workflow uses three configuration files:

1. Full continuous run (reference or production):

simulation:
  start_time: 2025-01-01
  end_time: 2025-01-15
  time_step: 10s

output:
  filename: "cocoa_output.nc"
  step_interval: 1h

2. Cold start with checkpoint writing (first segment):

simulation:
  start_time: 2025-01-01
  end_time: 2025-01-08        # Run first half
  time_step: 10s

hot_start:
  enabled: true
  write_interval: 12h         # Checkpoint every 12 hours
  file_prefix: "cocoa_hotstart"

output:
  filename: "cocoa_output_coldstart.nc"
  step_interval: 1h

3. Restart from checkpoint (second segment):

simulation:
  end_time: 2025-01-15        # Run to final end time
  time_step: 10s             # Must match original time step

hot_start:
  enabled: true
  read_file: "cocoa_hotstart_a.nc"
  write_interval: 12h         # Optionally continue writing checkpoints

output:
  filename: "cocoa_output_restart.nc"
  step_interval: 1h

Important

The restart configuration must use the same time_step as the original simulation. The start_time does not need to be specified — it is automatically recovered from the checkpoint file. The end_time can be different (typically extended to the desired final time).

Checkpoint Contents

Each checkpoint file (NetCDF format) stores:

  • Hydrodynamic state: Water surface elevation (zeta), velocity components (u, v), and volume flux (qx, qy) at all three time levels (n+1, n, n-1)

  • Derived fields: Rate of change of elevation (del_zeta)

  • Wet/dry state: Node wet/dry status, element active status, slope limiters

  • Boundary state: Normal flux (qn) and, for radiation boundaries, boundary elevation (en) at all three time levels (n+1, n, n-1)

  • Meteorological state (if enabled): Wind stress components (current and previous) and atmospheric pressure (current and previous)

  • Tidal potential state (if enabled): Potential values (current and previous)

  • Mesh geometry: Total element area per node

  • Metadata: Steps completed, time step, simulation start time, output time index, write sequence number

Considerations

Choosing a write interval:

Checkpoint writing involves file I/O and (in MPI mode) gathering data to rank 0. Choose an interval that balances restart granularity against I/O overhead. Common choices:

  • write_interval: 6h to 12h for storm surge runs

  • write_interval: 1d to 2d for tidal spinup

Peak values are not preserved:

Peak elevation (zeta_max) and other tracked extrema reset on restart because they are not stored in the checkpoint file. If you need continuous peak tracking, use a single continuous run.

Implicit solver restart precision:

When using the implicit (consistent) solver, restarted simulations may show small differences (within the iterative solver’s convergence tolerance) compared to a continuous run. This is because the iterative solver’s initial guess differs on the first restart step (zero vs. previous solution). These differences are inherent to iterative solvers and do not indicate a problem.

MPI compatibility:

Checkpoint files are written in global (non-partitioned) format. A restart simulation can use a different number of MPI ranks than the original run.