==================== Checkpoint / Restart ==================== Cocoa supports checkpoint/restart for long-running simulations. A simulation can write checkpoint files, and a subsequent simulation can resume (restart) from a checkpoint without restarting from the beginning. Overview -------- The checkpoint/restart system works as follows: 1. **Checkpoint writing**: When checkpointing is enabled, Cocoa writes a checkpoint file containing the full hydrodynamic state. By default a single checkpoint is written at the **end of the run**; it can optionally write at a regular interval instead. 2. **Restart**: A new simulation reads a checkpoint file and resumes from the saved state, continuing to the desired end time. Each write produces a **separate, timestamped file** named ``{prefix}.{simulation_time}.nc`` (for example ``cocoa_checkpoint.20250101T120000.nc``). The timestamp is the *simulation* time of the checkpoint in ``YYYYMMDDTHHMMSS`` form, so files are self-describing and sort chronologically by name. Because every write is a new file, a crash mid-write can only damage the file being written --- previously completed checkpoints are never overwritten. .. _checkpoint-config: Configuration ------------- The ``checkpoint`` section in the YAML configuration file controls checkpoint behavior. Writing Checkpoints ^^^^^^^^^^^^^^^^^^^ The simplest configuration enables checkpointing and writes a single checkpoint at the end of the run: .. code-block:: yaml checkpoint: enabled: true file_prefix: "cocoa_checkpoint" # Output: cocoa_checkpoint.{simulation_time}.nc To write checkpoints periodically as well, set ``write_interval``: .. code-block:: yaml checkpoint: enabled: true write_interval: 12h # Write every 12 hours (plus one at the end) file_prefix: "cocoa_checkpoint" **Parameters:** .. list-table:: :header-rows: 1 :widths: 25 15 20 40 * - Parameter - Type - Default - Description * - ``enabled`` - bool - false - Enable checkpoint writing * - ``write_interval`` - int or duration - 0 - How often to write checkpoints. ``0`` (the default) writes a single checkpoint at the end of the run. A positive value writes every ``N`` time steps **and** one at the final step; it accepts a plain integer (number of time steps) or a duration string (e.g., ``6h``, ``1d``). See :ref:`duration-intervals`. * - ``file_prefix`` - string - ``cocoa_checkpoint`` - Filename prefix. Each write produces ``{prefix}.{simulation_time}.nc`` (e.g. ``cocoa_checkpoint.20250101T120000.nc``). .. note:: Each checkpoint is a full snapshot of the model state and is not small. When ``write_interval`` is set such that more than ten checkpoint files would be produced, Cocoa logs a warning at startup suggesting a larger interval or the end-of-run default (``write_interval: 0``). Restarting from a Checkpoint ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To restart a simulation from a checkpoint file: .. code-block:: yaml checkpoint: enabled: true restart_file: "cocoa_checkpoint.20250108T000000.nc" **Parameters:** .. list-table:: :header-rows: 1 :widths: 25 15 20 40 * - Parameter - Type - Default - Description * - ``restart_file`` - string - (none) - Explicit path to the checkpoint file to restart from. When set, the simulation resumes from this checkpoint instead of starting from initial conditions. Because filenames encode the simulation time, the file to use is deterministic --- for an end-of-run checkpoint it is ``{prefix}.{end_time}.nc``. When ``restart_file`` is specified, Cocoa: - Recovers the original simulation start time from the checkpoint file - Restores all hydrodynamic state (elevation, velocity, flux at all time levels) - Restores wet/dry status and element active flags - Restores slope limiters and boundary forcing state - Resumes time stepping from the checkpoint step - Continues output numbering from the checkpoint offset Workflow -------- A typical checkpoint/restart workflow uses three configuration files: **1. Full continuous run** (reference or production): .. code-block:: yaml simulation: start_time: 2025-01-01 end_time: 2025-01-15 time_step: 10s output: filename: "cocoa_output.nc" step_interval: 1h **2. Cold start with checkpoint writing** (first segment): .. code-block:: yaml simulation: start_time: 2025-01-01 end_time: 2025-01-08 # Run first half time_step: 10s checkpoint: enabled: true write_interval: 12h # Checkpoint every 12 hours file_prefix: "cocoa_checkpoint" output: filename: "cocoa_output_coldstart.nc" step_interval: 1h **3. Restart from checkpoint** (second segment): .. code-block:: yaml simulation: end_time: 2025-01-15 # Run to final end time time_step: 10s # Must match original time step checkpoint: enabled: true # The cold-start segment ends at 2025-01-08, so its final checkpoint is # cocoa_checkpoint.20250108T000000.nc restart_file: "cocoa_checkpoint.20250108T000000.nc" write_interval: 12h # Optionally continue writing checkpoints output: filename: "cocoa_output_restart.nc" step_interval: 1h .. important:: The restart configuration must use the same ``time_step`` as the original simulation. The ``start_time`` does not need to be specified --- it is automatically recovered from the checkpoint file. The ``end_time`` can be different (typically extended to the desired final time). The checkpoint to restart from is named after its simulation time, so the cold-start segment's final checkpoint is ``{prefix}.{cold_start_end_time}.nc``. Checkpoint Contents ------------------- Each checkpoint file (NetCDF format) stores: - **Hydrodynamic state**: Water surface elevation (zeta), velocity components (u, v), and volume flux (qx, qy) at all three time levels (n+1, n, n-1) - **Derived fields**: Rate of change of elevation (del_zeta) - **Wet/dry state**: Node wet/dry status, element active status, slope limiters - **Boundary state**: Normal flux (qn) and, for radiation boundaries, boundary elevation (en) at all three time levels (n+1, n, n-1) - **Meteorological state** (if enabled): Wind stress components (current and previous) and atmospheric pressure (current and previous) - **Tidal potential state** (if enabled): Potential values (current and previous) - **Mesh geometry**: Total element area per node - **Metadata**: Steps completed, time step, simulation start time, output time index, write sequence number Considerations -------------- **Choosing a write interval:** For most workflows the end-of-run default (``write_interval: 0``) is sufficient: it leaves one checkpoint you can resume from. Set a positive ``write_interval`` only when you need intermediate restart points. Each checkpoint is a full state snapshot, so writing involves file I/O and (in MPI mode) gathering data to rank 0; choose an interval that balances restart granularity against I/O overhead and disk usage. Common choices: - ``write_interval: 6h`` to ``12h`` for storm surge runs - ``write_interval: 1d`` to ``2d`` for tidal spinup If a chosen interval would produce more than ten checkpoint files over the run, Cocoa warns at startup --- prefer a larger interval or the end-of-run default in that case. **Peak values are not preserved:** Peak elevation (``zeta_max``) and other tracked extrema reset on restart because they are not stored in the checkpoint file. If you need continuous peak tracking, use a single continuous run. **Implicit solver restart precision:** When using the implicit (consistent) solver, restarted simulations may show small differences (within the iterative solver's convergence tolerance) compared to a continuous run. This is because the iterative solver's initial guess differs on the first restart step (zero vs. previous solution). These differences are inherent to iterative solvers and do not indicate a problem. **MPI compatibility:** Checkpoint files are written in global (non-partitioned) format. A restart simulation can use a different number of MPI ranks than the original run.