Skip to content

Trainer Control

Forgather supports external control of running training jobs. From a separate terminal (or script), you can query job status, trigger checkpoint saves, and gracefully stop training -- without touching the process that is training.

This is useful for:

  • Saving a checkpoint at an interesting point during training (e.g., a loss plateau)
  • Gracefully stopping training early when results look good (or bad)
  • Aborting failed hyperparameter experiments without saving
  • Scripting control decisions based on training metrics

Quick start

1. Enable control in your training job by adding TrainerControlCallback:

from forgather.ml.trainer.callbacks import TrainerControlCallback

callbacks = [
    TrainerControlCallback(
        job_id="my_experiment",   # Optional: auto-generated if not provided
    ),
]

Or in a configuration template:

[callback_list]
    == super()
    trainer_control: !singleton:forgather.ml.trainer.callbacks:TrainerControlCallback

2. Start training as usual:

forgather -t config.yaml train

3. Control from another terminal:

forgather control list                    # Find running jobs
forgather control status JOB_ID          # Check training progress
forgather control save JOB_ID            # Save a checkpoint now
forgather control stop JOB_ID            # Gracefully stop after current step
forgather control save-stop JOB_ID       # Save checkpoint, then stop
forgather control abort JOB_ID           # Stop immediately without saving

CLI reference

Command Description
forgather control list List all discoverable jobs (shows status, host, port, PID, start time)
forgather control status JOB_ID Show current step, epoch, max_steps, and latest logged metrics
forgather control save JOB_ID Trigger a checkpoint save (runs evaluation first if configured)
forgather control stop JOB_ID Graceful stop -- training finishes the current step, then exits
forgather control save-stop JOB_ID Save a checkpoint, then gracefully stop
forgather control abort JOB_ID Abort immediately without saving (prompts for confirmation)
forgather control cleanup Remove endpoint files for dead jobs
forgather control cleanup --force Remove dead job files without confirmation

How it works

Architecture

The control system has two sides:

  • Server side (TrainerControlCallback): An HTTP server running in a background thread on rank 0 of the training process. It accepts commands and queues them for the training loop to process.
  • Client side (forgather control CLI): Discovers jobs via endpoint files and sends HTTP requests.

Job discovery

When TrainerControlCallback starts, rank 0 writes an endpoint file to ~/.config/forgather/jobs/<job_id>/endpoint.json containing the host, port, and PID. The forgather control list command scans this directory to find running jobs and checks whether each process is still alive.

When training ends (or is stopped), the endpoint file is automatically cleaned up. If a job crashes without cleanup, use forgather control cleanup to remove stale entries.

Distributed coordination

Only rank 0 runs the HTTP server. When a command arrives, it is broadcast to all ranks via torch.distributed.broadcast at the next log step. All ranks then apply the command to the TrainerControl state simultaneously.

Commands are checked on each on_log callback event (controlled by the logging_steps training argument). There is a latency of up to logging_steps training steps between sending a command and it taking effect.

Command behavior

Command Effect
graceful_stop Training finishes the current step and exits. If save_strategy is not "no", a final checkpoint is saved automatically on exit.
save_checkpoint Triggers a checkpoint save regardless of save_strategy. If load_best_model_at_end is configured, also triggers evaluation.
save_and_stop Forces a checkpoint save and then stops. See note below.
abort Stops training immediately without saving.

stop vs save-stop: When save_strategy="steps" or "epoch", stop already saves a final checkpoint on exit (this is the trainer's normal exit behavior), so stop and save-stop produce the same result. The difference matters when save_strategy="no": stop exits without saving, while save-stop forces a save before exiting. This is useful when you have disabled periodic checkpointing but want to keep the option of saving on demand.

TrainerControlCallback parameters

TrainerControlCallback(
    job_id: str = None,         # Auto-generated if not provided
    port: int = None,           # Auto-selected starting from 8900
    enable_http: bool = None,   # Auto-detected based on aiohttp availability
)
Parameter Default Description
job_id Auto-generated Unique identifier for this job. Format: job_{timestamp}_{hostname}_{pid}
port Auto-selected HTTP server port. Scans from 8900 upward for an available port
enable_http Auto-detected Enabled if aiohttp is installed; falls back to file-based control otherwise

Programmatic API

The control system can also be used from Python:

from forgather.trainer_control import list_jobs, get_job_status, graceful_stop, save_checkpoint

# List running jobs
jobs = list_jobs()
for job in jobs:
    print(f"{job.job_id} on {job.host}:{job.port}")

# Check status
status = get_job_status("my_experiment")
print(f"Step {status['global_step']} / {status['max_steps']}")

# Send commands
save_checkpoint("my_experiment")
graceful_stop("my_experiment")

Example

See examples/trainer_control/trainer_control_demo.py for a complete working example showing how to set up and use the control system.