TIL: Multi-node GPU training with SkyPilot and PyTorch Lightning

References:

  • https://docs.skypilot.co/en/latest/examples/training/distributed-pytorch.html#using-normal-torchrun
  • https://lightning.ai/docs/pytorch/stable/common/trainer.html#

1. Configure the PyTorch Lightning Trainer for multi-node training

E.g. on 8 nodes with 8 GPUs each (64 GPUs total):

trainer = Trainer(accelerator="gpu", devices=8, num_nodes=8, strategy="ddp")

2. Launch SkyPilot cluster with multiple GPU nodes

sky launch -c train train.yaml

SkyPilot config file (train.yaml):

resources:
    accelerators: H100:8
    disks: 1TB

num_nodes: 8

setup: |
    # Download data
    # Install dependencies

run: |
    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    # Explicit check for torchrun
    if ! command -v torchrun >/dev/null 2>&1; then
        echo "ERROR: torchrun command not found" >&2
        exit 1
    fi

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=8008 \
    --node_rank=${SKYPILOT_NODE_RANK} \
    train.py



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • TIL: Request GCP quota increase
  • TIL: Useful Datamol functions
  • TIL: Template data processing script with pathlib, fire, joblib, loguru, and tqdm
  • TIL: Template data exploration Jupyter notebook
  • TIL: Useful SkyPilot Commands