guide_to_running_pytorch_training

Author

Saideep Gona

Published

July 10, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="guide_to_running_pytorch_training" ## copy the slug from the header
bDATE='2023-07-10' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Here I compiled a guide on how to run distributed training using pytorch DDP on ALCF Polaris

Set up

Clone github repo

Prepare training data

Decide model starting point

Create PBS submission script

Run training session

wandb log directory

checkpoint storage

Results from training

Restarting training run

Running inference using trained models