suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="guide_to_running_pytorch_training"## copy the slug from the headerbDATE='2023-07-10'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
Here I compiled a guide on how to run distributed training using pytorch DDP on ALCF Polaris
Set up
Clone github repo
Prepare training data
Decide model starting point
Create PBS submission script
Run training session
wandb log directory
checkpoint storage
Results from training
Restarting training run
Running inference using trained models
Source Code
---title: "guide_to_running_pytorch_training"author: "Saideep Gona"date: "2023-07-10"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="guide_to_running_pytorch_training"## copy the slug from the headerbDATE='2023-07-10'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# Context Here I compiled a guide on how to run distributed training using pytorch DDP on ALCF Polaris## Set up ### Clone github repo### Prepare training data### Decide model starting point### Create PBS submission script## Run training session### wandb log directory### checkpoint storage## Results from training## Restarting training run## Running inference using trained models