consolidate_aracena_training

Author

Saideep Gona

Published

September 25, 2023

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="consolidate_aracena_training" ## copy the slug from the header
bDATE='2023-09-25' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Regarding the training of the Aracena reference epigenome -> aracena track prediction model, my progress can be summarized as follows:

Generated training data and a processing pipeline for converting alignment files into HDF5 tracks
Construct and benchmark a pytorch dataloading process for model training (non-lazy loader)
Designed a few simple pytorch models (linear, CNN variations) relating ref epigenome to aracena tracks
Tested models on a simple training loop using small training subsets and a single-gpu on beagle with limited batching
Evaluated model performance primarily through loss visualization of loss curve and inspection of output tracks

From doing this, I have made a few observations:

Very important to have robust checks for correctness of training data. Need to have confidence so that good/poor model performance is not due to data issues
Dataset is very large, which places constraints on memory and load times. Straight runs through entire training set require repeated reloading of training subsets, which can take on the order of several minutes each time.
Model specification in pytorch has some implicit assumptions. For example, linear layers are not fully connected by default. Instead, depending on the input/output dimension provided to the nn.Linear object, it will construct a weight matrix A which relates the input/output via a matrix multiplication. To create a fully connected layer, one must first add a flattening layer prior to a linear layer.
The gradient was able to decrease loss until close-convergence for even simple models. On model evaluation however, it seems that the models are primarily predicting background noise for many of the epigenetic tracks rather than distinct peaks. For RNAseq, the predictions

Given these observations and what I have currently finished, I have now to do the following:

Consolidate the training code and make it modular. This will allow for more rapid testing and more organized recording of the results
Implement support for DDP batched training across 4 GPUs. This will increase the speed of the training, and will also increase the efficiency of SU usage on the cluster (current scheme of single GPU with large memory is not efficient).
Expand the training loop to include all training subsets, as well as validation steps. This will allow for more complete training, as well as better feedback on overfitting.
Reduce the complexity of the training by reducing the dimensionality of the predictions. For example, follow a similar process as Temi where only central features are predicted on.

--- title: "consolidate_aracena_training" author: "Saideep Gona" date: "2023-09-25" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- ```{r} #| label: Set up box storage directory suppressMessages(library(tidyverse)) suppressMessages(library(glue)) PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai" ## COPY THE DATE AND SLUG fields FROM THE HEADER SLUG="consolidate_aracena_training" ## copy the slug from the header bDATE='2023-09-25' ## copy the date from the blog's header here DATA = glue("{PRE}/{bDATE}-{SLUG}") if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}")) WORK=DATA ``` # Context Regarding the training of the Aracena reference epigenome -> aracena track prediction model, my progress can be summarized as follows: * Generated training data and a processing pipeline for converting alignment files into HDF5 tracks * Construct and benchmark a pytorch dataloading process for model training (non-lazy loader) * Designed a few simple pytorch models (linear, CNN variations) relating ref epigenome to aracena tracks * Tested models on a simple training loop using small training subsets and a single-gpu on beagle with limited batching * Evaluated model performance primarily through loss visualization of loss curve and inspection of output tracks From doing this, I have made a few observations: * Very important to have robust checks for correctness of training data. Need to have confidence so that good/poor model performance is not due to data issues * Dataset is very large, which places constraints on memory and load times. Straight runs through entire training set require repeated reloading of training subsets, which can take on the order of several minutes each time. * Model specification in pytorch has some implicit assumptions. For example, linear layers are not fully connected by default. Instead, depending on the input/output dimension provided to the nn.Linear object, it will construct a weight matrix A which relates the input/output via a matrix multiplication. To create a fully connected layer, one must first add a flattening layer prior to a linear layer. * The gradient was able to decrease loss until close-convergence for even simple models. On model evaluation however, it seems that the models are primarily predicting background noise for many of the epigenetic tracks rather than distinct peaks. For RNAseq, the predictions Given these observations and what I have currently finished, I have now to do the following: * Consolidate the training code and make it modular. This will allow for more rapid testing and more organized recording of the results * Implement support for DDP batched training across 4 GPUs. This will increase the speed of the training, and will also increase the efficiency of SU usage on the cluster (current scheme of single GPU with large memory is not efficient). * Add in support for model visualization so it's easier to understand the model used for each run. * Store model training results in a common place like Box so others can access them * Expand the training loop to include all training subsets, as well as validation steps. This will allow for more complete training, as well as better feedback on overfitting. * Reduce the complexity of the training by reducing the dimensionality of the predictions. For example, follow a similar process as Temi where only central features are predicted on. * Implement learning rate scheduling for training warmup ## New git repo for all Aracena training Created a new repo at: https://github.com/hakyimlab/aracena_model_training , for common code regarding training Aracena model. Meant to not be general purpose, but with some effort can be used generaally for any hdf5 -> hdf5 prediction task of this nature. ## Implementing DDP support for multiple GPUs