consolidate_aracena_training

Author

Saideep Gona

Published

September 25, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="consolidate_aracena_training" ## copy the slug from the header
bDATE='2023-09-25' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Regarding the training of the Aracena reference epigenome -> aracena track prediction model, my progress can be summarized as follows:

  • Generated training data and a processing pipeline for converting alignment files into HDF5 tracks
  • Construct and benchmark a pytorch dataloading process for model training (non-lazy loader)
  • Designed a few simple pytorch models (linear, CNN variations) relating ref epigenome to aracena tracks
  • Tested models on a simple training loop using small training subsets and a single-gpu on beagle with limited batching
  • Evaluated model performance primarily through loss visualization of loss curve and inspection of output tracks

From doing this, I have made a few observations:

  • Very important to have robust checks for correctness of training data. Need to have confidence so that good/poor model performance is not due to data issues
  • Dataset is very large, which places constraints on memory and load times. Straight runs through entire training set require repeated reloading of training subsets, which can take on the order of several minutes each time.
  • Model specification in pytorch has some implicit assumptions. For example, linear layers are not fully connected by default. Instead, depending on the input/output dimension provided to the nn.Linear object, it will construct a weight matrix A which relates the input/output via a matrix multiplication. To create a fully connected layer, one must first add a flattening layer prior to a linear layer.
  • The gradient was able to decrease loss until close-convergence for even simple models. On model evaluation however, it seems that the models are primarily predicting background noise for many of the epigenetic tracks rather than distinct peaks. For RNAseq, the predictions

Given these observations and what I have currently finished, I have now to do the following:

  • Consolidate the training code and make it modular. This will allow for more rapid testing and more organized recording of the results
  • Implement support for DDP batched training across 4 GPUs. This will increase the speed of the training, and will also increase the efficiency of SU usage on the cluster (current scheme of single GPU with large memory is not efficient).
  • Expand the training loop to include all training subsets, as well as validation steps. This will allow for more complete training, as well as better feedback on overfitting.
  • Reduce the complexity of the training by reducing the dimensionality of the predictions. For example, follow a similar process as Temi where only central features are predicted on.