compiling_DL_training_materials

Author

Saideep Gona

Published

May 18, 2023

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="compiling_DL_training_materials" ## copy the slug from the header
bDATE='2023-05-18' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

At the ALCF hackathon we had a very productive session learning how to train our DL models. We never really consolidated all the code, lessons learned, etc. and still have to generally finish the process. I will be doing that here

Training data

During the hackathon, we converged on storing the training data in condensed HDF5 format where, for example, the training HDF5 would contain the following datasets within:

Code


h5ls train_pop_seq.hdf5 

pop_sequence             Dataset {34021, 131072, 4}
query_regions            Dataset {34021, 3}
sequence                 Dataset {34021, 131072, 4}
target                   Dataset {34021, 896, 5313}

In this case, “pop_sequence” and “sequence” are the inputs into the informer model corresponding to a region stored in “query_regions” where chrom X is stored as a 0 and the rest are their corresponding chrom numbers. Targets are the true output tracks Enformer is being trained to predict.

This particular dataset comes from the Basenji dataset, where the input sequence length is 131072bp(($1024*128$)) and the output is 896 values corresponding to 128 bp bins. Enformer as described in the original paper takes input length 196,608bp($1536*128$), and Enformer from their colab notebook takes input sequence of length 393216bp($3072*128$) which is twice that described in the paper.

Therefore the above datasets we created have the wrong sequence length for the model we really want to train (the colab notebook model). The original sequence is drawn from the original Basenji dataset, so we need to requery the sequences from the fasta files.

We should also revisit the data preparation code, as it was ad-hoc and messy. The current process (split across many more scripts than necessary is as follows:)

Code

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4;
      }

      [1]: 'Download Basenji TF record (tfr) files'
      [2]: 'Convert tfr to pytorch (pt) files'
      [3]: 'Perform data transforms (i.e. pop-seq) as needed on original pt files to generate new pt files'
      [4]: 'Convert pt files with transforms to single HDF5'
      ")

This process has multiple intermediates which aren’t actually necessary. A revised workflow is just a 2-step process like this:

Code

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']

      # edge definitions with the node IDs
      tab1 -> tab2;
      }

      [1]: 'Download Basenji TF record (tfr) files'
      [2]: 'Perform data transforms (i.e. pop-seq) and conversion to HDF5 at once'
      ")

--- title: "compiling_DL_training_materials" author: "Saideep Gona" date: "2023-05-18" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- ```{r} #| label: Set up box storage directory suppressMessages(library(tidyverse)) suppressMessages(library(glue)) PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai" ## COPY THE DATE AND SLUG fields FROM THE HEADER SLUG="compiling_DL_training_materials" ## copy the slug from the header bDATE='2023-05-18' ## copy the date from the blog's header here DATA = glue("{PRE}/{bDATE}-{SLUG}") if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}")) WORK=DATA ``` # Context At the ALCF hackathon we had a very productive session learning how to train our DL models. We never really consolidated all the code, lessons learned, etc. and still have to generally finish the process. I will be doing that here ## Training data During the hackathon, we converged on storing the training data in condensed HDF5 format where, for example, the training HDF5 would contain the following datasets within: ```{bash eval=FALSE} h5ls train_pop_seq.hdf5 pop_sequence Dataset {34021, 131072, 4} query_regions Dataset {34021, 3} sequence Dataset {34021, 131072, 4} target Dataset {34021, 896, 5313} ``` In this case, "pop_sequence" and "sequence" are the inputs into the informer model corresponding to a region stored in "query_regions" where chrom X is stored as a 0 and the rest are their corresponding chrom numbers. Targets are the true output tracks Enformer is being trained to predict. This particular dataset comes from the Basenji dataset, where the input sequence length is 131072bp(($1024*128$)) and the output is 896 values corresponding to 128 bp bins. Enformer as described in the original paper takes input length 196,608bp($1536*128$), and Enformer from their colab notebook takes input sequence of length 393216bp($3072*128$) which is twice that described in the paper. Therefore the above datasets we created have the wrong sequence length for the model we really want to train (the colab notebook model). The original sequence is drawn from the original Basenji dataset, so we need to requery the sequences from the fasta files. We should also revisit the data preparation code, as it was ad-hoc and messy. The current process (split across many more scripts than necessary is as follows:) ```{r} library(DiagrammeR) grViz("digraph flowchart { # node definitions with substituted label text node [fontname = Helvetica, shape = rectangle] tab1 [label = '@@1'] tab2 [label = '@@2'] tab3 [label = '@@3'] tab4 [label = '@@4'] # edge definitions with the node IDs tab1 -> tab2 -> tab3 -> tab4; } [1]: 'Download Basenji TF record (tfr) files' [2]: 'Convert tfr to pytorch (pt) files' [3]: 'Perform data transforms (i.e. pop-seq, resizing) as needed on original pt files to generate new pt files' [4]: 'Convert pt files with transforms to single HDF5' ") ``` This process has multiple intermediates which aren't actually necessary. For example, there is no need to use pytorch files as they are essentially just numpy arrays stored in python dictionaries which are generated by first converting TF records to said numpy arrays. To convert them back to HDF5, you end up converting back to numpy arrays anyway. A revised workflow is just a 2-step process like this: ```{r} library(DiagrammeR) grViz("digraph flowchart { # node definitions with substituted label text node [fontname = Helvetica, shape = rectangle] tab1 [label = '@@1'] tab2 [label = '@@2'] tab3 [label = '@@3'] # edge definitions with the node IDs tab1 -> tab2; tab3 -> tab2 } [1]: 'Download Basenji TF record (tfr) files' [2]: 'Perform data transforms (i.e. pop-seq, resizing) and conversion to HDF5 at once' [3]: 'Python modules for data transforms' ") ``` Before revising everything, it would be good to list and discuss the "data transforms". Because HDF5 is flexible, we can store many different datasets in our final train, test, val HDF5 files. As mentioned at the beginning of this post, these can include: * Basenji targets * One-hot sequences * Population sequences * Query region metadata More broadly, we can store any *inputs* and *outputs* of models we want to train on. If, for example, we want to add additional prediction heads, we can also store those here. Ultimately, being able to generate the various input/output sets flexibly will improve the ability to test different modeling strategies. ## Python Environment for Training We should definitely meet and decide on a really stable environment for doing the Training. Sam had many suggestions from the hackathon, but it was confusing. ## Training pipeline ## Pytorch inference pipeline modifications Our current Enformer pipeline does not yet generalize to pytorch models, nor does it allow for variation in model input length.