compiling_DL_training_materials

Author

Saideep Gona

Published

May 18, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="compiling_DL_training_materials" ## copy the slug from the header
bDATE='2023-05-18' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

At the ALCF hackathon we had a very productive session learning how to train our DL models. We never really consolidated all the code, lessons learned, etc. and still have to generally finish the process. I will be doing that here

Training data

During the hackathon, we converged on storing the training data in condensed HDF5 format where, for example, the training HDF5 would contain the following datasets within:

Code

h5ls train_pop_seq.hdf5 

pop_sequence             Dataset {34021, 131072, 4}
query_regions            Dataset {34021, 3}
sequence                 Dataset {34021, 131072, 4}
target                   Dataset {34021, 896, 5313}

In this case, “pop_sequence” and “sequence” are the inputs into the informer model corresponding to a region stored in “query_regions” where chrom X is stored as a 0 and the rest are their corresponding chrom numbers. Targets are the true output tracks Enformer is being trained to predict.

This particular dataset comes from the Basenji dataset, where the input sequence length is 131072bp((\(1024*128\))) and the output is 896 values corresponding to 128 bp bins. Enformer as described in the original paper takes input length 196,608bp(\(1536*128\)), and Enformer from their colab notebook takes input sequence of length 393216bp(\(3072*128\)) which is twice that described in the paper.

Therefore the above datasets we created have the wrong sequence length for the model we really want to train (the colab notebook model). The original sequence is drawn from the original Basenji dataset, so we need to requery the sequences from the fasta files.

We should also revisit the data preparation code, as it was ad-hoc and messy. The current process (split across many more scripts than necessary is as follows:)

Code
library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4;
      }

      [1]: 'Download Basenji TF record (tfr) files'
      [2]: 'Convert tfr to pytorch (pt) files'
      [3]: 'Perform data transforms (i.e. pop-seq) as needed on original pt files to generate new pt files'
      [4]: 'Convert pt files with transforms to single HDF5'
      ")

This process has multiple intermediates which aren’t actually necessary. A revised workflow is just a 2-step process like this:

Code
library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']

      # edge definitions with the node IDs
      tab1 -> tab2;
      }

      [1]: 'Download Basenji TF record (tfr) files'
      [2]: 'Perform data transforms (i.e. pop-seq) and conversion to HDF5 at once'
      ")