suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="harmonizing_training_data"## copy the slug from the headerbDATE='2023-09-11'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
Through my prior training runs, I was able to start training a predicted epigenome-to-aracena track model on data subset. This is promising, but an issue for scaling up is that there are still problems with the available training data. In particular, due to edge cases and other errors with the data generation process, certain input windows may have missing data for across aracena individuals/reference epigenome.
These inconsistencies misalign data during the training loop, which is not a huge issue if you just provide a proper mapping. This mapping, however, as well as the set of regions with valid entries across all required datasets needs to be recorded ahead of time so as to not waste time during training, reduce the data load memory requirement, and prevent interference with batching during training.
Here, I go through a process of checking the input data and creating a proper alignment.
Source Code
---title: "harmonizing_training_data"author: "Saideep Gona"date: "2023-09-11"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="harmonizing_training_data"## copy the slug from the headerbDATE='2023-09-11'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# ContextThrough my prior training runs, I was able to start training a predicted epigenome-to-aracena track model on data subset. This is promising, but an issue for scaling up is that there are still problems with the available training data. In particular, due to edge cases and other errors with the data generation process, certain input windows may have missing data for across aracena individuals/reference epigenome.These inconsistencies misalign data during the training loop, which is not a huge issue if you just provide a proper mapping. This mapping, however, as well as the set of regions with valid entries across all required datasets needs to be recorded ahead of time so as to not waste time during training, reduce the data load memory requirement, and prevent interference with batching during training.Here, I go through a process of checking the input data and creating a proper alignment./beagle3/haky/users/saideep/projects/aracena_modeling/hdf5_training/index_files/remapped_table_filt.csv```{r}```