harmonizing_training_data

Author

Saideep Gona

Published

September 11, 2023

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="harmonizing_training_data" ## copy the slug from the header
bDATE='2023-09-11' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Through my prior training runs, I was able to start training a predicted epigenome-to-aracena track model on data subset. This is promising, but an issue for scaling up is that there are still problems with the available training data. In particular, due to edge cases and other errors with the data generation process, certain input windows may have missing data for across aracena individuals/reference epigenome.

These inconsistencies misalign data during the training loop, which is not a huge issue if you just provide a proper mapping. This mapping, however, as well as the set of regions with valid entries across all required datasets needs to be recorded ahead of time so as to not waste time during training, reduce the data load memory requirement, and prevent interference with batching during training.

Here, I go through a process of checking the input data and creating a proper alignment.

---
title: "harmonizing_training_data"
author: "Saideep Gona"
date: "2023-09-11"
format:
  html:
    code-fold: true
    code-summary: "Show the code"
execute:
  freeze: true
  warning: false
---

```{r}
#| label: Set up box storage directory

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="harmonizing_training_data" ## copy the slug from the header
bDATE='2023-09-11' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

```

# Context

Through my prior training runs, I was able to start training a predicted epigenome-to-aracena track model on data subset. This is promising, but an issue for scaling up is that there are still problems with the available training data. In particular, due to edge cases and other errors with the data generation process, certain input windows may have missing data for across aracena individuals/reference epigenome.

These inconsistencies misalign data during the training loop, which is not a huge issue if you just provide a proper mapping. This mapping, however, as well as the set of regions with valid entries across all required datasets needs to be recorded ahead of time so as to not waste time during training, reduce the data load memory requirement, and prevent interference with batching during training.

Here, I go through a process of checking the input data and creating a proper alignment.


/beagle3/haky/users/saideep/projects/aracena_modeling/hdf5_training/index_files/remapped_table_filt.csv

```{r}
```