prepare_aracena_training_data

Author

Saideep Gona

Published

August 2, 2023

Modified

September 5, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="prepare_aracena_training_data" ## copy the slug from the header
bDATE='2023-08-02' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

As per my updated thesis proposal, I intend to use the Aracena et al. dataset to train personalized, context specific DL models for downstream prediction of epigenetic marks in Macrophages under controlled and flu exposure settings. Here, I outline the preparation of the training datasets from raw data shared with me by Luis’ lab.

Summary of data

Data size

The whole dataset consists of 11TB. However, whole genome sequencing actually occupies the majority of the storage as you can see.

Data Size The actual data I need are:

  • RNAseq: 896.3 GB
  • H3K4me1: 629.1 GB
  • ATACseq: 626.0 GB
  • H3K27me3: 545.6 GB
  • H3K4me3: 331.6 GB
  • H3K27ac: 246 GB
  • Total: 3274.6 GB

This is only ~3 TB in total, which is much more manageable. Still, our GFPS storage doesn’t have this much permanent storage. Fortunately, I can just copy one dataset at a time, and then convert to bigwig, and delete the copies.

Files

There are different numbers of files for each dataset. Downstream, therefore, I will need to use a subset of samples for which alignments are present across measurements.

Converting BAM files to BigWig

The first step in the generation of the training data is converting the raw data BAM files into BigWig signal tracks. Fortunately, the Basenji authors have provided some utility scripts for doing this conversion (https://github.com/calico/basenji/blob/master/docs/preprocess.md#bam_cov).

I have installed these scripts to Midway2 and made a small pipeline to convert the bam files to bigwigs. The simple script is as follows:

Code
import os,sys
import time
import glob

conversion_script = "/project2/haky/saideep/aracena_data_preprocessing/basenji/bin/bam_cov.py"

# Load in list of files

# file_list = "/project2/haky/saideep/aracena_data_preprocessing/test_conversion/file_list.txt"
# bam_files = []
# with open(file_list,"r") as fl:
#     for line in fl:
#         bam_files.append(line.strip())

files_dir = "/cds2/lbarreiro/users/katie/EU_AF_ancestry_flu/data/H3K27ac_alignment"

bam_files = glob.glob(files_dir+"/*.bam")
print(bam_files)

# Start main conversion loop

temp_folder = "/project2/haky/saideep/aracena_data_preprocessing/test_conversion/bam_cov"
output_dir = "/project2/haky/saideep/aracena_data_preprocessing/bigwigs"

for bam_file in bam_files:

    if os.path.exists(os.path.join(output_dir,os.path.basename(bam_file).replace(".bam",".bw"))):
        print(os.path.join(output_dir,os.path.basename(bam_file).replace(".bam",".bw")), " already exists")
        continue

    start_time = time.time()
    cp_com = [
        "cp",
        bam_file,
        os.path.join(temp_folder, os.path.basename(bam_file))
    ]
    print(" ".join(cp_com))
    os.system(" ".join(cp_com))

    cp_com_i = [
        "cp",
        bam_file,
        os.path.join(temp_folder, os.path.basename(bam_file)+".bai")
    ]
    print(" ".join(cp_com_i))
    os.system(" ".join(cp_com_i))

    bigwig_convert_com = [
        "python",
        conversion_script,
        os.path.join(temp_folder,os.path.basename(bam_file)),
        os.path.join(output_dir,os.path.basename(bam_file).replace(".bam",".bw"))
    ]
    print(" ".join(bigwig_convert_com))
    os.system(" ".join(bigwig_convert_com))

    rm_com = [
        "rm",
        os.path.join(temp_folder,os.path.basename(bam_file))
    ]
    print(" ".join(rm_com))
    os.system(" ".join(rm_com))

    rm_com_i = [
        "rm",
        os.path.join(temp_folder,os.path.basename(bam_file)+".bai")
    ]
    print(" ".join(rm_com_i))
    os.system(" ".join(rm_com_i))

    print("--- %s seconds ---" % (time.time() - start_time))

NOTE: Step requires a good amount of memory (24GB seems stable)

Lifting over to GRCh38 reference genome

The current alignments are aligned to the hg19 reference genome, but I intend to use the GRCh38 reference genome downstream. I therefore need to perform a liftover in order to use the BigWigs I have generated. Here I will try using CrossMap to perform the liftover. First, I downloaded the following chain file: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz.

Then, I install CrossMap to a local directory as such and then run an example. The process looks like this:

Code
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz

gunzip hg19ToHg38.over.chain.gz


pip install CrossMap -t CrossMap_install_directory

python ./CrossMap_install_directory/bin/CrossMap.py bigwig hg19ToHg38.over.chain test.bw test_lifted.bw


./CrossMap.py bigwig ../test_conversion/hg19ToHg38.over.chain /project2/haky/saideep/aracena_data_preprocessing/bigwigs/AF08_Flu_H3K27ac.sorted.dup.bw /project2/haky/saideep/aracena_data_preprocessing/hg38_bigwigs/AF08_Flu_H3K27ac.sorted.dup 

CrossMap consistently gives a bus error, and I think it is an older piece of software. I will instead try to convert bigwig to bedgraph, then perform liftover, and the convert back to bigwig. This follows the following logic as found here(https://bioinfocore.com/blogs/liftover-bigwig-files/) and uses UCSD liftover code:

Code
bigWigToBedGraph input.bw input.bedGraph

liftOver input.bedGraph hg19ToHg38.over.chain input_hg38.bedgraph unMapped

fetchChromSizes hg38 > hg38.chrom.sizes

LC_COLLATE=C sort -k1,1 -k2,2n input_hg38.bedgraph > input_hg38.sorted.bedgraph

bedtools merge -o mean -c 4 -i input_hg38.sorted.bedgraph > input_hg38.sorted.merged.bedgraph

bedGraphToBigWig input_hg38.sorted.merged.bedgraph hg38.chrom.sizes output.bw

Luckily this seems to be working. Had to add the bedtools merge to average out some cases where there are repeat regions (might be worth better understanding the cause)

Determining which set of of samples are in common between datasets

The following script was used to determine which set of samples are common between all 6 readouts:

Code
import os, sys
import glob
import numpy as np

root = "/cds2/lbarreiro/users/katie/EU_AF_ancestry_flu/data"
data_dirs = [
    "H3K27ac_alignment",
    "ATACseq_alignment",
    "H3K27me3_alignment",
    "H3K4me1_alignment",
    "H3K4me3_alignment",
    "RNAseq_bams"
]

data_dirs_full = [os.path.join(root,x) for x in data_dirs]

samples_dict = {}

for d_dir in data_dirs_full:
    all_bams_cur = glob.glob(d_dir+"/*.bam")
    print(d_dir+"\*.bam")
    samples = ["_".join(os.path.basename(x).split(".")[0].split("_")[0:2]) for x in all_bams_cur]
    samples_dict[os.path.basename(d_dir)] = set(samples)

[print(samples_dict)]

set_list = [samples_dict[x] for x in samples_dict.keys()]

common_set = set.intersection(*set_list)

print(len((common_set)))

print(common_set)

The result is that 54 samples (27 individuals) are present in common between the readouts. They look like this:

{‘EU03_NI’, ‘AF30_NI’, ‘AF36_NI’, ‘AF20_Flu’, ‘EU13_NI’, ‘EU47_NI’, ‘EU41_NI’, ‘EU27_NI’, ‘AF18_Flu’, ‘AF22_NI’, ‘EU21_NI’, ‘EU15_Flu’, ‘EU25_Flu’, ‘AF22_Flu’, ‘EU25_NI’, ‘EU05_NI’, ‘EU15_NI’, ‘AF34_Flu’, ‘EU05_Flu’, ‘EU03_Flu’, ‘EU07_NI’, ‘EU07_Flu’, ‘AF24_NI’, ‘EU41_Flu’, ‘EU19_NI’, ‘EU19_Flu’, ‘AF16_Flu’, ‘EU27_Flu’, ‘EU21_Flu’, ‘EU09_NI’, ‘AF24_Flu’, ‘AF08_NI’, ‘AF26_Flu’, ‘AF38_Flu’, ‘AF34_NI’, ‘AF08_Flu’, ‘EU33_NI’, ‘EU13_Flu’, ‘AF26_NI’, ‘EU39_NI’, ‘AF20_NI’, ‘AF28_NI’, ‘AF12_Flu’, ‘AF18_NI’, ‘AF30_Flu’, ‘AF38_NI’, ‘AF12_NI’, ‘AF16_NI’, ‘EU47_Flu’, ‘EU33_Flu’, ‘AF36_Flu’, ‘EU09_Flu’, ‘AF28_Flu’, ‘EU39_Flu’}