creating_population_sequences

Author

Saideep Gona

Published

July 7, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="creating_population_sequences" ## copy the slug from the header
bDATE='2023-07-07' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Here I will describe in detail the process for generating population sequences (pop-seq) for input into Enformer training. This is important because it needs to be clear that the process is not buggy.

Note that this description is based off the training data generation process used for the ALCF hackathon, which is not necessarily optimal. I started a more formal process of generating training data, and outlined the differences in the 2023-05-18 blog post.

Mutating sequence

Assuming we already have the “sequences.bed” and “.pt” training dataset files, the following steps will create a set of “pop-seq” containing “.pt” files by computing the allele frequencies of variants in that sequence window and mutating the corresponding sites in the sequence.

Functions

The following function is fairly well explained here:

Code
def get_allele_freqs(vcf_file, query, samples_of_interest):

    '''
    Given a query region, vcf file, and a set of samples, returns the allele frequencies at variable sites 
    for that region

    Inputs:
        vcf_file: path to vcf file
        query: query region in the format "chr:start-end"
        samples_of_interest: list of samples to query from VCF
    Outputs:
        af_vector_mat: matrix of allele frequencies at variable sites
        ref_vector_mat: matrix of reference allele at variable sites
        positions: positions of variable sites adjusted to query start coordinates
    '''

    nucleotides = {"A":0, "C":1, "G":2, "T":3}

    positions = []

    af_vectors = []
    ref_vectors = []

    vcf_reader = cyvcf2.VCF(vcf_file, samples=samples_of_interest)
    queried_vcf = vcf_reader(query)

    for variant in queried_vcf:
        positions.append(variant.POS)

        variant_genotypes = np.array(variant.genotypes)
        var_sum = np.sum(variant_genotypes, axis=0)
        allele_frequency = (var_sum[0]+var_sum[1])/(2*var_sum[2])

        af_vector = np.zeros(4)
        af_vector[nucleotides[variant.REF]] = 1-allele_frequency
        af_vector[nucleotides[variant.ALT[0]]] = allele_frequency
        ref_vector = np.zeros(4)
        ref_vector[nucleotides[variant.REF]] = 1

        ref_vectors.append(ref_vector)
        af_vectors.append(af_vector)

    af_vector_mat = np.vstack(af_vectors)
    ref_vector_mat = np.vstack(ref_vectors)
    
    positions = np.array(positions) - int(query.split(":")[1].split("-")[0]) - 1
    
    return af_vector_mat, ref_vector_mat, positions

To elaborate further, the function takes in only a VCF file, a region in the sequence, and set of samples of interest which should exist in the VCF file. From that it produces 3 outputs. The first is a matrix of allele frequencies. The dimensions of this are (number of variable SNPs in region x 4). The first dimension will tend to be fewer with fewer numbers of individuals since it’s based on the number of sites with at least some variation. The second dimension is the nucleotide encodings (similar to one-hot). The second output is a reference matrix. This has the same dimensions as the allele frequency matrix because it stores the reference allele values for those same variants. Importantly, this reference matrix is generated using only the reference alleles from the VCF file and is useful for checking that the VCF reference alleles match the reference sequence which will be mutated downstream. The final output is a positions vector, which gives the indeces of the variant sites relative to the sequence window. It has the same length as the row count fo the allele frequency matrix.

These outputs then feed into the next function, where the “mutation” actually happens:

Code
def mutate_sequence(one_hot_reference, af_mat, ref_mat, positions_mut):
    """
    Mutate the reference sequence at the given interval. Also check that reference sequence at
    each position matches the reference allele from the VCF file.
    
    Inputs:
        one_hot_reference: one-hot encoded reference sequence
        af_mat: matrix of allele frequencies at variable sites
        ref_mat: matrix of reference allele at variable sites
        positions_mut: positions of variable sites adjusted to query start coordinates
    Outputs:
        one_hot_mutated: one-hot encoded mutated sequence with allele frequencies at variable sites
    """
    
    print(one_hot_reference.shape, af_mat.shape, ref_mat.shape, positions_mut.shape)
    # Check that reference sequence matches reference from VCF
    for i,position in enumerate(positions_mut):
        print(i,position)
        print(one_hot_reference[0,position,:],ref_mat[i,:])
        if np.array_equal(one_hot_reference[0,position,:], ref_mat[i,:]):
            continue
        else:
            print("ERROR: REFERENCE DOES NOT MATCH REFERENCE VECTOR AT POSITION", pos)

    one_hot_mutated = one_hot_reference.deepcopy()
    one_hot_mutated[0,positions_mut,:] = af_mat

    return one_hot_mutated

This function now takes a one hot reference sequence from any source and creates a mutated version of it using the three outputs discussed previously. For safety, the first thing the function does is check if the reference matrix generated from the VCF matches the reference sequence to be mutated. This ensures that all the relevant files match as they should. Finally, using numpy indexing, the reference sequence is deep copied, and the positions are mutated to create a mutated “pop-seq”.

Creation of pop-seq .pt file

These functions were tested independently and then used in the following code to actually create the .pt files.

The larger process occurs in the following function:

Code
def modify_single_pt_file(pt_file_path, query_regions, vcf_files_path, samples_of_interest, output_folder):
    # load pt file
    pt_data = torch.load(pt_file_path) # 256, 131702, 4
    # get the sequence
    seqs = pt_data['sequence'] # 256, 131702, 4

    pop_seqs = []
    for i,seq in enumerate(seqs):
        # print(query_regions[i])
        chrom = query_regions[i].split(':')[0]
        vcf_file = os.path.join(vcf_files_path, f'ALL.{chrom}.shapeit2_integrated_SNPs_v2a_27022019.GRCh38.phased.vcf.gz')

        af_mat_test, ref_mat_test, pos = mutate_seq.get_allele_freqs(vcf_file, query_regions[i],samples_of_interest)

        if af_mat_test == "No variants in region":
            mutated_seq = copy.deepcopy(seq)
        else:
            mutated_seq = mutate_seq.mutate_sequence(copy.deepcopy(seq),af_mat_test,ref_mat_test,pos)
            
        pop_seqs.append(mutated_seq)

    pop_seqs = torch.stack(pop_seqs)

    
    pt_data['query_regions'] = convert_queries_to_tensor(query_regions)
    pt_data['pop_sequence'] = pop_seqs

    # save the pt file
    torch.save(pt_data, os.path.join(output_folder, os.path.basename(pt_file_path).replace('.pt', '_pop.pt')))

The first thing this function does is load a given pt file to be converted. These pt files already contain a sequence database in them, but lack the corresponding interval data. These are extracted from the “seqeunces.bed” file assuming the order is identical. The intervals are passed into the function above as “query regions”. Code for collecting these are below:

Code
def extract_regions(file_count, region_type, raw_regions):

    chromosome_counts = {}

    if region_type == "train":
        start = 0
        if (start+256*(file_count+1))>34021:
            end = 34021
        else:
            end = start+256*(file_count+1)
        start = start+256*file_count

    elif region_type == "valid":
        start = 34021
        if (start+256*(file_count+1))>36234:
            end = 36234
        else:
            end = start+256*(file_count+1)
        start = start+256*file_count

    else:
        start = 36234
        if (start+256*(file_count+1))>len(raw_regions):
            end = len(raw_regions)
        else:
            end = start+256*(file_count+1)
        start = start+256*file_count
        
    regions = []
    for i in range(start,end):
        chrom = raw_regions[i].split('\t')[0]
        if chrom not in chromosome_counts:
            chromosome_counts[chrom] = 1
        else:
            chromosome_counts[chrom] += 1

        s = raw_regions[i].split('\t')[1]
        e = raw_regions[i].split('\t')[2]

        split_type = raw_regions[i].split('\t')[3].strip()

        if not split_type == region_type:
            print(split_type , " should be ", region_type)
            sys.exit()
            

        regions.append(chrom+":{}-{}".format(s,e))
    
    return regions, chromosome_counts

I won’t spend too much time explaining this, since if the alignment of the sequences.bed file and the sequences stored in the pt files is not correct, then it will be clear from mismatch of underlying reference sequence.

Returning back to the main function, it then iterates through the sequences from the pt file, and iterates along the relative indices of these sequences in the pt files sequences. At each iteration, the interval from the query regions corresponding to that index is used to gather the allele frequencies of interest and the corresponding output matrices. Then, the sequence from the pt files is passed into the mutate sequence function (note that this sequence is not generated from querying a fasta file using the interval used to collect allele frequencies). Each of these mutated sequences are collected and stacked into a single matrix which is then stored alongside the original sequence matrix in a .pt file.

Conversion to hdf5