suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="creating_population_sequences"## copy the slug from the headerbDATE='2023-07-07'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
Here I will describe in detail the process for generating population sequences (pop-seq) for input into Enformer training. This is important because it needs to be clear that the process is not buggy.
Note that this description is based off the training data generation process used for the ALCF hackathon, which is not necessarily optimal. I started a more formal process of generating training data, and outlined the differences in the 2023-05-18 blog post.
Mutating sequence
Assuming we already have the “sequences.bed” and “.pt” training dataset files, the following steps will create a set of “pop-seq” containing “.pt” files by computing the allele frequencies of variants in that sequence window and mutating the corresponding sites in the sequence.
Functions
The following function is fairly well explained here:
Code
def get_allele_freqs(vcf_file, query, samples_of_interest):''' Given a query region, vcf file, and a set of samples, returns the allele frequencies at variable sites for that region Inputs: vcf_file: path to vcf file query: query region in the format "chr:start-end" samples_of_interest: list of samples to query from VCF Outputs: af_vector_mat: matrix of allele frequencies at variable sites ref_vector_mat: matrix of reference allele at variable sites positions: positions of variable sites adjusted to query start coordinates ''' nucleotides = {"A":0, "C":1, "G":2, "T":3} positions = [] af_vectors = [] ref_vectors = [] vcf_reader = cyvcf2.VCF(vcf_file, samples=samples_of_interest) queried_vcf = vcf_reader(query)for variant in queried_vcf: positions.append(variant.POS) variant_genotypes = np.array(variant.genotypes) var_sum = np.sum(variant_genotypes, axis=0) allele_frequency = (var_sum[0]+var_sum[1])/(2*var_sum[2]) af_vector = np.zeros(4) af_vector[nucleotides[variant.REF]] =1-allele_frequency af_vector[nucleotides[variant.ALT[0]]] = allele_frequency ref_vector = np.zeros(4) ref_vector[nucleotides[variant.REF]] =1 ref_vectors.append(ref_vector) af_vectors.append(af_vector) af_vector_mat = np.vstack(af_vectors) ref_vector_mat = np.vstack(ref_vectors) positions = np.array(positions) -int(query.split(":")[1].split("-")[0]) -1return af_vector_mat, ref_vector_mat, positions
To elaborate further, the function takes in only a VCF file, a region in the sequence, and set of samples of interest which should exist in the VCF file. From that it produces 3 outputs. The first is a matrix of allele frequencies. The dimensions of this are (number of variable SNPs in region x 4). The first dimension will tend to be fewer with fewer numbers of individuals since it’s based on the number of sites with at least some variation. The second dimension is the nucleotide encodings (similar to one-hot). The second output is a reference matrix. This has the same dimensions as the allele frequency matrix because it stores the reference allele values for those same variants. Importantly, this reference matrix is generated using only the reference alleles from the VCF file and is useful for checking that the VCF reference alleles match the reference sequence which will be mutated downstream. The final output is a positions vector, which gives the indeces of the variant sites relative to the sequence window. It has the same length as the row count fo the allele frequency matrix.
These outputs then feed into the next function, where the “mutation” actually happens:
Code
def mutate_sequence(one_hot_reference, af_mat, ref_mat, positions_mut):""" Mutate the reference sequence at the given interval. Also check that reference sequence at each position matches the reference allele from the VCF file. Inputs: one_hot_reference: one-hot encoded reference sequence af_mat: matrix of allele frequencies at variable sites ref_mat: matrix of reference allele at variable sites positions_mut: positions of variable sites adjusted to query start coordinates Outputs: one_hot_mutated: one-hot encoded mutated sequence with allele frequencies at variable sites """print(one_hot_reference.shape, af_mat.shape, ref_mat.shape, positions_mut.shape)# Check that reference sequence matches reference from VCFfor i,position inenumerate(positions_mut):print(i,position)print(one_hot_reference[0,position,:],ref_mat[i,:])if np.array_equal(one_hot_reference[0,position,:], ref_mat[i,:]):continueelse:print("ERROR: REFERENCE DOES NOT MATCH REFERENCE VECTOR AT POSITION", pos) one_hot_mutated = one_hot_reference.deepcopy() one_hot_mutated[0,positions_mut,:] = af_matreturn one_hot_mutated
This function now takes a one hot reference sequence from any source and creates a mutated version of it using the three outputs discussed previously. For safety, the first thing the function does is check if the reference matrix generated from the VCF matches the reference sequence to be mutated. This ensures that all the relevant files match as they should. Finally, using numpy indexing, the reference sequence is deep copied, and the positions are mutated to create a mutated “pop-seq”.
Creation of pop-seq .pt file
These functions were tested independently and then used in the following code to actually create the .pt files.
The larger process occurs in the following function:
The first thing this function does is load a given pt file to be converted. These pt files already contain a sequence database in them, but lack the corresponding interval data. These are extracted from the “seqeunces.bed” file assuming the order is identical. The intervals are passed into the function above as “query regions”. Code for collecting these are below:
Code
def extract_regions(file_count, region_type, raw_regions): chromosome_counts = {}if region_type =="train": start =0if (start+256*(file_count+1))>34021: end =34021else: end = start+256*(file_count+1) start = start+256*file_countelif region_type =="valid": start =34021if (start+256*(file_count+1))>36234: end =36234else: end = start+256*(file_count+1) start = start+256*file_countelse: start =36234if (start+256*(file_count+1))>len(raw_regions): end =len(raw_regions)else: end = start+256*(file_count+1) start = start+256*file_count regions = []for i inrange(start,end): chrom = raw_regions[i].split('\t')[0]if chrom notin chromosome_counts: chromosome_counts[chrom] =1else: chromosome_counts[chrom] +=1 s = raw_regions[i].split('\t')[1] e = raw_regions[i].split('\t')[2] split_type = raw_regions[i].split('\t')[3].strip()ifnot split_type == region_type:print(split_type , " should be ", region_type) sys.exit() regions.append(chrom+":{}-{}".format(s,e))return regions, chromosome_counts
I won’t spend too much time explaining this, since if the alignment of the sequences.bed file and the sequences stored in the pt files is not correct, then it will be clear from mismatch of underlying reference sequence.
Returning back to the main function, it then iterates through the sequences from the pt file, and iterates along the relative indices of these sequences in the pt files sequences. At each iteration, the interval from the query regions corresponding to that index is used to gather the allele frequencies of interest and the corresponding output matrices. Then, the sequence from the pt files is passed into the mutate sequence function (note that this sequence is not generated from querying a fasta file using the interval used to collect allele frequencies). Each of these mutated sequences are collected and stacked into a single matrix which is then stored alongside the original sequence matrix in a .pt file.
Conversion to hdf5
Source Code
---title: "creating_population_sequences"author: "Saideep Gona"date: "2023-07-07"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="creating_population_sequences"## copy the slug from the headerbDATE='2023-07-07'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# ContextHere I will describe in detail the process for generating population sequences (pop-seq) for input into Enformer training. This is important because it needs to be clear that the process is not buggy. Note that this description is based off the training data generation process used for the ALCF hackathon, which is not necessarily optimal. I started a more formal process of generating training data, and outlined the differences in the 2023-05-18 blog post. ## Mutating sequenceAssuming we already have the "sequences.bed" and ".pt" training dataset files, the following steps will create a set of "pop-seq" containing ".pt" files by computing the allele frequencies of variants in that sequence window and mutating the corresponding sites in the sequence. ### FunctionsThe following function is fairly well explained here: ```{python eval=FALSE}def get_allele_freqs(vcf_file, query, samples_of_interest):''' Given a query region, vcf file, and a set of samples, returns the allele frequencies at variable sites for that region Inputs: vcf_file: path to vcf file query: query region in the format "chr:start-end" samples_of_interest: list of samples to query from VCF Outputs: af_vector_mat: matrix of allele frequencies at variable sites ref_vector_mat: matrix of reference allele at variable sites positions: positions of variable sites adjusted to query start coordinates ''' nucleotides = {"A":0, "C":1, "G":2, "T":3} positions = [] af_vectors = [] ref_vectors = [] vcf_reader = cyvcf2.VCF(vcf_file, samples=samples_of_interest) queried_vcf = vcf_reader(query)for variant in queried_vcf: positions.append(variant.POS) variant_genotypes = np.array(variant.genotypes) var_sum = np.sum(variant_genotypes, axis=0) allele_frequency = (var_sum[0]+var_sum[1])/(2*var_sum[2]) af_vector = np.zeros(4) af_vector[nucleotides[variant.REF]] =1-allele_frequency af_vector[nucleotides[variant.ALT[0]]] = allele_frequency ref_vector = np.zeros(4) ref_vector[nucleotides[variant.REF]] =1 ref_vectors.append(ref_vector) af_vectors.append(af_vector) af_vector_mat = np.vstack(af_vectors) ref_vector_mat = np.vstack(ref_vectors) positions = np.array(positions) -int(query.split(":")[1].split("-")[0]) -1return af_vector_mat, ref_vector_mat, positions```To elaborate further, the function takes in only a VCF file, a region in the sequence, and set of samples of interest which should exist in the VCF file. From that it produces 3 outputs. The first is a matrix of allele frequencies. The dimensions of this are (number of variable SNPs in region x 4). The first dimension will tend to be fewer with fewer numbers of individuals since it's based on the number of sites with at least some variation. The second dimension is the nucleotide encodings (similar to one-hot). The second output is a reference matrix. This has the same dimensions as the allele frequency matrix because it stores the reference allele values for those same variants. Importantly, this reference matrix is generated using only the reference alleles from the VCF file and is useful for checking that the VCF reference alleles match the reference sequence which will be mutated downstream. The final output is a positions vector, which gives the indeces of the variant sites relative to the sequence window. It has the same length as the row count fo the allele frequency matrix.These outputs then feed into the next function, where the "mutation" actually happens:```{python eval=FALSE}def mutate_sequence(one_hot_reference, af_mat, ref_mat, positions_mut):""" Mutate the reference sequence at the given interval. Also check that reference sequence at each position matches the reference allele from the VCF file. Inputs: one_hot_reference: one-hot encoded reference sequence af_mat: matrix of allele frequencies at variable sites ref_mat: matrix of reference allele at variable sites positions_mut: positions of variable sites adjusted to query start coordinates Outputs: one_hot_mutated: one-hot encoded mutated sequence with allele frequencies at variable sites """print(one_hot_reference.shape, af_mat.shape, ref_mat.shape, positions_mut.shape)# Check that reference sequence matches reference from VCFfor i,position inenumerate(positions_mut):print(i,position)print(one_hot_reference[0,position,:],ref_mat[i,:])if np.array_equal(one_hot_reference[0,position,:], ref_mat[i,:]):continueelse:print("ERROR: REFERENCE DOES NOT MATCH REFERENCE VECTOR AT POSITION", pos) one_hot_mutated = one_hot_reference.deepcopy() one_hot_mutated[0,positions_mut,:] = af_matreturn one_hot_mutated```This function now takes a one hot reference sequence from any source and creates a mutated version of it using the three outputs discussed previously. For safety, the first thing the function does is check if the reference matrix generated from the VCF matches the reference sequence to be mutated. This ensures that all the relevant files match as they should. Finally, using numpy indexing, the reference sequence is deep copied, and the positions are mutated to create a mutated "pop-seq".### Creation of pop-seq .pt fileThese functions were tested independently and then used in the following code to actually create the .pt files.The larger process occurs in the following function:```{python eval=FALSE}def modify_single_pt_file(pt_file_path, query_regions, vcf_files_path, samples_of_interest, output_folder):# load pt file pt_data = torch.load(pt_file_path) # 256, 131702, 4# get the sequence seqs = pt_data['sequence'] # 256, 131702, 4 pop_seqs = []for i,seq inenumerate(seqs):# print(query_regions[i]) chrom = query_regions[i].split(':')[0] vcf_file = os.path.join(vcf_files_path, f'ALL.{chrom}.shapeit2_integrated_SNPs_v2a_27022019.GRCh38.phased.vcf.gz') af_mat_test, ref_mat_test, pos = mutate_seq.get_allele_freqs(vcf_file, query_regions[i],samples_of_interest)if af_mat_test =="No variants in region": mutated_seq = copy.deepcopy(seq)else: mutated_seq = mutate_seq.mutate_sequence(copy.deepcopy(seq),af_mat_test,ref_mat_test,pos) pop_seqs.append(mutated_seq) pop_seqs = torch.stack(pop_seqs) pt_data['query_regions'] = convert_queries_to_tensor(query_regions) pt_data['pop_sequence'] = pop_seqs# save the pt file torch.save(pt_data, os.path.join(output_folder, os.path.basename(pt_file_path).replace('.pt', '_pop.pt')))```The first thing this function does is load a given pt file to be converted. These pt files already contain a sequence database in them, but lack the corresponding interval data. These are extracted from the "seqeunces.bed" file assuming the order is identical. The intervals are passed into the function above as "query regions". Code for collecting these are below:```{python eval=FALSE}def extract_regions(file_count, region_type, raw_regions): chromosome_counts = {}if region_type =="train": start =0if (start+256*(file_count+1))>34021: end =34021else: end = start+256*(file_count+1) start = start+256*file_countelif region_type =="valid": start =34021if (start+256*(file_count+1))>36234: end =36234else: end = start+256*(file_count+1) start = start+256*file_countelse: start =36234if (start+256*(file_count+1))>len(raw_regions): end =len(raw_regions)else: end = start+256*(file_count+1) start = start+256*file_count regions = []for i inrange(start,end): chrom = raw_regions[i].split('\t')[0]if chrom notin chromosome_counts: chromosome_counts[chrom] =1else: chromosome_counts[chrom] +=1 s = raw_regions[i].split('\t')[1] e = raw_regions[i].split('\t')[2] split_type = raw_regions[i].split('\t')[3].strip()ifnot split_type == region_type:print(split_type , " should be ", region_type) sys.exit() regions.append(chrom+":{}-{}".format(s,e))return regions, chromosome_counts```I won't spend too much time explaining this, since if the alignment of the sequences.bed file and the sequences stored in the pt files is not correct, then it will be clear from mismatch of underlying reference sequence.Returning back to the main function, it then iterates through the sequences from the pt file, and iterates along the relative indices of these sequences in the pt files sequences. At each iteration, the interval from the query regions corresponding to that index is used to gather the allele frequencies of interest and the corresponding output matrices. Then, the sequence from the pt files is passed into the mutate sequence function (note that this sequence is not generated from querying a fasta file using the interval used to collect allele frequencies). Each of these mutated sequences are collected and stacked into a single matrix which is then stored alongside the original sequence matrix in a .pt file.### Conversion to hdf5