Test blog setup

Author

Saideep Gona

Published

2023-03-07

How to Build a Quarto Blog System Synced with the Lab’s Data

Note: This is an example centered around setting up a blog post system. You can add other Quarto analyses following these principles as well.

Set up a Quarto Project

1.) Go to the “File” tab in RMarkdown

2.) Click “New Project”

3.) Choose the option you want, including the root directory you want the project to be in.

4.) Choose “Quarto Website”

This will initialize the files for your Quarto Project

Set up a folder for blog posts

1.) Make a sub-directory called “posts” within the Quarto project directory

2.) For each blog post, make a nested sub-directory with the following naming convention:

“project_root/posts/YEAR-MONTH-DAY-SLUG/”

For example, the document for the current post is in: “project_root/posts/2023-03-07-first-blog-post/”

3.) Within this directory, you can then initialize a new quarto document and name it “index.qmd”

There should now be a quarto file with this path: “project_root/posts/2023-03-07-first-blog-post/index.qmd”

Note: By calling the file “index.qmd”, when we render the file it generates an “index.html” file to “project_root/_site/posts/YEAR-MONTH-DAY-SLUG/index.html”. When the website is generated, index.html is the default html page for this path that we want.

4.) Add a header to the document like, making sure the date matches the post:

5.) You will also want pretty much all your quarto analyses to include the following codeblock (or similar):

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/sgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="first-blog-test" ## copy the slug from the header
bDATE='2023-03-07' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

“PRE” should specify a local path to the mounted lab Box directory where you want to store data. This is most relevant for analyses | in which the files involved are too large to be stored as part of the git repository.

“SLUG” and “bDATE” should match the name of the post directory

The code block as a whole will create a mirrored directory on Box where your data lives. This is where that quarto script should (generally) read/write data.

It is probably a good idea to have the lab Box directory mounted locally for this ahead of time.

Set up a Github Repo

1.) git init a repo to the same directory of the Quarto project

2.) Set the remote repo on github to be in the Imlab (https://github.com/hakyimlab/)

3.) “git add” all the project files and blog posts

3.) Make sure to add and push the daily blog to Github daily (Don’t forget to render it first!)

Update 3-28

I added a script to autogenerate blog posts, located at “../../initialize_daily_blog.py

Progress for Today

Prior context

Previously, Temi and I merged towards a common pipeline framework. Using this, I ran 462 Geuvadis individuals for ~19k genes through Enformer, saving several entire Enformer output tracks across this set. This is the full set of protein coding genes across GEUVADIS individuals. The issue I’ve been having is that these results don’t seem to match the results we have observed in the past. For example, ERAP2 is not showing a negative expression correlation, the portability results seem strange, etc. Furthermore, this merging and subsequent steps were done in a rushed manner, so some doublechecking is required.

The saved tracks(and bins) for these Enformer rins are specified in the config json file for the pipeline as the following json fields:

“tracks_to_save”:“4766,5110,5109,4799,4828,956”,

“bins_to_save”: -1,

This is parsed by the following function from the batch_utils/collectUtils.py module in our pipeline, with the test code showing the output from parsing:

Code


#| eval: false

# def parse_bins_and_tracks(unparsed_bins, unparsed_tracks):
#     def split_range(range_str):
#         if "-" in range_str:
#             spl_range_str = [int(x) for x in range_str.split("-")]
#             return [x for x in range(spl_range_str[0],spl_range_str[1]+1)]
#         else:
#             return [int(range_str)]
#     
#     if unparsed_bins == -1:
#         bins_indices = None
#     else:
#         bins_indices = []
#         bins_split = unparsed_bins.split(",")
#         for b in bins_split:
#             bins_indices += split_range(b)
#         bins_indices.sort()
#     
#     if unparsed_tracks == -1:
#         tracks_indices = None
#     else:
#         tracks_indices = []
#         tracks_split = unparsed_tracks.split(",")
#         for t in tracks_split:
#             tracks_indices += split_range(t)
#         tracks_indices.sort()
# 
#     return bins_indices,tracks_indices
#   
# bins_to_save,tracks_to_save = parse_bins_and_tracks(-1,"4766,5110,5109,4799,4828,956")
# print(bins_to_save)
# print(tracks_to_save)

None

[956, 4766, 4799, 4828, 5109, 5110]

This looks correct. Notice it sorts the track indices, which is also the order of the tracks in the output tables.

These are important information, as they are subsequently used to extract the subset of bins and tracks from Enformer’s output which we intend to save. As a reminder, saving all of Enformer’s outputs takes too much storage. For Temi’s work, he needs to extract a small number of bins for all output tracks. For mine, I need to extract all bins for a small number of tracks. Therefore, we set up a flexible system to set up any combination of the two. It also allows for ranges (e.g. 1-500).

Below I am testing a single output file for correctness. It comes from individual HG00133 and gene ERAP2. The output table from my current summed haplotype quantification for the LCL cell line track yields the following when I grep for ERAP2:

5284,ENSG00000164308.17,33.80387878417969,31.365461349487305,31.386741638183597,30.118547439575195,30.08721160888672,28.509597778320312,31.80799102783203,32.5515251159668,30.151880264282227,29.66426086425781,29.612451553344727,30.323780059814453,32.736141204833984,28.742660522460938,30.81435203552246,30.148103713989254,29.80011940002441,31.095882415771484,33.89436340332031,31.719518661499023,30.68146133422852,29.886323928833008,30.87120819091797,34.55549621582031,31.963397979736328,,31.96174049377441,30.99546813964844,30.07516860961914,28.698373794555664,28.166400909423828,28.52982711791992,31.141475677490234,28.236064910888672,33.06477737426758,31.967294692993164,28.425626754760746

Contrast that to when I did quantification on my pipeline:

ERAP2,282.89473390579224,262.44551365077496,241.1964790970087,254.7325918674469,252.64414037764072,264.53468346595764,240.1670859158039,246.64309330284595,251.87666906416416,252.65430112183094,249.8701060935855,257.03185357898474,255.67644740641117,268.48601527512074,256.53784973174334,254.54583774507046,253.5440254509449,267.44521643966436,235.52161806076765,268.33800335228443,254.642160885036,257.72059690207243,236.5985001027584,280.03858971595764,257.02142008394003,NA,256.8592912852764,NA,238.57657745480537

These values just don’t seem right, so time to dissect the quantification. First of all, to quantify CAGE expression, we need to know where the TSS sites are located. This is handled by a script I use to parse GFF annotation files. In addition to identifying genes and gene locations, it creates protein-coding subsets, as well as a file with TSS sites based upon annotated transcripts. It is located at https://github.com/hakyimlab/enformerxcan/blob/main/saideep/enformerxcan/prep_pipeline_input_files/parse_gff.py.

From the outputs of this script on the gencode.v42 GFF, the following TSS sites were identified for ERAP2:

“ENSG00000164308.17 96875986,96876500,96876500,96876506,96879564,96880933,96889217,96892418,96896403,96896572,96913338 +”

with the gene itself being annotated as:

“chr5 96875986 96919703 ENSG00000164308.17 ERAP2”

Indeed, all the TSS sites fall within the gene.

Next, we need to convert these TSS sites into the bin which they fall into from the whole Enformer output. This is handled by ____

Let’s load in the old and new results directly from the model:

Code

#| eval: false

import h5py
import numpy as np


with h5py.File("/projects/covid-ct/imlab/users/saideep/enformer_all_geuvadis/predictions_folder/saideepDataset_Enformer_geuvadis_expression_all/predictions_2023-02-24/predictions/HG00133/haplotype1/chr5_96875986_96919703_predictions.h5",'r') as h:
    f = h.get("chr5_96875986_96919703")[:]    # Extract the table
    # print(h.keys())
    # print(f[tss_bins,1])
    # print(f)
    # print(np.max(f,axis=0))

    lcl_merged = f[:,5]
    print(lcl.shape)

(896,)

What did my old quantification look like?

Code


#| eval: false
# 
# import os
# import h5py
# def read_single_enformer_run(hdf5_file_path, individual, target_region):
# 
#     h5_path = os.path.join("individuals",individual,target_region)
#     with h5py.File(hdf5_file_path,'r') as h:
#         return {"haplo_1_prediction":h.get(os.path.join(h5_path,"haplo_1_prediction"))[:],
#                 "haplo_2_prediction":h.get(os.path.join(h5_path,"haplo_2_prediction"))[:]}
#         
# hdf5_path = "/grand/covid-ct/imlab/users/saideep/enformer_all_geuvadis/outs_old/enformer_predictions.h5"
# individual = "HG00133"
# region = "ERAP2"
# 
# single_enformer_run = read_single_enformer_run(hdf5_path,individual,region)["haplo_1_prediction"]
# 
# lcl_mine = single_enformer_run[:,1]
# print(lcl.shape)

(896,)

We can plot the tracks side by side to help get a sense for things

Code

# 
# import seaborn as sns
# import matplotlib.pyplot as plt
# import matplotlib as mpl
# import kipoiseq
# 
# def plot_tracks(tracks, chr, interval_start, interval_end, height=1.5):
#     fig, axes = plt.subplots(len(tracks), 1, figsize=(20, height * len(tracks)), sharex=True)
#     for ax, (title, y) in zip(axes, tracks.items()):
#         ax.fill_between(np.linspace(interval_start, interval_end, num=len(y)), y)
#         ax.set_title(title)
#         sns.despine(top=True, right=True, bottom=True)
#     ax.set_xlabel(chr+":"+str(interval_start)+"-"+str(interval_end))
#     plt.tight_layout()
#     
# tracks = {"Merged Pipeline":lcl_merged,
#             "My Pipeline":lcl_mine}
# ERAP2_interval = kipoiseq.Interval("chr5", # creates a kipoiseq interval to select the right sequences
#                         96875986,
#                         96919703)
# ERAP2_interval_resized = ERAP2_interval.resize(114688)
# plot_tracks(tracks, "chr5", ERAP2_interval.start,ERAP2_interval.end)

It looks like the predictions are similar looking, but not equivalent. This is definitely strange.

--- title: "Test blog setup" author: "Saideep Gona" date: "2023-03-07" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- # How to Build a Quarto Blog System Synced with the Lab's Data Note: This is an example centered around setting up a blog post system. You can add other Quarto analyses following these principles as well. ### Set up a Quarto Project 1.) Go to the "File" tab in RMarkdown 2.) Click "New Project" 3.) Choose the option you want, including the root directory you want the project to be in. 4.) Choose "Quarto Website" This will initialize the files for your Quarto Project ### Set up a folder for blog posts 1.) Make a sub-directory called "posts" within the Quarto project directory 2.) For each blog post, make a nested sub-directory with the following naming convention: | "project_root/posts/YEAR-MONTH-DAY-SLUG/" | For example, the document for the current post is in: "project_root/posts/2023-03-07-first-blog-post/" 3.) Within this directory, you can then initialize a new quarto document and name it "index.qmd" | There should now be a quarto file with this path: "project_root/posts/2023-03-07-first-blog-post/index.qmd" | Note: By calling the file "index.qmd", when we render the file it generates an "index.html" file to "project_root/\_site/posts/YEAR-MONTH-DAY-SLUG/index.html". When the website is generated, index.html is the default html page for this path that we want. 4.) Add a header to the document like, making sure the date matches the post: --- title: "Test blog setup" author: "Saideep Gona" date: "2023-03-07" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- 5.) You will also want pretty much all your quarto analyses to include the following codeblock (or similar): ```{r} #| label: Set up box storage directory suppressMessages(library(tidyverse)) suppressMessages(library(glue)) PRE = "/Users/sgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai" ## COPY THE DATE AND SLUG fields FROM THE HEADER SLUG="first-blog-test" ## copy the slug from the header bDATE='2023-03-07' ## copy the date from the blog's header here DATA = glue("{PRE}/{bDATE}-{SLUG}") if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}")) WORK=DATA ``` | "PRE" should specify a local path to the mounted lab Box directory where you want to store data. This is most relevant for analyses \| in which the files involved are too large to be stored as part of the git repository. | "SLUG" and "bDATE" should match the name of the post directory | The code block as a whole will create a mirrored directory on Box where your data lives. This is where that quarto script should (generally) read/write data. It is probably a good idea to have the lab Box directory mounted locally for this ahead of time. ### Set up a Github Repo 1.) git init a repo to the same directory of the Quarto project 2.) Set the remote repo on github to be in the Imlab (https://github.com/hakyimlab/) 3.) "git add" all the project files and blog posts 3.) Make sure to add and push the daily blog to Github daily (Don't forget to render it first!) ### Update 3-28 I added a script to autogenerate blog posts, located at "../../initialize_daily_blog.py ### # Progress for Today ### Prior context Previously, Temi and I merged towards a common pipeline framework. Using this, I ran 462 Geuvadis individuals for \~19k genes through Enformer, saving several entire Enformer output tracks across this set. This is the full set of protein coding genes across GEUVADIS individuals. The issue I've been having is that these results don't seem to match the results we have observed in the past. For example, ERAP2 is not showing a negative expression correlation, the portability results seem strange, etc. Furthermore, this merging and subsequent steps were done in a rushed manner, so some doublechecking is required. The saved tracks(and bins) for these Enformer rins are specified in the config json file for the pipeline as the following json fields: "tracks_to_save":"4766,5110,5109,4799,4828,956", "bins_to_save": -1, This is parsed by the following function from the batch_utils/collectUtils.py module in our pipeline, with the test code showing the output from parsing: ```{python} #| eval: false # def parse_bins_and_tracks(unparsed_bins, unparsed_tracks): # def split_range(range_str): # if "-" in range_str: # spl_range_str = [int(x) for x in range_str.split("-")] # return [x for x in range(spl_range_str[0],spl_range_str[1]+1)] # else: # return [int(range_str)] # # if unparsed_bins == -1: # bins_indices = None # else: # bins_indices = [] # bins_split = unparsed_bins.split(",") # for b in bins_split: # bins_indices += split_range(b) # bins_indices.sort() # # if unparsed_tracks == -1: # tracks_indices = None # else: # tracks_indices = [] # tracks_split = unparsed_tracks.split(",") # for t in tracks_split: # tracks_indices += split_range(t) # tracks_indices.sort() # # return bins_indices,tracks_indices # # bins_to_save,tracks_to_save = parse_bins_and_tracks(-1,"4766,5110,5109,4799,4828,956") # print(bins_to_save) # print(tracks_to_save) ``` None \[956, 4766, 4799, 4828, 5109, 5110\] This looks correct. Notice it sorts the track indices, which is also the order of the tracks in the output tables. These are important information, as they are subsequently used to extract the subset of bins and tracks from Enformer's output which we intend to save. As a reminder, saving all of Enformer's outputs takes too much storage. For Temi's work, he needs to extract a small number of bins for all output tracks. For mine, I need to extract all bins for a small number of tracks. Therefore, we set up a flexible system to set up any combination of the two. It also allows for ranges (e.g. 1-500). Below I am testing a single output file for correctness. It comes from individual HG00133 and gene ERAP2. The output table from my current summed haplotype quantification for the LCL cell line track yields the following when I grep for ERAP2: `5284,ENSG00000164308.17,33.80387878417969,31.365461349487305,31.386741638183597,30.118547439575195,30.08721160888672,28.509597778320312,31.80799102783203,32.5515251159668,30.151880264282227,29.66426086425781,29.612451553344727,30.323780059814453,32.736141204833984,28.742660522460938,30.81435203552246,30.148103713989254,29.80011940002441,31.095882415771484,33.89436340332031,31.719518661499023,30.68146133422852,29.886323928833008,30.87120819091797,34.55549621582031,31.963397979736328,,31.96174049377441,30.99546813964844,30.07516860961914,28.698373794555664,28.166400909423828,28.52982711791992,31.141475677490234,28.236064910888672,33.06477737426758,31.967294692993164,28.425626754760746` Contrast that to when I did quantification on my pipeline: `ERAP2,282.89473390579224,262.44551365077496,241.1964790970087,254.7325918674469,252.64414037764072,264.53468346595764,240.1670859158039,246.64309330284595,251.87666906416416,252.65430112183094,249.8701060935855,257.03185357898474,255.67644740641117,268.48601527512074,256.53784973174334,254.54583774507046,253.5440254509449,267.44521643966436,235.52161806076765,268.33800335228443,254.642160885036,257.72059690207243,236.5985001027584,280.03858971595764,257.02142008394003,NA,256.8592912852764,NA,238.57657745480537` These values just don't seem right, so time to dissect the quantification. First of all, to quantify CAGE expression, we need to know where the TSS sites are located. This is handled by a script I use to parse GFF annotation files. In addition to identifying genes and gene locations, it creates protein-coding subsets, as well as a file with TSS sites based upon annotated transcripts. It is located at https://github.com/hakyimlab/enformerxcan/blob/main/saideep/enformerxcan/prep_pipeline_input_files/parse_gff.py. From the outputs of this script on the gencode.v42 GFF, the following TSS sites were identified for ERAP2: | "ENSG00000164308.17 96875986,96876500,96876500,96876506,96879564,96880933,96889217,96892418,96896403,96896572,96913338 +" with the gene itself being annotated as: | "chr5 96875986 96919703 ENSG00000164308.17 ERAP2" Indeed, all the TSS sites fall within the gene. Next, we need to convert these TSS sites into the bin which they fall into from the whole Enformer output. This is handled by \_\_\_\_ Let's load in the old and new results directly from the model: ```{python, eval=FALSE} #| eval: false import h5py import numpy as np with h5py.File("/projects/covid-ct/imlab/users/saideep/enformer_all_geuvadis/predictions_folder/saideepDataset_Enformer_geuvadis_expression_all/predictions_2023-02-24/predictions/HG00133/haplotype1/chr5_96875986_96919703_predictions.h5",'r') as h: f = h.get("chr5_96875986_96919703")[:] # Extract the table # print(h.keys()) # print(f[tss_bins,1]) # print(f) # print(np.max(f,axis=0)) lcl_merged = f[:,5] print(lcl.shape) ``` (896,) What did my old quantification look like? ```{python} #| eval: false # # import os # import h5py # def read_single_enformer_run(hdf5_file_path, individual, target_region): # # h5_path = os.path.join("individuals",individual,target_region) # with h5py.File(hdf5_file_path,'r') as h: # return {"haplo_1_prediction":h.get(os.path.join(h5_path,"haplo_1_prediction"))[:], # "haplo_2_prediction":h.get(os.path.join(h5_path,"haplo_2_prediction"))[:]} # # hdf5_path = "/grand/covid-ct/imlab/users/saideep/enformer_all_geuvadis/outs_old/enformer_predictions.h5" # individual = "HG00133" # region = "ERAP2" # # single_enformer_run = read_single_enformer_run(hdf5_path,individual,region)["haplo_1_prediction"] # # lcl_mine = single_enformer_run[:,1] # print(lcl.shape) ``` (896,) We can plot the tracks side by side to help get a sense for things ```{python} #| eval: false # # import seaborn as sns # import matplotlib.pyplot as plt # import matplotlib as mpl # import kipoiseq # # def plot_tracks(tracks, chr, interval_start, interval_end, height=1.5): # fig, axes = plt.subplots(len(tracks), 1, figsize=(20, height * len(tracks)), sharex=True) # for ax, (title, y) in zip(axes, tracks.items()): # ax.fill_between(np.linspace(interval_start, interval_end, num=len(y)), y) # ax.set_title(title) # sns.despine(top=True, right=True, bottom=True) # ax.set_xlabel(chr+":"+str(interval_start)+"-"+str(interval_end)) # plt.tight_layout() # # tracks = {"Merged Pipeline":lcl_merged, # "My Pipeline":lcl_mine} # ERAP2_interval = kipoiseq.Interval("chr5", # creates a kipoiseq interval to select the right sequences # 96875986, # 96919703) # ERAP2_interval_resized = ERAP2_interval.resize(114688) # plot_tracks(tracks, "chr5", ERAP2_interval.start,ERAP2_interval.end) ``` It looks like the predictions are similar looking, but not equivalent. This is definitely strange.