About me

I am currently a PhD candidate at the University of Chicago studying statistical/computational genetics. I have almost 10 years of academic and work experience at the interface of computer science, data science, and genetics.

My current work revolves around developing new generations of statistical methods for undercovering the genetic influences on disease.

Check out my Quarto site showcasing my progress!: Research Blog.

headshot

What have I been up to lately?

One of the events associated with my Genetic Mechanism and Evolution training grant is an annual hackathon. As the training grant supports students with varying levels of computational experience, one of the aims is to improve computational literacy. Given my computational experience, I volunteered to be on the organizing committee. The theme we chose for the hackathon was interactive plotting.

The value of interactive plots is obvious, but it is the type of thing where the "activation energy" of actually learning to make them and deploy them in practice is high enough of a barrier for adoption. We therefore slotted the event for a period of two days, starting with a few short tutorial lectures followed by 10+ hours of "hacking" sessions.

The tutorials we made can be be found in this repo. Basically, the tutorials include plotly + dash in both R and Python as well as rshiny for R. These are the major relevant interactive plotting libraries which would be practical for researchers.

As something of an experimental event, we were super pleased with how it went! People ended up splitting into 4 groups based on their programming language preference. Although we provided a curated dataset to practice with, it ended up being more natural for people to use their own datasets from personal or research interests. We ended up having people work with several approaches:

  • Python: Plotly + Dash
  • R: Plotly + Dash
  • R: Rshiny + ggplot
  • R: Rshiny + plotly + ggplot

Interestingly, it was found that:

  1. Plotly + Dash are not as well developed in R as in Python, such that it is better to use Python if favoring these libraries.
  2. Converting ggplot plots to plotly and then embedding in RShiny can work, but adds many layers of abstraction which should be avoided if possible.

Overall, it was an enriching experience, well worth the time off from the daily research grind. On my end, I used some of my spare time to build an interactive RShiny app showcasing some of my recent research results. It is hosted here.

Argonne Leadership Computing Facility (ALCF) hosts regular workshops and trainings to improve AI researcher's capacity to adopt AI into their work and make use of their vast supercompute resources. I attended the AI for Science Training Series, a multi-lecture series with homeworks. Here is the repo.

I do have a pretty solid machine learning/AI background, but I still found the series very useful. For one, ALCF has one of the most powerful GPU clusters in the world. Even though we have been able to utilize these resources ourselves to perform inference and training, there is still a great deal we don't know about this infrastructure. For example, ALCF also experiments with AI testbeds housing up-and-coming specialized hardware resources. Some of these hardware, like groq-chips, could proliferate due to their efficient inference potential relative to standard NVIDIA A100 gpus. It was cool to get some exposure to these technologies.

Another nice benefit of this series was that it gave me a chance to revisit the mathematics of transformers, specifically self-attention. I've been pretty busy so I never really had a chance to sit down and properly derive the math at the core of the hubub surrounding LLMs. During my Master's coursework at CMU it was the standard to build every neural network architecture from scratch which, at the time, where primarily:

  1. Fully connected layers
  2. Convolutional filters
  3. Recurrent neural nets
It therefore bothered me that I didn't really have the "best" intuitive understanding of self-attention.

Fortunately, self-attention isn't as complicated as it seems. What helped me is to just think of attention as analogous to other neural network layers, such as those mentioned above. In essence, they all operate as functions which map inputs to outputs via linear transformations with added weights. From that perspective, self-attention is just a different way of performing such transformations. What it does have is a number of desirable properties over other architectures for language-based tasks. I made a little diagram/explainer here!

Finally, after completing all the assignments, the ALCF people were so kind as to give me a digital badge for my efforts :).

The UChicago program in Computational Biology (PCB) is an umbrella organization bringing together computational biology research groups from different departments across campus. One of the goals of PCB is to help improve computational literacy across UChicago's huge biomedical research ecosystem.

As part of this, PCB engages with the Software Carpentry organization to run workshops. I have been interested in becoming a certified Software Carpentry instructor, so I decided to volunteer as a helper for the kickstart workshop after COVID (the workshops were discontinued during COVID). Shoutout to Vivaswat(Viv), my friend and colleague who is a certified instructor and who was the primary organizer of this workshop.

The theme was to help beginners refresh their knowledge in the programming language R (most common among biomedical researchers) as well as give a walkthrough of plotting using the ggplot package. These are workhorse tools for allowing researchers to be able to analyze and present their experiemental results.

The event itself took place over a single day. Viv is a natural presenter and really smooth coding to an audience. As volunteers, our main goal was to help resolve issues along the way so everyone could keep up with the lesson. The epxerience definitely solidified my interest in becoming a certified instructor. For one, it was obvious how useful the single day was for the participants. Most of them are super motivated because of their own immediate research goals. Interestingly, even for us advanced programmers there were so many little tricks which were shared that most of us weren't even aware of! It definitely goes to show that reviewing the basics is never a waste of time, and teaching is the best way to achieve mastery.

The ASHG annual conference is the first large conference I've been to, and it sure is large!

Despite being in the academic setting for many years and developing extensive research experience, this is somehow my first *actual* conference I've attended. There is certainly a duality here. I have plenty of experience achieving the technical requirements of research like data analysis, writing code, reading papers, on the web, etc. However, attending the conference made me realize how it was always difficult for me to undsertand the softer aspacts of the research space. How projects are funded, research directions prioritized, and my favorite: who exactly are those people who's names I hear so often??

Skills

Python

My go-to language for nearly everything.

`

R

This is where I do my statistical analysis and especially data viz!

High Performance Computing

Genetics is full of big data and heavy computation. I have experience on 5+ supercompute clusters running Slurm, PBS, etc. as well as frameworks for compute scaling such as Parsl.

HTML, CSS, JS

Web literacy is critical for deploying outward facing, responsive data visualizations and web applications. I've built many, many of these over the years with technologies such as RShiny, Plotly, Dash, Flask, Django, and more!

`

Statistics

Statistical analysis is a regular part of biomedical research and a major part of my PhD coursework. As part of a statistical genetics lab and a bioinformatician, statistical rigor is part of everything I do.

Machine Learning

From prior coursework, internship expereince, and my current thesis project, I have extensive machine learning experience and an in-depth understanding of the underlying architecture behind specialty architectures.

Algorithms

I've taken and succeeded in multiple algorithms courses, and also subsequently TA'ed the subject. Algorithm training has helped me write effective code and take a keen eye to the scalability of my pipelines.

`

Bioinformatics

I have two years of direct work experience as a staff bioinformatician serving labs with 70+ TB of next-generation sequencing data. I have an additional five years of coursework and experience handling next-generation sequencing data. I am very comfortable in this setting! The underlying algorithms and probabilistic inference which makes next-generation sequencing possible remains the coolest technology around (by my estimation, anyway).

Continous Learning

I am a fast learner who is constantly picking up new skills. I can quickly adapt to different computational and research settings. You could say it is baked into my DNA ;D.

-->

Contact Info

919-449-6784

gona.saideep1@gmail.com

Design: Tooplate