Yale Institute for Foundations of Data Science, Kline Tower 13th Floor, Room 1327, New Haven, CT 06511
An opportunity for students to match with data science research opportunities presented by Yale faculty.
Opening Remarks & Introduction
by Daniel Spielman
Sterling Professor of Computer Science; Professor of Statistics & Data Science, and of Mathematics
James A. Attwood Director of the Institute for Foundations of Data Science at Yale (FDS)
“Innovating Cardiovascular Care with Multimodality Data Science“
The Cardiovascular Data Science (CarDS) Lab at Yale leverages advances in deep learning and AI to enhance and automate care. The work uses numerous data streams in the electronic health record and focuses on natural language processing, federated learning, signal processing, and computer vision for enhanced inference, and develops and deploys novel convolutional neural networks and transformer models to address care challenges. The experience is ideal for students interested in health tech and/or medicine and looking to gain from a longitudinal research experience.
Senior Research Scientist, School of the Environment
Director of Data Science, Yale Program on Climate Change Communication
Lecturer, Department of Molecular, Cellular and Developmental Biology
email@example.com | https://environment.yale.edu/profile/jennifer-marlon
“Using paleofire records and global fire simulations to understand wildfire responses to climate change and human activities”
Jennifer Marlon, Nicholas O’Mara, Carla Staver
Over the last several years unusually large and severe wildfires have devastated communities and wildlife and transformed ecosystems around the globe. This project reconstructs and analyzes long-term fire and vegetation records from ice and lake sediment cores for comparison with dynamic global fire model simulations. We seek a data analyst/database engineer to help develop the paleofire records and the SQL database that will house them. The research assistant (RA) will use R and SQL to generate composite records of regional to global wildfire activity spanning thousands of years of Earth’s history. The RA will have the opportunity to participate in bi-weekly project meetings, to present scientific results to a team of international, interdisciplinary collaborators, and to co-author peer-reviewed publications.
“MCMC methods for pooled testing“
In pooled or group testing, which was of high importance over the recent COVID-19 pandemic, one tests subsets of a population of individuals with the goal to detect the subset of infected ones using as few as possible total number of tests. One of the simplest yet information-theoretically optimal (in terms of number of total number of tests used), such testing procedures is to choose the individuals participating in each test independently at random. This is a simple implication of the so-called probabilistic method. Yet, besides the simplicity of its procedure, multiple natural computationally efficient procedures that have been mathematically proven to require a larger number of tests.
Interestingly, MCMC methods have never been mathematically analyzed for this setting and have shown intriguing success in (small scale) simulations. This project, as part of a general goal of build tools to analyze MCMC methods for statistical tasks, aims to understand (empirically in large scale and ideally mathematically establish) the performance of natural MCMC methods for this important group testing scheme.
“Training Large Language Models for Price Negotiation”
Price negotiation in academia is mostly examined within the field of economics and in environments in which each party to the negotiation has a simple set of moves available: accept/reject the offer made, or counter-offer a price. In this study, we aim to take a step further and train models for negotiation in environment in which each party’s moves entail generating a text that not only contains an offer, but also supports it with information and reasoning. An important aspect of our objectives in training LLMs for this task is that they learn the game theoretical aspects. To illustrate, a seller LLM that has info indicating its product is of high value is expected to share that info as part of its offer, while a seller that knows its product has lower quality is expected to remain silent about the quality aspect. In the initial stages of the project, we will try to train LLMs for simpler tasks; and we will build toward the ultimate goal of price negotiation over time.
“Neural representation of threat”
In this project, we have recorded from large numbers of neurons in the mouse prefrontal cortex as a mouse navigates through the environment. These optical recordings of neurons can be used to infer the animal’s level of threat perception in virtual environments with differing levels of safety. The neural representation can then be used to predict behavior, while accounting for other variables such as arousal, locomotion, and other task-related measures. Thus, a student interested in working on this project can apply nonlinear dimensionality reduction and ML approaches to understand how neurons encode information about emotionally related variables in the world.
“Physics-informed neural operators for fast prediction of multiscale systems”
High-fidelity simulations like direct numerical simulation (DNS) of turbulence and molecular dynamics (MD) of atomistic systems are computationally very expensive and data-intensive. Furthermore, for multiscale problems, the microscale component is so expensive that it has stalled progress in simulating time-dependent atomistic-continuum systems. These open issues, in turn, have delayed progress in forecasting of real-time dynamics in critical applications such as autonomy, extreme weather patterns, and designing efficiently new functional materials. Scientific machine learning (SciML) has the potential to totally reverse this rather inefficient paradigm and significantly accelerate scientific discovery with direct impact on technology in the next few decades. We propose to develop a new generation of neural operators, universal approximators for operators, that can learn explicit and implicit operators from data only. To this end, we need to extend the predictability of neural operators for unseen out-of-distribution inputs and to speed-up the training process via high performance and multi-GPU computing. We will endow neural operators with physics, multifidelity data, and equivariant principles (e.g., geometric equivariance and conservation laws) for continuum systems and with seamless coupling for hybrid continuum-molecular systems, where neural operators will replace the expensive molecular component.
Anthony N Brady Professor of Pathology. Department of Pathology, Yale School of Medicine. Department of Immunobiology.
Project presented by Gisela Gabernet, Associate Research Scientist at the Kleinstein Lab
firstname.lastname@example.org | https://medicine.yale.edu/lab/kleinstein/
“Identifying convergent antibody responses across infections and auto-immune diseases”
The development of antibodies that target and neutralize pathogens is an important facet of the adaptive immune response to foreign pathogens. Antibodies are generated through the recombination of Variable, Diversity and Joining gene segments at the DNA level, with additional targeted mutations that generate a theoretical antibody diversity of 1014 unique sequences. Despite this high diversity, a bias in the usage of these gene segments or even antibodies with overall high sequence similarity – denominated convergent antibodies – have been observed across cohorts of patients after an immune challenge such as vaccination, infection or auto-immune diseases. Convergent antibodies have been described to target conserved epitopes across mutagenic pathogens such as HIV and influenza, showing a potential towards the development of broadly protective vaccines. They have also been observed in auto-immune diseases, potentially serving as diagnostics and monitoring markers. In our lab, we have developed a high-throughput analysis pipeline that enables the efficient processing of antibody repertoires of individual cohorts (https://nf-co.re/airrflow). This project will aim at benchmarking and improving current convergent antibody detection methods as well as visualizations. One potential approach will involve modelling the antibody sequences as a network of sequence similarity and identifying regions in the network shared across multiple subjects.
“Predict the progression of Parkinson’s Disease”
Parkinson’s Disease (PD) is the fastest growing neurodegenerative disease in the world. PD is also heterogeneous – different patients progress at different rates along different trajectories. Predicting the patient-specific progress of PD is critical in treating the disease and in shortening the length of clinical trials for new PD therapies. Currently, there are no reliable methods to predict PD progression. The goal of this research is to use a large dataset of PD patients to predict PD progress from baseline data. The dataset has images, clinical scores, wearables data, lab reports, and genetic information. The challenge is to use this heterogeneous data to create an accurate prediction model. All methods (frequentist, Bayesian, deep learning) are welcome.
“Using Machine Learning to understand the language of biology”
Recent advances in large language models provide new opportunities for decoding biology. Single-cell omics data encodes complex cellular behaviors and processes into high-dimensional molecular profiles. By treating these data as textual representations, we can apply and fine-tune neural language models to uncover the underlying grammatical rules governing biological systems. We have demonstrated that these models can learn to translate between species, matching cell types and gene expression programs between mice and humans in a completely unsupervised fashion. This cross-species translation highlights how fundamental aspects of biology form a universal language translatable across organisms. More broadly, interpreting single cell data as “biological text” enables leveraging powerful natural language processing approaches to find patterns, generate hypotheses, and gain conceptual understanding of biology.
“What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization”
Large language models demonstrate an in-context learning (ICL) ability, i.e., they can learn from a few examples provided in the prompt without updating their parameters. In this project, we conduct a comprehensive study of ICL, addressing several open questions:
(a) What type of ICL estimator is learned within language models?
(b) What are the suitable performance metrics to evaluate ICL accurately, and what are their associated error rates?
(c) How does the transformer architecture facilitate ICL?
To address (a), we adopt a Bayesian perspective and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is shown to be approximated by the attention mechanism. For (b), we analyze ICL performance from an online learning standpoint and establish a sublinear regret bound. This shows that the error diminishes as the number of examples in the prompt increases. Regarding (c), beyond the encoded Bayesian model averaging algorithm in the attention mechanism, we reveal that during pretraining, the total variation distance between the learned model and the nominal model is bounded by the sum of an approximation error and a generalization error.
Our findings aim to offer a unified understanding of the transformer and its ICL capability, with bounds on ICL regret, approximation, and generalization. This deepens our comprehension of these crucial facets of modern language models and illuminates advanced prompt methodologies for tackling more complex reasoning tasks.
“Exploring Post Hoc Interpretation of Representations for Unstructured Data“
In recent years, deep learning has emerged as the prevailing solution for tackling decision-making tasks involving unstructured data, such as images and texts. The efficacy of any predictive undertaking related to unstructured data hinges upon the caliber of their representation in the latent space—often referred to as embeddings. In essence, the pivotal question revolves around whether an insightful portrayal of unstructured data can be attained, one that encapsulates pertinent information for downstream tasks. Our objective is to delve into the realm of post hoc interpretation concerning these representations, contextualizing our exploration within various domains, including business and medical data. Through an analytical lens, we seek to unveil the concealed insights nestled within latent representations, thereby discerning the origins of the informational cues present in the training data. It is noteworthy that a portion of this endeavor enjoys sponsorship from NSF and is executed in close collaboration with the esteemed Mayo Clinic.
Refreshments will be served