Back to Upcoming EventsThis Event has Passed

FDS Data Science Project Match

Tuesday, August 27, 2024    
2:30PM – 4:00PM
Luce Hall Auditorium
34 Hillhouse Avenue, Room 101
New Haven, CT 06511

Add To: Google Calendar | Outlook | iCal File

The FDS Data Science Project Match, hosted by the Yale Institute for Foundations of Data Science (FDS), is an opportunity for Yale faculty from any department or school within the university to connect with talented students from the departments of Statistics and Data Science, Applied Mathematics, and Computer Science. In a series of lightning-round talks, faculty will have exactly five minutes to pitch a current research problem, aiming to team up with students interested in tackling complex data challenges. This event facilitates collaboration on current research projects, offering a platform for faculty to present their data-driven initiatives and find skilled undergraduate and/or graduate students eager to contribute. It’s also a wonderful way to learn about the research of many Yale faculty.

Photo of a past project match event
Jennifer Marlon presenting at the 2023 FDS Project Match

Presenters:

Hoon Cho
Assistant Professor of Biomedical Informatics & Data Science and of Computer Science
hoon.cho@yale.edu | https://hcholab.org

“Secure and federated estimation of genetic risk models”

Predicting a person’s genetic risk for various health conditions is an essential task in personalized medicine. Bayesian models have been developed to tease apart causal signals from millions of genetic variants across the genome for risk prediction. However, accurate estimation of these models requires access to large datasets representing diverse populations, which are often challenging to put together due to privacy concerns. This project’s goal is to develop a privacy-preserving algorithm for estimating genetic risk models over a distributed collection of datasets in a federated manner. A key challenge is designing an efficient distributed method that minimizes the cost of cryptographic operations introduced for privacy protection. This work is expected to integrate ideas from several domains, such as MCMC methods, distributed sampling, secure computation, and population genetic models. In addition, the student will gain experience in genomic data analysis.

Phillip Atiba Solomon f.k.a. Goff 
Chair and Carl I. Hovland Professor of African American Studies and Professor of Psychology 
phillip.solomon@yale.edu | https://policingequity.org/about/team/executive-leadership/staff/dr-phillip-atiba-goff

  • Project presented by Justin Feldman, Principal Research Scientist
“A Model-Based National Estimate of Police Use-of-Force”

In the United States, police regularly use force against civilians. Much of the prior research in this area has focused on lethal force, but much less is known about the vast majority of incidents that do not result in death. This project aims to generate the first national estimate of the annual number of use-of-force incidents in the US overall and by race/ethnicity. Leveraging use-of-force data for thousands of agencies and a national dataset of predictors, we will use machine learning methods to predict incident counts for the entire US.

Nils Rudi
Professor of Operations Management, Yale School of Management
nils.rudi@yale.edu

“Probability Modeling & Inference”

TBD

Shreya Saxena
Assistant Professor of Biomedical Engineering, Yale School of Engineering and Applied Science
shreya.saxena@yale.edu | https://www.saxenalab.org

“Quantifying Across-Subject and Social Behavior”

Nicholas A. Christakis
Sterling Professor of Social and Natural Science
nicholas.christakis@yale.edu | https://nicholaschristakis.net
Lab website  https://humannaturelab.net/

“Social Networks, Microbiome and Mental Health in Rural Honduras”

The Human Nature Laboratory (led by Professor Nicholas Christakis) has a rich longitudinal dataset of socio-centric signed graphs (networks of friends and foes) and the characteristics of the people embedded in them. Our team is currently working on a project examining the sociobiological correlates of mental health and human flourishing in 19 rural Honduran villages. We have collected biological data (shotgun sequenced gut microbiome), socio-centric data (signed graphs), and self-reported health information. We aim to describe the socio-biological correlates of tie formation (relationships) and how these are informed by homophily in mental health/flourishing, microbiome sharing, multiplex ties (kin, friendship, neighbors), and the structural characteristics of the networks. We are a cross-disciplinary team that mixes creativity, data science, and empirical work.

Rohini Pande
Henry J. Heinz II Professor of Economics and Director of the Economic Growth Center
Professor in Economics and the Director of the Economic Growth Center
rohini.pande@yale.edu | https://campuspress.yale.edu/rpande/

  • Presented by Jenna Allard, Assistant Director of Research and Policy, Inclusion Economics at Yale University
    jennifer.allard@yale.edu
“Predicting Additionality in Payments for Ecosystem Services”

Payments for ecosystem services (PES) help conserve biodiversity and store carbon in ecosystems worldwide. They follow a simple economic logic: compensate landowners, often in low-income countries, for conserving ecosystems that provide global benefits. Central to designing PES programs is ensuring that they yield “additionality,” or conservation that would not have occurred without payments. We are partnering with the government of Meghalaya, India to optimize a PES program that pays landowners to preserve Meghalaya’s extensive tropical forests. We are looking for someone with advanced Python and machine-learning knowledge to support our project, where we are assisting the government to identify forest land that is most likely to be deforested in the absence of PES payments. Targeting recruitment to this vulnerable land would maximize additionality, and thus program effectiveness. Currently, we are working to implement an existing codebase for a convolutional neural network to predict deforestation risk, and we seek support in implementing and then improving on this codebase. We will then test in a randomized control trial whether embedding these predictions into PES recruitment can make Meghalaya’s program more effective at conserving forests. Ideally, skills should include advanced knowledge of machine learning and neural networks in Python, and some familiarity working with spatial prediction and remote-sensing data. 

María P. Angel, PhD
Resident Fellow | Information Society Project (ISP), Yale Law School 
maria.angel@yale.edu | https://mariapangel.com

“Understanding the concept of “commercial surveillance” in public policy discussions about consumer privacy”

Looking to eventually issue a Trade Regulation Rule on consumer privacy, in 2022 the Federal Trade Commission (FTC) published an Advanced Notice of Proposed Rulemaking on Commercial Surveillance and Data Security. There, the FTC invited the public to submit comments on a wide range of topics, including personalized or targeted advertising, algorithmic discrimination, biometric technologies and persistent identifiers, and algorithmic decision-making. In this study, we want to analyze the information submitted by multiple stakeholders (e.g., tech companies, trade associations, non-profit organizations, academics, individuals) to figure out what “commercial surveillance” means in public policy discussions about consumer privacy. We seek a data analyst to help us extract valuable insights from the 1255 public comments received by the FTC, by employing sentiment analysis, topic identification, keyword extraction, content categorization, document summarization, and any other natural language processing techniques. This experience is ideal for students interested in tech policy and seeking to apply their knowledge of machine learning for document analysis to a real-world case study.

John Sous
Assistant Professor, Department of Applied Physics and the Energy Sciences Institute
john.sous@yale.edu | https://www.johnsous.com/

“How do transformers reason about math?”

Transformers are the basis for large language models (LLMs). Despite intense research activity, the mechanics (and physics) of how transformers operate are not well understood. This project aims to contribute to the emerging field of mechanistic interpretability and address the problem of how transformers learn composition in arithmetic tasks.  The goal is to identify how the interplay of memorization via induction heads and generalization via grokking, if at all operative, results in composition of mathematical operations.

Quanquan C. Liu
Assistant Professor, Department of Computer Science
quanquan.liu@yale.edu | https://quanquancliu.com/

“Temporal Unlinking Predictions: Are Our Current Datasets Sufficient?”

Link prediction involves predicting the emergence of new connections in a network (like a social network) over time. Such predictions have various applications, including identifying protein interactions in bioinformatics or targeting customers in ad marketing. A related task is link deletion prediction, which predicts the disappearance of existing links in temporal networks. While link prediction benefits from an abundance of available datasets, there are currently few temporal graph datasets that contain link deletions; hence, it is a challenge to evaluate link deletion prediction algorithms empirically. 

Current methods for link deletion prediction rely on complex techniques like random walks and graph neural networks (GNNs). This project aims to explore simpler approaches using linear regression and related techniques like LASSO and ridge regression on existing datasets. We’ll compare these simpler methods to sophisticated ones. Preliminary findings suggest that simpler methods may be comparable in terms of accuracy (and sometimes more accurate). The first goal of this project is to formulate and comprehensively evaluate simple approaches using linear regression (and related techniques) on existing temporal graph datasets. Then, the next step involves acquiring richer datasets that better capture the nuances of real-world network dynamics. Future directions include obtaining more comprehensive temporal graph datasets that offer a more representativedepiction of how relationships and interactions evolve in real-world networks.

David van Dijk
Assistant Professor of Medicine, Yale School of Medicine
Assistant Professor of Computer Science 
david.vandijk@yale.edu | https://vandijklab.org

“Exploring (Artificial) Intelligence through Cellular Automata”

Hattie Chung
Assistant Professor of Medicine and of Molecular, Cellular and Developmental Biology, Yale School of Medicine
hattie.chung@yale.edu | https://www.hattiechunglab.bio/

“Understanding the principles of tissue organization”

Faidra Monachou
Assistant Professor of Operations Management, Yale School of Management
faidra.monachou@yale.edu | https://faidramonachou.github.io/

“Designing Recommendation Systems for College Applications”

TBA

Steven Kleinstein
Anthony N Brady Professor of Pathology, Yale School of Medicine
Co-Director of Graduate Studies, Computational Biology and Bioinformatics
steven.kleinstein@yale.edu | https://medicine.yale.edu/lab/kleinstein/ 

  • Presented by Visiting Research Scientist Gur Yaari, Professor in the Faculty of Engineering of Bar Ilan University, Israel. gur.yaari@yale.edu
“Developing a Statistical Framework to Quantify Gene Set Enrichment in Single-Cell Sequencing Data

High-throughput RNA/DNA sequencing, especially single-cell sequencing, has transformed biomedical research. This advancement creates a need for advanced statistical methods to interpret these often sparse and noisy data sets. One key aspect researchers analyze is the enrichment score for predefined sets of features, like genes. These gene set enrichment scores are crucial for understanding the biological mechanisms behind various clinical conditions.

We previously developed a method called qusage to quantify gene set differential expression including gene-gene correlations. While widely used, qusage was originally designed with assumptions that don’t fit single-cell data well. In this project, we will adapt qusage for single-cell count data, focusing on more accurate and efficient ways to estimate gene-gene correlations for different cell types. Our goal is to publish an improved method that the research community will enthusiastically adopt.

Bhramar Mukherjee
Senior Associate Dean for Data Science and Data Equity 
Anna MR Lauder Professor of Biostatistics
Professor of Epidemiology (Chronic Disease) Professor of Statistics and Data Science (Secondary)
Yale School of Public Health
bhramar.mukherjee@yale.edu | https://ysph.yale.edu/profile/bhramar-mukherjee/

“The data struggle of the unseen: Unveiling selection bias in scientific studies and prediction algorithms”

TBD

Mark Gerstein 
Albert L Williams Professor of Biomedical Informatics 
Professor of Molecular Biophysics & Biochemistry, of Computer Science, and of Statistics & Data Science 
mark@gersteinlab.org | http://gersteinlab.org

  • Project presented by Joel Rozowsky, Research Scientist in Molecular Biophysics and Biochemistry, Gerstein Lab. joel.rozowsky@yale.edu
“Genomics & Bioinformatics Research in the Gerstein Lab”

The Gerstein lab conducts bioinformatics research in the biomedical and genomic fields. We use various computational analytics methods including machine learning techniques to analyze large biomedical datasets. The lab has particular focuses on the following areas of research: genomic privacy, personal genomes, genome annotation and neurogenomics.


Not presented, but projects available:

Arianna Salazar Miranda
Assistant Professor of Urban Planning and Data Science, Yale School of the Environment
arianna.salazarmiranda@yale.edu | https://environment.yale.edu/directory/faculty/arianna-salazar-miranda

“Measuring Grey Spaces to Assess the Greenspace Potential of Cities”

With the increasing threats from climate change, cities need to rethink how they use space to protect people from rising environmental risks. This project focuses on identifying and assessing the potential of underutilized gray spaces, such as parking lots, to be transformed into climate-resilient areas. By applying computer vision techniques to satellite and street-view imagery from cities across the globe, the goal is to map these gray spaces, assess their potential for transformation, and quantify the resulting benefits, such as reduced urban heat and flood risks.

Students will gain hands-on experience in spatial data analysis, computer vision, and machine learning while contributing to solutions for climate resilience and sustainable urban development.

Requisite Skills and Qualifications:

Ideal candidates should have a good understanding of GIS and be proficient in handling large datasets. Experience with Python and computer vision techniques would be beneficial.

Winnie van Dijk
Assistant Professor, Department of Economics
Winnie.vandijk@yale.edu | www.winnievandijk.com

“Using Newspaper Articles to Track Local Legislative Changes”

Municipal laws address issues that are salient to local communities. They typically regulate matters such as zoning, local traffic rules, noise ordinances, building codes, and the use of public spaces, directly impacting residents’ day-to-day lives. However, unlike for federal and state level laws, no database exists to study changes to municipal legislation. It is difficult to create a snapshot of current municipal codes because the relevant information is dispersed, and this problem is compounded when trying to reconstruct their history further back in time. In this project, we will try to reconstruct the history of certain types of municipal ordinances by analyzing historical newspaper articles. We will extract and analyze the text from a collection of digitized newspaper articles to determine whether the articles describe specific types of municipal ordinances that were in place, were being proposed, or were being repealed at the time. Based on this information we will try to construct a history of local ordinances for U.S. cities. 

Karen C. Seto
Frederick C. Hixon Professor of Geography & Urbanization Science
Faculty Director, Hixon Center for Urban Sustainability 
Co-Director, Yale Center for Geospatial Solutions
U.S. National Academy of Sciences Council on Foreign Relations
Yale University, Yale School of the Environment
Karen.seto@yale.edu | http://urbanization.yale.edu

“Central Park Climate Lab”

This project uses medium to high resolution satellite imagery to map urban natural areas and urban parks in the U.S., and to assess the impact of climate change on them. The RA will conduct research by collecting and processing time series medium- to high-resolution satellite data (such as Landsat, Sentinel-2, and PlanetScope) and applying deep learning models for information extraction. 

Preferred experience: experience with geospatial data and methods, including satellite and GIS data (or interest to learn), spatial statistics, machine learning, coding (python and R preferred). 

Here is a brief news report about the project.


Event Details:

  • Format: Faculty will present their research projects in a series of lightning-round talks, outlining the complex data problems they are tackling.
  • Audience: The event is targeted at students from Applied Mathematics, Computer Science, Mathematics, and Statistics & Data Science who are interested in gaining hands-on research experience.
  • Purpose: The primary goal is to match faculty with students who have the expertise and interest in working on these data-intensive projects, fostering academic collaboration and mentorship.

Benefits for Participants:

  • Faculty: Gain access to a pool of motivated and skilled students who can bring fresh perspectives and technical expertise to your research projects.
  • Students: Discover exciting research opportunities, apply your data science skills to real-world problems, and build valuable relationships with faculty mentors.
  • Attendees: Learn about research projects from all corners of Yale University in a fast, fun format.

How to Participate:

  • Faculty: If you have a project that could benefit from student support, please reach out to FDS Associate Director Emily Hau.
  • Students: Attend the event to hear about various research opportunities and express your interest in projects that align with your skills and academic goals.

Past Events:

For more information on the structure and outcomes of previous Project Match events, including abstracts and presentation plans, please visit:


Join us at the FDS Project Match to forge new collaborations, push the boundaries of data science research, and make meaningful contributions to the academic community. Please join our mailing list for future announcements.

Submit an Event

Interested in creating your own event, or have an event to share? Please fill the form if you’d like to send us an event you’d like to have added to the calendar.

Submit an Event

Share your event ideas with us using the form below.

"*" indicates required fields

MM slash DD slash YYYY
Start Time*
:
End Time*
: