BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//wp-events-plugin.com//7.2.3.1//EN
TZID:America/New_York
X-WR-TIMEZONE:America/New_York
BEGIN:VEVENT
UID:555@fds.yale.edu
DTSTART;TZID=America/New_York:20230829T140000
DTEND;TZID=America/New_York:20230829T150000
DTSTAMP:20250916T142124Z
URL:https://fds.yale.edu/events/data-science-project-match-2/
SUMMARY:Data Science Project Match
DESCRIPTION:An opportunity for students to match with data science research
  opportunities presented by Yale faculty.\n\n\nOpening Remarks & Introduct
 ion \n\n\n\nby Daniel SpielmanSterling Professor of Computer Science\; Pro
 fessor of Statistics & Data Science\, and of MathematicsJames A. Attwood D
 irector of the Institute for Foundations of Data Science at Yale (FDS)\n\n
 \n\nProject Presentations\n\n\n\nRohan Khera\, MD\, MSDirector\, Cardiovas
 cular Data Science (CarDS) LabAssistant Professor\, Cardiovascular Medicin
 e\, Yale School of Medicinerohan.khera@yale.edu | CarDS-Lab.org\n\n\n\n"In
 novating Cardiovascular Care with Multimodality Data Science"The Cardiovas
 cular Data Science (CarDS) Lab at Yale leverages advances in deep learning
  and AI to enhance and automate care. The work uses numerous data streams 
 in the electronic health record and focuses on natural language processing
 \, federated learning\, signal processing\, and computer vision for enhanc
 ed inference\, and develops and deploys novel convolutional neural network
 s and transformer models to address care challenges. The experience is ide
 al for students interested in health tech and/or medicine and looking to g
 ain from a longitudinal research experience.\n\n\n\nJennifer MarlonSenior 
 Research Scientist\, School of the EnvironmentDirector of Data Science\, Y
 ale Program on Climate Change CommunicationLecturer\, Department of Molecu
 lar\, Cellular and Developmental Biologyjennifer.marlon@yale.edu | https:/
 /environment.yale.edu/profile/jennifer-marlon\n\n\n\n“Using paleofire re
 cords and global fire simulations to understand wildfire responses to clim
 ate change and human activities”Jennifer Marlon\, Nicholas O'Mara\, Carl
 a StaverOver the last several years unusually large and severe wildfires h
 ave devastated communities and wildlife and transformed ecosystems around 
 the globe. This project reconstructs and analyzes long-term fire and veget
 ation records from ice and lake sediment cores for comparison with dynamic
  global fire model simulations. We seek a data analyst/database engineer t
 o help develop the paleofire records and the SQL database that will house 
 them. The research assistant (RA) will use R and SQL to generate composite
  records of regional to global wildfire activity spanning thousands of yea
 rs of Earth’s history. The RA will have the opportunity to participate i
 n bi-weekly project meetings\, to present scientific results to a team of 
 international\, interdisciplinary collaborators\, and to co-author peer-re
 viewed publications.\n\n\n\nIlias ZadikAssistant Professor\, Department of
  Statistics and Data Science Ilias.zadik@yale.edu | https://iliaszadik.gi
 thub.io/\n\n\n\n"MCMC methods for pooled testing"In pooled or group testin
 g\, which was of high importance over the recent COVID-19 pandemic\, one t
 ests subsets of a population of individuals with the goal to detect the su
 bset of infected ones using as few as possible total number of tests. One 
 of the simplest yet information-theoretically optimal (in terms of number 
 of total number of tests used)\, such testing procedures is to choose the 
 individuals participating in each test independently at random. This is a 
 simple implication of the so-called probabilistic method. Yet\, besides th
 e simplicity of its procedure\, multiple natural computationally efficient
  procedures that have been mathematically proven to require a larger numbe
 r of tests. Interestingly\, MCMC methods have never been mathematically an
 alyzed for this setting and have shown intriguing success in (small scale)
  simulations. This project\, as part of a general goal of build tools to a
 nalyze MCMC methods for statistical tasks\, aims to understand (empiricall
 y in large scale and ideally mathematically establish) the performance of 
 natural MCMC methods for this important group testing scheme.\n\n\n\nSohei
 l GhiliAssistant Professor of Marketing\, School of Managementsoheil.ghili
 @yale.edu | https://sites.google.com/view/soheil-ghili/\n\n\n\n“Training
  Large Language Models for Price Negotiation”Price negotiation in academ
 ia is mostly examined within the field of economics and in environments in
  which each party to the negotiation has a simple set of moves available: 
 accept/reject the offer made\, or counter-offer a price. In this study\, w
 e aim to take a step further and train models for negotiation in environme
 nt in which each party’s moves entail generating a text that not only co
 ntains an offer\, but also supports it with information and reasoning. An 
 important aspect of our objectives in training LLMs for this task is that 
 they learn the game theoretical aspects. To illustrate\, a seller LLM that
  has info indicating its product is of high value is expected to share tha
 t info as part of its offer\, while a seller that knows its product has lo
 wer quality is expected to remain silent about the quality aspect. In the 
 initial stages of the project\, we will try to train LLMs for simpler task
 s\; and we will build toward the ultimate goal of price negotiation over t
 ime.\n\n\n\nAlfred P. Kaye\, MD PhDAssistant Professor\, Department of Psy
 chiatry\, Yale University School of Medicinealfred.kaye@yale.edu | https:/
 /www.kayelab.com/\n\n\n\n"Neural representation of threat" In this project
 \, we have recorded from large numbers of neurons in the mouse prefrontal 
 cortex as a mouse navigates through the environment. These optical recordi
 ngs of neurons can be used to infer the animal's level of threat perceptio
 n in virtual environments with differing levels of safety. The neural repr
 esentation can then be used to predict behavior\, while accounting for oth
 er variables such as arousal\, locomotion\, and other task-related measure
 s. Thus\, a student interested in working on this project can apply nonlin
 ear dimensionality reduction and ML approaches to understand how neurons e
 ncode information about emotionally related variables in the world.\n\n\n\
 nLu LuAssistant Professor of Statistics and Data ScienceLu.lu@yale.edu | h
 ttps://lu.seas.upenn.edu\n\n\n\n“Physics-informed neural operators for f
 ast prediction of multiscale systems”High-fidelity simulations like dire
 ct numerical simulation (DNS) of turbulence and molecular dynamics (MD) of
  atomistic systems are computationally very expensive and data-intensive. 
 Furthermore\, for multiscale problems\, the microscale component is so exp
 ensive that it has stalled progress in simulating time-dependent atomistic
 -continuum systems. These open issues\, in turn\, have delayed progress in
  forecasting of real-time dynamics in critical applications such as autono
 my\, extreme weather patterns\, and designing efficiently new functional m
 aterials. Scientific machine learning (SciML) has the potential to totally
  reverse this rather inefficient paradigm and significantly accelerate sci
 entific discovery with direct impact on technology in the next few decades
 . We propose to develop a new generation of neural operators\, universal a
 pproximators for operators\, that can learn explicit and implicit operator
 s from data only. To this end\, we need to extend the predictability of ne
 ural operators for unseen out-of-distribution inputs and to speed-up the t
 raining process via high performance and multi-GPU computing. We will endo
 w neural operators with physics\, multifidelity data\, and equivariant pri
 nciples (e.g.\, geometric equivariance and conservation laws) for continuu
 m systems and with seamless coupling for hybrid continuum-molecular system
 s\, where neural operators will replace the expensive molecular component.
 \n\n\n\nSteven KleinsteinAnthony N Brady Professor of Pathology. Departmen
 t of Pathology\, Yale School of Medicine. Department of Immunobiology.stev
 en.kleinstein@yale.eduProject presented by Gisela Gabernet\, Associate Res
 earch Scientist at the Kleinstein Labgisela.gabernet@yale.edu | https://me
 dicine.yale.edu/lab/kleinstein/\n\n\n\n“Identifying convergent antibody 
 responses across infections and auto-immune diseases”The development of 
 antibodies that target and neutralize pathogens is an important facet of t
 he adaptive immune response to foreign pathogens. Antibodies are generated
  through the recombination of Variable\, Diversity and Joining gene segmen
 ts at the DNA level\, with additional targeted mutations that generate a t
 heoretical antibody diversity of 1014 unique sequences. Despite this high 
 diversity\, a bias in the usage of these gene segments or even antibodies 
 with overall high sequence similarity – denominated convergent antibodie
 s – have been observed across cohorts of patients after an immune challe
 nge such as vaccination\, infection or auto-immune diseases. Convergent an
 tibodies have been described to target conserved epitopes across mutagenic
  pathogens such as HIV and influenza\, showing a potential towards the dev
 elopment of broadly protective vaccines. They have also been observed in a
 uto-immune diseases\, potentially serving as diagnostics and monitoring ma
 rkers. In our lab\, we have developed a high-throughput analysis pipeline 
 that enables the efficient processing of antibody repertoires of individua
 l cohorts (https://nf-co.re/airrflow). This project will aim at benchmarki
 ng and improving current convergent antibody detection methods as well as 
 visualizations. One potential approach will involve modelling the antibody
  sequences as a network of sequence similarity and identifying regions in 
 the network shared across multiple subjects.\n\n\n\nHemant TagareProfessor
  of Radiology and Biomedical Imaging and of Biomedical Engineeringhemant.t
 agare@yale.edu | https://medicine.yale.edu/profile/hemant-tagare/\n\n\n\n
 “Predict the progression of Parkinson’s Disease”Parkinson’s Diseas
 e (PD) is the fastest growing neurodegenerative disease in the world. PD i
 s also heterogeneous – different patients progress at different rates al
 ong different trajectories. Predicting the patient-specific progress of PD
  is critical in treating the disease and in shortening the length of clini
 cal trials for new PD therapies. Currently\, there are no reliable methods
  to predict PD progression. The goal of this research is to use a large da
 taset of PD patients to predict PD progress from baseline data. The datase
 t has images\, clinical scores\, wearables data\, lab reports\, and geneti
 c information. The challenge is to use this heterogeneous data to create a
 n accurate prediction model. All methods (frequentist\, Bayesian\, deep le
 arning) are welcome.\n\n\n\nDavid van Dijk\, Ph.D.Assistant Professor of M
 edicine\, Yale School of MedicineAssistant Professor of Computer Scienceda
 vid.vandijk@yale.edu | vandijklab.org\n\n\n\n"Using Machine Learning to un
 derstand the language of biology"Recent advances in large language models 
 provide new opportunities for decoding biology. Single-cell omics data enc
 odes complex cellular behaviors and processes into high-dimensional molecu
 lar profiles. By treating these data as textual representations\, we can a
 pply and fine-tune neural language models to uncover the underlying gramma
 tical rules governing biological systems. We have demonstrated that these 
 models can learn to translate between species\, matching cell types and ge
 ne expression programs between mice and humans in a completely unsupervise
 d fashion. This cross-species translation highlights how fundamental aspec
 ts of biology form a universal language translatable across organisms. Mor
 e broadly\, interpreting single cell data as “biological text” enables
  leveraging powerful natural language processing approaches to find patter
 ns\, generate hypotheses\, and gain conceptual understanding of biology.\n
 \n\n\nZhuoran YangAssistant Professor\, Department of Statistics & Data Sc
 iencezhuoran.yang@yale.edu | https://statistics.yale.edu/people/zhuoran-ya
 ng\n\n\n\n"What and How does In-Context Learning Learn? Bayesian Model Ave
 raging\, Parameterization\, and Generalization"Large language models demon
 strate an in-context learning (ICL) ability\, i.e.\, they can learn from a
  few examples provided in the prompt without updating their parameters. In
  this project\, we conduct a comprehensive study of ICL\, addressing sever
 al open questions:(a) What type of ICL estimator is learned within languag
 e models?(b) What are the suitable performance metrics to evaluate ICL acc
 urately\, and what are their associated error rates?(c) How does the trans
 former architecture facilitate ICL?To address (a)\, we adopt a Bayesian pe
 rspective and demonstrate that ICL implicitly implements the Bayesian mode
 l averaging algorithm. This Bayesian model averaging algorithm is shown to
  be approximated by the attention mechanism. For (b)\, we analyze ICL perf
 ormance from an online learning standpoint and establish a sublinear regre
 t bound. This shows that the error diminishes as the number of examples in
  the prompt increases. Regarding (c)\, beyond the encoded Bayesian model a
 veraging algorithm in the attention mechanism\, we reveal that during pret
 raining\, the total variation distance between the learned model and the n
 ominal model is bounded by the sum of an approximation error and a general
 ization error.Our findings aim to offer a unified understanding of the tra
 nsformer and its ICL capability\, with bounds on ICL regret\, approximatio
 n\, and generalization. This deepens our comprehension of these crucial fa
 cets of modern language models and illuminates advanced prompt methodologi
 es for tackling more complex reasoning tasks.\n\n\n\nTong WangAssistant Pr
 ofessor of Marketing\, School of Management\, Yale Universitytong.wang.tw6
 87@yale.edu | https://tongwang-ai.github.io/\n\n\n\n"Exploring Post Hoc In
 terpretation of Representations for Unstructured Data"In recent years\, de
 ep learning has emerged as the prevailing solution for tackling decision-m
 aking tasks involving unstructured data\, such as images and texts. The ef
 ficacy of any predictive undertaking related to unstructured data hinges u
 pon the caliber of their representation in the latent space—often referr
 ed to as embeddings. In essence\, the pivotal question revolves around whe
 ther an insightful portrayal of unstructured data can be attained\, one th
 at encapsulates pertinent information for downstream tasks. Our objective 
 is to delve into the realm of post hoc interpretation concerning these rep
 resentations\, contextualizing our exploration within various domains\, in
 cluding business and medical data. Through an analytical lens\, we seek to
  unveil the concealed insights nestled within latent representations\, the
 reby discerning the origins of the informational cues present in the train
 ing data. It is noteworthy that a portion of this endeavor enjoys sponsors
 hip from NSF and is executed in close collaboration with the esteemed Mayo
  Clinic.\n\n\n\n\n\n\n\nRefreshments will be served\n
CATEGORIES:FDS Events,Project Match,Training
END:VEVENT
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
DTSTART:20230312T030000
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
END:DAYLIGHT
END:VTIMEZONE
END:VCALENDAR