
Mark Gerstein is the Albert L. Williams Professor of Biomedical Informatics at Yale. He is associated with the Departments of Molecular Biophysics & Biochemistry, Computer Science and Statistics & Data Science. He is Co-Director (and founder) of the Computational Biology & Bioinformatics PhD Program. He has chaired the analysis groups of numerous national and international projects, including ENCODE, modENCODE, PsychENCODE, 1000 Genomes, PCAWG, ERCC, and SCORCH. Prof. Gerstein completed his PhD training in Computational Chemistry and Biophysics at Cambridge University, followed by postdoctoral training at Stanford. Since then, he has published >600 manuscripts in total, including several in prominent venues, such as Science, Nature, and Cell, with an H-index of >175. He has also written popular science pieces for venues such as Scientific American and the Wall Street Journal. He is a specialist in bioinformatics with a particular interest in large-scale data science, especially as it pertains to personal genome analyses. Current research foci in his lab include disease genomics (particularly neurogenomics and cancer genomics), human genome annotation, genomic privacy, network science (especially gene regulatory networks), wearable and molecular imaging data analysis, text mining of the biological science literature and macromolecular simulation. Prof. Gerstein has received awards such as being elected as a fellow of AAAS and the International Society of Computational Biology. His lab currently comprises >35 students and trainees and he has placed >35 of his past alumni/ae in academic faculty positions and an equivalent number in industry positions. He also has mentored >200 Yale undergraduates and has taught undergraduate and graduate courses in bioinformatics at Yale for >20 years.
How do you use data science?
We do research in bioinformatical data science, applying computational approaches to problems in molecular biology. Broadly, we are interested in large-scale analyses of genome sequences and macromolecular structures. We also work on the analysis of images and large-scale text and sensor data. We are especially focused on the human genome and proteome and phenotype descriptions associated with them. Our research involves a number of quantitative techniques, including database design, systematic data mining and deep learning, visualization of high-dimensional data, and molecular simulation. More specifically, we focus on three questions. First, we are interested in annotating the raw human genome sequence, especially in characterizing the vast intergenic regions. Next, we are trying to get at the function of all the genes encoded by the genome. Here, we try to characterize function on a large-scale through the use of molecular networks. Finally, for the group of protein-coding genes that have known 3D structures, we are trying to see how their function is carried out through motion and how motion can be predicted from packing geometry.