
HSPH Biostatistics and DFCI Data Science Seminar
Tuesday April 29 from 11:00-12:00pm
Zoom only (Link to be posted shortly)
Yu Li, PhD
Assistant Professor, CSE
The Chinese University of Hong Kong
Large language models, which can integrate and process large amounts of data in biomedicine, have great potential in modeling complex diseases and discovering functional biomolecules for potential therapeutics. To model complex diseases and identify the potential drug targets for such diseases, we built a language model trained on the insurance claims of around 123 million US people. With the model, we can give a unified representation of all the common complex diseases, which enables us to predict the genetic parameters of the diseases and discover unique genetic loci related to them efficiently. Then, we developed models based on protein language models to efficiently discover remote homologs and functional biomolecules from nature, such as signal peptides and antimicrobial peptides. With the model, we can identify remote homologs 22 times faster than PSI-BLAST and discover diverse functional peptides with sequence similarity lower than 20% against the known ones. Finally, we developed an RNA language model to model the RNA sequence and structure relation, which enables us to perform RNA structure prediction and reverse design effectively. Within two months, we designed and experimentally validated 19 RNA aptamers that are structurally similar, yet sequence dissimilar, to known light-up aptamers. More importantly, 10 designed aptamers show higher fluorescence than the native Mango-I. The above projects demonstrate the great potential of large language models in promoting fundamental computational biological research and potential transformational development.