
HSPH Biostatistics & DFCI Data Science Colloquium Series
Thursday October 2, 2025
4:00pm ET
HSPH FXB-301
Xihong Lin, PhD, Department of Biostatistics and Department of Statistics, Harvard University
Integrating statistics with generative Al provides unprecedent opportunities to empower statistical science and accelerate trustworthy scientific discovery by leveraging the potential of generative Al models alongside rigorous statistical principles that account for uncertainty and enhance interpretability. In this talk, I will discuss the challenges and opportunities as we navigate the crossroad of statistics, generative Al, and genomic health science. I will highlight how synthetic data from generative models, such as diffusion models and transformers, can be used to enable robust and powerful statistical analyses, while ensuring valid inference even when generative Al models are misspecified and treated as black-box tools. I will illustrate such synthetic data powered statistical inference with generative ML/Al through large scale analyses of the UK biobank in the presence of missing data, and discuss its connection with prediction powered inference (PPI). I will also discuss how to build an end-to-end autonomous, scalable and interpretable large-scale whole genome sequencing (WGS) analysis ecosystem. These efforts will be illustrated using the analysis of the TOPMed WGS samples of 200,000 samples, the UK biobank of 500,000 subjects on the cloud platform RAP and as well the All of Us data of 400,000 subjects in the NIH cloud platform AnVIL.