
HSPH Biostatistics and DFCI Data Science Colloquium
Thursday, March 27, 2025
4:00pm
Harvard TH Chan School of Public Health, FXB G13
Nancy Zhang, PhD
Ge Li and Ning Zhao Professor, Professor of Statistics and Data Science, Vice Dean of Wharton Doctoral Programs, The Wharton School, University of Pennsylvania
In single-cell and spatial biology, data integration refers to the alignment of cells across samples and modalities, and is an ubiquitous challenge affecting all downstream analyses. The goal in cell integration is to find cells across data sets that share the same biological state that may be obscured by technical differences.
In this talk, I will cast the cell integration problem on a continuum of weak to strong linkage, depending on the strength of feature sharing between experiments. First, I will examine integration across data modalities of weak linkage. This arises when there are few shared features between the data being integrated, for example, between single-cell RNA sequencing data and spatial proteomics data. For this, I will present MaxFuse, a method that leverages higher order relationships between all features, including unshared features, to achieve accurate integration. Next, we consider the scenario of data alignment across the same modality in clinical scale studies. For this setting, I will show that existing paradigms are overly aggressive, erasing disease and treatment effects and introducing severe data distortion. I will introduce a “pool-of-controls” experimental design concept to disentangle biological variation from unwanted variation. Based on this, I will describe CellANOVA, a novel statistical model and scalable algorithm that recovers biological signals lost during batch integration and corrects integration related data distortion. Through these two contrasting paradigms, I will share the key lessons learned and the remaining challenges in this field.