Data Scientist, ZephyrAI
Abstract: Adapting machine learning algorithms to better handle clustering or other partition structure within training data sets is important across a wide variety of biological applications. We first consider the task of learning prediction models when multiple training studies are available. We present a novel weighting approach for constructing tree-based ensemble learners in this setting, showing that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor and achieves superior performance to Random Forest. Next, we broaden the scope of the problem to consider the effect of ensembling forest-based learners trained on clusters within a single data set with heterogeneity in the distribution of the features. We show that constructing ensembles of forests trained on estimated clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and display its robustness and accuracy across simulations and on cancer molecular profiling and gene expression data sets that are naturally divisible into clusters. Finally, we provide theoretical support to these empirical observations by asymptotically analyzing linear least squares and random forest regressions under a linear model. In particular, for random forest regression under fixed dimensional linear models, our bounds imply a strict benefit of our ensembling strategy over classic Random Forest.
Maya Ramchandran recently completed her PhD at the Harvard Biostatistics department under the supervision of Dr. Giovanni Parmigiani, where she developed machine learning ensembling strategies with applications to cancer prediction problems. She holds a BS in Applied Mathematics-Biology from Brown University and a Masters of Music in Violin Performance from the New England Conservatory. She currently works as a data scientist at ZephyrAI, a biotechnology startup that develops novel drug combination and repurposing treatments for oncology.