Graduation Date

Summer 8-15-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Programs

Biostatistics

First Advisor

Ran Dai

Second Advisor

Cheng Zheng

Third Advisor

Ying Zhang

Fourth Advisor

Hongying (Daisy) Dai

Abstract

The increasing availability of high-dimensional (HD) observational data offers great opportunities for scientific discovery through feature selection. However, HD biomedical data also present significant statistical challenges, as signals are often rare and weak, and the relationship between predictors and outcomes is frequently unknown. We are interested in developing a knockoff-based framework to identify features that are truly associated with biomedical outcomes from HD data, while effectively ensuring the quality of selection by controlling the false discovery rate (FDR). This framework allows for the integration of machine learning methods, such as penalized regression and random forests. In particular, we demonstrate that our methods address the following challenges: (1) handling measurement errors and missingness in HD metabolomics data; (2) selecting reproducible features by integrating information from multiple electronic health record (EHR) sources with heterogeneity and privacy concerns; and (3) conducting mediator selection in high-dimensional causal models involving nonlinearities and interactions. We validate the FDR control and improved power of our methods through extensive simulation studies. Finally, we apply the proposed framework to the Women’s Health Initiative data, the National COVID Cohort Collaborative (N3C), and Alzheimer’s research datasets. These applications provide new insights into FDR control in HD feature selection and lead to novel scientific findings in nutritional epidemiology, infectious disease research, and neuroscience.

Comments

2025 Copyright, the authors

Available for download on Sunday, July 18, 2027

Share

COinS