Surrogate-powered Inference

We develop surrogate-powered inference methods to improve statistical efficiency when primary outcomes are sparsely observed or costly to collect. By leveraging auxiliary and surrogate outcomes that are more widely available, SPI combines information across labeled and unlabeled data while maintaining valid inference for the primary estimand. This line of work enables reliable estimation and hypothesis testing in real-world settings characterized by missing outcomes, irregular follow-up, and limited labeling.

Selected papers:

  • Chen, J., Wang, H., Lumley, T., Dai, X., & Chen, Y. (2025). Surrogate-powered inference: Regularization and adaptivity. arXiv preprint, arXiv:2512.21826.
  • Marks-Anglin, A., Chen, J., Luo, C., Hubbard, R. A., & Chen, Y. (2025). Optimal surrogate-assisted sampling for cost-efficient validation of electronic health record outcomes. Statistics in Medicine, 44(10–12), e70095.
  • Lu, Y., Tong, J., Chubak, J., Lumley, T., Hubbard, R. A., Xu, H., & Chen, Y. (2024). Leveraging error-prone algorithm-derived phenotypes: Enhancing association studies for risk factors in EHR data. Journal of Biomedical Informatics, 157, 104690.
  • Tong, J., Huang, J., Chubak, J., Wang, X., Moore, J. H., Hubbard, R. A., & Chen, Y. (2020). An augmented estimation procedure for EHR-based association studies accounting for differential misclassification. Journal of the American Medical Informatics Association, 27(2), 244–253.