Yong Chen

Dr. Yong Chen

Dr. Yong Chen is a Professor of Biostatistics and Founding Director of the Center for Health AI and Synthesis of Evidence (CHASE) at the University of Pennsylvania. He also directs the Penn Computing, Inference, and Learning (PennCIL) lab, focusing on evidence synthesis, machine learning/AI, and clinical evidence generation.

Dr. Chen is one of twenty Commissioners serving internationally on the Lancet Commission on Rare Diseases. He is also a Statistical Editor for the Annals of Internal Medicine, a Statistical Consultant for New England Journal of Medicine-AI, and an Associate Editor for both the JASA-ACS and The Annals of Applied Statistics. Dr. Chen has authored over 260 peer-reviewed papers in statistics and medical informatics. He is an elected Fellow of the American Statistical Association, and the American College of Medical Informatics.

Dr. Yong Chen has extensive experience leading multi-site studies across major national research networks, particularly within the PCORnet infrastructure. He maintains close collaborative relationships with several PCORnet Clinical Research Networks, including STAR Network, OneFlorida+, INSIGHT CRN, REACHnet, and PEDSnet. As contact Principal Investigator of a $8 million U01 award funded by NCATS, Dr. Chen leads a study involving over 20 health systems across OneFlorida+ and STAR, encompassing more than 35 million patients to develop AI-driven predictive models for early diagnosis of rare diseases. He also co-leads the IMPACT-MH initiative, a $13 million NIH-funded U24 project that supports a $150 million mental health research ecosystem, integrating over thirteen national consortia to build data infrastructure and deploy federated, multimodal AI analytics (project news). Additionally, Dr. Chen has recently been awarded funding for the ReCARDO initiative, which is supported by a $27.2 million U24 grant from the NIA, that aims to leverage electronic health records, claims, mobile apps, and wearable devices to accelerate discoveries in Alzheimer’s disease and related dementias. Further, Dr. Chen serves as the Biostatistics Lead for PEDSnet, where he has led and contributed to over 30 peer-reviewed studies using pediatric real-world data.

 

News and Media

Study on Long COVID was featured by Penn Medicine News October 1, 2025.

Dr. Yong Chen was interviewed about Long COVID risk for a feature in TIME Magazine published October 1, 2025.

The Lancet Infectious Diseases published commentary on recent Long COVID study September 30, 2025.

Dr. Yong Chen was interviewed by The New York Times for a feature on Long COVID published September 30, 2025.

Dr. Yong Chen was interviewed by The Microbiologist for a feature on Long COVID published September 30, 2025.

 

Research Areas

Real-world data; clinical evidence generation; learning health system; healthcare delivery.

 

Education

Awards

Selected Publications

Selected Manuscripts Under Revision

  1. Gao, Y., Zhou, J., Zhou, H., Chen, Y. and Dai, X. Learn then decide: A learning approach for designing data marketplaces. Journal of the American Statistical Association-Theory and Methods, (under revision).
  2. Zhang, D., Liu, X., Ning, Y., Carroll, R., Chen, Y. Bias reduction in distributed inference for rare events. Biometrika, (under revision).
  3. Duan, J., Ning, Y., Chen, X. and Chen, Y. Two-stage Hypothesis Tests for Variable Interactions with FDR Control. The Annals of Statistics, (under revision).
  4. Hu, J., Tong, J., Ning, Y., Tang, C. Y., and Chen, Y. (Dec, 2024). Federated feature selection via FDR control. Journal of the Royal Statistical Society Series B, (minor revision).
  5. Hu, J., Wang, Y., Wang, T., Qiu, Y., Ning, Y.and Chen, Y. Targeted Learning of Heterogeneous Sources by Informative Feature Sharing. Journal of the Royal Statistical Society Series B, (under revision).
  6. Liu, X., Hu, J., Jing, N., Ning, Y., Tang, CY., Li, R and Chen, Y. Targeted learning via probabilistic subpopulation matching. Journal of the American Statistical Association-Theory and Methods, (under revision).
  7. Wang, Y., Liu, X., Ning, Y., Carroll, R. J., and Chen, Y. (2025+) Integration of Heterogeneous Data via a One-shot Distributed EM Algorithm, Journal of the American Statistical Association-Theory and Methods, (under revision).
  8. Chen, Y., Wang, H., Lumley, T., Dai, X., and Chen, Y. (2025+) Surrogate-Powered Inference: Regularization and Adaptivity, Journal of the American Statistical Association-Theory and Methods, (under revision).
  9. Wang, Y, Zhu, Z, Chen, Y. (2025+) Distributed inference for heterogeneous distributions of response. Biometrika, (under revision).
  10. Liu, X, Yang, Y, Sun, Y, Bian, J, Ma, Y, Carroll, R, Chen Y. (2025+) Distributed inference for heterogeneous mixture models using multi-site data, Journal of Machine Learning Research, (under revision).
  11. Duan, R, Ning, Y, Shi, J, Carroll, R, Cai, T and Chen, Y. Global identifiability of logistic regression models with misclassified outcomes. Bernoulli (invited revision).

Teaching

Courses Taught at the University of Pennsylvania

Course instructors:

Yong Chen

 Description:

This graduate-level Biostatistics course will introduce the fundamentals of statistical methods for meta-analyses.  It will cover key principles of meta-analysis and the statistical rationales behind the analytic models, including univariate meta-analysis, multivariate meta-analysis, meta-analysis of diagnostic test accuracy, network meta-analysis, and multivariate network meta-analysis. Beyond these commonly used models, the course will cover statistical methods and software that investigate and correct for biases in systematic reviews such as publication bias, outcome reporting bias. Advanced statistical inferential tools such as composite likelihood, pseudolikelihood, integrated likelihood methods, EM algorithms will be introduced.
In addition, the cover will also cover some practical steps in systematic review including search strategies, data abstraction methods; quality assessment; and writing a meta-analysis report.
The course is composed of a series of weekly lectures and small group discussions. Students will be expected to attend weekly lectures, participate in class discussions, review assigned readings, complete homework assignments, and conduct a real-world meta-analysis with a clinically meaningful problem.
The students will be evaluated based on 2 homework assignments and a final in-class presentation of their final projects.

 Textbooks:

1. [Primary textbook] Schwarzer, Guido, Carpenter, James R., Rücker, Gerta. Meta-Analysis with R. Springer 2015.
2. [Primary textbook] Egger, Matthias, George D. Smith, and Douglas G. Altman, eds. Systematic Reviews in Health Care: Meta-analysis in Context. London: BMJ Publishing Group, 2001.
3. [Optional textbook] Borenstein, Michael, Larry V. Hedges, Julian P. T. Higgins, Hannah R. Rothstein. Introduction to Meta-Analysis. Wiley, 2009.
4. [Optional textbook] Rothstein, Hannah R., Alexander J. Sutton, Michael Borenstein. Publication Bias in Meta-Analysis: Prevention, Assessment and Adjustments. Wiley, 2005.
Course format:
This course will have a hybrid lecture/seminar format, with Dr. Yong Chen presenting lectures on standard and advanced statistical methods for meta-analysis, and several guests who will describe important aspects of systematic review from their perspectives as clinicians, epidemiologists, medical librarians, and systematic reviewers. The guest speakers include Drs. Jesse Berlin (Johnson & Johnson Ltd), Robert J. DeRubeis (UPenn), Eileen Erinoff (Emergency Care Research Institute, ECRI), Tianjing Li (the Johns Hopkins University School of Public Health). All of them have given guest lectures co-directed by Dr. Yong Chen two years ago.

 Expectation:

This course is expected to attract students from the first year and above in their PhD program, and will likely include students in GGEB (Biostatistics and Epidemiology programs) as well as perhaps students in other groups, such as MSCE students, who meet the prerequisites.

Course instructors:

Justine Shults (part I: Linear models) and Yong Chen (Part II: Generalized linear models)

 Description:

This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, log-linear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”

 Textbooks:

1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN-10: 0471360937.

2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN-10: 0412317605.

 Learning objectives:

Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and log-linear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.

 List of topics:

  • Generalized linear models and maximum likelihood method

  • Quasi-likelihood method and estimating equation

  • Model selection

  • Analysis of binary data

  • Analysis of polytomous responses

  • Analysis of count data: log linear models

  • Analysis of contingency table

  • Generalized linear mixed effect models

  • Analysis of matched data

  • Inference for correlated responses: marginal models and random effect models

 Expectation:

By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context

Course directors:

Craig Umscheid and Yong Chen

Objective:

This 1.0 unit graduate-level course will provide an introduction to the fundamentals of systematic reviews and meta-analyses.  It will cover introductory principles of meta-analysis; protocol development; search strategies; data abstraction methods; quality assessment; meta-analytic methods; and applications of meta-analysis.  The course is composed of a series of weekly small group lectures and discussions. Students will be expected to attend weekly didactics, participate in class discussions, review assigned readings, complete homework assignments, and draft a systematic review protocol of their choosing suitable for IRB submission.

Assignments:

Students will be required to complete readings in the textbook and articles referenced for each session. In addition, each student will complete homework assignments assigned by the instructors including a data analysis project using a meta-analysis dataset provided by the instructors: download Stata meta-analysis modules from the Stata website, review dataset variables, complete an analysis, and write-up their findings. Finally, students will draft a systematic review protocol of their choosing and present their protocol at the conclusion of the class. There are no examinations.

Course instructors:

Yong Chen (Part I) and Jinbo Chen (Part II)

Outline of topics:

Parametric Inference:

        Unbiased estimation and unbiased estimating functions

        Maximum likelihood estimation: Consistency, asymptotic normality, and efficiency

        Hypothesis testing: Wald test, Likelihood ratio test, Score test

        Influence functions

        EM algorithm

        Model checking, Model mis-specification, and model selection

        Examples of Non-regular maximum likelihood estimation

        Marginal likelihood, Conditional likelihood, (modified) profile likelihood, composite likelihood, and pseudolikelihood

        U-statistics theory

        Contiguity theory       

        Bayes and Empirical Bayes estimators, Bayesian tests

    

Semiparametric Inference:

         Semiparametric maximum likelihood estimation (Case-control study; Cox proportional hazards regression)

         Z-estimation/M-estimation 

         Generalized score test, with Pearson’s Chi^2 test as an example

         Semiparametric inference with incomplete data

Course instructors:

Yong Chen

Description:

This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.

 Textbooks:

1. Diggle, P,  Heagerty, P, Liang, K-Y and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN-10: 0198524846.

2. Fitzmaurice GM, Laird NM, Ware JH.  Applied Longitudinal Analysis.  Second Edition. New York: Wiley; 2011.  ISBN: 978-0-470-38027-7. Hardcover  740 pages; August 2011 

3. Singer JD, Willett JB.   Applied Longitudinal Analysis.    New York: Oxford 2003.

 Graphics texts: 

Mitchell MN.   A Visual Guide to Stata Graphics.  3rd Edition.  College Station, TX: Stata Press; 2012.

Courses Taught at the University of Texas School of Public Health

Course instructor:

Yong Chen

Description:

This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, log-linear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”

Textbooks:

1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN-10: 0471360937.

2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN-10: 0412317605.

Learning objectives:

Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and log-linear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.

List of topics:

  • Generalized linear models and maximum likelihood method

  • Quasi-likelihood method and estimating equation

  • Model selection

  • Analysis of binary data

  • Analysis of polytomous responses

  • Analysis of count data: log linear models

  • Analysis of contingency table

  • Generalized linear mixed effect models

  • Analysis of matched data

  • Inference for correlated responses: marginal models and random effect models

Expectation:

By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.

Course instructors:

Yong Chen

Description:

This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.

 Textbooks:

1. Diggle, P,  Heagerty, P, Liang, K-Y and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN-10: 0198524846.

2. Fitzmaurice GM, Laird NM, Ware JH.  Applied Longitudinal Analysis.  Second Edition. New York: Wiley; 2011.  ISBN: 978-0-470-38027-7. Hardcover  740 pages; August 2011 

3. Singer JD, Willett JB.   Applied Longitudinal Analysis.    New York: Oxford 2003.

 Graphics texts: 

Mitchell MN.   A Visual Guide to Stata Graphics.  3rd Edition.  College Station, TX: Stata Press; 2012.

Course instructor:

Yong Chen

Description:

This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, log-linear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”

Textbooks:

1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN-10: 0471360937.

2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN-10: 0412317605.

Learning objectives:

Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and log-linear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.

List of topics:

  • Generalized linear models and maximum likelihood method

  • Quasi-likelihood method and estimating equation

  • Model selection

  • Analysis of binary data

  • Analysis of polytomous responses

  • Analysis of count data: log linear models

  • Analysis of contingency table

  • Generalized linear mixed effect models

  • Analysis of matched data

  • Inference for correlated responses: marginal models and random effect models

Expectation:

By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.

Course instructor:

Yong Chen

Description:

This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, log-linear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”

Textbooks:

1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN-10: 0471360937.

2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN-10: 0412317605.

Learning objectives:

Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and log-linear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.

List of topics:

  • Generalized linear models and maximum likelihood method

  • Quasi-likelihood method and estimating equation

  • Model selection

  • Analysis of binary data

  • Analysis of polytomous responses

  • Analysis of count data: log linear models

  • Analysis of contingency table

  • Generalized linear mixed effect models

  • Analysis of matched data

  • Inference for correlated responses: marginal models and random effect models

Expectation:

By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.

Course instructors:

Yong Chen

Description:

This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.

 Textbooks:

1. Diggle, P,  Heagerty, P, Liang, K-Y and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN-10: 0198524846.

2. Fitzmaurice GM, Laird NM, Ware JH.  Applied Longitudinal Analysis.  Second Edition. New York: Wiley; 2011.  ISBN: 978-0-470-38027-7. Hardcover  740 pages; August 2011 

3. Singer JD, Willett JB.   Applied Longitudinal Analysis.    New York: Oxford 2003.

 Graphics texts: 

Mitchell MN.   A Visual Guide to Stata Graphics.  3rd Edition.  College Station, TX: Stata Press; 2012.

Course instructor:

Yong Chen

Description:

This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, log-linear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”

Textbooks:

1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN-10: 0471360937.

2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN-10: 0412317605.

Learning objectives:

Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and log-linear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.

List of topics:

  • Generalized linear models and maximum likelihood method

  • Quasi-likelihood method and estimating equation

  • Model selection

  • Analysis of binary data

  • Analysis of polytomous responses

  • Analysis of count data: log linear models

  • Analysis of contingency table

  • Generalized linear mixed effect models

  • Analysis of matched data

  • Inference for correlated responses: marginal models and random effect models

Expectation:

By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.

Joint Appointments