Yong Chen
Dr. Yong Chen, Professor of Biostatistics, founded and directs the Computing, Inference and Learning Lab (PENNCIL) at University of Pennsylvania. The mission of PENNCIL lab is to develop computational methods and software to transform realworld data into insights, to disseminate the methods and knowledge to research communities, and to bridge the gap from data to actionable health care.
Research areas:
 Realworld data; clinical evidence generation; learning health system; healthcare delivery.
Education:
 Ph.D. in Biostatistics at the Bloomberg School of Public Health, the Johns Hopkins University
 Thesis Advisors: Professor KungYee Liang and Professor Charles Rohde.
 M.A. in Mathematics at the Department of Mathematics, the Johns Hopkins University
 B.S. in Mathematics at the University of Science and Technology of China
Awards:
 2023. Best of Annals of Applied Statistics (to be presented at JSM 2023): PALM: Patientcentered Treatment Ranking via Largescale Multivariate Network Metaanalysis.
 2022. Winner of the Best Paper in Biometrics by an IBS Member Award: Testing small study effects in multivariate metaanalysis. Biometrics, 76(4), 12401250. 2020.
 2022. Best of Annals of Applied Statistics: Monitoring vaccine safety by studying temporal variation of adverse events using vaccine adverse event reporting system.
 2021. The Observational Health Data Sciences and Informatics (OHDSI) Titan Award for Methodological Research to recognize extraordinary contributions by an individual, organization, or team in development or evaluation in analytical methods for clinical characterization, populationlevel effect estimation, or patientlevel prediction
 2021. Best paper award by the Translational Bioinformatics YearinReview by American Medical Informatics Association (top 25 papers among 206 papers published in Jan 2020 – March 2021)
 2020. Elected Fellow, American Statistical Association
 2019. Distinguished Faculty member at the Department of Biostatistics, Epidemiology and Informatics, the Perelman School of Medicine, University of Pennsylvania
 2018. Best paper award by the International Medical Informatics Association (IMIA) Yearbook Section on Clinical Research Informatics
 2018. Elected Member, International Statistical Institute
 2018. Elected Member, Society for Research Synthesis Methodology
 2015. Institute of Mathematical Statistics IMS Travel Award
 2010. Margaret Merrell Award for excellence in research, Department of Biostatistics, The Johns Hopkins University.
 2005 — 2010. Sommer Scholar, the Bloomberg School of Public Health, the Johns Hopkins University – Dean Alfred Sommer’s leadership training program for the next generation of public health leaders at the Bloomberg School of Public Health, the Johns Hopkins University
 The inaugural class of Hopkins Sommer Scholars in 2005
Teaching
Courses Taught at the University of Pennsylvania
Course instructors:
Yong Chen
Description:
This graduatelevel Biostatistics course will introduce the fundamentals of statistical methods for metaanalyses. It will cover key principles of metaanalysis and the statistical rationales behind the analytic models, including univariate metaanalysis, multivariate metaanalysis, metaanalysis of diagnostic test accuracy, network metaanalysis, and multivariate network metaanalysis. Beyond these commonly used models, the course will cover statistical methods and software that investigate and correct for biases in systematic reviews such as publication bias, outcome reporting bias. Advanced statistical inferential tools such as composite likelihood, pseudolikelihood, integrated likelihood methods, EM algorithms will be introduced.
In addition, the cover will also cover some practical steps in systematic review including search strategies, data abstraction methods; quality assessment; and writing a metaanalysis report.
The course is composed of a series of weekly lectures and small group discussions. Students will be expected to attend weekly lectures, participate in class discussions, review assigned readings, complete homework assignments, and conduct a realworld metaanalysis with a clinically meaningful problem.
The students will be evaluated based on 2 homework assignments and a final inclass presentation of their final projects.
Textbooks:
1. [Primary textbook] Schwarzer, Guido, Carpenter, James R., Rücker, Gerta. MetaAnalysis with R. Springer 2015.
2. [Primary textbook] Egger, Matthias, George D. Smith, and Douglas G. Altman, eds. Systematic Reviews in Health Care: Metaanalysis in Context. London: BMJ Publishing Group, 2001.
3. [Optional textbook] Borenstein, Michael, Larry V. Hedges, Julian P. T. Higgins, Hannah R. Rothstein. Introduction to MetaAnalysis. Wiley, 2009.
4. [Optional textbook] Rothstein, Hannah R., Alexander J. Sutton, Michael Borenstein. Publication Bias in MetaAnalysis: Prevention, Assessment and Adjustments. Wiley, 2005.
Course format:
This course will have a hybrid lecture/seminar format, with Dr. Yong Chen presenting lectures on standard and advanced statistical methods for metaanalysis, and several guests who will describe important aspects of systematic review from their perspectives as clinicians, epidemiologists, medical librarians, and systematic reviewers. The guest speakers include Drs. Jesse Berlin (Johnson & Johnson Ltd), Robert J. DeRubeis (UPenn), Eileen Erinoff (Emergency Care Research Institute, ECRI), Tianjing Li (the Johns Hopkins University School of Public Health). All of them have given guest lectures codirected by Dr. Yong Chen two years ago.
Expectation:
This course is expected to attract students from the first year and above in their PhD program, and will likely include students in GGEB (Biostatistics and Epidemiology programs) as well as perhaps students in other groups, such as MSCE students, who meet the prerequisites.
Course instructors:
Justine Shults (part I: Linear models) and Yong Chen (Part II: Generalized linear models)
Description:
This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, loglinear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”
Textbooks:
1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN10: 0471360937.
2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN10: 0412317605.
Learning objectives:
Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and loglinear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.
List of topics:

Generalized linear models and maximum likelihood method

Quasilikelihood method and estimating equation

Model selection

Analysis of binary data

Analysis of polytomous responses

Analysis of count data: log linear models

Analysis of contingency table

Generalized linear mixed effect models

Analysis of matched data

Inference for correlated responses: marginal models and random effect models
Expectation:
By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context
Course directors:
Craig Umscheid and Yong Chen
Objective:
This 1.0 unit graduatelevel course will provide an introduction to the fundamentals of systematic reviews and metaanalyses. It will cover introductory principles of metaanalysis; protocol development; search strategies; data abstraction methods; quality assessment; metaanalytic methods; and applications of metaanalysis. The course is composed of a series of weekly small group lectures and discussions. Students will be expected to attend weekly didactics, participate in class discussions, review assigned readings, complete homework assignments, and draft a systematic review protocol of their choosing suitable for IRB submission.
Assignments:
Students will be required to complete readings in the textbook and articles referenced for each session. In addition, each student will complete homework assignments assigned by the instructors including a data analysis project using a metaanalysis dataset provided by the instructors: download Stata metaanalysis modules from the Stata website, review dataset variables, complete an analysis, and writeup their findings. Finally, students will draft a systematic review protocol of their choosing and present their protocol at the conclusion of the class. There are no examinations.
Course instructors:
Yong Chen (Part I) and Jinbo Chen (Part II)
Outline of topics:
Parametric Inference:
Unbiased estimation and unbiased estimating functions
Maximum likelihood estimation: Consistency, asymptotic normality, and efficiency
Hypothesis testing: Wald test, Likelihood ratio test, Score test
Influence functions
EM algorithm
Model checking, Model misspecification, and model selection
Examples of Nonregular maximum likelihood estimation
Marginal likelihood, Conditional likelihood, (modified) profile likelihood, composite likelihood, and pseudolikelihood
Ustatistics theory
Contiguity theory
Bayes and Empirical Bayes estimators, Bayesian tests
Semiparametric Inference:
Semiparametric maximum likelihood estimation (Casecontrol study; Cox proportional hazards regression)
Zestimation/Mestimation
Generalized score test, with Pearson’s Chi^2 test as an example
Semiparametric inference with incomplete data
Course instructors:
Yong Chen
Description:
This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.
Textbooks:
1. Diggle, P, Heagerty, P, Liang, KY and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN10: 0198524846.
2. Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. Second Edition. New York: Wiley; 2011. ISBN: 9780470380277. Hardcover 740 pages; August 2011
3. Singer JD, Willett JB. Applied Longitudinal Analysis. New York: Oxford 2003.
Graphics texts:
Mitchell MN. A Visual Guide to Stata Graphics. 3rd Edition. College Station, TX: Stata Press; 2012.
Courses Taught at the University of Texas School of Public Health
Course instructor:
Yong Chen
Description:
This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, loglinear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”
Textbooks:
1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN10: 0471360937.
2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN10: 0412317605.
Learning objectives:
Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and loglinear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.
List of topics:

Generalized linear models and maximum likelihood method

Quasilikelihood method and estimating equation

Model selection

Analysis of binary data

Analysis of polytomous responses

Analysis of count data: log linear models

Analysis of contingency table

Generalized linear mixed effect models

Analysis of matched data

Inference for correlated responses: marginal models and random effect models
Expectation:
By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.
Course instructors:
Yong Chen
Description:
This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.
Textbooks:
1. Diggle, P, Heagerty, P, Liang, KY and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN10: 0198524846.
2. Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. Second Edition. New York: Wiley; 2011. ISBN: 9780470380277. Hardcover 740 pages; August 2011
3. Singer JD, Willett JB. Applied Longitudinal Analysis. New York: Oxford 2003.
Graphics texts:
Mitchell MN. A Visual Guide to Stata Graphics. 3rd Edition. College Station, TX: Stata Press; 2012.
Course instructor:
Yong Chen
Description:
This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, loglinear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”
Textbooks:
1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN10: 0471360937.
2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN10: 0412317605.
Learning objectives:
Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and loglinear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.
List of topics:

Generalized linear models and maximum likelihood method

Quasilikelihood method and estimating equation

Model selection

Analysis of binary data

Analysis of polytomous responses

Analysis of count data: log linear models

Analysis of contingency table

Generalized linear mixed effect models

Analysis of matched data

Inference for correlated responses: marginal models and random effect models
Expectation:
By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.
Course instructor:
Yong Chen
Description:
This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, loglinear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”
Textbooks:
1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN10: 0471360937.
2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN10: 0412317605.
Learning objectives:
Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and loglinear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.
List of topics:

Generalized linear models and maximum likelihood method

Quasilikelihood method and estimating equation

Model selection

Analysis of binary data

Analysis of polytomous responses

Analysis of count data: log linear models

Analysis of contingency table

Generalized linear mixed effect models

Analysis of matched data

Inference for correlated responses: marginal models and random effect models
Expectation:
By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.
Course instructors:
Yong Chen
Description:
This course presents extensions of general and generalized linear models to longitudinal and correlated outcome data with special emphasis on clinical, epidemiologic, and public health applications. Major topics include generalized linear mixed linear models (GLMM) for continuous, binomial, and count data; maximum likelihood estimation; generalized estimating equations (GEE); current general and specialized software applicable to these methods; and readings from current statistical literature. Each student will be required to participate in 4 labs and complete associated problem sets. Software will include Stata.
Textbooks:
1. Diggle, P, Heagerty, P, Liang, KY and Zeger, S. (2013). Analysis of Longitudinal Data (Second Edition). Oxford University Press. ISBN10: 0198524846.
2. Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. Second Edition. New York: Wiley; 2011. ISBN: 9780470380277. Hardcover 740 pages; August 2011
3. Singer JD, Willett JB. Applied Longitudinal Analysis. New York: Oxford 2003.
Graphics texts:
Mitchell MN. A Visual Guide to Stata Graphics. 3rd Edition. College Station, TX: Stata Press; 2012.
Course instructor:
Yong Chen
Description:
This is a course on methods for generalized linear models (GLMs), rather than a course on using software for data analysis with GLMs. This course is designed to provide students with a fundamental understanding of theory and applications of the GLMs. Emphasis will be placed on statistical modeling, building from standard normal linear models, extending to GLMs, and going beyond GLMs. The main subjects are logit models for nominal and ordinal data, loglinear models, models for repeated categorical data, generalized linear mixed models and other mixture models for categorical data. Methods of maximum likelihood, weighted least squares, and generalized estimating equations will be used for estimation and inference.”
Textbooks:
1. Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley. ISBN10: 0471360937.
2. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models (Second Edition). Chapman and Hall. ISBN10: 0412317605.
Learning objectives:
Regression analysis has been developed for many years and remains one of the most commonly used statistical tools to help scientists address their scientific questions. Generalized linear models (GLMs) were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including ANCOVA, linear regression, logistic regression and loglinear models for contingency tables and count data. This lecture will introduce GLMs and some recent developments of regression techniques with focus on generalized linear models, quasi likelihood methods and estimating function approaches.
List of topics:

Generalized linear models and maximum likelihood method

Quasilikelihood method and estimating equation

Model selection

Analysis of binary data

Analysis of polytomous responses

Analysis of count data: log linear models

Analysis of contingency table

Generalized linear mixed effect models

Analysis of matched data

Inference for correlated responses: marginal models and random effect models
Expectation:
By the end of the course, the students are expected to: 1) understand the main components of GLMs; 2) build and apply appropriate models to binary, nominal, ordinal or count data; 3) build and apply appropriate models to correlated outcomes; 4) make inference for a given model and interpret the results in the scientific context.
Joint appointments:
 Senior Fellow at the Institute of Biomedical Informatics, University of Pennsylvania
 Senior Scholar at the Center for Evidencebased Practice at Penn School of Medicine, University of Pennsylvania
 Faculty member at the Applied Mathematics & Computational Science Program, Penn Arts & Sciences, University of Pennsylvania