Dr. Yong Chen

Yong Chen, PhD

Vision and Research Focus

I lead the Penn Computing, Inference, and Learning (PennCIL) Lab and am the founding Director of the Center for Health AI and Synthesis of Evidence (CHASE), with a long-term vision to build medical AI systems that translate real-world data into reliable, actionable evidence for clinical and public health decision-making within learning health systems. My work spans statistics, machine learning, and biomedical informatics, with an emphasis on causal reasoning, robustness, and deployment in real clinical environments.

Biomedical data are inherently complex—heterogeneous, decentralized, and shaped by clinical workflows, institutional constraints, and societal expectations around privacy and trust. I view these realities not as limitations, but as the core scientific challenges that medical AI must confront to be credible, useful, and responsible in practice.

My research integrates methodological innovation with large-scale translational studies, including causal and federated learning, multi-agentic AI, and AI-assisted evidence synthesis, alongside leadership of national and international investigations in pediatric COVID-19, vaccine safety, neurodegenerative disease, and metabolic disorders such as GLP-1–related therapies. A central theme of this work is generating rigorous evidence on the real-world impact of health system interventions, such as telemedicine, to inform clinical practice, policy, and system-level decision-making. Through PennCIL and CHASE, I aim to build not just algorithms, but durable scientific infrastructure: methods that travel across institutions, disciplines, and diseases, and people—students, collaborators, and partners—who will carry these principles and this research philosophy forward long after any single project ends.

National and International efforts and leadership

Dr. Yong Chen provides national and international leadership in advancing large-scale, collaborative clinical research and medical AI, with a focus on building sustainable infrastructure for real-world evidence generation. He serves as one of twenty-seven international Commissioners representing six continents on the Lancet Commission on Rare Diseases, contributing to global efforts to shape research, diagnostic frameworks, and policy agendas for rare diseases by integrating scientific rigor with practical, multi-stakeholder perspectives.

A central pillar of Dr. Chen’s leadership is his long-standing engagement with the PCORnet ecosystem. He has extensive experience leading and coordinating multi-site studies across PCORnet Clinical Research Networks, including the STAR Network, OneFlorida+, INSIGHT CRN, REACHnet, and PEDSnet.  Through these collaborations, he has led national studies spanning tens of millions of patients, advancing privacy-preserving, federated, and AI-enabled analytics to address high-impact clinical and public health questions at scale.

Dr. Chen currently serves as a senior leader on multiple large-scale NIH- and PCORI-funded national initiatives that establish shared infrastructure for translational research, regulatory-grade evidence generation, and AI-enabled discovery. He is the Contact Principal Investigator of PANDA, an NCATS-funded U01 project supported by the National Center for Advancing Translational Sciences, which brings together more than 20 health systems and tens of millions of patients to develop and deploy AI-driven models for early identification and diagnosis of rare diseases using real-world clinical data. He further co-leads the only Data Coordinating Center (DCC) for the IMPACT-MH program, supported by the National Institute of Mental Health, overseeing data governance, coordination, and advanced analytics across 13 national consortia within a $260 million mental health research ecosystem, with a focus on deploying federated, multimodal medical AI methods at scale. In addition, Dr. Chen is a Multi-Principal Investigator of ReCARDO, a $27.2 million U24 initiative funded by the National Institute on Aging, which integrates electronic health records, claims data, mobile applications, and wearable technologies to accelerate discoveries in Alzheimer’s disease and related dementias, while enabling scalable AI-driven evidence generation relevant to both clinical care and drug development.

In parallel, Dr. Chen has led and supported multiple PCORI-funded studies within the PCORnet infrastructure, emphasizing patient-centered governance, data-partner trust, and rigorous, reproducible analytics. Across these efforts, his leadership focuses on translating advanced AI and data science into credible evidence that informs clinical practice, health system strategy, drug development, and regulatory decision-making.

Impact on Medicine, Public Health, and Policy

Dr. Chen’s work has had direct and sustained impact on medicine, public health, and health policy, particularly through the translation of large-scale real-world data into actionable evidence. A defining example is his leadership during the COVID-19 pandemic, when he served as Biostatistics Core Director for the pediatric EHR cohort of the NIH RECOVER Initiative, leading data analytics for one of the largest national efforts to understand long COVID in children and adolescents. This work engaged more than 200 researchers and stakeholders across 40 health systems and covered the healthcare experiences of over 12 million children—representing more than 10% of the U.S. pediatric population.

Under Dr. Chen’s leadership, analytic teams delivered weekly, actionable evidence to the NIH, CDC, FDA, and the White House, informing national understanding of pediatric COVID-19 and long COVID. In parallel, his team led one of the largest real-world studies of COVID-19 vaccine effectiveness, generating critical evidence on long-term protection against infection and post-acute sequelae that directly informed public health policy discussions.

Beyond the pandemic response, Dr. Chen’s work has generated high-impact evidence across a range of clinical domains, including rare diseases, pediatric outcomes, chronic conditions, and aging-related disorders, often by enabling multi-site collaborations that produce robust, real-world insights. His methodological contributions in causal inference, federated learning, and distributed analytics have been widely adopted by research consortia, advancing how complex health data are analyzed and informing evidence frameworks used by clinicians, health systems, and regulators. Collectively, these contributions demonstrate how rigorous real-world evidence can improve decision-making far beyond any single disease area, shaping both clinical practice and policy at scale.

Recent News

Research Program and Methodological Contributions

Statistical inference

We develop statistical inference theory for complex data structures that arise in modern biomedical and health studies, where standard regularity conditions often fail. Our work focuses on likelihood and pseudolikelihood based inference under nonstandard settings, including boundary constraints, latent variable models, nonconvex parameter spaces, and irregular or informative observation processes. We study identifiability, asymptotic behavior of test statistics, and valid hypothesis testing in settings where classical likelihood theory breaks down. These theoretical advances provide a principled foundation for reliable inference in diagnostic accuracy studies, longitudinal data with informative visit times, and semiparametric models with boundary problems.

We develop communication-efficient and statistically rigorous methods for distributed and federated inference across multiple data partners, where individual-level data cannot be centrally pooled. Our work emphasizes lossless and one-shot algorithms that achieve the same statistical efficiency as pooled analyses while preserving data ownership and privacy. We study distributed inference for generalized linear models, linear mixed models, time-to-event data, and high-dimensional causal analyses under data heterogeneity and covariate shift. These methods enable large-scale, multi-site real-world evidence generation across healthcare systems.

Federated Learning

Causal ML/AI

We develop causal machine learning and AI methods that integrate modern AI models with principled causal reasoning. Our work focuses on estimating heterogeneous treatment effects, addressing systematic bias in observational data, and improving robustness and interpretability of causal models in real-world settings. We study how causal structure, negative controls, and representation learning can be incorporated into machine learning pipelines to support reliable decision making, particularly in high-dimensional and complex healthcare data.

We develop surrogate-powered inference (SPI) methods to improve statistical efficiency when primary outcomes are sparsely observed or costly to collect. By leveraging auxiliary and surrogate outcomes that are more widely available, SPI combines information across labeled and unlabeled data while maintaining valid inference for the primary estimand. This line of work enables reliable estimation and hypothesis testing in real-world settings characterized by missing outcomes, irregular follow-up, and limited labeling.

SPI

Meta-analysis

We develop methodological advances in meta-analysis and evidence synthesis to support large-scale comparative effectiveness research and safety monitoring. Our work addresses multivariate and network meta-analysis, publication bias, small-study effects, and dynamic monitoring of adverse events. These methods enable patient-centered treatment ranking, real-time pharmacovigilance, and more reliable aggregation of evidence across heterogeneous studies and data sources.

We conduct large-scale real-world studies to understand the effectiveness, safety, and long-term consequences of SARS-CoV-2 infection and vaccination in children and adolescents. Our work leverages multi-site electronic health record data to study infection risk, severe outcomes, long COVID, reinfection, and disparities across populations. Methodologically, these studies integrate causal inference, mediation analysis, and bias calibration to address confounding and data limitations in observational pediatric research.

Pediatric COVID-19

Metabolic Therapies

We study the real-world effectiveness and safety of metabolic therapies using causal inference and target trial emulation in large-scale healthcare data. Our work focuses on contemporary antihyperglycemic treatments, including GLP-1 receptor agonists and related drug classes, and evaluates their cardiovascular, renal, and psychiatric outcomes across diverse patient populations. By leveraging longitudinal electronic health records and rigorous causal designs, these studies aim to inform clinical decision making beyond glycemic control and support evidence generation in routine care settings.

We develop and apply statistical and causal methods for pharmacovigilance and post-marketing vaccine safety surveillance. Our work focuses on detecting adverse events, characterizing temporal risk patterns, and addressing confounding and reporting biases in large observational and surveillance data sources. These methods support regulatory decision making and real-time safety monitoring at population scale.

Pharmacovigilance

Telemedicine

We study telemedicine as a core component of learning health systems, focusing on how virtual care reshapes access, utilization, and downstream clinical outcomes. Our work combines causal inference, digital twins, and real-world data to evaluate when telemedicine substitutes for in-person care and when it expands overall utilization. These studies aim to inform policy, reimbursement, and system-level design of hybrid care delivery models.

Education

Dr. Yong Chen earned his PhD and MA in Biostatistics and Pure Math from the Johns Hopkins University. Prior to his graduate work, he received a BS in Mathematics from the University of Science and Technology of China.

Mentoring philosophy

My approach to mentoring centers on developing good scientific judgment. Beyond technical skills in mathematics, statistics, and programming, I encourage trainees and collaborators to cultivate independence, critical thinking, and ambition grounded in substance rather than trends. I emphasize spending time understanding and formulating the right problem—thinking qualitatively about context, assumptions, and consequences—before optimizing solutions. This habit, developed through practice and reflection, builds the kind of common sense that transfers across domains.

My goal is to train the next generation of leaders in biomedical data science—across academia, industry, policy, and entrepreneurship—who will carry forward a principled approach to medical AI: rigorous, transparent, human-centered, and grounded in real-world impact. I view mentoring as a long-term investment in people, helping them develop the taste, confidence, and responsibility needed to contribute meaningful and trustworthy science in diverse professional settings.

We are actively recruiting self-motivated students and trainees who are excited about medical AI, causal machine learning, and real-world evidence generation. Prospective students and collaborators who are interested in working with me are encouraged to email me directly with a brief description of background and interests.

Keynotes and Plenary Presentations

Grants

Editorial and Professional Service

Dr. Yong Chen serves as a Statistical Editor for the Annals of Internal Medicine, a Statistical Consultant for New England Journal of Medicine-AI, and an Associate Editor for both the Journal of the American Statistical Association Applications and Case Studies (JASA-ACS) and The Annals of Applied Statistics.

Major Awards