How Scoring Decisions Can Bias Your Study’s Results: A Trip Through the IRT Looking Glass

Feb 21, 2022 10:30 AM to 11:45 AM

This free event is brought to you by EdPolicy Works of UVA's School of Education & Human Development and open to the public.

Jim Soland
Jim Soland
Assistant Professor of Education
University of Virginia


Jim Soland is an Assistant Professor of Quantitative Methods at the University of Virginia and an Affiliated Research Fellow at NWEA, an assessment nonprofit.  His research is situated at the intersection of educational and psychological measurement, practice, and policy.  Particular areas of emphasis include understanding how measurement decisions impact estimates of treatment effects and psychological/social-emotional growth, as well as detecting and quantifying test/survey disengagement.  His work has been featured by the Collaborative for Academic, Social, and Emotional Learning (CASEL), the Brookings Institute, and the New York Times.  Prior to joining NWEA, Jim completed a doctorate in Educational Psychology at Stanford University with a concentration in measurement.  Jim has also served as a classroom teacher, a policy analyst at the RAND Corporation, and Senior Fiscal Analyst at the Legislative Analyst’s Office (LAO), a nonpartisan organization that provides policy analysis to support the California Legislature and general public.


Though much effort is often put into designing educational studies, the measurement model and scoring approach employed are often an afterthought, especially when short survey scales are used (Flake & Fried, 2020).  One possible reason that measurement gets downplayed is that there is generally little understanding of how calibration/scoring approaches could impact common estimates of interest, including treatment effect estimates, beyond random noise due to measurement error. Another possible reason is that the process of scoring is complicated, involving selecting a suitable measurement model, calibrating its parameters, then deciding how to generate a score, all steps that occur before the score is even used to examine the desired phenomenon. In this study, we provide three motivating examples where surveys are used to understand individuals’ underlying social emotional constructs to demonstrate the potential consequences of measurement/scoring decisions. These examples also mean we can walk through the different measurement decision stages and, hopefully, begin to demystify them. As we show in our analyses, the decisions researchers make about how to calibrate and score the survey used has consequences that are often overlooked, with likely implications both for conclusions drawn from individual psychological studies and replications of studies.