Earlier this summer, The Quantitative Methods area in the Department of Educational Psychology at UW-Madison had five students present at the National Council of Measurement in Education (NCME) Conference.

Kylie Gorney presented three papers at the NCME conference. The paper, “When to Use Synthetic Linking Functions in Small-Sample Equating”, aimed to expand understanding of the specific contexts in which synthetic equating may improve upon traditional equating. The results of the study demonstrated that synthetic equating may be considered when the sample size is 25 or smaller, provided that the test forms being equated do not differ markedly in difficulty. For all other contexts studied, traditional equating appeared preferable.
In her study, “A Comparison of Anchor Lengths and Item Selection Methods in Small-Sample Equating,” Kylie compared three anchor selection methods—minitest, semi-miditest, and miditest—for four separate equating methods, each under five different anchor lengths and four small sample conditions (N = 10 – 100). Among anchor selection methods, the minitest produced the most accurate (least biased) results, whereas the nominal weights mean equating and circle-arc equating methods generally performed well across conditions.
In her third paper, “Does Item Format Affect Test Security?,” Kylie conducted an experiment with 150 students to evaluate the relative susceptibility of discrete option multiple choice (DOMC) items and traditional multiple-choice (MC) items to test compromise. Although the DOMC items were harder than their MC counterparts, the magnitudes of score gain (on the logit metric) were comparable for the two formats. However, the DOMC format offered other security advantages, such as with respect to the speed at responding to compromised items, and a stronger relationship between performance on compromised items and one’s underlying latent trait.

Merve Sarac presented two papers at the NCME conference. In her first study, “Preknowledge Detection in Multiple-Format Testing,” examined the efficacy of borrowing information from multiple-choice items to detect examinees with preknowledge to task-based simulation items that were known to be compromised. The study concluded that a method based on differential person functioning (DPF) was more successful at identifying examinees with preknowledge (EWP) than was a regression method. Performance of the DPF method improved as the percentage of EWP increased and the number of contaminated items decreased.
The second paper, “Reducing Score Bias Through Real-Time Rerouting of Examinees with Anomalous Responses,” developed a methodology to identify during an exam those examinees who had received prior information about certain test questions, and demonstrated that re-routing these examinees towards highly secure items mid-exam significantly reduces bias in test scores, improves the accuracy of pass/fail decisions, and reduces the need for post-exam investigations and sanctions.

Qi (Helen) Huang presented an NCME paper, “Relative Robustness of CDMs and (M)IRT in Measuring Change in Latent Skills,” that provides a comparison of different psychometric ways of studying student growth. While there is a lot of current interest in describing test performances using discrete categories (e.g., student mastery/nonmastery of specific skills), Helen showed that such approaches unfortunately provide overly coarse ways of understanding student growth, and may miss actual student gains that are better captured using continua.

Sinan Yavuz had three studies at NCME this year. The first one “Imputations for Large-Scale Assessment Contextual Data: Is Recreation of Plausible Values Necessary?” tested two multiple imputation (MI) techniques for NAEP contextual data: (1) simple MI with existing plausible values and (2) the nested MI, in which plausible values are created conditioned on the imputed contextual dataset. Estimates of the one-step MI were not as accurate as of the nested MI method. The latter method also yielded slightly larger standard errors of estimates. Nevertheless, estimates from both MI methods are within the CI.
The second research “Multiple Imputation Approach to Missingness at the School Level: A Comparison Study” deals with the missingness on the school level variable by two multiple imputation approaches: chained equations (MICE) using R and Blimp packages. Both methods performed well under 25% and mostly with 50% missingness.
The third study “Using Multiple Imputations to Handle Non-Random Missing in Multidimensional Multistage Testing” mimics the 2015 NAEP multistage testing (MST) design and compares three different imputation methods for item parameter recovery. However, due to a large amount of missingness, none of the methods performed satisfactorily. Especially item discrimination parameters were estimated with considerable bias.

Yiqin Pan presented four papers at the 2021 NCME conference. The paper, “A Support-Vector-Machine-Based Approach for Detecting Item Preknowledge in CAT”, uses a support-vector machine (SVM) classification algorithm to simultaneously identify compromised items and examinees with preknowledge in a computerized adaptive test, based on a combination of response accuracy and response time data. Through a detailed simulation study, this study demonstrated that provided less than half the test was compromised and fewer than half the examinees had preknowledge, the SVM approach had uniformly strong power to detect compromised items and people, while false positive rates were well controlled below .05 across all conditions studied.
In the paper, “Understanding Different Ways to Compute Measurement Errors and Score Reliability for Adaptive Tests,” Yiqin evaluated the differences among four different approaches for computing conditional and unconditional standard error of measurement, as well as a measure of test reliability, as a function of test length and the ability generation model, for a multistage test. Results demonstrated noteworthy differences in patterns across conditional standard error estimates, but not with respect to unconditional standard errors. Also, while the ability generating model did not have a significant effect on standard errors or conditional standard errors, it did produce a pronounced effect on the test reliabilities.
In her paper, “An Autoencoder-Based Approach to Modeling Response Times,” Yiqin proposes an autoencoder neural network model which can be applied to model response times, and demonstrates the utility of this model in detecting compromised items and people during a computerized adaptive test. The approach demonstrated strong power at detecting items provided the proportion of compromised items does not exceed 20%, while the detection of people was strong-to-moderate across all conditions. False positive rates were uniformly well controlled.
Her fourth paper, “A Cross-Validated Unsupervised-Learning-Based Approach for the Simultaneous Detection of Preknowledge in Examinees and Items when Both are Unknown” extends a previous machine learning model she developed for detecting compromised items to also identify examinees with preknowledge (EWP). False positive rates for both items and people were very small, and the simulated factors had little effect. False negative rates for both items and people were affected by the proportion of compromised items, the proportion of compromised items to which EWP had access, and the interaction of these two factors with the proportion of EWP.