Thursday, November 20, 2014

Analyzing Assessment Items

By Dr. Pooja Shivraj, RME Educational Assessment Researcher

Much of the work we do at Research in Mathematics Education involves the development of assessments used by educators to identify students who may be struggling with algebra-readiness knowledge and skills, so that teachers can provide additional instructional support. The research process we use is rigorous and begins with an assessment blueprint, then item writing, internal reviews and external expert reviews, followed by a pilot test and finally the development of the test forms. The pilot test is given to a large number of students in order to determine the validity of the assessment items. Our researchers receive the results of the pilot and perform an extensive statistical analysis to determine if an item is good, psychometrically speaking.

The point of obtaining item statistics is to develop a pool of items that function well from which future tests can be designed. There are two kinds of analyses that can be performed: a Classical Test Theory (CTT) analysis, which is sample-dependent and non-model based, or an Item Response Theory (IRT) analysis, which is sample-independent and model-based. Regardless of the type of analysis performed, three primary statistics are used to determine if an item is psychometrically good. The ranges listed below are the acceptable norms found in the literature.

(1) The item should have a strong correlation between each item score and the total score. In other words, the correlation should show that the test-takers choosing the correct answer on the item are likely to receive a higher score. This statistic is measured by the point-biserial correlation (CTT) or the point-measure correlation (IRT). A good item would have a point-biserial correlation of >0.2 or a point-measure correlation of >0.25.
(2) The difficulty of the item, measured by the proportion of students answering the item correctly (CTT), should be between 30% to 80% of the test-takers. In IRT, the difficulty parameter, b, should be between -4 and +4.
An item characteristic curve depicting the discrimination parameter
(a) and the difficulty parameter (b) in an IRT model
(3) The discrimination of the item, also measured by the point-biserial correlation (CTT) should be higher for the correct response than the distractors. In IRT, the discrimination parameter, a, should be between 0.5 and 1.5. The greater the discrimination, the better the item discriminates between lower ability and higher ability students.

What can you do with items that don't function well? For the items that don't function well, reviewing the data would be the first step. Are the items functioning poorly because the majority of students are choosing the correct answer? Is one distractor not being chosen at all? Are the majority of students choosing a single distractor more often than other options? These data would all be red flags. The next step would be to review the content of all the items that don't function well, especially the items that were flagged in the previous step. What about the content led students to choose or not choose a particular response choice?

Using this process of analyzing data, reviewing items, and adjusting the content of the items, a pool of items that function well can be developed for use in the future.

Note: Many other statistics (e.g., fit statistics in IRT like Chi square, infit, outfit, etc.) could be used to determine if an item functions well in addition to the ones described above that could also provide information at the test level. Please feel free to email me if you would like more information at