Validation of Assessment Instrument

After performing the item analysis and revising the items which need revision, the next step is to validate the instrument. The purpose of validation is to determine the characteristics of the whole test itself, namely, the validity and reliability of the test. Validation is the process of collecting and analyzing evidence to support the meaningfulness and usefulness of the test.


Validity is the extent to which a test measures what it purports to measure or as referring to the appropriateness, correctness, meaningfulness and usefulness of the specific decisions a teacher makes based on the test results. These two definitions of validity differ in the sense that the first definition refers to the test itself while the second refers to the decisions made by the teacher based on the test. 

A teacher who conducts test validation might want to gather different kinds of evidence. There are essentially three main types of evidence that may be collected: content-related evidence of validity, criterion-related evidence of validity and construct-related evidence of validity. Content-related evidence of validity refers to the content and format of the instrument. How appropriate is the content? How comprehensive? Does it logically get at the intended variable? How adequately does the sample of items or questions represent the content to be assessed?

Criterion-related evidence of validity refers to the relationship between scores obtained using the instrument and scores obtained using one or more other tests (often called criterion). How strong is this relationship? How well do such scores estimate present or predict future performance of a certain type? 

Construct-related evidence of validity refers to the nature of the psychological construct or’ characteristic being measured by the test. How well does a measure of the construct explain differences in the behavior of the individuals or their performance on a certain task?

The usual procedure for determining content validity may be described as follows: The teacher writes out the objectives of the test based on the table of specifications and then gives these together with the test to at least two (2) experts along with a description of the intended test takers. The experts look at the objectives, read over the items in the test and place a check mark in front of each question or item that they feel does not measure one or more objectives. They also place a check mark in front of each objective not assessed by any item in the test. The teacher then rewrites any item so checked and resubmits to the experts and/or writes new items to cover those objectives not heretofore covered by the existing test. This continues until the experts approve of all items and also until the experts agree that all of the objectives are sufficiently covered by the test.

In order to obtain evidence of criterion-related validity, the teacher usually compares scores on the test in question with the scores on some other independent criterion test which presumably has already high validity. For example, if a test is designed to measure mathematics ability of students and it correlates highly with a standardized mathematics achievement test (external criterion), then we say we have high criterion-related evidence of validity. In particular, this type of criterion-related validity is called its concurrent validity. Another type of criterion-related validity is called predictive validity wherein the test scores in the instrument are correlated with scores on a later performance (criterion measure) of the students. For example, the mathematics ability test constructed by the teacher may be correlated with their later performance in a Division-wide mathematics achievement test.

Apart from the use of correlation coefficient in measuring criterion-related validity, Gronlund suggested using the so-called expectancy table. This table is easy to construct and consists of the test (predictor) categories listed on the left-hand side and the criterion categories listed horizontally along the top of the chart. For example, suppose that a mathematics achievement test is constructed and the scores are categorized as high, average, and low. The criterion measure used is the final average grades of the students in high school: Very Good, Good, and Needs Improvement. The two-way table lists down the number of students falling under each of the possible pairs of (test, grade) as shown below: 

Grade Point Average (GPA)

The expectancy table shows that there were 20 students getting high test scores and subsequently rated excellent in terms of their final grades; 25 students got average scores and subsequently rated good in their finals; and finally, 14 students obtained low test scores and were later graded as needing improvement. The evidence for this particular test tends to indicate that students getting high scores on it would be graded excellent; average scores on it would be rated good later; and students getting low scores on the test would be graded as needing improvement later.

We will not be able to discuss the measurement of construct-related validity in this book since the method to be used require sophisticated statistical techniques falling in the category of factor analysis.