Educational Technology

Characteristics of evaluative instruments

The characteristics of an effective evaluative instrument are, in general, its adequacy, efficiency, and consistency. These three characteristics are dependent upon the qualities called validity, reliability, objectivity, norms, and practicability. Item analysis of the test items reveals the difficulty and discrimination levels. Distracter analysis provides information with regard to incorrectness of options in multiple choice questions.

Validity

The basic idea involved in validity is how well an evaluative instrument (such as a test) evaluates the students in the light of selected objectives. Validity is that characteristic which indicates the degree to which the instrument measures what it purports to measure.
For example, a test in elementary nutrition course which is designed to develop the elementary or basic knowledge of nutrition, and not skills in cook¬ing. If such is not the case, the test should be considered invalid for measuring the basic knowledge of nutrition.
It is difficult to establish validity of an evaluative instrument on the basis of marks, or grades, obtained by the students. These grades reflect not only student’s knowledge of the subject-matter, but quite often the student’s efforts, oral or written fluency, work habits, and other aspects of their behaviour, also. Using marks or grades in a test, therefore, may not give correct validity of the test. Since validity is not an absolute characteristic of the evaluative instruments, several types of validity, like predictive, concurrent, content, etc., have been identified

Predictive Validity: Predictive validity is determined by the degree of relationship between a measure and subsequent measures over a period of time. This validity is required in such tests as tests of intelligence and aptitude which attempt to predict the intelligence and aptitude of persons for later use.

Concurrent Validity: Concurrent validity indicates the relationship between the evaluative instrument and the, more or less immediate performance of the persons. The difference between predictive and concurrent validity is merely of time. In concurrent validity, the degree of concurrence can be judged right at the time of testing. For instance, an interview schedule is being judged for its concurrent validity simultaneously with the interview.

Content Validity: Content validity is judged by the degree of relationship between the evaluative instrument and achievement in the subject that the students are taking. For checking, the content validity of the test, or any other evaluative instrument, should be checked against the course content, textbooks, and syllabi.

Construct Validity: It can be determined by the relationship between the results of evaluative instrument and other indicators which measure the required characteristics, such as mental ability, aptitudes and interests. It is also established by considering together many different kinds of incomplete but complementary evidence. The correlation of several tests and observations of student~ together may furnish construct validity.

Reliability

The accuracy of the evaluative instrument is known as its reliability. If the repeated use of the same instrument gives consistent results, the reliability of the instrument is established. Reliability is, thus, the degree to which a true measurement of an individual is obtained when an evaluative instrument is used. Reliability is expressed statistically as a co-efficient of reliability. Wright stone has mentioned the following procedures for estimating the coefficient of reliability:

  1. Administration to two equivalent tests and correlation of the scores .
  2. Repeated administration of the same test, or testing procedures, and correlation of the resulting scores
  3. Sub-division of a single test into two presumably equivalent halves, each scored separately, and the correlating of the resulting scores
  4. Analysis of variance among individual items and determi¬nation of the error variance from these statistics.

Objectivity

When the results of an evaluative instrument are not affected by the examiner’s personal opinion, or bias, the instrument is regarded as objective. The tests for students need to be made in such a way that the opinions of the teachers do not unduly influence the results. Objectivity in intelligence tests, and in tests of certain disciplines like arithmetic, can be very high. Where as in social sciences, and the humanities, objectivity is not very high. Standardized tests have high objectivity because the scoring key is previously prepared, and hence, the bias of the examiner does not influence the results. Evaluative techniques, like observation and interview, have relatively less objectivity.

Norms

A norm is the average, or typical value of a particular characteristic measured in a specific homogeneous group.
For instance, the norm of speed of typing for typists after 20 lessons may be 40 words per minute, and if a learner still has a speed of only 20 words per minute after typing 20 lessons he will be consi¬dered below the typing norm.
The raw scores of a test, or an examination, have meaning only when some method of comparing the students is followed. The norms establish the typical, or the average performance or ability against which each student’s achievement can be evaluated and measured. There are several kinds of norms for comparing an individuals achievement with the average achievement of a homo¬geneous group. For instance, there are age norms, and grade or class norms for converting a. raw score into the performance of the test. of the average pupil in a given class; percentile norms for converting raw score into a comparison with the percentage of pupils of a given age or class who obtained the raw score and standard score norms for converting the raw score into a standard deviation for the mean, or average, for a given age, grade, or other reference groups

Practicability

An evaluative instrument is a practical proposition if it is easy to use and requires less time, energy, and money. The instrument must be reasonably acceptable to the students as well as to the teachers; and the cost of using it must be within the reach of the college, or the school, or the parents if they are to meet the cost. Simplicity in administering, scoring and interpreting should be an important aspect of practicability. Too complicated scoring, or interpreting of scores, may tend to make the examiner act super¬ficial and to be less objective in his use of the evaluative instru¬ment. When a limited time is available, the use of any complicated test could be frustrating.

Item Difficulty

Item difficulty is simply the percentage of students taking the test who answered the item correctly. The larger the percentage getting an item right, the easier the item. The higher the difficulty index, the easier the item is understood to be (Wood, 1960). To compute the item difficulty, divide the number of people answering the item correctly by the total number of people answering item. The proportion for the item is usually denoted as p and is called item difficulty (Crocker & Algina, 1986). An item answered correctly by 85% of the examinees would have an item difficulty, or p value, of 0.85, whereas an item answered correctly by 50% of the examinees would have a lower item difficulty, or p value, of 0 .50.

A p value is basically a behavioral measure. Rather than defining difficulty in terms of some intrinsic characteristic of the item, difficulty is defined in terms of the relative frequency with which those taking the test choose the correct response (Thorndike et al, 1991).

One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of a p value.

Another implication of a p value is that the difficulty is a characteristic of both the item and the sample taking the test. For example, an English test item that is very difficult for an elementary student will be very easy for a high school student. A p value also provides a common measure of the difficulty of test items that measure completely different domains. It is very difficult to determine whether answering a history question involves knowledge that is more obscure, complex, or specialized than that needed to answer a math problem. When p values are used to define difficulty, it is very simple to determine whether an item on a history test is more difficult than a specific item on a math test taken by the same group of students.

Item Discrimination

If the test and a single item measure the same thing, one would expect people who do well on the test to answer that item correctly, and those who do poorly to answer the item incorrectly. A good item discriminates between those who do well on the test and those who do poorly. Two indices can be computed to determine the discriminating power of an item, the item discrimination index, D, and discrimination coefficients.

Item Discrimination Index, D
The method of extreme groups can be applied to compute a very simple measure of the discriminating power of a test item. If a test is given to a large group of people, the discriminating power of an item can be measured by comparing the number of people with high test scores who answered that item correctly with the number of people with low scores who answered the same item correctly. If a particular item is doing a good job of discriminating between those who score high and those who score low, more people in the top-scoring group will have answered the item correctly.
In computing the discrimination index, D, first score each student’s test and rank order the test scores. Next, the 27% of the students at the top and the 27% at the bottom are separated for the analysis. Wiersma and Jurs (1990) stated that “27% is used because it has shown that this value will maximize differences in normal distributions while providing enough cases for analysis” (p. 145). There need to be as many students as possible in each group to promote stability, at the same time it is desirable to have the two groups be as different as possible to make the discriminations clearer. According to Kelly (as cited in Popham, 1981) the use of 27% maximizes these two characteristics. Nunnally (1972) suggested using 25%.
The discrimination index, D, is the number of people in the upper group who answered the item correctly minus the number of people in the lower group who answered the item correctly, divided by the number of people in the largest of the two groups. Wood (1960) stated that when more students in the lower group than in the upper group select the right answer to an item, the item actually has negative validity. Assuming that the criterion itself has validity, the item is not only useless but is actually serving to decrease the validity of the test. (p. 87)

The higher the discrimination index, the better the item because such a value indicates that the item discriminates in favor of the upper group, which should get more items correct. An item that everyone gets correct or that everyone gets incorrect, as shown in Tables 1 and 2, will have a discrimination index equal to zero. If more students in the lower group get an item correct than in the upper group, the item will have a negative D value and is probably flawed.

A negative discrimination index is most likely to occur with an item covers complex material written in such a way that it is possible to select the correct response without any real understanding of what is being assessed. A poor student may make a guess, select that response, and come up with the correct answer. Good students may be suspicious of a question that looks too easy, may take the harder path to solving the problem, read too much into the question, and may end up being less successful than those who guess. As a rule of thumb, in terms of discrimination index, 0.40 and greater are very good items, 0.30 to 0.39 are reasonably good but possibly subject to improvement, 0.20 to 0.29 are marginal items and need some revision, below 0.19 are considered poor items and need major revision or should be eliminated (Ebel & Frisbie, 1986).

Discrimination Coefficients
Two indicators of the item’s discrimination effectiveness are point biserial correlation and biserial correlation coefficient. The choice of correlation depends upon what kind of question we want to answer. The advantage of using discrimination coefficients over the discrimination index (D) is that every person taking the test is used to compute the discrimination coefficients and only 54% (27% upper + 27% lower) are used to compute the discrimination index, D.

The point biserial (rpbis) correlation is used to find out if the right people are getting the items right, and how much predictive power the item has and how it would contribute to predictions. Henrysson (1971) suggests that the rpbis tells more about the predictive validity of the total test than does the biserial r, in that it tends to favor items of average difficulty. It is further suggested that the rpbis is a combined measure of item-criterion relationship and of difficulty level.

Biserial correlation coefficients (rbis) are computed to determine whether the attribute or attributes measured by the criterion are also measured by the item and the extent to which the item measures them. The rbis gives an estimate of the well-known Pearson product-moment correlation between the criterion score and the hypothesized item continuum when the item is dichotomized into right and wrong (Henrysson, 1971). Ebel and Frisbie (1986) state that the rbis simply describes the relationship between scores on a test item (e.g., “0” or “1”) and scores (e.g., “0”, “1”,…”50″) on the total test for all examinees.

Distracter Analysis

A distracter analysis addresses the performance of the incorrect response options given in multiple choice questions. Just as the key, or correct response option, must be definitively correct, the distracters must be clearly incorrect (or clearly not the “best” option). In addition to being clearly incorrect, the distracters must also be believable. That is, the distracters should seem likely or reasonable to an examinee who is not sufficiently knowledgeable in the content area. If a distracter appears so unlikely that almost no examinee will select it, it is not contributing to the performance of the item. In fact, the presence of one or more unreasonable Distracters in a multiple choice item can make the item artificially far easier than it ought to be.

In a simple approach to distracter analysis, the proportion of examinees who selected each of the response options is examined. For the key, this proportion is equivalent to the item p-value, or difficulty. If the proportions are summed across all of an item’s response options they will add up to 1.0, or 100% of the examinees’ selections.

The proportion of examinees who select each of the distracters can be very informative. For example, it can reveal an item mis-key. Whenever the proportion of examinees who selected a Distracter is greater than the proportion of examinees who selected the key, the item should be examined to determine if it has been mis-keyed or double-keyed. A distracter analysis can also reveal an implausible distracter. In CRTs, where the item p-values are typically high, the proportions of examinees selecting all the Distracters are, as a result, low. Nevertheless, if examinees consistently fail to select a given Distracter, this may be evidence that the Distracter is implausible or simply too easy.

Implications of item analysis

Item difficulty should have been named item easiness. It expresses the proportion or percentage of students who answered the item correctly. Item difficulty can range from 0.0 (none of the students answered the item correctly) to 1.0 (all of the students answered the item correctly). Experts recommend that the average level of difficulty for a four-option multiple choice test should be between 60% and 80%; an average level of difficulty within this range can be obtained, of course, when the difficulty of individual items falls outside of this range.

If an item has a low difficulty value, say, less than .25, there are several possible causes: the item may have been miskeyed; the item may be too challenging relative to the overall level of ability of the class; the item may be ambiguous or not written clearly; there may be more than one correct answer.

The cause of a low difficulty value can often be gained by examining the percentage of students who chose each response option. For example, when a high percentage of students chose a single option other than the one that is keyed as correct, it is advisable to check whether a mistake was made on the answer key.

The point-biserial correlation is an index of item discrimination, i.e., how well the item serves to discriminate between students with higher and lower levels of knowledge. The point-biserial correlation reflects the degree of relationship between scores on the item — 0=incorrect, 1=correct — and total test scores. Thus the point-biserial will be positive if better students answered the item correctly more frequently than poorer students did, and negative if the opposite occurred.

A negative point-biserial is denoted by a minus sign in front of the value. The value of a positive point-biserial discrimination index can range between 0 and 1; the closer the value is to 1, the better the discrimination. The value of a negative point-biserial discrimination index can range between -1 and 0, but positive values are desirable.

Item discrimination is greatly influenced by item difficulty. Items with a difficulty of either 0 or 1 will always have a discrimination index of 0, and item discrimination is maximized when item difficulty is close to 0.5. As a general rule, point-biserial values of 0.20 and above are considered to be desirable.

Items with negative discrimination values should be reviewed. A negative discrimination value, like a low difficulty value, may occur as a result of several possible causes: a miskeyed item, an item that is ambiguous, or an item that is misleading.

In small classes negative values close to zero are not necessarily reason for concern; they may be caused by one good student answering the item incorrectly or one poor student answering the item correctly.

KR 20 is an index of the internal consistency of the test. “Internal consistency” refers to consistency of students’ responses across the items on the test. KR 20 can be thought of as a measure of the extent to which the items on a test provide consistent information about a students’ level of knowledge of the content assessed by the test. Assuming that all the items on a test relate to a single content domain, we would expect students with a very high level of knowledge of the domain to answer most items correctly and students with a very low level of knowledge of the domain to answer most items incorrectly. The value of KR 20 can range from 0 to 1, with numbers closer to 1 reflecting greater internal consistency. The value of KR 20 may be negative under certain circumstances, but this is a rare occurrence.

What are acceptable values of KR 20? There is no single answer to say the acceptable values of KR20. As a general rule, values of KR 20 for professionally developed and widely administered tests such as the SAT or GRE are expected to be greater than or equal to .80. Values of KR 20 for tests developed by instructors are not held to the same standard; one rule of thumb states that values greater than or equal to 0.70 are acceptable. Values of KR 20 for tests that assess several content areas or topics are expected to be lower than values of KR 20 than tests that assess a single content area.

When interpreting the value of KR 20 two additional factors should be considered: the size of the class and the extent of variability in students’ knowledge. Values of KR 20 for small classes (say, less than 15 students) should be interpreted with caution since the observed value may be considerably different than the value that would be obtained if the test were administered to a larger sample. When a class is made up of students who are similar to each other in their level of ability, i.e., the range of test scores is small, the observed value of KR 20 will generally be lower than the value that would be obtained if the test were administered to a more diverse sample of students.

References

http://www.omet.pitt.edu/docs/OMET%20Test%20and%20Item%20Analysis.pdf
http://ericae.net/ft/tamu/Espy.htm

Assignments

  1. Construct a test and statically find the validity, reliability and objectivity of it. Write and present a detailed report
  2. Choose any multiple choice question and perform distracter analysis.