Multiple-choice test questions have substantial support as a reliable and valid evaluation method. In practice, however, instructors may prepare or select multiple-choice questions based on content and without adequate attention to item quality. This paper has two specific objectives: (1) to familiarize accounting educators with item-writing “rules” (commonly accepted guidelines intended to achieve validity); and (2) to provide evidence on the relative rule compliance of test bank questions and CPA exam questions (“quality”). We selected random samples of 100 multiple-choice questions from each source and employed judges to evaluate them for rule compliance. Because not all rules apply to all questions, we categorize the questions into eight types based on their format (question versus completion), numerical calculation requirements, and the presence or absence of words in the responses. The test bank questions showed consistently lower quality than CPA exam questions across the six main question types. Violations were about 4.6 times more frequent for test bank questions (6.42% versus 1.41%). The analysis indicates that some types of questions are especially prone to rule violation. Multiple-choice questions (hereafter MCQs) are widely used in both professional certification examinations and classrooms. They are especially convenient today for accounting educators, since most textbook packages include a test bank. The advantages of using MCQs are ease of grading, standardization, and the opportunity to cover a large amount of material quickly. Collier and Mehrens (1985) strongly defend the value of MCQs and encourage readers to develop and use such items in accounting courses. They “believe that multiple-choice testing is the best available testing methodology to achieve the desired qualities of reliability and validity in the assignment of grades based on tests to measure student learning” (p. 41). As Aiken (1987) notes, MCQs can measure not just simple recognition but higher order learning skills—by requiring Bailey, Karcher and Clevenger 13 one to examine multiple premises, to perform classifications, to determine relations and correlations, or to consider if-then situations. The larger class sizes dictated by funding restrictions also ensure the continued prominence of MCQs. In Florida, for example, population growth and budgetary constraints are expected to increase college class sizes dramatically during the coming decade (Date, 1995). A MCQ exam can quickly test whether students have achieved some level of technical knowledge before they apply that knowledge through case studies, cooperative learning projects and other less traditional classroom methods in the development of critical thinking and communication skills stressed by the Accounting Education Change Commission. Research examining the validity and reliability of test banks is limited. Clute, McGrail, and Robinson (1988), assuming test bank questions were reasonably well written, found a significant level of bias in the distribution of correct answers for MCQs in principles of accounting textbooks. Clute and McGrail (1989) reported similar findings for cost accounting textbooks, and Geiger and Higgins (1998) discover such biases in professional certification exams. The concern is that “test wise” test-takers (Gibb 1964) could unfairly use this bias to improve their scores and decrease the validity of a test. Gibb (1964) examined students’ awareness of seven secondary clues, such as grammatical errors and longer correct alternatives, to determine the correct answer to a MCQ. Gibb developed an Experimental Test of Testwiseness instrument which subsequent researchers have used to determine that certain secondary clues may be more important than others (Miller, Fagley & Lane, 1988; Miller, Fuqua & Fagley, 1990). Writers of MCQs need to be aware of these secondary clues to construct valid, reliable test questions. A number of education researchers have developed rules for MCQ construction, and although broad support exists for many of those rules, we find no studies that have investigated whether examination items used in accounting follow the rules. Further, the Framework for the Development of Accounting Education Research (Williams et al., 1988) notes that “[a]n unfortunate truth is that many classroom tests are poorly prepared” (p.105) and that “[t]he subjects of testing and grading have been barely scratched in accounting literature” (p.108). This study samples MCQs from Certified Public Accountant (CPA) examinations and from test banks accompanying accounting textbooks, to assess their adherence to the MCQ-writing rules. It has two objectives: (1) to familiarize accounting educators with item-writing “rules,” or commonly accepted guidelines intended to achieve validity; and (2) to provide evidence on the relative rule compliance of test bank questions and CPA exam questions (“quality”). Despite the nondisclosure of future CPA exam questions, the existing body of questions has ongoing utility, and the questions provide an interesting benchmark because of the effort that went into their construction. The remainder of this paper is divided into four sections. The first discusses the relevant research and summarizes the various item construction rules. The second describes the methods and data Bailey, Karcher and Clevenger 14 sources used in this study. The third details the results of our analysis, and the last discusses the limitations, conclusions and implications of the study. The Development of MCQWriting Rules Haladyna and Downing (1989) reviewed 96 theoretical and analytical studies and developed a taxonomy of 43 rules designed to improve MCQ writing. They found that 33 of the rules enjoyed strong consensus among the sources. Aiken (1987) provides 14 guidelines for writing MCQs and also discusses MCQ arrangement, test taking, scoring and sociological effects of multiple-choice testing. In the accounting literature, Baldwin (1984) details eight guidelines for writing MCQs. Since the current study concerns individual MCQ construction, not overall test validity, we consider only the guidelines which deal with the individual MCQ, its stem (the material preceding the responses) and the responses. Based on the guidelines cited in the literature, we developed a set of 15 rules dealing with the overall test question, the question stem and the responses. The final set of rules appears in Table 1. We selected these items based on their widespread acceptance, relevance to the accounting setting, and our ability to judge whether MCQs adhere to them. Because we chose to sample from a wide variety of 1. Formal validation typically is done for standardized tests, but not for classroom tests. The rules that we are considering are at an entirely different level, and have the purpose of improving validity. For a recent, thorough summary of the research on developing and validating MCQs, see Haladyna (1994). topic areas, we did not attempt to assess the technical accuracy of the answers. In addition to the general criteria calling for clarity of communication, researchers have tried to develop measures of readability. Readability formulas, which assign a measure of difficulty or required grade level to text, now are readily available with the major word processing packages. Although readability formulas appear frequently in accounting research (Jones & Shoemaker, 1994), they are quite limited, being based on statistics that include sentence length, word length, and other syntactical variables. The Flesch Reading Ease score decreases with sentence length and syllables per word, and ranges from 100 (very easy) to 0. Thirty and below is “very difficult.” Gunning’s Fog Index is intended to indicate the school-grade needed for comprehension, with college graduates at grade 17, and the Flesch-Kinkaid Grade Level is a similar measure. Flory, Phillips & Tassin (1992) argued for the usefulness of such formulas in evaluating textbook readability, prompting a sharp response from Stevens, Stevens, and Stevens (1993). The latter favor the use of the “cloze” procedure, which utilizes a sample of readers from the intended audience. For the items in our study, however, the intended audience consists of students who have studied specific technical materials, and the broad inclusiveness of our sample makes this procedure impractical. Thus, although we agree that the readability formulas are severely limited in their ability to predict how well students will understand a particular writer’s style, we believe that the statistics may shed some light when applied to the samples drawn from our two sources. Table 1: List of Rules The stem 1 deals with one central problem. a, b 2 has the to-be-completed phrase placed at the end. a 3 uses simple numbers to test a concept. b The responses 4 are grammatically consistent with the stem. a, c, e, f 5 do not unnecessarily repeat language from the stem. a, f, g, h 6 are all of approximately the same length. a, c, e, f, h 7 consist only of numbers that are plausible based on the data in the problem. a, b, c, e, f, h 8 avoid the use of “all of the above.” a, f, g 9 avoid the use of “none of the above.” a, f, g 10 contain one allowable answer rather than the use of “A and B,” ”A, B, and C,” etc. f, g The question overall 11 avoids excessive verbiage. a, e, f, g, h 12 is grammatically correct throughout. a, e, f, g, h, 13 is clearly written. a, e, f, g, h 14 contains no negative or other counter-intuitive wording without underlining or other special emphasis. c, d, e, g, h 15 contains no other offensive or undesirable characteristics not listed above. Main sources in addition to the review by Haladyna & Downing (1989): a Aiken (1987) b Baldwin (1984) c Denova (1979) d Green (1963) e Grolund (1968) f Lutterodt & Grafinger (1985) g Miller & Erickson (1985) Wood (1960) i Added to encompass sexist language, offensive or distracting scanarios, etc.