The design and content of the Texas Biology End-of-Course Examination (EOC) (available from the Texas Education Agency (TEA) at http://www.tea.state.tx.us/studentassessment /release.htm) was evaluated using the National Science Education Standards (NSES) Assessment Standard B, the NSES Life Science Content Standards for Grades 9 through 12 (National Research Council, NRC, 1996), and the recommended guidelines for standardized tests from the NRC's Board on Testing and Assessment (BOTA) (NRC, 1999). Results indicated that the Texas Biology EOC did not comply with standards from NSES in areas concerning coverage of biology concepts, depth of knowledge required to answer questions, format of questions, and lack of assessment in three NSES science content categories. Furthermore, the Texas Biology EOC did not comply with standards from BOTA in areas concerning validity, reliability and fairness.
The conversation between Lee and Laura concerning a standardized state-mandated science test is typical of many conversations in faculty lounges at the end of a day of testing. Unfortunately, these anecdotal opinions of a test rarely become part of an objective, formal, and useful test evaluation. Those who have been teachers in public school classrooms know why: in the hurried rush of a typical school day, science teachers do not have the time to make a formal evaluation of a standardized science test. Their job, after all, is to teach young people.
Intent
The intent of the paper is two-fold: 1) to provide a brief historical overview of standardized testing in science including its purpose, consequences, characteristics, and evaluation; 2) to provide a critique that is unique in that it uses national standards to evaluate a state standardized science test.
Purpose of Standardized Testing
The emergence of national standards in science education began in 1989 when the American Association for the Advancement of Science (AAAS, 1989) published Science for All Americans which was a set of recommendations on what a scientifically literate person should know in science, math and technology. Others followed such as the National Science Teachers' Association's (NSTA) The Content Core, (1992). It specified which science topics should be taught in each discipline at a range of grade levels. One year later, in 1993, the American Association for the Advancement of Science (AAAS) published Benchmarks for Science Literacy, (BSL) that also specified science topics at different ranges of grades. The development of national standards culminated in 1996 with the publication of the National Science Education Standards (NSES), by the National Research Council (NRC) (NRC, 1996). The NSES not only specify the science content that students should know, but also specify how to assess the knowledge. The NSES are contained in a 250 page document that provides standards in six areas of science education, five of which are: science content, assessment, professional development, science education programs and science education systems (NRC, 1996). The NSES was designed to improve science education. Thus, the use of the NSES assessment standards can enable educators to evaluate standardized science tests.
Besides the basic and vital purpose of assessing scientific literacy in the United States, standardized testing has been used for other purposes. The historical purposes of standardized testing in education have generally fallen into three categories; 1) to serve as a diagnostic tool to support learning, 2) to report the achievements of individuals and 3) to satisfy the demands of public accountability (Black, 1998). From the early part of the 20th century until today, there has been a change in the purpose of standardized testing from its use as diagnostic tools to their use in determining accountability. In the first half of the 20th century, standardized tests in the United States were composed of questions that were answered in an oral manner or by written essays for diagnostic purposes (Goslin, 1963). For example, the first standardized achievement tests in education were designed by Edward L.Thorndike in 1904 to assess reading, handwriting and spelling at Teachers College, Columbia University. They were considered standardized only because the administration and scoring of the tests were uniform. As technology improved and tests were made easier to process by machine scoring, standardized tests in education became ever more widely used and evolved from an oral mode to written essay to multiple-choice format. Today, standardized tests heavily emphasize multiple-choice questioning and, as a result, rote learning is more often tested (The National Center for Fair and Open Testing (NCFT), 1997). Thus, in the United States, most large-scale testing programs are weak as diagnostic tools because they generally do not provide an opportunity for sustained and engaged thinking (NCFT, 1997).
Today, most standardized testing in science is for the purposes of reporting individual achievement and public accountability. Many states, with the number increasing each year, require students to pass science tests in order to graduate from high school. In 1999, 39 states had developed their own state-wide science standards, mostly based on recommendations in the national standards documents, and 48 states had statewide science tests to measure student achievement (Jerald & Boser, 1999). For example, the state of New York, one of the first states to begin standardized testing in science in the early 1900's, administers the Regents Examination. The Regents Examinations are comprehensive tests in 13 different subjects for grades 9 through 12. (Madaus, 1994).
Consequences of Standardized Testing
Since standardized tests today are largely for accountability purposes, teachers and administrators have become focused on having students pass standardized tests. This may contribute to significant gains in the scores on certain state and national tests. Nationally, students in grades 4, 8, and 12, are achieving higher scores in mathematics and slightly higher scores in reading (National Educators Goals Panel, 1999, NAEP, 1998). And, in specific student populations, there has been significant improvement in basic skills in certain states. For example, African-American students in Texas have shown significant improvement in writing skills and had the highest African-American scores in the United States on the 1998 NAEP writing test (NAEP, 1998). Persistent efforts by Texas teachers to teach writing skills to students because of an exit-level writing requirement, imposed in 1985 by Texas' state-wide accountability, may have contributed to the increased writing scores.
However, there are well-known negative consequences of standardized testing on education, such as the "narrowing of the curriculum" due to "teaching to the test". When there is public pressure to improve test results, schools and teachers are more likely to emphasize, in their instruction, the material covered by the test (Shepard, 1991; Madaus, 1991; Herman & Golan, 1992). Standardized tests heavily emphasize multiple-choice questioning (The National Center for Fair & Open Testing (NCFT), 1997). As a result, rote memorizing, "cramming" of concepts and test-taking strategies have become part of the daily instruction (Madaus, 1991). This type of instruction or "teaching to the tests" causes students to gain the "most elementary knowledge and skills and less of the deep understanding of even a few topics" (Stake, 1991 p. 246). This is demonstrated in the inability of test scores to generalize or transfer to other indicators of achievement. For example, when a new testing program is brought into a state, scores tend to plummet in the initial years of testing since students have not been prepared for that exact test (Bracey, 2000). Consequently, test scores that reflect higher-order thinking have been steadily declining (Darling-Hammond, 1991; 1994). In fact, the rote learning that was involved in "teaching-to-the-test" in the 1970s has been cited as one of the reasons that U.S. students have ranked low in international achievement tests (McKnight et. al., 1987).
Characteristics of Standardized Science Tests
In 26 states, K-12 standardized state-wide science achievement tests are used to evaluate the effectiveness of public school science programs (Jerald & Boser, 1999). Each year more states are adding science achievement tests to their testing program. An overview of science tests in the various states indicates that most states have developed their own tests in accordance with their state standards (Edwards, 1999). For example, New York , Texas and California, the most populous states, have developed their own unique tests. New York administered the 4th grade Elementary Science Program Test (ESPET) and the 12th grade Regents Examination (Madaus, 1994). In Texas, students must take the 8th grade Science Texas Assessment of Academic Skills (TAAS), and the Biology End-of-Course Examination (EOC) (Texas Education Agency, TEA, 1997). In California, the newly developed Golden State Examinations are administered to biology, chemistry, coordinated science and physics students. The assessment design can be viewed at the California Department of Education Web site (http://www.cde.ca.gov/statetests/).
Statewide K-12 state-developed science assessment tests share many similarities. For example, they predominantly contain multiple-choice questions. In 1995, twenty-six states used all multiple-choice and eighteen states used a majority of multiple-choice items on their state-wide assessments (NCFT , 1997). Specifically, most state-developed, standardized science achievement tests have these characteristics: 1) were multiple-choice, 2) were short (fewer than 60 items), 3)contained few questions about science concepts, 4) administered for one hour, 5) required only paper and pencil, 6) made for individual student work and 7) administered twice a year. Only fourteen states use performance assessment in their statewide science examinations in a few grade levels. For example, California has a laboratory component in the Golden State Examinations in science. New York also has a laboratory component in its ESPETS. By contrast, Texas does not include performance assessment in any of the science standardized examinations (TEA, 1994, 1996). Also, most tests tend to be criterion-referenced in that students are expected to answer correctly a certain percentage of items that are based on state standards (NCFT, 1997, Gong, 1990). Lastly, portfolio assessment is not used in any state-wide science assessment system (Jerald & Boser, 1999).
Evaluation of State-wide Standardized Tests
It is important to evaluate statewide-standardized science tests because these tests define what will be taught to students. In a sense, the test content becomes the curriculum (Madaus, 1988). Most states do not attempt to ascertain whether the state-wide examinations measure "the ability of students to think critically or in complex ways in the various subject areas" (NCFT, 1997). A literature search (1986 to present) using the Education Resources Information Center (ERIC) and key words 1) "science tests and evaluation" and 2) "science examinations and evaluation" revealed 194 articles concerning testing in science. These keywords identified articles under the ERIC subject headings, "Evaluation Methods, Science Education, Student Evaluation, Science Tests and Sciences." The search revealed only two critiques of state-wide science examinations; the Connecticut Academic Performance Testing Program (CAPT) and the Common Core of Learning (CCL) Science Assessment Project (Lomask, 1995). Both of these critiques examined the development and the use of these performance tests rather than critiques of the tests themselves.
The only substantial critique of state science assessments was conducted by Webb (1999) in conjunction with the Council of Chief State School Officers (CCSSO) and the National Institute for Science Education (NISE). His study examined five state science assessments from two states and their alignment to state science standards based on four criteria: categorical concurrence, depth-of-knowledge consistency, range-of-knowledge correspondence, and balance of representation. Alignment was weakest in two of the criteria. The depth-of-knowledge analysis indicated that test items generally required a lower level of knowledge than those in the state standards. The range-of-knowledge analysis indicated that the test items covered a narrower range of knowledge than those expressed in the standards (Webb, 1999). Generally, the tests were balanced in that the assessment items were nearly evenly distributed among the test objectives (Webb, 1999).
Thus, it appears that, with the exception of the Webb study, states have either not evaluated their statewide science examinations or they have not made the evaluations available for a national audience. Furthermore, many state-wide standardized tests are not released to the public. For example, the North Carolina Biology End-of-Course Examination, which has been administered annually since 1987, is not available for public scrutiny. The assessment program for North Carolina can be viewed at the North Carolina Dept of Public Instruction Web site (http://www.dpi.state.nc.us/accountability/testing.abcs_testing_program.html ). On the other hand, all of the Texas End-of-Course Examinations in Biology have been released to the public since 1997. The examinations are available from the Texas Education Agency (TEA) Web site (http://www.tea.state.tx.us/student assessment/release.htm).
Without evaluation of these tests, it is difficult to determine whether rising test scores signify any real gains in learning. Rising test scores often reflect "teaching-to-a-test" that assesses recall of low level knowledge but not gains in conceptual learning (Morgenstern & Renner, 1984). Although it may be the intent of most science teachers to teach for in-depth conceptual understanding of science in their students, their efforts may be undermined by pressure to teach to a traditional science achievement test. The curriculum then becomes defined by the test even though it may only assess memorized facts (Madaus, 1988). Thus, there is a need for educators to evaluate these examinations as part of the process of science education reform, and to inform the science education community, and the public, as to their adequacy or inadequacy in complying with national guidelines, such as the NSES, for reform of assessment in science education.
An external monitoring agency, that is independent of test developers and education agencies, should be supported by the states in order to evaluate the adequacy of science standardized examinations for compliance with guidelines recommended for reform of assessment in science education. (NRC, 1999). If an external monitoring agency is not being used in a state to evaluate the tests, or if additional evaluation is needed, those who are familiar with the guidelines for science education assessment reform, such as university science teacher educators, should conduct their own evaluation of standardized tests and disseminate their findings to the science education community. By means of such evaluations, a dialogue may begin between science educators and those who regulate testing, in order to improve standardized science examinations.
Methodology
Selection of a Test
We evaluated the effectiveness of a secondary school science assessment, the Texas Biology End-of-Course Examination [EOC], as an instrument for assessing scientific literacy. The purpose of the Texas Biology EOC is to provide the state with additional information about the effectiveness of the science program (TEA, 1992, TEA, 1993). The Texas Biology EOC was chosen because 1) it is available for evaluation since all test forms are posted on TEA Web site (http://www.tea.state.tx.us/student assessment/release.htm), and 2) a critique of the Texas Biology EOC has not been published for a national audience. Our evaluation of the Spring 1997 Biology EOC was the first and is still the only formal and documented evaluation of the test (Westerlund and West, 1999). The Texas Biology EOC is a traditional, multiple-choice, machine-scorable test of 42 questions. Results for the tests from 1994 to 1998 indicated average state-wide passing rates (70% of test items correct) of 80%, although some population groups had passing rates as low as 53% (TEA, 1999).
Qualifications of Test Evaluators
As university science teacher educators, the authors are qualified to critique science standardized examinations in regard to science education reform and science content. Both are knowledgeable in the NSES. Familiarity with the NSES has been established through frequent use of it in designing science curricula and assessments, teaching science courses, and through participation in NSES discussions in national science education conferences (National Association of Research in Science Teaching, National Science Teacher's Association and Association for the Education of Teachers in Science). The authors are further qualified as evaluators because they are proficient in science content. Proficiency has been documented with undergraduate degrees in biology and composite science and graduate degrees in genetics, marine biology and science education from the University of Texas at Austin, Texas A&M University, and the University of Minnesota at Minneapolis/St.Paul. Proficiency in science content and teaching have also been established through teaching biology and other sciences at the high school level (8 years and 16 years experience) in public school districts in Houston, Austin and San Antonio, Texas, and at the college level (5 and 12 years experience in the biology department) at the University of Texas at Austin and Southwest Texas State University. Both have served as internal and external evaluators in numerous projects. Furthermore, Dr. Sandra West has received National Science Foundation (NSF) training in the evaluation of science assessments at The Evaluation Center at Western Michigan University.
Test Evaluation Guidelines
In our evaluation of the Texas Biology EOC, we used the National Science Education Standards (NSES) Assessment Standards, the NSES Life Science Content Standards for grades 9 through12 and the NRC's Board on Testing and Assessment (BOTA) (NRC, 1999) guidelines for tests. Our approach was unique in that we chose to evaluate the Texas Biology EOC by its alignment with national standards, as opposed to Texas state science standards, so that other state tests could also be evaluated using this alignment model. The few evaluations of state science tests, such as Webb's study (1999), generally examined their alignment to state standards. Thus, this evaluation is unique, when compared with the literature that we encountered, in that it employed the NSES Assessment Standards. Our evaluation of the Texas Biology EOC can be used as a tool in the promotion of scientific literacy in biology. Since the purpose of the Texas Biology EOC was to provide the state with additional information about the effectiveness of the science program (TEA, 1992), it was appropriate to select the NSES Assessment standards. As stated in the NSES, "The assessment standards provide criteria to judge progress toward the science education vision of scientific literacy for all (NRC, 1996, p. 75)."
The NSES Assessment Standard B requires measurement both of achievement and of the opportunity-to-learn science. The achievement component of the standard was used to evaluate the Biology EOC. The achievement component stipulates that: "Achievement data collected focus on the science content that is most important for students to learn" (NRC, 1996, p.79). According to the NSES, "science content" refers to abilities in science. The achievement component of NSES Standard Assessment B was used as an evaluation tool because it concerned the science content that should be contained in science examinations. The opportunity-to-learn component of the standard is typically evaluated through other measures such as the alignment of teacher practices with assessments (Porter & Smithson, 2000).
The "most important science content for students to learn," as outlined by NSES Assessment Standard B, includes: 1) The ability to inquire, 2) Knowing and understanding scientific facts, concepts, principles, laws, and theories, 3) The ability to reason scientifically, 4) The ability to use science to make personal decisions and to take positions on societal issues, and 5) The ability to communicate effectively about science (NRC, 1996, p.79-82). The content of the 42 questions of the Spring 1997 Biology EOC (TEA Web Site, http://www.tea.state.tx.us/student assessment/release.htm) was examined using these five NSES categories of science content.
Questions on the Texas Biology EOC that probed for scientific knowledge and understanding were further evaluated using the NSES Life Science Content Standards for grades 9- 12. These standards specify that, "As a result of their activities in grades 9- 12, all students should develop an understanding of the cell, molecular basis of heredity, biological evolution, interdependence of organisms, matter, energy, and organization in living organisms and behavior of organisms (NRC, 1996, p.181)."
State-wide science assessments must meet professional standards of validity, reliability and fairness in order to be deemed an appropriate test according to BOTA (NRC, 1999). The NRC's BOTA guidelines in these areas were also used in the evaluation of the Texas Biology EOC. The test was examined for its ability to measure what it is supposed to measure, its internal consistency measures, and whether the test scores from the examination indicate equity between all individuals and groups (NRC, 1999).
Test Analysis
We compared the overall design of the Texas Biology EOC (TEA Web Site, http://www.tea.state.tx.us/student assessment/release.htm) and the individual questions of a representative example, the Spring 1997 Biology EOC with the NSES Assessment Standard B, the NSES Life Science Content Standards for grades 9-12 (NRC, 1996) and BOTA guidelines for appropriate tests (NRC, 1999). All of the Biology EOCs are similar in the number, type and content of questions.
All 42 questions in the Spring 1997 Biology EOC were classified and coded. Critical to the validity of the coding process was the authors' expertise in the subject area. We used standard qualitative analysis to examine individual test items rather than the traditional psychometric item analysis. During the initial phase of the data analysis, we independently sorted the 42 questions into the five NSES categories of science content that were mentioned in the previous paragraph. New categories were created for those questions that could not be grouped into the NSES categories.
We developed the new categories independently by reading each question and defining categories that clustered the questions into groups that concerned similar ideas or that contained similar key phrases or graphics (Patton, 1990). After the initial classification phase of the data analysis was completed, we examined the data together, compared our categories, and redefined the categories. Disagreements about categories or placement of questions into categories were discussed. The NSES and BSL were jointly reviewed and the authors agreed on a common definition for categorization. From our discussions, we established a rationale for categories and placement of questions into particular categories. The rationale for categories and the placement of the Spring 1997 Biology EOC questions are described in Table 1. Inter-rater reliability was not calculated since the categories and placement of questions came about through discussion between the evaluators. A calculation of inter-rater reliabilities is appropriate for test evaluation studies in which the categories are defined for the evaluators prior to the analysis such as in Webb's (1999) study. The reliability or dependability, as Marshall and Rossman (1995) define reliability in qualitative analyses, of this analysis is based on evaluators not being constrained by previously determined categories but empowered to select categories through consensus based upon their expertise.
Table 1
Rationale for Categorization of Questions
| Category Name | Rationale for Placement of Question in Category | Questions Classified in this Category |
| Knowing and understanding scientific facts, concepts, laws, and theories | Required knowledge of biology to answer question | 10, 13,14, 15, 18, 19, 20, 22,
23, 25, 33, 35, 42
Total = 13 |
| The ability to reason scientifically | Required knowledge of experimental research design or graphing to answer question | 3, 11, 16, 24, 32, 34, 36, 38,
40
Total = 9 |
| The ability to interpret a chart or diagram | Required the ability to interpret a chart or diagram to answer question | 1, 2, 5, 6, 7, 9, 17, 26, 29,
30, 31, 37, 39, 41
Total = 14 |
| Manipulative laboratory skills | Required knowledge of the safe use of laboratory equipment to answer question | 12, 21, 27, 28
Total = 4 |
| The ability to answer common knowledge questions | Required the ability to answer common knowledge or "common sense" questions | 4, 8
Total = 2 |
All of the Spring 1997 Biology EOC questions were organized into five categories. Two of the five categories were labeled by the NSES categories of science content that were mentioned previously since the questions fit those areas. Categorizing questions is a subjective process. Those with other perspectives may categorize the test questions differently. Our evaluation is based upon our content expertise in biology and our knowledge of the NSES. An individual trained in psychometrics would most likely evaluate the test differently. By providing the TEA Web site (http://www.tea.state.tx.us/student assessment/release.htm) which contains all of the questions on the Spring 1997 Biology EOC and Table 1 which describes the rationale for placement of specific questions into categories, the reader and others in the science education community may examine our analysis.
The content of the 42 questions on the Spring 1997 Biology EOC was classified into five categories. Two of these categories corresponded to two of the five NSES Assessment Standard B science content categories that are considered by NSES "the most important for students to learn." These were noted in the italicized quotations in the preceding section. The other three categories of the Biology EOC science content, as shown by the authors, did not conform to criteria stated by NSES as science content that is most important for students to learn. Of the 42 Biology EOC examination questions, 31% were classified in the NSES category Knowing and understanding scientific facts, concepts, laws, and theories; 21% were classified in the NSES category The ability to reason scientifically; 33% were classified as The ability to interpret a chart or diagram; 10% were classified as Manipulative laboratory skills; and 5% were classified as T he ability to answer common knowledge questions. Students were tested for isolated bits of knowledge or for definitions in all but one of the questions that were classified in the NSES category Knowing and understanding scientific facts, concepts, laws, and theories. The only exception was question #20. The other questions in this NSES category were not structured in a manner that would test for understanding of a biological concept.
The questions that were classified in the NSES category
Knowing
and understanding scientific facts (31% of the 42 questions), were
further analyzed using the NSES Life Science Content Standards for grades
9-12 (see Table 2).
Table 2
Rationale for Categorization using NSES Life Science Content Standards
|
Category Name |
Rationale for Placement of Question in Category |
Questions Classified in this Category |
|
Cell |
Required knowledge of cell structures and functions |
14, 20, 33 |
|
Molecular Basis of Heredity |
Required knowledge of structure of DNA, cellular division, human genetics, mutations
|
15, 25 |
|
Biological Evolution |
Required knowledge of evolutionary processes, natural selection, classification of organisms |
18, 19, 23
|
|
Interdependence of organisms |
Required knowledge of energy flow through ecosystem, interdependence of organisms, carrying capacity, global stability
|
22 |
|
Matter, energy and organization |
Required knowledge of energy input into living systems, photosynthesis, respiration, energy flow through levels of living systems |
None |
|
Behavior of Organisms |
Required knowledge of nervous systems, behavioral response to internal changes and external stimuli
|
None |
Only 9 questions (14, 15, 18, 19, 20, 22, 23, 25 and 33) out of the 13 questions in this category were aligned with the NSES Life Science Content Standards. Questions 14, 20, and 33 were classified as "Cell" type questions. Questions 15 and 25 were classified as "Molecular basis of heredity" type questions. Questions 18, 19 and 23 were classified as "Biological evolution" type questions. Question 22 was classified as a "Interdependence of organisms" type question. Concepts from two of the NSES Life Science Content standards, "Matter, energy, and organization in living systems" and "Behavior of organisms" were not found in any of these questions. The questions from the other four categories in Table 1 were not analyzed using the NSES Life Science Content Standards because they did not probe for an understanding or knowledge of life science or biology. Overall, 9 questions out of 42 or 21% of the questions on the Texas Biology EOC aligned with the NSES Life Science Content Standards. This reflects differences in content emphases between state and national content standards.
Validity
The validity of a test, or inferences that can be made from a test, is determined by whether it measures what it is intended to measure. The end-of-course examinations are intended to possess content validity and construct validity because (a) they are content-based, and (b) the construct that is tested is the mastery of the state-mandated curriculum at the time, the "Texas Essential Elements" (TEA, 1997).
The 9 objectives of the Biology EOC, which are based on the "Texas Essential Elements", are described in a TEA report called the "Biology I Objectives and Measurement Specifications" (TEA, 1994). The Biology EOC objectives, categorized into two domains, specify the content that appears on the test. The first domain, "Understanding Concepts," has 3 objectives. For each of the 3 objectives, there are 6 Biology EOC questions, totaling 18 questions. The second domain, "Integrating Concepts with Process Skills," has 6 objectives. For each of the 6 objectives, there are 4 Biology EOC questions per objective, totaling 24 questions. Altogether, there are 42 questions that are related to 9 objectives. Each objective has 3 through 7 broad sub-objectives. (TEA, 1994).
The Biology EOC may not be valid because it is comprised of so few questions (only 42 items). This may affect the ability of the test to assess adequately the objectives it is intended to assess. They were only 4-6 test items per objective. Furthermore, validity may be further compromised because some sub-objectives were not evaluated. Valid inferences may not be made about the students' knowledge of biology if the test does not provide representative coverage of the content (NRC, 1999).
Reliability
The reliability of a test is an estimate of the consistency of test scores on an examination by the same individual at different times. The reliability estimates for the biology end-of-course examinations are based on internal consistency measures. The test developers used the Kuder-Richardson formula, Number 20 (KR20), and the items were "scored dichotomously" (TEA, 1997). The KR20 is a measure of the internal consistency of a test (Hills, 1981). The Spring 1997 Biology EOC had a KR20 value of 0.877 (TEA, 1997), which means that if a student were to take it again, there would be an 87.7% probability of achieving the same score (Joe Wilson, personal communication, TEA Student Assessment Office, June 21, 1996). A KR20 of 0.877 appears to indicate high reliability. The KR20 was also used to measure the reliability for each of the 9 objectives. When items for each of the 9 objectives were examined for reliability, the reliability was low. The reliability for each of the 9 objectives ranged from 0.356 to 0.602. A KR20 test of reliability of each of the objectives may be meaningless since the number of items per objective (4 to 6) was small (TEA, 1997). However, it is important to include the reliabilities of specific objectives in the evaluation of the Biology EOC since the individual student Biology EOC reports that are generated indicate pass or fail on the specific objectives. By knowing the reliabilities of the individual objectives, accurate inferences about the students' knowledge of biology can be made. Since the reliabilities for the sub-objectives were low on the Spring 1997 Biology EOC, the test appears not to be a reliable indicator of student knowledge.
Fairness
The professional standard of fairness of an examination is difficult to measure. The standard measures whether the tests were "applied consistently and equitably between individuals and groups for the proposed purposes" (NRC, 1999). There is approximately a 30 % difference in passing scores between non-minority students and minority students on the Texas Biology EOC examinations (TEA, 1999). For the 1997 Spring Biology EOC, 91% of white students, 62% of Hispanic students, 60% of African American students and 60% of all economically disadvantaged students passed the examination. This difference has been consistent since 1994 (TEA, 1999). Fairness does not require the same outcomes across different groups. A valid and reliable test may consistently indicate differences between groups that signify differences in knowledge levels or may be "systematically underestimating the knowledge or skill of members of a particular group." (NRC, 1999, p.72) These group differences may be due to differences in the opportunity to learn. The examinees may not have had equal access to certified teachers, classrooms and homes conducive to learning, practice materials, appropriate testing conditions, etc. Since there is a consistently large percentage difference between minority and non-minority students in passing rates on the Biology EOCs, the professional standard of fairness needs to be addressed. Further, the reasons for these differences need to be explored and, where possible, addressed.
Standardized testing in science is done in nearly every state in the United States. If the purpose is accountability, which is often the case, teachers will design their curriculum around the test objectives (Shepard, 1991; Madaus, 1991; Herman & Golan, 1992). In order to promote a high quality of science education, high quality science tests need to be used so that when teachers "teach to the test", they improve their curriculum and thus improve student understanding of important science concepts and principles. The NSES, developed by experts in the field of science education, were designed to improve science education. (NRC, 1996). Standardized testing in science should be dependent on the development and evaluation of science tests in accordance with national science standards, such as the NSES (NRC, 1996).
Our use of the NSES Assessment and Life Science Content Standards, and guidelines from the NRC's Board on Testing and Assessment, BOTA, as tools to evaluate state standardized testing provides one approach to evaluating state-wide science examinations. We have shown in our evaluation of the Texas Biology EOC that the NSES Assessment Standard B and the BOTA standards concerning validity, reliability and fairness can be used to evaluate science state-wide assessments. The outcomes of an evaluation based on NSES and BOTA standards can be used to guide the test development process, so that future science tests meet national standards in science education. For example, future Texas Biology EOCs could be improved by 1) adding a section that contains open-ended questions which probe students' understanding of biology and allows students opportunities to communicate their views on societal biological issues, and 2) increasing the number of test items, particularly in biology concepts.
If teachers "teach to tests" that have been evaluated and developed according to national standards, such as the NSES and BOTA, there should be more teaching for deeper conceptual understanding of science and less teaching towards memorization of unrelated facts. This should create learning environments that are more conducive to effective science teaching and learning. Thus, by applying national standards in the development and evaluation of tests, both the teaching of science and the scientific literacy in our students can be improved.
References
American Association for the Advancement of Science (AAAS). (1989). Science for All Americans. Washington, D.C: Author.
American Association for the Advancement of Science (AAAS). (1993). Benchmarks for Science Literacy. New York, NY: Oxford University Press.
Black, P. J. (1998). Testing Friend or Foe: Theory and Practice of Assessment and Testing. Bristol, PA: Falmer Press.
Bourque, M. L., Champagne, A. B., Chrissman, S. (1997). 1996 Performance Standards; Achievement Results for the Nation and the States. Washington, D. C. : National Governing Board
Bracey, G. A. (2000, October). The 10th Bracey Report On the Condition of Public Education. Phi Delta Kappan. 133 144.
Darling-Hammond, L. (1991, November). The Implications of Testing Policy for Quality and Equality. Phi Delta Kappan, 220-225.
Darling-Hammond, L. (1994). Performance-based Assessment and Educational Equity. Harvard Educational Review (64) 1, 5-30.
Edwards, V. B. (1999). Rewarding Results, Punishing Failure. Quality Counts 99' (Education Week/Pew Charitable Trust Report.p. 85). Bethesda, MD: Virginia B. Edwards
Gong, B. (1990, April). Current State Science Assessments: Is Something Better Than Nothing? Paper presented at the Annual Meeting of the American Educational
Research Association, Boston, MA.
Goslin, D. (1963). The Search for Ability: Standardized Testing in Social Perspectives. New York, NY: Russel Sage Foundation.
Herman, J. and Golan, S. (1992). Effects of Standardized Testing on Teaching and Learning (CSE Technical Report 334), Los Angles, CA: University of California at Los Angeles. National Center for Research on Evaluation, Standards and Student Testing.
Hills, J.R. (1981). Measurement and Evaluation in the Classroom (2nd ed.). Columbus: Merrill Publishing Company.
Jerald, C. & Boser, U. (1999, January). Taking Stock. Quality Counts '99
(Education Week/Pew Charitable Trust Report.p. 85). Bethesda, MD: Virginia B. Edwards
Lomask M. S. (1995). Large-Scale Science Performance Assessment in Connecticut: Challenges and Resolutions. Washington D.C: National Science Foundation. (ERIC Document Reproduction Service No. ED 386 463)
McKnight, C. , Crosswhite, F., Dossey, J., Kifer, E., Swafford, S., Traver., K., Cooney, T., (1987). The Underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes.
Madaus, G. F. (1988). The Influence of Testing on the Curriculum. Critical Issues in Curriculum (87th Yearbook of the National Society for the Study of Education). Chicago: University of Chicago Press.
Madaus, G. F. (1991, November). The Effects of Important Tests on Students : Implications for a National Examination System. Phi Delta Kappan. 226-231.
Madaus, G. F. (1994). A Technological and Historical Consideration of Equity Issues Associated with Proposals to Change the Nation's Testing Policy. Harvard Educational Review 64 (1), 76-95.
Marshall, C., & Rossman, G. B (1995). Designing Qualitative Research (2nd Ed). Thousand Oaks, CA: Sage Publications.
Medrich, E.A., & Griffith, J. E. (1992). International Mathematics and Science Assessments: What have We Learned? (Report No. NCES 92-001) Washington, D.C.: U.S. Department of Education.
Morgenstern, C.F., & Renner, J.W. (1984). Measuring Thinking with Standardized Tests. Journal of Research in Science Teaching, 21 (6), 639-648.
National Assessment of Educational Progress (NAEP). (1996). What does the Assessment Measure? Retrieved 2000, from the World Wide Web: http://nces.ed.gov/nationsreportcard/science/sci_assess_what.asp
National Assessment of Educational Progress (NAEP). (1998). The Nation's Report Card.
Retrieved 2000, from the World Wide Web: http://nces.ed.gov/nationsreportcard/site/home.asp
National Assessment of Educational Progress (NAEP) (1998). NAEP 1998 National and State Writing Summary Data Tables for Grade 8 Student Data Weighted Percentages and Average ScaleScores. Retrieved 2000, from the World Wide Web: http://nces.ed.gov/nationsreportcard/TABLES/SDTTOOL.HTM
The National Center for Education Statistics (December, 2000). Pursuing Excellence: Comparisons of International Eighth-Grade Mathematics and Science Achievement from a U.S. Perspective, 1995 and 1999. Initial Findings from the Third International Mathematics and Science Study Repeat NCES 2001-028. Retrieved December, 2000, from the World Wide Web: http://www.ed.gov/pubs/edpubs.html
The National Center for Fair & Open Testing (NCFT). (Summer, 1997). Testing Our Children A Report Card on State Assessment Systems. Retrieved 2000, from the World Wide Web: http://fairtest.org/states/intro.htm
National Educators Goals Panel. (1999). Retrieved 2000, from the World Wide Web: http://www.negp.gov.
National Research Council (NRC). (1996). National Science Education Standards. Washington, D.C: National Academy Press.
National Research Council (NRC). (1997). Improving Student Learning in Mathematics and Science. Washington, D.C: National Academy Press.
National Research Council (NRC). (1999). High Stakes Testing for Tracking, Promotion, and Graduation. Washington, D.C: National Academy Press.
National Science Foundation. (1996). Chapter 7: Science and Technology: Public Attitudes and Public Understanding, In National Science Foundation, Science and Engineering Indicators 1996, 3-21. Washington, D.C: U.S. Government Printing Office.
National Science Teachers Association. (NSTA). (1993). The Content Core. Washington, D.C: Author.
Patton, M.Q. (1990). Qualitative Evaluation and Research Methods. (2nd ed.) Newbury Park, CA: Sage Publications.
Porter, A.C. & Smithson, J.L. (2000). Alignment of State Testing Programs NAEP and Reports of Teacher Practice in Mathematics and Science in Grades 4 and 8. Paper presented at the Annual Meeting of the American Education Research Association, New Orleans, LA.
Sagan, C.A. (1995). Science as a candle in the dark. The demon-haunted world. New York: Random House.
Shepard, L.A. (1991, November). Will National Tests Improve Student Learning? Phi Delta Kappan. 232-238.
Stake, R.E. (1991, November). The Teacher, Standardized Testing and Prospects of Revolution. Phi Delta Kappan. 243-247.
Suter, L.E. (1992). Indicators of Science & Mathematics Education in 1992. (Report No NSF-93-95). Washington, D.C: National Science Foundation. (ERIC Document Reproduction Service No ED 365 511)
Texas Education Agency. (1992). Biology I End of Course Exam (Report: Student Assessment Office) Austin, TX: Author.
Texas Education Agency (1993). Statewide Accountability System An Overview of the Accreditation Procedures as Revised by Senate Bill 7. (Report: The Office of Accountability). Austin, TX: Author.
Texas Education Agency. (1994). Biology I Objectives and Measurement Specifications (Report: Division of Student Assessment). Austin, TX: Author.
Texas Education Agency. (1996). Table 4: Blueprint of 1996-2000 Accountability Systems. (Report: Office of Accountability). Austin, TX: Author.
Texas Education Agency. (1997). Texas Student Assessment Program Technical Digest For the Academic Year 1996-1997. Austin, TX: Author
Texas Education Agency. (1999). Biology End-of-Course Percent Meeting Minimum Expectations All Students Not in Special Education Retrieved 1999, from the World Wide Web: http://www.tea.state.tx.us/student.assessment/results/swresult/biology.htm
Third International Mathematics and Science Study. (TIMMS) (1996). TIMSS Report, Washington, D.C: National Center for Education Statistics.
Webb, N.L., (1999). Alignment of Science and Mathematics Standards and Assessments in Four States. (Council of Chief State School Officers and National Institute for Science Education Publication). Madison, WI: University of Wisconsin, Wisconsin Center for Education Research.
Westerlund, J.F. (1996). Reform and Reality: A Two Year Study Observations of Texas Teachers on the Biology I End of Course Examination. Dissertation Abstracts International, 58 (01) A, 126. (On-line No. AAG97119516)
Westerlund, J., & West, S. (1999). The Texas Biology I End of Course Examination: A Critique. The Texas Science Teacher, 28 (1), 5-12.
About the authors...
Julie F. Westerlund is an Assistant Professor of Biology at Southwest Texas State University in San Marcos, TX. She has eight years of experience as a high school teacher in Texas. Dr. Westerlund's research interests include inquiry-based science teaching, standardized testing in science, earth science education, and science teacher professional development. Sandra S. West is an Associate Professor of Biology at Southwest Texas State University in San Marcos, TX. She has taught in public schools at the secondary level for 18 years. Dr. West's research primarily focuses on safety in science education.