Language Assessment

SUMMARY OF THE ASSESSING GRAMMAR AND VOCABULARY BOOKS

SUMMARY OF THE ASSESSING GRAMMAR BOOK

Chapter one

Differing notions of ‘grammar’ for Assessment

Grammar was used to mean the analysis of a language system, and the study of grammar was not just considered an essential feature of language learning, but was thought to be sufficient for learners to actually acquire another language (Rutherford, 1988). Grammar in and of itself was deemed to be worthy of study to the extent that in the Middle Ages in Europe, it was thought to be the foundation of all knowledge and the gateway to sacred and secular understanding (Hillocks and Smith, 1991). Thus, the central role of grammar in language teaching remained relatively uncontested until the late twentieth century. Even a few decades ago, it would have been hard to imagine language instruction without immediately thinking of grammar.

While the central role of grammar in the language curriculum has remained unquestioned until recent times, grammar pedagogy has unsurprisingly been the source of much debate. For example, some language educators have argued that foreign languages are best learned deductively, where students are asked to memorize and recite definitions, rules, examples and exceptions.

What is meant by ‘grammar’ in theories of language?

Grammar and linguistics

This is important given the different definitions and conceptualizations of grammar that have been proposed over the years, and the diverse ways in which these notions of grammars have influenced L2 educators.

When most language teachers, second language acquisition (SLA) researchers and language testers think of ‘grammar’, they call to mind one of the many paradigms (e.g., ‘traditional grammar’ or ‘universal grammar’) available for the study and analysis of language. Such linguistic grammars are typically derived from data taken from native speakers and minimally constructed to describe well-formed utterances within an individual framework. These grammars strive for internal consistency and are mainly accessible to those who have been trained in that particular paradigm.

Since the 1950s, there have been many such linguistic theories – too numerous to list here – that have been proposed to explain language phenomena. Many of these theories have helped shape how L2 educators currently define grammar in educational contexts. Although it is beyond the purview of this book to provide a comprehensive review of these theories, it is, nonetheless, helpful to mention a few, considering both the impact they have had on L2 education and the role they play in helping define grammar for assessment purposes.

Form-based perspectives of language

Several syntactocentric, or form-based, theories of language have provided grammatical insights to L2 teachers. There are three: traditional grammar, structural linguistics and transformational-generative grammar.

Form- and use-based perspectives of language

The three theories of linguistic analysis described thus far have provided insights to L2 educators on several grammatical forms. These insights provide information to explain what structures are theoretically possible in a language. Other linguistic theories, however, are better equipped to examine how speakers and writers actually exploit linguistic forms during language use. For example, if we wish to explain how seemingly similar structures like I like to read and I like reading connote different meanings, we might turn to those theories that study grammatical form and use interfaces. This would address questions such as: Why does a language need two or more structures that are similar in meaning? Are similar forms used to convey different specialized meanings? To what degree are similar forms a function of written versus spoken language, or to what degree are these forms characteristic of a particular social group or a specific situation? It is important for us to discuss these questions briefly if we ultimately wish to test grammatical forms along with their meanings and uses in context.

One approach to linguistic an analysis that has contributed greatly to our understanding of the grammatical forms found in language use, as well as the contextual factors that influence the variability of these forms, is corpus linguistics.

Communication-based perspectives of language

Other theories have provided grammatical insights from a communication- based perspective. Such a perspective expresses the notion that language involves more than linguistic form. It moves beyond the view of language as patterns of morphosyntax observed within relatively decontextualized sentences or sentences found within natural-occurring corpora. Rather, a communication-based perspective views grammar as a set of linguistic norms, preferences and expectations that an individual Invokes to convey a host of pragmatic meanings that are appropriate, acceptable and natural depending on the situation.

What is pedagogical grammar?

A pedagogical grammar represents an eclectic, but principled description of the target-language forms, created for the express purpose of helping teachers understand the linguistic resources of communication. These grammars provide information about how language is organized and offer relatively accessible ways of describing complex, linguistic phenomena for pedagogical purposes. The more L2 teachers understand how the grammatical system works, the better they will be able to tailor this information to their specific instructional contexts.

Recently, there have been some comprehensive, formal attempts at interpreting linguistic theories for the purposes of teaching (or testing) grammar. One of these formal pedagogical grammars of English is The Grammar Book, published by Celce-Murcia and Larsen-Freeman (1999). These authors used transformational-generative grammar as an organizing framework for the study of the English language. However, in the tradition of pedagogical grammars, they also invoked other linguistic theories and methods of analysis to explain the workings of grammatical form, meaning and use when a specific grammar point was not amenable to a transformational-generative analysis. For example, to explain the form and meanings of prepositions, they drew upon case grammar (Fillmore, 1968) and to describe the English tense-aspect system at the Semantic level, they referred to Bull’s (1960) framework relating tense to time. Celce-Murcia and Larsen-Freeman’s (1999) book and other useful Pedagogical English grammars (e.g., Swan, 1995; Azar, 1998) provide teachers and testers alike with pedagogically oriented grammars that are an invaluable resource for organizing grammar content for instruction and assessment.

Chapter two

Research on L2 grammar teaching, learning and assessment

I will discuss the research on L2 grammar teaching and learning and show how this research has important insights for language teachers and testers wanting to assess L2 grammatical ability. Similarly, I will discuss the critical role that assessment has played in empirical inquiry on L2 grammar teaching and learning.

Research on L2 teaching and learning

Over the years, several of the questions mentioned above have intrigued language teachers, inspiring them to experiment with different methods, approaches and techniques in the teaching of grammar. To determine if students had actually learned under the different conditions, teachers have used diverse forms of assessment and drawn their own conclusions about their students. In so doing, these teachers have acquired a considerable amount of anecdotal evidence on the strengths and weaknesses of using different practices to implement L2 grammar instruction. These experiences have led most teachers nowadays to ascribe to an eclectic approach to grammar instruction, whereby they draw upon a variety of different instructional techniques, depending on the individual needs, goals and learning styles of their students.

In recent years, some of these same questions have been addressed by second language acquisition (SLA) researchers in a variety of empirically based studies. These studies have principally focused on a description of how a learner’s interlanguage (Selinker, 1972), or how a learner’s L2, develops over time and on the effects that L2 instruction may have on this progression. In most of these studies, researchers have investigated the effects of learning grammatical forms by means of one or more assessment tasks. Based on the conclusions drawn from these assessments, SLA researchers have gained a much better understanding of how grammar instruction impacts both language learning in general and grammar learning in particular. However, in far too many SLA studies, the ability under investigation has been poorly defined or defined with no relation to a model of L2 grammatical ability.

The SLA research looking at the role of grammar instruction in SLA might be categorized into three strands. One set of studies has looked at the relationship between the acquisition of L2 grammatical knowledge and different language-teaching methods. These are referred to as the comparative methods studies. A second set of studies has examined the acquisition of L2 grammatical knowledge through what Long and Robinson (1998) call a ‘non-interventionist’ approach to instruction. These studies have examined the degree to which grammatical ability could be acquired incidentally (while doing something else) or implicitly (without awareness), and not through explicit (with awareness) grammar instruction. A third set of studies has investigated the relationship between explicit grammar instruction and the acquisition of L2 grammatical ability. These are referred to as the interventionist studies, and are a topic of particular interest to language teachers and testers.

Comparative methods studies

The comparative methods studies sought to compare the effects of different language-teaching methods on the acquisition of an L2. These studies occurred principally in the 1960s and 1970s, and stemmed from a reaction to the grammar-translation method, which had dominated language instruction during the first half of the twentieth century. More generally, these studies were in reaction to form-focused instruction (referred to as ‘focus on forms’ by Long, 1991), which used a traditional structural syllabus of grammatical forms as the organizing principle for L2 instruction. According to Ellis (1997), form-focused instruction contrasts with meaning-focused instruction in that meaning-focused instruction emphasizes the communication of messages (i.e., the act of making a suggestion and the content of such a suggestion) while form- focused instruction stresses the learning of linguistic forms. These can be further contrasted with form-and-meaning focused instruction (referred to by Long (1991) as ‘focus-on-form’), where grammar instruction occurs in a meaning-based environment and where learners strive to communicate meaning while paying attention to form.

Non-interventionist studies

While some language educators were examining different methods of teaching grammar in the 1960s, others were feeling a growing sense of dissatisfaction with the central role of grammar in the L2 curriculum. As a result, questions regarding the centrality of grammar were again raised by a small group of L2 teachers and syllabus designers who felt that the teaching of grammar in any form simply did not produce the desired classroom results. Newmark (1966), in fact, asserted that grammatical analysis and the systematic practice of grammatical forms were actually interfering with the process of L2 learning, rather than promoting it, and if left uninterrupted, second language acquisition, similar to first language acquisition, would proceed naturally.

Empirical studies in support of non-intervention

The non-interventionist position was examined empirically by Prabhu (1987) in a project known as the Communicational Teaching Project (CTP) in southern India. This study sought to demonstrate that the development of grammatical ability could be achieved through a task-based, Rather than a form-focused, approach to language teaching, provided that the tasks required learners to engage in meaningful communication. In the CTP, Prabhu (1987) argued against the notion that the development of grammatical ability depended on a systematic presentation of grammar followed by planned practice.

Possible implications of fixed developmental order to language assessment

The notion that structures appear to be acquired in a fixed developmental order and in a fixed developmental sequence might conceivably have some relevance to the assessment of grammatical ability. First of all, these findings could give language testers an empirical basis for constructing grammar tests that would account for the variability inherent in a learner’s interlanguage. In other words, information on the acquisitional order of grammatical items could conceivably serve as a basis for selecting grammatical content for tests that aim to measure different levels of developmental progression, such as Chang (2002, 2004) did in examining the underlying structure of a test that attempted to measure knowledge of the relative clauses. These findings also suggest a substantive approach to defining test tasks according to developmental order and sequence on the basis of how grammatical features are acquired over time (Ellis, 2001b). In other words, one task could potentially tap into developmental level one, While another taps into developmental level two, and so forth.

Problems with the use of development sequences as a basis for assessment

Although developmental sequence research offers an intuitively appealing complement to accuracy-based assessments in terms of interpreting test scores, this method is fraught with a number of serious problems, and language educators should use extreme caution in applying this method to language testing. This is because our understanding of natural acquisitional sequences is incomplete and at too early a stage of research to be the basis for concrete assessment recommendations (Lightbown, 1985; Hudson, 1993).

Interventionist studies

Not all L2 educators are in agreement with the non-interventionist position to grammar instruction. In fact, several (e.g., Schmidt, 1983; Swain,1991) have maintained that although some L2 learners are successful in acquiring selected linguistic features without explicit grammar instruction, the majority fail to do so.

Empirical studies in support of intervention

Aside from anecdotal evidence, the non-interventionist position has come under intense attack on both theoretical and empirical grounds with several SLA researchers affirming that efforts to teach L2 grammar typically results in the development of L2 grammatical ability. Hulstijn (1989) and Alanen (1995) investigated the effectiveness of L2 grammar instruction on SLA in comparison with no formal instruction. They found that when coupled with meaning-focused instruction, the formal instruction of grammar appears to be more effective than exposure to meaning or form alone. Long (1991) also argued for a focus on both meaning and form in classrooms that are organized around meaningful and sustained communicative interaction. He maintained that the focus on grammar in communicative interaction serves as an aid to clarity an precision.

Research on instructional techniques and their effects on acquisition

Much of the recent research on teaching grammar has focused on four types of instructional techniques and their effects on acquisition. Although a complete discussion of teaching interventions is outside the purview of this book (see Ellis, 1997; Doughty and Williams, 1998), these techniques include form- or rule-based techniques, input-based techniques, feedback-based techniques and practice-based techniques (Norris and Ortega, 2000).

Grammar processing and second language development

In the grammar-learning process, explicit grammatical knowledge refers to a conscious knowledge of grammatical forms and their meanings. Explicit knowledge is usually accessed slowly, even when it is almost fully automatized (Ellis, 2001b). DeKeyser (1995) characterizes grammatical instruction as ‘explicit’ when it involves the explanation of a rule or the request to focus on a grammatical feature. Instruction can be explicitly deductive, where learners are given rules and asked to apply them, or explicitly inductive, where they are given samples of language from which to generate rules and make generalizations. Similarly, many types of language test tasks (i.e., gap-filling tasks) seem to measure explicit grammatical knowledge.

Implicit grammatical knowledge refers to ‘the knowledge of a language that is typically manifest in some form of naturally occurring language behavior such as conversation’ (Ellis, 2001b, p. 252). In terms of processing time, it is unconscious and is accessed quickly. DeKeyser (1995) classifies grammatical instruction as implicit when it does not involve rule presentation or a request to focus on form in the input; rather, implicit grammatical instruction involves semantic processing of the input with any degree of awareness of grammatical form.

Implications for assessing grammar

The studies investigating the effects of teaching and learning on grammatical performance present a number of challenges for language assessment. First of all, the notion that grammatical knowledge structures can be differentiated according to whether they are fully automatized (i.e., implicit) or not (i.e., explicit) raises important questions for the testing of grammatical ability (Ellis, 2001b). Given the many purposes of assessment, we might wish to test explicit knowledge of grammar, implicit knowledge of grammar or both. For example, in certain classroom contexts, we might want to assess the learners’ explicit knowledge of one or more grammatical forms, and could, therefore, ask learners to answer multiple-choice or short-answer questions related to these forms. The information from these assessments would show how well students could apply the forms in contexts where fluent and spontaneous language use is not required and where time could be taken to figure out the answers. Inferences from the results of these assessments could be useful for teachers wishing to determine if their students have mastered certain grammatical forms. However, as teachers are well aware, this type of assessment would not necessarily show that the students had actually internalized the grammatical forms so as to be able to use them automatically in spontaneous or unplanned discourse. To obtain information on the students’ implicit knowledge of grammatical forms, testers would need to create tasks designed to elicit the fluent and spontaneous use of grammatical forms in situations where automatic language use was required. In other words, to infer that students could understand and produce grammar in spontaneous speech, testers would need to present students with tasks that elicit comprehension or full production in real time (e.g., listening and speaking). Ellis (2001b) suggests that we also need to utilize time pressure as a means of ensuring that implicit knowledge is being tested. Although this idea is interesting, the introduction of speed into an assessment should be done with caution since it is often difficult to determine the impact of speed on the test taker. In effect, speed may simply produce a heightened sense of test anxiety, thereby introducing irrelevant variability in the test scores. If this were the case, speed would not necessarily provide an effective means of eliciting automatic grammatical ability. In my opinion, comprehensive assessments of grammatical ability should attempt to test students on both their explicit and their implicit knowledge of grammar.

Chapter 3

The role of grammar in models of communicative language ability

In this chapter I will discuss the role that grammar plays in models of communicative competence. I will then endeavor to define grammar for assessment purposes. In this discussion I will describe in some detail the relationships among grammatical form, grammatical meaning and pragmatic meaning. Finally, I will present a theoretical model of grammar that will be used in this book as a basis for a model of grammatical knowledge. This will, in turn, be the basis for grammar-test construction and validation. In the following chapter I will discuss what it means for L2 learners to have grammatical ability.

The role of grammar in models of communicative competence

In sum, many different models of communicative competence have emerged over the years. The more recent depictions have presented much broader conceptualizations of communicative language ability; However, definitions of grammatical knowledge have remained more or less the same – morphosyntax. Also, within these expanded models, more detailed specifications are needed for how grammatical form might interact with grammatical meaning to communicate literal and intended meanings, and how form and meaning relate to the ability to convey pragmatic meanings. If our assessment goal were limited to an understanding of how learners have mastered grammatical forms, then the current models of grammatical knowledge would suffice. However, if we hope to understand how learners use grammatical forms as a resource for conveying a variety of meanings in language-acquisition, -assessment and -use situations, as I think we do, then a definition of grammatical knowledge which addresses these other dimensions of grammatical ability is needed.

What is meant by ‘grammar’ for assessment purposes?

Now with a better understanding of how grammar has been conceptualized in models of language ability, how might we define ‘grammar’ for assessment purposes? It should be obvious from the previous discussion that there is no one ‘right’ way to define grammar. In one testing situation the assessment goal might be to obtain information on students’ knowledge of linguistic forms in minimally contextualized sentences, while in another, it might be to determine how well learners can use linguistic forms to express a wide range of communicative meanings. Regardless of the assessment purpose, if we wish to make inferences about grammatical ability on the basis of a grammar test or some other form of assessment, it is important to know what we mean by ‘grammar’ when attempting to specify components of grammatical knowledge for measurement purposes. With this goal in mind, we need a definition of grammatical knowledge that is broad enough to provide a theoretical basis for the construction and validation of tests in a number of contexts. At the same time, we need our definition to be precise enough to distinguish it from other areas of language ability.

From a theoretical perspective, the main goal of language use is communication, whether it be used to transmit information, to perform transactions, to establish and maintain social relations, to construct one’s identity or to communicate one’s intentions, attitudes or hypotheses. Being the primary resource for communication, language knowledge consists of grammatical knowledge and pragmatic knowledge. Therefore, I propose a theoretical definition of language knowledge that consists of two distinct, but related, components. I will refer to one component as grammatical knowledge and to the other as pragmatic knowledge.

Chapter four

Towards a definition of grammatical ability

What is meant by grammatical ability?

Having described how grammar has been conceptualized, we are now faced with the challenge of defining what it means to ‘know’ the grammar of a language so that it can be used to achieve some communicative goal. In other words, what does it mean to have ‘grammatical ability’?

Defining grammatical constructs

A clear definition of what we believe it means to ‘know’ grammar for a particular testing context will then allow us to construct tests that measure grammatical ability. The many possible ways of interpreting what it means to ‘know grammar’ or to have ‘grammatical ability’ highlight the importance in language assessment of defining key terms. Some of the same terms used by different testers reflect a wide range of theoretical positions in the field of applied linguistics. In this book, I will use several theoretical terms from the domain of language testing. These include knowledge, competence, ability, proficiency and performance, to name a few. These concepts are abstract, not directly observable in tests and open to multiple definitions and interpretations. Therefore, before we use abstract terms such as knowledge or ability, we need to ‘construct’ a definition of them that will both suit our assessment goals and be theoretically viable. I will refer to these abstract, theoretical concepts generically as constructs or theoretical constructs.

One of the first steps in designing a test, aside from identifying the need for a test, its purpose and audience, is to provide a clear theoretical definition of the construct(s) to be measured. If we have a theoretically sound, as well as a clear and precise definition of grammatical knowledge, we can then design tasks to elicit performance samples of grammatical ability. By having the test-takers complete grammar tasks, we can observe – and score – their answers with relation to specific grammatical criteria for correctness. If these performance samples reflect the underlying grammatical constructs – an empirical question – we can then use the test results to make inferences about the test-takers’ grammatical ability. These inferences, in turn, may be used to make decisions about The test-takers (e.g., pass the course). However, we need first to provide evidence that the tasks on a test have measured the grammatical constructs we have designed them to measure (Messick, 1993). The process of providing arguments in support of this evidence is called validation, And this begins with a clear definition of the constructs.

Language educators thus need to define carefully the constructs to measured when creating tasks for tests. They must provide clear definition of the constructs, bearing in mind that each time a test is designed It should reflect the different components of grammatical knowledge, the purpose of the assessment, the group of learners about which we like to make inferences and the language-use contexts to which, we hope, the results will ultimately generalize.

Definition of key terms

Before continuing this discussion, it might be helpful if I clarified some of the key terms.

Grammatical knowledge

Knowledge refers to a set of informational structures that are built up through experience and stored in long-term memory. These structures include knowledge of facts that are stored in concepts, images, networks, production-like structures, propositions, schemata and representation (Pressley, 1995). Language knowledge is then a mental representation of informational structures related to language. The exact components of language knowledge, like any other construct, need to be defined. In this book, grammar refers to a system of language whereas grammatical knowledge is defined as a set of internalized informational structures related to the theoretical model of grammar proposed

Grammatical ability

Although some researchers have defined knowledge and ability similarly, I use these terms differently. ‘Knowledge’ refers to a set of informational structures available for use in long-term memory. Ability, however, encompasses more than just a domain of information in memory; it also involves the capacity to use these informational structures in some way. Therefore, language ability, sometimes called communicative competence or language proficiency, refers to an individual’s capacity to utilize mental representations of language knowledge built up through practice or experience in order to convey meaning. Given this definition, language ability, by its very nature, involves more than just language knowledge. Bachman and Palmer (1996) characterize language ability as a combination of language knowledge and strategic competence, defined as a set of metacognitive strategies (e.g., planning, evaluating) and, I might add, cognitive strategies (e.g., associating, clarifying), for the purpose of ‘creating and interpreting discourse in both testing and non-testing situations’ (p. 67).

Grammatical ability is, then, the combination of grammatical knowledge and strategic competence; it is specifically defined as the capacity to realize grammatical knowledge accurately and meaningfully in testing or other language-use situations. Hymes (1972) distinguished between competence and performance, stating that communicative competence includes the underlying potential of realizing language ability in instances of language use, whereas language performance refers to the use of language in actual language events. Carroll (1968) refers to language performance as ‘the actual manifestation of linguistic competence . . . in behavior’ (p. 50).

Metalinguistic knowledge

Finally, metalanguage is the language used to describe a language. It generally consists of technical linguistic or grammatical terms (e.g., noun,verb). Metalinguistic knowledge, therefore, refers to informational structures related to linguistic terminology. We must be clear that metalinguistic knowledge is not a component of grammatical ability; rather, the knowledge of linguistic terms would more aptly be classified as a kind of specific topical knowledge that might be useful for language teachers to possess. Some teachers almost never present metalinguistic terminology to their students, while others find it useful as a means of discussing the language and learning the grammar. It is important to remember that knowing the grammatical terms of a language does not necessarily mean knowing how to communicate in the language.

What is ‘grammatical ability’ for assessment purposes?

The approach to the assessment of grammatical ability in this book is based on several specific definitions. First, grammar encompasses grammatical form and meaning, whereas pragmatics is a separate, but related, component of language. A second is that grammatical knowledge, along with strategic competence, constitutes grammatical ability. A third is that grammatical ability involves the capacity to realize grammatical knowledge accurately and meaningfully in test-taking or other language-use contexts. The capacity to access grammatical knowledge to understand and convey meaning is related to a person’s strategic competence. It is this interaction that enables examinees to implement their grammatical ability in language use. Next, in tests and other language-use contexts, grammatical ability may interact with pragmatic ability (i.e., pragmatic knowledge and strategic competence) on the one hand, and with a host of non-linguistic factors such as the test-taker’s topical knowledge, personal attributes, affective schemata and the characteristics of the task on the other. Finally, in cases where grammatical ability is assessed by means of an interactive test task involving two or more interlocutors, the way grammatical ability is realized will be significantly impacted by both the contextual and the interpretative demands of the interaction.

Knowledge of phonological or graphological form and meaning

Knowledge of phonological/graphological form enables us to understand and produce features of the sound or writing system, with the exception of meaning-based orthographies such as Chinese characters, as they are used to convey meaning in testing or language-use situations. Phonological form includes the segmentals (i.e., vowels and consonants) and prosody (i.e., stress, rhythm, intonation contours, volume, tempo). These forms can be used alone or in conjunction with other grammatical forms to encode phonological meaning. For example, the ability to hear or pronounce meaning-distinguishing sounds such as the /b/ vs. /v/ could be used to differentiate the meaning between different nouns (boat/vote), and the ability to hear or pronounce the prosodic features of the language (e.g., intonation) could allow students to understand or convey the notion that a sentence is an interrogative (You’re staying?).

Knowledge of lexical form and meaning

Knowledge of lexical form enables us to understand and produce those features of words that encode grammar rather than those that reveal meaning. This includes words that mark gender (e.g., waitress), count ability (e.g., people) or part of speech (e.g., relate, relation). For example, when the word think in English is followed by the preposition about before a noun, this is considered the grammatical dimension of lexis, representing a co-occurrence restriction with prepositions. One area of lexical form that poses a challenge to learners of some languages is word formation. This includes compounding in English with a noun + noun or a verb + particle pattern (e.g., fire escape; breakup) or derivational affixation in Italian (e.g., ragazzino ‘little kid’, ragazzone ‘big kid’). For example, a student who says ‘a teacher of chemistry’ instead of ‘chemistry teacher’ or ‘*this people’ would need further instruction in lexical form.

Knowledge of lexical meaning allows us to interpret and use words based on their literal meanings. Lexical meaning here does not encompass the suggested or implied meanings of words based on contextual, sociocultural, psychological or rhetorical associations. For example, the literal meaning of a rose is a kind of flower, whereas a rose can also be used in a non-literal sense to imply a number of sociocultural meanings depending on the context. These include love, beauty, passion and still a host of other cultural meanings (e.g., the Rose Bowl, a rose window). Lexical meaning also accounts for the literal meaning of formulaic or lexicalized expressions (e.g., You’re welcome). Although it is possible to test lexical form or meaning separately, we must recognize that lexical form and meaning are very closely associated.

Knowledge of morphosyntactic form and meaning

Knowledge of morphosyntactic form permits us to understand and produce both the morphological and syntactic forms of the language. This includes the articles, prepositions, pronouns, affixes (e.g., -est), syntactic structures, word order, simple, compound and complex sentences, mood, voice and modality. A learner who knows the morphosyntactic form of the English conditionals would know that: (1) an if-clause sets up a condition and a result clause expresses the outcome; (2) both clauses can be in the sentence-initial position in English; (3) if can be deleted under certain conditions as long as the subject and operator are inverted; and (4) certain tense restrictions are imposed on if and result clauses.

Morphosyntactic forms carry morphosyntactic meanings which allow us to interpret and express meanings from inflections such as aspect and time, meanings from derivations such as negation and agency, and meanings from syntax such as those used to express attitudes (e.g., sub-junctive mood) or show focus, emphasis or contrast (e.g., voice and word order). For example, a student who knows the morphosyntactic meaning of the English conditionals would know how to express a factual conditional relationship (If this happens, that happens), a predictive conditional relationship (If this happens, that will happen), or a hypothetical conditional relationship (If this happened, that would happen). On the sentential level, the individual morphosyntactic forms and meanings taken together allow us to interpret and express the literal or grammatical meaning of an utterance and they allow us to identify the direct language function associated with language use.

Knowledge of cohesive form and meaning

Knowledge of cohesive form enables us to use the phonological, lexical and morphosyntactic features of the language in order to interpret and Express cohesion on both the sentence and the discourse levels. Cohesive form is directly related to cohesive meaning through cohesive devices (e.g., she, this, here) which create links between cohesive forms and their referential meanings within the linguistic environment or the surrounding co-text. Halliday and Hasan (1976, 1989) list a number of grammatical forms for displaying cohesive meaning. This can be achieved through the use of personal referents to convey possession or reciprocity; demonstrative referents to display spatial, temporal or psychological links; comparative referents to encode similarity, difference and equality; and logical connectors to signal a wide range of meanings such as addition, logical conclusion and contrast. Cohesive meaning is also conveyed through ellipsis (e.g., When [should I arrive at your house]?), substitution (e.g., I hope so) and lexical ties in the form of synonymy and repetition. Finally, cohesive meaning can be communicated through the internal relationship between pair parts in an adjacency pair (e.g., invitation/acceptance). When the interpretation source of a cohesive form is within the linguistic environment, the interpretation is said to be endophoric (Halliday and Hasan, 1989).

Knowledge of information management form and meaning

Knowledge of information management form allows us to use linguistic forms as a resource for interpreting and expressing the information structure of discourse. Some resources that help manage the presentation of information include, for example, prosody, word order, tense-aspect and parallel structures. These forms are used to create information management meaning. In other words, information can be structured to allow us to organize old and new information (i.e., topic/comment), topicalize, emphasize information and provide information symmetry through parallelism and tense concordance.

Knowledge of interactional form and meaning

Knowledge of interactional form enables us to understand and use linguistic forms as a resource for understanding and managing talk-in interaction. These forms include discourse markers and communication management strategies. Discourse markers consist of a set of adverbs, conjunctions and lexicalized expressions used to signal certain language functions. For example, well . . . can signal disagreement, ya know or ah-huh can signal shared knowledge, and by the way can signal topic diversion. Conversation-management strategies include a wide range of linguistic forms that serve to facilitate smooth interaction or to repair interaction when communication breaks down. For example, when interaction stops because a learner does not understand something, one person might try to repair the breakdown by asking, *What means that? Here the learner knows the interactional meaning, but not the form.

Similar to cohesive forms and information management forms, interactional forms use phonological, lexical and morphosyntactic resources to encode interactional meaning. For example, in saying *What means that?, the learner knows how to repair a conversation by asking for clarification, but does not know the form of the request. Other examples of interactional forms and meanings include: backchannel signals such as ah-huh, or right in English to signal active listening and engagement; lexicalized expressions like guess what? And you know what? To indicate the initiation of a story sequence; and others such as Oh my God, I can’t believe it! Oh my God, you’re kiddin’me! Or in current Valleyspeak, Shut up! (with stress on ‘shut’ and falling intonation on ‘up’), commonly used to express surprise.

Although interactional form and meaning are closely associated, it is possible for students to know the form but not the meaning, and vice versa. For example, a student who says *thanks you to express gratitude or *you welcome to respond to an expression of gratitude obviously has knowledge of the interactional meanings, but not the forms. Finally, from a pragmatic perspective, interactional forms and meanings embody a number of implied meanings. Consider the following examples.

Chapter Five

Designing test tasks to measure L2 grammatical ability

How does test development begin? Every grammar-test development project begins with a desire to obtain (and often provide) information about how well a student knows grammar in order to convey meaning in some situation where the target language is used. The information obtained from this assessment then forms the basis for decision-making. Those situations in which we use the target language to communicate in real life or in which we use it for instruction or testing are referred to as the target language use (TLU) situations (Bachman and Palmer, 1996). Within these situations, the tasks or activities requiring language to achieve a communicative goal are called the target language use tasks. A TLU task is one of many language use tasks that test-takers might encounter in the target language use domain. It is to this domain that language testers would like to make inferences about language ability, or more specifically, about grammatical ability.

What do we mean by ‘task’?

The notion of ‘task’ in language-learning contexts has been conceptualized in many different ways over the years. Traditionally, ‘task’ has referred to any activity that requires students to do something for the intent purpose of learning the target language. A task then is any activity (i.e., short answers, role-plays) as long as it involves a linguistic or non-linguistic (circle the answer) response to input. Traditional learning or teaching tasks are characterized as having an intended pedagogical purpose – which may or may not be made explicit; they have a set of instructions that control the kind of activity to be performed; they contain input (e.g., questions); and they elicit a response. More recently, learning tasks have been characterized more in terms of their communicative goals, their success in eliciting interaction and negotiation of meaning, and their ability to engage learners in complex meaning-focused activities (Nunan, 1989, 1993; Berwick, 1993; Skehan, 1998).

In a discussion of the degree to which pedagogical tasks are successful in eliciting the specific grammatical structures under investigation, Loschky and Bley-Vroman (1993) identified three types of grammar-to-task relationships. The first involves task-naturalness, a condition where ‘a grammatical construction may arise naturally during the performance of a particular task, but the task can often be performed perfectly well, even quite easily, without it’ (p. 132). For example, in a task designed to elicit past modals in the context of a murder mystery, we expect forms like: the butler could have done it or the maid might have killed her, but we might get forms like: Maybe the butler did it or I suspect the maid killed her. The second condition is task-utility, where ‘it is possible to complete the task [meaningfully] without the structure, but with the structure the task becomes easier’ (ibid.). For example, in a comparison task, I once had a student say: *Shiraz is beautiful city, but Esfahan is very, very, very, beautiful city in Iran. Had he known the comparatives or the superlatives, his message could have been communicated much more easily. The final and most interesting condition for grammar assessment entails task-essentialness. This is where the task cannot be completed unless the grammatical form is used. For example, in a task intended to distinguish stative from dynamic adjectives, the student would need to know the difference between I’m really bored and I’m really boring in order to complete the task. Obviously task essentialness is the most difficult, yet the most desirable condition to meet in the construction of grammar tasks.

What are the characteristics of grammatical test tasks?

As the goal of grammar assessment is to provide as useful a measurement as possible of our students’ grammatical ability, we need to design test tasks in which the variability of our students’ scores is attributed to the differences in their grammatical ability, and not to uncontrolled or irrelevant variability resulting from the types of tasks or the quality of the tasks that we have put on our tests. As all language teachers know, the kinds of tasks we use in tests and their quality can greatly influence how students will perform. Therefore, given the role that the effects of task characteristics play on performance, we need to strive to manage (or at least understand) the effects of task characteristics so that they will function the way we designed them to – as measures of the constructs we want to measure (Douglas, 2000). In other words, specifically designed tasks will work to produce the types of variability in test scores that can be attributed to the underlying constructs given the contexts in which they were measured (Tarone, 1998). To understand the characteristics of test tasks better, we turn to Bachman and Palmer’s (1996) framework for analyzing target language use tasks and test tasks.

The Bachman and Palmer framework

Bachman and Palmer’s (1996) framework of task characteristics represents the most recent thinking in language assessment of the potential relationships between task characteristics and test performance. In this framework, they outline five general aspects of tasks, each of which is characterized by a set of distinctive features. These five aspects describe characteristics of (1) the setting, (2) the test rubrics, (3) the input, (4) the expected response and (5) the relationship between the input and response.

Describing grammar test tasks

For grammar tests, they call to mind a large repertoire of task types that have been commonly used in teaching and testing contexts. We now know that these holistic task types constitute collections of task characteristics for eliciting performance and that these holistic task types can vary on a number of dimensions. In designing grammar tests, we need to be familiar with a wide range of activities to elicit grammatical performance.

Selected-response task types

Selected-response tasks present input in the form of an item, and test-takers are expected to select the response. Other than that, all other task characteristics can vary. However, in some instances, partial-credit scoring may be useful, depending on how the construct is defined. Finally, selected-response tasks can vary in terms of reactivity, scope and directness.

The multiple-choice (MC) task

This task presents input with gaps or underlined words or phrases. While the MC task has many advantages, the items can be difficult and time-consuming to develop. The format encourages guessing, and scores might be inflated due to test-wise ness, or the test-taker’s knowledge about test taking. This can result in serious questions about the validity of inferences based on these items (Cohen, 1998). Finally, many educators argue that MC tasks are inauthentic language-use tasks.

Multiple-choice error identification task

This task presents test-takers with an item that contains one incorrect, unacceptable, or inappropriate feature in the input. Examines are required to identify the error. In the context of grammatical assessment, the errors in the input relate to grammatical accuracy and/or meaningfulness.

The discrimination task

This task presents examines with language and/or non-language input along with two response choices that are polar opposites or that contrast in some way. Some response possibilities include: true–false, right–wrong, same–different, agree–disagree, grammatical–ungrammatical and so forth. Discrimination items are designed to measure the differences between two similar areas of grammatical knowledge.

The noticing task

This task presents learners with a wide range of input in the form of language and/or non-language. Examines are asked to indicate (e.g., by circling, highlighting) that they have identified some specific feature in the language.

The noticing task, also referred to as a kind of consciousness-raising (CR) task, is intended to help students process input by getting them to construct a conscious form–meaning representation of the grammatical feature (Ellis, 1997), and for this reason, it seems to be particularly effective in promoting the acquisition of new grammar points (Tuz, 1993, cited in Ellis, 1997; VanPatten and Cadierno, 1993a, 1993b).

Limited-production task types

Limited-production tasks are intended to assess one or more areas of grammatical knowledge depending on the construct definition. Unlike selected-response items, which usually have only one possible answer, the range of possible answers for limited-production tasks can, at times, be large – even when the response involves a single word.

In other situations, limited-production tasks can be scored with a holistic or analytic rating scale. This method is useful if we wish to judge distinct aspects of grammatical ability with different levels of ability or mastery.

The gap-filling task

This task presents input in the form of a sentence, passage or dialogue with a number of words deleted. one or more areas of grammatical knowledge. Examinees are required to fill the gap with an appropriate response for the context. Gap-filling tasks are designed to measure the learner’s knowledge of grammatical forms and meanings.

A second type of gap-filling task is the cued gap-filling task. In these tasks, the gaps are preceded by one or more lexical items, or cues, which must be transformed in order to fill the gap correctly.

A third type of gap-filling task is the cloze. This task presents the input as a passage or dialogue in which every fifth, sixth or seventh word is mechanically deleted and replaced by a gap. Examinees have to fill the gap with the best word for the context.

The short-answer task

This task presents input in the form of a question, incomplete sentence or some visual stimulus. Test-takers are expected to produce responses that range in length from a word to a sentence or two. Short-answer questions can be used to test several areas of grammatical ability, and are usually scored as right or wrong with one or more criteria for correctness or partial credit. Short-answer tasks can also be scored by means of a rating scale

The dialogue (or discourse) completion task (DCT)

DCTs are intended to measure the students’ capacity to use grammatical forms to express a variety of literal or grammatical meanings (e.g., request), where the relationship between the form and the meaning is relatively direct. DCTs have been used extensively in applied linguistics research to investigate the use of semantic formulas and other linguistic devices to express a wide range of literal and implied contextual meanings (e.g., refusals, apologies, compliments). They have also been used to examine sociolinguistic and sociocultural meanings (social distance, power, register) associated with these contexts. This research has been performed with native and non-native speakers alike.

In this DCT the relationship between the input and response is reciprocal since the response affects further turns. The closing third in this task is used to constrain the meaning of the expected response, thereby limiting the forms and the range of meanings that can be expressed in the response.

Extended-production task

Extended-production tasks are particularly well suited for measuring the examinee’s ability to use grammatical forms to convey meanings in instances of language use (i.e., speaking and writing). When assessing grammatical ability in the context of speaking, it is advisable, whenever possible, to audiotape or videotape the interaction. This will allow the performance samples to be scored more reliably and will provide time to record diagnostic feedback for students.

The quality of the extended-production task responses is judged (1) with reference to the theoretical construct(s) being measured and (2) in terms of different levels of grammatical ability or mastery. For this reason, extended-production tasks are scored with the rating-scale method

The information-gap task (info-gap)

This task presents input in the form of two or more sets of partially complete information. Test-takers are instructed to ask each other questions to obtain one complete set of information. Info gap tasks are scored by means of the rating-scale method as described above (for further information, these tasks can be used to measure the test-takers’ ability to use grammatical forms to convey several meanings – both literal and implied.

The role-play and simulation tasks

These tasks present test-takers with a prompt in which two or more examinees are asked to assume a role in order to solve a problem collaboratively, make a decision or perform some transaction. The input can be language and/or non-language, and it can contain varying amounts of information. In terms of the expected response, role-plays and simulation tasks elicit large amounts of language, invoking the test-takers’ grammatical and pragmatic knowledge, their topical knowledge, strategic competence and affective schemata. The purpose of the test and the construct definition will determine what will be scored. The relationship between the input and response is reciprocal and indirect. These tasks are scored with the rating-scale method in light of the constructs being measured.

Chapter six

Developing tests to measure L2 grammatical ability

The information derived from language tests, of which grammar tests are a subset, can be used to provide test-takers and other test-users with formative and summative evaluations. Formative evaluation relating to grammar assessment supplies information during a course of instruction or learning on how test-takers might increase their knowledge of grammar, or how they might improve their ability to use grammar in communicative contexts.

Score-based inferences from grammar tests can also be used to make, or contribute to, decisions about program placement. This information provides a basis for deciding how students might be placed into a level of a language program that best matches their knowledge base, or it might determine whether or not a student is eligible to be exempted from further L2 study.

The quality of reliability

When we talk about ‘reliability’ in reference to a car, we all know what that means. A car is said to be reliable if it readily starts up every time we want to use it regardless of the weather, the time of day or the user. It is also considered reliable if the brakes never fail, and the steering is consistently responsive. These mechanical functions, working together, make the car’s performance anywhere from zero to one hundred percent reliable.

The quality of construct validity

The second quality that all ‘useful’ tests possess is construct validity. Bachman and Palmer (1996) define construct validity as ‘the extent to which we can interpret a given test score as an indicator of the ability(ies), or construct(s), we want to measure. Construct validity also has to do with the domain of generalization to which our score interpretations generalize’ (p. 21). In other words, construct validity not only refers to the meaningfulness and appropriateness of the interpretations we make based on test scores, but it also pertains to the degree to which the score-based interpretations can be extrapolated beyond the testing situation to a particular TLU domain (Messick 1993).

Construct validity is clearly one of the most important qualities a test can possess. It tells us if we are measuring what we had intended to measure. Nonetheless, this information provides no information on how these assessment tasks resemble those that the learners might encounter in some non-testing situation or on what impact, if any, these assessments are having on the test-takers.

The quality of authenticity

A third quality of test usefulness is authenticity, a notion much discussed in language testing since the late 1970s, when communicative approaches to language teaching were first taking root. Building on these discussions, Bachman and Palmer (1996) refer to ‘authenticity’ as the degree of correspondence between the test-task characteristics and the TLU task characteristics.

The purpose of the test is to measure knowledge of grammatical forms so that we can check on the students’ understanding of these forms, and the TLU domain to which we wish to generalize is instructional, then selected-response tasks of grammatical form should not be perceived as lacking in authenticity.

In sum, test authenticity resides in the relationship between the characteristics of the TLU domain and characteristics of the test tasks, and although a test task may be highly authentic, this does not necessarily mean it will engage the test-taker’s grammatical ability.

The quality of Interactiveness

A fourth quality of test usefulness outlined by Bachman and Palmer (1996) is interactiveness. This quality refers to the degree to which the aspects of the test-taker’s language ability we want to measure (e.g., grammatical knowledge, language knowledge) are engaged by the test-task characteristics (e.g, the input response, and relationship between the input and response) based on the test constructs.

This task is likely to be more interactive than a task that is unsuccessful in engaging aspects of the test-taker’s language ability to such a degree. The engagement of these construct-relevant characteristics with task characteristics is the essence of actual language use.

A task may be interactive because it engages the examinee’s topical knowledge and positive affective schemata; however, if the purpose of the test is to measure grammatical ability and the task does not engage the ability of interest, this is all construct irrelevant.

The quality of impact

Bachman and Palmer (1996) refer to the degree to which testing and test score decisions influence all aspects of society and the individuals within that society as test impact. Therefore, impact refers to the link between the inferences we make from scores and the decisions we make based on these interpretations. In terms of impact, most educators would agree that tests should promote positive test-taker experiences leading to positive attitudes (e.g., a feeling of accomplishment) and actions (e.g., studying hard).

A special case of test impact is washback, which is the degree to which testing has an influence on learning and instruction. Washback can be observed in grammar assessment through the actions and attitudes that test-takers display as a result of their perceptions of the test and its influence over them.

The quality of practicality

Test practicality is not a quality of a test itself, but is a function of the extent to which we are able to balance the costs associated with designing, developing, administering, and scoring a test in light of the available resources (Bachman, personal communication, 2002).

In sum, the characteristics of test usefulness, proposed by Bachman and Palmer (1996), are critical qualities to keep in mind in the development of a grammar test. While each testing situation may not emphasize the same characteristics to the same degree, it is important to consider these qualities and to determine an appropriate balance.

Overview of grammar-test construction

As a result, there is no one ‘right’ way to develop a test; nor are there any recipes for ‘good’ tests that could generalize to all situations. Test development is often presented as a linear process consisting of a number of stages and steps. In reality, the process is anything but linear.

Bachman and Palmer (1996) organize test development into three stages: design, operationalization and administration. I will discuss each of these stages in the process of describing grammar-test development. According to Bachman and Palmer (1996, p. 88), this document should contain the following components.

Purpose

Test development begins with what Davidson and Lynch (2002) call a mandate. The test mandate grows out of a perceived need for a test by one or more stakeholders. This embodies the test purpose(s), which, in the case of grammar assessment, makes explicit the inferences we wish to make about grammatical knowledge or the ability to use this knowledge and the uses we intend to make of these inferences.

TLU domains and tasks

TLU domain is identified (e.g., real-life and/or language-instructional) and the TLU task types are selected. To identify language-use tasks and the type of language needed to perform these tasks, a needs analysis must be performed. This involves the collection and analysis of information related to the students’ target-language needs.

In more and more language teaching situations, however, the focus is on communicative language teaching. Instruction in this approach is designed to correspond to real-life communication outside the classroom; therefore the intended TLU domain of communicative language tests is likely to be real-life.

Characteristics of test-takers

The design statement contains a detailed description of the characteristics of the test-takers, so that the population of test-takers for whom the test is intended and to whom the test scores might generalize can be made explicit Construct(s) to be measured

The design statement also provides a theoretical definition of the construct(s) to be measured in the test. Construct definition can be based on a set of instructional objectives in a syllabus, a set of standards, a theoretical definition of the construct or some combination of them all. In grammar tests, construct definition based on a syllabus (or a textbook) is useful when we want to determine to what degree students have mastered the grammar points that have been taught during a certain period

In addition to defining grammatical knowledge, the test designer must specify the role of topical knowledge in the construct definition of grammar tests. Bachman and Palmer (1996) provide three options for defining topical knowledge. The first is to exclude topical knowledge from the test construct(s). This is appropriate in situations where specific topics and themes are not a consideration in instruction, and where test-takers are not expected to have any special background knowledge to complete the task.

The second option is to include topical knowledge in the construct. This is appropriate in situations where specific topics or themes are an integral part of the curriculum and where topics or themes contextualize language, provide a social–cognitive context for the tasks, and serve to raise the students’ interest level,

The third and most interesting option is to define topical knowledge separately from the language construct(s). This is appropriate in situations where the development of topical knowledge is as important as, if not more important than, the development of language knowledge in the curriculum. Finally, the test developer needs to decide if strategic competence needs to be specified in the construct definition of grammar tests.

Plan for evaluating usefulnes

The test design statement also provides a description of a plan for assessing the qualities of test usefulness. From the beginning of grammar-test development, it is important to consider all six qualities of test usefulness and to determine minimum acceptable levels for each quality. Bachman and Palmer (1996) suggest that a list of questions be provided to guide test developers to evaluate test usefulness throughout the process so that feedback can be provided.

Plan for managing resources

Finally, the test design makes explicit the human, material and time resources needed and available to develop the test. In cases of limited resources, priorities should be made in light of the qualities of test usefulness.

Stage 2: Operationalization

The operationalization stage of grammar-test development describes how an entire test involving several grammar tasks is assembled, and how the individual tasks are specified, written and scored. The outcome of the operationalization phase is both a blueprint for the entire test including scoring materials and a draft version of the actual test.

According to Bachman and Palmer (1996), the blueprint contains two parts: a description of the overall structure of the test and a set of test-task specifications for each task. The blueprint serves as the basis for item writing and scoring.

The first part of the blueprint provides a description of the test structure. This involves an overview of the entire test. Minimally the test structure describes the number of test parts or tasks used to measure grammatical ability, the salience of these parts, their sequence, the importance of the parts, and the number of tasks per part.

The test-task specifications consist of a detailed list of task characteristics, which form the basis for writing the test tasks. Test-task specifications are an important part of the operationalization phase because they provide a means of creating parallel forms of the same test – that is, alternate test forms containing the same task types and approximately the same test content and measurement characteristics.

According to Bachman and Palmer (1996), test-task specifications provide, for each task, a description of the following: the purpose, the construct definition, the characteristics of the setting, the time allotment, the instructions, the characteristics of the input and expected response, a description of the relationship between the input and response, and finally the scoring method.

I will describe these procedures in detail so that they can be specified properly in the blueprint.

Specifying the scoring method

Scoring method provides an explicit description of the criteria for correctness and the exact procedures for scoring the response. Generally speaking, tasks can be scored objectively, where the scorer does not need to make any expert judgments in determining if the answers are correct, or subjectively, where expert judgment is needed to judge performances

Scoring selected-response tasks

The first task in the example chemistry lab test discussed above is a selected-response task (i.e., multiple-choice) of grammatical form. Scorers are provided with an answer key to determine if the answers are right or wrong – no further adjudication is necessary. In this task, each item is designed to measure a single area of explicit grammatical knowledge.

Scoring limited-production tasks

The second task in the lab report test is a limited-production task. Limited-production tasks are designed to elicit a range of possible answers and can be used to measure one or more areas of grammatical knowledge.

Scoring extended-production tasks

The third task in the chemistry lab test asks test-takers to write an abbreviated version of a lab report based on topical cues in the input. This task is designed to elicit an array of grammatical features characteristic of chemical lab reports (e.g., past active and passive sentences).

Using scoring rubrics

Once the scoring rubric has been constructed, the scoring process can be determined. In an attempt to avoid unreliability due to the scoring process, certain basic procedures should be followed. First of all, raters should be normed. To do this, raters are given a norming packet containing the scoring rubric and samples of tasks typifying the different levels. These benchmark samples serve to familiarize raters with the rubric and how performance might be leveled.

Grading

The blueprint should describe the relative importance of the test sections. This can be used to determine a final score on the test. In the chemistry lab test blueprint, the selected-response and the limited-production tasks together account for fifty percent of the points (20 points), while the extended-production task accounts for the other fifty percent (20 points).

Stage 3: Test administration and analysis

The final stage in the process of developing grammar tests involves the administration of the test to individual students or small groups, and then to a large group of examinees on a trial basis. Piloting the entire test or individual test tasks allows for the collection of response data and other sorts of information to support and/or improve the usefulness of the test. This information can then be analyzed and the test revised before being put to operational use.

The actual administration of the test should transpire in a setting that is physically comfortable and free from distraction, and a supportive testing environment should be established. Instructions should be clear and the administration orderly. Test administration provides an excellent opportunity for collecting information about the test-takers’ initial reaction to the test tasks and information about certain test procedures such as the allotment of time.

Once the pre-test responses have been collected and scored, a number of statistical analyses should be performed in order to examine the psychometric properties of the test.

Test analyses provide different types of information to evaluate the characteristics of test usefulness. This information serves as a basis for revising the test before it goes operational, at which time further data are collected and analyses performed in an iterative and recursive manner. In the end, we will have a bank of test tasks that will be archived and from which we can draw upon in further test administrations.

Chapter 7

Illustrative tests of grammatical ability

In this chapter I will examine several examples of professionally developed language tests that measure grammatical ability. Some of these tests contain separate sections that are exclusively devoted to the assessment of grammatical ability, while others measure grammatical knowledge along with other components of language ability in the context of language use – that is while test-takers are listening, speaking, reading or writing.

I will begin the analysis of each test by describing the context of the test, its purpose and its intended use(s). After that, I will turn to how the construct of grammatical ability was defined and operationalized. I will then describe the grammar task(s) taking into account the areas of grammatical knowledge being measured and will summarize the critical features of the test tasks. I will highlight the priorities and compromises made in the process of balancing the qualities of test usefulness.

The First Certificate in English Language Test (FC)

Purpose

The First Certificate in English (FCE) exam was first developed by the University of Cambridge Local Examinations Syndicate (UCLES, now Cambridge ESOL) in 1939 and has been revised periodically ever since. The purpose of the FCE (Cambridge ESOL, 2001a) is to assess the general English language proficiency of learners as measured by their abilities in reading, writing, speaking, listening, and knowledge of the lexical and grammatical systems of English (Cambridge ESOL, 1995, p. 4).

In this review, I will focus on how grammatical ability is measured in the Use of English or grammar section of the FCE. I will then examine how grammatical ability is measured in the writing and speaking sections.

Construct definition and operationalization

According to the FCE Handbook (Cambridge ESOL, 2001a), the Use of English paper is designed to measure the test-takers’ ability to ‘demonstrate their knowledge and control of the language system by completing a number of tasks, some of which are based on specially written texts’ (p. 7).

Measuring grammatical ability through language use

In addition to measuring grammatical ability in the Use of English paper of the test, the FCE measures grammatical ability in the writing and speaking sections. Language use in the writing paper is measured in the contexts of writing letters, articles, reports and compositions (Cambridge ESOL, 2001a, p. 7). Scores are derived from a six-point (0–5), holistic rating scale based on ‘the control, organization and cohesion, range of structures and vocabulary, register and format, and [the effect made on] the target reader indicated in the task’ (Cambridge ESOL, 2001a, p. 19).

The FCE and the qualities of test usefulness

The qualities of test usefulness, the FCE clearly gives priority to construct validity, especially as this relates to the measurement of grammatical ability as one component of English language proficiency. The published FCE literature provides clear, albeit very general, information on the aspects of grammatical knowledge being measured in the Use of English paper. Also, the FCE literature makes explicit the importance of grammatical ability when describing the criteria for rating the writing and speaking papers of the test. This is made even more salient by the inclusion of rater comments on the quality of writing samples, where explicit rater judgments about grammatical performance are expressed.

The purpose and uses of the FCE, the establishment of a discrete, empirical relationship between the target language use tasks and the test tasks in the Use of English paper of the test is difficult to determine from the published literature.

The Comprehensive English Language Test (CELT)

Purpose The Comprehensive English Language Test (CELT) (Harris and Palmer, 1970a, 1986) was designed to measure the English language ability of nonnative speakers of English. The authors claim in the technical manual (Harris and Palmer, 1970b) that this test is most appropriate for students at the intermediate or advanced levels of proficiency. English language proficiency is measured by means of a structure subtest, a vocabulary subtest and a listening subtest.

Construct definition and operationalization

According to the CELT Technical Manual (Harris and Palmer, 1970b), the structure subtest is intended to measure the students’ ‘ability to manipulate the grammatical structures occurring in spoken English.

Measuring grammatical ability through language use

In addition to measuring grammatical knowledge in the structure subtest, grammatical knowledge is also measured in the listening subtest. According to the Technical Manual, the listening subtest is designed tomeasure the test takers’ ‘ability to understand short statements, question, and dialogues as spoken by a native speaker’ (p. 1). The listening section has three tasks. In the first task, candidates hear a wh‑or a yes/no question (When are you going to New York?) and are asked to select one correct response to this question (e.g., Next Friday) from four options. To get this item right, examinees need to understand the lexical item when and associate it with a time expression in the response. This item is obviously designed to measure the student’s ability to understand lexical meaning.

A second listening task presents test-takers with a sentence involving conditions, comparisons, and time and number expressions (Harris and Palmer, 1970b, p. 2). The candidates then choose from among four options to select ‘the one accurate paraphrase’ for the sentence they heard. For example: Student hears: ‘George has just returned from vacation.’

(a) George is spending his vacation at home

(b) George has just finished his vacation.

(d) George has decided not to take a vacation.

This item type seems to be designed to measure the examinees ability to understand grammatical meaning, or the literal and intended meaning of the utterance in the input. Given the slightly indirect association that examinees need to make between ‘finishing a vacation’ (i.e., travel may or may not be involved in the response) and ‘returning from vacation’ (i.e., travel is presumed in the input), it could be argued that this item is measuring knowledge of grammatical meaning, where the relationship between form and meaning is relatively, but not entirely, direct.

The CELT and the qualities of test usefulness

In terms of the purposes and intended uses of the CELT, the authors explicitly stated, ‘the CELT is designed to provide a series of reliable and easy-to-administer tests for measuring English language ability of non native speakers’ (Harris and Palmer, 1970b, p. 1). As a result, concerns for high reliability and ease of administration led the authors to make choices privileging reliability and practicality over other qualities of test usefulness. To maximize consistency of measurement, the authors used only selected-response task types throughout the test, allowing for minimal fluctuations in the scores due to characteristics of the test method. This allowed them to adopt ‘easy-to-administer’ and ‘easy-to score’ procedures for maximum practicality and reliability. Reliability was also enhanced by pre testing items with the goal of improving their psychometric characteristics.

In my opinion, reliability might have been emphasized at the expense of other important test qualities, such as construct validity, authenticity, interactiveness and impact. For example, construct validity was severely compromised by the mismatch among the purpose of the test, the way the construct was defined and the types of tasks used to operationalize the constructs. In short, scores from discrete-point grammar tasks were used to make inferences about speaking ability rather than make interpretations about the test-takers’ explicit grammatical knowledge.

Finally, authenticity in the CELT was low due to the exclusive use of multiple-choice tasks and the lack of correspondence between these tasks and those one might encounter in the target language use domain. Interactiveness was also low due to the test’s inability to fully involve the test-takers’ grammatical ability in performing the tests. The impact of the CELT on stakeholders is not documented in the published manual.

In all fairness, the CELT was a product of its time, when emphasis was on discrete-point testing and reliability, and when language testers were not yet discussing qualities of test usefulness in terms of authenticity, interactiveness and impact.

The Community English Program (CEP) Placement Test

Purpose

The Community English Program (CEP) Placement Test was first developed by students and faculty in the TESOL and Applied Linguistics Programs at Teachers College, Columbia University, in 2002, and is revised regularly. Unlike the previous tests reviewed, the CEP Placement Testis a theme-based assessment designed to measure the communicative language ability of learners entering the Community English Program, a low-cost, adult ESL program servicing Columbia University staff and people in the neighboring community. The CEP Placement Test consists of five sections: listening, grammar, reading, writing and speaking. The first four sections take one hour and 35 minutes to complete; the speaking test involves a ten-minute interview. Inferences from the test scores are used to place students in the program course that best matches their level of communicative English language ability.

Construct definition and operationalization

Given that the CEP is a theme-based ESL program, where language instruction is contextualized within a number of different themes throughout the different levels, the CEP Placement Test is also theme-based. The theme for the CEP Placement Test under review is ‘Cooperation and Competition’. This is not one of the themes students encounter in the program. In this test, all five test sections assess different aspects of language ability while exposing examinees to different aspects of the theme. To illustrate, the reading subtest presents students with a passage on ants that explains how ants both cooperate and compete; the listening subtest presents a passage on how students cooperate and compete in US schools; and the grammar subtest presents a gapped passage that revolves around competition in advertisements.

More specifically, the grammar section of the CEP Placement Test is intended to measure the students’ grammatical knowledge in terms of a wide range of grammatical forms and meanings at both the sentence and the discourse levels. Items on the test are designed to measure the students’ knowledge of lexical, morphosyntactic and cohesive forms and meanings.

Measuring grammatical ability through language use

In addition to measuring grammatical ability in the grammar section, grammatical ability is also measured in the writing and speaking sections of the test. The writing section consists of one 30-minute essay to bewritten on the theme of ‘cooperation and competition’. Scores are derived from a four-point analytic scoring rubric in which overall task fulfillment, content, organization, vocabulary and language control are scored. The rubric constitutes an adapted version of a rubric devised by Jacobs et al. (1981). Language use (i.e., grammatical ability) is implicitly defined in terms of the complexity of grammatical forms, the number of errors and the range of vocabulary. For example, the highest level descriptors (4) describe performance as ‘effective complex constructions; few errors of grammar, and sophisticated range of vocabulary.

The CEP Placement Test and the qualities of test usefulness

In terms of the qualities of test usefulness, the developers of the grammar section of the CEP Placement Test prioritize construct validity, reliability and practicality. With regard to construct validity, the grammar section of this test was designed to measure both grammatical form and meaning on the sentential and discourse levels, sampling from a wide range of grammatical features. In this test, grammatical ability is measured by means of four tasks in the grammar section, one task in the writing section, and by several tasks in the speaking section. In short, the CEP. Placement Test measures both explicit and implicit knowledge of grammar. Placement decisions based on interpretations of the CEP Placement Test scores seem to be appropriate as only a handful of misplacements are reported each term.

The reliability of the grammar-test scores was also considered a priority from the design stage of test development as seen in the procedures for item development, test piloting and scoring. In an effort to promote consistency (and quick return of the results), the CEP Placement Test developers decided to use only multiple-choice tasks in the grammar section. This decision was based on the results of the pilot tests, where the use of limited production grammar tasks showed inconsistent scoring results and put a strain on time resources. Although MC tasks were used to measure grammatical knowledge, the theme of the input was designed to be aligned with the test theme, and the type of input in each task varied (dialogue, advertisement, passage). Once the test design was established, the grammar tasks were developed, reviewed and piloted a number of times before the test became operational. Scoring is objective and machine-scored.

Test authenticity was another major concern for the CEP Placement Test development team. Therefore, in the test design phase of test development, it was decided that test forms should contain one coherent theme across all test tasks in order to establish a close correspondence between the TLU tasks (i.e., ones that might be encountered in a theme based curriculum) and the test tasks on the CEP Placement Test. It was also decided that grammatical ability would be measured by means of selected-response tasks in the grammar section and extended production tasks in the writing and speaking sections, with both task types supporting the same overarching test theme.

It must be noted that the use of one overarching theme in a placement test can be controversial because of the potential for content bias. In other words, if one group of students (e.g., the science students) is familiar with the theme, they may be unfairly advantaged. In an attempt to minimize construct-irrelevant variance, several measures were taken. First, one goal of the test was to actually teach test-takers something about the theme in the process of taking the test so that they could develop an opinion about the theme by the time they got to the writing and speaking sections. To this end, terms and concepts relating to the theme were explained in the listening and grammar sections and reinforced throughout the test in an attempt to create, to the extent possible, a direct relationship between the input and expected responses. Second, the theme was approached from many angles‑cooperation and competition in family relationships, in schools, in the animal kingdom and so forth. Third, each task was reviewed for its newsworthiness. In other words, if the test developers felt that the information in the task was ‘common knowledge’ for the test population, the text was changed. Finally, in piloting the test, test-takers were asked their opinions about the use of the theme. Results from this survey did not lead the testing committee to suspect bias due to the use of a common theme.

Given that the scores from the CEP Placement Test are used to make placement decisions, the test is likely to have a considerable impact on the examinees. Unfortunately, at this point, students are provided with no feedback other than their course placement. The results of this test have also had a strong impact on the CEP, since the program now has a much better system for grouping students according to ability levels than it previously had. Unfortunately, no research on impact is available.

Chapter 8

Learning-oriented assessments of grammatical ability

Introduction

The language tests reviewed in the previous chapter involved the grammar sections from large-scale tests designed to measure global language proficiency, typically for academic purposes. Like other large-scale and often high-stakes tests, they were designed to make institutional decisions related to placement into or exit from a language program, screening for language proficiency or reclassification of school status based on whether a student had achieved the language skills necessary to benefit from instruction in the target language. These tests provide assessments for several components of language ability including, among others, aspects of grammatical knowledge. In terms of the grammar sections of the tests reviewed, a wide range of grammar points were assessed and, except perhaps for the CEP Placement Test, the selection of test content was relatively removed from the local constraints of instruction in specific contexts. These large-scale tests were designed as one-shot, timed assessments for examinees who bring to the testing situation a variety of experiences and proficiency levels. The tests were different in the ways in which the qualities of usefulness were prioritized, and the compromises that ensued from these decisions.

Although large-scale, standardized tests have an important role to play in some school decisions and can have a positive impact on learning and instruction, the primary mandate of large-scale exams is different from that of classroom assessment. In the first place, large-scale language assessments are not necessarily designed to promote learning and influence teaching in local contexts. They rarely provide detailed feedback or diagnostic information to students and teachers, and they are primarily oriented toward the measurement of learner abilities at one point in time rather than continuously over a stretch of time. Finally, large-scale language assessments do not benefit from the knowledge that teachers bring to the assessment context regarding their students’ instructional histories. As a result, score-based information provided by large-scale, standardized tests is often of little practical use to classroom teachers for pursuing a program designed to enhance learning and personalize instruction.

In the context of learning grammar, learning-oriented assessment of grammar reflects a growing belief among educational assessment experts (e.g., Stiggins, 1987; Gipps, 1994; Pellegrinio, Baxter and Glaser, 1999; Rea-Dickins and Gardner, 2000) that if assessment, curriculum and instruction were more integrally connected, student learning would improve (National Research Council, 2001b). This approach attempts to provide teachers and learners with summative and/or formative information on the test-takers’ grammatical ability. Summative information from assessment allows teachers to assign grades based on specific assessment criteria, report student progress at a single moment or over time, and reward and motivate student learning. Formative information from assessment provides teachers and learners with concrete information on what aspects of the grammar students have and have not mastered and involves them in the regulation and assessment of their own learning, so that further learning can take place independently or in collaboration with teachers and other students.

What is learning-oriented assessment of grammar?

In reaction to conventional testing practices typified by large-scale, discrete-point, multiple-choice tests of language ability, several educators (e.g., Herman, Aschbacher and Winters, 1992; Short, 1993; Shohamy, 1995; Shepard, 2000) have advocated reforms so that assessment practices might better capture educational outcomes and might be more consistent with classroom goals, curricula and instruction. The termalternative assessment, authentic assessment and performance assessment have all been associated with calls for reform to both large-scale and classroom assessment contexts. While alternative, authentic and performance assessment are all viewed to be essentially the same, they emphasize slightly different aspects of a move away from conventional, discrete-point, standardized assessment.

Alternative assessment emphasizes an alternative to and rejection of selected-response, timed and one shot approaches to assessment, whether they occur in large scale or classroom assessment contexts. Alternative assessment encourages assessments in which students are asked to perform, create, produce or do meaningful tasks that both tap into higher-level thinking (e.g., problem-solving) and have real-world implications (Herman et al., 1992).

Similar to alternative assessment, authentic assessment stresses measurement practices which engage students’ knowledge and skills in ways similar to those one can observe while performing some real-life or ‘authentic’ task (O’Malley and Valdez-Pierce, 1996).

Performance assessment refers to the evaluation of outcomes relevant to a domain of interest (e.g., grammatical ability), which are derived from the observation of students performing complex tasks that invoke real world applications (Norris et al., 1998). As with most performance data, assessments are scored by human judges (Stiggins, 1987; Herman et al., 1992; Brown, 1998) according to a scoring rubric that describes what test takers need to do in order to demonstrate knowledge or ability at a given performance level. Bachman (2002) characterized language performance assessment as typically: (1) involving more complex constructs than those measured in selected-response tasks; (2) utilizing more complex and authentic tasks; and (3) fostering greater interactions between the characteristics of the test-takers and the characteristics of the assessment tasks than in other types of assessments. Performance assessment encourages self-assessment by making explicit the performance criteria in a scoring rubric. In this way, students can then use the criteria to evaluate their performance and contribute proactively to their own learning.

While these three approaches better reflect the types of academic competencies that most language educators value and wish to promote, a learning-oriented approach to assessment maintains a clear and unambiguous focus on assessment for the purpose of fostering further learning relevant to some domain of interest (e.g., grammatical ability). Learning is defined here as the accumulation of knowledge and the ability to use this knowledge for some purpose (i.e., skill). To highlight the learning mandate in the assessment of grammar in classroom contexts, I will use the term learning-oriented assessment of grammar.

In terms of method, learning-oriented assessment of grammar reflects the belief that assessments must remain open to all task types if the mandate is to provide information about student performance on the one hand, and information about the processing of grammatical input and the production of grammatical output on the other. Therefore, unlike with other approaches, operationalization involves the use of selected-response, limited-production and complex, extended-production tasks that may or may not invoke real-life applications or interaction. Just as in large-scale assessments, though, the specification of test tasks varies according to the specific purpose of the assessment and the claims we would like to make about what learners know and can do, and in fact, in some instances, a multiple-choice task may be the most appropriate task type available.

Implementing learning-oriented assessment of grammar

Considerations from grammar-testing theory

The development procedures for constructing large-scale assessments of grammatical ability discussed in Chapter 6 are similar to those needed to develop learning-oriented assessments of grammar for classroom purposes with the exception that the decisions made from classroom assessments will be somewhat different due to the learning-oriented mandate of classroom assessment. Also, given the usual low-stakes nature of the decisions in classroom assessment, the amount of resources that needs to be expended is generally less than that required for large-scale assessment.

Implications for test design In designing classroom-based, learning-oriented assessments, we need to provide a much more explicit depiction of the assessment mandate than we might do for large-scale assessments. This is because classroom assessment, especially in school contexts, has many interested stakeholders (e.g. students, teachers, parents, tutors, principals, school districts), who are likely to be held accountable for learning and who will use the assessment information to evaluate instructional outcomes and plan for further instruction.

A second consideration in which the design stage of classroom-based, learning-oriented assessment may differ from that of large-scale assessment is construct definition. Learning-oriented assessment aims to measure simple and/or complex constructs depending on both the claims that the assessment is designed to make and the feedback that can result from an observation of performances. Applied to grammar learning, a ‘simple’ construct might involve the assessment of regular and irregular past tense verb forms presented in a passage with gaps about the disappearance of the dinosaurs.

A third consideration for classroom-based, learning-oriented assessment is the need to measure the students’ explicit as well as their implicit knowledge of grammar. Selected-response and limited-production tasks, or tasks that include planning time, will elicit the students’ explicit knowledge of grammar. In addition, it is important to assess the students’ implicit or internalized knowledge of the grammar. To do this, students should be asked to demonstrate their capacity to use grammatical knowledge to perform complex, real-time tasks that invoke the language performances one would expect to observe in instructional or real-life situations.

Implications for operationalization

The operationalization stage of classroom-based, learning-oriented assessment is also similar to that of large-scale assessments. That is, the outcome should be a blueprint for the assessment, as described in Chapter 6. The learning mandate, however, will obviously affect the specification of test tasks so that characteristics such as the setting, the rubrics or the expected response can be better aligned with instructional goals. For example, in classroom-based assessment, we may wish to collect information about grammar ability during the course of instruction, and we may decide to evaluate performance by means of teacher observationreports, or we may wish to assess grammatical ability by means of informal oral interviews conducted over several day.

Learning-oriented assessment of grammar may be achieved by means of a wide array of data-gathering methods in classroom contexts. These obviously include conventional quizzes and tests containing selected response, limited-production and all sorts of extended-production tasks, as discussed earlier. These conventional methods provide achievement or diagnostic information to test-users, and can occur before, during or after instruction, depending on the assessment goals. They are often viewed as ‘separate’ from instruction in terms of their administration. These assessments are what most teachers typically call to mind when they think of classroom tests.

In addition to using stand-alone tests, learning-oriented assessment promotes the collection of data on students’ grammatical ability as an integral part of instruction. While teachers have always evaluated student performance incidentally in class with no other apparent purpose than to make instructional choices, classroom assessment activities can be made more systematic by means of learning-oriented assessment.

In order to situate how a grammar learning mandate in classroom contexts can impact operationalization decisions, consider the following situation. Imagine you are teaching an intermediate foreign-language course in a theme-based program. The overarching unit theme is crime investigation or the ‘whodunit’ (i.e., ‘who done it’, short for a detective story or mystery à la Agatha Christie or Detective Trudeau). This theme is used to teach and assess modal auxiliary forms for the purpose of expressing degrees of certainty (e.g., It may/might/could/can’t/must/has to be the butler who stole the jewelry). Students are taught to speculate about possible crime suspects by providing motives and drawing logical conclusions.

Operationalization in classroom-based assessment, however, need not be limited to this conventional approach to assessment. Teachers can obviously specify the characteristics of the test setting, the characteristics of the test rubrics and other task characteristics for that matter, in many other ways depending on the learning mandate and the goals of assessment. Let us examine how the setting and test rubrics, for example, can be modified to assess grammar for different learning goals.

In specifying the characteristics of the test rubrics, teachers might decide to vary the scoring method to accommodate different learning oriented assessment goals. For example, after giving the task in Figure 8.2, they might choose to score the recorded performance samples them selves, by means of an analytic rating scale that measures modal usage in terms of accuracy, range, complexity and meaningfulness. They might also have students listen to the tapes to score their own and their peers’ performances.

In classroom assessments designed to promote learning, the scoring process, whether implemented by teachers, the students themselves or their peers, results in a written or oral evaluation of candidate responses. This, in turn, provides learners with summative and/or formative information (i.e., feedback) so that they can compare or ‘notice’ the differences between their inter language utterances and the target-language utterances. According to Schmidt (1990, 1993), Sharwood Smith (1993) and Skehan (1998), this information makes the learners more ready to accommodate the differences between their inter language and the target language, thereby contributing to the ultimate internalization of the learning point.

In addition to teacher feedback, the scoring method in learning oriented assessment can be specified to involve students. From a learning perspective, students need to develop the capacity for self assessment so that they can learn to ‘notice’ for themselves how their language compares with the target-language norms. Learning to mark their own (or their peers’) work can, in itself, trigger further reanalysis and restructuring of inter language forms, as discussed in Chapter 2. It can also foster the development of skills needed to regulate their own learning and it places more responsibility for learning on the students (Rief, 1990).

Planning for further learning

The usefulness of learning-oriented, classroom assessment is to a great extent predicated upon the quality and explicitness of information obtained and its relevance for further action. Research has shown, however, that the quality of feedback contributes more to further learning than the actual presence or absence of it (Bangert-Downs et al., 1991).

Teachers have many options for presenting assessment results to students. They could present students with feedback in the form of a single overall test score, a score for each test component, scores referenced to a rubric, a narrative summary of teacher observations or a profile of scores showing development over time. Feedback can also be presented in a private conference with the individual student. In an effort to understand the effect of feedback on further learning, Butler (1987) presented test takers with feedback from an assessment in one of three forms: (1) focused written comments that addressed criteria test-takers were aware of before the assessment; (2) grades derived from numerical scoring; and (3) grades and comments. Test-takers were then given two subsequent tasks, and significant gains were observed with those who received thedetailed coments.

Furthermore, research has also demonstrated that feedback focusing on learning goals has led to greater learning gains than, for example, feedback emphasizing self-esteem (Butler, 1988). Therefore, feedback from assessments should provide information on not only the quality of the work at hand, but also the identification of student improvement goals. While feedback can be provided by teachers, students should be involved in identifying areas for improvement and for setting realistic improvement goals. In fact, considerable research (e.g., Darling-Hammond, Ancess and Falk, 1995) has shown the learning benefits of engaging students in self and peer assessment.

Considerations from L2 learning theory

Given that learning-oriented assessment involves the collection and interpretation of evidence about performance so that judgments can be made about further language development, learning-oriented assessment of grammar needs to be rooted not only in a theory of grammar testing or language proficiency, but also in a theory of L2 learning. What is striking in the literature is that models of language ability rarely refer to models of language learning, and models of language learning rarely make reference to models of language ability.

As we have seen, implementing grammar assessment with a learning mandate has implications for test construction. Some of these implications have already been discussed. However, implementing learning oriented assessment of grammar is not only about task design and operationalization, teachers also need to consider how assessment relates to and can help promote grammar acquisition, as described by Van Patten (1996).

SLA processes – briefly revisited

As discussed in Chapter 2, research in SLA suggests that learning an L2 involves three simultaneously occurring processes: input processing (Van Patten, 1996), system change (Schmidt, 1990) and output processing (Swain, 1985; Lee and Van Patten, 2003). Input processing relates to how the learner understands the meaning of a new grammatical feature or how form–meaning connections are made (Ellis, 1993; Van Patten, 1996). A critical first stage of acquisition is the conversion of input to ‘intake’. The second set of processes, system change, refers to how learners accommodate new grammatical forms into their inter language and how this change helps restructure their inter language so that it is more target like (McLaughlin, 1987; De Keyser, 1998).

Assessing for intake

Van Patten and Cadierno (1993b) describe this critical first stage of acquisition as the process of converting input into ‘intake’. In language classrooms, considerable time is spent on determining if students have understood. As most teachers know, however, it is difficult to discern if their students have mapped meaning onto the form. In fact, some students can fake their way through an entire lesson without having really understood the meaning of the target forms. Given that communicative language classrooms encourage the use of tasks in which learners must use language meaningfully (Savignon, 1983; Nunan, 1989) and the use of comprehensible input as an essential component of instruction (Krashen, 1982; Krashen and Terrell, 1983), I believe that teachers should explicitly assess for intake as a routine part of instruction.

Assessing for intake requires that learners understand the target forms, but do not produce them themselves. This can be achieved by selected response and limited-production tasks in which learners need to make form meaning connections. Three examples of interpretation tasks designed to assess for intake are presented below. (For additional examples of interpretation tasks, see Ellis, 1997; Lee and Van Patten, 2003; and Van Patten, 1996, 2003.).

Assessing to push restructuring

Once input has been converted into intake, the new grammatical feature is ready to be ‘accommodated’ into the learner’s developing linguistic system, causing a restructuring of the entire system (Van Patten, 1996). To initiate this process, teachers provide students with tasks that enable them to use the new grammatical forms in decreasingly controlled situations so they can incorporate these forms into their existing system of implicit grammatical knowledge. By attending to grammatical input and by getting feedback, learners are able to accommodate the differences between their interlanguage and the target language. Assessment playsan important role in pushing this restructuring process forward since it contributes concrete information to learners on the differences between the grammatical forms they are using and those they should be using to communicate the intended meanings.

Assessing for output processing

Although learners may have developed an explicit knowledge of the form and meaning of a new grammatical point, this does not necessarily mean they can access this knowledge automatically in spontaneous communication. In order for learners to produce unplanned, meaningful output in real time (i.e., speaking), they need to be able to tap into grammatical knowledge that is already an unconscious part of their developing system of language knowledge (Lee and VanPatten, 2003). Thus, to assess the test takers’ implicit knowledge of grammar (i.e., their ability to process output), test-takers need to be presented with tasks that ask them to produce language in real time, where the focus is more on the contentbeing communicated or on the completion of the task than on the application of explicit grammar rules.

Classroom assessments that are cognitively complex typically involve the processing of topical information from multiple sources in order to accomplish some task that requires complex or higher-order thinking skills (Burns, 1986). Based on an analysis of tasks in school subject classrooms (e.g., social studies), Marzano, Pickering and McTighe (1993) provide a list of commonly identified reasoning processes (which I have added to) that are used in cognitively complex tasks.

• Comparing • Error analysis • Experimental inquiry

• Classifying • Constructing support • Invention

• Induction • Analyzing perspectives • Abstracting

• Deduction • Decision making • Diagnosis

• Investigation • Problem solving • Summarizing

Some of these processes, I might add, are not uncommon in communicative language teaching or in language classroom assessment.

Illustrative example of learning-oriented assessment

Let us now turn to an illustration of a learning-oriented achievement test of grammatical ability.

Background

The example is taken from Unit 7 of On Target 1 Achievement Tests (Purpura et al., 2001). This is a book of achievement tests designed to accompany On Target 1 (Purpura and Pinkley, 1999), a theme-based, integrated-skills program designed for secondary school or adult learners of English as a second or foreign language at the lower–mid intermediate level of proficiency. On Target provides instruction in language (e.g., grammar, vocabulary, pronunciation), language use (e.g., listening, speaking, reading and writing) and thematic development (e.g., record-breaking, mysteries of science). The goal of the achievement tests is ‘to measure the students’ knowledge of grammar, vocabulary, pronunciation, reading and writing, as taught in each unit’ (Purpura etal., 2001, p. iii).

Making assessment learning-oriented

The On Target achievement tests were designed with a clear learning mandate. The content of the tests had to be strictly aligned with the content of the curriculum. This obviously had several implications for the test design and its operationalization. From a testing perspective, the primary purpose of the Unit 7 achievement test was to measure the students’ explicit as well as their implicit knowledge of grammatical form and meaning on both the sentence and discourse levels. More specifically, the test was intended to measure the degree to which test-takers had learned the present perfect tense with repeated actions (How many times have you been . . . ?) and with length of time actions (How long have you been . . . ?). Test inferences also included the learners’ ability to use this knowledge to discuss life achievements.

While the TLU domain was limited to the use of the present perfect tense to discuss life achievements, the constructs and tasks included in the test were both simple and complex. For example, the first gap-filling grammar task was intended only to assess the test-takers’ explicit knowledge of morphosyntactic form and the pronunciation task focused only on their explicit knowledge of phonological form. The second grammar task was slightly more complex in that it aimed to measure the test-takers’ ability to use these forms to communicate literal and intended meanings based on more extensive input.

In terms of scoring, the specification of participants was left to the teachers’ discretion in case they wished to score the tasks themselves or involve students in self- or peer assessment activities. Each task contained a recommended number of points so that learners could receive a score for each task. To rate performance on the extended-production tasks, teachers were enc. The writing task was included in the test to assess the test-takers’ implicit knowledge of grammar. In this task, test-takers had to maintain a focus on meaning as they wrote a grammatically accurate, meaningful and well-organized paragraph about past achievements. In sum, the On Target achievement test attempted to take into consideration elements from both grammar-testing theory and L2 learning theory in achieving a learning-oriented assessment mandate.

Chapter 9

Challenges and new directions inassessing grammatical ability

Introduction

Research and theory related to the teaching and learning of grammar have made significant advances over the years. In applied linguistics, our understanding of language has been vastly broadened with the work of corpus-based and communication-based approaches to language study, and this research has made pathways into recent pedagogical grammars.

Also, our conceptualization of language proficiency has shifted from an emphasis on linguistic form to one on communicative language ability and communicative language use, which has, in turn, led to a demphasis on grammatical accuracy and a greater concern for communicative effectiveness. In language teaching, we moved from a predominant focus on structures and metalinguistic terminology to an emphasis on comprehensible input, interaction and no explicit grammar instruction. From there, we adopted a more balanced approach to language instruction, where meaning and communication are still emphasized, but where form and meaning-focused instruction have a clear role.

The state of grammar assessment

In the last fifty years, language testers have dedicated a great deal of time to discussing the nature of language proficiency and the testing of the four skills, the qualities of test usefulness (i.e., reliability, authenticity), the relationships between test-taker or task characteristics and performance, and numerous statistical procedures for examining data and providing evidence of test validity. In all of these discussions, very little has been said about the assessment of grammatical ability, and unsurprisingly, until recently, not much has changed since the 1960s. In other words, for the past fifty years, grammatical ability has been defined in many instances as morphosyntactic form and tested in either a discrete point, selected-response format a practice initiated by several large language-testing firms and emulated by classroom teachers or in a discrete-point, limited-production format, typically by means of the cloze or some other gap-filling task.

In recent years, the assessment of grammatical ability has taken an interesting turn in certain situations. Grammatical ability has been assessed in the context of language use under the rubric of testing speaking or writing. This has led, in some cases, to examinations in which grammatical knowledge is no longer included as a separate and explicit component of communicative language ability in the form of a separate subtest. In other words, only the students’ implicit knowledge of grammar alongside other components of communicative language ability (e.g., topic, organization, register) is measured. Having discussed how grammar assessment has evolved over the years, I will discuss in the next section some ongoing issues and challenges associated with assessing grammar.

Challenge 1: Defining grammatical ability

One major challenge revolves around how grammatical ability has been defined both theoretically and operationally in language testing. As we saw in Chapters 3 and 4, in the 1960s and 1970s language teaching and language testing maintained a strong syntactocentric view of language rooted largely in linguistic structuralism. Moreover, models of language ability, such as those proposed by Lado (1961) and Carroll (1961), had a clear linguistic focus, and assessment concentrated on measuring language elements defined in terms of morphosyntactic forms on the sentence level – while performing language skills. Grammatical knowledge was determined solely in terms of linguistic accuracy. This approach to testing led to examinations such at the CELT (Harris and Palmer, 1970a) and the English Proficiency Test battery (Davies, 1964).

With an expanded definition of grammatical knowledge, however, come several theoretical challenges. The first is for language educators to make clear distinctions between the form and meaning components of grammatical knowledge and, if relevant to the test purpose, to incorporate these distinctions in construct definition. Making finer distinctions between form and meaning will require adjustments in how we approach grammar assessment and may require innovation.

While the current research on learner-oriented corpora has shown great promise, many more insights on learner errors and interlanguage development could be obtained if other components of grammatical form (e.g., information management forms and interactional forms) and if grammatical meaning were also tagged at both the sentence and the discourse levels. For example, in a talk on the use of corpora for defining learning problems of Korean ESL students at the University of Illinois, Choi (2003) identified the following errors as passive errors:

1. The color of her face was changed from a pale white to a bright red.

2. It is ridiculous the women in developing countries are suffered.

Challenge 2: Scoring grammatical ability

A second challenge relates to scoring, as the specification of both form and meaning is likely to influence the ways in which grammar assessments are scored. As we discussed in Chapter 6, responses with multiple criteria for correctness may necessitate different scoring procedures. For example, the use of dichotomous scoring, even with certain selected response items, might need to give way to partial-credit scoring, since some wrong answers may reflect partial development either in form or meaning. As a result, language educators might need to adapt their scoring procedures to reflect the two dimensions of grammatical knowledge.

A clear example of the need to score for form and meaning can be seen in some of the latest research related to computer-assisted language learning (CALL). Several studies (e.g., Heift, 2003) have investigated, for example, the role or different types of corrective feedback (i.e., explicit correction, metalinguistic information, repetition by highlighting) on grammar development. Grammar performance errors in these studies were scored for form alone. In future studies, the scoring of both grammatical form and meaning, when applicable, might provide interesting insights into learner uptake in CALL.

Another challenge relates to the scoring of grammatical ability in complex performance tasks. In instances where the assessment goals call for the use of complex performance tasks, we need to be sure to use well developed scoring rubrics and rating scales to guide raters to focus their judgments only on the constructs relevant to the assessment goal.

McNamara (1996) stresses that the scales in such tasks represent, explicitly or implicitly, the theoretical basis upon which the performance is judged. Therefore, clearly defined constructs of grammatical ability and how they are operationalized in rating scales are critical.

Challenge 3: Assessing meanings

The third challenge revolves around ‘meaning’ and how ‘meaning’ in a model of communicative language ability can be defined and assessed. The ‘communicative’ in communicative language teaching, communicative language testing, communicative language ability, or communicative competence refers to the conveyance of ideas, information, feelings, attitudes and other intangible meanings (e.g., social status) through language. Therefore, while the grammatical resources used to communicate these meanings precisely are important, the notion of meaning conveyance in the communicative curriculum is critical. Therefore, in order to test something as intangible as meaning in second or foreign language use, we need to define what it is we are testing.

Looking to linguists (and language philosophers) for help in defining meaning (e.g., Searle, 1969; Lyons, 1977; Leech, 1983; Levinson, 1983; Jaszczolt, 2002), we will soon realize that meaning is not only a characteristic of the language and its forms (i.e., semantics), but also a characteristic of language use (i.e., pragmatics). This, in turn, involves the links among explicitly stated meanings in an utterance, the language user’s intentions, presuppositions and knowledge of the real world, and the specific context in which the utterance is made. We will also realize that boundary debates between semantics and pragmatics have been long and interesting, but have produced no simple answer with respect to the meaning of ‘meaning’ and the distinctions between semantics and pragmatics.

Challenge 4: Reconsidering grammar-test tasks

The fourth challenge relates to the design of test tasks that are capable of both measuring grammatical ability and providing authentic and engaging measures of grammatical performance. Since the early 1960s, language educators have associated grammar tests with discrete-point, multiple-choice tests of grammatical form. These and other ‘traditional’ test tasks (e.g., grammaticality judgments) have been severely criticized for lacking in authenticity, for not engaging test-takers in language use, and for promoting behaviors that are not readily consistent with communicative language teaching. Discrete-point testing methods may have even led some teachers to have reservations about testing grammar or to have uncertainties about how to test it communicatively.

In providing grammar assessments, the challenge for language educators is to design tasks that are authentic and engaging measures of performance. To do this, I have argued that we must first consider the assessment purpose and the construct we would like to measure. We also need to contemplate the kinds of grammatical performance that we would need to obtain in order to provide evidence in support of the inferences we want to be able to make about grammatical ability. Once we have specified the inferences, or claims, that we would like to make and the kinds of evidence we need to support these claims, we can then design test tasks to measure what grammar test-takers know or how they are able to use grammatical resources to accomplish a wide range of activities in the target language.

Challenge 5: Assessing the development of grammatical ability

The fifth challenge revolves around the argument, made by some researchers, that grammatical assessments should be constructed, scored and interpreted with developmental proficiency levels in mind. This notion stems from the work of several SLA researchers (e.g. Clahsen, 1985; Pienemann and Johnson, 1987; Ellis, 2001b) who maintain that the principal finding from years of SLA research is that structures appear to be acquired in a fixed order and a fixed developmental sequence. Furthermore, instruction on forms in non-contiguous stages appears to be ineffective. As a result, the acquisitional development of learners, they argue, should be a major consideration in the L2 grammar testing.

In terms of test construction, Clahsen (1985) claimed that grammar tests should be based on samples of spontaneous L2 speech with a focus on syntax and morphology, and that the structures to be measured should be selected and graded in terms of order of acquisition in natural L2 development. Furthermore, Ellis (2001b) argued that grammar scores should be calculated to provide a measure of both grammatical accuracy and the underlying acquisitional development of L2 learners. In the former, the target-like accuracy of a grammatical form can be derived from a total correct score or percentage.

As intuitively appealing as the recommendation for developmental scores might appear, the research based on developmental orders and sequences is vastly incomplete and at too early a stage for use as a basis for assessment (Lightbown, 1985; Hudson, 1993; Bachman, 1998). Furthermore, as I have argued in Chapters 3 and 4, grammatical knowledge involves more than knowledge of morphosyntactic or lexical forms; meaning is a critical component. In other words, test-takers can be communicatively effective and at the same time inaccurate, they can be highly accurate but communicatively ineffective and they can be both communicatively effective and highly accurate. Without more complete information on the patterns of acquisition relating to other grammatical forms as well as to grammatical meaning, language testers would not have a solid empirical basis upon which to construct, score and interpret the results from grammar assessments based solely on developmental scores.

In sum, the challenge for language testers is to design, score and interpret grammar assessments with a consideration for developmental proficiency. While this idea makes sense, what basis can we use to infer progressive levels of development? Results from acquisitional development research have been proposed as a basis for such interpretations by some researchers. At this stage of our knowledge, other more viable ways of accounting for what learners know might be better obtained from the way grammatical performance is scored. Instead of reporting one and only one composite, accuracy-based score, we can report a profile of scores one for each construct we are measuring. Furthermore, in the determination of these scores, we can go beyond dichotomous scoring to give more precise credit for attainment of grammatical ability. Finally, scores that are derived from partial credit reflect different levels of development and can be interpreted accordingly. In other words, acquisitional developmental levels need not be the only basis upon which to make inferences about grammatical development.

Final remarks

Despite loud claims in the 1970s and 1980s by a few influential SLA researchers that instruction, and in particular explicit grammar instruction, had no effect on language learning, most language teachers around the world never really gave up grammar teaching. Furthermore, these claims have instigated an explosion of empirical research in SLA, the results of which have made a compelling case for the effectiveness of certain types of both explicit and implicit grammar instruction. This research has also highlighted the important role that meaning plays in learning grammatical forms.

In the same way, most language teachers and SLA researchers around the world have never really given up grammar testing. Admittedly, some have been perplexed as to how grammar assessment could be compatible with a communicative language teaching agenda, and many have relied on assessment methods that do not necessarily meet the current standards of test construction and validation. With the exception of Rea Dickins and a few others, language testers have been of little help. In fact, a number of influential language proficiency exams have abandoned the explicit measurement of grammatical knowledge and/or have blurred the boundaries between communicative effectiveness and communicative precision (i.e., accuracy).

My aim in this book, therefore, has been to provide language teachers, language testers and SLA researchers with a practical framework, firmly based in research and theory, for the design, development and use of grammar assessments. I have tried to show how grammar plays a critical role in teaching, learning and assessment. I have also presented a model of grammatical knowledge, including both form and meaning, that could be used for test construction and validation. I then showed how L2 grammar tests can be constructed, scored and used to make decisions about test-takers in both large-scale and classroom contexts. Finally, in this last chapter, I have discussed some of the challenges we still face in constructing useful grammar assessments. My hope is that this volume will not only help language teachers, testers and SLA researchers develop better grammar assessments for their respective purposes, but instigate research and continued discussion on the assessment of grammatical ability and its role in language learning.

SUMMARY OF THE ASSESSING VOCABULARY BOOK

CHAPTER 1

The Place Of Vocabulary In Language Assessment

Introduction

At first glance, it may seem that assessing the vocabulary knowledge of second language learners is both necessary and reasonably straightforward. It is necessary in the sense that words are the basic building blocks of language, the units of meaning from which larger structures such as sentences, paragraphs and whole texts are formed.

Vocabulary assessment seems straightforward in the sense that word lists are readily available to provide a basis for selecting a set of words to be tested. In addition, there is a range of well-known item types that are convenient to use for vocabulary testing. Here are some examples:

Multiple-choice (Choose the correct answer)

Completion (Write in the missing word)

Translation (Give the L1 equivalent of the underlined word) They worked at the mill.

Matching (Match each word with its meaning)

Recent trends in language testing

However, scholars in the ®eld of language testing have a rather different perspective on vocabulary-test items of the conventional kind. Such items ®t neatly into what language testers call the discretepoint approach to testing. This involves designing tests to assess whether learners have knowledge of particular structural elements of the language: word meanings, word forms, sentence patterns, sound contrasts and so on.

The widespread acceptance of the validity of these criticisms has led to the adoption ± particularly in the major English-speaking countries ± ofthe communicative approach to language testin.

Bachman and Palmer's (1996) book Language Testing in Practice, which is a comprehensive and in¯uential volume on language-test design and development. Following Bachman's (1990) earlier work, the authors see the purpose of language testing as being to allow us to make inferences about learners' language ability, which consists of two components. One is language knowledge and the other is strategic competence.

Three dimensions of vocabulary assessment

Up to this point, I have outlined two contrasting perspectives on the role of vocabulary in language assessment. One point of view is that it is perfectly sensible to write tests that measure whether learners know the meaning and usage of a set of words, taken as independent semantic units. The other view is that vocabulary must always be assessed in the context of a language-use task, where it interacts in a natural way with other components of language knowledge.

Discrete - embedded

The first dimension focuses on the construct which underlies the assessment instrument. In language testing, the term construct refers to the mental attribute or ability that a test is designed to measure

a discrete test takes vocabulary knowledge as a distinct construct, separated from other components of language competence.

an embedded vocabulary measure is one that contributes to the assessment of a larger construct. I have already given an example of such a measure, when I referred to Bachman and Palmer's task of writing a proposal for the improvement of university admissions procedures.

Selective - comprehensive

The second dimension concerns the range of vocabulary to be included in the assessment. A conventional vocabulary test is based on a set of target words selected by the test-writer, and the test-takers are assessed according to how well they demonstrate their knowledge of the meaning or use of those words. This is what I call a selective vocabulary measure. The target words may either be selected as individual words and then incorporated into separate test items, or alternatively the test-writer first chooses a suitable text and then uses certain words from it as the basis for the vocabulary assessment.

On the other hand, a comprehensive measure takes account of all the vocabulary content of a spoken or written text. For example, let us take a speaking test in which the learners are rated on various criteria, including their range of expression.

Context-independent - context-dependent

The role of context, which is an old issue in vocabulary testing, is the basis for the third dimension. Traditionally contextualisation has meant that a word is presented to test-takers in a sentence rather than as an isolated element. From a contemporary perspective, it is necessary to broaden the notion of context to include whole texts and, more generally, discourse.

The issue of context dependence also arises with cloze tests, in which words are systematically deleted from a text and the testtakers' task is to write a suitable word in each blank space.

Generally speaking, vocabulary measures embedded in writing and speaking tasks are context dependent in that the learners are assessed on the appropriateness of their vocabulary use in relation to the task.

CHAPTER 2

The Nature Of Vocabulary

Introduction

Before we start to consider how to test vocabulary, it is necessary first to explore the nature of what we want to assess. Our everyday concept of vocabulary is dominated by the dictionary. We tend to think of it as an inventory of individual words, with their associated meanings. This view is shared by many second language learners, who see the task of vocabulary learning as a matter of memorising long lists of L2 words, and their immediate reaction when they encounter an unknown word is to reach for a bilingual dictionary.

What is a word?

A basic assumption in vocabulary testing is that we are assessing knowledge of words. But the word is not an easy concept to define, either in theoretical terms or for various applied purposes. There are some basic points that we need to spell out from the start. One is the distinction between tokens and types, which applies to any count of the worlds in a text.

Mention of words like the, a, to, and, in and that leads to the question of whether they are to be regarded as vocabulary items. Words of this kind - articles, prepositions, pronouns, conjunctions, auxiliaries, etc. - are often referred to as function words and are seen as belonging more to the grammar of the language than to its vocabulary. Unlike content words - nouns, 'full' verbs, adjectives and adverbs they have little if any meaning in isolation and serve more to provide links within sentences, modify the meaning of content words and so on.

What about larger lexical Itemst

The second major point about vocabulary is that it consists of more than just single words. For a start, there are the phrasal verbs (get available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments' (p. 110).

They identify four categories of lexical phrases:

1. Polywords: short fixed phrases that perform a variety of functions, such as for the most part (which they call a qualifier), at any rate and so to speak (fluency devices), and hold your horses (disagree. ment marker).

2. Institutionalised expressions: longer utterances that are fixed in form and include proverbs, aphorisms and formulas for social interaction.

3. Phrasa constraints: short- to medium-length phrases consisting of a basic frame with one or two slots that can be filled with various worlds or phrases.

4. Sentence builders: phrases that provide the framework for a complete sentence, with one or more slots in which a whole idea can be expressed.

What does it mean to know a lexical item?

The other seven assumptions cover various aspects of what is meant by knowing a word:

2. Knowing a word means knowing the degree of probability of encountering that word in speech or print. For many words we also know the sort of words most likely to be found associated with the word.

3. Knowing a word implies knowing the limitations on the use of the word according to variations of function and situation.

4. Knowing a word means knowing the syntactic behaviour associated with the word.

5. Knowing a word entails knowledge of the underlying form of a word and the derivations that can be made from it.

6. Knowing a word entails knowledge of the network of associations between that word and other words in the language.

7. Knowing a word means knowing the semantic value of a word.

8. Knowing a word means knowing many of the different meanings associated with a word.

What ia vocabulary ability?

Mention of the term construct brings me back to the main theme I developed in Chapter 1, which was that scholars with a specialist interest in vocabulary teaching and learning have a rather different perspective from language testers on the question of how - and even whether - to assess vocabulary. My three dimensions of vocabulary assessment represent one attempt to incorporate the two perspectives within a single framework.

The context of vocabulary use

Traditionally in vocabulary testing, the term context has referred to the sentence or utterance in which the target word occurs. For in-stance, in a multiple-choice vocabulary item, it is normally recommended that the stem should consist of a sentence containing the word to be tested, as in the following example: The committee endorsed the proposal.

A.discussed

B. supported

C. knew about

D. prepared

Vocabulary knowledge and fundamental processes

The second component in Chapelle's (1994) framework of vocabulary ability is the one that has received the most attention from applied linguists and second language teachers. Chapelle outines four dimensions of this component:

Vocabulary size: This refers to the number of words that a person knows.

Knowledge of word characteristics: Their understanding of particular lexical items may range from vague to more precise (Cronbach, 1942).

Lexicon organization: This concerns the way in which words and other lexical items are stored in the brain.

Fundamental vocabulary processes: Language users apply these processes to gain access to their knowledge of vocabulary, both for understanding and for their own speaking and writing.

Metacognitive strategies for vocabulary use

This is the third component of Chapelle's definition of vocabulary ability, and is what Bachman (1990) refers to as 'strategic competence'. The strategies are employed by all language users to manage the ways that they use their vocabulary knowledge in communication. Most of the time, we operate these strategies without being aware of it. It is only when we have to undertake unfamiliar or cognitively demanding communication tasks that the strategiès become more conscious.

Learners have a particular need for metacognitive strategies in communication situations because they have to overcome their lack of vocabulary knowledge in order to function effectively. Blum-Kulka and Levenston (1983) see these strategies in terms of general processes of simplification.

Conclusion

My intention in this chapter has been to draw attention to the complexity of the subject matter in any assessment of vocabulary. At the simplest level vocabulary consists of words, but even the concept of a word is challenging to define and classify. For a number of assessment purposes, it is important to clarify what is meant by a "word' if the correct conclusions are to be drawn from the test results.

The strong association in our minds between vocabulary and single words tends to restrict our view of the topic. For this reason, during the writing of the book I considered the possibility of dispensing with both the terms vocabulary and word as much as possible in favour of terms like lexicon, lexis and lexical item, in order to signal that I was adopting a much broader conception than the traditional ideas about vocabulary.

CHAPTER 3

Research on vocabulary acquisition and use

Introduction

The focus of this chapter is on research in second language vocabu- lary acquisition and use. There are three reasons for reviewing this research in a book on vocabulary assessment. The first is that the researchers are significant users of vocabulary tests as instruments in their studies. In other words, the purpose of vocabulary assessment is not only to make decisions about what individual learners have achieved in a teaching/learning context but also to advance our un- derstanding of the processes of vocabulary acquisition. Secondly, in the absence of much recent interest in vocabulary among language testers, acquisition researchers have often had to deal with assess- ment issues themselves as they devised the instruments for their research. The third reason is that the results of their research can contribute to a better understanding of the nature of the construct of vocabulary ability, which as I explained in the previous chapter - is important for the validation of vocabulary tests.

Systematic vocabulary learning

Given the number of words that learners need to know if they are to achieve any kind of functional proficiency in a second language, it is understandable that researchers on language teaching have been in- terested in evaluating the relative effectiveness of different ways of learning new words.

The starting point for research in this area is the traditional approach to vocabulary learning, which involves working through a list of L2 words together with their LI glossary / translations and memorizing the word-gloss pairs.

Assessment Issues

As for the assessment implications, the design of tests to evaluate how well students have learned a set of new words is straightforward, particularly if the learning task is restricted to memorising association between an L2 word and its L1 meaning. It simply involves presenting the test-takers with one word and asking them to supply the other-language equivalent. However, as Ellis and Beaton (1993b) note, it makes a difference whether they are required to translate into or out of their own language. For example, English native speakers learning German words find it easier to supply the English meaning in response to the German word (i.e. to translate into the L1) than to give the German word for an English meaning (i.e. to translate into the L2).

Incidental vocabulary learning

The term incidental often causes problems in the discussion of research on this kind of vocabulary acquisition. In practice it usually isciation between an L2 word and its L1 meaning. It simply involves presenting the test-takers with one word and asking them to supply the other-language equivalent. However, as Ellis and Beaton (1993b) note, it makes a difference whether they are required to translate into or out of their own language. For example, English native speakers learning German words find it easier to supply the English meaning in response to the German word (i.e. to translate into the L1) than to give the German word for an English meaning (i.e. to translate into the L2).

Research with native speakers

The first step in investigating this kind of vocabulary acquisition was to obtain evidence that it actually happened. Teams of reading researchers in the United States (Jenkins, Stein and Wysocki, 1984; Nagy, Herman and Anderson, 1985; Nagy, Anderson and Herman, 1987) undertook a series of studies with native-English-speaking school children. The basic research design involved asking the sub-types to read texts appropriate to their age level that contained un- billion words. The children were not told that the researchers were interested in vocabulary. After they had completed the reading task, they were given unannounced at least one test of their knowledge of the target words in the text

Second language research

Now, how about incidental learning of second language vocabulary? In a study that predates the LI research in the US, Saragi, Nation and Meister (1978) gave a group of native speakers of English the task of reading Anthony Burgess's novel A Clockwork Orange, which contains a substantial number of Russian-derived words functions as an argot used by the young delinquents who are the main characters in the book. When the subjects were subsequently tested, it was found on average that they could recognize the meaning of 76 per cent of the 90 target words.

Pitts, White and Krashen (1989) used just excerpts from the novel with two groups of American university students and also found some evidence of vocabulary learning; however, as you might expect, the reduced scope of study resulted in fewer targetswords being correctly understood: an average of only two words out of 30 tested. If you have read the novel yourself, you may recall how you were able to gain a reasonable understanding of what most of the Russian words meant by repeated exposure to them as the story progressed. Of course, since Burgess deliberately composed the text of the novel to facilitate this kind of 'incidental acquisition', it represents an unusual kind of reading material that is significantly different from the processes that learners normally encounter.

Assessment issues

Now, what are the testing issues that arise from this research on incidental vocabulary acquisition? One concerns the need for a pretest. A basic assumption made in these studies is that the target words are not known by the subjects. To some extent, it is possible to rely on teachers' judgements or word-frequency counts to select words that a particular group of learners are unlikely to know, but it is preferable to have some more direct evidence. The use of a pre-test allows the researchers to select from a set of potential target words ones that none of the subjects are familiar with.

Alternatively, if the learners turn out to have knowledge of some of the words, the test results can provide a baseline against which post-test scores can be evaluated. The problem, however, is that if a vocabulary test is given before the learners undertake the reading or listening task, they will be alerted to the fact that the researchers are interested in vocabulary and thus the 'incidental' character of any learning that occurs may be reduced, if not lost entirely.

Various solutions to this problem have been adopted. In some of the research where the subjects were reading texts in their first lan- guage, the type of target items used could be assumed to be unfami- liar without giving a pre-test. For example, the researchers who used A Clockwork Orange had only to ensure that none of their subjects had studied Russian. And in the two experiments by Hulstijn (1992) that involved subjects who were reading in their first language, the target items were 'pseudo-words' created by the researcher.

Questions about lexical inferencing

This topic is not purely a pedagogical concern. Inferencing by learners is of great interest in second language acquisition research and there are several types of empirical studies that are relevant here. In re-viewing this work, I find it helpful to start with five questions that seem to follow a logical sequence:

1 What kind of contextual information is available to readers to help them in guessing the meaning of unknown words in texts?

2 Are such clues normally available to the reader in natural, unedited texts?

3 How do well learners infer the meaning of unknown words without being specifically trained to do so?

4 Is training strategy an effective way of developing learners' lexical inferencing skills?

5 Does successful inferencing lead to acquisition of the words?

Assessment issues

As in any test-design project, we first need to be clear about what the purpose of a lexical inferencing test is. The literature I have just reviewed above indicated at least three possible purposes:

1 to conduct research on the processes that learners engage in when they attempt to enter the meaning of unknown words;

2 to evaluate the success of a program to train learners to apply lexical inferencing strategies; or

3 to assess learners on their abilities to make inferences about unknown words.

The design of the test should be influenced by which of these pursuits is the applicable one.

There are two possible starting points for the design test. The first approach is to select a set of words which are known to be unfamiliar to the test-takers and then create a suitable context for each one in the form of a sentence or short paragraph. This strategy allows the tester to control the nature and the amount of the contextual clues pro- vided but at the risk of producing unauthentic contexts that may be unnaturally pregnant. The alternative is to take one or more texts as the starting point and to choose certain low-frequency words within them as the target items for the test. There are drawbacks in this case as well: there may be too many unfamiliar words or conversely too few; and the text may not provide any usable contextual information for a particular word, or again may provide too much.

Communication strategies

When compared with the amount of research on ways that learners cope with unknown words they encounter in their reading, there has been less investigation of the vocabulary difficulties they face in expressing themselves through speaking and writing. However, within the field of second language acquisition, there is an active tradition of research on communication strategies. Although the scholars involved have not seen themselves as vocabulary researchers, a primary focus of their studies has been on how learners deal with lexical gaps, that is words or phrases in the target language that they need to express their intended meanings but don't know. It is therefore appropriate for us to consider what the findings have and what their possible implications for vocabulary assessment are.

Conclusion

Let us review the assessment procedures discussed in this chapter in terms of the three dimensions of vocabulary assessment I presented in Chapter 1. Research on second language vocabulary acquisition normally employs discrete tests, because the researchers are investing a construct that can be labeled 'vocabulary knowledge', 'vocabulary skills' or 'vocabulary learning ability'. This applies even to research on communication strategies. Despite the apparently broad scope of the topic area, most researchers have focused very specific on lexical strategies and designed tests that oblige learners to deal with their lack of knowledge of particular vocabulary items. Embedded measures make sense in theory, but it remains to be seen whether they can be used as practical tools for assessing communication strategies.

Secondly, selective rather than comprehensive measures are used in vocabulary acquisition research, at least in the areas covered in this chapter. Tests assess whether learners have some knowledge of a series of target words and / or specific vocabulary skills that the re-searcher is interested in. However, comprehensive measures may have a limited role in the development of incidental learning or inferencing tests. In order to have access to the contextual information required to gain some understanding of the unknown or partially known target words, the test-takers need to have a reasonable knowledge of most words in the input text. A comprehensive measure of the non-target vocabulary - say, in the form of a readability formula - would therefore be a useful guide to the suitability of a text for this kind of test.

As for context dependence, there is variability according to what aspects of vocabulary is being investigated. Tests of systematic vocational learning are normally context independent, with the words being presented in isolation or in a limited sentence context. In the case of research on incidental learning, subjects are certainly pre- sented with the target words in context as part of the experimental treatment, but knowledge of the words is assessed afterwards in a context-independent manner, in that the subjects cannot refer to what they read or heard while they are taking the vocabulary test. By contrast, context dependence is an essential characteristic of the test material in studies of lexical inferencing. In order for the items to be truly context dependent, the test-writer needs to ensure that both contextual clues are available for each target word and that the test takers have no prior knowledge of the words.

CHAPTER 4

Research on vocabulary assessment

Introduction

In the previous chapter, we see how tests play a role in research on vocabulary within the field of second language acquisition (SLA). Now we move on to consider research in the field of language testing. where the focus is not so much on understanding the processes of vocabulary learning as on measuring the level of vocabulary knowledge and abilities that learners have reached. Language testing is connected with the design of tests to assess learners for a variety of practical purposes that can be summarized under labels such as placement, diagnosis, achievement and proficiency.

However, in practice this distinction between second language acquisition research and assessment is difficult to maintain consistently, because, on the one hand, language testing researchers have paid relatively little attention to vocabulary tests and, on the other hand, second language acquisition researchers working on vocabulary acquisition has often been needed to develop tests as an integral part of their research design. Thus, some of the important work on how to measure vocabulary knowledge and abilities has been produced by vocabulary acquisitions researchers rather than language testers; the latter has tended either to take vocabulary tests for granted or, in the 1990s, to be interested in more integrative and communicative measures of management proficiency.

Objective testing

The history of vocabulary assessment in the twentieth century is very much associated with the development of objective testing, especially in the United States. Objective tests are ones in which the learning material is divided into small units, each of which can be assessed by means of a test item with a single correct answer that can be specified in advance. Most commonly these are items of the multiple-choice type.

It is easy to see how vocabulary became popular as a component of objective language tests. • Words could be treated as independent linguistic units with a meaning expressed by a synonym, a short defining phrase or a translation equivalent.

• There was a great deal of work in the 1920s and 1930s to prepare lists of the most frequent words in English, as well as other words that were useful for the needs of particular groups of students. According to Lado (1961: 181), similar though more limited work was done on the vocabulary of major European languages.

• Multiple-choice vocabulary tests proved to have excellent technical characteristics, in relation to the requirements of psychometric theory. Well-written items could discriminate effectively among learners according to their level of ability, and thus the tests were highly reliable. Reliability was the great virtue of a psychometric test.

• Rather than simply measuring vocabulary knowledge, objective vocabulary tests seem to be valid indicators of language ability in a broad sense. As Anderson and Freebody (1981: 78-80) noted, one of the most consistent findings in L1 reading research has been the high correlation between tests of vocabulary and reading comprehension.

Multiple-choice vocabulary items

Although the multiple-choice format is one of the most widely used methods of vocabulary assessment, both for native speakers and for second language learners, its limitations have also been recognized for a long time. Wesche and Paribakht summarize the criticisms of these items as follows:

1. They are difficult to construct, and require laborious field-testing, analysis and refinement.

2. The learner may know another meaning for the word, but not the one sought.

3. The learner may choose the right word by a process of elimination, and has in any case a 25 per cent chance of guessing the correct answer in a four-alternative format.

4 Items may test students' knowledge of distractors rather than their ability to identify an exact meaning of the target word.

5. The learner may miss an item either for lack of knowledge of words or lack of understanding of syntax in the distractors.

6. This format permits only a very limited sampling of the learner's total vocabulary (for example, a 25-item multiple-choice test sample one word in 400 from a 10,000-word vocabulary).

Wesche and Paribakht, (1996: 17)

Validating tests of vocabulary knowledge

Writers on first language reading research over the years (Kelley and Krey, 1934; Farr, 1969: Schwartz, 1984) have pointed out that, in addition to the various variations of the multiple-choice format, a wide range of test items and methods have been used for measuring vocabulary knowledge. Kelley and Krey (cited in Farr, 1969: 34) identified 26 different methods in standardized US vocabulary and reading tests. However, as Schwartz puts it, 'there does not appear to be any rationale for choosing one measurement technique rather than another. Test constructors sometimes seem to choose a par- ticular measurement technique more or less by whim '(1984: 52). In addition, it is not clear that the various types of tests are all measuring the same ability; this calls into question the validity of the tests.

Measuring vocabulary size

Let me first sketch some educational situations in which consideration of vocabulary size is relevant and where the research has been undertaken.

• Reading researchers have long been interested in estimating how many words are known by native speakers of English as they grow from childhood through the school years to adult life. It represents one facet of research into the role of vocabulary knowledge in reading comprehension as it develops with age. As Anderson and Freebody (1981: 96-97) noted, the results of such research have important implications for the way that reading programs in schools are designed and taught.

• Estimates of native-speaker vocabulary size at different ages provide a target - though a moving one, of course - for the acquisitions of vocabulary by children entering school with little knowledge of the language used as the medium of instruction. Let us take the case of children from non-English-speaking backgrounds who migrate with their families to Canada.

• International students undertaking upper secondary or university education through a new medium of instruction simply do not have discussion I use the two sets of terms interchangeably. A great deal of both Ll and L2 vocabulary testing research can be seen as addressing one or other of these dimensions. Measuring vocabulary size Let me first sketch some educational situations in which consideration of vocabulary size is relevant and where the research has been undertaken.

What counts as a word?

is an issue that I discussed in Chapter 2. The larger estimates of vocabulary sizes for native speakers tend to be calculated on the basis of individual word forms, whereas more conservative estimates take word families as the units to be measured. Remember that a word family consists of a base word together with its inflected and derived forms that share the same meaning. For example, the word forms extends, extending, extended, extensive, extensively, extension and extent can be seen as members of a family headed by the base form extend. The other members of the family are linked to extend by simple word-formation rules and all of them share a core meaning which can be expressed as 'spread or stretch out'. A person who knows the meaning of extend (or perhaps the meaning of any one member of the family) should be able to figure out what the other word forms mean by applying a little knowledge of English suffixes and getting some help from the context in which the word occurs.

How do we choose which words to test?

For practical reasons it is impossible to test all the words that the native speaker of a language might know. Researchers have typically started with a large dictionary and then drawn a sample of words representing, say, 1 per cent (1 in 100) of the total dictionary entries. The next step is to test how many of the selected words are known by a group of subjects. Finally, the test scores are multiplied by 100 to give an estimate of the total vocabulary size. It seems a straightforward procedure but, as Nation (1993b) pointed out in some detail, there are numerous traps for unwary researchers. For example, the dictionary headwords are not the most suitable sampling units, for the reasons I gave in response to the first question. A single-word family may have multiple entries in the dictionary, so an estimate of vocabulary size based on headwords would be an inflated one. Second, a procedure by which, say, the first word on every sixth page is chosen will produce a sample in which very common words are overrepresented because these words, with their various meanings and uses, take up much more space in the dictionary than low - frequency words. Third, there are technical questions concerning the size of sample required to make a reliable estimate of the total vocabulary size.

How do we find out what the selected words are known?

Once a sample of words has been selected, it is necessary to find out - by means of some kind of test whether each word is known. In studies of vocabulary size, the criterion for knowing a word is usually quite liberal, because of the large number of words that need to be covered in the time available for testing. The following test formats have been commonly used:

• multiple-choice items of various types;

• matching of words with synonyms or definitions3;

• supplying an Ll equivalent to each L2 target word;

• the checklist (or yes-no) test, in which test-takers simply indicate whether they know the word or not.

Assessing quality of vocabulary knowledge

Whatever the merits of vocabulary-size tests, one limitation is that they can give only a superficial indication of how well any particular word is known. In fact this criticism has long been applied to many objective vocabulary tests, not just those that are designed to estimate the total vocabulary size. Dolch and Leeds (1953) analyzed the vocabulary subtests of five major reading and general achievement test batteries for American school children and found that

1. Only the commonest meaning of each target word was assessed; and

2. The test-takers were required just to identify a synonym of each word. Thus, the test items could not show whether additional, derived or figurative meaning of the target word were known, and it was quite possible that the children had learned to associate the target word with its synonym without really understanding what either one meant.

How to measure it?

How to conceptualize it? The Dolch and Leeds (1953) test items with which I introduced this section of the chapter essentially assessing precision of knowledge: do the test-takers know the specific meaning of each target word, rather than just having a vague idea about it? This represents one way to define quality of knowledge, but it assumes that each word has only one meaning to be precisely known. Of course, words commonly have several different meanings, think of fresh, as in fresh bread, fresh ideas, fresh supplies, a fresh breeze and so on. If we take this aspect into account, we need to add a dimension of meaning, in addition to precision. Going a step further, vocabulary knowledge involves more than simply word meaning. As we saw in Chapter 2, Richards (1976) and Nation (1990) list multiple components of word knowledge, including spelling, pronunciation, grammatical form, relative frequency, collocations and restrictions on the use of the word, as well as the distinction between receptive and productive knowledge.

How to measure it?

A common assessment procedure for measuring the quality of vocabulary knowledge is an individual interview with each learner, probing how much they know about a set of target words. For instance, in their work with bilingual and monolingual Dutch children, Verhallen and Schoonen (1993) wanted to elicit all aspects of the target word meaning that children might know, in order to make an elaborate semantic analysis of the responses.

The role of context

Whether we can separate vocabulary from other aspects of language proficiency is obviously relevant to the question of what the role of context is in vocabulary assessment. In the early years of objective testing, many vocabulary tests are presented in the target words in isolation. in lists or as the stems of multiple-choice items. It was considered that such tests were pure measures of vocabulary knowledge. In fact. the distinguished American scholar John B. Carroll wrote in an unpublished paper in 1954 (quoted in Spolsky, 1995: 165) that test items containing a single target word were the only ones that should be classified as vocabulary items. Any longer stimulus would turn the item into a reading-comprehension one.

Cloze tests as vocabulary measures

A standard cloze test consists of one or more reading passages from which words are deleted according to a fixed ratio (e.g. every seventh word). Each deleted word is replaced by a blank of uniform length, and the task of the test takers is to write a suitable word in each space. Some authors (e.g. Weir, 1990: 48; Alderson, 2000) prefer, to restrict the use of the label cloze to this kind of test, but the term is commonly used to include a number of modifications to the standard format. One modified version is the selective-deletion (or rational) cloze, where the test-writer deliberately chooses the words to be deleted, preferably according to the principled criteria. A second modification is the multiple-choice cloze.

The standard cloze

Let us first look at the standard, fixed-ratio cloze. A popular way of exploring the validity of cloze tests in the 1970s was to correlate them with various other types of tests. In numerous studies the cloze correlated highly with 'integrative' tests such as dictation or composition writing and at a rather lower level with more 'discrete-point' tests of vocabulary, grammar and phonology. This was interpreted as evidence in support of Oller's claim that the cloze was a good measure of overall proficiency in the language. However, as I discussed earlier in this chapter, simple correlations are not an adequate means of estab- lishing what a test is measuring, especially when the correlation is a moderate one, say, in the range from 0.50 to 0.80.

The rational cloze

Although Oller has consistently favored the standard fixed-ratio format as the most valid form of the cloze procedure for assessing second language proficiency, other scholars have argued for a more selective approach to the deletion of words from the text. In his research, Alderson (1979) found that a single text could produce quite different tests depending on whether you deleted, say, every eighth word rather than every sixth. He also obtained evidence that most cloze blanks could be filled by referring just to the clause or the sentence in which they occurred. (There is a course some confirmation of this in the figures from Jonz's (1990) research quoted above: 68 per cent of the items could be answered from within the clause or the sentence.

The rational cloze in a systematic way with second language learners. Bachman (1982; 1985) conducted two studies in which he selected items that focused primarily on cohesive links within and between the sentences of the text. Chapelle and Abraham (1990) chose items that matched the types of items in a fixed-ratio cloze based on the same text. None of the published studies has involved a rational cloze designed just to measure vocabulary, but once you accept the logic of selective deletion of words from the text, it makes sense to use the cloze procedure to assess the learners' ability to supply missing content words on the basis of contextual clues and, at a more advanced level, to choose the most stylistically appropriate word for a particular blank.

The multiple-choice cloze

Choice items in a cloze test rather than the standard blanks to be filled in. Porter (1976) and Ozete (1977) argued that the standard format re-quires writing ability, whereas the multiple-choice version makes it more a measure of reading comprehension Jonz (1976) pointed out that a multiple-choice cloze could be marked more objectively because it controlled the range of responses that the test-takers could give In addition, he considered that providing response options made the test more student-centered - or 'learner-friendly, 'as we might say these days. For Bensoussan and Ramraz (1984), multiple-choice items were a practical necessity, because their cloze test formed part of an entrance examination for two universities in Israel, with 13,000 temples- dates annually.

The C-test

At first glance the C-test-in which a series of short texts are prepared for testing by deleting the second half of every second word - may seem to be the version of the cloze procedure that is the least processing as a specific measure of vocabulary. For one thing, its creators intended that it should assess general proficiency in the language, particularly for selection and placement purposes, and that the representation should be a sample of all the elements in the text (Klein-Braley, 1985; 1997 : 63-66). If that is the intention, there is no question of using only content words as items. Second, the fact that just the second half of a word is deleted might suggest that knowledge of word structure is more important in this kind of test than, say, the semantic aspects of vocabulary knowledge - especially if the language being tested has a complex system of word endings. However, Chapelle and Abraham (1990) found that their C-test correlated highly with their multiple-choice vocabulary test (r = 0.862). The correlation was substantially higher than with the other parts of the placement test battery, including the writing test (0.639) and the reading test (0.604).

Conclusion

Discrete, selective, context-independent vocabulary tests have been an important part of the educational measurement scene for almost the whole of the twentieth century. They have all the virtues of an objective language test and have become so well established that for a long time they were almost taken for granted. Multiple-choice vocabulary items are still very much in use, generally using a more contextualized form in the 1990s, with the target words presented at least in a sentence if not a broad linguistic context. At the same time, the prevailing view in language testing is that discrete vocabulary measurements are no longer a valid component of tests designed to assess the learners' overall proficiency in a second language. Vocabulary knowledge is assessed indirectly through the test-takers' performance of integrative tasks that show how well they can draw on all their language resources to use the language for various communicative purposes.

Nevertheless, researchers and language-teaching specialists with a specific interest in vocabulary learning have a continuing need for assessment tools. Much of their work can be classified as focusing on either vocabulary size (breadth) or quality of vocabulary knowledge (depth). Vocabulary size has received more attention because, despite the fact that the tests may seem superficial, they can give a more representative picture of the overall state of the learners' vocabulary than an in-depth probe of a limited number of words. Measures of quality of vocabulary knowledge also have values but for quite specific purposes.

The construct validation studies by Corrigan and Upshur (1982) and Arnaud (1989) challenge the notion that vocabulary can be as hidden as something separate from other components of language knowledge, even when individual words are tested in relative isolation. This is consistent with other evidence of the integral part that vocabulary plays in language ability, such as the strong relationship between vocabulary tests and measures of reading comprehension.

Such findings lend support to the view that vocabulary should always be assessed in context. However, as the research on the various members of the cloze family of tests shows, the more we contextualize the assessment of vocabulary, the less clear it may be to what extent it is vocabulary knowledge that is influencing the test-takers' performance.

CHAPTER 5

Vocabulary Tests: Four case studies

The four tests are:

1. The voluntary level test;

2. The Eurocentres Vocabulary Size Test (EVST);

3. The Vocabulary Knowledge scale (VKS); and

4. The test of English as a Foreign Language (TOEFL)

The first two test, the vocabulary levels Test and the Eurocentres Vocabulary Size Test, arei both measures of vocabulary size, whereas the vocabulary knowledge scale is designedto asses dept of vocabulary kmowledge. The fourth test to be considered is the test of English as a foreign language (TOEFL), which is certainly not a discreate vocabulary test but rather a well-researched proficiency-test battery that has incorporated vocabulary items in a variety of interesting ways throughout its history.

The Vocabulary Levels Test

The vocabulary levels Test was devised by paul Nation at victoria University of Wellington I in New Zealand in the early 1980s as a simple instrument for classroom use by teachers in order to help the develop a suitable vocabulary teaching.and leaming programme for their students

The design of the test

The test is in five parts, representing five levels of word frequency in English: the first 2000 words, 3000 words, 5000 words, the University word level (beyond 5000 words) and 10,000

According to Nation (1990:261), the 2000- and 3000-word levels contain the high-frequency i words that all leamers need to kmow in order to function effectively in English. For instance it i is difficult forleamers to read unsimplfied text unless they know these words "he 5000word level represents the upper limit of general high-frequency vocabulary that is worth i spending time on in class. Words at the University level, should help students in reading their i textbooks and other academic reading material. Finally-the 10,000-word level covers the more common lower-frequency words ofthe language.

At each level, there are 36 words and 18 definitions, in group of six and three respectively, as in this example from the 2000-word level:

1. Apply

2. Elect - choose by voting

3. Jump - become like water

4. Manufacture - make

5. Melt

6. Threaten

· Validation

One way to validate the test as a measure of vocabulary size is to see whether it provides evidence for the assumption on which it is based: words that occur more frequently in the language are more likely to be known by learners than less frequent words.

To evaluate the scalability of the test, it was first necessary to set a criterion (or mastery) score for each level. For the analysis of this test, the criterion was set at 16 out of 18. In other words, a student who scored at least 16 on a particular level

The Guttman scalogram analysis produces a summary statistic called the coefficient of scalabillty, which indicates the extent to which the test scores truly form an implicational scale, taking intoi account the number of errors as well as the proportion of correct responses. According to Hatch and Farhady (1982: 181), the coefficient should be well above 0.60 if the scores are to be considered scalable. In my analysis, the scores from the beginning of the course yielded a coefficient of 0.90, with the five frequency levels in their original order, as in the upper part of Table 5.1. For the end of course scores, I obtained the best scalability, 0.84, by reversing the order of the 5000 and University levels, following the pattern of the mean scores in the lower part of Table 5.1. Thus, the statistics showed a high degree of implicational scaling, but by no means a perfect one.

New versions

Schmitt (1993) wrote three new forms of the test, following the original specifications and taking fresh samples of words for each level. This new material was used by Beglar and Hunt (1999), but they concentrated on the 2000- and University-word levels, treating them as separate tests. according to Laufer's (1992; 1997a) work, these levels correspond to a knowledge of 3000 word families which is the ap proxdmate threshold required to be able to read academic texts in English relatively independently. Beglar and Hunt administered all four forms of either the 2000-Word-Level or the University-WordLevel Tests to nearly 1000 learners of English in secondary and tertiary institutions in Japan. Based on the results of this trial, they selected 54 of the best-performing items to produce two new 27-iterm tests for each level. The two pairs of tests were then equated statistically, so that they could function as equivalent measures of learners' vocabulary knowledge at the two frequency levels.

Laufer herself (Laufer, 1998; Laufer and Nation, 1999) describes thei blank-filling version of the test as a measure of 'active' or 'productive' vocabulary knowledge (she uses the two terms interchangeably in heri articles). In the studies just cited, she distinguishes it from the two other measures used as follow

Levels Test - original matching version: receptive knowledge
Levels Test-blank-filling version: controlled productive knowledge.
Lexical Frequency Profile (LFp): free productive knowledge

According to this classification, the blank-filling version is asso clated with the ability to use vocabulary in writing (which is what i the LFP sets out to measure), and in fact Laufer and Nation (1999) assert that the blank-filling test scores can be interpreted in terms of the approximate number of words at a particular frequency leveli which are 'available for productive use' (1999: 41). However, they i have limited evidence to support this interpretation. (Laufer and Nation, 1995: 317) they found some moderatei correlations between sections of the blank-filling Levels Test and corresponding sections of the LFP. On the other hand, Laufer's (1998: 264) intercorrelations of the three measures produced somei contradictory evidence. Whereas there were substantial correlations i between the two versions of the Levels Test, there was no signih cant relationship between either version and a measure based on the LFP.

The Burocentres Vocabulary Size Test

Like the Vocabulary Levels Test, the Eurocentres Vocabulary Size Test (EVST) makes an estimate of a learner's vocabulary size using a graded sample of words covering numerous frequency levels. Another distinctive feature of the EVST is that it is administered by computer rather than as a pen-and-paper test. Let us now look at the test from two perspec tives: first as a placement instrument and then as a measure of vocabulary size.

The EVST as a placement test

The original work on the test was carried out at Birkbeck College, University of London by Paul Meara and his associates (Meara and Buxton, 1987; Meara and Jones, 1988). In its published form (Mearai and Jones, 1990a) the test was commissioned by Eurocentres, al network of language schools in various European countries, including the UK.

Meara and Jones (1988) report a validation study of the EVST in language schools in Cambridge and London using two criterion measures. First, the scores were correlated with the scores obtained in he existing Eurocentres placement test, which yielded an overall coefficient of 0.664 for the Cambridge learners and 0.717 for those in London. The second approach was to review - after one week of classes - the cases of students who had apparently been assigned to the wrong class by the existing placement test.

The EVST as a measure of vocabulary size

If the Eurocentres test is to have a wider application than just as a placement tool for language schools, we also need to consider itsi validity as a measure of vocabulary size, and for this we should lookl into various aspects of its design:

the format of the test and, in particular, the role of the non-words;
the selection of the words to be tested; and .
the scoring of the test.

The first thing to consider is the test format, which allows a largei number of words to be covered within a short time. The second aspect of the design that we need to look at is the selection of the words for the test. And we come now to the scoring of the test. A straightforward way to produce a score would be simply to take the number of Yes' responses to the real words and calculate what proportion of the total inventory of 10,000 words they represent.

The Vocabulary Knowledge Scale

The instrument is of interest not only as a test in its own right but also asi a way of exploring some issues that arise in any attempt to measure quality of vocabulary knowledge in a practical manner.

The design of the scale

The design of the scale The VKS is a generic instrument, which can be used with any set of words that the tester or researcher is interested in assessing. It consists in effect of two scales: one for eliciting responses from the test takers and one for scoring the responses.

Use of the VKS

Paribakht and Wesche have used the VKS as a tool in their research on vocabulary acquisition in an English language programme for non-native-speaking undergraduate students at the University of Ottowa. The courses in the programme, which focuses on comprehen sion skills, use authentic spoken and written materials linked to particular themes such as media, the environment and fitness. In their frst study, Paribakht and Wesche (1993) selected two themes as the basis for a study of incidental vocabulary learning, one theme being actually used in class and the other not.

In the second study (Paribakht and Wesche, 1997), the researchers i compared two approaches to the acquisition of theme-related voca-i bulary through reading. One, called Reading Plus, added to the main reading activities a series of vocabulary exercises using target content i words from the themes. The other, Reading Only, supplemented thei main readings with further texts and comprehension exercises. As in the earlier study, the VKS was administered as a pre-test and a post test to measure acquisition of the target words during the course.

Evaluation of the instrument

Paribakht and Wesche have been careful to make modest claims for their instrument: 'Its purpose is not to estimate general vocabulary knowledge, but rather to track the early development of specific words in an instructional or experimental situation' (Wesche and Paribakht, 1996: 33). They have obtained various kinds of evidence in their research for its reliability and validity as a measure of incidental vocabulary acquisition (Wesche and Paribakht, 1996: 31-33). To estimate reliability, they administered the VKS to a group of students twice within two weeks and there was a high level of consistency in the students' responses to the 24 content words tested (the correlation was 0.89). They have found a strong relationship (with correlations of 0.92 to 0.97) between the way that the students rated themselves on the elicitation scale and the way that their responses were scored, which suggests that the students reported their level of knowledge of the target words reasonably accurately.

The Test of English as a Foreign Language

The Test of English as a Foreign Language, or TOEFL, is administered in 180 countries and territories to more than 900,000 candidates. Like other ETS tests, TOEFL relies on sophisticated statistical analyses and testing technology in order to ensure its quality as a measuring instrument and its efficient administration to such large numbers of test-takers. Until recently, all the items in the basic TOEFL test have been of this type.

The primary purpose of TOEFL is to assess whether foreign stu-i dents planning to study in a tertiary institution where English is the medium of instruction have a sufficient level of proficiency in the language to be able to undertake their academic studies without i being hampered by language-related difficulties.

From the viewpoint of vocabulary assessment. the history of the TOEFL programme represents a fascinating case study of how ap proaches to testing have changed in the latter part of the twentieth century. In particular, vocabulary testing has become progressively more embedded and context dependent as a result of successive revisions of the test battery during that period.

The original vocabulary items

From its beginning in 1964 until the mid-1970s, TOEFL consisted of five sections: listening comprehension, English structure, vocabulary, reading comprehension and writing ability.There were two types of vocabulary-test item, which were labelled sentence completion and synonym matching.

The 1976 revision

In his study involving students in Peru, Chile and Japan, Plke found that, although the existing Vocabulary section of the test correlated highly (at 0.88 to 0.95) with the Reading Comprehension section, the new Words in Context items had even higher correlations (0.94 to 0.99) with the reading section of the experimental test. These results suggested that Pike had achieved his objective of creating a new vocabulary test format that simulated more closely the experience of readers encountering words in context. However, they also raised the intriguing question of whether both vocabulary and reading comprehension items were needed in the test, and if not, which of the two could be dispensed with. There were arguments both ways:

The vocabulary items formed a very efficient section of the test, in that they achieved a very high level of reliability within a short period of testing time.
On the other hand, reading is such a crucial skill in university study that it would have seemed very strange to have a test of English for academic purposes that did not require the students to demonstratei their ability to understand written texts.
In the end, Pike recommended a compromise solution by which both the words in context vocabulary items and the reading comprehen sion items (based on short passages) were included in a new combined section of the test. Pike's recommrndation was accepted and implemented in operational versions of the test from 1976 until 1955.

Towards more contextualised testing

At a conference convened by the TOEFL Program in 1984, a number of applied linguists were invited to present critical reviews of the extent to which TOEFL could be considered a measure of com municative competence. Bachman observed that the vocabulary items 'would appear to suffer from misguided attempts at con textualization' (1986: 81), because the contextual information in the stem sentence was hardly ever required to answer the item correctly. In my terms, Bachman was arguingi that, despite their label, the test items were essentially context independent.

The other study was by Henning (1991). In order to investigate the effects of contextualisation, Henning used eight different formats, including the then-current words-in-context item type. Essentially the formats varied along three dimensions:

the length of the stem, ranging from a single word to a whole reading passage;
the 'inference-generating quality' of the stem sentence, i.e. thei extent to which it provided clues to the meaning of the target word; and
the inclusion or deletion of the target word: it was either included in the stem sentence or replaced by a blank.

Vocbulary in the 1995 version

The results of this in-house research provided support for recommen dations from the TOEFL Committee of Examiners, an advisory body of scholars from outside ETS, that vocabulary knowledge should be assessed in a more integrative manner. Thus, in the 1995 revision of TOEFL, the separate set of vocabulary items was eliminated from thei test. Instead, they were made an integral part of the reading comprehension section.

It is interesting here to revisit the issue of context dependence by looking at an item from the 1995 version of the test and considering what is involved in responding to it.

Not all the vocabulary items in the 1995 version of the test func tioned quite like this. Their context dependence has to be individually evaluated, in much the same way that researchers have analysed what is involved in answering particular cloze-test items. However, in general the items appear to be significantly more context dependent.

The current situation

The latest development in the story occurred in 1998, with the intro duction of a computerised version of TOEFL. In most countries candidates now take the test at an individual testing station, sitting at a computer and using the mouse to record their responses to the items presented to them on the screen. For the reading test, the passages appear in a frame on the left side of the screen and the test items are shown one by one on the right side. Vocabulary items have been retained but in a different form from before.

CHAPTER SIX

The design of discrete vocabulary tests

Introduction

The discussion of vocabulary-test design in the first part of this chapter is based on the framework for language-test development i presented in Bachman and Palmer's (1996) book Language Testing in i Practice. Since the full framework is too complex to cover here, I have chosen certain key steps in the test-development process as the basis for a discussion of important issues in the design of discrete vocabu lary tests in particular. In the second part of the chapter, I offer ai practical perspective on the development of vocabulary tests by means of two examples. One looks at the preparation of classroom progress tests, and the other describes the process by which I developed the word-associates format as a measure of depth of vocabulary knowledge.

Test Purpose

Following Bachman and Palmer's framework, an essential first step in language-test design is to define the purpose of the test. It is important to clarify what the test will be used for because, according to testing theory, a test is valid to the extent that we are justified in drawingi conclusons from its results.

we can identify three uses for language tests: for research, making decisions about learners and making decisions about language programmes. Second language vocabulary researchers have needed assessment in struments for studies on:

· how broad and deep learners' vocabulary knowledge is;

· how effective different methods of systematic vocabulary learning i are

· how incidental learning occurs through reading and listening activities:

· whether and how learners can infer the meanings of unknown words encountered in context; and

· how learners deal with gaps in their vocabulary knowledge.

On the other hand, language teachers and testers who employ tests for making decisions about learners have a different focus. They use tests for purposes such as placement, diagnosis, measuring progress or achievement, and assessing proficiency, as in these examples:

The vocabulary section of a placement test battery can be designed to estimate how many high-frequency words the learners already know.

A progress test assesses how well students have learned the words i presented in the units they have recently studied in the coursebook. The purpose can be essentially a diagnostic one for the teacher, to identify words that require some further attention in class.

In an achievement test, one section may be designed to assess how well the learners have mastered a vocabulary skill that they havei been taught, such as the ability to figure out the meaning of unfa miliar lexical items in a text on the basis of contextual cues.

Construct definition

Bachman and Palmer (1996: 117-120) state that there are two approaches to construct definition: syllabus-based and theory-based. A syllabus-based definition is ap propriate when vocabulary assessment takes place within a course of study, so that the lexical items and the vocabulary skills to be assessed can be specified in relation to the learning objectives of the course.

For research purposes and in proficiency testing, the definition of the construct needs to be based on theory. One thing that makes construct definition rather difficult in the area of second language vocabulary is the variety of theoretical concepts and frameworks which scholars have proposed to account for vocabulary acquisition, knowledge and use. Another effort to achieve greater clarity and standardisa tion is represented by Henriksen's (1999) three dimensions of vocabulary.

Receptive and productive vocabulary

This distinction between receptive and productive vocabulary is one that is accepted by scholars working on bothI first and second language vocabulary development, and it is often referred to by the alternative terms passive and active. As Melka (1997) points out, though, there are still basic problems in conceptualising and measuring the two types of vocabulary, in spite of a lengthy history of research on the subject. The difficulty at the conceptual level is to find criteria for distinguishing words that have receptive status from those which are part of a person's productive vocabulary. It is generally assumed that words are known receptively first and only later become available for productive use.

From my own reading of the literature, I have concluded that one source of confusion about the distinction between receptive and pro ductive measures is that many authors use two different ways of defining reception and production interchangeably. Because the dif. ference in the two definitions is quite significant for assessment pur poses, let me spell out each one by using other terms:

Recognition and recall

Recognition here means that the test-takers are presented with the target word and are asked to show that they understand its meaning, whereas in the case of recall they are provided with some stimulus designed to elicit the target word from their memory.

Comprehension and use

Comprehenslon here means that learners can understand a word when they encounter it in context while listening or reading. whereas use means that the word occurs in their own speech or writing.

Characteristics of the test input

The design of test tasks is the next step in test development, according to Bachman and Palmer's model. In this chapter, I focus on just two aspects of task design: characteristics of the input and characteristics of the expected response.

Selection of target words

Based on such findings, Nation (1990: Chapter 2) proposes that, for teaching and learning purposes, a broad three-way division can bei made Into high-frequency, low-frequency and specialised vocabulary. The hlgh-frequency category In English consists of 2000 word families, which form the foundation of the vocabulary knowledge that i all proficient users of the language must have acquired.

On the other hand, low-frequency vocabulary as a whole is of much less value to learners. The low-frequency words that learners do know reflect the influence of a variety of personal and social variables:

How widely they have read and listened
What their personal interest are;
How much time they have devoted to intensive vocabulary learning;
What their educational and professional background is;
Which community or society they live in;
What communicative purposes they use the language for; and so on.
This is where specialised vocabulary comes in. Specialised vocabulary is likely to be better acquired through content instruction by a subject teacher than through language teaching.

Presentation of words
Words in isolation

As with other decisions in test design, the question of how to present selected words to the test-takers needs to be related to the purpose of the assessment. We have seen various uses of vocabulary tests earlier where no context is provided at all:

In systematic vocabulary learning, students apply memorising techniques to sets of target words and their meanings (usually expressed as Ll equivalents).
In tests of vocabulary size, such as the Vocabulary Levels Test and the Eurocentres Test (EVST), words are often presented in isolation.
In research on incidental vocabulary learning, the learners en counter the target words in context during reading or listening activities, but in the test afterwards the words are presented separately because the researcher is interested in whether the learners can show an understanding of the words when they occur without contextual support.

Words in context

For other purposes, the presentation of target words in some context i is desirable or necessary. In discrete, selective tests, the context most commonly consists of a sentence in which the target word occurs, but it can also be a paragraph or a longer text containing a whole series of target words. Although it is taken for granted these days by many language teachers that words should always be learned in context, in designing a vocabulary measure it is important to consider what rolei is played by the context in the assessment of vocabulary knowledge or ability.

Characteristics of the expected response
Self-report vs. verifiable response

In some testing situations, it is appropriate to ask the learners to assess their own lexical knowledge. In Chapter 5, we saw how the EVST and the VKS draw on self-report by the test-takers, although both instruments also incorporate a way of checking how valid the responses are as measures of vocabulary knowledge. As I noted in thei conclusion to that chapter, the purpose of the test is a major con-i sideration in deciding whether self-assessment is appropriate.

Even in "low-stakes' testing situations where self-report is used, It is desirable to have a means of verifying the test-takers' judgements about their knowledge of the target words. Another approach to verification is found in a test-your-own-vocabulary book for native speakers of English by Diack (1975). The book contains 50 tests, each con sisting of 60 words that are listed in order from most frequent to least.

Monolingual vs. Bilingual testing

last design consideration concerns the language of the test itself. Whereas in a monolingual test format only the target language is i used, a billingual one employs both the target language (L2) and the learners' own language (Ll). This aspect of test design involves more than just the characteristics of the expected response, but I have chosen to deal with the choice of language in an integrated way in this section of the chapter.

I have already noted one type of bilingual test format in the pre vious section on receptive and productive vocabulary, where transla tion from L2 to Ll assesses receptive knowledge and LI to 12 translation is the corresponding productive measure. Vocabulary tests for native speakers of English learning foreign languages commonly have a similar kind of bilingual structure

Practical examples
Classroom progress tests

The purpose of my class tests is generally to assess the learners' progress in vocabulary learning and, more specifically, to give them an incentive to keep studying vocabulary on a regular basis.

Matching items

There are some aspects of the design of this item type which arei worth noting:
The reason for adding one or two extra definitions is to avoid a situation where the learner knows four of the target words and can then get the fifth definition correct by process of elimination, i without actually knowing what the word means.
Assuming that the focus of the test is on knowledge of the target words, the definitions should be easy for the learners to understand. Thus, as a general principle, they should be composed of higher frequency vocabulary than the words to be tested and should not be written In an ellptical style that causes comprehension problems.
If the purpose of the test is Just to assess knowledge of word meaning, then all of the target items should belong to the same word class- all nouns, all adjectives, etc. —and should not include structural clues. Otherwise, the learners may be able to match up words and definitions on the basis of form as well as meaning.
One critism that can be made of the standard matching task is that it presents each target word in isolation.

Completion items

Completion, or blank-filling, items consist of a sentence from which the target word has been deleted and replaced by a blank. As in the contextualised matching format above, the function of the sentence is to provide a context for the word and perhaps to cue a particular use of it. However, this is a recall task rather than simply a recognition one because the learners have to supply the target words from memory.

Generic test items

In an individualised vocabulary programme, these generic items offer a practical alternative to having separate tests for each learner in the class. The same item types could also be used more convention. ally, with target words provided by the teacher, in a class where thei learners have all studied the same vocabulary.

Testing depth of vocabulary knowledge

The word-associates test

The new starting point was the concept of word association. The standard word-association task involves presenting subjects with a set of stimulus words one by one and asking them to say the first related word that comes into their head.

CHAPTER 7

Comprehensive measures of vocabulary

Introduction

Comprehensive measures are particularly suitable for assessment procedures in which vocabulary is embedded as one component of the measurement of a larger construct, such as communicative competence in speaking, academic writing ability or listening comprehension. However, we cannot simply say that all comprehensive measures are embedded ones, because they can also be used on discrete basis.

Measures of test input

In reading and listening tests we have to be concerned about thei nature of the input text. At least two questions can be asked:

Is it at a suitable level of difficulty that matches the ability range of the test-takers?
Does it have the characteristics of an authentic text, especially if it has been produced or modified for use in the test? Here we are specifically interested in the extent to which information about the vocabulary of the text can help to provide answers to these questions.

Readability

In Ll reading research. the basic concept used in the analysis of textsiis readabillty, which refers to the various aspects of a text that arei likely to make it easy or difficult for a reader to understand and enjoy. During the twentieth century a whole educational enterprise grew up in the United States devoted to devising and applying formulas toi predict the readability of English texts for native-speaker readers in terms of school grade or age levels (for a comprehensive review, see Klare, 1984).

Listenability of spoken texts

Much more work has been done on the comprehensibility of written texts than of spoken language. Whereas the term readability is now very well established, its oral equivalent, listenabillty, has had only limited currency. However, it seems reasonable to expect that spoken texts used for the assessment of listening comprehension vary in the demands that they place on listeners in comparable ways to thei demands made of readers by different kinds of written language. One way to evaluate the input of a listening test, then, would be simply to treat it as a written text and apply a readability formula to determine its suitability for the target population of test-takers.

Measures of learner production

Most of this section is concerned with statistical measures of writing, because there is more published research on that topic, but l also consider measures i of spoken production, as well as the more qualitative approach to assessment represented by the use of rating scales to judgei performance.

The statistical measures could provide one kind of evidence for the validity of analytic ratings. Of course this does not necessarily mean that the statistical calculations can capture all thc relevant aspects of language use or that the subjective ratings are invalid if they turn out to be inconsistent with the statistical measures. Both kinds of evidence are needed for validating the assessment of learner performance in speaking and writing tasks.

CHAPTER 8

Further development in vocabulary assessment

Introduction
In the main body of this chapter, I want to review i some current areas of work on second language vocabulary. which will provide additional evidence for my view that a wider perspective is required. and then explore the implications for further develop ments in vocabulary assessment for the future.

The identification of lexical units

One basic requirement for any work on vocabulary is good quality information about the units that we are dealing with. In this section of the chapter, I first review the current state of word-frequency listsi and then take up the question of how we might deal with multi-word Iexical items in vocabulary assessment.

The vocabulary of informal speech

The vocabulary speech is the second area of vocabulary study that has received less attention than it should have, as indicated by the fact that perhaps the most frequently cited research study is the one conducted by Schonell et al. (1956) in the 1950s on the spoken vocabulary of Australian workers.

The social dimension of vocabulary use

In addition, vocabulary knowledge and use are typically thought of in psycholinguistic terms, which minimises the existence of social variation among learners, apart from the fact i that they undertake various courses of study, pursue different careers i and have a range of personal interests.

For assessment purposes, the education domain is obviously an area of major concern, especially when there is evidence that learners i from particular social backgrounds lack the opportunity to acquire the vocabulary they need for academic study.

References

Purpura, E. James. 2004. Assessing Grammar. United Kingdom: Cambridge University Press.

Read, John, Assessing Vocabulary, United Kingdom: Cambridge University Press, 2000

Language Assessment

Jumat, 15 Mei 2020

Laporkan Penyalahgunaan