SUMMARY OF THE
ASSESSING GRAMMAR BOOK
Chapter one
Differing notions of ‘grammar’ for Assessment
Grammar was used to mean the analysis of a language system,
and the study of grammar was not just considered an essential feature of
language learning, but was thought to be sufficient for learners to actually
acquire another language (Rutherford, 1988). Grammar in and of itself was
deemed to be worthy of study to the
extent that in the Middle Ages in Europe, it was thought to be the foundation
of all knowledge and the gateway to sacred and secular understanding (Hillocks
and Smith, 1991). Thus, the central role of grammar in language teaching
remained relatively uncontested until the late twentieth century. Even a few
decades ago, it would have been hard to imagine language instruction without
immediately thinking of grammar.
While the central role of grammar in the language
curriculum has remained unquestioned until recent times, grammar pedagogy has unsurprisingly
been the source of much debate. For example, some language educators have
argued that foreign languages are best learned deductively, where students are
asked to memorize and recite definitions, rules, examples and exceptions.
What is meant by ‘grammar’ in theories of language?
Grammar and linguistics
This is important given the different definitions and
conceptualizations of grammar that have been proposed over the years, and the
diverse ways in which these notions of grammars have influenced L2 educators.
When most language teachers, second language
acquisition (SLA) researchers and language testers think of ‘grammar’, they
call to mind one of the many paradigms (e.g., ‘traditional grammar’ or
‘universal grammar’) available for the study and analysis of language. Such
linguistic grammars are typically derived from data taken from native speakers
and minimally constructed to describe well-formed utterances within an
individual framework. These grammars strive for internal consistency and are
mainly accessible to those who have been trained in that particular paradigm.
Since the 1950s, there have been many such linguistic
theories – too numerous to list here – that have been proposed to explain
language phenomena. Many of these theories have helped shape how L2 educators currently
define grammar in educational contexts. Although it is beyond the purview of
this book to provide a comprehensive review of these theories, it is,
nonetheless, helpful to mention a few, considering both the impact they have
had on L2 education and the role they play in helping define grammar for
assessment purposes.
Form-based perspectives of language
Several syntactocentric, or form-based, theories of
language have provided grammatical insights to L2 teachers. There are three:
traditional grammar, structural linguistics and transformational-generative
grammar.
Form- and use-based perspectives of language
The three theories of linguistic analysis described
thus far have provided insights to L2 educators on several grammatical forms.
These insights provide information to explain what structures are theoretically
possible in a language. Other linguistic theories, however, are better equipped
to examine how speakers and writers actually exploit linguistic forms during
language use. For example, if we wish to explain how seemingly similar
structures like I like to read and I like reading connote different meanings,
we might turn to those theories that study grammatical form and use interfaces.
This would address questions such as: Why does a language need two or more
structures that are similar in meaning? Are similar forms used to convey
different specialized meanings? To what degree are similar forms a function of
written versus spoken language, or to what degree are these forms
characteristic of a particular social group or a specific situation? It is
important for us to discuss these questions briefly if we ultimately wish to
test grammatical forms along with their meanings and uses in context.
One approach to linguistic an analysis that has
contributed greatly to our understanding of the grammatical forms found in
language use, as well as the contextual factors that influence the variability
of these forms, is corpus linguistics.
Communication-based perspectives of language
Other theories have provided grammatical insights from
a communication- based perspective. Such a perspective expresses the notion
that language involves more than linguistic form. It moves beyond the view of
language as patterns of morphosyntax observed within relatively
decontextualized sentences or sentences found within natural-occurring corpora.
Rather, a communication-based perspective views grammar as a set of linguistic
norms, preferences and expectations that an individual Invokes to convey a host
of pragmatic meanings that are appropriate, acceptable and natural depending on
the situation.
What is pedagogical grammar?
A pedagogical grammar represents an eclectic, but
principled description of the target-language forms, created for the express purpose
of helping teachers understand the linguistic resources of communication. These
grammars provide information about how language is organized and offer
relatively accessible ways of describing complex, linguistic phenomena for
pedagogical purposes. The more L2 teachers understand how the grammatical
system works, the better they will be able to tailor this information to their
specific instructional contexts.
Recently, there have been some comprehensive, formal
attempts at interpreting linguistic theories for the purposes of teaching (or
testing) grammar. One of these formal pedagogical grammars of English is The Grammar
Book, published by Celce-Murcia and Larsen-Freeman (1999). These authors used
transformational-generative grammar as an organizing framework for the study of
the English language. However, in the tradition of pedagogical grammars, they
also invoked other linguistic theories and methods of analysis to explain the
workings of grammatical form, meaning and use when a specific grammar point was
not amenable to a transformational-generative analysis. For example, to explain
the form and meanings of prepositions, they drew upon case grammar (Fillmore,
1968) and to describe the English tense-aspect system at the Semantic level,
they referred to Bull’s (1960) framework relating tense to time. Celce-Murcia
and Larsen-Freeman’s (1999) book and other useful Pedagogical English grammars
(e.g., Swan, 1995; Azar, 1998) provide teachers and testers alike with
pedagogically oriented grammars that are an invaluable resource for organizing
grammar content for instruction and assessment.
Chapter two
Research on L2 grammar teaching, learning and
assessment
I will discuss the research on L2 grammar teaching and
learning and show how this research has important insights for language
teachers and testers wanting to assess L2 grammatical ability. Similarly, I
will discuss the critical role that assessment has played in empirical inquiry
on L2 grammar teaching and learning.
Research on L2 teaching and learning
Over the years, several of the questions mentioned
above have intrigued language teachers, inspiring them to experiment with
different methods, approaches and techniques in the teaching of grammar. To
determine if students had actually learned under the different conditions,
teachers have used diverse forms of assessment and drawn their own conclusions
about their students. In so doing, these teachers have acquired a considerable
amount of anecdotal evidence on the strengths and weaknesses of using different
practices to implement L2 grammar instruction. These experiences have led most
teachers nowadays to ascribe to an eclectic approach to grammar instruction,
whereby they draw upon a variety of different instructional techniques,
depending on the individual needs, goals and learning styles of their students.
In recent years, some of these same questions have
been addressed by second language acquisition (SLA) researchers in a variety of
empirically based studies. These studies have principally focused on a description
of how a learner’s interlanguage (Selinker, 1972), or how a learner’s L2,
develops over time and on the effects that L2 instruction may have on this
progression. In most of these studies, researchers have investigated the
effects of learning grammatical forms by means of one or more assessment tasks.
Based on the conclusions drawn from these assessments, SLA researchers have
gained a much better understanding of how grammar instruction impacts both
language learning in general and grammar learning in particular. However, in
far too many SLA studies, the ability under investigation has been poorly
defined or defined with no relation to a model of L2 grammatical ability.
The SLA research looking at the role of grammar
instruction in SLA might be categorized into three strands. One set of studies
has looked at the relationship between the acquisition of L2 grammatical
knowledge and different language-teaching methods. These are referred to as the
comparative methods studies. A second set of studies has examined the
acquisition of L2 grammatical knowledge through what Long and Robinson (1998)
call a ‘non-interventionist’ approach to instruction. These studies have
examined the degree to which grammatical ability could be acquired incidentally
(while doing something else) or implicitly (without awareness), and not
through explicit (with awareness) grammar instruction. A third set of
studies has investigated the relationship between explicit grammar instruction
and the acquisition of L2 grammatical ability. These are referred to as the interventionist
studies, and are a topic of particular interest to language teachers and
testers.
Comparative methods studies
The comparative methods studies sought to compare the
effects of different language-teaching methods on the acquisition of an L2.
These studies occurred principally in the 1960s and 1970s, and stemmed from a
reaction to the grammar-translation method, which had dominated language
instruction during the first half of the twentieth century. More generally,
these studies were in reaction to form-focused instruction (referred to as
‘focus on forms’ by Long, 1991), which used a traditional structural syllabus
of grammatical forms as the organizing principle for L2 instruction. According
to Ellis (1997), form-focused instruction contrasts with meaning-focused
instruction in that meaning-focused instruction emphasizes the communication of
messages (i.e., the act of making a suggestion and the content of such a
suggestion) while form- focused instruction stresses the learning of linguistic
forms. These can be further contrasted with form-and-meaning focused
instruction (referred to by Long (1991) as ‘focus-on-form’), where grammar
instruction occurs in a meaning-based environment and where learners strive to communicate
meaning while paying attention to form.
Non-interventionist studies
While some language educators were examining different
methods of teaching grammar in the 1960s, others were feeling a growing sense
of dissatisfaction with the central role of grammar in the L2 curriculum. As a
result, questions regarding the centrality of grammar were again raised by a
small group of L2 teachers and syllabus designers who felt that the teaching of
grammar in any form simply did not produce the desired classroom results.
Newmark (1966), in fact, asserted that grammatical analysis and the systematic
practice of grammatical forms were actually interfering with the process of L2
learning, rather than promoting it, and if left uninterrupted, second language
acquisition, similar to first language acquisition, would proceed naturally.
Empirical studies in support of non-intervention
The non-interventionist position was examined
empirically by Prabhu (1987) in a project known as the Communicational Teaching
Project (CTP) in southern India. This study sought to demonstrate that the
development of grammatical ability could be achieved through a task-based,
Rather than a form-focused, approach to language teaching, provided that the
tasks required learners to engage in meaningful communication. In the CTP,
Prabhu (1987) argued against the notion that the development of grammatical
ability depended on a systematic presentation of grammar followed by planned
practice.
Possible implications of fixed developmental order to
language assessment
The notion that structures appear to be acquired in a
fixed developmental order and in a fixed developmental sequence might
conceivably have some relevance to the assessment of grammatical ability. First
of all, these findings could give language testers an empirical basis for
constructing grammar tests that would account for the variability inherent in a
learner’s interlanguage. In other words, information on the acquisitional order
of grammatical items could conceivably serve as a basis for selecting
grammatical content for tests that aim to measure different levels of developmental
progression, such as Chang (2002, 2004) did in examining the underlying
structure of a test that attempted to measure knowledge of the relative
clauses. These findings also suggest a substantive approach to defining test
tasks according to developmental order and sequence on the basis of how
grammatical features are acquired over time (Ellis, 2001b). In other words, one
task could potentially tap into developmental level one, While another taps
into developmental level two, and so forth.
Problems with the use of development sequences as a
basis for assessment
Although developmental sequence research offers an
intuitively appealing complement to accuracy-based assessments in terms of
interpreting test scores, this method is fraught with a number of serious
problems, and language educators should use extreme caution in applying this
method to language testing. This is because our understanding of natural
acquisitional sequences is incomplete and at too early a stage of research to
be the basis for concrete assessment recommendations (Lightbown, 1985; Hudson,
1993).
Interventionist studies
Not all L2 educators are in agreement with the
non-interventionist position to grammar instruction. In fact, several (e.g.,
Schmidt, 1983; Swain,1991) have maintained that although some L2 learners are
successful in acquiring selected linguistic features without explicit grammar
instruction, the majority fail to do so.
Empirical studies in support of intervention
Aside from anecdotal evidence, the non-interventionist
position has come under intense attack on both theoretical and empirical
grounds with several SLA researchers affirming that efforts to teach L2 grammar
typically results in the development of L2 grammatical ability. Hulstijn (1989)
and Alanen (1995) investigated the effectiveness of L2 grammar instruction on
SLA in comparison with no formal instruction. They found that when coupled with
meaning-focused instruction, the formal instruction of grammar appears to be
more effective than exposure to meaning or form alone. Long (1991) also argued
for a focus on both meaning and form in classrooms that are organized around
meaningful and sustained communicative interaction. He maintained that the
focus on grammar in communicative interaction serves as an aid to clarity an
precision.
Research on instructional techniques and their effects
on acquisition
Much of the recent research on teaching grammar has
focused on four types of instructional techniques and their effects on
acquisition. Although a complete discussion of teaching interventions is
outside the purview of this book (see Ellis, 1997; Doughty and Williams, 1998),
these techniques include form- or rule-based techniques, input-based
techniques, feedback-based techniques and practice-based techniques (Norris and
Ortega, 2000).
Grammar processing and second language development
In the grammar-learning process, explicit
grammatical knowledge refers to a conscious knowledge of grammatical forms
and their meanings. Explicit knowledge is usually accessed slowly, even when it
is almost fully automatized (Ellis, 2001b). DeKeyser (1995) characterizes
grammatical instruction as ‘explicit’ when it involves the explanation of a
rule or the request to focus on a grammatical feature. Instruction can be explicitly
deductive, where learners are given rules and asked to apply them, or explicitly
inductive, where they are given samples of language from which to generate
rules and make generalizations. Similarly, many types of language test tasks
(i.e., gap-filling tasks) seem to measure explicit grammatical knowledge.
Implicit grammatical knowledge refers
to ‘the knowledge of a language that is typically manifest in some form of
naturally occurring language behavior such as conversation’ (Ellis, 2001b, p.
252). In terms of processing time, it is unconscious and is accessed quickly.
DeKeyser (1995) classifies grammatical instruction as implicit when it does not
involve rule presentation or a request to focus on form in the input; rather,
implicit grammatical instruction involves semantic processing of the input with
any degree of awareness of grammatical form.
Implications for assessing grammar
The studies investigating the effects of teaching and
learning on grammatical performance present a number of challenges for language
assessment. First of all, the notion that grammatical knowledge structures can
be differentiated according to whether they are fully automatized (i.e.,
implicit) or not (i.e., explicit) raises important questions for the testing of
grammatical ability (Ellis, 2001b). Given the many purposes of assessment, we
might wish to test explicit knowledge of grammar, implicit knowledge of grammar
or both. For example, in certain classroom contexts, we might want to assess
the learners’ explicit knowledge of one or more grammatical forms, and could,
therefore, ask learners to answer multiple-choice or short-answer questions
related to these forms. The information from these assessments would show how
well students could apply the forms in contexts where fluent and spontaneous
language use is not required and where time could be taken to figure out the
answers. Inferences from the results of these assessments could be useful for
teachers wishing to determine if their students have mastered certain
grammatical forms. However, as teachers are well aware, this type of assessment
would not necessarily show that the students had actually internalized the
grammatical forms so as to be able to use them automatically in spontaneous or
unplanned discourse. To obtain information on the students’ implicit knowledge
of grammatical forms, testers would need to create tasks designed to elicit the
fluent and spontaneous use of grammatical forms in situations where automatic
language use was required. In other words, to infer that students could
understand and produce grammar in spontaneous speech, testers would need to
present students with tasks that elicit comprehension or full production in
real time (e.g., listening and speaking). Ellis (2001b) suggests that we also
need to utilize time pressure as a means of ensuring that implicit knowledge is
being tested. Although this idea is interesting, the introduction of speed into
an assessment should be done with caution since it is often difficult to
determine the impact of speed on the test taker. In effect, speed may simply
produce a heightened sense of test anxiety, thereby introducing irrelevant
variability in the test scores. If this were the case, speed would not
necessarily provide an effective means of eliciting automatic grammatical
ability. In my opinion, comprehensive assessments of grammatical ability should
attempt to test students on both their explicit and their implicit knowledge of
grammar.
Chapter 3
The role
of grammar in models of communicative language ability
In this chapter I will
discuss the role that grammar plays in models of communicative competence. I
will then endeavor to define grammar for assessment purposes. In this
discussion I will describe in some detail the relationships among grammatical
form, grammatical meaning and pragmatic meaning. Finally, I will present a
theoretical model of grammar that will be used in this book as a basis for a
model of grammatical knowledge. This will, in turn, be the basis for
grammar-test construction and validation. In the following chapter I will
discuss what it means for L2 learners to have grammatical ability.
The role
of grammar in models of communicative competence
In sum, many different
models of communicative competence have emerged over the years. The more recent
depictions have presented much broader conceptualizations of communicative
language ability; However, definitions of
grammatical knowledge have remained more or less the same – morphosyntax. Also,
within these expanded models, more detailed specifications are needed for how
grammatical form might interact with grammatical meaning to communicate literal
and intended meanings, and how form and meaning relate to the ability to convey
pragmatic meanings. If our assessment goal were limited to an understanding of
how learners have mastered grammatical forms, then the current models of
grammatical knowledge would suffice. However, if we hope to understand how
learners use grammatical forms as a resource for conveying a variety of
meanings in language-acquisition, -assessment and -use situations, as I think
we do, then a definition of grammatical knowledge which addresses these other
dimensions of grammatical ability is needed.
What is meant by ‘grammar’ for assessment purposes?
Now with a better understanding of how grammar has
been conceptualized in models of language ability, how might we define ‘grammar’
for assessment purposes? It should be obvious from the previous discussion that
there is no one ‘right’ way to define grammar. In one testing situation the
assessment goal might be to obtain information on students’ knowledge of
linguistic forms in minimally contextualized sentences, while in another, it
might be to determine how well learners can use linguistic forms to express a
wide range of communicative meanings. Regardless of the assessment purpose, if
we wish to make inferences about grammatical ability on the basis of a grammar
test or some other form of assessment, it is important to know what we mean by
‘grammar’ when attempting to specify components of grammatical knowledge for
measurement purposes. With this goal in mind, we need a definition of
grammatical knowledge that is broad enough to provide a theoretical basis for the
construction and validation of tests in a number of contexts. At the same time,
we need our definition to be precise enough to distinguish it from other areas
of language ability.
From a theoretical perspective, the main goal of
language use is communication, whether it be used to transmit information, to
perform transactions, to establish and maintain social relations, to construct
one’s identity or to communicate one’s intentions, attitudes or hypotheses.
Being the primary resource for communication, language knowledge consists of
grammatical knowledge and pragmatic knowledge. Therefore, I propose a
theoretical definition of language knowledge that consists of two distinct, but
related, components. I will refer to one component as grammatical knowledge and
to the other as pragmatic knowledge.
Chapter four
Towards a definition of grammatical ability
What is meant by grammatical ability?
Having described how grammar has been conceptualized,
we are now faced with the challenge of defining what it means to ‘know’ the
grammar of a language so that it can be used to achieve some communicative
goal. In other words, what does it mean to have ‘grammatical ability’?
Defining grammatical constructs
A clear definition of what we believe it means to
‘know’ grammar for a particular testing context will then allow us to construct
tests that measure grammatical ability. The many possible ways of interpreting
what it means to ‘know grammar’ or to have ‘grammatical ability’ highlight the
importance in language assessment of defining key terms. Some of the same terms
used by different testers reflect a wide range of theoretical positions in the
field of applied linguistics. In this book, I will use several theoretical
terms from the domain of language testing. These include knowledge, competence,
ability, proficiency and performance, to name a few. These concepts are
abstract, not directly observable in tests and open to multiple definitions and
interpretations. Therefore, before we use abstract terms such as knowledge or
ability, we need to ‘construct’ a definition of them that will both suit our
assessment goals and be theoretically viable. I will refer to these abstract,
theoretical concepts generically as constructs or theoretical constructs.
One of the first steps in designing a test, aside from
identifying the need for a test, its purpose and audience, is to provide a
clear theoretical definition of the construct(s) to be measured. If we have a
theoretically sound, as well as a clear and precise definition of grammatical
knowledge, we can then design tasks to elicit performance samples of
grammatical ability. By having the test-takers complete grammar tasks, we can observe
– and score – their answers with relation to specific grammatical criteria for
correctness. If these performance samples reflect the underlying grammatical
constructs – an empirical question – we can then use the test results to make
inferences about the test-takers’ grammatical ability. These inferences, in
turn, may be used to make decisions about The test-takers (e.g., pass the
course). However, we need first to provide evidence that the tasks on a test
have measured the grammatical constructs we have designed them to measure
(Messick, 1993). The process of providing arguments in support of this evidence
is called validation, And this begins with a clear definition of the constructs.
Language educators thus need to define carefully the constructs
to measured when creating tasks for tests. They must provide clear definition
of the constructs, bearing in mind that each time a test is designed It should
reflect the different components of grammatical knowledge, the purpose of the
assessment, the group of learners about which we like to make inferences and
the language-use contexts to which, we hope, the results will ultimately
generalize.
Definition of key terms
Before continuing this discussion, it might be helpful
if I clarified some of the key terms.
Grammatical knowledge
Knowledge refers to a set of informational structures
that are built up through experience and stored in long-term memory. These
structures include knowledge of facts that are stored in concepts, images,
networks, production-like structures, propositions, schemata and representation
(Pressley, 1995). Language knowledge is then a mental representation of informational
structures related to language. The exact components of language knowledge,
like any other construct, need to be defined. In this book, grammar refers to a
system of language whereas grammatical knowledge is defined as a set of
internalized informational structures related to the theoretical model of
grammar proposed
Grammatical ability
Although some researchers have defined knowledge and
ability similarly, I use these terms differently. ‘Knowledge’ refers to a set
of informational structures available for use in long-term memory. Ability,
however, encompasses more than just a domain of information in memory; it also
involves the capacity to use these informational structures in some way.
Therefore, language ability, sometimes called communicative
competence or language proficiency, refers to an individual’s
capacity to utilize mental representations of language knowledge built up
through practice or experience in order to convey meaning. Given this
definition, language ability, by its very nature, involves more than just
language knowledge. Bachman and Palmer (1996) characterize language ability as
a combination of language knowledge and strategic competence, defined as
a set of metacognitive strategies (e.g., planning, evaluating) and, I
might add, cognitive strategies (e.g., associating, clarifying), for the
purpose of ‘creating and interpreting discourse in both testing and non-testing
situations’ (p. 67).
Grammatical ability is, then, the combination of
grammatical knowledge and strategic competence; it is specifically defined as
the capacity to realize grammatical knowledge accurately and meaningfully in
testing or other language-use situations. Hymes (1972) distinguished between
competence and performance, stating that communicative competence includes the
underlying potential of realizing language ability in instances of language
use, whereas language performance refers to the use of language in
actual language events. Carroll (1968) refers to language performance as ‘the
actual manifestation of linguistic competence . . . in behavior’ (p. 50).
Metalinguistic knowledge
Finally, metalanguage is the language used to describe
a language. It generally consists of technical linguistic or grammatical terms
(e.g., noun,verb). Metalinguistic knowledge, therefore, refers to informational
structures related to linguistic terminology. We must be clear that
metalinguistic knowledge is not a component of grammatical ability; rather, the
knowledge of linguistic terms would more aptly be classified as a kind of
specific topical knowledge that might be useful for language teachers to
possess. Some teachers almost never present metalinguistic terminology to their
students, while others find it useful as a means of discussing the language and
learning the grammar. It is important to remember that knowing the grammatical
terms of a language does not necessarily mean knowing how to communicate in the
language.
What is ‘grammatical ability’ for assessment purposes?
The approach to the assessment of grammatical ability
in this book is based on several specific definitions. First, grammar
encompasses grammatical form and meaning, whereas pragmatics is a separate, but
related, component of language. A second is that grammatical knowledge, along with
strategic competence, constitutes grammatical ability. A third is that grammatical
ability involves the capacity to realize grammatical knowledge accurately and
meaningfully in test-taking or other language-use contexts. The capacity to
access grammatical knowledge to understand and convey meaning is related to a
person’s strategic competence. It is this interaction that enables examinees to
implement their grammatical ability in language use. Next, in tests and other
language-use contexts, grammatical ability may interact with pragmatic ability
(i.e., pragmatic knowledge and strategic competence) on the one hand, and with
a host of non-linguistic factors such as the test-taker’s topical knowledge, personal
attributes, affective schemata and the characteristics of the task on the
other. Finally, in cases where grammatical ability is assessed by means of an
interactive test task involving two or more interlocutors, the way grammatical
ability is realized will be significantly impacted by both the contextual and
the interpretative demands of the interaction.
Knowledge of phonological or graphological form and
meaning
Knowledge of phonological/graphological form enables
us to understand and produce features of the sound or writing system, with the
exception of meaning-based orthographies such as Chinese characters, as they
are used to convey meaning in testing or language-use situations. Phonological
form includes the segmentals (i.e., vowels and consonants) and prosody (i.e.,
stress, rhythm, intonation contours, volume, tempo). These forms can be used
alone or in conjunction with other grammatical forms to encode phonological
meaning. For example, the ability to hear or pronounce meaning-distinguishing
sounds such as the /b/ vs. /v/ could be used to differentiate the meaning
between different nouns (boat/vote), and the ability to hear or pronounce the
prosodic features of the language (e.g., intonation) could allow students to
understand or convey the notion that a sentence is an interrogative (You’re
staying?).
Knowledge of lexical form and meaning
Knowledge of lexical form enables us to understand and
produce those features of words that encode grammar rather than those that
reveal meaning. This includes words that mark gender (e.g., waitress), count
ability (e.g., people) or part of speech (e.g., relate, relation). For example,
when the word think in English is followed by the preposition about before a
noun, this is considered the grammatical dimension of lexis, representing a
co-occurrence restriction with prepositions. One area of lexical form that
poses a challenge to learners of some languages is word formation. This
includes compounding in English with a noun + noun or a verb + particle pattern
(e.g., fire escape; breakup) or derivational affixation in Italian (e.g.,
ragazzino ‘little kid’, ragazzone ‘big kid’). For example, a student who says
‘a teacher of chemistry’ instead of ‘chemistry teacher’ or ‘*this people’ would
need further instruction in lexical form.
Knowledge of lexical meaning allows us to
interpret and use words based on their literal meanings. Lexical meaning here
does not encompass the suggested or implied meanings of words based on
contextual, sociocultural, psychological or rhetorical associations. For
example, the literal meaning of a rose is a kind of flower, whereas a rose can
also be used in a non-literal sense to imply a number of sociocultural meanings
depending on the context. These include love, beauty, passion and still a host
of other cultural meanings (e.g., the Rose Bowl, a rose window). Lexical
meaning also accounts for the literal meaning of formulaic or lexicalized
expressions (e.g., You’re welcome). Although it is possible to test lexical
form or meaning separately, we must recognize that lexical form and meaning are
very closely associated.
Knowledge of morphosyntactic form and meaning
Knowledge of morphosyntactic form permits us to
understand and produce both the morphological and syntactic forms of the
language. This includes the articles, prepositions, pronouns, affixes (e.g.,
-est), syntactic structures, word order, simple, compound and complex
sentences, mood, voice and modality. A learner who knows the morphosyntactic
form of the English conditionals would know that: (1) an if-clause sets up a
condition and a result clause expresses the outcome; (2) both clauses can be in
the sentence-initial position in English; (3) if can be deleted under certain
conditions as long as the subject and operator are inverted; and (4) certain
tense restrictions are imposed on if and result clauses.
Morphosyntactic forms carry morphosyntactic
meanings which allow us to interpret and express meanings from inflections
such as aspect and time, meanings from derivations such as negation and agency,
and meanings from syntax such as those used to express attitudes (e.g., sub-junctive
mood) or show focus, emphasis or contrast (e.g., voice and word order). For
example, a student who knows the morphosyntactic meaning of the English
conditionals would know how to express a factual conditional relationship (If
this happens, that happens), a predictive conditional relationship (If
this happens, that will happen), or a hypothetical conditional relationship
(If this happened, that would happen). On the sentential level, the
individual morphosyntactic forms and meanings taken together allow us to
interpret and express the literal or grammatical meaning of an utterance and
they allow us to identify the direct language function associated with language
use.
Knowledge of cohesive form and meaning
Knowledge of cohesive form enables us to use the
phonological, lexical and morphosyntactic features of the language in order to
interpret and Express cohesion on both the sentence and the discourse levels.
Cohesive form is directly related to cohesive meaning through cohesive devices
(e.g., she, this, here) which create links between cohesive forms and their
referential meanings within the linguistic environment or the surrounding
co-text. Halliday and Hasan (1976, 1989) list a number of grammatical forms for
displaying cohesive meaning. This can be achieved through the use of personal
referents to convey possession or reciprocity; demonstrative referents to
display spatial, temporal or psychological links; comparative referents to
encode similarity, difference and equality; and logical connectors to signal a
wide range of meanings such as addition, logical conclusion and contrast.
Cohesive meaning is also conveyed through ellipsis (e.g., When [should I arrive
at your house]?), substitution (e.g., I hope so) and lexical ties in the form
of synonymy and repetition. Finally, cohesive meaning can be communicated
through the internal relationship between pair parts in an adjacency pair
(e.g., invitation/acceptance). When the interpretation source of a cohesive
form is within the linguistic environment, the interpretation is said to be
endophoric (Halliday and Hasan, 1989).
Knowledge of information management form and meaning
Knowledge of information management form allows us to
use linguistic forms as a resource for interpreting and expressing the
information structure of discourse. Some resources that help manage the
presentation of information include, for example, prosody, word order,
tense-aspect and parallel structures. These forms are used to create
information management meaning. In other words, information can be structured
to allow us to organize old and new information (i.e., topic/comment),
topicalize, emphasize information and provide information symmetry through
parallelism and tense concordance.
Knowledge of interactional form and meaning
Knowledge of interactional form enables us to
understand and use linguistic forms as a resource for understanding and
managing talk-in interaction. These forms include discourse markers and
communication management strategies. Discourse markers consist of a set of adverbs,
conjunctions and lexicalized expressions used to signal certain language functions.
For example, well . . . can signal disagreement, ya know or ah-huh can signal
shared knowledge, and by the way can signal topic diversion.
Conversation-management strategies include a wide range of linguistic forms
that serve to facilitate smooth interaction or to repair interaction when
communication breaks down. For example, when interaction stops because a
learner does not understand something, one person might try to repair the
breakdown by asking, *What means that? Here the learner knows the interactional
meaning, but not the form.
Similar to cohesive forms and information management
forms, interactional forms use phonological, lexical and morphosyntactic
resources to encode interactional meaning. For example, in saying *What means that?,
the learner knows how to repair a conversation by asking for clarification, but
does not know the form of the request. Other examples of interactional forms
and meanings include: backchannel signals such as ah-huh, or right in English
to signal active listening and engagement; lexicalized expressions like guess
what? And you know what? To indicate the initiation of a story sequence; and
others such as Oh my God, I can’t believe it! Oh my God, you’re kiddin’me! Or
in current Valleyspeak, Shut up! (with stress on ‘shut’ and falling intonation
on ‘up’), commonly used to express surprise.
Although interactional form and meaning are closely
associated, it is possible for students to know the form but not the meaning,
and vice versa. For example, a student who says *thanks you to express
gratitude or *you welcome to respond to an expression of gratitude obviously
has knowledge of the interactional meanings, but not the forms. Finally, from a
pragmatic perspective, interactional forms and meanings embody a number of
implied meanings. Consider the following examples.
Chapter Five
Designing test tasks to measure L2 grammatical ability
How does test development begin? Every grammar-test
development project begins with a desire to obtain (and often provide)
information about how well a student knows grammar in order to convey meaning
in some situation where the target language is used. The information obtained
from this assessment then forms the basis for decision-making. Those situations
in which we use the target language to communicate in real life or in which we
use it for instruction or testing are referred to as the target language use
(TLU) situations (Bachman and Palmer, 1996). Within these situations, the
tasks or activities requiring language to achieve a communicative goal are
called the target language use tasks. A TLU task is one of many language
use tasks that test-takers might encounter in the target language use domain.
It is to this domain that language testers would like to make inferences about
language ability, or more specifically, about grammatical ability.
What do we mean by ‘task’?
The notion of ‘task’ in language-learning contexts has
been conceptualized in many different ways over the years. Traditionally,
‘task’ has referred to any activity that requires students to do something for
the intent purpose of learning the target language. A task then is any activity
(i.e., short answers, role-plays) as long as it involves a linguistic or
non-linguistic (circle the answer) response to input. Traditional learning or
teaching tasks are characterized as having an intended pedagogical purpose –
which may or may not be made explicit; they have a set of instructions that
control the kind of activity to be performed; they contain input (e.g.,
questions); and they elicit a response. More recently, learning tasks have been
characterized more in terms of their communicative goals, their success in
eliciting interaction and negotiation of meaning, and their ability to engage
learners in complex meaning-focused activities (Nunan, 1989, 1993; Berwick,
1993; Skehan, 1998).
In a discussion of the degree to which pedagogical
tasks are successful in eliciting the specific grammatical structures under
investigation, Loschky and Bley-Vroman (1993) identified three types of
grammar-to-task relationships. The first involves task-naturalness, a condition
where ‘a grammatical construction may arise naturally during the performance of
a particular task, but the task can often be performed perfectly well, even
quite easily, without it’ (p. 132). For example, in a task designed to elicit
past modals in the context of a murder mystery, we expect forms like: the
butler could have done it or the maid might have killed her, but we might get
forms like: Maybe the butler did it or I suspect the maid killed her. The
second condition is task-utility, where ‘it is possible to complete the task
[meaningfully] without the structure, but with the structure the task becomes
easier’ (ibid.). For example, in a comparison task, I once had a student say:
*Shiraz is beautiful city, but Esfahan is very, very, very, beautiful city in
Iran. Had he known the comparatives or the superlatives, his message could have
been communicated much more easily. The final and most interesting condition
for grammar assessment entails task-essentialness. This is where the
task cannot be completed unless the grammatical form is used. For example, in a
task intended to distinguish stative from dynamic adjectives, the student would
need to know the difference between I’m really bored and I’m really boring in
order to complete the task. Obviously task essentialness is the most difficult,
yet the most desirable condition to meet in the construction of grammar tasks.
What are the characteristics of grammatical test
tasks?
As the goal of grammar assessment is to provide as
useful a measurement as possible of our students’ grammatical ability, we need
to design test tasks in which the variability of our students’ scores is
attributed to the differences in their grammatical ability, and not to
uncontrolled or irrelevant variability resulting from the types of tasks or the
quality of the tasks that we have put on our tests. As all language teachers
know, the kinds of tasks we use in tests and their quality can greatly
influence how students will perform. Therefore, given the role that the effects
of task characteristics play on performance, we need to strive to manage (or at
least understand) the effects of task characteristics so that they will function
the way we designed them to – as measures of the constructs we want to measure
(Douglas, 2000). In other words, specifically designed tasks will work to
produce the types of variability in test scores that can be attributed to the
underlying constructs given the contexts in which they were measured (Tarone,
1998). To understand the characteristics of test tasks better, we turn to
Bachman and Palmer’s (1996) framework for analyzing target language use tasks
and test tasks.
The Bachman and Palmer framework
Bachman and Palmer’s (1996) framework of task
characteristics represents the most recent thinking in language assessment of
the potential relationships between task characteristics and test performance.
In this framework, they outline five general aspects of tasks, each of which is
characterized by a set of distinctive features. These five aspects describe
characteristics of (1) the setting, (2) the test rubrics, (3) the input, (4)
the expected response and (5) the relationship between the input and response.
Describing grammar test tasks
For
grammar tests, they call to mind a large repertoire of task types that have
been commonly used in teaching and testing contexts. We now know that these
holistic task types constitute collections of task characteristics for
eliciting performance and that these holistic task types can vary on a number
of dimensions. In designing grammar tests, we need to be familiar with a wide
range of activities to elicit grammatical performance.
Selected-response task types
Selected-response
tasks present input in the form of an item, and test-takers are expected to
select the response. Other than that, all other task characteristics can vary.
However, in some instances, partial-credit scoring may be useful, depending on
how the construct is defined. Finally, selected-response tasks can vary in
terms of reactivity, scope and directness.
The multiple-choice (MC) task
This task
presents input with gaps or underlined words or phrases. While the MC task has
many advantages, the items can be difficult and time-consuming to develop. The
format encourages guessing, and scores might be inflated due to test-wise ness,
or the test-taker’s knowledge about test taking. This can result in serious
questions about the validity of inferences based on these items (Cohen, 1998).
Finally, many educators argue that MC tasks are inauthentic language-use tasks.
Multiple-choice error identification
task
This task presents test-takers with an item that
contains one incorrect, unacceptable, or inappropriate feature in the input.
Examines are required to identify the error. In the context of grammatical
assessment, the errors in the input relate to grammatical accuracy and/or
meaningfulness.
The discrimination task
This task presents examines with language and/or
non-language input along with two response choices that are polar opposites or
that contrast in some way. Some response possibilities include: true–false,
right–wrong, same–different, agree–disagree, grammatical–ungrammatical and so
forth. Discrimination items are designed to measure the differences between two
similar areas of grammatical knowledge.
The noticing task
This task presents learners with a wide range of
input in the form of language and/or non-language. Examines are asked to
indicate (e.g., by circling, highlighting) that they have identified some
specific feature in the language.
The noticing task, also referred to as a kind of
consciousness-raising (CR) task, is intended to help students process input by
getting them to construct a conscious form–meaning representation of the
grammatical feature (Ellis, 1997), and for this reason, it seems to be
particularly effective in promoting the acquisition of new grammar points (Tuz,
1993, cited in Ellis, 1997; VanPatten and Cadierno, 1993a, 1993b).
Limited-production task types
Limited-production
tasks are intended to assess one or more areas of grammatical knowledge
depending on the construct definition. Unlike selected-response items, which
usually have only one possible answer, the range of possible answers for
limited-production tasks can, at times, be large – even when the response
involves a single word.
In
other situations, limited-production tasks can be scored with a holistic or
analytic rating scale. This method is useful if we wish to judge distinct
aspects of grammatical ability with different levels of ability or mastery.
The gap-filling task
This task presents input in the form of a sentence,
passage or dialogue with a number of words deleted. one or more areas of
grammatical knowledge. Examinees are required to fill the gap with an
appropriate response for the context. Gap-filling tasks are designed to measure
the learner’s knowledge of grammatical forms and meanings.
A second type of gap-filling task is the cued
gap-filling task. In these tasks, the gaps are preceded by one or more lexical
items, or cues, which must be transformed in order to fill the gap correctly.
A third type of gap-filling task is the cloze. This
task presents the input as a passage or dialogue in which every fifth, sixth or
seventh word is mechanically deleted and replaced by a gap. Examinees have to
fill the gap with the best word for the context.
The short-answer task
This task presents input in the form of a question,
incomplete sentence or some visual stimulus. Test-takers are expected to
produce responses that range in length from a word to a sentence or two.
Short-answer questions can be used to test several areas of grammatical
ability, and are usually scored as right or wrong with one or more criteria for
correctness or partial credit. Short-answer tasks can also be scored by means
of a rating scale
The dialogue (or discourse)
completion task (DCT)
DCTs
are intended to measure the students’ capacity to use grammatical forms to
express a variety of literal or grammatical meanings (e.g., request), where the
relationship between the form and the meaning is relatively direct. DCTs have been used extensively in applied
linguistics research to investigate the use of semantic formulas and other
linguistic devices to express a wide range of literal and implied contextual
meanings (e.g., refusals, apologies, compliments). They have also been used to
examine sociolinguistic and sociocultural meanings (social distance, power,
register) associated with these contexts. This research has been performed with
native and non-native speakers alike.
In this DCT the relationship
between the input and response is reciprocal since the response affects further
turns. The closing third in this task is used to constrain the meaning of the
expected response, thereby limiting the forms and the range of meanings that
can be expressed in the response.
Extended-production task
Extended-production tasks are particularly well
suited for measuring the examinee’s ability to use grammatical forms to convey
meanings in instances of language use (i.e., speaking and writing). When
assessing grammatical ability in the context of speaking, it is advisable,
whenever possible, to audiotape or videotape the interaction. This will allow
the performance samples to be scored more reliably and will provide time to
record diagnostic feedback for students.
The quality of the extended-production task
responses is judged (1) with reference to the theoretical construct(s) being
measured and (2) in terms of different levels of grammatical ability or
mastery. For this reason, extended-production tasks are scored with the
rating-scale method
The information-gap task
(info-gap)
This task presents input in the form of two or more
sets of partially complete information. Test-takers are instructed to ask each
other questions to obtain one complete set of information. Info gap tasks are
scored by means of the rating-scale method as described above (for further
information, these tasks can be used to measure the test-takers’ ability to use
grammatical forms to convey several meanings – both literal and implied.
The role-play and simulation tasks
These tasks present test-takers with a prompt in
which two or more examinees are asked to assume a role in order to solve a
problem collaboratively, make a decision or perform some transaction. The input
can be language and/or non-language, and it can contain varying amounts of
information. In terms of the expected response, role-plays and simulation tasks
elicit large amounts of language, invoking the test-takers’ grammatical and
pragmatic knowledge, their topical knowledge, strategic competence and affective
schemata. The purpose of the test and the construct definition will determine
what will be scored. The relationship between the input and response is
reciprocal and indirect. These tasks are scored with the rating-scale method in
light of the constructs being measured.
Chapter six
Developing tests to
measure L2 grammatical ability
The
information derived from language tests, of which grammar tests are a subset,
can be used to provide test-takers and other test-users with formative and summative
evaluations. Formative evaluation relating to grammar assessment supplies
information during a course of instruction or learning on how test-takers might
increase their knowledge of grammar, or how they might improve their ability to
use grammar in communicative contexts.
Score-based
inferences from grammar tests can also be used to make, or contribute to,
decisions about program placement. This information provides a basis for
deciding how students might be placed into a level of a language program that
best matches their knowledge base, or it might determine whether or not a
student is eligible to be exempted from further L2 study.
The quality of reliability
When
we talk about ‘reliability’ in reference to a car, we all know what that means.
A car is said to be reliable if it readily starts up every time we want to use
it regardless of the weather, the time of day or the user. It is also
considered reliable if the brakes never fail, and the steering is consistently
responsive. These mechanical functions, working together, make the car’s
performance anywhere from zero to one hundred percent reliable.
The quality of construct validity
The
second quality that all ‘useful’ tests possess is construct validity. Bachman
and Palmer (1996) define construct validity as ‘the extent to which we can
interpret a given test score as an indicator of the ability(ies), or
construct(s), we want to measure. Construct validity also has to do with the
domain of generalization to which our score interpretations generalize’ (p.
21). In other words, construct validity not only refers to the meaningfulness
and appropriateness of the interpretations we make based on test scores, but it
also pertains to the degree to which the score-based interpretations can be
extrapolated beyond the testing situation to a particular TLU domain (Messick
1993).
Construct
validity is clearly one of the most important qualities a test can possess. It
tells us if we are measuring what we had intended to measure. Nonetheless, this
information provides no information on how these assessment tasks resemble
those that the learners might encounter in some non-testing situation or on
what impact, if any, these assessments are having on the test-takers.
The quality of authenticity
A
third quality of test usefulness is authenticity, a notion much discussed in
language testing since the late 1970s, when communicative approaches to
language teaching were first taking root. Building on these discussions,
Bachman and Palmer (1996) refer to ‘authenticity’ as the degree of
correspondence between the test-task characteristics and the TLU task
characteristics.
The
purpose of the test is to measure knowledge of grammatical forms so that we can
check on the students’ understanding of these forms, and the TLU domain to
which we wish to generalize is instructional, then selected-response tasks of
grammatical form should not be perceived as lacking in authenticity.
In
sum, test authenticity resides in the relationship between the characteristics
of the TLU domain and characteristics of the test tasks, and although a test
task may be highly authentic, this does not necessarily mean it will engage the
test-taker’s grammatical ability.
The quality of Interactiveness
A
fourth quality of test usefulness outlined by Bachman and Palmer (1996) is
interactiveness. This quality refers to the degree to which the aspects of the
test-taker’s language ability we want to measure (e.g., grammatical knowledge,
language knowledge) are engaged by the test-task characteristics (e.g, the input
response, and relationship between the input and response) based on the test
constructs.
This
task is likely to be more interactive than a task that is unsuccessful in
engaging aspects of the test-taker’s language ability to such a degree. The
engagement of these construct-relevant characteristics with task
characteristics is the essence of actual language use.
A
task may be interactive because it engages the examinee’s topical knowledge and
positive affective schemata; however, if the purpose of the test is to measure
grammatical ability and the task does not engage the ability of interest, this
is all construct irrelevant.
The quality of impact
Bachman
and Palmer (1996) refer to the degree to which testing and test score decisions
influence all aspects of society and the individuals within that society as
test impact. Therefore, impact refers to the link between the inferences we
make from scores and the decisions we make based on these interpretations. In
terms of impact, most educators would agree that tests should promote positive
test-taker experiences leading to positive attitudes (e.g., a feeling of
accomplishment) and actions (e.g., studying hard).
A
special case of test impact is washback, which is the degree to which testing
has an influence on learning and instruction. Washback can be observed in
grammar assessment through the actions and attitudes that test-takers display
as a result of their perceptions of the test and its influence over them.
The quality of practicality
Test
practicality is not a quality of a test itself, but is a function of the extent
to which we are able to balance the costs associated with designing,
developing, administering, and scoring a test in light of the available
resources (Bachman, personal communication, 2002).
In
sum, the characteristics of test usefulness, proposed by Bachman and Palmer
(1996), are critical qualities to keep in mind in the development of a grammar
test. While each testing situation may not emphasize the same characteristics
to the same degree, it is important to consider these qualities and to
determine an appropriate balance.
Overview of grammar-test
construction
As
a result, there is no one ‘right’ way to develop a test; nor are there any
recipes for ‘good’ tests that could generalize to all situations. Test
development is often presented as a linear process consisting of a number of
stages and steps. In reality, the process is anything but linear.
Bachman
and Palmer (1996) organize test development into three stages: design,
operationalization and administration. I will discuss each of these stages in
the process of describing grammar-test development. According to Bachman and
Palmer (1996, p. 88), this document should contain the following components.
Purpose
Test development begins with what Davidson and Lynch
(2002) call a mandate. The test mandate grows out of a perceived need for a
test by one or more stakeholders. This embodies the test purpose(s), which, in
the case of grammar assessment, makes explicit the inferences we wish to make
about grammatical knowledge or the ability to use this knowledge and the uses
we intend to make of these inferences.
TLU domains and tasks
TLU domain is identified (e.g., real-life and/or
language-instructional) and the TLU task types are selected. To identify
language-use tasks and the type of language needed to perform these tasks, a
needs analysis must be performed. This involves the collection and analysis of
information related to the students’ target-language needs.
In more and more language teaching situations,
however, the focus is on communicative language teaching. Instruction in this
approach is designed to correspond to real-life communication outside the
classroom; therefore the intended TLU domain of communicative language tests is
likely to be real-life.
Characteristics of test-takers
The design statement contains a detailed description
of the characteristics of the test-takers, so that the population of
test-takers for whom the test is intended and to whom the test scores might
generalize can be made explicit Construct(s) to be measured
The design statement also provides a theoretical
definition of the construct(s) to be measured in the test. Construct definition
can be based on a set of instructional objectives in a syllabus, a set of standards,
a theoretical definition of the construct or some combination of them all. In
grammar tests, construct definition based on a syllabus (or a textbook) is
useful when we want to determine to what degree students have mastered the
grammar points that have been taught during a certain period
In addition to defining grammatical knowledge, the
test designer must specify the role of topical knowledge in the construct
definition of grammar tests. Bachman and Palmer (1996) provide three options
for defining topical knowledge. The first is to exclude topical knowledge from
the test construct(s). This is appropriate in situations where specific topics
and themes are not a consideration in instruction, and where test-takers are
not expected to have any special background knowledge to complete the task.
The second option is to include topical knowledge in
the construct. This is appropriate in situations where specific topics or
themes are an integral part of the curriculum and where topics or themes
contextualize language, provide a social–cognitive context for the tasks, and
serve to raise the students’ interest level,
The third and most interesting option is to define
topical knowledge separately from the language construct(s). This is
appropriate in situations where the development of topical knowledge is as
important as, if not more important than, the development of language knowledge
in the curriculum. Finally, the test developer needs to decide if strategic
competence needs to be specified in the construct definition of grammar tests.
Plan for evaluating usefulnes
The test design statement also provides a
description of a plan for assessing the qualities of test usefulness. From the
beginning of grammar-test development, it is important to consider all six
qualities of test usefulness and to determine minimum acceptable levels for
each quality. Bachman and Palmer (1996) suggest that a list of questions be
provided to guide test developers to evaluate test usefulness throughout the
process so that feedback can be provided.
Plan for managing resources
Finally,
the test design makes explicit the human, material and time resources needed
and available to develop the test. In cases of limited resources, priorities
should be made in light of the qualities of test usefulness.
Stage 2: Operationalization
The
operationalization stage of grammar-test development describes how an entire
test involving several grammar tasks is assembled, and how the individual tasks
are specified, written and scored. The outcome of the operationalization phase
is both a blueprint for the entire test including scoring materials and a draft
version of the actual test.
According
to Bachman and Palmer (1996), the blueprint
contains two parts: a description of the overall structure of the test and
a set of test-task specifications for each task. The blueprint serves as the
basis for item writing and scoring.
The
first part of the blueprint provides a description of the test structure. This
involves an overview of the entire test. Minimally the test structure describes
the number of test parts or tasks used to measure grammatical ability, the
salience of these parts, their sequence, the importance of the parts, and the
number of tasks per part.
The
test-task specifications consist of a detailed list of task characteristics,
which form the basis for writing the test tasks. Test-task specifications are
an important part of the operationalization phase because they provide a means
of creating parallel forms of the same test – that is, alternate test forms
containing the same task types and approximately the same test content and
measurement characteristics.
According
to Bachman and Palmer (1996), test-task specifications provide, for each task,
a description of the following: the purpose, the construct definition, the
characteristics of the setting, the time allotment, the instructions, the
characteristics of the input and expected response, a description of the
relationship between the input and response, and finally the scoring method.
I
will describe these procedures in detail so that they can be specified properly
in the blueprint.
Specifying the scoring method
Scoring
method provides an explicit description of the criteria for correctness and the
exact procedures for scoring the response. Generally speaking, tasks can be
scored objectively, where the scorer does not need to make any expert judgments
in determining if the answers are correct, or subjectively, where expert
judgment is needed to judge performances
Scoring selected-response tasks
The first task in the example chemistry lab test
discussed above is a selected-response task (i.e., multiple-choice) of
grammatical form. Scorers are provided with an answer key to determine if the
answers are right or wrong – no further adjudication is necessary. In this
task, each item is designed to measure a single area of explicit grammatical
knowledge.
Scoring limited-production tasks
The second task in the lab report test is a
limited-production task. Limited-production tasks are designed to elicit a
range of possible answers and can be used to measure one or more areas of
grammatical knowledge.
Scoring extended-production tasks
The third task in the chemistry lab test asks
test-takers to write an abbreviated version of a lab report based on topical
cues in the input. This task is designed to elicit an array of grammatical
features characteristic of chemical lab reports (e.g., past active and passive
sentences).
Using scoring rubrics
Once the scoring rubric has been constructed, the
scoring process can be determined. In an attempt to avoid unreliability due to
the scoring process, certain basic procedures should be followed. First of all,
raters should be normed. To do this, raters are given a norming packet
containing the scoring rubric and samples of tasks typifying the different
levels. These benchmark samples serve to familiarize raters with the rubric and
how performance might be leveled.
Grading
The blueprint should describe the relative
importance of the test sections. This can be used to determine a final score on
the test. In the chemistry lab test blueprint, the selected-response and the
limited-production tasks together account for fifty percent of the points (20
points), while the extended-production task accounts for the other fifty percent
(20 points).
Stage 3: Test administration and
analysis
The final stage in the process of developing grammar
tests involves the administration of the test to individual students or small
groups, and then to a large group of examinees on a trial basis. Piloting the
entire test or individual test tasks allows for the collection of response data
and other sorts of information to support and/or improve the usefulness of the
test. This information can then be analyzed and the test revised before being
put to operational use.
The actual administration of the test should
transpire in a setting that is physically comfortable and free from
distraction, and a supportive testing environment should be established.
Instructions should be clear and the administration orderly. Test
administration provides an excellent opportunity for collecting information
about the test-takers’ initial reaction to the test tasks and information about
certain test procedures such as the allotment of time.
Once the pre-test responses have been collected and
scored, a number of statistical analyses should be performed in order to
examine the psychometric properties of the test.
Test
analyses provide different types of information to evaluate the characteristics
of test usefulness. This information serves as a basis for revising the test
before it goes operational, at which time further data are collected and
analyses performed in an iterative and recursive manner. In the end, we will
have a bank of test tasks that will be archived and from which we can draw upon
in further test administrations.
Chapter 7
Illustrative tests of
grammatical ability
In
this chapter I will examine several examples of professionally developed
language tests that measure grammatical ability. Some of these tests contain
separate sections that are exclusively devoted to the assessment of grammatical
ability, while others measure grammatical knowledge along with other components
of language ability in the context of language use – that is while test-takers
are listening, speaking, reading or writing.
I
will begin the analysis of each test by describing the context of the test, its
purpose and its intended use(s). After that, I will turn to how the construct
of grammatical ability was defined and operationalized. I will then describe
the grammar task(s) taking into account the areas of grammatical knowledge
being measured and will summarize the critical features of the test tasks. I
will highlight the priorities and compromises made in the process of balancing the
qualities of test usefulness.
The First Certificate in English
Language Test (FC)
Purpose
The First Certificate in English (FCE) exam was
first developed by the University of Cambridge Local Examinations Syndicate
(UCLES, now Cambridge ESOL) in 1939 and has been revised periodically ever
since. The purpose of the FCE (Cambridge ESOL, 2001a) is to assess the general
English language proficiency of learners as measured by their abilities in
reading, writing, speaking, listening, and knowledge of the lexical and
grammatical systems of English (Cambridge ESOL, 1995, p. 4).
In this review, I will focus on how grammatical
ability is measured in the Use of English or grammar section of the FCE. I will
then examine how grammatical ability is measured in the writing and speaking
sections.
Construct definition and
operationalization
According
to the FCE Handbook (Cambridge ESOL, 2001a), the Use of English paper is
designed to measure the test-takers’ ability to ‘demonstrate their knowledge
and control of the language system by completing a number of tasks, some of
which are based on specially written texts’ (p. 7).
Measuring grammatical ability
through language use
In addition to measuring grammatical ability in the
Use of English paper of the test, the FCE measures grammatical ability in the
writing and speaking sections. Language use in the writing paper is measured in
the contexts of writing letters, articles, reports and compositions (Cambridge
ESOL, 2001a, p. 7). Scores are derived from a six-point (0–5), holistic rating
scale based on ‘the control, organization and cohesion, range of structures and
vocabulary, register and format, and [the effect made on] the target reader
indicated in the task’ (Cambridge ESOL, 2001a, p. 19).
The FCE and the qualities of test
usefulness
The qualities of test usefulness, the FCE clearly
gives priority to construct validity, especially as this relates to the
measurement of grammatical ability as one component of English language
proficiency. The published FCE literature provides clear, albeit very general,
information on the aspects of grammatical knowledge being measured in the Use
of English paper. Also, the FCE literature makes explicit the importance of
grammatical ability when describing the criteria for rating the writing and speaking
papers of the test. This is made even more salient by the inclusion of rater
comments on the quality of writing samples, where explicit rater judgments
about grammatical performance are expressed.
The purpose and uses of the FCE, the establishment
of a discrete, empirical relationship between the target language use tasks and
the test tasks in the Use of English paper of the test is difficult to
determine from the published literature.
The Comprehensive English
Language Test (CELT)
Purpose The Comprehensive English Language Test
(CELT) (Harris and Palmer, 1970a, 1986) was designed to measure the English
language ability of nonnative speakers of English. The authors claim in the
technical manual (Harris and Palmer, 1970b) that this test is most appropriate
for students at the intermediate or advanced levels of proficiency. English
language proficiency is measured by means of a structure subtest, a vocabulary
subtest and a listening subtest.
Construct definition and
operationalization
According
to the CELT Technical Manual (Harris and Palmer, 1970b), the structure subtest
is intended to measure the students’ ‘ability to manipulate the grammatical
structures occurring in spoken English.
Measuring
grammatical ability through language use
In addition to measuring grammatical knowledge in the structure subtest, grammatical knowledge is also measured in the listening subtest. According to the Technical Manual, the listening subtest is designed tomeasure the test takers’ ‘ability to understand short statements, question, and dialogues
as spoken by a native speaker’ (p. 1). The listening section has three
tasks. In the first task, candidates hear a wh‑or a yes/no question (When are
you going to New York?) and are asked to select one correct response to
this question (e.g., Next Friday) from four options. To get this item right,
examinees need to understand the lexical item when and associate it with a time
expression in the response. This item is obviously designed to measure the
student’s ability to understand lexical meaning.
A second
listening task presents test-takers with a sentence involving conditions,
comparisons, and time and number expressions (Harris and Palmer, 1970b, p. 2).
The candidates then choose from among four options to select ‘the one accurate
paraphrase’ for the sentence they heard. For example: Student hears: ‘George
has just returned from vacation.’
(a)
George is spending his vacation at home
(b)
George has just finished his vacation.
(c)
George is just about to begin his vacation.
(d)
George has decided not to take a vacation.
This item type seems to be designed to measure the examinees ability to understand grammatical
meaning, or the literal and intended meaning of the utterance in the input. Given the slightly indirect
association that examinees need to make between ‘finishing a vacation’ (i.e.,
travel may or may not be involved in the response) and ‘returning from
vacation’ (i.e., travel is presumed in the input), it could be argued that this
item is measuring knowledge of grammatical meaning, where the relationship
between form and meaning is relatively, but not entirely, direct.
The CELT and the qualities of test usefulness
In terms
of the purposes and intended uses of the CELT, the authors explicitly stated,
‘the CELT is designed to provide a series of reliable
and easy-to-administer tests for measuring English language ability of non
native speakers’ (Harris and Palmer, 1970b, p. 1). As a result, concerns for
high reliability and ease of administration led the authors to make choices
privileging reliability and practicality over other qualities of test
usefulness. To maximize consistency of measurement, the authors used only
selected-response task types throughout the test, allowing for minimal
fluctuations in the scores due to characteristics of the test method. This
allowed them to adopt ‘easy-to-administer’ and ‘easy-to score’
procedures for maximum practicality and
reliability. Reliability was also enhanced by pre testing items with the goal of improving their psychometric
characteristics.
In my opinion, reliability might have been emphasized at the expense of other important test qualities, such as construct validity, authenticity, interactiveness and impact. For example, construct validity
was severely compromised by the mismatch among the purpose of the test, the way
the construct was defined and the types of tasks used to operationalize the
constructs. In short, scores from discrete-point grammar tasks were used to
make inferences about speaking ability rather than make interpretations about
the test-takers’ explicit grammatical knowledge.
Finally,
authenticity in the CELT was low due to the exclusive use of multiple-choice
tasks and the lack of correspondence between these tasks and those one might
encounter in the target language use domain. Interactiveness was also low due
to the test’s inability to fully involve the test-takers’ grammatical ability
in performing the tests. The impact of the CELT on stakeholders is not
documented in the published manual.
In all
fairness, the CELT was a product of its time, when emphasis was on
discrete-point testing and reliability, and when language testers were not yet
discussing qualities of test usefulness in terms of authenticity,
interactiveness and impact.
The Community English Program (CEP) Placement
Test
Purpose
The
Community English Program (CEP) Placement Test was first developed by students
and faculty in the TESOL and Applied Linguistics Programs at Teachers College,
Columbia University, in 2002, and is revised regularly. Unlike the previous
tests reviewed, the CEP Placement Testis a theme-based assessment designed to
measure the communicative language ability of learners entering the Community
English Program, a low-cost, adult ESL program servicing Columbia University
staff and people in the neighboring community. The CEP Placement Test consists
of five sections: listening, grammar, reading, writing and speaking. The first
four sections take one hour and 35 minutes to complete; the speaking test
involves a ten-minute interview. Inferences from the test scores are used to
place students in the program course that best matches their level of
communicative English language ability.
Construct definition and operationalization
Given
that the CEP is a theme-based ESL program, where language instruction is
contextualized within a number of different themes throughout the different
levels, the CEP Placement Test is also theme-based. The theme for the CEP
Placement Test under review is ‘Cooperation and Competition’. This is not one
of the themes students encounter in the program. In this test, all five test
sections assess different aspects of language ability while exposing examinees
to different aspects of the theme. To illustrate, the reading subtest presents
students with a passage on ants that explains how ants both cooperate and
compete; the listening subtest presents a passage on how students cooperate and
compete in US schools; and the grammar subtest presents a gapped passage that
revolves around competition in advertisements.
More
specifically, the grammar section of the CEP Placement Test is intended to
measure the students’ grammatical knowledge in terms of a wide range of
grammatical forms and meanings at both the sentence and the discourse levels.
Items on the test are designed to measure the students’ knowledge of lexical,
morphosyntactic and cohesive forms and meanings.
Measuring grammatical ability through language
use
In
addition to measuring grammatical ability in the grammar section, grammatical
ability is also measured in the writing and speaking sections of the test. The
writing section consists of one 30-minute essay to bewritten on the theme of
‘cooperation and competition’. Scores are derived from a four-point analytic
scoring rubric in which overall task fulfillment, content, organization,
vocabulary and language control are scored. The rubric constitutes an adapted
version of a rubric devised by Jacobs et al. (1981). Language use (i.e.,
grammatical ability) is implicitly defined in terms of the complexity of
grammatical forms, the number of errors and the range of vocabulary. For
example, the highest level descriptors (4) describe performance as ‘effective
complex constructions; few errors of grammar, and sophisticated range of
vocabulary.
The CEP Placement Test and the qualities of
test usefulness
In terms
of the qualities of test usefulness, the developers of the grammar section of
the CEP Placement Test prioritize construct validity, reliability and
practicality. With regard to construct validity, the grammar section of this
test was designed to measure both grammatical form and meaning on the
sentential and discourse levels, sampling from a wide range of grammatical
features. In this test, grammatical ability is measured by means of four tasks
in the grammar section, one task in the writing section, and by several tasks
in the speaking section. In short, the CEP. Placement Test measures both
explicit and implicit knowledge of grammar. Placement decisions based on
interpretations of the CEP Placement Test scores seem to be appropriate as only
a handful of misplacements are reported each term.
The
reliability of the grammar-test scores was also considered a priority from the
design stage of test development as seen in the procedures for item
development, test piloting and scoring. In an
effort to promote consistency (and quick return of the results), the CEP Placement Test developers
decided to use only multiple-choice tasks in the grammar section.
This decision was based on the results of the pilot tests, where the use
of limited production grammar tasks showed inconsistent
scoring results and put a strain on time resources. Although MC tasks were used
to measure grammatical knowledge, the theme of the input was designed to be
aligned with the test theme, and the type of input in each task varied
(dialogue, advertisement, passage). Once the test design was established, the
grammar tasks were developed, reviewed and piloted a number of times before the
test became operational. Scoring is objective and machine-scored.
Test
authenticity was another major concern for the CEP Placement Test development
team. Therefore, in the test design phase of test development, it was decided
that test forms should contain one coherent theme across all test tasks in
order to establish a close correspondence between the TLU tasks (i.e., ones
that might be encountered in a theme based curriculum) and the test tasks on
the CEP Placement Test. It was also decided that grammatical ability would be
measured by means of selected-response tasks in the grammar section and
extended production tasks in the writing and speaking sections, with both task
types supporting the same overarching test theme.
It must
be noted that the use of one overarching theme in a placement test can be
controversial because of the potential for content bias. In other words, if one
group of students (e.g., the science students) is familiar with the theme, they
may be unfairly advantaged. In an attempt to minimize construct-irrelevant
variance, several measures were taken. First, one goal of the test was to
actually teach test-takers something about the theme in the process of taking
the test so that they could develop an opinion about the theme by the time they
got to the writing and speaking sections. To this end, terms and concepts
relating to the theme were explained in the listening and grammar sections and
reinforced throughout the test in an attempt to create, to the extent possible,
a direct relationship between the input and expected responses. Second, the theme
was approached from many angles‑cooperation and competition in family
relationships, in
schools, in the animal kingdom and so forth. Third, each task was reviewed for its newsworthiness.
In other words, if the test developers felt that the information in the task
was ‘common knowledge’ for the test population, the text was changed. Finally,
in piloting the test, test-takers were asked their opinions about the use of
the theme. Results from this survey did not lead the testing committee to
suspect bias due to the use of a common theme.
Given
that the scores from the CEP Placement Test are used to make placement
decisions, the test is likely to have a considerable impact on the examinees.
Unfortunately, at this point, students are provided with no feedback other than
their course placement. The results of this test have also had a strong impact
on the CEP, since the program now has a much better system for grouping
students according to ability levels than it previously had. Unfortunately, no
research on impact is available.
Chapter 8
Learning-oriented assessments of grammatical
ability
Introduction
The
language tests reviewed in the previous chapter involved the grammar sections
from large-scale tests designed to measure global language proficiency, typically
for academic purposes. Like other large-scale and often high-stakes tests, they
were designed to make institutional decisions related to placement into or exit
from a language program, screening for language proficiency or reclassification
of school status based on whether a student had achieved the language skills
necessary to benefit from instruction in the target language. These tests
provide assessments for several components of language ability including, among
others, aspects of grammatical knowledge. In terms of the grammar sections of
the tests reviewed, a wide range of grammar points were assessed and, except
perhaps for the CEP Placement Test, the selection of test content was
relatively removed from the local constraints of instruction in specific
contexts. These large-scale tests were designed as one-shot, timed assessments
for examinees who bring to the testing situation a variety of experiences and
proficiency levels. The tests were different in the ways in which the qualities
of usefulness were prioritized, and the compromises that ensued from these
decisions.
Although
large-scale, standardized tests have an important role to play in some school
decisions and can have a positive impact on learning and instruction, the
primary mandate of large-scale exams is different from that of classroom
assessment. In the first place, large-scale language assessments are not
necessarily designed to promote learning and influence teaching in local
contexts. They rarely provide detailed feedback or diagnostic information to
students and teachers, and they are primarily oriented toward the measurement
of learner abilities at one point in time rather than continuously over a
stretch of time. Finally, large-scale language assessments do not benefit from
the knowledge that teachers bring to the assessment context regarding their
students’ instructional histories. As a result, score-based information
provided by large-scale, standardized tests is often of little practical use to
classroom teachers for pursuing a program designed to enhance learning and
personalize instruction.
In the
context of learning grammar, learning-oriented assessment of grammar reflects a
growing belief among educational assessment experts (e.g., Stiggins, 1987;
Gipps, 1994; Pellegrinio, Baxter and Glaser, 1999; Rea-Dickins and Gardner,
2000) that if assessment, curriculum and instruction were more integrally
connected, student learning would improve (National Research Council, 2001b).
This approach attempts to provide teachers and learners with summative and/or
formative information on the test-takers’ grammatical ability. Summative
information from assessment allows teachers to assign grades based on specific
assessment criteria, report student progress at a single moment or over time,
and reward and motivate student learning. Formative information from assessment
provides teachers and learners with concrete information on what
aspects of the grammar students have and have not mastered
and involves them in the regulation and assessment of their own learning, so
that further learning can take place independently or in collaboration with
teachers and other students.
What is
learning-oriented assessment of grammar?
In
reaction to conventional testing practices typified by large-scale, discrete-point,
multiple-choice tests of language ability, several educators (e.g., Herman,
Aschbacher and Winters, 1992; Short, 1993; Shohamy, 1995; Shepard, 2000) have
advocated reforms so that assessment practices might better capture educational
outcomes and might be more consistent with classroom goals, curricula and
instruction. The termalternative assessment, authentic assessment and
performance assessment have all been associated with calls for reform to both
large-scale and classroom assessment contexts. While alternative, authentic and
performance assessment are all viewed to be essentially the same, they
emphasize slightly different aspects of a move away from conventional,
discrete-point, standardized assessment.
Alternative
assessment emphasizes an alternative to and rejection of selected-response,
timed
and one shot approaches to assessment, whether they occur in large scale or classroom assessment contexts. Alternative assessment encourages assessments in which students are asked to perform,
create, produce or do meaningful tasks that both tap into higher-level thinking
(e.g., problem-solving) and have real-world implications (Herman et al., 1992).
Similar
to alternative assessment, authentic assessment stresses measurement practices
which engage students’ knowledge and skills in ways similar to those one can
observe while performing some real-life or ‘authentic’ task (O’Malley and
Valdez-Pierce, 1996).
Performance
assessment refers to the evaluation of outcomes relevant to a domain of
interest (e.g., grammatical ability), which are derived from the observation of
students performing complex tasks that invoke real world
applications (Norris et al., 1998). As with most performance data, assessments
are scored by human judges (Stiggins, 1987; Herman et al., 1992; Brown, 1998)
according to a scoring rubric that describes what test takers need to do in order to demonstrate
knowledge or ability at a given performance level. Bachman (2002) characterized
language performance assessment as typically: (1) involving more complex
constructs than those measured in selected-response tasks; (2) utilizing more
complex and authentic tasks; and (3) fostering greater interactions between the
characteristics of the test-takers and the characteristics of the assessment
tasks than in other types of assessments. Performance assessment encourages
self-assessment by making explicit the performance criteria in a scoring
rubric. In this way, students can then use the criteria to evaluate their
performance and contribute proactively to their own learning.
While
these three approaches better reflect the types of academic competencies that
most language educators value and wish to promote, a learning-oriented approach
to assessment maintains a clear and unambiguous focus on assessment for the
purpose of fostering further learning relevant to some domain of interest
(e.g., grammatical ability). Learning is defined here as the accumulation
of knowledge and the ability to use this knowledge for some purpose (i.e.,
skill). To highlight the learning mandate in the assessment of grammar in
classroom contexts, I will use the term learning-oriented assessment of
grammar.
In terms
of method, learning-oriented assessment of grammar reflects the belief that
assessments must remain open to all task types if the mandate is to provide
information about student performance on the one hand, and information about
the processing of grammatical input and the production of grammatical output on
the other. Therefore, unlike with other approaches, operationalization involves
the use of selected-response, limited-production and complex,
extended-production tasks that may or may not invoke real-life applications or
interaction. Just as in large-scale assessments, though, the specification of
test tasks varies according to the specific purpose of the assessment and the
claims we would like to make about what learners know and can do, and in fact,
in some instances, a multiple-choice task may be the most appropriate task type
available.
Implementing learning-oriented assessment of
grammar
Considerations
from grammar-testing theory
The
development procedures for constructing large-scale assessments of grammatical
ability discussed in Chapter 6 are similar to those needed to develop
learning-oriented assessments of grammar for classroom purposes with the
exception that the decisions made from classroom assessments will be somewhat
different due to the learning-oriented mandate of classroom assessment. Also,
given the usual low-stakes nature of the decisions in classroom assessment, the
amount of resources that needs to be expended is generally less than that
required for large-scale assessment.
Implications
for test design In designing classroom-based, learning-oriented assessments, we
need to provide a much more explicit depiction of the assessment mandate than
we might do for large-scale assessments. This is because classroom assessment,
especially in school contexts, has many interested stakeholders (e.g. students,
teachers, parents, tutors, principals, school districts), who are likely to be
held accountable for learning and who will use the assessment information to
evaluate instructional outcomes and plan for further instruction.
A second
consideration in which the design stage of classroom-based, learning-oriented
assessment may differ from that of large-scale assessment is construct
definition. Learning-oriented assessment aims to measure simple and/or complex
constructs depending on both the claims that the assessment is designed to make
and the feedback that can result from an observation of performances. Applied
to grammar learning, a ‘simple’ construct might involve the assessment of
regular and irregular past tense verb forms presented in a passage with gaps
about the disappearance of the dinosaurs.
A third
consideration for classroom-based, learning-oriented assessment is the need to
measure the students’ explicit as well as their implicit knowledge of grammar.
Selected-response and limited-production tasks, or tasks that include planning
time, will elicit the students’ explicit knowledge of grammar. In addition, it
is important to assess the students’ implicit or internalized knowledge of the
grammar. To do this, students should be asked to demonstrate their capacity to
use grammatical knowledge to perform complex, real-time tasks that invoke the
language performances one would expect to observe in instructional or
real-life situations.
Implications for operationalization
The
operationalization stage of classroom-based, learning-oriented assessment is
also similar to that of large-scale assessments. That is, the outcome should be
a blueprint for the assessment, as described in Chapter 6. The learning
mandate, however, will obviously affect the specification of test tasks so that
characteristics such as the setting, the rubrics or the expected response can
be better aligned with instructional goals. For example, in classroom-based
assessment, we may wish to collect information about grammar ability during the
course of instruction, and we may decide to evaluate performance by means of
teacher observationreports, or we may wish to assess grammatical ability by
means of informal oral interviews conducted over several day.
Learning-oriented
assessment of grammar may be achieved by means of a wide array of
data-gathering methods in classroom contexts. These obviously include
conventional quizzes and tests containing selected response,
limited-production and all sorts of extended-production tasks, as discussed
earlier. These conventional methods provide achievement or diagnostic
information to test-users, and can occur before, during or after instruction,
depending on the assessment goals. They are often viewed as ‘separate’ from
instruction in terms of their administration. These assessments are what
most teachers typically call to mind when they think of classroom tests.
In
addition to using stand-alone tests, learning-oriented assessment promotes the
collection of data on students’ grammatical ability as an integral part of
instruction. While teachers have always evaluated student performance
incidentally in class with no other apparent purpose than to make instructional
choices, classroom assessment activities can be made more systematic by means
of learning-oriented assessment.
In order to situate how a grammar learning mandate in classroom contexts can impact operationalization
decisions, consider the following situation. Imagine you are teaching an
intermediate foreign-language course in a theme-based program. The overarching
unit theme is crime investigation or the ‘whodunit’ (i.e., ‘who done it’, short
for a detective story or mystery à la Agatha Christie or Detective Trudeau).
This theme is used to teach and assess modal auxiliary forms for the
purpose of expressing degrees of certainty (e.g., It may/might/could/can’t/must/has
to be the butler who stole the jewelry). Students are taught to speculate about
possible crime suspects by providing motives and drawing logical conclusions.
Operationalization
in classroom-based assessment, however, need not be limited to this
conventional approach to assessment. Teachers can obviously specify the
characteristics of the test setting, the characteristics of the test rubrics
and other task characteristics for that matter, in many other ways depending on
the learning mandate and the goals of assessment. Let
us examine how the setting and test rubrics, for example, can be modified to
assess grammar for different learning goals.
In
specifying the characteristics of the test rubrics, teachers might decide to
vary the scoring method to accommodate different learning oriented assessment goals. For example, after
giving the task in Figure 8.2, they might choose to score the recorded
performance samples them selves,
by means of an analytic rating scale that measures modal usage in terms of
accuracy, range, complexity and meaningfulness. They might also have students
listen to the tapes to score their own and their peers’ performances.
In
classroom assessments designed to promote learning, the scoring process, whether
implemented by teachers, the students themselves or their peers, results in a
written or oral evaluation of candidate responses. This, in turn, provides
learners with summative and/or formative information (i.e., feedback) so that
they can compare or ‘notice’ the differences between their inter language
utterances and the target-language utterances. According to Schmidt (1990,
1993), Sharwood Smith (1993) and Skehan (1998), this information makes the
learners more ready to accommodate the differences between their inter language
and the target language, thereby contributing to the ultimate internalization
of the learning point.
In
addition to teacher feedback, the scoring method in learning oriented assessment can be specified to involve
students. From a learning perspective, students need to develop the capacity
for self assessment so that they can learn to ‘notice’
for themselves how their language compares with the target-language norms.
Learning to mark their own (or their peers’) work can, in itself, trigger
further reanalysis and restructuring of inter language forms, as discussed
in Chapter 2. It can also foster the development of skills needed to regulate
their own learning and it places more responsibility for learning on the
students (Rief, 1990).
Planning for further learning
The
usefulness of learning-oriented, classroom assessment is to a great extent
predicated upon the quality and explicitness of information obtained and its
relevance for further action. Research has shown, however, that the quality of
feedback contributes more to further learning than the actual presence or
absence of it (Bangert-Downs et al., 1991).
Teachers
have many options for presenting assessment results to students. They could
present students with feedback in the form of a single overall test score, a
score for each test component, scores referenced to a rubric, a narrative
summary of teacher observations or a profile of scores showing development over
time. Feedback can also be presented in a private conference with the
individual student. In an effort to understand the effect of feedback on
further learning, Butler (1987) presented test takers with feedback from an assessment in one
of three forms: (1) focused written comments that addressed criteria
test-takers were aware of before the assessment; (2) grades derived from
numerical scoring; and (3) grades and comments. Test-takers were then given two
subsequent tasks, and significant gains were observed with those who received
thedetailed coments.
Furthermore,
research has also demonstrated that feedback focusing on learning goals has led
to greater learning gains than, for example, feedback emphasizing self-esteem
(Butler, 1988). Therefore, feedback from assessments should provide information
on not only the quality of the work at hand, but also the identification of
student improvement goals. While feedback can be provided by teachers, students
should be involved in identifying areas for improvement and for setting
realistic improvement goals. In fact, considerable research (e.g.,
Darling-Hammond, Ancess and Falk, 1995) has shown the learning benefits of
engaging students in self and peer
assessment.
Considerations from L2 learning theory
Given
that learning-oriented assessment involves the collection and interpretation of
evidence about performance so that judgments can be made about further language
development, learning-oriented assessment of grammar needs to be rooted not
only in a theory of grammar testing or language proficiency, but also in a
theory of L2 learning. What is striking in the literature is that models of
language ability rarely refer to models of language learning, and models of
language learning rarely make reference to models of language ability.
As we
have seen, implementing grammar assessment with a learning mandate has
implications for test construction. Some of these implications have already
been discussed. However, implementing learning oriented assessment of grammar
is not only about task design and operationalization, teachers also need to
consider how assessment relates to and can help promote grammar acquisition, as
described by Van Patten (1996).
SLA processes – briefly revisited
As
discussed in Chapter 2, research in SLA suggests that learning an L2 involves
three simultaneously occurring processes: input processing (Van Patten,
1996), system change (Schmidt, 1990) and output processing (Swain, 1985; Lee
and Van Patten, 2003). Input processing relates to how the learner
understands the meaning of a new grammatical feature or how form–meaning connections are made (Ellis,
1993; Van Patten, 1996). A critical first stage of acquisition is the
conversion of input to ‘intake’. The second set of processes, system change,
refers to how learners accommodate new grammatical forms into their inter language
and how this change helps restructure their inter language so that it is
more target like (McLaughlin, 1987; De Keyser, 1998).
Assessing for intake
Van Patten and Cadierno (1993b) describe
this critical first stage of acquisition as the process of converting input
into ‘intake’. In language classrooms, considerable time is spent on
determining if students have understood. As most teachers know, however, it is
difficult to discern if their students have mapped meaning onto the form. In
fact, some students can fake their way through an entire lesson without having
really understood the meaning of the target forms. Given that communicative
language classrooms encourage the use of tasks in which learners must use
language meaningfully (Savignon, 1983; Nunan, 1989) and the use of
comprehensible input as an essential component of instruction (Krashen, 1982;
Krashen and Terrell, 1983), I believe that teachers should explicitly assess
for intake as a routine part of instruction.
Assessing
for intake requires that learners understand the target forms, but do not
produce them themselves. This can be achieved by selected response and limited-production tasks in which
learners need to make form meaning connections. Three examples of
interpretation tasks designed to assess for intake are presented below.
(For additional examples of interpretation tasks, see Ellis, 1997;
Lee and Van Patten, 2003; and Van Patten, 1996, 2003.).
Assessing to push restructuring
Once
input has been converted into intake, the new grammatical feature is ready to
be ‘accommodated’ into the learner’s developing linguistic system, causing a
restructuring of the entire system (Van Patten, 1996). To initiate this
process, teachers provide students with tasks that enable them to use the new
grammatical forms in decreasingly controlled situations so they can incorporate
these forms into their existing system of implicit grammatical knowledge. By
attending to grammatical input and by getting feedback, learners are able to
accommodate the differences between their interlanguage and the target
language. Assessment playsan important role in pushing this restructuring
process forward since it contributes concrete information to learners on the
differences between the grammatical forms they are using and those they should
be using to communicate the intended meanings.
Assessing for output processing
Although
learners may have developed an explicit knowledge of the form and meaning of a
new grammatical point, this does not necessarily mean they can access this
knowledge automatically in spontaneous communication. In order for learners to
produce unplanned, meaningful output in real time (i.e., speaking), they need
to be able to tap into grammatical knowledge that is already an unconscious
part of their developing system of language knowledge (Lee and VanPatten,
2003). Thus, to assess the test takers’ implicit knowledge of grammar (i.e.,
their ability to process output), test-takers need to be presented with tasks
that ask them to produce language in real time, where the focus is more on the
contentbeing communicated or on the completion of the task than on the
application of explicit grammar rules.
Classroom
assessments that are cognitively complex typically involve the processing of
topical information from multiple sources in order to accomplish some task that
requires complex or higher-order thinking skills (Burns, 1986). Based on an
analysis of tasks in school subject classrooms (e.g., social studies), Marzano,
Pickering and McTighe (1993) provide a list of commonly identified reasoning
processes (which I have added to) that are used in cognitively complex tasks.
•
Comparing • Error analysis • Experimental inquiry
•
Classifying • Constructing support • Invention
•
Induction • Analyzing perspectives • Abstracting
•
Deduction • Decision making • Diagnosis
•
Investigation • Problem solving • Summarizing
Some of
these processes, I might add, are not uncommon in communicative language
teaching or in language classroom assessment.
Illustrative example of learning-oriented
assessment
Let us
now turn to an illustration of a learning-oriented achievement test of
grammatical ability.
Background
The
example is taken from Unit 7 of On Target 1 Achievement Tests (Purpura et al.,
2001). This is a book of achievement tests designed to accompany On Target 1
(Purpura and Pinkley, 1999), a theme-based, integrated-skills program designed
for secondary school or adult learners of English as a second or foreign
language at the lower–mid intermediate level of proficiency. On Target provides
instruction in language (e.g., grammar, vocabulary, pronunciation), language
use (e.g., listening, speaking, reading and writing) and thematic development
(e.g., record-breaking, mysteries of science). The goal of the achievement
tests is ‘to measure the students’ knowledge of grammar, vocabulary,
pronunciation, reading and writing, as taught in each unit’ (Purpura etal.,
2001, p. iii).
Making assessment learning-oriented
The On
Target achievement tests were designed with a clear learning mandate. The
content of the tests had to be strictly aligned with the content of the
curriculum. This obviously had several implications for the test design and its
operationalization. From a testing perspective, the primary purpose of the Unit
7 achievement test was to measure the students’ explicit as well as their
implicit knowledge of grammatical form and meaning on both the sentence and
discourse levels. More specifically, the test was intended to measure the
degree to which test-takers had learned the present perfect tense with repeated
actions (How many times have you been . . . ?) and with length of time actions
(How long have you been . . . ?). Test inferences also included the learners’
ability to use this knowledge to discuss life achievements.
While
the TLU domain was limited to the use of the present perfect tense to discuss
life achievements, the constructs and tasks included in the test were both
simple and complex. For example, the first gap-filling grammar task was
intended only to assess the test-takers’ explicit knowledge of morphosyntactic
form and the pronunciation task focused only on their explicit knowledge of
phonological form. The second grammar task was slightly more complex in that it
aimed to measure the test-takers’ ability to use these forms to communicate
literal and intended meanings based on more extensive input.
In terms
of scoring, the specification of participants was left to the teachers’
discretion in case they wished to score the tasks themselves or involve
students in self- or peer assessment activities. Each task contained a
recommended number of points so that learners could receive a score for each
task. To rate performance on the extended-production tasks, teachers were enc.
The writing task was included in the test to assess the test-takers’ implicit
knowledge of grammar. In this task, test-takers had to maintain a focus on
meaning as they wrote a grammatically accurate, meaningful and well-organized
paragraph about past achievements. In sum, the On Target achievement test
attempted to take into consideration elements from both grammar-testing theory
and L2 learning theory in achieving a learning-oriented assessment mandate.
Chapter
9
Challenges and new directions inassessing
grammatical ability
Introduction
Research
and theory related to the teaching and learning of grammar have made
significant advances over the years. In applied linguistics, our understanding
of language has been vastly broadened with the work of corpus-based and
communication-based approaches to language study, and this research has made
pathways into recent pedagogical grammars.
Also,
our conceptualization of language proficiency has shifted from an emphasis on
linguistic form to one on communicative language ability and communicative language
use, which has, in turn, led to a demphasis on grammatical accuracy and a
greater concern for communicative effectiveness. In language teaching, we moved
from a predominant focus on structures and metalinguistic terminology to an
emphasis on comprehensible input, interaction and no explicit grammar
instruction. From there, we adopted a more balanced approach to language
instruction, where meaning and communication are still emphasized, but where
form and meaning-focused instruction have a clear role.
The state of grammar assessment
In the
last fifty years, language testers have dedicated a great deal of time to
discussing the nature of language proficiency and the testing of the four
skills, the qualities of test usefulness (i.e., reliability, authenticity), the
relationships between test-taker or task characteristics and performance, and
numerous statistical procedures for examining data and providing evidence of
test validity. In all of these discussions, very little has been said about the
assessment of grammatical ability, and unsurprisingly, until recently, not much
has changed since the 1960s. In other words, for the past fifty years,
grammatical ability has been defined in many instances as morphosyntactic form
and tested in either a discrete point, selected-response format a practice
initiated by several large language-testing firms and emulated by classroom
teachers or in a discrete-point, limited-production format, typically by means
of the cloze or some other gap-filling task.
In
recent years, the assessment of grammatical ability has taken an interesting
turn in certain situations. Grammatical ability has been assessed in the
context of language use under the rubric of testing speaking or writing. This
has led, in some cases, to examinations in which grammatical knowledge is no
longer included as a separate and explicit component of communicative language
ability in the form of a separate subtest. In other words, only the students’
implicit knowledge of grammar alongside other components of communicative
language ability (e.g., topic, organization, register) is measured. Having
discussed how grammar assessment has evolved over the years, I will discuss in
the next section some ongoing issues and challenges associated with assessing
grammar.
Challenge
1: Defining grammatical ability
One
major challenge revolves around how grammatical ability has been defined both
theoretically and operationally in language testing. As we saw in Chapters 3
and 4, in the 1960s and 1970s language teaching and language testing maintained
a strong syntactocentric view of language rooted largely in linguistic
structuralism. Moreover, models of language ability, such as those proposed by
Lado (1961) and Carroll (1961), had a clear linguistic focus, and assessment
concentrated on measuring language elements defined in terms of morphosyntactic
forms on the sentence level – while performing language skills. Grammatical
knowledge was determined solely in terms of linguistic accuracy. This approach
to testing led to examinations such at the CELT (Harris and Palmer, 1970a) and
the English Proficiency Test battery (Davies, 1964).
With an
expanded definition of grammatical knowledge, however, come several theoretical
challenges. The first is for language educators to make clear distinctions
between the form and meaning components of grammatical knowledge and, if
relevant to the test purpose, to incorporate these distinctions in construct
definition. Making finer distinctions between form and meaning will require
adjustments in how we approach grammar assessment and may require innovation.
While the
current research on learner-oriented corpora has shown great promise, many more
insights on learner errors and interlanguage development could be obtained if
other components of grammatical form (e.g., information management forms and
interactional forms) and if grammatical meaning were also tagged at both the
sentence and the discourse levels. For example, in a talk on the use of corpora
for defining learning problems of Korean ESL students at the University of
Illinois, Choi (2003) identified the following errors as passive errors:
1. The
color of her face was changed from a pale white to a bright red.
2. It is
ridiculous the women in developing countries are suffered.
Challenge
2: Scoring grammatical ability
A second
challenge relates to scoring, as the specification of both form and meaning is
likely to influence the ways in which grammar assessments are scored. As we
discussed in Chapter 6, responses with multiple criteria for correctness may
necessitate different scoring procedures. For example, the use of dichotomous
scoring, even with certain selected response items, might need to give way
to partial-credit scoring, since some wrong answers may reflect partial
development either in form or meaning. As a result, language educators might
need to adapt their scoring procedures to reflect the two dimensions of
grammatical knowledge.
A clear
example of the need to score for form and meaning can be seen in some of the
latest research related to computer-assisted language learning (CALL). Several
studies (e.g., Heift, 2003) have investigated, for example, the role or
different types of corrective feedback (i.e., explicit correction,
metalinguistic information, repetition by highlighting) on grammar development.
Grammar performance errors in these studies were scored for form alone. In
future studies, the scoring of both grammatical form and meaning, when
applicable, might provide interesting insights into learner uptake in CALL.
Another
challenge relates to the scoring of grammatical ability in complex performance
tasks. In instances where the assessment goals call for the use of complex
performance tasks, we need to be sure to use well developed scoring
rubrics and rating scales to guide raters to focus their judgments only on the
constructs relevant to the assessment goal.
McNamara
(1996) stresses that the scales in such tasks represent, explicitly or
implicitly, the theoretical basis upon which the performance is judged.
Therefore, clearly defined constructs of grammatical ability and how they are
operationalized in rating scales are critical.
Challenge
3: Assessing meanings
The
third challenge revolves around ‘meaning’ and how ‘meaning’ in a model of
communicative language ability can be defined and assessed. The ‘communicative’
in communicative language teaching, communicative language
testing, communicative language ability, or communicative competence
refers to the conveyance of ideas, information, feelings, attitudes and other
intangible meanings (e.g., social status) through language. Therefore, while
the grammatical resources used to communicate these meanings precisely are
important, the notion of meaning conveyance in the communicative curriculum is
critical. Therefore, in order to test something as intangible as meaning in
second or foreign language use, we need to define what it is we are testing.
Looking
to linguists (and language philosophers) for help in defining meaning (e.g.,
Searle, 1969; Lyons, 1977; Leech, 1983; Levinson, 1983; Jaszczolt, 2002), we
will soon realize that meaning is not only a characteristic of the language and
its forms (i.e., semantics), but also a characteristic of language use (i.e.,
pragmatics). This, in turn, involves the links among explicitly stated meanings
in an utterance, the language user’s intentions, presuppositions and knowledge
of the real world, and the specific context in which the utterance is made. We
will also realize that boundary debates between semantics and pragmatics have
been long and interesting, but have produced no simple answer with respect to
the meaning of ‘meaning’ and the distinctions between semantics and pragmatics.
Challenge
4: Reconsidering grammar-test tasks
The
fourth challenge relates to the design of test tasks that are capable of both
measuring grammatical ability and providing authentic and engaging measures of
grammatical performance. Since the early 1960s, language educators have
associated grammar tests with discrete-point, multiple-choice tests of
grammatical form. These and other ‘traditional’ test tasks (e.g.,
grammaticality judgments) have been severely criticized for lacking in
authenticity, for not engaging test-takers in language use, and for promoting
behaviors that are not readily consistent with communicative language teaching.
Discrete-point testing methods may have even led some teachers to have
reservations about testing grammar or to have uncertainties about how to test
it communicatively.
In
providing grammar assessments, the challenge for language educators is to
design tasks that are authentic and engaging measures of performance. To do
this, I have argued that we must first consider the assessment purpose and the
construct we would like to measure. We also need to contemplate the kinds of
grammatical performance that we would need to obtain in order to provide
evidence in support of the inferences we want to be able to make about
grammatical ability. Once we have specified the inferences, or claims, that we
would like to make and the kinds of evidence we need to support these claims,
we can then design test tasks to measure what grammar test-takers know or how
they are able to use grammatical resources to accomplish a wide range of
activities in the target language.
Challenge
5: Assessing the development of grammatical ability
The
fifth challenge revolves around the argument, made by some researchers, that
grammatical assessments should be constructed, scored and interpreted with
developmental proficiency levels in mind. This notion stems from the work of
several SLA researchers (e.g. Clahsen, 1985; Pienemann and Johnson, 1987;
Ellis, 2001b) who maintain that the principal finding from years of SLA
research is that structures appear to be acquired in a fixed order and a fixed
developmental sequence. Furthermore, instruction on forms in non-contiguous
stages appears to be ineffective. As a result, the acquisitional development of
learners, they argue, should be a major consideration in the L2 grammar
testing.
In terms
of test construction, Clahsen (1985) claimed that grammar tests should be based
on samples of spontaneous L2 speech with a focus on syntax and morphology, and
that the structures to be measured should be selected and graded in terms of
order of acquisition in natural L2 development. Furthermore, Ellis (2001b)
argued that grammar scores should be calculated to provide a measure of both
grammatical accuracy and the underlying acquisitional development of L2
learners. In the former, the target-like accuracy of a grammatical form can be
derived from a total correct score or percentage.
As
intuitively appealing as the recommendation for developmental scores might
appear, the research based on developmental orders and sequences is vastly
incomplete and at too early a stage for use as a basis for assessment
(Lightbown, 1985; Hudson, 1993; Bachman, 1998). Furthermore, as I have
argued in Chapters 3 and 4, grammatical knowledge involves more than knowledge
of morphosyntactic or lexical forms; meaning is a critical component. In other
words, test-takers can be communicatively effective and at the same time
inaccurate, they can be highly accurate but communicatively ineffective and
they can be both communicatively effective and highly accurate. Without more
complete information on the patterns of acquisition relating to other
grammatical forms as well as to grammatical meaning, language testers would not
have a solid empirical basis upon which to construct, score and interpret the
results from grammar assessments based solely on developmental scores.
In sum,
the challenge for language testers is to design, score and interpret grammar
assessments with a consideration for developmental proficiency. While this idea
makes sense, what basis can we use to infer progressive levels of development?
Results from acquisitional development research have been proposed as a basis
for such interpretations by some researchers. At this stage of our
knowledge, other more viable ways of accounting for what learners know might be
better obtained from the way grammatical performance is scored. Instead of
reporting one and only one composite, accuracy-based score, we can report a
profile of scores one for each construct we are measuring. Furthermore, in the
determination of these scores, we can go beyond dichotomous scoring to give
more precise credit for attainment of grammatical ability. Finally, scores that
are derived from partial credit reflect different levels of development and can
be interpreted accordingly. In other words, acquisitional developmental levels
need not be the only basis upon which to make inferences about grammatical
development.
Final remarks
Despite
loud claims in the 1970s and 1980s by a few influential SLA researchers that
instruction, and in particular explicit grammar instruction, had no effect on language
learning, most language teachers around the world never really gave up grammar
teaching. Furthermore, these claims have instigated an explosion of empirical
research in SLA, the results of which have made a compelling case for the
effectiveness of certain types of both explicit and implicit grammar
instruction. This research has also highlighted the important role that meaning
plays in learning grammatical forms.
In the
same way, most language teachers and SLA researchers around the world have
never really given up grammar testing. Admittedly, some have been perplexed as
to how grammar assessment could be compatible with a communicative language
teaching agenda, and many have relied on assessment methods that do not
necessarily meet the current standards of test construction and validation.
With the exception of Rea Dickins and a few others, language testers have
been of little help. In fact, a number of influential language proficiency
exams have abandoned the explicit measurement of grammatical knowledge and/or
have blurred the
boundaries between communicative effectiveness and communicative precision (i.e., accuracy).
My aim
in this book, therefore, has been to provide language teachers, language
testers and SLA researchers with a practical framework, firmly based in
research and theory, for the design, development and use of grammar
assessments. I have tried to show how grammar plays a critical role in
teaching, learning and assessment. I have also presented a model of grammatical
knowledge, including both form and meaning, that could be used for test
construction and validation. I then showed how L2 grammar tests can be
constructed, scored and used to make decisions about test-takers in both
large-scale and classroom contexts. Finally, in this last chapter, I have
discussed some of the challenges we still face in constructing useful grammar
assessments. My hope is that this volume will not only help language teachers,
testers and SLA researchers develop better grammar assessments for their
respective purposes, but instigate research and continued discussion on the
assessment of grammatical ability and its role in language learning.
SUMMARY OF THE ASSESSING VOCABULARY BOOK
CHAPTER 1
The Place Of Vocabulary
In Language Assessment
Introduction
At
first glance, it may seem that assessing the vocabulary knowledge of second language
learners is both necessary and reasonably straightforward. It is necessary in the
sense that words are the basic building blocks of language, the units of meaning
from which larger structures such as sentences, paragraphs and whole texts are formed.
Vocabulary
assessment seems straightforward in the sense that word lists are readily available
to provide a basis for selecting a set of words to be tested. In addition, there
is a range of well-known item types that are convenient to use for vocabulary testing.
Here are some examples:
Multiple-choice
(Choose the correct answer)
Completion
(Write in the missing word)
Translation
(Give the L1 equivalent of the underlined word) They worked at the mill.
Matching
(Match each word with its meaning)
Recent trends in language testing
However,
scholars in the ®eld of language testing have a rather different perspective on
vocabulary-test items of the conventional kind. Such items ®t neatly into what language
testers call the discretepoint approach to testing. This involves designing tests
to assess whether learners have knowledge of particular structural elements of the
language: word meanings, word forms, sentence patterns, sound contrasts and so on.
The
widespread acceptance of the validity of these criticisms has led to the adoption
± particularly in the major English-speaking countries ± ofthe communicative approach
to language testin.
Bachman
and Palmer's (1996) book Language Testing in Practice, which is a comprehensive
and in¯uential volume on language-test design and development. Following Bachman's
(1990) earlier work, the authors see the purpose of language testing as being to
allow us to make inferences about learners' language ability, which consists of
two components. One is language knowledge and the other is strategic competence.
Three dimensions of vocabulary assessment
Up
to this point, I have outlined two contrasting perspectives on the role of vocabulary
in language assessment. One point of view is that it is perfectly sensible to write
tests that measure whether learners know the meaning and usage of a set of words,
taken as independent semantic units. The other view is that vocabulary must always
be assessed in the context of a language-use task, where it interacts in a natural
way with other components of language knowledge.
Discrete - embedded
The
first dimension focuses on the construct
which underlies the assessment instrument. In language testing, the term construct
refers to the mental attribute or ability that a test is designed to measure
a
discrete test takes vocabulary knowledge
as a distinct construct, separated from other components of language competence.
an
embedded vocabulary measure is one that
contributes to the assessment of a larger construct. I have already given an example
of such a measure, when I referred to Bachman and Palmer's task of writing a proposal
for the improvement of university admissions procedures.
Selective - comprehensive
The
second dimension concerns the range of vocabulary to be included in the assessment.
A conventional vocabulary test is based on a set of target words selected by the
test-writer, and the test-takers are assessed according to how well they demonstrate
their knowledge of the meaning or use of those words. This is what I call a selective vocabulary measure. The target
words may either be selected as individual words and then incorporated into separate
test items, or alternatively the test-writer first chooses a suitable text and then
uses certain words from it as the basis for the vocabulary assessment.
On
the other hand, a comprehensive measure takes account of all the vocabulary content
of a spoken or written text. For example, let us take a speaking test in which the
learners are rated on various criteria, including their range of expression.
Context-independent - context-dependent
The
role of context, which is an old issue in vocabulary testing, is the basis for the
third dimension. Traditionally contextualisation has meant that a word is presented
to test-takers in a sentence rather than as an isolated element. From a contemporary
perspective, it is necessary to broaden the notion of context to include whole texts
and, more generally, discourse.
The
issue of context dependence also arises with cloze tests, in which words are systematically
deleted from a text and the testtakers' task is to write a suitable word in each
blank space.
Generally
speaking, vocabulary measures embedded in writing and speaking tasks are context
dependent in that the learners are assessed on the appropriateness of their vocabulary
use in relation to the task.
CHAPTER 2
The Nature Of Vocabulary
Introduction
Before
we start to consider how to test vocabulary, it is necessary first to explore the
nature of what we want to assess. Our everyday concept of vocabulary is dominated
by the dictionary. We tend to think of it as an inventory of individual words, with
their associated meanings. This view is shared by many second language learners,
who see the task of vocabulary learning as a matter of memorising long lists of
L2 words, and their immediate reaction when they encounter an unknown word is to
reach for a bilingual dictionary.
What is a word?
A
basic assumption in vocabulary testing is that we are assessing knowledge of words.
But the word is not an easy concept to define, either in theoretical terms or for
various applied purposes. There are some basic points that we need to spell out
from the start. One is the distinction between tokens and types, which applies to
any count of the worlds in a text.
Mention
of words like the, a, to, and, in and that leads to the question of whether they
are to be regarded as vocabulary items. Words of this kind - articles, prepositions,
pronouns, conjunctions, auxiliaries, etc. - are often referred to as function words
and are seen as belonging more to the grammar of the language than to its vocabulary.
Unlike content words - nouns, 'full' verbs, adjectives and adverbs they have little
if any meaning in isolation and serve more to provide links within sentences, modify
the meaning of content words and so on.
What about larger lexical Itemst
The
second major point about vocabulary is that it consists of more than just single
words. For a start, there are the phrasal verbs (get available to him or her a large
number of semi-preconstructed phrases that constitute single choices, even though
they might appear to be analysable into segments' (p. 110).
They
identify four categories of lexical phrases:
1. Polywords: short fixed phrases that
perform a variety of functions, such as for the most part (which they call a qualifier),
at any rate and so to speak (fluency devices), and hold your horses (disagree. ment
marker).
2. Institutionalised expressions: longer
utterances that are fixed in form and include proverbs, aphorisms and formulas for
social interaction.
3. Phrasa constraints: short- to medium-length
phrases consisting of a basic frame with one or two slots that can be filled with
various worlds or phrases.
4. Sentence builders: phrases that provide
the framework for a complete sentence, with one or more slots in which a whole idea
can be expressed.
What does it mean to know a lexical
item?
The
other seven assumptions cover various aspects of what is meant by knowing a word:
2. Knowing
a word means knowing the degree of probability of encountering that word in speech
or print. For many words we also know the sort of words most likely to be found
associated with the word.
3. Knowing
a word implies knowing the limitations on the use of the word according to variations
of function and situation.
4. Knowing
a word means knowing the syntactic behaviour associated with the word.
5. Knowing
a word entails knowledge of the underlying form of a word and the derivations that
can be made from it.
6. Knowing
a word entails knowledge of the network of associations between that word and other
words in the language.
7. Knowing
a word means knowing the semantic value of a word.
8. Knowing
a word means knowing many of the different meanings associated with a word.
What ia vocabulary ability?
Mention
of the term construct brings me back to the main theme I developed in Chapter 1,
which was that scholars with a specialist interest in vocabulary teaching and learning
have a rather different perspective from language testers on the question of how
- and even whether - to assess vocabulary. My three dimensions of vocabulary assessment
represent one attempt to incorporate the two perspectives within a single framework.
The context of vocabulary use
Traditionally
in vocabulary testing, the term context has referred to the sentence or utterance
in which the target word occurs. For in-stance, in a multiple-choice vocabulary
item, it is normally recommended that the stem should consist of a sentence containing
the word to be tested, as in the following example: The committee endorsed the proposal.
A.discussed
B. supported
C. knew about
D. prepared
Vocabulary knowledge and fundamental
processes
The
second component in Chapelle's (1994) framework of vocabulary ability is the one
that has received the most attention from applied linguists and second language
teachers. Chapelle outines four dimensions of this component:
Vocabulary
size: This refers to the number of words that a person knows.
Knowledge
of word characteristics: Their understanding of particular lexical items may range
from vague to more precise (Cronbach, 1942).
Lexicon organization:
This concerns the way in which words and other lexical items are stored in the brain.
Fundamental
vocabulary processes: Language users apply these processes to gain access to their
knowledge of vocabulary, both for understanding and for their own speaking and writing.
Metacognitive strategies for vocabulary
use
This
is the third component of Chapelle's definition of vocabulary ability, and is what
Bachman (1990) refers to as 'strategic competence'. The strategies are employed
by all language users to manage the ways that they use their vocabulary knowledge
in communication. Most of the time, we operate these strategies without being aware
of it. It is only when we have to undertake unfamiliar or cognitively demanding
communication tasks that the strategiès become more conscious.
Learners
have a particular need for metacognitive strategies in communication situations
because they have to overcome their lack of vocabulary knowledge in order to function
effectively. Blum-Kulka and Levenston (1983) see these strategies in terms of general
processes of simplification.
Conclusion
My
intention in this chapter has been to draw attention to the complexity of the subject
matter in any assessment of vocabulary. At the simplest level vocabulary consists
of words, but even the concept of a word is challenging to define and classify.
For a number of assessment purposes, it is important to clarify what is meant by
a "word' if the correct conclusions are to be drawn from the test results.
The
strong association in our minds between vocabulary and single words tends to restrict
our view of the topic. For this reason, during the writing of the book I considered
the possibility of dispensing with both the terms vocabulary and word as much as
possible in favour of terms like lexicon, lexis and lexical item, in order to signal
that I was adopting a much broader conception than the traditional ideas about vocabulary.
CHAPTER 3
Research on vocabulary
acquisition and use
Introduction
The
focus of this chapter is on research in second language vocabu- lary acquisition
and use. There are three reasons for reviewing this research in a book on vocabulary
assessment. The first is that the researchers are significant users of vocabulary
tests as instruments in their studies. In other words, the purpose of vocabulary
assessment is not only to make decisions about what individual learners have achieved
in a teaching/learning context but also to advance our un- derstanding of the processes
of vocabulary acquisition. Secondly, in the absence of much recent interest in vocabulary
among language testers, acquisition researchers have often had to deal with assess-
ment issues themselves as they devised the instruments for their research. The third
reason is that the results of their research can contribute to a better understanding
of the nature of the construct of vocabulary ability, which as I explained in the previous chapter - is important
for the validation of vocabulary tests.
Systematic vocabulary learning
Given
the number of words that learners need to know if they are to achieve any kind of
functional proficiency in a second language, it is understandable that researchers
on language teaching have been in- terested in evaluating the relative effectiveness
of different ways of learning new words.
The
starting point for research in this area is the traditional approach to vocabulary
learning, which involves working through a list of L2 words together with their
LI glossary / translations and memorizing the word-gloss pairs.
Assessment Issues
As
for the assessment implications, the design of tests to evaluate how well students
have learned a set of new words is straightforward, particularly if the learning
task is restricted to memorising association
between an L2 word and its L1 meaning. It
simply involves presenting the test-takers with one word and asking them to supply
the other-language equivalent. However, as
Ellis and Beaton (1993b) note, it makes a difference whether they are required to
translate into or out of their own language.
For example, English native speakers learning German words find it easier
to supply the English meaning in response to the German word (i.e. to translate
into the L1) than to give the German word for an English meaning (i.e. to translate
into the L2).
Incidental vocabulary learning
The
term incidental often causes problems in the discussion of research on this kind
of vocabulary acquisition. In practice it
usually isciation between an L2 word and its L1 meaning. It simply involves presenting the test-takers
with one word and asking them to supply the other-language equivalent. However, as Ellis and Beaton (1993b) note, it
makes a difference whether they are required to translate into or out of their own
language. For example, English native speakers
learning German words find it easier to supply the English meaning in response to
the German word (i.e. to translate into the L1) than to give the German word for
an English meaning (i.e. to translate into the L2).
Research with native speakers
The
first step in investigating this kind of vocabulary acquisition was to obtain evidence
that it actually happened. Teams of reading
researchers in the United States (Jenkins, Stein and Wysocki, 1984; Nagy, Herman
and Anderson, 1985; Nagy, Anderson and Herman, 1987) undertook a series of studies
with native-English-speaking school children.
The basic research design involved asking the sub-types to read texts appropriate
to their age level that contained un- billion words. The children were not told that the researchers
were interested in vocabulary. After they
had completed the reading task, they were given unannounced at least one test of
their knowledge of the target words in the text
Second language research
Now,
how about incidental learning of second language vocabulary? In a study that predates the LI research in the
US, Saragi, Nation and Meister (1978) gave a group of native speakers of English
the task of reading Anthony Burgess's novel A Clockwork Orange, which contains a
substantial number of Russian-derived words functions as an argot used by the young delinquents who
are the main characters in the book. When
the subjects were subsequently tested, it was found on average that they could recognize
the meaning of 76 per cent of the 90 target words.
Pitts,
White and Krashen (1989) used just excerpts from the novel with two groups of American
university students and also found some evidence of vocabulary learning; however, as you might expect, the reduced scope
of study resulted in fewer targetswords being correctly understood: an average of
only two words out of 30 tested. If you have
read the novel yourself, you may recall how you were able to gain a reasonable understanding
of what most of the Russian words meant by repeated exposure to them as the story
progressed. Of course, since Burgess deliberately
composed the text of the novel to facilitate this kind of 'incidental acquisition',
it represents an unusual kind of reading material that is significantly different
from the processes that learners normally encounter.
Assessment issues
Now,
what are the testing issues that arise from this research on incidental vocabulary
acquisition? One concerns the need for a pretest. A basic assumption made in these
studies is that the target words are not known by the subjects. To some extent,
it is possible to rely on teachers' judgements or word-frequency counts to select
words that a particular group of learners are unlikely to know, but it is preferable
to have some more direct evidence. The use of a pre-test allows the researchers
to select from a set of potential target words ones that none of the subjects are
familiar with.
Alternatively,
if the learners turn out to have knowledge of some of the words, the test results
can provide a baseline against which post-test scores can be evaluated. The problem,
however, is that if a vocabulary test is given before the learners undertake the
reading or listening task, they will be alerted to the fact that the researchers
are interested in vocabulary and thus the 'incidental' character of any learning
that occurs may be reduced, if not lost entirely.
Various
solutions to this problem have been adopted. In some of the research where the subjects
were reading texts in their first lan- guage, the type of target items used could
be assumed to be unfami- liar without giving a pre-test. For example, the researchers
who used A Clockwork Orange had only to ensure that none of their subjects had studied
Russian. And in the two experiments by Hulstijn (1992) that involved subjects who
were reading in their first language, the target items were 'pseudo-words' created
by the researcher.
Questions about lexical inferencing
This
topic is not purely a pedagogical concern.
Inferencing by learners is of great interest in second language acquisition
research and there are several types of empirical studies that are relevant here. In re-viewing this work, I find it helpful to
start with five questions that seem to follow a logical sequence:
1 What kind
of contextual information is available to readers to help them in guessing the meaning
of unknown words in texts?
2 Are such
clues normally available to the reader in natural, unedited texts?
3 How do well
learners infer the meaning of unknown words without being specifically trained to
do so?
4 Is training
strategy an effective way of developing learners' lexical inferencing skills?
5 Does successful
inferencing lead to acquisition of the words?
Assessment issues
As
in any test-design project, we first need to be clear about what the purpose of
a lexical inferencing test is. The literature
I have just reviewed above indicated at least three possible purposes:
1 to conduct
research on the processes that learners engage in when they attempt to enter the
meaning of unknown words;
2 to evaluate
the success of a program to train learners to apply lexical inferencing strategies; or
3 to assess
learners on their abilities to make inferences about unknown words.
The design
of the test should be influenced by which of these pursuits is the applicable one.
There
are two possible starting points for the design test. The first approach is to select a set of words
which are known to be unfamiliar to the test-takers and then create a suitable context
for each one in the form of a sentence or short paragraph. This strategy allows the tester to control the
nature and the amount of the contextual clues pro- vided but at the risk of producing
unauthentic contexts that may be unnaturally pregnant. The alternative is to take one or more texts as
the starting point and to choose certain low-frequency words within them as the
target items for the test. There are drawbacks
in this case as well: there may be too many unfamiliar words or conversely too few; and the text may not provide any usable contextual
information for a particular word, or again may provide too much.
Communication strategies
When
compared with the amount of research on ways that learners cope with unknown words
they encounter in their reading, there has been less investigation of the vocabulary
difficulties they face in expressing themselves through speaking and writing. However, within the field of second language acquisition,
there is an active tradition of research on communication strategies. Although the scholars involved have not seen themselves
as vocabulary researchers, a primary focus of their studies has been on how learners
deal with lexical gaps, that is words or phrases in the target language that they
need to express their intended meanings but don't know. It is therefore appropriate for us to consider
what the findings have and what their possible implications for vocabulary assessment
are.
Conclusion
Let
us review the assessment procedures discussed in this chapter in terms of the three
dimensions of vocabulary assessment I presented in Chapter 1. Research on second
language vocabulary acquisition normally employs discrete tests, because the researchers
are investing a construct that can be labeled
'vocabulary knowledge', 'vocabulary skills' or 'vocabulary learning ability'. This applies even to research on communication
strategies. Despite the apparently broad
scope of the topic area, most researchers have focused very specific on lexical
strategies and designed tests that oblige learners to deal with their lack of knowledge
of particular vocabulary items. Embedded
measures make sense in theory, but it remains to be seen whether they can be used
as practical tools for assessing communication strategies.
Secondly,
selective rather than comprehensive measures are used in vocabulary acquisition
research, at least in the areas covered in this chapter. Tests assess whether learners have some knowledge
of a series of target words and / or specific vocabulary skills that the re-searcher
is interested in. However, comprehensive
measures may have a limited role in the development of incidental learning or inferencing
tests. In order to have access to the contextual
information required to gain some understanding of the unknown or partially known
target words, the test-takers need to have a reasonable knowledge of most words
in the input text. A comprehensive measure
of the non-target vocabulary - say, in the form of a readability formula - would
therefore be a useful guide to the suitability of a text for this kind of test.
As
for context dependence, there is variability according to what aspects of vocabulary
is being investigated. Tests of systematic
vocational learning are normally context independent, with the words being presented
in isolation or in a limited sentence context.
In the case of research on incidental learning, subjects are certainly pre-
sented with the target words in context as part of the experimental treatment, but
knowledge of the words is assessed afterwards in a context-independent manner, in
that the subjects cannot refer to what they
read or heard while they are taking the vocabulary test. By contrast, context dependence is an essential
characteristic of the test material in studies of lexical inferencing. In order for the items to be truly context dependent,
the test-writer needs to ensure that both contextual clues are available for each
target word and that the test takers have no prior knowledge of the words.
CHAPTER 4
Research on vocabulary
assessment
Introduction
In
the previous chapter, we see how tests play a role in research on vocabulary within
the field of second language acquisition (SLA).
Now we move on to consider research in the field of language testing. where the focus is not so much on understanding
the processes of vocabulary learning as on measuring the level of vocabulary knowledge
and abilities that learners have reached.
Language testing is connected with the design of tests to assess learners
for a variety of practical purposes that can be summarized under labels such as
placement, diagnosis, achievement and proficiency.
However, in practice this distinction between second
language acquisition research and assessment is difficult to maintain consistently,
because, on the one hand, language testing researchers have paid relatively little
attention to vocabulary tests and, on the other hand, second language acquisition
researchers working on vocabulary acquisition
has often been needed to develop tests as an integral part of their research design. Thus, some of the important work on how to measure
vocabulary knowledge and abilities has been produced by vocabulary acquisitions
researchers rather than language testers;
the latter has tended either to take vocabulary tests for granted or, in
the 1990s, to be interested in more integrative and communicative measures of management
proficiency.
Objective testing
The history
of vocabulary assessment in the twentieth century is very much associated with the
development of objective testing, especially in the United States. Objective tests are ones in which the learning
material is divided into small units, each of which can be assessed by means of
a test item with a single correct answer that can be specified in advance. Most commonly these are items of the multiple-choice
type.
It is easy
to see how vocabulary became popular as a component of objective language tests. • Words could be treated as independent linguistic
units with a meaning expressed by a synonym, a short defining phrase or a translation
equivalent.
• There was
a great deal of work in the 1920s and 1930s to prepare lists of the most frequent
words in English, as well as other words that were useful for the needs of particular
groups of students. According to Lado (1961:
181), similar though more limited work was done on the vocabulary of major European
languages.
• Multiple-choice
vocabulary tests proved to have excellent technical characteristics, in relation
to the requirements of psychometric theory.
Well-written items could discriminate effectively among learners according
to their level of ability, and thus the tests were highly reliable. Reliability was the great virtue of a psychometric
test.
• Rather than
simply measuring vocabulary knowledge, objective vocabulary tests seem to be valid
indicators of language ability in a broad sense. As Anderson and Freebody (1981: 78-80) noted,
one of the most consistent findings in L1 reading research has been the high correlation
between tests of vocabulary and reading comprehension.
Multiple-choice vocabulary items
Although the
multiple-choice format is one of the most
widely used methods of vocabulary assessment, both for native speakers and for second
language learners, its limitations have also been recognized for a long time. Wesche and Paribakht summarize the criticisms
of these items as follows:
1. They are
difficult to construct, and require laborious field-testing, analysis and refinement.
2. The learner
may know another meaning for the word, but not the one sought.
3. The learner
may choose the right word by a process of elimination, and has in any case a 25
per cent chance of guessing the correct answer in a four-alternative format.
4 Items may
test students' knowledge of distractors rather than their ability to identify an
exact meaning of the target word.
5. The learner
may miss an item either for lack of knowledge of words or lack of understanding
of syntax in the distractors.
6. This format
permits only a very limited sampling of the learner's total vocabulary (for example,
a 25-item multiple-choice test sample one word in 400 from a 10,000-word vocabulary).
Wesche and Paribakht, (1996: 17)
Validating tests of vocabulary knowledge
Writers
on first language reading research over the years (Kelley and Krey, 1934; Farr,
1969: Schwartz, 1984) have pointed out that, in addition to the various variations
of the multiple-choice format, a wide range of
test items and methods have been used for measuring vocabulary knowledge. Kelley and Krey (cited in Farr, 1969: 34) identified
26 different methods in standardized US vocabulary and reading tests. However, as Schwartz puts it, 'there does not
appear to be any rationale for choosing one measurement technique rather than another.
Test constructors sometimes seem to choose a par- ticular measurement technique
more or less by whim '(1984: 52). In addition,
it is not clear that the various types of tests are all measuring the same ability; this calls into question the validity of the tests.
Measuring vocabulary size
Let me first
sketch some educational situations in which consideration of vocabulary size is
relevant and where the research has been undertaken.
• Reading researchers have long been interested
in estimating how many words are known by native speakers of English as they grow
from childhood through the school years to adult life. It represents one facet of research into the role
of vocabulary knowledge in reading comprehension as it develops with age. As Anderson and Freebody (1981: 96-97) noted,
the results of such research have important implications for the way that reading
programs in schools are designed and taught.
• Estimates of native-speaker vocabulary size at
different ages provide a target - though a moving one, of course - for the acquisitions
of vocabulary by children entering school with little knowledge of the language
used as the medium of instruction. Let us
take the case of children from non-English-speaking backgrounds who migrate with
their families to Canada.
• International
students undertaking upper secondary or university education through a new medium
of instruction simply do not have discussion I use the two sets of terms interchangeably. A great deal of both Ll and L2 vocabulary testing
research can be seen as addressing one or other of these dimensions. Measuring vocabulary size Let me first sketch
some educational situations in which consideration of vocabulary size is relevant
and where the research has been undertaken.
• Reading
researchers have long been interested in estimating how many words are known by
native speakers of English as they grow from childhood through the school years
to adult life. It represents one facet of
research into the role of vocabulary knowledge in reading comprehension as it develops
with age.
• Estimates
of native-speaker vocabulary size at different ages provide a target - though a
moving one, of course - for the acquisitions of vocabulary by children entering
school with little knowledge of the language used as the medium of instruction.
What counts as a word?
is an issue that I discussed in Chapter 2. The
larger estimates of vocabulary sizes for native speakers tend to be calculated on
the basis of individual word forms, whereas more conservative estimates take word
families as the units to be measured. Remember
that a word family consists of a base word together with its inflected and derived
forms that share the same meaning. For example,
the word forms extends, extending, extended, extensive, extensively, extension and
extent can be seen as members of a family headed by the base form extend. The other members of the family are linked to
extend by simple word-formation rules and all of them share a core meaning which
can be expressed as 'spread or stretch out'.
A person who knows the meaning of extend (or perhaps the meaning of any one
member of the family) should be able to figure out what the other word forms mean
by applying a little knowledge of English suffixes and getting some help from the
context in which the word occurs.
How do we choose which words to test?
For
practical reasons it is impossible to test all the words that the native speaker
of a language might know. Researchers have
typically started with a large dictionary and then drawn a sample of words representing,
say, 1 per cent (1 in 100) of the total dictionary entries. The next step is to test how many of the selected
words are known by a group of subjects. Finally,
the test scores are multiplied by 100 to give an estimate of the total vocabulary
size. It seems a straightforward procedure
but, as Nation (1993b) pointed out in some detail, there are numerous traps for
unwary researchers. For example, the dictionary
headwords are not the most suitable sampling units, for the reasons I gave in response
to the first question. A single-word family
may have multiple entries in the dictionary, so an estimate of vocabulary size based
on headwords would be an inflated one. Second,
a procedure by which, say, the first word on every sixth page is chosen will produce
a sample in which very common words are overrepresented because these words, with
their various meanings and uses, take up much more space in the dictionary than
low - frequency words. Third, there are technical questions concerning
the size of sample required to make a reliable estimate of the total vocabulary
size.
How do we find out what the selected
words are known?
Once a sample of words has been selected, it is
necessary to find out - by means of some kind of test whether each word is known. In studies of vocabulary size, the criterion for
knowing a word is usually quite liberal, because of the large number of words that
need to be covered in the time available for testing. The following test formats have been commonly
used:
• multiple-choice
items of various types;
• matching of words with synonyms or definitions3;
• supplying
an Ll equivalent to each L2 target word;
• the checklist (or yes-no) test, in which test-takers
simply indicate whether they know the word or not.
Assessing quality of vocabulary knowledge
Whatever
the merits of vocabulary-size tests, one limitation is that they can give only a
superficial indication of how well any particular word is known. In fact this criticism has long been applied to
many objective vocabulary tests, not just those that are designed to estimate the
total vocabulary size. Dolch and Leeds (1953)
analyzed the vocabulary subtests of five major reading and general achievement test
batteries for American school children and found that
1. Only the
commonest meaning of each target word was assessed; and
2. The test-takers
were required just to identify a synonym of each word. Thus, the test items could not show whether additional,
derived or figurative meaning of the target word were known, and it was quite possible
that the children had learned to associate the target word with its synonym without
really understanding what either one meant.
How to measure it?
How
to conceptualize it? The Dolch and Leeds
(1953) test items with which I introduced this section of the chapter essentially
assessing precision of knowledge: do the test-takers know the specific meaning of
each target word, rather than just having a vague idea about it? This represents one way to define quality of knowledge,
but it assumes that each word has only one meaning to be precisely known. Of course, words commonly have several different
meanings, think of fresh, as in fresh bread, fresh ideas, fresh supplies, a fresh
breeze and so on. If we take this aspect
into account, we need to add a dimension of meaning, in addition to precision. Going a step further, vocabulary knowledge involves
more than simply word meaning. As we saw
in Chapter 2, Richards (1976) and Nation (1990) list multiple components of word
knowledge, including spelling, pronunciation, grammatical form, relative frequency,
collocations and restrictions on the use of the word, as well as the distinction between receptive and productive knowledge.
How to measure it?
A
common assessment procedure for measuring the quality of vocabulary knowledge is
an individual interview with each learner, probing how much they know about a set
of target words. For instance, in their work
with bilingual and monolingual Dutch children, Verhallen and Schoonen (1993) wanted
to elicit all aspects of the target word meaning that children might know, in order
to make an elaborate semantic analysis of the responses.
The role of context
Whether
we can separate vocabulary from other aspects of language proficiency is obviously
relevant to the question of what the role of context is in vocabulary assessment. In the early years of objective testing, many
vocabulary tests are presented in the target words in isolation. in lists or as the stems of multiple-choice items. It was considered that such tests were pure measures
of vocabulary knowledge. In fact. the distinguished American scholar John B. Carroll
wrote in an unpublished paper in 1954 (quoted in Spolsky, 1995: 165) that test items
containing a single target word were the only ones that should be classified as
vocabulary items. Any longer stimulus would
turn the item into a reading-comprehension one.
Cloze tests as vocabulary measures
A
standard cloze test consists of one or more reading passages from which words are
deleted according to a fixed ratio (e.g. every seventh word). Each deleted word is replaced by a blank of uniform
length, and the task of the test takers is to write a suitable word in each space. Some authors (e.g. Weir, 1990: 48; Alderson, 2000)
prefer, to restrict the use of the label cloze to this kind of test, but the term
is commonly used to include a number of modifications to the standard format. One modified version is the selective-deletion
(or rational) cloze, where the test-writer deliberately chooses the words to be
deleted, preferably according to the principled criteria. A second modification is the multiple-choice cloze.
The standard cloze
Let
us first look at the standard, fixed-ratio cloze. A popular way of exploring the validity of cloze
tests in the 1970s was to correlate them with various other types of tests. In numerous studies the cloze correlated highly
with 'integrative' tests such as dictation or composition writing and at a rather
lower level with more 'discrete-point' tests of vocabulary, grammar and phonology. This was interpreted as evidence in support of
Oller's claim that the cloze was a good measure of overall proficiency in the language. However, as I discussed earlier in this chapter,
simple correlations are not an adequate means of estab- lishing what a test is measuring,
especially when the correlation is a moderate one, say, in the range from 0.50 to
0.80.
The rational cloze
Although
Oller has consistently favored the standard fixed-ratio format as the most valid
form of the cloze procedure for assessing second language proficiency, other scholars
have argued for a more selective approach to the deletion of words from the text. In his research, Alderson (1979) found that a
single text could produce quite different tests depending on whether you deleted,
say, every eighth word rather than every sixth.
He also obtained evidence that most cloze blanks could be filled by referring
just to the clause or the sentence in which they occurred. (There is a course some confirmation of this in
the figures from Jonz's (1990) research quoted above: 68 per cent of the items could
be answered from within the clause or the sentence.
The
rational cloze in a systematic way with second language learners. Bachman (1982; 1985) conducted two studies in
which he selected items that focused primarily on cohesive links within and between
the sentences of the text. Chapelle and Abraham
(1990) chose items that matched the types of items in a fixed-ratio cloze based
on the same text. None of the published studies
has involved a rational cloze designed just to measure vocabulary, but once you
accept the logic of selective deletion of words from the text, it makes sense to
use the cloze procedure to assess the learners' ability to supply missing content
words on the basis of contextual clues and,
at a more advanced level, to choose the most stylistically appropriate word for
a particular blank.
The multiple-choice cloze
Choice
items in a cloze test rather than the standard blanks to be filled in. Porter (1976)
and Ozete (1977) argued that the standard format re-quires writing ability, whereas
the multiple-choice version makes it more a measure of reading comprehension Jonz (1976) pointed out that a multiple-choice
cloze could be marked more objectively because it controlled the range of responses
that the test-takers could give In addition, he considered that providing response
options made the test more student-centered -
or 'learner-friendly, 'as we might say these days. For Bensoussan and Ramraz (1984), multiple-choice
items were a practical necessity, because their cloze test formed part of an entrance
examination for two universities in Israel, with 13,000 temples- dates annually.
The C-test
At
first glance the C-test-in which a series of short texts are prepared for testing
by deleting the second half of every second word - may seem to be the version of
the cloze procedure that is the least processing as a specific measure of vocabulary. For one thing, its creators intended that it should
assess general proficiency in the language, particularly for selection and placement
purposes, and that the representation should be a sample of all the elements in
the text (Klein-Braley, 1985; 1997 : 63-66). If that is the intention, there is no question
of using only content words as items. Second,
the fact that just the second half of a word is deleted might suggest that knowledge
of word structure is more important in this kind of test than, say, the semantic
aspects of vocabulary knowledge - especially if the language being tested has a
complex system of word endings. However, Chapelle and Abraham (1990) found that
their C-test correlated highly with their multiple-choice vocabulary test (r = 0.862). The correlation was substantially higher than
with the other parts of the placement test battery, including the writing test (0.639)
and the reading test (0.604).
Conclusion
Discrete,
selective, context-independent vocabulary tests have been an important part of the
educational measurement scene for almost the whole of the twentieth century. They have all the virtues of an objective language
test and have become so well established that for a long time they were almost taken
for granted. Multiple-choice vocabulary items
are still very much in use, generally using a more contextualized form in the 1990s,
with the target words presented at least in a sentence if not a broad linguistic
context. At the same time, the prevailing
view in language testing is that discrete vocabulary measurements are no longer
a valid component of tests designed to assess the learners' overall proficiency
in a second language. Vocabulary knowledge
is assessed indirectly through the test-takers' performance of integrative tasks
that show how well they can draw on all their language resources to use the language
for various communicative purposes.
Nevertheless, researchers and language-teaching
specialists with a specific interest in vocabulary learning have a continuing need
for assessment tools. Much of their work
can be classified as focusing on either vocabulary size (breadth) or quality of
vocabulary knowledge (depth). Vocabulary
size has received more attention because, despite the fact that the tests may seem
superficial, they can give a more representative picture of the overall state of
the learners' vocabulary than an in-depth probe of a limited number of words. Measures of quality of vocabulary knowledge also
have values but for quite specific purposes.
The
construct validation studies by Corrigan and Upshur (1982) and Arnaud (1989) challenge
the notion that vocabulary can be as hidden as something separate from other components
of language knowledge, even when individual words are tested in relative isolation. This is consistent with other evidence of the
integral part that vocabulary plays in language ability, such as the strong relationship
between vocabulary tests and measures of reading comprehension.
Such
findings lend support to the view that vocabulary should always be assessed in context. However, as the research on the various members
of the cloze family of tests shows, the more we contextualize the assessment of
vocabulary, the less clear it may be to what extent it is vocabulary knowledge that
is influencing the test-takers' performance.
CHAPTER
5
Vocabulary
Tests: Four case studies
The four
tests are:
1. The
voluntary level test;
2. The
Eurocentres Vocabulary Size Test (EVST);
3. The
Vocabulary Knowledge scale (VKS); and
4. The
test of English as a Foreign Language (TOEFL)
The first
two test, the vocabulary levels Test and the Eurocentres Vocabulary Size Test,
arei both measures of vocabulary size, whereas the vocabulary knowledge scale
is designedto asses dept of vocabulary kmowledge. The fourth test to be
considered is the test of English as a foreign language (TOEFL), which is
certainly not a discreate vocabulary test but rather a well-researched
proficiency-test battery that has incorporated vocabulary items in a variety of
interesting ways throughout its history.
The
Vocabulary Levels Test
The
vocabulary levels Test was devised by paul Nation at victoria University of
Wellington I in New Zealand in the early 1980s as a simple instrument for
classroom use by teachers in order to help the develop a suitable vocabulary
teaching.and leaming programme for their students
The
design of the test
The test
is in five parts, representing five levels of word frequency in English: the
first 2000 words, 3000 words, 5000 words, the University word level (beyond
5000 words) and 10,000
According
to Nation (1990:261), the 2000- and 3000-word levels contain the high-frequency
i words that all leamers need to kmow in order to function effectively in
English. For instance it i is difficult forleamers to read unsimplfied text
unless they know these words "he 5000word level represents the upper limit
of general high-frequency vocabulary that is worth i spending time on in class.
Words at the University level, should help students in reading their i
textbooks and other academic reading material. Finally-the 10,000-word level
covers the more common lower-frequency words ofthe language.
At each
level, there are 36 words and 18 definitions, in group of six and three
respectively, as in this example from the 2000-word level:
1. Apply
2. Elect
- choose by voting
3. Jump -
become like water
4.
Manufacture - make
5. Melt
6.
Threaten
·
Validation
One way
to validate the test as a measure of vocabulary size is to see whether it
provides evidence for the assumption on which it is based: words that occur
more frequently in the language are more likely to be known by learners than
less frequent words.
To
evaluate the scalability of the test, it was first necessary to set a criterion
(or mastery) score for each level. For the analysis of this test, the criterion
was set at 16 out of 18. In other words, a student who scored at least 16 on a
particular level
The
Guttman scalogram analysis produces a summary statistic called the coefficient
of scalabillty, which indicates the extent to which the test scores truly form
an implicational scale, taking intoi account the number of errors as well as
the proportion of correct responses. According to Hatch and Farhady (1982:
181), the coefficient should be well above 0.60 if the scores are to be
considered scalable. In my analysis, the scores from the beginning of the
course yielded a coefficient of 0.90, with the five frequency levels in their
original order, as in the upper part of Table 5.1. For the end of course
scores, I obtained the best scalability, 0.84, by reversing the order of the
5000 and University levels, following the pattern of the mean scores in the
lower part of Table 5.1. Thus, the statistics showed a high degree of
implicational scaling, but by no means a perfect one.
New
versions
Schmitt
(1993) wrote three new forms of the test, following the original specifications
and taking fresh samples of words for each level. This new material was used by
Beglar and Hunt (1999), but they concentrated on the 2000- and University-word
levels, treating them as separate tests. according to Laufer's (1992; 1997a)
work, these levels correspond to a knowledge of 3000 word families which is the
ap proxdmate threshold required to be able to read academic texts in English
relatively independently. Beglar and Hunt administered all four forms of either
the 2000-Word-Level or the University-WordLevel Tests to nearly 1000 learners
of English in secondary and tertiary institutions in Japan. Based on the
results of this trial, they selected 54 of the best-performing items to produce
two new 27-iterm tests for each level. The two pairs of tests were then equated
statistically, so that they could function as equivalent measures of learners'
vocabulary knowledge at the two frequency levels.
Laufer
herself (Laufer, 1998; Laufer and Nation, 1999) describes thei blank-filling
version of the test as a measure of 'active' or 'productive' vocabulary
knowledge (she uses the two terms interchangeably in heri articles). In the
studies just cited, she distinguishes it from the two other measures used as
follow
- Levels Test - original matching version: receptive knowledge
- Levels Test-blank-filling version: controlled productive knowledge.
- Lexical Frequency Profile (LFp): free productive knowledge
According
to this classification, the blank-filling version is asso clated with the
ability to use vocabulary in writing (which is what i the LFP sets out to
measure), and in fact Laufer and Nation (1999) assert that the blank-filling
test scores can be interpreted in terms of the approximate number of words at a
particular frequency leveli which are 'available for productive use' (1999:
41). However, they i have limited evidence to support this interpretation. (Laufer
and Nation, 1995: 317) they found some moderatei correlations between sections
of the blank-filling Levels Test and corresponding sections of the LFP. On the
other hand, Laufer's (1998: 264) intercorrelations of the three measures
produced somei contradictory evidence. Whereas there were substantial
correlations i between the two versions of the Levels Test, there was no signih
cant relationship between either version and a measure based on the LFP.
The
Burocentres Vocabulary Size Test
Like the
Vocabulary Levels Test, the Eurocentres Vocabulary Size Test (EVST) makes an
estimate of a learner's vocabulary size using a graded sample of words covering
numerous frequency levels. Another distinctive feature of the EVST is that it
is administered by computer rather than as a pen-and-paper test. Let us now
look at the test from two perspec tives: first as a placement instrument and
then as a measure of vocabulary size.
The
EVST as a placement test
The
original work on the test was carried out at Birkbeck College, University of
London by Paul Meara and his associates (Meara and Buxton, 1987; Meara and
Jones, 1988). In its published form (Mearai and Jones, 1990a) the test was
commissioned by Eurocentres, al network of language schools in various European
countries, including the UK.
Meara and
Jones (1988) report a validation study of the EVST in language schools in
Cambridge and London using two criterion measures. First, the scores were
correlated with the scores obtained in he existing Eurocentres placement test,
which yielded an overall coefficient of 0.664 for the Cambridge learners and
0.717 for those in London. The second approach was to review - after one week
of classes - the cases of students who had apparently been assigned to the
wrong class by the existing placement test.
The
EVST as a measure of vocabulary size
If the
Eurocentres test is to have a wider application than just as a placement tool
for language schools, we also need to consider itsi validity as a measure of
vocabulary size, and for this we should lookl into various aspects of its
design:
- the format of the test and, in particular, the role of the non-words;
- the selection of the words to be tested; and .
- the scoring of the test.
The first
thing to consider is the test format, which allows a largei number of words to
be covered within a short time. The second aspect of the design that we need to
look at is the selection of the words for the test. And we come now to the
scoring of the test. A straightforward way to produce a score would be simply
to take the number of Yes' responses to the real words and calculate what
proportion of the total inventory of 10,000 words they represent.
The Vocabulary
Knowledge Scale
The
instrument is of interest not only as a test in its own right but also asi a
way of exploring some issues that arise in any attempt to measure quality of
vocabulary knowledge in a practical manner.
- The design of the scale
The design
of the scale The VKS is a generic instrument, which can be used with any set of
words that the tester or researcher is interested in assessing. It consists in
effect of two scales: one for eliciting responses from the test takers and one
for scoring the responses.
- Use of the VKS
Paribakht and Wesche have used the VKS as a
tool in their research on vocabulary acquisition in an English language
programme for non-native-speaking undergraduate students at the University of
Ottowa. The courses in the programme, which focuses on comprehen sion skills,
use authentic spoken and written materials linked to particular themes such as
media, the environment and fitness. In their frst study, Paribakht and Wesche
(1993) selected two themes as the basis for a study of incidental vocabulary
learning, one theme being actually used in class and the other not.
In the
second study (Paribakht and Wesche, 1997), the researchers i compared two
approaches to the acquisition of theme-related voca-i bulary through reading.
One, called Reading Plus, added to the main reading activities a series of
vocabulary exercises using target content i words from the themes. The other,
Reading Only, supplemented thei main readings with further texts and
comprehension exercises. As in the earlier study, the VKS was administered as a
pre-test and a post test to measure acquisition of the target words during the
course.
- Evaluation of the instrument
Paribakht
and Wesche have been careful to make modest claims for their instrument: 'Its
purpose is not to estimate general vocabulary knowledge, but rather to track
the early development of specific words in an instructional or experimental
situation' (Wesche and Paribakht, 1996: 33). They have obtained various kinds
of evidence in their research for its reliability and validity as a measure of
incidental vocabulary acquisition (Wesche and Paribakht, 1996: 31-33). To
estimate reliability, they administered the VKS to a group of students twice
within two weeks and there was a high level of consistency in the students'
responses to the 24 content words tested (the correlation was 0.89). They have
found a strong relationship (with correlations of 0.92 to 0.97) between the way
that the students rated themselves on the elicitation scale and the way that
their responses were scored, which suggests that the students reported their
level of knowledge of the target words reasonably accurately.
The
Test of English as a Foreign Language
The Test
of English as a Foreign Language, or TOEFL, is administered in 180 countries
and territories to more than 900,000 candidates. Like other ETS tests, TOEFL
relies on sophisticated statistical analyses and testing technology in order to
ensure its quality as a measuring instrument and its efficient administration
to such large numbers of test-takers. Until recently, all the items in the
basic TOEFL test have been of this type.
The
primary purpose of TOEFL is to assess whether foreign stu-i dents planning to
study in a tertiary institution where English is the medium of instruction have
a sufficient level of proficiency in the language to be able to undertake their
academic studies without i being hampered by language-related difficulties.
From the
viewpoint of vocabulary assessment. the history of the TOEFL programme
represents a fascinating case study of how ap proaches to testing have changed
in the latter part of the twentieth century. In particular, vocabulary testing
has become progressively more embedded and context dependent as a result of
successive revisions of the test battery during that period.
- The original vocabulary items
From its
beginning in 1964 until the mid-1970s, TOEFL consisted of five sections:
listening comprehension, English structure, vocabulary, reading comprehension
and writing ability.There were two types of vocabulary-test item, which were
labelled sentence completion and synonym matching.
- The 1976 revision
In his
study involving students in Peru, Chile and Japan, Plke found that, although
the existing Vocabulary section of the test correlated highly (at 0.88 to 0.95)
with the Reading Comprehension section, the new Words in Context items had even
higher correlations (0.94 to 0.99) with the reading section of the experimental
test. These results suggested that Pike had achieved his objective of creating
a new vocabulary test format that simulated more closely the experience of
readers encountering words in context. However, they also raised the intriguing
question of whether both vocabulary and reading comprehension items were needed
in the test, and if not, which of the two could be dispensed with. There were
arguments both ways:
- The vocabulary items formed a very efficient section of the test, in that they achieved a very high level of reliability within a short period of testing time.
- On the other hand, reading is such a crucial skill in university study that it would have seemed very strange to have a test of English for academic purposes that did not require the students to demonstratei their ability to understand written texts.
- In the end, Pike recommended a compromise solution by which both the words in context vocabulary items and the reading comprehen sion items (based on short passages) were included in a new combined section of the test. Pike's recommrndation was accepted and implemented in operational versions of the test from 1976 until 1955.
Towards
more contextualised testing
At a
conference convened by the TOEFL Program in 1984, a number of applied linguists
were invited to present critical reviews of the extent to which TOEFL could be
considered a measure of com municative competence. Bachman observed that the
vocabulary items 'would appear to suffer from misguided attempts at con
textualization' (1986: 81), because the contextual information in the stem
sentence was hardly ever required to answer the item correctly. In my terms,
Bachman was arguingi that, despite their label, the test items were essentially
context independent.
The other
study was by Henning (1991). In order to investigate the effects of
contextualisation, Henning used eight different formats, including the
then-current words-in-context item type. Essentially the formats varied along
three dimensions:
- the length of the stem, ranging from a single word to a whole reading passage;
- the 'inference-generating quality' of the stem sentence, i.e. thei extent to which it provided clues to the meaning of the target word; and
- the inclusion or deletion of the target word: it was either included in the stem sentence or replaced by a blank.
- Vocbulary in the 1995 version
The
results of this in-house research provided support for recommen dations from
the TOEFL Committee of Examiners, an advisory body of scholars from outside
ETS, that vocabulary knowledge should be assessed in a more integrative manner.
Thus, in the 1995 revision of TOEFL, the separate set of vocabulary items was
eliminated from thei test. Instead, they were made an integral part of the
reading comprehension section.
It is
interesting here to revisit the issue of context dependence by looking at an
item from the 1995 version of the test and considering what is involved in
responding to it.
Not all
the vocabulary items in the 1995 version of the test func tioned quite like
this. Their context dependence has to be individually evaluated, in much the
same way that researchers have analysed what is involved in answering
particular cloze-test items. However, in general the items appear to be
significantly more context dependent.
- The current situation
The
latest development in the story occurred in 1998, with the intro duction of a computerised
version of TOEFL. In most countries candidates now take the test at an
individual testing station, sitting at a computer and using the mouse to record
their responses to the items presented to them on the screen. For the reading
test, the passages appear in a frame on the left side of the screen and the
test items are shown one by one on the right side. Vocabulary items have been
retained but in a different form from before.
CHAPTER
SIX
The
design of discrete vocabulary tests
Introduction
The discussion
of vocabulary-test design in the first part of this chapter is based on the
framework for language-test development i presented in Bachman and Palmer's
(1996) book Language Testing in i Practice. Since the full framework is too
complex to cover here, I have chosen certain key steps in the test-development
process as the basis for a discussion of important issues in the design of
discrete vocabu lary tests in particular. In the second part of the chapter, I
offer ai practical perspective on the development of vocabulary tests by means
of two examples. One looks at the preparation of classroom progress tests, and
the other describes the process by which I developed the word-associates format
as a measure of depth of vocabulary knowledge.
Test
Purpose
Following
Bachman and Palmer's framework, an essential first step in language-test design
is to define the purpose of the test. It is important to clarify what the test
will be used for because, according to testing theory, a test is valid to the
extent that we are justified in drawingi conclusons from its results.
we can
identify three uses for language tests: for research, making decisions about
learners and making decisions about language programmes. Second language
vocabulary researchers have needed assessment in struments for studies on:
·
how
broad and deep learners' vocabulary knowledge is;
·
how
effective different methods of systematic vocabulary learning i are
·
how
incidental learning occurs through reading and listening activities:
·
whether
and how learners can infer the meanings of unknown words encountered in
context; and
·
how
learners deal with gaps in their vocabulary knowledge.
On the
other hand, language teachers and testers who employ tests for making decisions
about learners have a different focus. They use tests for purposes such as
placement, diagnosis, measuring progress or achievement, and assessing
proficiency, as in these examples:
The
vocabulary section of a placement test battery can be designed to estimate how
many high-frequency words the learners already know.
A
progress test assesses how well students have learned the words i presented in
the units they have recently studied in the coursebook. The purpose can be
essentially a diagnostic one for the teacher, to identify words that require
some further attention in class.
In
an achievement test, one section may be designed to assess how well the
learners have mastered a vocabulary skill that they havei been taught, such as
the ability to figure out the meaning of unfa miliar lexical items in a text on
the basis of contextual cues.
Construct
definition
Bachman
and Palmer (1996: 117-120) state that there are two approaches to construct
definition: syllabus-based and theory-based. A syllabus-based definition is ap
propriate when vocabulary assessment takes place within a course of study, so
that the lexical items and the vocabulary skills to be assessed can be
specified in relation to the learning objectives of the course.
For
research purposes and in proficiency testing, the definition of the construct
needs to be based on theory. One thing that makes construct definition rather
difficult in the area of second language vocabulary is the variety of
theoretical concepts and frameworks which scholars have proposed to account for
vocabulary acquisition, knowledge and use. Another effort to achieve greater
clarity and standardisa tion is represented by Henriksen's (1999) three
dimensions of vocabulary.
Receptive
and productive vocabulary
This
distinction between receptive and productive vocabulary is one that is accepted
by scholars working on bothI first and second language vocabulary development,
and it is often referred to by the alternative terms passive and active. As
Melka (1997) points out, though, there are still basic problems in conceptualising
and measuring the two types of vocabulary, in spite of a lengthy history of
research on the subject. The difficulty at the conceptual level is to find
criteria for distinguishing words that have receptive status from those which
are part of a person's productive vocabulary. It is generally assumed that
words are known receptively first and only later become available for
productive use.
From my own reading of the literature, I have concluded that one source of confusion about the distinction between receptive and pro ductive measures is that many authors use two different ways of defining reception and production interchangeably. Because the dif. ference in the two definitions is quite significant for assessment pur poses, let me spell out each one by using other terms:
Recognition and recall
Recognition
here means that the test-takers are presented with the target word and are
asked to show that they understand its meaning, whereas in the case of recall
they are provided with some stimulus designed to elicit the target word from
their memory.
Comprehension and use
Comprehension and use
Comprehenslon
here means that learners can understand a word when they encounter it in
context while listening or reading. whereas use means that the word occurs in
their own speech or writing.
Characteristics
of the test input
The
design of test tasks is the next step in test development, according to Bachman
and Palmer's model. In this chapter, I focus on just two aspects of task
design: characteristics of the input and characteristics of the expected
response.
Selection
of target words
Based on
such findings, Nation (1990: Chapter 2) proposes that, for teaching and
learning purposes, a broad three-way division can bei made Into high-frequency,
low-frequency and specialised vocabulary. The hlgh-frequency category In
English consists of 2000 word families, which form the foundation of the
vocabulary knowledge that i all proficient users of the language must have
acquired.
On the other hand, low-frequency vocabulary as
a whole is of much less value to learners. The low-frequency words that
learners do know reflect the influence of a variety of personal and social
variables:
- How widely they have read and listened
- What their personal interest are;
- How much time they have devoted to intensive vocabulary learning;
- What their educational and professional background is;
- Which community or society they live in;
- What communicative purposes they use the language for; and so on.
- This is where specialised vocabulary comes in. Specialised vocabulary is likely to be better acquired through content instruction by a subject teacher than through language teaching.
Presentation of words
Words in isolation
As with
other decisions in test design, the question of how to present selected words
to the test-takers needs to be related to the purpose of the assessment. We
have seen various uses of vocabulary tests earlier where no context is provided
at all:
- In systematic vocabulary learning, students apply memorising techniques to sets of target words and their meanings (usually expressed as Ll equivalents).
- In tests of vocabulary size, such as the Vocabulary Levels Test and the Eurocentres Test (EVST), words are often presented in isolation.
- In research on incidental vocabulary learning, the learners en counter the target words in context during reading or listening activities, but in the test afterwards the words are presented separately because the researcher is interested in whether the learners can show an understanding of the words when they occur without contextual support.
Words
in context
For other
purposes, the presentation of target words in some context i is desirable or
necessary. In discrete, selective tests, the context most commonly consists of
a sentence in which the target word occurs, but it can also be a paragraph or
a longer text containing a whole series of target words. Although it is taken
for granted these days by many language teachers that words should always be
learned in context, in designing a vocabulary measure it is important to consider
what rolei is played by the context in the assessment of vocabulary knowledge
or ability.
Characteristics
of the expected response
Self-report vs. verifiable response
Self-report vs. verifiable response
In some
testing situations, it is appropriate to ask the learners to assess their own
lexical knowledge. In Chapter 5, we saw how the EVST and the VKS draw on
self-report by the test-takers, although both instruments also incorporate a
way of checking how valid the responses are as measures of vocabulary
knowledge. As I noted in thei conclusion to that chapter, the purpose of the
test is a major con-i sideration in deciding whether self-assessment is
appropriate.
Even in "low-stakes' testing situations where self-report is used, It is desirable to have a means of verifying the test-takers' judgements about their knowledge of the target words. Another approach to verification is found in a test-your-own-vocabulary book for native speakers of English by Diack (1975). The book contains 50 tests, each con sisting of 60 words that are listed in order from most frequent to least.
Monolingual vs. Bilingual testing
last
design consideration concerns the language of the test itself. Whereas in a
monolingual test format only the target language is i used, a billingual one
employs both the target language (L2) and the learners' own language (Ll). This
aspect of test design involves more than just the characteristics of the
expected response, but I have chosen to deal with the choice of language in an
integrated way in this section of the chapter.
I have already noted one type of bilingual test format in the pre vious section on receptive and productive vocabulary, where transla tion from L2 to Ll assesses receptive knowledge and LI to 12 translation is the corresponding productive measure. Vocabulary tests for native speakers of English learning foreign languages commonly have a similar kind of bilingual structure
Practical examples
Classroom progress tests
The
purpose of my class tests is generally to assess the learners' progress in
vocabulary learning and, more specifically, to give them an incentive to keep
studying vocabulary on a regular basis.
Matching items
- There are some aspects of the design of this item type which arei worth noting:
- The reason for adding one or two extra definitions is to avoid a situation where the learner knows four of the target words and can then get the fifth definition correct by process of elimination, i without actually knowing what the word means.
- Assuming that the focus of the test is on knowledge of the target words, the definitions should be easy for the learners to understand. Thus, as a general principle, they should be composed of higher frequency vocabulary than the words to be tested and should not be written In an ellptical style that causes comprehension problems.
- If the purpose of the test is Just to assess knowledge of word meaning, then all of the target items should belong to the same word class- all nouns, all adjectives, etc. —and should not include structural clues. Otherwise, the learners may be able to match up words and definitions on the basis of form as well as meaning.
- One critism that can be made of the standard matching task is that it presents each target word in isolation.
Completion
items
Completion,
or blank-filling, items consist of a sentence from which the target word has
been deleted and replaced by a blank. As in the contextualised matching format
above, the function of the sentence is to provide a context for the word and
perhaps to cue a particular use of it. However, this is a recall task rather
than simply a recognition one because the learners have to supply the target
words from memory.
Generic
test items
In an
individualised vocabulary programme, these generic items offer a practical
alternative to having separate tests for each learner in the class. The same
item types could also be used more convention. ally, with target words provided
by the teacher, in a class where thei learners have all studied the same
vocabulary.
- Testing depth of vocabulary knowledge
The
word-associates test
The new
starting point was the concept of word association. The standard
word-association task involves presenting subjects with a set of stimulus words
one by one and asking them to say the first related word that comes into their
head.
CHAPTER 7
Comprehensive
measures of vocabulary
Introduction
Comprehensive
measures are particularly suitable for assessment procedures in which
vocabulary is embedded as one component of the measurement of a larger
construct, such as communicative competence in speaking, academic writing
ability or listening comprehension. However, we cannot simply say that all
comprehensive measures are embedded ones, because they can also be used on discrete basis.
Measures
of test input
In
reading and listening tests we have to be concerned about thei nature of the
input text. At least two questions can be asked:
- Is it at a suitable level of difficulty that matches the ability range of the test-takers?
- Does it have the characteristics of an authentic text, especially if it has been produced or modified for use in the test? Here we are specifically interested in the extent to which information about the vocabulary of the text can help to provide answers to these questions.
Readability
In Ll
reading research. the basic concept used in the analysis of textsiis
readabillty, which refers to the various aspects of a text that arei likely to
make it easy or difficult for a reader to understand and enjoy. During the
twentieth century a whole educational enterprise grew up in the United States devoted
to devising and applying formulas toi predict the readability of English texts
for native-speaker readers in terms of school grade or age levels (for a
comprehensive review, see Klare, 1984).
Listenability of spoken texts
Much more
work has been done on the comprehensibility of written texts than of spoken
language. Whereas the term readability is now very well established, its oral
equivalent, listenabillty, has had only limited currency. However, it seems
reasonable to expect that spoken texts used for the assessment of listening
comprehension vary in the demands that they place on listeners in comparable
ways to thei demands made of readers by different kinds of written language.
One way to evaluate the input of a listening test, then, would be simply to
treat it as a written text and apply a readability formula to determine its
suitability for the target population of test-takers.
Measures of learner production
Most of
this section is concerned with statistical measures of writing, because there
is more published research on that topic, but l also consider measures i of
spoken production, as well as the more qualitative approach to assessment
represented by the use of rating scales to judgei performance.
The
statistical measures could provide one kind of evidence for the validity of
analytic ratings. Of course this does not necessarily mean that the statistical
calculations can capture all thc relevant aspects of language use or that the
subjective ratings are invalid if they turn out to be inconsistent with the
statistical measures. Both kinds of evidence are needed for validating the
assessment of learner performance in speaking and writing tasks.
CHAPTER 8
Further
development in vocabulary assessment
Introduction
In the main body of this chapter, I want to review i some current areas of work on second language vocabulary. which will provide additional evidence for my view that a wider perspective is required. and then explore the implications for further develop ments in vocabulary assessment for the future.
In the main body of this chapter, I want to review i some current areas of work on second language vocabulary. which will provide additional evidence for my view that a wider perspective is required. and then explore the implications for further develop ments in vocabulary assessment for the future.
The identification of lexical units
One basic
requirement for any work on vocabulary is good quality information about the
units that we are dealing with. In this section of the chapter, I first review
the current state of word-frequency listsi and then take up the question of how
we might deal with multi-word Iexical items in vocabulary assessment.
The vocabulary of informal speech
The
vocabulary speech is the second area of vocabulary study that has received less
attention than it should have, as indicated by the fact that perhaps the most
frequently cited research study is the one conducted by Schonell et al. (1956)
in the 1950s on the spoken vocabulary of Australian workers.
The social dimension of vocabulary use
In
addition, vocabulary knowledge and use are typically thought of in
psycholinguistic terms, which minimises the existence of social variation among
learners, apart from the fact i that they undertake various courses of study,
pursue different careers i and have a range of personal interests.
For
assessment purposes, the education domain is obviously an area of major
concern, especially when there is evidence that learners i from particular
social backgrounds lack the opportunity to acquire the vocabulary they need for
academic study.
References
Purpura, E. James. 2004. Assessing Grammar. United Kingdom: Cambridge University Press.
Read,
John, Assessing Vocabulary, United Kingdom: Cambridge University Press, 2000